# **Assignment 5: NLP**

### Instructions

1) Please submit the .ipynb and .pdf file to gradescope

2) Please include your Name and UNI below.

### Name: Triyasha Ghosh Dastidar
### UNI: tg2936

### Natural Language Processing
We will train a supervised training model to predict if a tweet has a positive or negative sentiment.

####  **Dataset loading & dev/test splits**

**1.1) Load the twitter dataset from NLTK library**

In [1]:
import nltk
nltk.download('twitter_samples')
from nltk.corpus import twitter_samples
nltk.download('punkt')
nltk.download('stopwords')

import warnings
warnings.filterwarnings("ignore")

from nltk.corpus import stopwords
stop = stopwords.words('english')
import pandas as pd
import string
import re
from sklearn.model_selection import train_test_split
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/triyasha/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package punkt to /home/triyasha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/triyasha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**1.2) Load the positive & negative tweets**

In [2]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

**1.3) Make a data frame that has all tweets and their corresponding labels**

In [3]:
tweets = []

for tweet in all_positive_tweets:
    tweets.append({'Tweet': tweet, 'Sentiment': 'positive'})

for tweet in all_negative_tweets:
    tweets.append({'Tweet': tweet, 'Sentiment': 'negative'})

df = pd.DataFrame(tweets)


In [4]:
df

Unnamed: 0,Tweet,Sentiment
0,#FollowFriday @France_Inte @PKuchly57 @Milipol...,positive
1,@Lamb2ja Hey James! How odd :/ Please call our...,positive
2,@DespiteOfficial we had a listen last night :)...,positive
3,@97sides CONGRATS :),positive
4,yeaaaah yippppy!!! my accnt verified rqst has...,positive
...,...,...
9995,I wanna change my avi but uSanele :(,negative
9996,MY PUPPY BROKE HER FOOT :(,negative
9997,where's all the jaebum baby pictures :((,negative
9998,But but Mr Ahmad Maslan cooks too :( https://t...,negative


**1.4) Look at the class distribution of the tweets**

In [5]:
df['Sentiment'].value_counts()

Sentiment
positive    5000
negative    5000
Name: count, dtype: int64

The 'positive' and 'negative' tweets are equal in number. So the dataset is balanced.

**1.5) Create a development & test split (80/20 ratio):**

In [6]:
df_dev, df_test = train_test_split(df, test_size=0.2)

In [8]:
print("Shape of development split: {df_dev.shape}")
print("Shape of test split: {df_test.shape}")

Shape of development split: {df_dev.shape}
Shape of test split: {df_test.shape}


#### **Data preprocessing**
We will do some data preprocessing before we tokenize the data. We will remove `#` symbol, hyperlinks, stop words & punctuations from the data. You can use the `re` package in python to find and replace these strings.

**1.6) Replace the `#` symbol with '' in every tweet**

In [10]:
def remove_hashtags(text):
    return re.sub(r'#\S+', '', text)

df_dev['Tweet'] = df_dev['Tweet'].apply(remove_hashtags)
df_test['Tweet'] = df_test['Tweet'].apply(remove_hashtags)
df_dev

Unnamed: 0,Tweet,Sentiment
9662,Longmorn 30 y.o. supposed to be a replacement ...,negative
105,@straz_das @DCarsonCPA @GH813600 for being to...,positive
3707,@JohnTorode1 @BBCOne @MasterChefUK think it'll...,positive
4602,@TheKimTillman ROAD TRIP!!! :D,positive
6957,@RedMakuzawa that would be terrible :(,negative
...,...,...
5863,"Things are hard right now, &amp; I can't even ...",negative
1272,@lazycrazygen Thank you Gen! Miss you! :D,positive
7126,@thorntonschocs this is awful news! Mum uses t...,negative
4635,Laguna again. :) see you,positive


**1.7) Replace hyperlinks with '' in every tweet**

In [None]:
def remove_hyperlinks(text):
    return re.sub(r'http\S+|https\S+|www\.\S+', '', text)

df_dev['Tweet'] = df_dev['Tweet'].apply(remove_hyperlinks)
df_test['Tweet'] = df_test['Tweet'].apply(remove_hyperlinks)
df_dev


Unnamed: 0,Tweet,Sentiment
9662,Longmorn 30 y.o. supposed to be a replacement ...,negative
105,@straz_das @DCarsonCPA @GH813600 for being to...,positive
3707,@JohnTorode1 @BBCOne @MasterChefUK think it'll...,positive
4602,@TheKimTillman ROAD TRIP!!! :D,positive
6957,@RedMakuzawa that would be terrible :(,negative
...,...,...
5863,"Things are hard right now, &amp; I can't even ...",negative
1272,@lazycrazygen Thank you Gen! Miss you! :D,positive
7126,@thorntonschocs this is awful news! Mum uses t...,negative
4635,Laguna again. :) see you,positive


**1.8) Remove all stop words**

In [12]:
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop]
    return ' '.join(filtered_words)

df_dev['Tweet'] = df_dev['Tweet'].apply(remove_stopwords)
df_test['Tweet'] = df_test['Tweet'].apply(remove_stopwords)
df_dev

Unnamed: 0,Tweet,Sentiment
9662,Longmorn 30 y.o. supposed replacement Tobermor...,negative
105,@straz_das @DCarsonCPA @GH813600 top engaged m...,positive
3707,@JohnTorode1 @BBCOne @MasterChefUK think it'll...,positive
4602,@TheKimTillman ROAD TRIP!!! :D,positive
6957,@RedMakuzawa would terrible :(,negative
...,...,...
5863,"Things hard right now, &amp; can't even home M...",negative
1272,@lazycrazygen Thank Gen! Miss you! :D,positive
7126,@thorntonschocs awful news! Mum uses birthday ...,negative
4635,Laguna again. :) see,positive


**1.9) Remove all punctuations**

In [13]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

df_dev['Tweet'] = df_dev['Tweet'].apply(remove_punctuation)
df_test['Tweet'] = df_test['Tweet'].apply(remove_punctuation)
df_dev

Unnamed: 0,Tweet,Sentiment
9662,Longmorn 30 yo supposed replacement Tobermory ...,negative
105,straz_das DCarsonCPA GH813600 top engaged memb...,positive
3707,JohnTorode1 BBCOne MasterChefUK think itll sam...,positive
4602,TheKimTillman ROAD TRIP D,positive
6957,RedMakuzawa would terrible,negative
...,...,...
5863,Things hard right now amp cant even home Mady ...,negative
1272,lazycrazygen Thank Gen Miss you D,positive
7126,thorntonschocs awful news Mum uses birthday ca...,negative
4635,Laguna again see,positive


**1.10) Apply stemming on the development & test datasets using Porter algorithm**

In [14]:
stemmer = PorterStemmer()
def stemming(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

df_dev['Tweet'] = df_dev['Tweet'].apply(stemming)
df_test['Tweet'] = df_test['Tweet'].apply(stemming)



In [15]:
df_dev

Unnamed: 0,Tweet,Sentiment
9662,longmorn 30 yo suppos replac tobermori 32 yo b...,negative
105,straz_da dcarsoncpa gh813600 top engag member ...,positive
3707,johntorode1 bbcone masterchefuk think itll sam...,positive
4602,thekimtillman road trip d,positive
6957,redmakuzawa would terribl,negative
...,...,...
5863,thing hard right now amp cant even home madi h...,negative
1272,lazycrazygen thank gen miss you d,positive
7126,thorntonschoc aw news mum use birthday cake ev...,negative
4635,laguna again see,positive


#### **Model training**

**1.11) Create bag of words features for each tweet in the development dataset**

In [16]:
vectorizer = CountVectorizer(stop_words='english')
bow_matrix = vectorizer.fit_transform(df_dev['Tweet'])
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

print(bow_df)


      0001  00128835  009  00962778381838  00kouhey00  0115am  0116am  01282  \
0        0         0    0               0           0       0       0      0   
1        0         0    0               0           0       0       0      0   
2        0         0    0               0           0       0       0      0   
3        0         0    0               0           0       0       0      0   
4        0         0    0               0           0       0       0      0   
...    ...       ...  ...             ...         ...     ...     ...    ...   
7995     0         0    0               0           0       0       0      0   
7996     0         0    0               0           0       0       0      0   
7997     0         0    0               0           0       0       0      0   
7998     0         0    0               0           0       0       0      0   
7999     0         0    0               0           0       0       0      0   

      0129ann  01482  ...  للحياة  للعو

**1.12) Train a Logistic Regression model on the development dataset**

In [17]:
regressor = LogisticRegression()
regressor.fit(bow_df, df_dev['Sentiment'])

**1.13) Create TF-IDF features for each tweet in the development dataset**

In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df_dev['Tweet'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print(tfidf_df)

      0001  00128835  009  00962778381838  00kouhey00  0115am  0116am  01282  \
0      0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
1      0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
2      0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
3      0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
4      0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
...    ...       ...  ...             ...         ...     ...     ...    ...   
7995   0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
7996   0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
7997   0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
7998   0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   
7999   0.0       0.0  0.0             0.0         0.0     0.0     0.0    0.0   

      0129ann  01482  ...  للحياة  للعو

**1.14) Train the Logistic Regression model on the development dataset with TF-IDF features**

In [19]:
regressor2 = LogisticRegression()
regressor2.fit(tfidf_df, df_dev['Sentiment'])

**1.15) Compare the performance of the two models on the test dataset using a classification report and the scores obtained. Explain the difference in results obtained.**

In [20]:
tfidf_matrix_test = tfidf_vectorizer.transform(df_test['Tweet'])
tfidf_df_test = pd.DataFrame(tfidf_matrix_test.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
y_pred_tfidf = regressor2.predict(tfidf_df_test)
print("Training using TF-IDF features")
print(classification_report(df_test['Sentiment'], y_pred_tfidf))

Training using TF-IDF features
              precision    recall  f1-score   support

    negative       0.75      0.77      0.76       997
    positive       0.76      0.75      0.76      1003

    accuracy                           0.76      2000
   macro avg       0.76      0.76      0.76      2000
weighted avg       0.76      0.76      0.76      2000



In [21]:
bow_matrix_test = vectorizer.transform(df_test['Tweet'])
bow_df_test = pd.DataFrame(bow_matrix_test.toarray(), columns=vectorizer.get_feature_names_out())
y_pred_bow = regressor.predict(bow_df_test)
print("Training using BOW features")
print(classification_report(df_test['Sentiment'], y_pred_bow))

Training using BOW features
              precision    recall  f1-score   support

    negative       0.73      0.76      0.75       997
    positive       0.75      0.72      0.74      1003

    accuracy                           0.74      2000
   macro avg       0.74      0.74      0.74      2000
weighted avg       0.74      0.74      0.74      2000



# Conclusion

The **TF-IDF-based** model outperforms the **BOW-based** model in all evaluated metrics. This is likely due to TF-IDF's ability to capture the importance of terms, leading to better feature representation and better discrimination between positive and negative classes.

- **TF-IDF** retains more nuanced information about the text by considering how often a word occurs across documents, which helps in capturing patterns better.
- **BOW**, by treating all words equally, can lose this nuanced information and may result in poorer generalization.

These might be the reason why we see observe such results