# Twitter Sentiment Anaysis - NLP

**Problem Statement:**
Develop a machine learning model for twitter sentiment analysis using historical transaction data.

**Objective:**
The objective is to develop a machine learning model that can automatically classify tweets into sentiment categories (e.g., positive, negative, or neutral).

**Expected Outcome:**
The expected outcome is a robust machine learning model that achieves high accuracy, precision, recall, and F1-score on test data.

**Approach:**
1. **Data Exploration and Preprocessing:** Explore the dataset to understand its structure, identify any missing values, outliers, or inconsistencies. Preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features if necessary.

2. **Feature Engineering:** Extract relevant features from the dataset that could help in distinguishing between fraudulent and legitimate transactions. This could involve creating new features, such as transaction frequency, time of day, or geographical distance between transaction locations.

3. **Model Selection:** Experiment with different classification algorithms such as logistic regression, random forest, gradient boosting, or neural networks. Evaluate each model's performance using appropriate metrics such as accuracy, precision, recall, and F1-score.

4. **Model Training and Evaluation:** Train the selected model(s) on the preprocessed data and evaluate their performance.

5. **Model Deployment:** Once a satisfactory model is obtained, deploy it into a production environment where it can be used to classify new transactions in real-time.

**How the Trained Model Will Be Helpful:**
- **Business Value:** Companies can monitor customer feedback, brand reputation, and campaign effectiveness on Twitter.

- **Social Insights:** Researchers and analysts can track public sentiment around trending topics, events, or policies.

- **Customer Support:** Helps identify dissatisfied customers faster, enabling proactive resolution.

- **Decision Making:** Organizations can use aggregated sentiment trends to guide marketing strategies, product development, and public relations.

In [1]:
# importing neccessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

# evaluation matrices
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

**Stopwords in Machine Learning (especially NLP)**

- Stopwords are very common words in a language that don’t add much meaning when analyzing text.
- Examples in English: is, the, and, a, of, in, on, to, with

When we do text analysis (like sentiment analysis, topic modeling, search engines), these words appear so frequently that they don’t help in understanding the actual meaning.

---

**Why remove them?**

Imagine this tweet:

> "The movie was really good"

- If we keep words like the, was, they don’t tell us much about the sentiment.
- But really and good are important → they reveal it’s positive.

So, we remove stopwords to focus on meaningful words that improve accuracy and reduce noise.

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# print the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [4]:
# read file using read_csv function
twitter_train = pd.read_csv('/content/twitter_training.csv', encoding='ISO-8859-1')
twitter_test = pd.read_csv('/content/twitter_validation.csv', encoding='ISO-8859-1')

## Preprocessing
---

In [5]:
# number of rows and columns
twitter_train.shape, twitter_test.shape

((68431, 4), (999, 4))

In [6]:
# first 5 rows
twitter_train.head()

Unnamed: 0,2401,Borderlands,Positive,"im getting on borderlands and i will murder you all ,"
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...


In [7]:
# addding column names
column_names = ['id', 'user', 'target', 'text']
df_twitter_train = pd.read_csv('/content/twitter_training.csv', names=column_names, encoding='ISO-8859-1')
df_twitter_test = pd.read_csv('/content/twitter_validation.csv', names=column_names, encoding='ISO-8859-1')

In [8]:
# summary of the data: column names, total no.of non-null values, data types, memory usage
df_twitter_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68432 entries, 0 to 68431
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      68432 non-null  int64 
 1   user    68432 non-null  object
 2   target  68432 non-null  object
 3   text    67830 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.1+ MB


In [9]:
# summary statistics for numeric data types
df_twitter_train.describe()

Unnamed: 0,id
count,68432.0
mean,6250.292407
std,3760.727632
min,1.0
25%,2930.0
50%,6161.0
75%,9463.0
max,13200.0


In [10]:
# summary statistics for object data types
df_twitter_train.describe(include='O')

Unnamed: 0,user,target,text
count,68432,68432,67830.0
unique,30,4,63705.0
top,TomClancysRainbowSix,Negative,
freq,2400,20862,149.0


In [11]:
# check for missing values
df_twitter_train.isna().sum()

Unnamed: 0,0
id,0
user,0
target,0
text,602


In [12]:
# remove rows with blank values in 'text' column as they are not helpful
df_twitter_train.dropna(subset=['text'], inplace=True)
df_twitter_test.dropna(subset=['text'], inplace=True)

In [13]:
# drop columns that are not useful
df_twitter_train.drop(['id','user'], axis=1, inplace=True)
df_twitter_test.drop(['id','user'], axis=1, inplace=True)

In [14]:
# check for duplicate values in the dataset
df_twitter_train[df_twitter_train.duplicated()].sum()

Unnamed: 0,0
target,PositiveNegativeNeutralNeutralNegativePositive...
text,that was the first borderlands session in a lo...


In [15]:
df_twitter_train.columns

Index(['target', 'text'], dtype='object')

In [16]:
df_twitter_train['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
Negative,20694
Positive,18579
Neutral,16138
Irrelevant,12419


In [17]:
# Replace 'Irrelevant' with 'Neutral' in the 'target' column
df_twitter_train['target'] = df_twitter_train['target'].replace('Irrelevant', 'Neutral')
df_twitter_test['target'] = df_twitter_test['target'].replace('Irrelevant', 'Neutral')
df_twitter_train['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
Neutral,28557
Negative,20694
Positive,18579


In [18]:
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display, HTML

import plotly.io as pio
pio.templates.default = "plotly_dark" # for dark theme

# Utility function to update figure layout
def set_default_fig_size(fig, width=600, height=480):
    fig.update_layout(autosize=True, width=width, height=height)
    return fig

In [19]:
count_of_target = df_twitter_train['target'].value_counts().sort_index()

fig = px.pie(count_of_target, values=count_of_target, names=count_of_target.index, title='Distribution of target varibale')

fig = set_default_fig_size(fig, 600)
fig.show()

from the above chart we can see the data seems to be evenly divided in three categories and hence no need to upsample/downsample the data

In [20]:
# importing LabelEncoder to convert categorical value to numeric
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df_twitter_train['target'] = label_encoder.fit_transform(df_twitter_train['target'])
df_twitter_test['target'] = label_encoder.fit_transform(df_twitter_test['target'])

After converting the "target" column value from string to integer we can see that:
- 0 denotes NEGATIVE
- 1 denotes NEUTRAL and,
- 2 denotes POSITIVE

In [21]:
# "target" columns converted to numberical after label-encoding
df_twitter_train['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
1,28557
0,20694
2,18579


#### Stemming

Stemming is the process of reducing a word to its root form by chopping off prefixes or suffixes.

---
**Why do we use stemming?**

- Reduces vocabulary size → fewer unique words to process.
- Groups similar words → model understands that play, played, playing all mean the same action.
- Improves efficiency in text mining, sentiment analysis, and search engines.

In [22]:
# creates an object of the Porter Stemmer (a common stemming algorithm in NLP).
port_stem = PorterStemmer()

In [23]:
# funtion for stemming
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content) # removes anything that is not a letter
    stemmed_content = stemmed_content.lower() # converts the whole text into lowercase.
    stemmed_content = stemmed_content.split() # Splits the text into a list of words (tokens).

    # goes through each word and does two things:
    # 1) Removes stopwords (like the, is, and, in).
    # 2) Applies stemming using PorterStemmer.
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]

    stemmed_content = ' '.join(stemmed_content) # joins the processed words back into a single string.

    return stemmed_content # returns the cleaned and stemmed text
#

In [24]:
# Applies the stemming function to every row in the column "text" of training dataset.
# Stores the result in a new column stemmed_text.
df_twitter_train['stemmed_text'] = df_twitter_train['text'].apply(stemming)

In [25]:
# print last 7 rows from the dataset
df_twitter_train.tail(7)

Unnamed: 0,target,text,stemmed_text
68425,2,LETS FICKING GOOOOOOOOOOOO,let fick goooooooooooo
68426,2,LETS FUCKING GOOOOOOO,let fuck gooooooo
68427,2,LETS N GOOOOOOOOOO,let n goooooooooo
68428,2,she LETS IN FUCKING OF GOOOOOOOOOO,let fuck goooooooooo
68429,2,LETS FUCKING LI,let fuck li
68430,2,I canât wait for this to come out,wait come
68431,2,I can't wait for,wait


In [26]:
# Applies the stemming function to every row in the column "text" of testing dataset.
df_twitter_test['stemmed_text'] = df_twitter_test['text'].apply(stemming)

In [27]:
# Splitting the dataset into train and test
X_train = df_twitter_train['stemmed_text'].values
y_train = df_twitter_train['target'].values
X_test = df_twitter_test['stemmed_text'].values
y_test = df_twitter_test['target'].values

In [28]:
print(X_train)

['im get borderland murder' 'come border kill' 'im get borderland kill'
 ... 'let fuck li' 'wait come' 'wait']


In [29]:
print(X_test)

['mention facebook struggl motiv go run day translat tom great aunti hayley get bed told grandma think lazi terribl person'
 'bbc news amazon boss jeff bezo reject claim compani act like drug dealer bbc co uk news av busin'
 'microsoft pay word function poorli samsungu chromebook'
 'csgo matchmak full closet hack truli aw game'
 'presid slap american face realli commit unlaw act acquitt discov googl vanityfair com news'
 'hi eahelp madelein mccann cellar past year littl sneaki thing escap whilst load fifa point took card use paypal account work help resolv pleas'
 'thank eamaddennfl new te austin hooper orang brown brown austinhoop pic twitter com grg xzfkon'
 'rocket leagu sea thiev rainbow six sieg love play three stream best stream twitch rocketleagu seaofthiev rainbowsixsieg follow'
 'ass still knee deep assassin creed odyssey way anytim soon lmao'
 'fix jesu pleas fix world go playstat askplayst playstationsup treyarch callofduti neg silver wolf error code pic twitter com ziryhrf 

In [30]:
# transforming the "train" and "test" tweets into numerical vectors using TF-IDF, so machine learning algorithms (which work with numbers, not text) can process them.
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [31]:
print(X_train)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 704652 stored elements and shape (67830, 21218)>
  Coords	Values
  (0, 8639)	0.5057160006794084
  (0, 7012)	0.32720143452143197
  (0, 2039)	0.42523291685149306
  (0, 11930)	0.6755497867144308
  (1, 3308)	0.4482449500357379
  (1, 2037)	0.7422391763288988
  (1, 9829)	0.4981540624044501
  (2, 8639)	0.579559511728585
  (2, 7012)	0.3749786508106725
  (2, 2039)	0.4873244693272133
  (2, 9829)	0.5348052406213704
  (3, 8639)	0.4890624583860953
  (3, 2039)	0.4112297325429457
  (3, 11930)	0.6533035122655839
  (3, 3308)	0.4060819372139792
  (4, 8639)	0.5057160006794084
  (4, 7012)	0.32720143452143197
  (4, 2039)	0.42523291685149306
  (4, 11930)	0.6755497867144308
  (5, 8639)	0.5057160006794084
  (5, 7012)	0.32720143452143197
  (5, 2039)	0.42523291685149306
  (5, 11930)	0.6755497867144308
  (6, 2039)	0.13042272645064382
  (6, 17003)	0.188061629770934
  :	:
  (67821, 2158)	0.33960406859024894
  (67821, 8381)	0.24395293140557656
  (67821, 

In [32]:
print(X_test)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 12290 stored elements and shape (1000, 21218)>
  Coords	Values
  (0, 1070)	0.3044281667056202
  (0, 1482)	0.24861938911314294
  (0, 4107)	0.14228024144478724
  (0, 5817)	0.14122878028416033
  (0, 7012)	0.11760629471698979
  (0, 7233)	0.1254313823476484
  (0, 7405)	0.28915638610567496
  (0, 7435)	0.1449813469939473
  (0, 7851)	0.3290500373259431
  (0, 10208)	0.24778132672985703
  (0, 11320)	0.22331891829864983
  (0, 11798)	0.2799952849641723
  (0, 13484)	0.18720342953220478
  (0, 15556)	0.17990065179699147
  (0, 17390)	0.23790458799702116
  (0, 17984)	0.1993831859320953
  (0, 18196)	0.15530521287657548
  (0, 18437)	0.21118380586040175
  (0, 18442)	0.2648863923542845
  (0, 18571)	0.2656020299456177
  (1, 141)	0.20218932158384992
  (1, 516)	0.13294023709477504
  (1, 1102)	0.24812550431031832
  (1, 1429)	0.4685414936167827
  (1, 1633)	0.24434743006393256
  :	:
  (997, 17463)	0.2777700723326396
  (997, 17503)	0.38647993820286397


## Model Training

### 1) Logistic Regression (LG)

In [33]:
# Initialize the LogisticRegression model
model = LogisticRegression(max_iter=1000)

In [34]:
# Fit the model to the training data
model.fit(X_train, y_train.ravel())

# Make predictions on the test data
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, y_test)

In [35]:
# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"LG Accuracy: {accuracy*100:.2f}%")

# Generate classification report
print("\nLG Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("\nLG Confusion Matrix :")
print(confusion_matrix(y_test, y_pred))

LG Accuracy: 87.40%

LG Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.89      0.88       266
           1       0.88      0.89      0.88       457
           2       0.88      0.83      0.86       277

    accuracy                           0.87      1000
   macro avg       0.87      0.87      0.87      1000
weighted avg       0.87      0.87      0.87      1000


LG Confusion Matrix :
[[236  22   8]
 [ 27 407  23]
 [ 10  36 231]]


---
### 2) Decision

In [36]:
# Initialize the Decision Tree model
model_DT = DecisionTreeClassifier()

In [37]:
# Fit the model to the training data
model_DT.fit(X_train, y_train.ravel())

# Make predictions on the test data
y_pred = model_DT.predict(X_test)

In [38]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"DT Accuracy: {accuracy*100:.2f}%")

# Generate classification report
print("\nDT Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("\nDT Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

DT Accuracy: 88.30%

DT Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.90      0.89       266
           1       0.91      0.88      0.89       457
           2       0.85      0.87      0.86       277

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.88      1000
weighted avg       0.88      0.88      0.88      1000


DT Confusion Matrix:
[[240  19   7]
 [ 20 403  34]
 [ 14  23 240]]


---
### 3) Random Forest

In [39]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

In [41]:
# Fit the model to the training data
rf_model.fit(X_train, y_train.ravel())

# Make predictions on the test data
y_pred_rf = rf_model.predict(X_test)

In [42]:
# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy of Random Forest: {accuracy_rf*100:.2f}%")

# Generate classification report
print("\nClassification Report for Random Forest:")
print(classification_report(y_test, y_pred_rf))

# Generate confusion matrix
print("\nConfusion Matrix for Random Forest:")
print(confusion_matrix(y_test, y_pred_rf))

Accuracy of Random Forest: 92.30%

Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.95      0.91      0.93       266
           1       0.90      0.96      0.92       457
           2       0.95      0.88      0.92       277

    accuracy                           0.92      1000
   macro avg       0.93      0.92      0.92      1000
weighted avg       0.92      0.92      0.92      1000


Confusion Matrix for Random Forest:
[[242  20   4]
 [ 12 437   8]
 [  2  31 244]]


---
### 4) Gradient Boosting Machines (GBM)

In [43]:
# Initialize the Gradient Boosting Classifier
gbm_model = GradientBoostingClassifier()

In [44]:
# Fit the model to the training data
gbm_model.fit(X_train, y_train.ravel())

# Make predictions on the test data
y_pred = gbm_model.predict(X_test)

In [45]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"GBM Accuracy: {accuracy*100:.2f}%")

# Generate classification report
print("\nGBM Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("\nGBM Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

GBM Accuracy: 63.70%

GBM Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.53      0.59       266
           1       0.60      0.84      0.70       457
           2       0.72      0.41      0.52       277

    accuracy                           0.64      1000
   macro avg       0.67      0.59      0.61      1000
weighted avg       0.66      0.64      0.62      1000


GBM Confusion Matrix:
[[140 117   9]
 [ 39 383  35]
 [ 26 137 114]]


---
### 5) XGBoost Classifier

In [46]:
# Initialize the XGBoost classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

In [47]:
# Fit the model to the training data
xgb_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = xgb_model.predict(X_test)


Parameters: { "use_label_encoder" } are not used.




In [48]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"XGBoost Accuracy: {accuracy*100:.2f}%")

# Generate classification report
print("\nXGBoost Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("\nXGBoost Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

XGBoost Accuracy: 78.00%

XGBoost Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.79      0.78       266
           1       0.75      0.86      0.80       457
           2       0.86      0.64      0.73       277

    accuracy                           0.78      1000
   macro avg       0.80      0.76      0.77      1000
weighted avg       0.79      0.78      0.78      1000


XGBoost Confusion Matrix:
[[209  51   6]
 [ 40 395  22]
 [ 19  82 176]]


## Model building using effective algorithmn

We tried multiple ML models like Logistic Regression, Decision Tree, Random Forest, Gradient Boosting Machine and XGBoost Classifier.

### Out of these, ***Random Forest*** have performed the best with ***95.30%***. Hence will save the rf_model which can be then deployed and used for sentiment analysis

In [49]:
import pickle

In [50]:
# saving the model
filename = 'trained_model.sav'
pickle.dump(rf_model, open(filename, 'wb'))

In [51]:
# load the save model
loaded_model = pickle.load(open('/content/trained_model.sav', 'rb'))

Prediction for NEGATIVE tweet

In [52]:
X_new = X_test[23] # contains 0 (negative) value
print("Actual:",y_test[23])

prediction = loaded_model.predict(X_new)
print("Predicition: ", prediction[0]) # saved model predicted value

if(prediction[0]==0):
  print('The tweet is NEGATIVE')
elif(prediction[0]==1):
  print('The tweet is NEUTRAL')
else:
  print('The tweet is POSITIVE')

Actual: 0
Predicition:  0
The tweet is NEGATIVE


Prediction for NEUTRAL tweet

In [53]:
X_new = X_test[0] # contains 1 (neutral) value
print("Actual:", y_test[0])

prediction = loaded_model.predict(X_new)
print("Predicition: ", prediction[0]) # saved model predicted value

if(prediction[0]==0):
  print('The tweet is NEGATIVE')
elif(prediction[0]==1):
  print('The tweet is NEUTRAL')
else:
  print('The tweet is POSITIVE')

Actual: 1
Predicition:  1
The tweet is NEUTRAL


Prediction for POSITIVE tweet

In [54]:
X_new = X_test[92] # contains 2 (positive) value
print("Actual:",y_test[92])

prediction = loaded_model.predict(X_new)
print("Predicition: ", prediction[0]) # saved model predicted value

if(prediction[0]==0):
  print('The tweet is NEGATIVE')
elif(prediction[0]==1):
  print('The tweet is NEUTRAL')
else:
  print('The tweet is POSITIVE')

Actual: 2
Predicition:  2
The tweet is POSITIVE
