<a href="https://www.kaggle.com/code/tanishqharit21/twitter-sentiment-nlp-lr?scriptVersionId=219702735" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

GFG Project - Project YT Link :

https://www.youtube.com/watch?v=4YGkfAd2iXM&list=PLqM7alHXFySGTcwBQV-hYDkYAPJ4EPHe9

Kaggle Dataset Link : 

https://www.kaggle.com/datasets/kazanova/sentiment140

In [1]:
import numpy as np
import pandas as pd
import re                                                    # Regular Expression 
from nltk.corpus import stopwords                            # Natural Language Procesing
from nltk.stem.porter import PorterStemmer                   # To stem words 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split         # To split data into train and test set 
from sklearn.linear_model import LogisticRegression          # Logistic Regression 
from sklearn.metrics import accuracy_score                   # To calculate accuracy and performance 

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Printing stopwords in English
print(stopwords.words('english'))
# These words are not required for our model, they have not much contextual importance
# We need to reduce size and complexity of our data (1.6 million tweets)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [4]:
# Loading data as pandas df
twitter_data = pd.read_csv('/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv',
                              encoding = 'ISO-8859-1')

In [5]:
# Rows and columns
twitter_data.shape

(1599999, 6)

In [6]:
# Peeking data
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


There is a problem with this data. It has no column names.

In the above data peek, the column names bar is also data.

So, we need to name the columns manually. We will use the same name as present in Kaggle dataset overview.

In [7]:
# Naming the columns
column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
# Reading the dataset again
twitter_data = pd.read_csv('/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv',
                              names = column_names, encoding = 'ISO-8859-1')
# Peeking again dataframe again
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Now, according to dataset overview, the 'target' column contains either 0 or 4.

Here, 0 = negative, 4 = positive (Sentiment)

In [8]:
# Checking any missing values
twitter_data.isnull().sum()    # Total missing values in all columns

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

Good, there are no missing values in the dataset.

In [9]:
# Checking the distribution of the 'target' column
twitter_data['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

So, 8 Lakh tweets have positive sentiment (target = 4) and

again 8 Lakh tweets have negative sentiment (target = 0).

Good, this data is evenly distributed.

#### Stemming
- It is a process of reducing a word to its root word.
- Example : actor, actress, acting = act
- We are doing this to reduce dataset.
- For this we will use PorterStemmer which we have imported.

In [10]:
# Intialising Porter Stemmer
port_stem = PorterStemmer()

In [11]:
# Improving 'text' column - Building a template for stemming  

def stemming(content):  # Here, content is each tweet present in dataset 

    stemmed_content = re.sub('[^a-zA-Z]',' ', content)  # Only alphabets, no [1,2,@,#,$...]
    stemmed_content = stemmed_content.lower()           # Converting words to lower case
    stemmed_content = stemmed_content.split()           # splitting words and storing in a list 
    
    stemmed_content = [port_stem.stem(word) for word in stemmed_content
                          if not word in stopwords.words('english')]
    # Performing stem function only word which are not in stopwords.
    # We dont want stop words.

    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [12]:
# Creating a new column in dataframe called 'stemmed_content'
# and applying 'stemming' function to the 'text' column.
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming)

# CAUTION : This will take alot of time, 1.6 million rows.

In [13]:
# Now, lets check our column 
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


Now, to build our sentiment analysis model, we will use 'target' and 'stemmed_content'.

We dont need 'id', 'date', 'flag', 'user', 'text' columns data.

In [14]:
# Seprating data and label 
X = twitter_data['stemmed_content'].values 
Y = twitter_data['target'].values 

In [15]:
# Data 
print(X)

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']


In [16]:
# Label
print(Y)

[0 0 0 ... 4 4 4]


In [17]:
# Splitting data into training and test data using Train_Test_Split function 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, 
                                                      stratify = Y, random_state = 2)

# test_size = 0.2 MEANS 20% of data will go to test data and 80% will go to test data. 
# stratify = Y MEANS there should be equal number of positive (4) and negative (1) in both data.
# random_state = 2 (to avoid splitting data randomly)

In [18]:
# Training data (stemmed content)
print(f'Original data: {X.shape}')
print(f'Training data: {X_train.shape}')
print(f'Testing data: {X_test.shape}')

Original data: (1600000,)
Training data: (1280000,)
Testing data: (320000,)


Machine Learning does not understand text, so we need to convert textual data to numerical data using Vectorizer.

In [19]:
# Converting textual data to numerical data
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# It will assign some importance to each individual word.
# Vectorizer will give score to all the individual words.

In [20]:
# Let's check the converted numerical data
print(X_train)

  (0, 443066)	0.4484755317023172
  (0, 235045)	0.41996827700291095
  (0, 109306)	0.3753708587402299
  (0, 185193)	0.5277679060576009
  (0, 354543)	0.3588091611460021
  (0, 436713)	0.27259876264838384
  (1, 160636)	1.0
  (2, 288470)	0.16786949597862733
  (2, 132311)	0.2028971570399794
  (2, 150715)	0.18803850583207948
  (2, 178061)	0.1619010109445149
  (2, 409143)	0.15169282335109835
  (2, 266729)	0.24123230668976975
  (2, 443430)	0.3348599670252845
  (2, 77929)	0.31284080750346344
  (2, 433560)	0.3296595898028565
  (2, 406399)	0.32105459490875526
  (2, 129411)	0.29074192727957143
  (2, 407301)	0.18709338684973031
  (2, 124484)	0.1892155960801415
  (2, 109306)	0.4591176413728317
  (3, 172421)	0.37464146922154384
  (3, 411528)	0.27089772444087873
  (3, 388626)	0.3940776331458846
  (3, 56476)	0.5200465453608686
  :	:
  (1279996, 390130)	0.22064742191076112
  (1279996, 434014)	0.2718945052332447
  (1279996, 318303)	0.21254698865277746
  (1279996, 237899)	0.2236567560099234
  (1279996, 2910

In [21]:
print(X_test)

  (0, 420984)	0.17915624523539803
  (0, 409143)	0.31430470598079707
  (0, 398906)	0.3491043873264267
  (0, 388348)	0.21985076072061738
  (0, 279082)	0.1782518010910344
  (0, 271016)	0.4535662391658828
  (0, 171378)	0.2805816206356073
  (0, 138164)	0.23688292264071403
  (0, 132364)	0.25525488955578596
  (0, 106069)	0.3655545001090455
  (0, 67828)	0.26800375270827315
  (0, 31168)	0.16247724180521766
  (0, 15110)	0.1719352837797837
  (1, 366203)	0.24595562404108307
  (1, 348135)	0.4739279595416274
  (1, 256777)	0.28751585696559306
  (1, 217562)	0.40288153995289894
  (1, 145393)	0.575262969264869
  (1, 15110)	0.211037449588008
  (1, 6463)	0.30733520460524466
  (2, 400621)	0.4317732461913093
  (2, 256834)	0.2564939661498776
  (2, 183312)	0.5892069252021465
  (2, 89448)	0.36340369428387626
  (2, 34401)	0.37916255084357414
  :	:
  (319994, 123278)	0.4530341382559843
  (319995, 444934)	0.3211092817599261
  (319995, 420984)	0.22631428606830145
  (319995, 416257)	0.23816465111736276
  (319995, 3

TRAINING the machine learning model. In this case Logistic Regression (Classification Model).

Here, 2 classes are positive tweet and negative tweet.

In [22]:
# ML Model
model = LogisticRegression(max_iter = 1000)   # maximum_iterations = 1000 to perfect the accuracy.

Fitting the training data into the LogisticRegression model.

In [23]:
model.fit(X_train, Y_train)

In [24]:
# Model Evaluation - Accuracy Score on the training data.

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [25]:
print(f'Accuracy score on the training data {training_data_accuracy}')

Accuracy score on the training data 0.81020703125


- Accuracy score of the model is 81%.
- 81 out of 100 tweets, model can predict whether that tweet is positive or negative.

#### But, But, But 
- We have not yet show the testing data to the model.
- This accuracy score is on the training data only.
- Lets show it some NEW tweets.

In [26]:
# Accuracy Score on the testing data.

X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(Y_test, X_test_prediction)

print(f'Accuracy score on the testing data {testing_data_accuracy}')

Accuracy score on the testing data 0.7780125


- Training data accuracy and testing data accuracy are very close to each other.
- Model has performed well.
- If accuracy score of testing data (id very less) < accuracy score of training data :
- Then this is called Overfitting (Model has learned nothing).

### MODEL ACCURACY = 77.7%

In [27]:
# Saving the trained model
import pickle
filename = 'TSA_trained_model.sav'
pickle.dump(model, open(filename,'wb'))   # wb - is nothing but writing a new file (in binary format)

# We are saving all the parameters and learning of the model in the file 'TSA_trained_model.sav'.

### Using Model for NEW Predictions

In [28]:
# Loading the saved model

loaded_model = pickle.load(open(filename, 'rb'))

In [29]:
# 200th Tweet in the dataframe 

X_new = X_test[200]
print(f'True Label: {Y_test[200]}')

prediction = loaded_model.predict(X_new)

if (prediction[0] == 0):
    print('Negative Tweet')
else:
    print('Positive Tweet')

True Label: 4
Positive Tweet


In [30]:
# There are so many tweets, lets take a random
# 48294th tweet

X_new = X_test[48294]
print(f'True Label: {Y_test[48294]}')

prediction = loaded_model.predict(X_new)

if (prediction[0] == 0):
    print('Negative Tweet')
else:
    print('Positive Tweet')

True Label: 4
Positive Tweet


So, our model is predicting tweets precisely.

Sometime, it may not predict precisely though.