# TWITTER SENTIMENT ANALYSIS (NLP) | Machine Learning Projects | GeeksforGeeks
https://www.youtube.com/watch?v=4YGkfAd2iXM&t=20s

## Workflow

### Train

1. Twitter Data Collection (via Kaggle api)
2. Data pre-processing
3. Train-Test split
4. Train Logistic Regression model
5. Save the trained model
6. Use the saved model for future predictions

### Test

New data -> Trained model -> Prediction (Positive or Negative tweets)

## 

## 1. Twitter Data Collection (via Kaggle api)

In [None]:
# installing kaggle library
# !pip install kaggle

### Upload your kaggle.json file

https://www.kaggle.com/settings

In [1]:
# configuring the path of kaggle.json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

### Import "Sentiment140 dataset with 1.6 million tweets"

https://www.kaggle.com/datasets/kazanova/sentiment140



In [2]:
# API to fetch the dataset from kaggle
!kaggle datasets download -d kazanova/sentiment140

Downloading sentiment140.zip to /Users/williamyeh/Documents/Codes/AI/twitter-entiment-analysis
100%|██████████████████████████████████████| 80.9M/80.9M [00:18<00:00, 4.91MB/s]
100%|██████████████████████████████████████| 80.9M/80.9M [00:18<00:00, 4.70MB/s]


In [3]:
# extract the zip file

from zipfile import ZipFile
dataset = 'sentiment140.zip'

with ZipFile(dataset, 'r') as zip:
  zip.extractall()
  print('The dataset has been extracted.')

The dataset has been extracted.


### Importing Dependencies

In [4]:
import numpy as np
import pandas as pd
import re # regex
from nltk.corpus import stopwords # natural language toolkit
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer # texttual data to numerical data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


### Download stopwords

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [5]:
import nltk
nltk.download('stopwords')

print(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/williamyeh/nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data]   Unzipping corpora/stopwords.zip.


## 2. Data pre-processing

### Loading data

In [6]:
# Loading the dataset from csv file to a pandas dataframe
twitter_data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding = 'ISO-8859-1')

In [7]:
# Checking the number of rows and columns in the dataframe
twitter_data.shape

'''
target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
ids: The id of the tweet ( 2087)
date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
flag: The query (lyx). If there is no query, then this value is NO_QUERY.
user: the user that tweeted (robotickilldozr)
text: the text of the tweet (Lyx is cool)
'''

(1599999, 6)

In [8]:
# Displaying the first 5 rows of the dataframe
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


### The first row has been used as the column names. We need to change the column names

In [9]:
# naming the columns and reading the data again

column_names = ['target', 'ids', 'date', 'flag', 'user', 'text']

twitter_data = pd.read_csv('training.1600000.processed.noemoticon.csv', names=column_names, encoding = 'ISO-8859-1')

In [10]:
# Checking the number of rows and columns in the dataframe again
twitter_data.shape

(1600000, 6)

In [11]:
# Displaying the first 5 rows of the dataframe again
twitter_data.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


### Handle missing values

In [12]:
# counting the number of missing values in the dataframe
twitter_data.isnull().sum()

target    0
ids       0
date      0
flag      0
user      0
text      0
dtype: int64

### Make sure the data has "even distribution"

In [13]:
# checking the distribution of the target variable (0 = negative, 4 = positive)
twitter_data['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

### Convert the target "4" to "1" 

(0 = negative, 1 = positive)

In [14]:
twitter_data.replace({'target': {4: 1}}, inplace = True)

In [15]:
# checking the distribution of the target variable agian
twitter_data['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64

### Stemming

Stemming is a method in text processing that eliminates prefixes and suffixes from words, transforming them into their fundamental or root form, The main objective of stemming is to streamline and standardize words, enhancing the effectiveness of the natural language processing tasks. 

Some more example of stemming for root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"

In [16]:
port_stem = PorterStemmer()

In [18]:
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]', ' ', content) # remove all the special characters (not alphabets)
  stemmed_content = stemmed_content.lower() # convert all the text to lowercase
  stemmed_content = stemmed_content.split() # split the text into list of words
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [19]:
# creating a new column in the dataframe and applying the stemming function
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming) # 16 mins for 1.6 million rows

In [20]:
twitter_data.head()

Unnamed: 0,target,ids,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


In [21]:
# separating the data and label
x = twitter_data['stemmed_content'].values # list of stemmed tweets
y = twitter_data['target'].values

## 3. Train-Test split

In [23]:
# test_size: 20% of the data will be used for testing
# stratify: it will ensure that the distribution of data is similar in both the training and testing datasets (not all 0 or 1 in training or testing)
# random_state: it will ensure that the data is split in the same way if we run the code again (like seed)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify=y, random_state = 2)   

In [24]:
print(x.shape, x_train.shape, x_test.shape)

(1600000,) (1280000,) (320000,)


### Converting the textual data to numerical data

In [25]:
vectorizer = TfidfVectorizer() # give importance score to the words which are more frequent (e.g. many "happy" -> positive sentiment)

# fit: it will learn the vocabulary and idf from the training data
# transform: it will use the fitted vocabulary to convert the text data into numerical data
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test) # we are not fitting the test data, we are only transforming it

In [26]:
print(x_train)

'''
The importance scores of the words in the first tweet
  (0, 443066)	0.4484755317023172
  (0, 235045)	0.41996827700291095
  (0, 109306)	0.3753708587402299
  (0, 185193)	0.5277679060576009
  (0, 354543)	0.3588091611460021
  (0, 436713)	0.27259876264838384
'''

  (0, 443066)	0.4484755317023172
  (0, 235045)	0.41996827700291095
  (0, 109306)	0.3753708587402299
  (0, 185193)	0.5277679060576009
  (0, 354543)	0.3588091611460021
  (0, 436713)	0.27259876264838384
  (1, 160636)	1.0
  (2, 288470)	0.16786949597862733
  (2, 132311)	0.2028971570399794
  (2, 150715)	0.18803850583207948
  (2, 178061)	0.1619010109445149
  (2, 409143)	0.15169282335109835
  (2, 266729)	0.24123230668976975
  (2, 443430)	0.3348599670252845
  (2, 77929)	0.31284080750346344
  (2, 433560)	0.3296595898028565
  (2, 406399)	0.32105459490875526
  (2, 129411)	0.29074192727957143
  (2, 407301)	0.18709338684973031
  (2, 124484)	0.1892155960801415
  (2, 109306)	0.4591176413728317
  (3, 172421)	0.37464146922154384
  (3, 411528)	0.27089772444087873
  (3, 388626)	0.3940776331458846
  (3, 56476)	0.5200465453608686
  :	:
  (1279996, 390130)	0.22064742191076112
  (1279996, 434014)	0.2718945052332447
  (1279996, 318303)	0.21254698865277746
  (1279996, 237899)	0.2236567560099234
  (1279996, 2910

In [27]:
print(x_test)

  (0, 420984)	0.17915624523539803
  (0, 409143)	0.31430470598079707
  (0, 398906)	0.3491043873264267
  (0, 388348)	0.21985076072061738
  (0, 279082)	0.1782518010910344
  (0, 271016)	0.4535662391658828
  (0, 171378)	0.2805816206356073
  (0, 138164)	0.23688292264071403
  (0, 132364)	0.25525488955578596
  (0, 106069)	0.3655545001090455
  (0, 67828)	0.26800375270827315
  (0, 31168)	0.16247724180521766
  (0, 15110)	0.1719352837797837
  (1, 366203)	0.24595562404108307
  (1, 348135)	0.4739279595416274
  (1, 256777)	0.28751585696559306
  (1, 217562)	0.40288153995289894
  (1, 145393)	0.575262969264869
  (1, 15110)	0.211037449588008
  (1, 6463)	0.30733520460524466
  (2, 400621)	0.4317732461913093
  (2, 256834)	0.2564939661498776
  (2, 183312)	0.5892069252021465
  (2, 89448)	0.36340369428387626
  (2, 34401)	0.37916255084357414
  :	:
  (319994, 123278)	0.4530341382559843
  (319995, 444934)	0.32110928175992615
  (319995, 420984)	0.22631428606830148
  (319995, 416257)	0.23816465111736282
  (319995, 

## 4. Train Logistic Regression model

### Logistic Regression

Logistic regression is commonly used for prediction and classification problems. Here we use it to classify positive tweets and negative tweets. 

In [28]:
model = LogisticRegression(max_iter=1000) # max_iter: maximum number of iterations taken

In [29]:
model.fit(x_train, y_train) # training the model

### Model evaluation

In [30]:
# Accuracy score on the training data
x_train_prediction = model.predict(x_train) # ask the model to predict the target variable for the training data
training_data_accuracy = accuracy_score(y_train, x_train_prediction) # compare the predicted target variable with the actual target variable

In [31]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.79871953125


In [32]:
# Accuracy score on the test data
x_test_prediction = model.predict(x_test) # ask the model to predict the target variable for the test data
test_data_accuracy = accuracy_score(y_test, x_test_prediction) # compare the predicted target variable with the actual target variable

In [33]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.77668125


Here both scores are very close. It means the model performs well.

## 5. Save the trained model

In [34]:
import pickle

In [35]:
filename = 'trained_model.sav' # .pkl or .sav works
pickle.dump(model, open(filename, 'wb')) # write binary

## 6. Use the saved model for future predictions

In [36]:
# loading the model from the file
loaded_model = pickle.load(open('trained_model.sav', 'rb')) # read binary

In [37]:
# pick a random tweet from the test data: the 200th tweet
x_new = x_test[200]

# print the label of the random tweet
print(y_test[200])

# predict the sentiment of the random tweet
prediction = loaded_model.predict(x_new)
print(prediction)


1
[1]
