# Twitter sentiment analysis

I am using twitter comments with label from here(https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis/data). Next i will convert the text into vectors using Tf-Idf vectorizer and Count vectorizer.

From the vectors i would try different models like Logistic regression, Random forest classifier, SVC and Gradient boosting classifier for the purpose of performing sentiment analysis. So after every method we will conclude which model is performing best for our given data with the best performing vectorizer.

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
%matplotlib inline

In [2]:
columns = ['target', 'id', 'date', 'flag', 'user', 'text']

In [3]:
df = pd.read_csv('C:/Users/vivek/Downloads/sentiment140/training.1600000.processed.noemoticon.csv', encoding="ISO-8859-1")
df.columns = columns

In [4]:
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [5]:
df = df.drop(['id', 'date', 'flag', 'user'], 1)

In [6]:
df.head()

Unnamed: 0,target,text
0,0,is upset that he can't update his Facebook by ...
1,0,@Kenichan I dived many times for the ball. Man...
2,0,my whole body feels itchy and like its on fire
3,0,"@nationwideclass no, it's not behaving at all...."
4,0,@Kwesidei not the whole crew


In [7]:
df = df.dropna()

### Splitting data into train-test

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(df['text'], df['target'])

## Feature Extraction

From our text data we need to extract features in order to pass them through model. So, we will use tfidf and count vectorizer to convert text into vectors.

### TfidfVectorizer

It counts number of occurances of a word in each document and devides it by number of occurances of that word in whole corpus.

In [9]:
tfidf_vectorizer = TfidfVectorizer(min_df=2)
tfidf_vectorizer.fit(X_train)
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

### CountVectorizer

Unlike TfidfVectorizer, it simply counts the occurances of each word and gives that number to it's column.

In [10]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(X_train)
X_train_count = count_vectorizer.transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

## Model generation

Now after features are ready, we'll start generating models and compare them with each other.

### RandomForestClassifier

In [11]:
rfc_tfidf = RandomForestClassifier(n_estimators=10)
rfc_tfidf.fit(X_train_tfidf, Y_train)
print("Printing scores for TfidfVectorizer:- \n")
print("Train data score: ", rfc_tfidf.score(X_train_tfidf, Y_train))
print("Test data score: ", rfc_tfidf.score(X_test_tfidf, Y_test))

rfc_count = RandomForestClassifier(n_estimators=10)
rfc_count.fit(X_train_count, Y_train)
print("Printing scores for CountVectorizer:- \n")
print("Train data score: ", rfc_count.score(X_train_count, Y_train))
print("Test data score: ", rfc_count.score(X_test_count, Y_test))

Printing scores for TfidfVectorizer:- 

Train data score:  0.9873508227923523
Test data score:  0.7557175
Printing scores for CountVectorizer:- 

Train data score:  0.9884608237173531
Test data score:  0.752745


### GradientBoostingClassifier

In [12]:
gbc_tfidf = GradientBoostingClassifier(loss='exponential')
gbc_tfidf.fit(X_train_tfidf, Y_train)
print("Printing scores for TfidfVectorizer:- \n")
print("Train data score: ", gbc_tfidf.score(X_train_tfidf, Y_train))
print("Test data score: ", gbc_tfidf.score(X_test_tfidf, Y_test))

gbc_count = GradientBoostingClassifier(loss='exponential')
gbc_count.fit(X_train_count, Y_train)
print("Printing scores for CountVectorizer:- \n")
print("Train data score: ", gbc_count.score(X_train_count, Y_train))
print("Test data score: ", gbc_count.score(X_test_count, Y_test))

Printing scores for TfidfVectorizer:- 

Train data score:  0.7009239174365979
Test data score:  0.701105
Printing scores for CountVectorizer:- 

Train data score:  0.6994572495477079
Test data score:  0.6997075


### LogisticRegression

In [13]:
lr_tfidf = LogisticRegression()
lr_tfidf.fit(X_train_tfidf, Y_train)
print("Printing scores for TfidfVectorizer:- \n")
print("Train data score: ", lr_tfidf.score(X_train_tfidf, Y_train))
print("Test data score: ", lr_tfidf.score(X_test_tfidf, Y_test))

lr_count = LogisticRegression()
lr_count.fit(X_train_count, Y_train)
print("Printing scores for CountVectorizer:- \n")
print("Train data score: ", lr_count.score(X_train_count, Y_train))
print("Test data score: ", lr_count.score(X_test_count, Y_test))



Printing scores for TfidfVectorizer:- 

Train data score:  0.8224781853984878
Test data score:  0.803285
Printing scores for CountVectorizer:- 

Train data score:  0.8574990479158733
Test data score:  0.8006675


### SVC

In [None]:
svc_tfidf = SVC()
svc_tfidf.fit(X_train_tfidf, Y_train)
print("Printing scores for TfidfVectorizer:- \n")
print("Train data score: ", svc_tfidf.score(X_train_tfidf, Y_train))
print("Test data score: ", svc_tfidf.score(X_test_tfidf, Y_test))

svc_count = SVC()
svc_count.fit(X_train_count, Y_train)
print("Printing scores for CountVectorizer:- \n")
print("Train data score: ", svc_count.score(X_train_count, Y_train))
print("Test data score: ", svc_count.score(X_test_count, Y_test))



### GridSearchCV for best hyperparameters in LogisticRegression

In [19]:
hyperparameters = {
    'penalty': ['l2'],
    'random_state': [0, 20, 50],
    'C': [0.1, 0.5, 1.0],
    'solver': ['newton-cg', 'liblinear', 'sag', 'saga']
}
lr = LogisticRegression()
CV_lr = GridSearchCV(lr, hyperparameters, cv=3)
CV_lr.fit(X_train_tfidf, Y_train)
print("Best parameters: ", CV_lr.best_params_)
print("Train data score: ", CV_lr.score(X_train_tfidf, Y_train))
print("Test data score: ", CV_lr.score(X_test_tfidf, Y_test))

Best parameters:  {'C': 1.0, 'penalty': 'l2', 'random_state': 0, 'solver': 'saga'}
Train data score:  0.8224723520602933
Test data score:  0.803285


In [23]:
lr_tfidf = LogisticRegression(penalty= 'l1', random_state= 0, C=1.0, solver= 'saga')
lr_tfidf.fit(X_train_tfidf, Y_train)
print("Printing scores for TfidfVectorizer:- \n")
print("Train data score: ", lr_tfidf.score(X_train_tfidf, Y_train))
print("Test data score: ", lr_tfidf.score(X_test_tfidf, Y_test))



Printing scores for TfidfVectorizer:- 

Train data score:  0.8108965090804242
Test data score:  0.80185


# Conclusion

In conclusion, LogisticRegression works best along with TfidfVectorizer for accuracy of 82% in training dataset and 80% in testing dataset. RandomForestClassifier seems to overfit the data for both the vectorizing methods.

GradientBoostingClassifier was only able to get 70% in both the training and testing datasets. I tried to increase accuracy of LogisticRegression by trying different parameters using GridSearchCV but it was not able to.