## Twitter Sentiment Analysis
### ELEC-ENG 375-475: Final Project
Yemi Kelani


This data was sourced from [Kaggle](https://www.kaggle.com/datasets/kazanova/sentiment140)

---

##### **Background and Motivation**
Sentimental analysis is done in many day-to-day interactions and is an important aspect in
how humans communicate with each other. Certain combinations of phrases and words
lead us to form assumptions about the other’s emotional state and dictate the way that we
choose to respond. Sentimental analysis done by artificial intelligence has become
necessary in recent years due to virtual assistants such as Siri or Alexa, where it becomes
necessary to analyze not only the commands given, but the contents of the command as
well. This is largely an area that is still being developed and is becoming more important as
AI continues to become a facet of human life.

\\

##### **Project Goals and Objectives**
We aim to create a model that classifies the sentiment of varying tweets as negative,
neutral, or positive based on its associated text. The degree of sentiment will be assigned
to a number scale. Our data is sourced from Kaggle, and contains over 1.6 million features.
Each feature contains the target label (-1 = negative, 1 = positive ), the tweet
text, and other information such as the tweet timestamp and id.

---

In [2]:
import numpy as np
import pandas as pd

# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction.text import TfidfVectorizer

# models
from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_halving_search_cv 
from sklearn.model_selection import HalvingGridSearchCV
# from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
from google.colab import drive
drive.mount('/content/drive')

---
#### **Data Processing**

In [4]:
column_names = ["target", "ids", "date", "flag", "user", "text"]
data = pd.read_csv("/content/drive/MyDrive/EE 375 475 ML Project/trainingTwitter.csv", 
                    encoding="ISO-8859-1", names=column_names)

# reassign labels from [0,4] to [-1,1]
data["target"] = np.where(data["target"]==0, -1, 1) 
data.head()

Unnamed: 0,target,ids,date,flag,user,text
0,-1,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,-1,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,-1,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,-1,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,-1,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


---
#### **A Simple Logistic Regression**

In [5]:
# create corpus
corpus = data["text"]
tfidf_vec = TfidfVectorizer(stop_words='english')

# encode via TFIDF
X = tfidf_vec.fit_transform(corpus)
y = data["target"]

# spilt data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [10]:
# Logistic Regression
logistic_model = LogisticRegression(solver='saga')
logistic_model.fit(X_train, y_train)

LogisticRegression(solver='saga')

In [11]:
# Cross Validation
cv = cross_validate(logistic_model, X_train, y_train, scoring="f1", cv=3)
score = cv['test_score'].mean()
print("Logisitic Regression CV Results: \n", 
      "average fit time:", cv['fit_time'].mean(), 
      "average test score:", score)

Logisitic Regression CV Results: 
 average fit time: 30.296721537907917 average test score: 0.7856441608766036


In [12]:
# Quality Metrics
accuracy = logistic_model.score(X_test, y_test)
print(f"Model accuracy: {accuracy}")

Model accuracy: 0.783159375


---
#### **Hyperparameter Tuning**
In order to improve the accuracy of our model, we can perform hyperparameter tuning via sklearn's GridSearchCV function. GridSearchCV performs a cross-validation on a series of hyperparameter combinations and keeps record of an optimal configuration. Below we opt to use the 'f1' scoring metric as opposed to 'accuracy'. The 'f1' score is the harmonic mean between precision and recall, and ultimatley results in a superior measurment of incorrectly classified samples than accuracy.

Reference: [GridSearchCV guide](https://towardsdatascience.com/tuning-the-hyperparameters-of-your-machine-learning-model-using-gridsearchcv-7fc2bb76ff27)

In [6]:
import warnings
warnings.filterwarnings("ignore")

model = LogisticRegression()

parameters = {
  'max_iter': [1000],
  'penalty' : ['l1','l2'], 
  'C'       : [0.1, 1, 10, 100],
  'solver'  : ['liblinear', 'saga']
}

clf = HalvingGridSearchCV(model, param_grid=parameters, scoring='f1', cv=3)
clf.fit(X_train[:10000], y_train[:10000])
print("accuracy :", clf.best_score_)
print("best params :", clf.best_params_)

accuracy : 0.7249534214984608
best params : {'C': 1, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'saga'}


In [7]:
# Quality Metrics
accuracy = clf.score(X_test, y_test)
print(f"Model accuracy: {accuracy}")

Model accuracy: 0.7345212537569772


---
#### **Batching**
To perform batching with sklearn's LogisticRegression class, we set the `warm start` paramerter to true. According to the documentation, `warm start` allows us to fit the model to data based on the residual from a previous training session.

In [15]:
# Logistic Regression
batch_model = LogisticRegression(solver='saga', warm_start=True)

# create mini-batches
batches = []
num_batches = 10
batch_size = y_train.size // num_batches
for batch_number in range(num_batches):
  start = batch_number * batch_size
  end =  (batch_number+1) * batch_size
  batches.append((X_train[start:end], y_train[start:end]))

# train
for mini_x, mini_y in batches:
  batch_model.fit(mini_x, mini_y)

# Quality Metrics
accuracy = batch_model.score(X_test, y_test)
print(f"Model accuracy: {accuracy}")

Model accuracy: 0.7653125


---
##### **Obstacles**
Training our models on such a large dataset proved to be fairly difficult and time consuming. Trial and error quickly became an expensive method of testing parameters as seen with GridSearchCV when hyperparameter tuning. Some parameter configurations took hours to run and never converged, leaving us back at square one. To combat this issue, reduced the quantity of features we trained on when hyperparameter tuning, although it ultimately diminished the overall accuracy of our model.

\\

##### **Conclusion**
Using a naive logistic regression approach, we were able to achieve a reasonably high accuracy of 78.3%. We attempted to improve the accuracy via hyperparameter tuning, which yielded an accuracy score of 73.4% (likely because of the obstacles mentioned above). Additionally, we attempted to run a batching algorithm as well, where we got an accuracy of 76.5%. 

One lingering question we have is how our model might perform on a completely separate dataset. We split our data into training and test sets to perform cross validation, so we do believe this model should be relatively robust to overfitting. However it would still be valuable to verify this on a totally new set of tweets, as all of the tweets we looked at are from 2009. Since then, Twitter has increased maximum tweet length, increased moderation standards, and has had significant shifts in its user base. 

Overall, this project has shown that even with basic machine learning regression models, we can train computers to access rather accurately human sentiment in human text. This suggests that with enough data and training time, as well as improvements on learning algorithms, the idea of creating artificial intelligence capable of interpreting and learning human emotions is quite a possible endeavor.  
