In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

The Sentiment140 dataset has 1.6 million entries of tweets. Each tweet entry has the polarity of the tweet from negative to positive in terms of language sentiment (target), a tweet ID (ids), the date the tweet was posted (date), a query for the tweet if available (flag), the user that posted the tweet (user), and the actual tweet content in text format (text). 

In this project we are going to train models to predict the sentiment of a tweet based on its content, given this training dataset. Since most parameters in this dataset are not necessary for the training, we will be dropping all columns except the polarity parameter and the tweet content.

In [2]:
columns = ["target", "ids", "date", "flag", "user", "text"]
df = pd.read_csv("/Users/savitha/Desktop/New Yorkie University/2023-2024/Machine Learning/Homework/sentiment140.csv", header=None, names=columns, encoding="ISO-8859-1")

df = df.drop(["ids", "date", "flag", "user"], axis=1)

In this project, we are going to train our model to predict sentiment in a binary fashion. Since the given dataset is currently using values from 0 to 4 to show sentiment, we are going to change the value mappings to keep the max below 1. 

All tweets with a value of 0 or 1 should be negative, so they will be mapped to a value of 0. 

Values of 2 are netural tweets and will not help with our training since we are keeping the training binary, so the netural tweets will be dropped from the dataset. 

Finally, values of 3 and 4 indicate positive sentiment, so they will be mapped to a value of 1.

In [3]:
# map values to binary (0 = negative, 4 = positive)
df["target"] = df["target"].apply(lambda x: 0 if (x >= 0 and x <= 1) else 1 if (x >= 3 and x <= 4) else None)

# Drop rows with None values in the 'target' column (neutral tweets)
df = df.dropna(subset=["target"])

Since the input is raw text from the dataset of tweets and their content, the data will need to be translated to an understandable format for the models and their training. We will be using TF-IDF feature extraction for this, where raw text is translated into a numerical representative format. 

Stop words are exlcuded from the tweet content being fed to the model. Stop words are words that do not contribute to the overall understanding of a sentence. They include words such as "the", "and", or "is". 

The extracted content is then used for our modelling.

In [4]:
# Feature extraction using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

df = pd.concat([df, df_tfidf], axis=1)
df = df.drop("text", axis=1)

At this point we have all we need for our training. We have the parsed content of the tweets and their corresponding sentiment in binary format. In order to split the data accordingly, we assign them to values X and Y, where X is the parsed features from the TF-IDF extraction and contain tweet content and Y is the binary sentiment values.

In [5]:
# Split the data into features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=999)

In [8]:
# Logistic Regression Model
log_reg = LogisticRegression(random_state=999)
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_log_reg)
precision = precision_score(y_test, y_pred_log_reg)
recall = recall_score(y_test, y_pred_log_reg)
f1 = f1_score(y_test, y_pred_log_reg)

print("Logisitic Regression Model\n")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Logisitic Regression Model

Accuracy: 0.7405
Precision: 0.7248
Recall: 0.7780
F1 Score: 0.7505


In [9]:
# MLP Classifier
mlp_clf = MLPClassifier(random_state=999)
mlp_clf.fit(X_train, y_train)
y_pred_mlp = mlp_clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_mlp)
precision = precision_score(y_test, y_pred_mlp)
recall = recall_score(y_test, y_pred_mlp)
f1 = f1_score(y_test, y_pred_mlp)

print("MLP Classifier Model\n")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")



MLP Classifier Model

Accuracy: 0.7449
Precision: 0.7320
Recall: 0.7754
F1 Score: 0.7530


Final Observations:

From the results of the two models, we can see that both models performed at about the same level. The differences in the metrics of accuracy and the others were small and therefore negligible. An accuracy of around 74% is quite good, and was the result of both models. 

The logisitic regression model is a simple model and was mainly used for comparision of results to other model's training. However, the model performed relatively well. The MLP, or Multi-Layer Perceptron model, was used since it is a complex neural network model and would be good at recognizing more complex patterns in data, as would be the case for different words in the tweets of our dataset. 

However, it is important to note that while training my models, I had selected to pause the training of the MLP model after 7 hours as I had felt that the training was taking abnormally long. It was after this long training period that the metrics were calculated. Although the model had very slightly better metrics, I do not think the long training time was a good trade-off. Of course, this may have been due to the nature of the dataset as it is quite large at 1.6 million entries. 

Given the final results, the logistic regression model should be preferred as its metrics are on-par with the MLP model for a exponentially lower training period. 