# Project

## Git LFS

Pull the datasets from Git LFS

In [22]:
# TRAINING DATA
!git lfs pull -I "training.1600000.processed.noemoticon.csv"

# TEST DATA
!git lfs pull -I "testdata.manual.2009.06.14.csv"

## Sentiment140 Dataset

Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter.

The data is a CSV with emoticons removed. Data file format has 6 fields:

1. the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1. the id of the tweet (2087)
1. the date of the tweet (Sat May 16 23:58:44 UTC 2009)
1. the query (lyx). If there is no query, then this value is NO_QUERY.
1. the user that tweeted (robotickilldozr)
1. the text of the tweet (Lyx is cool)

<http://help.sentiment140.com/for-students>

In [20]:
import pandas as pd

df = pd.read_csv(
    "training.1600000.processed.noemoticon.csv",
    encoding="ISO-8859-1",
    names=["target", "ids", "date", "flag", "user", "text"],
)

df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [23]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

# Load the test data
df_test = pd.read_csv(
    "testdata.manual.2009.06.14.csv",
    encoding="ISO-8859-1",
    names=["target", "ids", "date", "flag", "user", "text"],
)

# Preprocess the text data
df_test["text"] = df_test["text"].str.lower().str.replace("[^\w\s]", "").str.split()
df_test["text"] = df_test["text"].apply(
    lambda x: " ".join([word for word in x if word not in stop_words])
)

# Convert the text data into a matrix of token counts
vectorizer = CountVectorizer()
X_test = vectorizer.transform(df_test["text"])
y_test = df_test["target"]

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))

[nltk_data] Downloading package stopwords to /home/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Accuracy:  0.772
Confusion Matrix: 
 [[127907  31587]
 [ 41373 119133]]
