Note: datasets with citations generally come from here: 
https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
So, remember to cite if you use one

MORE NOTES TO SELF:
- Find average comment length, sentiment, score, etc.

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

First model: IMDB  
From:  
Maas, A., Daly, R., Pham, P., Huang, D., Ng, A. and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [online] Portland, Oregon, USA: Association for Computational Linguistics, pp.142–150. Available at: http://www.aclweb.org/anthology/P11-1015.

This dataset comes in the form of a neg folder and a pos folder, each with comments in .txt format from IMDB. So, I'll need to put them in a more useful format.

In [25]:
import os

def folders_to_df(pos_path_as_str, neg_path_as_str):
    
    lst = []
    
    pos_path = os.fsencode(pos_path_as_str)
    neg_path = os.fsencode(neg_path_as_str)
    
    for file in os.listdir(pos_path):
        
        filepath = pos_path_as_str + "\\" + os.fsdecode(file)
        with open(filepath, "r", encoding="utf8") as f:
            body = f.read()
            lst.append([body, 1])
            
    for file in os.listdir(neg_path):
        
        filepath = neg_path_as_str + "\\" + os.fsdecode(file)
        with open(filepath, "r", encoding="utf8") as f:
            body = f.read()
            lst.append([body, -1])
            
    return pd.DataFrame(lst, columns = ["body", "sentiment"])

In [26]:
train_pos_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\train\pos"
train_neg_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\train\neg"

In [27]:
training_set = folders_to_df(train_pos_path, train_neg_path)

In [28]:
training_set.head()

Unnamed: 0,body,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


The dataset is already split into train/test, but I'm combining them so I can define their ratio

In [29]:
test_pos_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\test\pos"
test_neg_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\test\neg"

testing_set = folders_to_df(test_pos_path, test_neg_path)

In [30]:
testing_set.head()

Unnamed: 0,body,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [31]:
imdb_dataset = training_set.append(testing_set, ignore_index=True)

In [32]:
imdb_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
body         50000 non-null object
sentiment    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.3+ KB


In [2]:
import string
from nltk.corpus import stopwords

def text_processor(text):
    punc_removed = ''.join([char for char in text if char not in string.punctuation])
    return [word.lower() for word in punc_removed.split() if word.lower() not in stopwords.words("english")]

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [35]:
count_vec = CountVectorizer(analyzer = text_processor)

In [36]:
count_vec = count_vec.fit_transform(imdb_dataset['body'])

In [4]:
from sklearn.model_selection import train_test_split

In [38]:
X_train, X_test, y_train, y_test = train_test_split(count_vec, imdb_dataset['sentiment'], test_size = 0.3, random_state = 1)

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [40]:
imdb_model = MultinomialNB().fit(X_train, y_train)

In [41]:
test_predictions = imdb_model.predict(X_test)

In [6]:
from sklearn.metrics import classification_report

In [43]:
print(classification_report(y_test, test_predictions))

              precision    recall  f1-score   support

          -1       0.84      0.88      0.86      7407
           1       0.88      0.84      0.86      7593

   micro avg       0.86      0.86      0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [45]:
tfidf_vec = TfidfVectorizer(analyzer = text_processor)

In [47]:
tfidf_vec = tfidf_vec.fit_transform(imdb_dataset['body'])

In [68]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(tfidf_vec, imdb_dataset['sentiment'], test_size = 0.4, random_state = 57)

In [69]:
imdb_model_v2 = MultinomialNB().fit(X_train_2, y_train_2)

In [70]:
test_predictions_2 = imdb_model_v2.predict(X_test_2)

In [71]:
print(classification_report(y_test_2, test_predictions_2))

              precision    recall  f1-score   support

          -1       0.86      0.88      0.87     10098
           1       0.87      0.86      0.87      9902

   micro avg       0.87      0.87      0.87     20000
   macro avg       0.87      0.87      0.87     20000
weighted avg       0.87      0.87      0.87     20000



A little better, but it shifts 0.01 or so when I change the random_state, so not really better at all

Now I'll try Sentiment140  

Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12).

In [8]:
s140test = pd.read_csv(r"C:\Users\Spencer\Desktop\sentiment_datasets\trainingandtestdata\testdata.manual.2009.06.14.csv", encoding = "ISO-8859-1", names = ["polarity", "id", "date", "topic", "username", "body"])
s140train = pd.read_csv(r"C:\Users\Spencer\Desktop\sentiment_datasets\trainingandtestdata\training.1600000.processed.noemoticon.csv", encoding = "ISO-8859-1", names = ["polarity", "id", "date", "topic", "username", "body"])

In [9]:
s140test["polarity"].value_counts()

4    182
0    177
2    139
Name: polarity, dtype: int64

It's my understanding from these numbers and their distribution (and the fact that the original study is using a positive, negative, neutral system) that 0 is negative, 2 is neutral, 4 is positive. The tweets themselves seem to indicate that:

In [10]:
# 4 looks positive
s140test[s140test["polarity"]==4]["body"].tolist()[0]

'@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.'

In [11]:
# 2 looks neutral
s140test[s140test["polarity"]==2]["body"].tolist()[0]

"Check this video out -- President Obama at the White House Correspondents' Dinner http://bit.ly/IMXUM"

In [12]:
# 0 looks negative
s140test[s140test["polarity"]==0]["body"].tolist()[0]

'Fuck this economy. I hate aig and their non loan given asses.'

So, I'll reassign them using the system I'm using for the sake of consistency. I'll also combine test and train so I can split them differently.

In [13]:
s140 = s140test.append(s140train, ignore_index=True)

In [14]:
s140["polarity"] = s140["polarity"].apply(lambda x: int((x / 2) - 1)) # 4->1, 2->0, 0->-1

In [15]:
s140["polarity"].value_counts()

 1    800182
-1    800177
 0       139
Name: polarity, dtype: int64

In [16]:
s140_cv = CountVectorizer(analyzer=text_processor).fit_transform(s140["body"])

KeyboardInterrupt: 

In [None]:
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(s140_cv, s140["body"], test_size = 0.3, random_state = 135)

In [149]:
# Restarting computer soon so I'll save previously made models so they won't need to be retrained due to notebook restraints

In [150]:
import pickle

In [151]:
imdb_cv_filename = "imdb_cv_model.sav"
imdb_tfidf_filename = "imdb_tfidf_model.sav"

In [152]:
pickle.dump(imdb_model, open(imdb_cv_filename, "wb"))
pickle.dump(imdb_model_v2, open(imdb_tfidf_filename, "wb"))

In [None]:
s140_cv_model = MultinomialNB().fit(X_train_3, y_train_3)

In [None]:
test_predictions_3 = s140_cv_model.predict(X_test_3)

In [None]:
print(classification_report(y_test_3, test_predictions_3))