Note: datasets with citations generally come from here: 
https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
So, remember to cite if you use one

MORE NOTES TO SELF:
- Find average comment length, sentiment, score, etc.

In [82]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

First model: IMDB  
From:  
Maas, A., Daly, R., Pham, P., Huang, D., Ng, A. and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [online] Portland, Oregon, USA: Association for Computational Linguistics, pp.142–150. Available at: http://www.aclweb.org/anthology/P11-1015.

This dataset comes in the form of a neg folder and a pos folder, each with comments in .txt format from IMDB. So, I'll need to put them in a more useful format.

In [83]:
import os

def folders_to_df(pos_path_as_str, neg_path_as_str):
    
    lst = []
    
    pos_path = os.fsencode(pos_path_as_str)
    neg_path = os.fsencode(neg_path_as_str)
    
    for file in os.listdir(pos_path):
        
        filepath = pos_path_as_str + "\\" + os.fsdecode(file)
        with open(filepath, "r", encoding="utf8") as f:
            body = f.read()
            lst.append([body, 1])
            
    for file in os.listdir(neg_path):
        
        filepath = neg_path_as_str + "\\" + os.fsdecode(file)
        with open(filepath, "r", encoding="utf8") as f:
            body = f.read()
            lst.append([body, -1])
            
    return pd.DataFrame(lst, columns = ["body", "sentiment"])

In [84]:
train_pos_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\train\pos"
train_neg_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\train\neg"

In [85]:
training_set = folders_to_df(train_pos_path, train_neg_path)

In [86]:
training_set.head()

Unnamed: 0,body,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


The dataset is already split into train/test, but I'm combining them so I can define their ratio

In [87]:
test_pos_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\test\pos"
test_neg_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\test\neg"

testing_set = folders_to_df(test_pos_path, test_neg_path)

In [88]:
testing_set.head()

Unnamed: 0,body,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [89]:
imdb_dataset = training_set.append(testing_set, ignore_index=True)

In [90]:
imdb_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
body         50000 non-null object
sentiment    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.3+ KB


import string
from nltk.corpus import stopwords

def text_processor(text):
    punc_removed = ''.join([char for char in text if char not in string.punctuation])
    return [word.lower() for word in punc_removed.split() if word.lower() not in stopwords.words("english")]

In [91]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

nltk.download('punkt')
nltk.download('stopwords')
stop = set(stopwords.words('english'))

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet

def full_processor(text_string):
    
    # tokenize text
    tokenized_text = word_tokenize(text_string)
    
    # remove stopwords and punctuation
    no_stops = []
    for word in tokenized_text:
        if (word not in stop) and (word not in string.punctuation) :
            no_stops.append(word)
    
    # get parts of speech and lemmatize
    pos_ = {"N": wordnet.NOUN,
            "V": wordnet.VERB,
            "J": wordnet.ADJ,
            "R": wordnet.ADV}
    lemmatized = []
    tags = nltk.pos_tag(tokenized_text)
    for word in tags:
        tag = word[1][0]
        if tag not in pos_: # anything not in .lemmatize()'s extremely limited domain is considered a noun
            tag = 'N'
        tag = pos_[tag]
        lem = lemmatizer.lemmatize(word[0], tag)
        lemmatized.append(lem)
    
    return lemmatized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [92]:
from sklearn.feature_extraction.text import CountVectorizer

In [93]:
count_vec = CountVectorizer(analyzer=full_processor).fit_transform(imdb_dataset["body"])

In [94]:
from sklearn.model_selection import train_test_split

In [101]:
X_train, X_test, y_train, y_test = train_test_split(count_vec, imdb_dataset['sentiment'], test_size = 0.3, random_state = 137)

In [102]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [103]:
imdb_model = MultinomialNB().fit(X_train, y_train)

In [104]:
test_predictions = imdb_model.predict(X_test)

In [105]:
from sklearn.metrics import classification_report

In [106]:
print(classification_report(y_test, test_predictions))

              precision    recall  f1-score   support

          -1       0.83      0.86      0.85      7621
           1       0.85      0.82      0.83      7379

   micro avg       0.84      0.84      0.84     15000
   macro avg       0.84      0.84      0.84     15000
weighted avg       0.84      0.84      0.84     15000



In [107]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [109]:
tfidf_vec = TfidfVectorizer(analyzer = full_processor).fit_transform(imdb_dataset["body"])

In [114]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(tfidf_vec, imdb_dataset['sentiment'], test_size = 0.3, random_state = 57)

In [115]:
imdb_model_v2 = MultinomialNB().fit(X_train_2, y_train_2)

In [116]:
test_predictions_2 = imdb_model_v2.predict(X_test_2)

In [117]:
print(classification_report(y_test_2, test_predictions_2))

              precision    recall  f1-score   support

          -1       0.85      0.88      0.87      7601
           1       0.87      0.84      0.86      7399

   micro avg       0.86      0.86      0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



In [118]:
from sklearn.svm import LinearSVC

In [121]:
X_train_svc, X_test_svc, y_train_svc, y_test_svc = train_test_split(tfidf_vec, imdb_dataset['sentiment'], test_size = 0.3, random_state = 49)
imdb_svc = LinearSVC().fit(X_train_svc, y_train_svc)
test_predictions_svc = imdb_svc.predict(X_test_svc)
print(classification_report(y_test_svc, test_predictions_svc))

              precision    recall  f1-score   support

          -1       0.91      0.90      0.90      7549
           1       0.90      0.91      0.90      7451

   micro avg       0.90      0.90      0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



In [123]:
# strangely high accuracy, something is probably wrong

A little better, but it shifts 0.01 or so when I change the random_state, so not really better at all

Now I'll try Sentiment140  

Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12).

In [None]:
s140test = pd.read_csv(r"C:\Users\Spencer\Desktop\sentiment_datasets\trainingandtestdata\testdata.manual.2009.06.14.csv", encoding = "ISO-8859-1", names = ["polarity", "id", "date", "topic", "username", "body"])
s140train = pd.read_csv(r"C:\Users\Spencer\Desktop\sentiment_datasets\trainingandtestdata\training.1600000.processed.noemoticon.csv", encoding = "ISO-8859-1", names = ["polarity", "id", "date", "topic", "username", "body"])

In [None]:
s140test["polarity"].value_counts()

It's my understanding from these numbers and their distribution (and the fact that the original study is using a positive, negative, neutral system) that 0 is negative, 2 is neutral, 4 is positive. The tweets themselves seem to indicate that:

In [None]:
# 4 looks positive
s140test[s140test["polarity"]==4]["body"].tolist()[0]

In [None]:
# 2 looks neutral
s140test[s140test["polarity"]==2]["body"].tolist()[0]

In [None]:
# 0 looks negative
s140test[s140test["polarity"]==0]["body"].tolist()[0]

So, I'll reassign them using the system I'm using for the sake of consistency. I'll also combine test and train so I can split them differently.

In [None]:
s140 = s140test.append(s140train, ignore_index=True)

In [None]:
s140["polarity"] = s140["polarity"].apply(lambda x: int((x / 2) - 1)) # 4->1, 2->0, 0->-1

In [None]:
s140["polarity"].value_counts()

In [16]:
s140_cv = CountVectorizer(analyzer=full_processor).fit_transform(s140["body"])

In [23]:
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(s140_cv, s140["polarity"], test_size = 0.4, random_state = 135)

In [None]:
# Restarting computer soon so I'll save previously made models so they won't need to be retrained due to notebook restraints

In [30]:
import pickle

In [31]:
imdb_cv_filename = "imdb_cv_model.sav"
imdb_tfidf_filename = "imdb_tfidf_model.sav"

In [32]:
pickle.dump(imdb_model, open(imdb_cv_filename, "wb"))
pickle.dump(imdb_model_v2, open(imdb_tfidf_filename, "wb"))

NameError: name 'imdb_model' is not defined

In [25]:
s140_cv_model = MultinomialNB().fit(X_train_3, y_train_3)

In [26]:
test_predictions_3 = s140_cv_model.predict(X_test_3)

In [27]:
print(classification_report(y_test_3, test_predictions_3))

              precision    recall  f1-score   support

          -1       0.76      0.81      0.78    319714
           0       0.00      0.00      0.00        54
           1       0.79      0.74      0.76    320432

   micro avg       0.77      0.77      0.77    640200
   macro avg       0.52      0.51      0.51    640200
weighted avg       0.77      0.77      0.77    640200



In [33]:
s140_cv_filename = "s140_cv_model.sav"
pickle.dump(s140_cv_model, open(s140_cv_filename, "wb"))

In [35]:
s140_tfidf = TfidfVectorizer(analyzer = full_processor).fit_transform(s140["body"])

In [36]:
X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(s140_cv, s140["polarity"], test_size = 0.3, random_state = 93)

In [37]:
s140_tfidf_model = MultinomialNB().fit(X_train_4, y_train_4)

In [38]:
test_predictions_4 = s140_tfidf_model.predict(X_test_4)

In [39]:
print(classification_report(y_test_4, test_predictions_4))

              precision    recall  f1-score   support

          -1       0.76      0.81      0.78    239708
           0       0.00      0.00      0.00        38
           1       0.79      0.74      0.77    240404

   micro avg       0.77      0.77      0.77    480150
   macro avg       0.52      0.52      0.52    480150
weighted avg       0.77      0.77      0.77    480150



I should preprocess all the text so its lemmatized and everything before fitting data