Note: datasets with citations generally come from here: 
https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
So, remember to cite if you use one

MORE NOTES TO SELF:
- Find average comment length, sentiment, score, etc.

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

Model: IMDB reviews labelled by sentiment based on score
From:  
Maas, A., Daly, R., Pham, P., Huang, D., Ng, A. and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [online] Portland, Oregon, USA: Association for Computational Linguistics, pp.142–150. Available at: http://www.aclweb.org/anthology/P11-1015.

IMDB reviews may be applicable considering they are 1) comparable to the length of a reddit comment and 2) written about movies, obviously comparable to writing about TV shows

This dataset comes in the form of a neg folder and a pos folder, each with comments in .txt format from IMDB. So, I'll need to put them in a more useful format.

In [2]:
import os

def folders_to_df(pos_path_as_str, neg_path_as_str):
    
    lst = []
    
    pos_path = os.fsencode(pos_path_as_str)
    neg_path = os.fsencode(neg_path_as_str)
    
    for file in os.listdir(pos_path):
        
        filepath = pos_path_as_str + "\\" + os.fsdecode(file)
        with open(filepath, "r", encoding="utf8") as f:
            body = f.read()
            lst.append([body, 1])
            
    for file in os.listdir(neg_path):
        
        filepath = neg_path_as_str + "\\" + os.fsdecode(file)
        with open(filepath, "r", encoding="utf8") as f:
            body = f.read()
            lst.append([body, -1])
            
    return pd.DataFrame(lst, columns = ["body", "sentiment"])

In [7]:
train_pos_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\train\pos"
train_neg_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\train\neg"

In [8]:
training_set = folders_to_df(train_pos_path, train_neg_path)

In [9]:
training_set.head()

Unnamed: 0,body,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


The dataset is already split into train/test, but I'm combining them so I can define their ratio

In [10]:
test_pos_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\test\pos"
test_neg_path = r"C:\Users\Spencer\Desktop\sentiment_datasets\aclImdb\test\neg"

testing_set = folders_to_df(test_pos_path, test_neg_path)

In [11]:
testing_set.head()

Unnamed: 0,body,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [12]:
imdb_dataset = training_set.append(testing_set, ignore_index=True)

In [13]:
imdb_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
body         50000 non-null object
sentiment    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.3+ KB


In [14]:
# getting this into a csv so I don't have to deal with it on ubuntu later
imdb_dataset.to_csv("imdb_df.csv")

In [15]:
# ON UBUNTU START FROM HERE
imdb_dataset = pd.read_csv("imdb_df.csv")

import string
from nltk.corpus import stopwords

def text_processor(text):
    punc_removed = ''.join([char for char in text if char not in string.punctuation])
    return [word.lower() for word in punc_removed.split() if word.lower() not in stopwords.words("english")]

In [164]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

nltk.download('punkt')
nltk.download('stopwords')
stop = set(stopwords.words('english'))

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet

def full_processor(text_string):
    
    # tokenize text
    tokenized_text = word_tokenize(text_string)
    
    # remove stopwords and punctuation
    no_stops = []
    for word in tokenized_text:
        if (word not in stop) and (word not in string.punctuation) :
            no_stops.append(word)
    
    # get parts of speech and lemmatize
    pos_ = {"N": wordnet.NOUN,
            "V": wordnet.VERB,
            "J": wordnet.ADJ,
            "R": wordnet.ADV}
    lemmatized = []
    tags = nltk.pos_tag(tokenized_text)
    for word in tags:
        tag = word[1][0]
        if tag not in pos_: # anything not in .lemmatize()'s pos dictionary is considered a noun
            tag = 'N'
        tag = pos_[tag]
        lem = lemmatizer.lemmatize(word[0], tag)
        lemmatized.append(lem)
    
    return lemmatized

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [165]:
from sklearn.feature_extraction.text import CountVectorizer

In [166]:
imdb_cv = CountVectorizer(analyzer=full_processor).fit_transform(imdb_dataset["body"])

In [167]:
from sklearn.model_selection import train_test_split

In [174]:
X_train_imdb_cv_nb, X_test_imdb_cv_nb, y_train_imdb_cv_nb, y_test_imdb_cv_nb = train_test_split(imdb_cv, imdb_dataset['sentiment'], test_size = 0.25, random_state = 137)

In [175]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [176]:
imdb_cv_nb = MultinomialNB().fit(X_train_imdb_cv_nb, y_train_imdb_cv_nb)

In [177]:
imdb_pred_cv_nb = imdb_cv_nb.predict(X_test_imdb_cv_nb)

In [178]:
from sklearn.metrics import classification_report

In [179]:
print(classification_report(y_test_imdb_cv_nb, imdb_pred_cv_nb))

              precision    recall  f1-score   support

          -1       0.83      0.87      0.85      6345
           1       0.86      0.82      0.84      6155

   micro avg       0.84      0.84      0.84     12500
   macro avg       0.84      0.84      0.84     12500
weighted avg       0.84      0.84      0.84     12500



In [107]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [141]:
imdb_tfidf = TfidfVectorizer(analyzer = full_processor).fit_transform(imdb_dataset["body"])

In [180]:
X_train_imdb_tfidf_nb, X_test_imdb_tfidf_nb, y_train_imdb_tfidf_nb, y_test_imdb_tfidf_nb = train_test_split(imdb_tfidf, imdb_dataset['sentiment'], test_size = 0.3, random_state = 57)

In [181]:
imdb_tfidf_nb = MultinomialNB().fit(X_train_imdb_tfidf_nb, y_train_imdb_tfidf_nb)

In [182]:
imdb_pred_tfidf_nb = imdb_tfidf_nb.predict(X_test_imdb_tfidf_nb)

In [183]:
print(classification_report(y_test_imdb_tfidf_nb, imdb_pred_tfidf_nb))

              precision    recall  f1-score   support

          -1       0.85      0.88      0.87      7601
           1       0.87      0.84      0.86      7399

   micro avg       0.86      0.86      0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



In [184]:
from sklearn.svm import LinearSVC

In [206]:
X_train_imdb_tfidf_svc, X_test_imdb_tfidf_svc, y_train_imdb_tfidf_svc, y_test_imdb_tfidf_svc = train_test_split(imdb_tfidf, imdb_dataset['sentiment'], test_size = 0.25, random_state = 49)
imdb_tfidf_svc = LinearSVC().fit(X_train_imdb_tfidf_svc, y_train_imdb_tfidf_svc)
imdb_pred_tfidf_svc = imdb_tfidf_svc.predict(X_test_imdb_tfidf_svc)
print(classification_report(y_test_imdb_tfidf_svc, imdb_pred_tfidf_svc))

              precision    recall  f1-score   support

          -1       0.91      0.90      0.91      6309
           1       0.90      0.91      0.91      6191

   micro avg       0.91      0.91      0.91     12500
   macro avg       0.91      0.91      0.91     12500
weighted avg       0.91      0.91      0.91     12500



In [209]:
# very high accuracy

In [129]:
from sklearn.feature_extraction.text import HashingVectorizer

In [189]:
imdb_hv = HashingVectorizer(analyzer = full_processor, non_negative = True).fit_transform(imdb_dataset["body"])



In [208]:
X_train_imdb_hv_svc, X_test_imdb_hv_svc, y_train_imdb_hv_svc, y_test_imdb_hv_svc = train_test_split(imdb_hv, imdb_dataset['sentiment'], test_size = 0.3, random_state = 799)
imdb_hv_svc = LinearSVC().fit(X_train_imdb_hv_svc, y_train_imdb_hv_svc)
imdb_pred_hv_svc = imdb_hv_svc.predict(X_test_imdb_hv_svc)
print(classification_report(y_test_imdb_hv_svc, imdb_pred_hv_svc))

              precision    recall  f1-score   support

          -1       0.90      0.89      0.90      7551
           1       0.89      0.90      0.90      7449

   micro avg       0.90      0.90      0.90     15000
   macro avg       0.90      0.90      0.90     15000
weighted avg       0.90      0.90      0.90     15000



In [194]:
X_train_imdb_hv_nb, X_test_imdb_hv_nb, y_train_imdb_hv_nb, y_test_imdb_hv_nb = train_test_split(imdb_hv, imdb_dataset['sentiment'], test_size = 0.3, random_state = 709)
imdb_hv_nb = MultinomialNB().fit(X_train_imdb_hv_nb, y_train_imdb_hv_nb)
imdb_pred_hv_nb = imdb_hv_nb.predict(X_test_imdb_hv_nb)
print(classification_report(y_test_imdb_hv_nb, imdb_pred_hv_nb))

              precision    recall  f1-score   support

          -1       0.80      0.91      0.85      7543
           1       0.89      0.77      0.83      7457

   micro avg       0.84      0.84      0.84     15000
   macro avg       0.85      0.84      0.84     15000
weighted avg       0.85      0.84      0.84     15000



In [150]:
import pickle

In [196]:
models = [imdb_cv_nb, imdb_tfidf_nb, imdb_tfidf_svc, imdb_hv_svc, imdb_hv_nb]
strings = ["imdb_cv_nb", "imdb_tfidf_nb", "imdb_tfidf_svc", "imdb_hv_svc", "imdb_hv_nb"]
for i in range(len(models)):
    pickle.dump(models[i], open(strings[i] + ".sav", "wb"))

Tentatively, TfidfVectorizer with an SVC classifier looks best. I will stick with TfidfVectorizer as it should be more accurate in most cases, seems to be at least as good (better, even) for this dataset, and I don't want to bother with several vectorizers that run the same/worse.  
So, I'll adjust TFIDF's parameters.

In [238]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

nltk.download('punkt')
nltk.download('stopwords')
stop = set(stopwords.words('english'))

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet

def tokenizer_lemmatizer(text):
    
    # tokenizes, removes, stop words, gets pos, lemmatizes with pos
    tokenized_text = word_tokenize(text)
    
    no_stops = []
    for word in tokenized_text:
        if word not in stop:
            no_stops.append(word)
    tokenized_text = no_stops
    
    pos_ = {"N": wordnet.NOUN,
            "V": wordnet.VERB,
            "J": wordnet.ADJ,
            "R": wordnet.ADV}
    lemmatized = []
    tags = nltk.pos_tag(tokenized_text)
    for word in tags:
        tag = word[1][0]
        if tag not in pos_: # anything not in .lemmatize()'s pos dictionary is considered a noun
            tag = 'N'
        tag = pos_[tag]
        lem = lemmatizer.lemmatize(word[0], tag)
        lemmatized.append(lem)
    
    return lemmatized

def preprocessor_remove_punc(text):
    return text.lower().translate(str.maketrans('', '', string.punctuation))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Spencer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [239]:
tfidf_vec = TfidfVectorizer(preprocessor=preprocessor_remove_punc, tokenizer=tokenizer_lemmatizer)

In [240]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier

In [257]:
tfidf_pipeline = Pipeline([('tfidf', tfidf_vec)])

In [259]:
tfidf_pipeline.fit_transform(imdb_dataset["body"])

<50000x170147 sparse matrix of type '<class 'numpy.float64'>'
	with 4866093 stored elements in Compressed Sparse Row format>

In [260]:
classifier_pipeline = Pipeline([('classifier', MultinomialNB())])

In [261]:
tfidf_clf_pipeline = Pipeline([('tfidf', tfidf_pipeline), ('classifier', MultinomialNB())])

In [279]:
parameters = [
                {'classifier': [MultinomialNB()],
                'classifier__alpha': [1, 0.1, 0.01, 0.001],
                'tfidf__tfidf__ngram_range': [(1, 1), (1, 2)],
                'tfidf__tfidf__max_df': [0.25, 0.5, 0.75],
                'tfidf__tfidf__min_df': [1, 10, 100]},
                {'classifier': [LinearSVC()],
                'classifier__C': [1, 10, 100, 1000],
                'tfidf__tfidf__ngram_range': [(1, 1), (1, 2)],
                'tfidf__tfidf__max_df': [0.25, 0.5, 0.75],
                'tfidf__tfidf__min_df': [1, 10, 100]}
            ]

In [280]:
grid_search = GridSearchCV(tfidf_clf_pipeline, parameters, cv = 3)

In [281]:
X_train, X_test, y_train, y_test = train_test_split(imdb_dataset["body"], imdb_dataset['sentiment'], test_size = 0.3, random_state = 709)

In [282]:
grid_search.fit(X_train, y_train)

KeyboardInterrupt: 