### Easy

Read dataset: IMDB dataset . 

[IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

Install spacy. Tokenize all texts. [https://spacy.io/usage/spacy-101#annotations-token](https://spacy.io/usage/spacy-101#annotations-token)

In [116]:
import spacy
import pandas as pd
import re

In [117]:
df = pd.read_csv('././data/IMDB Dataset.csv')

In [118]:
nlp = spacy.load('en_core_web_sm')

In [119]:
pattern = r'[^\w]'

In [120]:
df['tokens'] = df['review'].apply(lambda doc: list(filter(None,re.split(pattern, doc))))

In [121]:
df.head()

Unnamed: 0,review,sentiment,tokens
0,One of the other reviewers has mentioned that ...,positive,"[One, of, the, other, reviewers, has, mentione..."
1,A wonderful little production. <br /><br />The...,positive,"[A, wonderful, little, production, br, br, The..."
2,I thought this was a wonderful way to spend ti...,positive,"[I, thought, this, was, a, wonderful, way, to,..."
3,Basically there's a family where a little boy ...,negative,"[Basically, there, s, a, family, where, a, lit..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[Petter, Mattei, s, Love, in, the, Time, of, M..."


### Medium

Read sentiment data (2020). 

Use IMBD dataset and calculate positivity and negativity by mean_sentiment

Install sci-kit-learn to measure precision and recall on labels found after using mean_sentiment values for each word. 
About precision and recall read from link ([https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/](https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/))

Calculate precision and Recall, F1

In [122]:
import numpy as np

In [123]:
X = df['tokens']
y = df['sentiment']

In [124]:
dictionary = {}
with open('./data/2000.tsv') as file:
    lines = file.readlines()
    for line in lines:
        line = line.strip()
        word, mean_sentiment, _ = line.split('\t')
        dictionary[word] = float(mean_sentiment)

In [125]:
def mean_sentiment(tokens, dictionary):
    sentiment_sum = 0
    for token in tokens:
        if token in dictionary:
            sentiment_sum += dictionary[token]
    return sentiment_sum

In [126]:
y_prediction = df['tokens'].apply(lambda x: 'positive' if mean_sentiment(x, dictionary) >= 0 else 'negative')

In [127]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [128]:
print(f"Accuracy: {accuracy_score(y, y_prediction)}")
print(f"Precision: {precision_score(y, y_prediction, pos_label='positive')}")
print(f"Recall: {recall_score(y, y_prediction, pos_label='positive')}")
print(f"F1: {f1_score(y, y_prediction, pos_label='positive')}")

Accuracy: 0.64496
Precision: 0.6103397880891487
Recall: 0.80184
F1: 0.6931055943572366


### Hard

Use Naive Bayes to classify text in in IMDB dataset

In [129]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer


In [130]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [131]:
df["cleaned_review"] = df["tokens"].apply(lambda doc: " ".join(doc))

In [132]:
def posneg(x):
    if x=="negative":
        return 0
    elif x=="positive":
        return 1
    return x

filtered_score = df["sentiment"].map(posneg)
df["score"] = filtered_score

In [133]:
test_data = df[:25000]
train_data = df[25000:50000]

In [134]:
X_train = train_data["cleaned_review"]
y_train = train_data["score"]

X_test = test_data["cleaned_review"]
y_test = test_data["score"]

In [135]:
y_train=y_train.astype('int')
y_test=y_test.astype('int')

In [136]:
#TF_IDF

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
tf_idf_train = tf_idf_vect.fit_transform(X_train.values)
tf_idf_test = tf_idf_vect.transform(X_test.values)

In [137]:
clf = MultinomialNB(alpha=6)
clf.fit(tf_idf_train,y_train)

MultinomialNB(alpha=6)

In [138]:
y_pred_test = clf.predict(tf_idf_test)


In [139]:
print(f"Accuracy: {accuracy_score(y_test, y_pred_test)}")
print(f"Precision: {precision_score(y_test, y_pred_test)}")
print(f"Recall: {recall_score(y_test, y_pred_test)}")
print(f"F1: {f1_score(y_test, y_pred_test)}")

Accuracy: 0.86856
Precision: 0.9129067050152795
Recall: 0.8142536475869809
F1: 0.8607627118644068
