# Import data

https://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
import glob

In [2]:
text_pos = []
for file_name in glob.glob('aclImdb/train/pos/*'):
    with open(file_name, 'r') as file:
        text_pos.append(file.read())
text_neg = []
for file_name in glob.glob('aclImdb/train/neg/*'):
    with open(file_name, 'r') as file:
        text_neg.append(file.read())        

In [4]:
print (text_pos[1])

Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's "Flamingo Road") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina's pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of "Rosemary's Baby" and "The Exorcist"--but what a combination! Based on the best-seller by Jeffrey Konvitz, "The Sentinel" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbeat en

In [5]:
import pandas as pd

In [9]:
#Assign 1 for positive and 0 for negative
df_pos = pd.DataFrame(text_pos, columns=['text'])
df_pos['target'] = 1
df_neg = pd.DataFrame(text_neg, columns=['text'])
df_neg['target'] = 0
df = pd.concat([df_pos, df_neg])

In [10]:
print (len(text_pos) , " " , len(text_neg))

12500   12500


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [12]:
#classification using RandomForestClassifier
pipe = Pipeline([  
  ('tfidf', TfidfVectorizer()),
  ('cls', RandomForestClassifier())
  ])

In [13]:
X, y = df['text'], df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)

In [14]:
print(y_pred)

[0 1 0 ... 1 0 0]


In [15]:
from sklearn.metrics import classification_report

In [16]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.84      0.83      3100
           1       0.84      0.83      0.83      3150

    accuracy                           0.83      6250
   macro avg       0.83      0.83      0.83      6250
weighted avg       0.83      0.83      0.83      6250



In [17]:
import pickle

In [19]:
with open('model.p', 'wb') as f:
    pickle.dump(pipe, f)

# Predict

In [20]:
with open('model.p', 'rb') as f:
    pipe = pickle.load(f)

In [21]:
sample_texts = [
    "Data science has many many scope. I love it!",
    "The connection is so bad, I can't understand anything! What a waste of time."
]

In [22]:
pipe.predict_proba(sample_texts)

array([[0.2 , 0.8 ],
       [0.62, 0.38]])