[Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)
======

## Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

## File descriptions

labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
## Data fields

* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

## Objective
Objective of this dataset is base on **review** we predict **sentiment** (positive or negative) so X is **review** column and y is **sentiment** column

## 1. Load Dataset

Let's first of all have a look at the data. You can download the file `labeledTrainData.tsv` on the [Kaggle website of the competition](https://www.kaggle.com/c/word2vec-nlp-tutorial/data), or on our [Google Drive](https://drive.google.com/file/d/1a1Lyn7ihikk3klAX26fgO3YsGdWHWoK5/view?usp=sharing)


In [1]:
# Import pandas, numpy
import pandas as pd
import numpy as np

In [2]:
# Read dataset with extra params sep='\t', encoding="latin-1"
df = pd.read_csv('labeledTrainData.tsv', sep = '\t', encoding = 'latin-1')
df

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I don't believe they made this film. Completel...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil..."
24998,10194_3,0,This 30 minute documentary BuÃ±uel made in the...


## 2. Preprocessing

In [6]:
# stop words
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Dell
[nltk_data]     7577\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [3]:
# Removing special characters and "trash"
import re
def preprocessor(text):
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text) # Your code here
    
    # Save emoticons for later appending
    # Your code here
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons
    # removing the noise character for standarization. Convert to lower case
    # Your code here
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    return text

In [11]:
# tokenizer and stemming
# tokenizer: to break down our twits in individual words
# stemming: reducing a word to its root
from nltk.stem import PorterStemmer
# Your code here
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()] # Your code here

In [12]:
# split the dataset in train and test
# Your code here
from sklearn.model_selection import train_test_split

X = df['review']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=102)

## 3. Create Model and Train 

Using **Pipeline** to concat **tfidf** step and **LogisticRegression** step

In [13]:
# Import Pipeline, LogisticRegression, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop_words,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])
clf.fit(X_train, y_train)



Pipeline(steps=[('vect',
                 TfidfVectorizer(preprocessor=<function preprocessor at 0x00000272A81B1C10>,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves', 'you',
                                             "you're", "you've", "you'll",
                                             "you'd", 'your', 'yours',
                                             'yourself', 'yourselves', 'he',
                                             'him', 'his', 'himself', 'she',
                                             "she's", 'her', 'hers', 'herself',
                                             'it', "it's", 'its', 'itself', ...],
                                 tokenizer=<function tokenizer_porter at 0x00000272AD7AE3A0>)),
                ('clf', LogisticRegression(random_state=0))])

## 4. Evaluate Model

In [14]:
# Using Test dataset to evaluate model
# classification_report
# confusion matrix
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


predictions = clf.predict(X_test)
print('accuracy:',accuracy_score(y_test,predictions))
print('confusion matrix:\n',confusion_matrix(y_test,predictions))
print('classification report:\n',classification_report(y_test,predictions))

accuracy: 0.8854
confusion matrix:
 [[2164  339]
 [ 234 2263]]
classification report:
               precision    recall  f1-score   support

           0       0.90      0.86      0.88      2503
           1       0.87      0.91      0.89      2497

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000



## 5. Export Model 

In [23]:
# Using pickle to export our trained model
import pickle
import os

pickle.dump(clf, open('logisticRegression.pkl', 'wb'))

In [27]:
predict_proba = clf.predict_proba(X_test)
predict_proba.shape

(5000, 2)

In [28]:
predict_proba[0]

array([0.04355957, 0.95644043])

In [35]:
with open('logisticRegression.pkl', 'rb') as model:
    reload_model = pickle.load(model)
preds = reload_model.predict_proba(X_test)


for i in range(len(X_test)):
    print(f'{X_test.iloc[i]} --> Negative, Positive = {preds[i]}')



Great acting on the part of Gretchen Mol. This film is one of the best biopics to hit the screen in some time. While it does cover the majority of Bettie's young life, it also manages to stay on a mostly focused path which is something most biographical films seem to lack. There is some lovely and alarmingly funny subtext in the dialogue and acting. This film is an excellent break from the Dir. of \American Psycho,\" and I think this will show through as her best work to date. Oh, and as a cinematography buff, I give this film 100% in the cine dept. It was amazing how well they pulled off a 50s look with modern film stocks. Accolades to the D.O.P. All around very enjoyable. I recommend any interested to see it: 8/10." --> Negative, Positive = [0.04355957 0.95644043]
While this movie isn't a classic by any stretch, it is very entertaining as I remember it. I saw it about 15 years ago on HBO and loved the movie. It was written by the same guy that wrote and directed \Arthur\" and though 

i expected something different:more passion,drama...Again another failed attempt of originality.i'm sorry to say that the film falls into the old clichÃ© of 'cheesiness'.15 year old teens may appreciate it though.The acting was not very convincing and the lines common,lacking any wit.Still, the soundtrack was good and well adapted.I can't say that this movie is a total flop,because people do watch it but it didn't meet the public's expectations and sunk into mediocrity.So,to conclude,the production keeps you in front of the TV for almost an hour and a half,which is an appreciable thing.Thus,I guess its worth seeing if you don't get annoyed --> Negative, Positive = [0.92599414 0.07400586]
When I rented Domino I was expected it to be very dumb. I hate films that have really flashy editing and cinematography and Domino also just got very bad reviews. The only reason I watched it is because I like have liked Keira Knightley, Mickey Rourke, Christopher Walken, and Tony Scott on other occasi

\True\" story of a late monster that appears when an American industrial plant begins polluting the waters. Amusing, though not really good, monster film has lots of people trying to get the monster and find out whats going on but not in a completely involving way. Give it points for giving us a giant monster that they clearly built to scale for some scenes but take some away in that it looks like a non threatening puppy. An amusing exploitation film thats enjoyably silly in the right frame of mind. (My one complaint is that the print used on the Elvira release is so poor that it looks like a well worn video tape copy that was past its prime 20 years ago.)" --> Negative, Positive = [0.44229344 0.55770656]
I'm not a regular viewer of Springer's, but I do watch his show in glimpses and I think the show is a fine guilty pleasure and a good way to kill some time. So naturally, I'm going to watch this movie expecting to see \Jerry Springer Uncensored.\" First of all, Jerry appears in approx

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

