# Week 3: Lab - Natural Language Processing
## Sentiment Analysis of Movie Reviews

Welcome to Week 3 ! In today's lab, you will learn about Natural Language Processing (*NLP*). We will compare 3 methods of featurizing text data: 
* `CountVectorizer` (Bag of Words)
* `TfidfVectorizer` (TF-IDF)
* `Doc2Vec` 

in order to perform **sentiment analysis** on the Cornell IMDB movie review corpus (http://www.cs.cornell.edu/people/pabo/movie-review-data/).

### Input Format

We can't directly input the raw reviews from the Cornell movie review data repository. Instead, we have to "clean them up" by:
1. Converting everything to lower case
2. Removing punctuation
3. Removing common words (stop words)
4. Stemming

'Cleaning up' text is an important **Data Pre-processing** step in NLP, and is crucial to getting good results. In the same way that we do with our numerical features (egs: filling na values with a mean, etc.), we need to make sure that words that we are going to use as features are consistently formatted and don't include information that will end up being unnecessary.

To practise, we are going to perform the above 4 steps on the sample movie review below.

In [1]:
movie_review = """Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!"""
print(movie_review)

Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!


#### The first step is to lowercase it
You can do this using the `.lower()` function. Try it out on `movie_review`, and print it to see the result.

In [3]:
movie_review = movie_review.lower()

Next, we need to remove punctuation. import `string`, and then from `string` import `punctuation`.
Print `punctuation` to see the list of punctuation marks in the library.

In [4]:
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


The way we remove punctuation from a string is by creating a `translator` object, and then calling `.translate` on our string using the `translator` object.

Create a `translator` object by calling `str.maketrans('', '', punctuation)`.

In [7]:
translator = str.maketrans('', '', punctuation)

Now, call `.translate` on your `movie_review` and pass it your `translator` object. Then, print your movie review.

In [23]:
movie_review = movie_review.translate(translator)
movie_review

'bromwell high short brilliant expertly scripted perfectly delivered searing parody students teachers south london public school leaves literally rolling laughter its vulgar provocative witty sharp characters superbly caricatured cross section british society or accurate society following escapades keisha latrina natella protagonists want better term doesnt shy away parodying imaginable subject political correctness flies window episode enjoy shows arent afraid poke fun taboo subject imaginable bromwell high disappoint'

You should see that all the punctuation has been removed!

If you want to understand why/how this works, check out these posts:
* https://www.tutorialspoint.com/python/string_maketrans.htm
* https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate

#### Remove stop words

Notice all of the punctuation has been removed.  Next we will remove common words.  This is because in NLP we want to find things that distinct between different sets of texts.  We can make that easier by removing words that are common to ALL texts (and, is are, etc.)

from `sklearn.feature_extraction.stop_words` import `ENGLISH_STOP_WORDS`. Then, print `ENGLISH_STOP_WORDS` to see a list of common stop words.

In [24]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
print(ENGLISH_STOP_WORDS)

frozenset({'also', 'give', 'therein', 'whom', 'often', 'such', 'thru', 'done', 'became', 'beyond', 'anything', 'for', 'thick', 'thereafter', 'un', 'either', 'why', 'their', 'here', 'found', 'itself', 'third', 'beforehand', 'de', 'our', 'hence', 'seems', 'so', 'there', 'ltd', 'about', 'becoming', 'nothing', 'something', 'moreover', 'whither', 'nor', 'whence', 'than', 'few', 'would', 'see', 'hereupon', 'almost', 'fill', 'though', 'him', 'become', 'empty', 'afterwards', 'me', 'through', 'out', 'six', 'both', 'nobody', 'could', 'ourselves', 'down', 'without', 'because', 'etc', 'had', 'indeed', 'someone', 'hasnt', 'last', 'beside', 'already', 'less', 'on', 'the', 'two', 'of', 'describe', 'hers', 'find', 'he', 'made', 'among', 'whatever', 'whereby', 'into', 'get', 'ours', 'cry', 'via', 'former', 'along', 'from', 'towards', 'becomes', 'may', 'too', 'always', 'toward', 'whenever', 'those', 'anyhow', 'wherein', 'around', 'per', 'neither', 'anyway', 'in', 'nine', 'twelve', 'even', 'until', 'befo

We want to remove the above words from `movie_review`. First, convert `movie_review` into a list by calling the `.split()` method. Call your new object `split_review`.

In [25]:
movie_review_splitted = movie_review.split()

Now, you want to use a `for` loop to create a new list (call it `clean_words`). In each iteration of your loop, go through `split_review` and check every word. If the word is not in `ENGLISH_STOP_WORDS`, append it to your `clean_words`.

In [26]:
clean_words = []
for word in movie_review_splitted:
    if word not in ENGLISH_STOP_WORDS:
        clean_words.append(word)

Finally, put the clean words back together to re-create `movie_review`, by using the `.join'` method on `clean_words`, separated by a space.

In [27]:
movie_review = ' '.join(clean_words)

Print your movie review! It should look like this:

In [28]:
print(movie_review)

bromwell high short brilliant expertly scripted perfectly delivered searing parody students teachers south london public school leaves literally rolling laughter vulgar provocative witty sharp characters superbly caricatured cross section british society accurate society following escapades keisha latrina natella protagonists want better term doesnt shy away parodying imaginable subject political correctness flies window episode enjoy shows arent afraid poke fun taboo subject imaginable bromwell high disappoint


#### Stem words
Finally, we will "stem" the words so that we take away the differences between words like "expertly" and "expert" since they have the same meaning. Read more on stemming [here](https://en.wikipedia.org/wiki/Stemming):

We will use the `SnowballStemmer` library. import it from `nltk.stem.snowball`

In [29]:
from nltk.stem.snowball import SnowballStemmer

SnowballStemmer takes in a language as an argument. Since we are working with english, create a `SnowballStemmer` object and pass it `english` as the language. Call your object `stemmer`.

In [30]:
stemmer = SnowballStemmer('english')

To check the stem of a word, call `stemmer.stem()`. Try it out with the word `running`. See what it prints!

In [31]:
stemmer.stem('running')

'run'

Now, similar to how we removed the stop words, we want to now go through our review and stem each word. So:
* Turn your `movie_review` back into a list using `.split()`
* Create an empty list called `stemmed_words`
* Use a `for` loop to go through every word in your `movie_review` and call `stemmer.stem` on it.
* Append the newly stemmed word to your `stemmed_words` list
* Finally, re-create movie_review into a string by calling `.join` using a space as your separator.

In [32]:
movie_review = movie_review.split()
stemmed_words = []
for word in movie_review:
    stemmed_words.append(stemmer.stem(word))
movie_review = ' '.join(movie_review)

Print your final movie review! It should look like this:

In [33]:
print(movie_review)

bromwell high short brilliant expertly scripted perfectly delivered searing parody students teachers south london public school leaves literally rolling laughter vulgar provocative witty sharp characters superbly caricatured cross section british society accurate society following escapades keisha latrina natella protagonists want better term doesnt shy away parodying imaginable subject political correctness flies window episode enjoy shows arent afraid poke fun taboo subject imaginable bromwell high disappoint


#### Put it all together
We can put all the steps above together in a function, like this (pseudo-code given):

In [57]:
def clean_text(raw_text):
#     initialize empty clean_words list
    clean_words = []
    raw_text = raw_text.translate(translator)
    raw_text_split = raw_text.split()

    for word in raw_text_split:
        if word not in ENGLISH_STOP_WORDS:
            clean_words.append(stemmer.stem(word))
    return ' '.join(clean_words)

Let's see how it works. Here is an unclean review:

In [58]:
unclean_review = """Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!"""
print(unclean_review)

Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint!


Now clean it by calling the `clean_text` function above and make sure you notice the difference!

In [59]:
clean_text(unclean_review)

'bromwel high short brilliant expert script perfect deliv sear parodi student teacher south london public school leav liter roll laughter it vulgar provoc witti sharp the charact superbl caricatur cross section british societi accur societi follow escapad keisha latrina natella protagonist want better term doesnt shi away parodi imagin subject polit correct fli window episod if enjoy show arent afraid poke fun taboo subject imagin bromwel high disappoint'

## Let's now clean up all our data!

Our data can be be found in the file `all_reviews_small.csv` (some of the cleaning steps have already been done).

import `pandas` and read `all_reviews_small.csv` into a dataframe. Call it `df_reviews`.

In [60]:
import pandas as pd
df_reviews = pd.read_csv('all_reviews_small.csv')

Print the `head` and `shape`. You should see 4000 reviews, with 3 columns.

In [61]:
print(df_reviews.head())

  label train_test_split                                               text
0   pos            train  bromwell high is a cartoon comedy it ran at th...
1   pos            train  homelessness or houselessness as george carlin...
2   pos            train  brilliant over acting by lesley ann warren bes...
3   pos            train  this is easily the most underrated film inn th...
4   pos            train  this is not the typical mel brooks film it was...


In [62]:
print(df_reviews.shape)

(4000, 3)


Now apply our `clean_words` function to all the reviews! Store the clean reviews in a new column called `clean_text`.

In [64]:
clean_text_column = []
for text in df_reviews['text']:
    clean_text_column.append(clean_text(text))

se = pd.Series(clean_text_column)
df_reviews['clean_text'] = se.values

Check the `head` again to see your new dataframe's `clean_text` column.

In [65]:
df_reviews.head()

Unnamed: 0,label,train_test_split,text,clean_text
0,pos,train,bromwell high is a cartoon comedy it ran at th...,bromwel high cartoon comedi ran time program s...
1,pos,train,homelessness or houselessness as george carlin...,homeless houseless georg carlin state issu yea...
2,pos,train,brilliant over acting by lesley ann warren bes...,brilliant act lesley ann warren best dramat ho...
3,pos,train,this is easily the most underrated film inn th...,easili underr film inn brook cannon sure flaw ...
4,pos,train,this is not the typical mel brooks film it was...,typic mel brook film slapstick movi actual plo...


## Bag of Words
In Python, the `CountVectorizer` object represents the Bag Of Words model. import it from `sklearn.feature_extraction.text`, and create a `CountVectorizer()` object called `count_vect`.

In [66]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

Now we want to convert all our `clean_text` reviews into a bag of words representation. Call `count_vect.fit_transform` on all our clean reviews (i.e. `df_review['clean_text']`) to do this. Save the result in `bag_of_words`.

In [68]:
bag_of_words = count_vect.fit_transform(df_reviews['clean_text'])

Print your `bag_of_words`!

In [71]:
print(bag_of_words)

  (0, 9282)	1
  (0, 13792)	1
  (0, 6360)	1
  (0, 6212)	1
  (0, 18299)	1
  (0, 274)	1
  (0, 197)	1
  (0, 6052)	1
  (0, 20055)	1
  (0, 15629)	1
  (0, 9083)	1
  (0, 10530)	1
  (0, 3302)	1
  (0, 14825)	1
  (0, 8800)	1
  (0, 2450)	1
  (0, 18738)	1
  (0, 15061)	1
  (0, 5825)	1
  (0, 15812)	1
  (0, 9971)	1
  (0, 15018)	1
  (0, 16581)	1
  (0, 13613)	1
  (0, 13967)	1
  :	:
  (3999, 15667)	1
  (3999, 5893)	1
  (3999, 5340)	1
  (3999, 7405)	1
  (3999, 12832)	3
  (3999, 4802)	3
  (3999, 12062)	1
  (3999, 14781)	1
  (3999, 2099)	1
  (3999, 1047)	1
  (3999, 18820)	1
  (3999, 14436)	1
  (3999, 19988)	1
  (3999, 12045)	1
  (3999, 1713)	1
  (3999, 15818)	1
  (3999, 15862)	1
  (3999, 9091)	1
  (3999, 7499)	1
  (3999, 2989)	1
  (3999, 13841)	1
  (3999, 10496)	1
  (3999, 9654)	1
  (3999, 4794)	2
  (3999, 18299)	1


Because our vocabulary is so large, CountVectorizer creates a sparse matrix for memory efficiency. Check `bag_of_words.shape`. You should see 4000 vectors, each with a dimension of 20719.

In [72]:
bag_of_words.shape

(4000, 20719)

bag_of_words is now a 4000 X 20,719 feature matrix, where every row is a move review, and every column is the count of words for the word that column represents. The words can be found using the `.get_feature_names()` method.

Check the 200th word in `count_vect`. It should be `adulthood`!

In [73]:
count_vect.get_feature_names()[200]

'adulthood'

Likewise, we can do the opposite using `.vocabulary_.get()`. Check `'adulthood'`; it should be the 200th word.

In [75]:
count_vect.vocabulary_.get('adulthood')

200

## TF-IDF

In Python, the `TfidfVectorizer` object represents the Bag Of Words model. import it from `sklearn.feature_extraction.text`, and create a `CountVectorizer()` object called `tf_idf_vect`.

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vect = TfidfVectorizer()

Again, just like you did with the `CountVectorizer`, fit and transform your clean text reviews and store the result in a variable called `tf_idf`. 

In [77]:
tf_idf = tf_idf_vect.fit_transform(df_reviews['clean_text'])

Print `tf_idf` and notice how the values are different. Print the `shape` to confirm that the dimensions are the same as `bag_of_words`.

In [78]:
print(tf_idf)

  (0, 2278)	0.5855887807242065
  (0, 8284)	0.26592116063913257
  (0, 2762)	0.09029991322032968
  (0, 3537)	0.05638745903045433
  (0, 14695)	0.09457744785485278
  (0, 18420)	0.03230458413501131
  (0, 14289)	0.09160984007658156
  (0, 15898)	0.20455582137486983
  (0, 10470)	0.04733560734992249
  (0, 18066)	0.3611996528813187
  (0, 20542)	0.04433368983922523
  (0, 18065)	0.09061806545257328
  (0, 14274)	0.10997188049798773
  (0, 10300)	0.05682279347322916
  (0, 1562)	0.05092034552535194
  (0, 15781)	0.09498760381840703
  (0, 3377)	0.09817406573681878
  (0, 14793)	0.07758183927166384
  (0, 15956)	0.1276981011941443
  (0, 17782)	0.08588286428952507
  (0, 6429)	0.10558346176543791
  (0, 9067)	0.09061806545257328
  (0, 17536)	0.31275913685281154
  (0, 15279)	0.05369019100375045
  (0, 13360)	0.08767616862911443
  :	:
  (3999, 19905)	0.09164719069537093
  (3999, 12051)	0.11269137540016838
  (3999, 18145)	0.22003559744563847
  (3999, 2181)	0.08547048555848959
  (3999, 8399)	0.09390766756728264
  

Again, because our dataset has so many unique words, tfidf vectorizer creates a sparse matrix.

This matrix will again be 4000 X 20,719, where each column is the term frequence (count of times that word appears in the review) times the by the inverse document frequency (basically total number of reviews / number of reviews the word appears in).

Let's compare the differences in the two feature sets for one of our reviews.

Print the 9084th word in any one of our objects (`count_vect` or `tf_idf_vect`). It should be `inspir`.

In [81]:
print(count_vect.get_feature_names()[9084])
print(tf_idf_vect.get_feature_names()[9084])

inspir
inspir


Let's see how often it appears in Review 1: print the value of `(1, 9084)` in your `bag_of_words`.

In [82]:
print(bag_of_words[1, 9084])

1


You should see `1` !

What about it's tf-idf value?

In [83]:
print(tf_idf[1, 9084])

0.04589627183322225


You should see `0.04589627183322225`.

Notice how much smaller it is? This means it must appear in a good deal of other reviews

### Classification

Now, let's make a classifier to actually feed our feature data and train/test it. We'll use a Logistic Regression Classifier.

First, do this for the `CountVectorizer`. Use `train_test_split` with `test_size=0.1` and `random_state=42`. Your features will simply be your `bag_of_words` and your labels will be `df_reviews['label']`. Because our data is balanced, you can use `accuracy_score` if you like to check the accuracy of your classifier.

In [94]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score 

logreg = LogisticRegression()

X = bag_of_words
y = df_reviews['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

logreg.fit(X_train, y_train)
predictions = logreg.predict(X_test)

print(accuracy_score(y_test, predictions))

0.8875


88.75 %, not bad! 

Now let's try using our `TfidfVectorizer` and see if it performs better. Use the same parameters as above, the only different is that your features are `tf_idf`.

In [95]:
X = tf_idf
y = df_reviews['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

logreg.fit(X_train, y_train)
predictions = logreg.predict(X_test)

print(accuracy_score(y_test, predictions))

0.9025


90.25% ! So in this case, tf-idf is a little more accurate than bag of words.

## Doc2Vec

The `Doc2Vec` documentation can be found here:<br>
https://radimrehurek.com/gensim/models/doc2vec.html

A readable, easy introduction to `Doc2Vec` is available in this medium article:<br>
https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

You don't need to understand the main details about how `Doc2Vec` works, but it's more important that you understand how to use it -- and *that* will be the goal of this lab.

## Setup

First of all, you need to install `gensim`, which is the module that contains `Doc2Vec`. Open up your Terminal (on Mac) or Command Prompt (on Windows) and type in the following:

`easy_install -U gensim`

### Modules

We use `gensim`, since `gensim` has a much more readable implementation of `Word2Vec` (and `Doc2Vec`). We also use `numpy` for general array manipulation, and `sklearn` for Logistic Regression classifier.

First, from `gensim.models` import `Doc2Vec`.

In [1]:
import numpy as np
from gensim.models import Doc2Vec

Next, import the usual suspects: `numpy`, and `LogisticRegression` from `sklearn.linear_model`

In [14]:
from sklearn.linear_model import LogisticRegression

### Building a Doc2Vec model

The way Doc2Vec works is that, each 'document' (lyrics of a song, words in an email, etc.) needs to be fully 'cleaned' (no punctuation, stemmed, etc.) and on a single line each in a `.txt` file. In our case, we have 50,000 movie reviews, split into 4 different `.txt` files:

- `test-neg.txt`: 12500 negative movie reviews from the test data
- `test-pos.txt`: 12500 positive movie reviews from the test data
- `train-neg.txt`: 12500 negative movie reviews from the training data
- `train-pos.txt`: 12500 positive movie reviews from the training data

#### Check out the above text files and briefly go through them.

You can look at the `Doc2VecHelperFunctions.ipynb` file if you are curious to see how the text files are converted into our Doc2Vec model, `imdb.d2v`.

If you're curious about the parameters, do read the Doc2Vec/Word2Vec [documentation](
https://radimrehurek.com/gensim/models/doc2vec.html).

For this lab, the model is already prepared. It is named `imdb.d2v`. Load it by calling `Doc2Vec.load('./imdb.d2v')` and save it in a variable called `model`.

In [2]:
model = Doc2Vec.load('./imdb.d2v')

### Inspecting the Model

Let's see what our model gives. If we want to see what words are most 'similar' to `'good'`, we can call `model.vw.most_similar('good')` on our model. Try it out!

In [4]:
model.most_similar('good')

  """Entry point for launching an IPython kernel.


[('decent', 0.7383522391319275),
 ('great', 0.7068538069725037),
 ('bad', 0.6740086078643799),
 ('fine', 0.6538172364234924),
 ('solid', 0.6522965431213379),
 ('nice', 0.6312327980995178),
 ('excellent', 0.5867269039154053),
 ('terrific', 0.5646074414253235),
 ('poor', 0.5573974847793579),
 ('strong', 0.5290021300315857)]

Are some of the words above used in similar ways in which you would use the word 'good' ? If yes, that means our model has kind of understood the *meaning* of the word `good`. This is really awesome (and important), since we are doing sentiment analysis.

We can also look deeper and see what the model actually contains. To see the feature vector for the first review in the training set for negative reviews, check `model['TRAIN_NEG_0']`:

In [5]:
model['TRAIN_NEG_0']

array([-0.17747681,  0.12533455, -0.00807998, -0.27478257, -0.37068164,
       -0.98255026, -0.6695602 , -1.3983027 , -0.9171792 , -0.08588421,
       -0.36857808, -0.07579076,  0.71824706, -1.0862557 , -0.10960834,
       -0.44040376,  0.2701989 , -1.3847525 ,  0.09918125, -0.22871178,
        0.58657205,  0.11869906,  0.0109016 , -1.4388447 , -0.02760231,
       -0.69180745, -0.837127  , -0.9432887 , -0.18288958,  1.0258856 ,
        1.4753187 , -0.61216   , -0.7937162 , -0.41774765, -0.9840156 ,
       -0.6290388 , -0.84463   , -0.47075298,  1.0337003 ,  0.16890025,
        0.25671875, -0.04080485,  1.4223473 , -0.4656492 ,  0.18151864,
       -0.02523474, -0.9327431 , -0.44220942,  1.3065814 ,  0.1611904 ,
        0.35378784, -0.14659157,  0.53616524, -0.05212237, -0.4788619 ,
       -0.5630621 , -1.0142472 , -0.8748162 , -0.76901954,  0.9787782 ,
       -0.8417108 ,  1.3018354 , -0.43847433,  0.8446837 , -1.2455583 ,
       -0.36123884, -0.08434324,  0.7177968 , -0.14318839,  0.66

## Classifying Sentiments

### Training Vectors

Now let's use these vectors to train a classifier. First, we must extract the training vectors. Remember that we have a total of 25000 training reviews, with equal numbers of positive and negative ones (12500 positive, 12500 negative). There are two parallel arrays, one containing the vectors (`train_arrays`) and the other containing the labels (`train_labels`). We simply put the positive ones at the first half of the array, and the negative ones at the second half.

We will use a `for` loop to go through all `25000` training reviews, adding the vector for each review in `train_arrays` and it's corresponding label (`1` for a positive review, and `0` for a negative review) in `train_labels`.

#### Read the code below and ask your Instructor/TA if you have any questions!

In [7]:
train_arrays = np.zeros((25000, 100))
train_labels = np.zeros(25000)

for i in range(12500):
    prefix_train_pos = 'TRAIN_POS_' + str(i)
    prefix_train_neg = 'TRAIN_NEG_' + str(i)
    train_arrays[i] = model[prefix_train_pos]
    train_arrays[12500 + i] = model[prefix_train_neg]
    train_labels[i] = 1
    train_labels[12500 + i] = 0

Print `train_arrays`. You should see rows and rows of vectors representing each sentence.

In [8]:
print(train_arrays)

[[-0.09523074  0.10516991 -0.07066526 ... -1.50765908  0.37817046
   0.45435163]
 [ 0.15591073 -1.00769353 -0.29605961 ... -1.45913517  1.49660051
   1.72079444]
 [-0.49689472 -0.63923281 -1.31833351 ... -2.12929225  0.9443326
   0.63289094]
 ...
 [-0.2536512  -0.89831948 -0.24197805 ...  1.50290143  1.01230037
  -0.3398996 ]
 [-2.00854516  0.64646685 -0.45022076 ...  1.535079    0.13337763
   0.06628666]
 [-0.50857788  0.85919422 -0.78979629 ... -0.4446539   1.05848455
   0.50058913]]


Print `train_labels`. They are simply category labels for the sentence vectors -- 1 representing positive and 0 for negative.

In [10]:
print(train_labels)

[1. 1. 1. ... 0. 0. 0.]


### Testing Vectors

We do the same for testing data -- data that we are going to feed to the classifier after we've trained it using the training data. This allows us to evaluate our results. The process is pretty much the same as extracting the results for the training data.

#### Read the code below and ask your Instructor/TA if you have any questions!

In [12]:
test_arrays = np.zeros((25000, 100))
test_labels = np.zeros(25000)

for i in range(12500):
    prefix_test_pos = 'TEST_POS_' + str(i)
    prefix_test_neg = 'TEST_NEG_' + str(i)
    test_arrays[i] = model[prefix_test_pos]
    test_arrays[12500 + i] = model[prefix_test_neg]
    test_labels[i] = 1
    test_labels[12500 + i] = 0

### Classification

Now, train a logistic regression classifier using the training data.

Create a LogisticRegression Classifier, and `fit` it to your `train_arrays` and `train_labels`.

In [15]:
logreg = LogisticRegression()
logreg.fit(train_arrays, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Call `score` on your classifier, passing in your `test_arrays` and `test_labels`.

In [32]:
logreg.score(test_arrays, test_labels)

0.8644

You should see that we have achieved nearly 87% accuracy for sentiment analysis.

Finally, if you have time, try running your classifier on a bunch of individual reviews and see if you agree with the predictions! You can do this in the following steps:
* Choose a review from one of the `.txt` files.
* You can grab it's corresponding vector by using the correct index in your model.
    * For example, for the 3rd negative test review, the feature vector is `model['TEST_NEG_2']`
* Call `classifier.predict` on your feature vector to see the prediction (you may have to use `.reshape` to get it in the correct format).
    * A result of 0 means it's a positive review, and 1 means negative.
* Do you agree :) ?

In [34]:
tfidf.fit_transform('once again mr costner has dragged out a movie for far longer than necessary aside from the terrific sea rescue sequences of which there are very few i just did not care about any of the characters most of us have ghosts in the closet and costner s character are realized early on and then forgotten until much later by which time i did not care the character we should really care about is a very cocky overconfident ashton kutcher the problem is he comes off as kid who thinks he s better than anyone else around him and shows no signs of a cluttered closet his only obstacle appears to be winning over costner finally when we are well past the half way point of this stinker costner tells us all about kutcher s ghosts we are told why kutcher is driven to be the best with no prior inkling or foreshadowing no magic here it was all i could do to keep from turning it off an hour in this is an example of why the majority of action films are the same generic and boring there s really nothing worth watching here a complete waste of the then barely tapped talents of ice t and ice cube who ve each proven many times over that they are capable of acting and acting well don t bother with this one go see new jack city ricochet or watch new york undercover for ice t or boyz n the hood higher learning or friday for ice cube and see the real deal ice t s horribly cliched dialogue alone makes this film grate at the teeth and i m still wondering what the heck bill paxton was doing in this film and why the heck does he always play the exact same character from aliens onward every film i ve seen with bill paxton has him playing the exact same irritating character and at least in aliens his character died which made it somewhat gratifying overall this is second rate action trash there are countless better films to see and if you really want to see this one watch judgement night which is practically a carbon copy but has better acting and a better script the only thing that made this at all worth watching was a decent hand on the camera the cinematography was almost refreshing which comes close to making up for the horrible film itself but not quite ')

NameError: name 'tfidf' is not defined

In [23]:
model['TEST_NEG_0'].reshape(-1,1)


array([[-0.4025294 ],
       [ 1.4757489 ],
       [-0.9191377 ],
       [-0.72288036],
       [-0.13658887],
       [ 0.37351483],
       [-0.29512873],
       [-0.09204291],
       [ 0.40346098],
       [ 1.3588545 ],
       [ 0.55906445],
       [-0.46164885],
       [ 0.28233474],
       [-0.13036124],
       [ 0.30230382],
       [ 0.03286894],
       [-1.172239  ],
       [ 0.23371239],
       [ 0.38307256],
       [-0.8923773 ],
       [-1.2599294 ],
       [-0.87436867],
       [-0.5740467 ],
       [-0.25664878],
       [ 1.0596973 ],
       [ 0.4461559 ],
       [-0.16825895],
       [ 0.30935693],
       [-0.2801147 ],
       [-1.0062453 ],
       [ 0.80701435],
       [ 0.4010077 ],
       [ 0.7609655 ],
       [-1.51837   ],
       [-0.65418285],
       [-0.71275574],
       [ 0.93690497],
       [-0.34276244],
       [-0.6172708 ],
       [-0.7193946 ],
       [-1.9586462 ],
       [ 0.7217415 ],
       [ 0.61408055],
       [-0.4114957 ],
       [-0.1370073 ],
       [-0

In [27]:
logreg.predict(model['TEST_NEG'].reshape(-1,1))

KeyError: "tag 'TEST_NEG' not seen in training corpus/invalid"

In [31]:
logreg.predict(model['TEST_NEG_0'].reshape(-1,1))

ValueError: X has 1 features per sample; expecting 100

In [None]:
# Try a random review

In [None]:
# Try a random review

In [None]:
# Try a random review

In [None]:
# Try a random review

Again, ask your Instructor or TA if you have any questions. Good luck on the Assignment!

## References

- Doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html
- Paper that inspired this: https://arxiv.org/pdf/1405.4053.pdf

---