# Assignment 2, due March 27, 10am¶

### Late submission policy: each late day is 10% of the grade
In this assignment you will classify the Movie Reviews Corpus into positive or negative. 
This is the polarity data set (1000 negative and 1000 positive reviews). For more information visit Bo Pang and Lillian Lee's Movie Review Site: http://www.cs.cornell.edu/people/pabo/movie-review-data/.
This corpus is part of the NLTK distribution. 


### Packages
First import all the packages that you will need during this assignment.

numpy (www.numpy.org) is the fundamental package for scientific computing with Python.
NLTK (https://www.nltk.org/) is the NLTK tool.
pandas (https://pandas.pydata.org/) is the fundamental package for for data manipulation and analysis (we will use dataframes).
sklearn (http://scikit-learn.org/stable/) provides simple and efficient tools for data mining and data analysis.
matplotlib (http://matplotlib.org) is a library for plotting graphs in Python.
keras (https://keras.io/) is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano

# 1. Prepare data. 

In [None]:
import logging
import pandas as pd
import numpy as np
from numpy import asarray
from numpy import zeros

%matplotlib inline
import matplotlib.pyplot as plt
import itertools

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing import text, sequence
from keras import regularizers
from keras.models import Model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers import Input, Dense, Dropout, Embedding, LSTM, Flatten, Activation
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint

from keras import utils

In [None]:
from tensorflow import keras

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.utils import shuffle

In [None]:
import nltk
from nltk.corpus import movie_reviews,stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
import string
from nltk import pos_tag
from nltk.stem import wordnet
import random

In [None]:
documents = [(list(movie_reviews.words(fileid)), " ".join((list(movie_reviews.words(fileid)))), category)
    for category in movie_reviews.categories()
    for fileid in movie_reviews.fileids(category)]    

In [None]:
# Create a dataframe called imdb where the data will be stored. 
# The imdb dataframe should have three columns: words (tokenized text), text, label (pos or neg)
imdb = None

imdb = pd.DataFrame(documents, columns=['words', 'text', 'label'])

In [None]:
# Covert the pos/neg labels into binary labels: 0 - positive; 1 - negative;
imdb['target'] = pd.Categorical(imdb.label, categories = ['pos','neg']).codes

In [None]:
imdb.head() 

### Your output should look like: 

In [None]:
imdb.tail() 

### Your output should look like: 

In [None]:
# Shuffle the data points to randomize the order

imdb = shuffle(imdb)

In [None]:
# Create a new data column with the number of tokens per text 
# We will use this information to specify the maximum length of the sentences that we will analyze

imdb['num_words'] = imdb.words.apply(lambda x : len(x))

In [None]:
# See the disctribution of the text length across data points. 
# We will analyze up to 1000 tokens per document (data point)

imdb['bins']=pd.cut(imdb.num_words, bins=[0,200,500,900,1400, np.inf], labels=['0-200', '200-500', '500-900','900-1400' ,'>1400'])
word_distribution = imdb.groupby('bins').size().reset_index().rename(columns={0:'counts'})
word_distribution

In [None]:
train_size = int(len(imdb) * .7) 
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(imdb) - train_size))

In [None]:
# Separate document representations from labels

train_docs = imdb['text'][:train_size]
train_labels = imdb['target'][:train_size]

test_docs = imdb['text'][train_size:]
test_labels = imdb['target'][train_size:]

# 2. Create model 1.

## Use the model that was created for Assignment 1. 
## Report the result. 

In [None]:
# The Tokenizer class (https://keras.io/preprocessing/text/) allows to vectorize a text corpus.
# Follow the provided link to learn about all the arguments for Tokenizer

max_words = 1000   # as discussed above, this is the max number of tokens alanyzed per text
tokenize = text.Tokenizer(num_words=max_words, char_level=False)

In [None]:
# report results here
print (None)

# 3. Do  Download the glove.6B.100d.txt file (100 dimentions)

### the glove.6B.100d.txt file should be in the same directory (root directory) as your notebook file

In [None]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# 4. Create model 2.

## Create a model using word embeddings. 
## Compare the two models.
## Report the result. 

In [None]:
# out of 400000 word vectors, use only those that correspond to the words used in the movie review corpus. 
# vocab_size contains the number of different words used in the movie review corpus

vocab_size = len(tokenize.word_index) + 1
vocab_size

### The vocabulary size should be: 34211

In [None]:
# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 100))
for word, i in tokenize.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
embedding_matrix.shape

In [None]:
# define model
# in this model, in contrast to Assignment 1, instead of useing hidden units, we use GloVe word embeddings

model = Sequential()
model.add(Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_words, trainable=False))
model.add(Flatten())
model.add(Dense(num_classes, activation='sigmoid'))
# compile the model

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print(model.summary())


In [None]:
# Fit the model here
# Use your code from assignment1

In [None]:
# Run the model here
# Use your code from assignment1

In [None]:
print (None) # how the two models are different

# 4. Experiment with another set of word embeddings.

## Run the same model using a different embeddings matrix (for example, different size for GloVe; word2Vec; embeddings obtained from different corpora). Compare the obtained reslut and the result from part 2. 

### Compare the two models.
### Report the result. 

In [None]:
print (None) # how the two embeddings models are different

# 5. (optional) Experiment with word embeddings.

### Update the GloVe matrix so that it contains all  400000 word vectors.


### You can find many online resources with word vector operations. For example, 
### https://datascience-enthusiast.com/DL/Operations_on_word_vectors.html

### Use these functions, or write your own functions to find intersting inforamtion / connections that could be deduced using the word embedings. 

### Report your results