## Getting and Preparing data

Although later on we will use `sklearn.feature_extraction.text.CountVectorizer` to create a bag-of-words set of features, and this library directly accepts a file name, we need to pass instead a secuence of documents since our training file contains not just text but also sentiment tags (that we need to strip out).  

In [2]:
import urllib, os

data_list= []
data_list_test = []
for root, dirs, files in os.walk("/Users/romilvasani/USC/CSCI544/Final Project/dataset/aclImdb/train/pos"):
    for file in files:
        if file.endswith(".txt"):
            file_open = open(os.path.join(root, file), "r")
            file_content = file_open.read()
            data_list.append([file_content,"POSITIVE"])
            

for root, dirs, files in os.walk("/Users/romilvasani/USC/CSCI544/Final Project/dataset/aclImdb/train/neg"):
    for file in files:
        if file.endswith(".txt"):
            file_open = open(os.path.join(root, file), "r")
            file_content = file_open.read()
            data_list.append([file_content,"NEGATIVE"])

            
for root, dirs, files in os.walk("/Users/romilvasani/USC/CSCI544/Final Project/dataset/aclImdb/test/pos"):
    for file in files:
        if file.endswith(".txt"):
            file_open = open(os.path.join(root, file), "r")
            file_content = file_open.read()
            data_list_test.append([file_content,"POSITIVE"])
            

for root, dirs, files in os.walk("/Users/romilvasani/USC/CSCI544/Final Project/dataset/aclImdb/test/neg"):
    for file in files:
        if file.endswith(".txt"):
            file_open = open(os.path.join(root, file), "r")
            file_content = file_open.read()
            data_list_test.append([file_content,"NEGATIVE"])


Now that we have our files downloaded locally, we can load them into data frames for processing.  

In [3]:
import pandas as pd

train_data_df = pd.DataFrame(data_list_test, columns=["Text","Sentiment"])
test_data_df = pd.DataFrame(data_list, columns=["Text","Sentiment"])

# test_data_df = pd.read_csv(test_data_file_name, header=None, sep="|", usecols=[2])
# test_data_df.columns = ["Text"]
# train_data_df = pd.read_csv(train_data_file_name, header=None, sep="|", usecols=[2,3])
# train_data_df.columns = ["Text","Sentiment"]

In [4]:
train_data_df.shape

(25000, 2)

In [5]:
test_data_df.shape

(25000, 2)

Here, `header=0` indicates that the first line of the file contains column names, `delimiter=\t` indicates that the fields are separated by tabs, and `quoting=3` tells Python to ignore doubled quotes, otherwise you may encounter errors trying to read the file.

Let's check the first few lines of the train data.  

In [6]:
train_data_df.head()

Unnamed: 0,Text,Sentiment
0,I went and saw this movie last night after bei...,POSITIVE
1,Actor turned director Bill Paxton follows up h...,POSITIVE
2,As a recreational golfer with some knowledge o...,POSITIVE
3,"I saw this film in a sneak preview, and it is ...",POSITIVE
4,Bill Paxton has taken the true story of the 19...,POSITIVE


And the test data.

In [7]:
test_data_df.head()

Unnamed: 0,Text,Sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,POSITIVE
1,Homelessness (or Houselessness as George Carli...,POSITIVE
2,Brilliant over-acting by Lesley Ann Warren. Be...,POSITIVE
3,This is easily the most underrated film inn th...,POSITIVE
4,This is not the typical Mel Brooks film. It wa...,POSITIVE


Let's count how many labels do we have for each sentiment class.  

In [8]:
train_data_df.Sentiment.value_counts()

NEGATIVE    12500
POSITIVE    12500
Name: Sentiment, dtype: int64

Finally, let's calculate the average number of words per sentence. We could do the following using a list comprehension with the number of words per sentence.

In [9]:
import numpy as np 

np.mean([len(s.split(" ")) for s in train_data_df.Text])

228.51516000000001

## Preparing a *corpus*

The class [sklearn.feature_extraction.text.CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) in the wonderful `scikit learn` Python library converts a collection of text documents to a matrix of token counts. This is just what we need to implement later on our bag-of-words linear classifier.  

First we need to init the vectorizer. We need to remove puntuations, lowercase, remove stop words, and stem words. All these steps can be directly performed by `CountVectorizer` if we pass the right parameter values. We can do as follows.

In [10]:
import re, nltk
from sklearn.feature_extraction.text import CountVectorizer        
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = re.sub("[^a-zA-Z]", " ", text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # stem
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = 'english',
    max_features = 100
)

The method `fit_transform` does two functions: First, it fits the model and learns the vocabulary; second, it transforms our corpus data into feature vectors. The input to `fit_transform` should be a list of strings, so we concatenate train and test data as follows.

In [11]:
corpus_data_features = vectorizer.fit_transform(train_data_df.Text.tolist() + test_data_df.Text.tolist())

Numpy arrays are easy to work with, so convert the result to an array.

In [12]:
corpus_data_features_nd = corpus_data_features.toarray()
corpus_data_features_nd.shape

(50000, 100)

In [14]:
# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print(vocab)

['act', 'actor', 'actual', 'ani', 'anoth', 'bad', 'becaus', 'befor', 'believ', 'best', 'better', 'br', 'cast', 'charact', 'come', 'comedi', 'day', 'did', 'didn', 'direct', 'director', 'doe', 'doesn', 'don', 'effect', 'end', 'enjoy', 'everi', 'fact', 'feel', 'film', 'funni', 'girl', 'good', 'got', 'great', 'guy', 'ha', 'hi', 'horror', 'just', 'kill', 'know', 'life', 'like', 'littl', 'live', 'look', 'lot', 'love', 'm', 'make', 'man', 'mani', 'minut', 'movi', 'music', 'new', 'noth', 'old', 'onli', 'origin', 'peopl', 'perform', 'play', 'plot', 'point', 'pretti', 'quit', 'real', 'realli', 'role', 's', 'say', 'scene', 'seen', 'set', 'someth', 'star', 'start', 'stori', 't', 'thi', 'thing', 'think', 'thought', 'time', 'tri', 'turn', 'use', 've', 'veri', 'wa', 'want', 'watch', 'way', 'whi', 'work', 'world', 'year']


We can also print the counts of each word in the vocabulary as follows.

In [15]:
# Sum up the counts of each vocabulary word
dist = np.sum(corpus_data_features_nd, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print(count, tag)

17494 act
13599 actor
10011 actual
15048 ani
8594 anoth
18544 bad
17715 becaus
8525 befor
7841 believ
12630 best
11459 better
201954 br
8762 cast
28364 charact
13260 come
7430 comedi
7850 day
12624 did
8768 didn
7450 direct
10050 director
11673 doe
8876 doesn
17661 don
7248 effect
19100 end
8490 enjoy
7968 everi
7358 fact
10353 feel
95890 film
8768 funni
7881 girl
30195 good
7234 got
18404 great
9257 guy
33400 ha
57725 hi
7507 horror
35186 just
7419 kill
15170 know
12938 life
45210 like
12435 littl
8718 live
19951 look
9723 lot
18221 love
10176 m
30035 make
11943 man
13486 mani
7441 minut
103284 movi
8587 music
8102 new
8396 noth
8811 old
23241 onli
7630 origin
18384 peopl
10733 perform
17374 play
13809 plot
8068 point
7258 pretti
7498 quit
9442 real
23095 realli
8386 role
129794 s
14992 say
21452 scene
13378 seen
8033 set
10219 someth
8450 star
8145 start
25285 stori
68322 t
151030 thi
16512 thing
17546 think
7604 thought
31968 time
12535 tri
7623 turn
10228 use
10244 ve
27729 veri
95

## A bag-of-words linear classifier

In order to perform logistic regression in Python we use [sklearn.linear_model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). But first let's split our training data in order to get an evaluation set. We will use [sklearn.cross_validation.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html).

In [18]:
from sklearn.cross_validation import train_test_split

# remember that corpus_data_features_nd contains all of our original train and test data, so we need to exclude
# the unlabeled test entries
X_train, X_test, y_train, y_test  = train_test_split(
    corpus_data_features_nd[0:len(train_data_df)], 
    train_data_df.Sentiment,
    train_size=0.85, 
    random_state=1234)
print(X_test[0])

[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0
 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]


Now we are ready to train our classifier.

In [15]:
from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression()
log_model = log_model.fit(X=X_train, y=y_train)

Now we use the classifier to label our evaluation set. We can use either `predict` for classes or `predict_proba` for probabilities.  

In [16]:
y_pred = log_model.predict(X_test)

There is a function for classification called [sklearn.metrics.classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) which calculates several types of (predictive) scores on a classification model. Check also [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics). In this case we want to check our classifier accuracy.  

In [17]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

   NEGATIVE       0.87      0.86      0.87      1920
   POSITIVE       0.85      0.87      0.86      1830

avg / total       0.86      0.86      0.86      3750



Finally, we can re-train our model with all the training data and use it for sentiment classification with the original (unlabeled) test set.

In [18]:
# train classifier
log_model = LogisticRegression()
log_model = log_model.fit(X=corpus_data_features_nd[0:len(train_data_df)], y=train_data_df.Sentiment)

# get predictions
test_pred = log_model.predict(corpus_data_features_nd[len(train_data_df):])
actual_pred = test_data_df["Sentiment"].tolist()

print(classification_report(test_pred, actual_pred))

# sample some of them

# spl = range(len(test_pred))

# target = open("logistic_output.txt", 'w')

# # print text and labels
# for text, sentiment in zip(test_data_df.Text[spl], test_pred[spl]):
#     target.write(sentiment + "|" + text + "\n")
# target.close()

             precision    recall  f1-score   support

   NEGATIVE       0.86      0.85      0.86     12584
   POSITIVE       0.85      0.86      0.86     12416

avg / total       0.86      0.86      0.86     25000

