### NLP Part I 

dataset - IMDB 
train - 50k
test - 50k

In [2]:
reviews_train = []
for line in open('movie_data/full_train.txt'):
    reviews_train.append(line.strip())
reviews_test = []
for line in open('movie_data/full_test.txt'):
    reviews_test.append(line.strip())

In [3]:
reviews_train[0]

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

In [4]:
import re
REPLACE_WITHOUT_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

In [5]:
def preprocess_reviews(reviews):
    reviews = [REPLACE_WITHOUT_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub("", line.lower()) for line in reviews]
    
    return reviews

In [6]:
reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

In [7]:
reviews_train_clean[0]

'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt'

### Vectorization

In order for this data to make sense to our machine learning algorithm we’ll need to convert each review to a numeric representation, which we call vectorization.
The simplest form of this is to create one very large matrix with one column for every unique word in your corpus (where the corpus is all 50k reviews in our case). Then we transform each review into one row containing 0s and 1s, where 1 means that the word in the corpus corresponding to that column appears in that review. That being said, each row of the matrix will be very sparse (mostly zeros). This process is also known as one hot encoding.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean)
X_test_full = cv.transform(reviews_test_clean)

In [10]:
X.shape #141656 unique words, where each word represents a column

(25000, 141656)

### Build Classifier
Now that we’ve transformed our dataset into a format suitable for modeling we can start building a classifier. Logistic Regression is a good baseline model for us to use for several reasons: (1) They’re easy to interpret, (2) linear models tend to perform well on sparse datasets like this one, and (3) they learn very fast compared to other algorithms.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Note: The targets/labels we use will be the same for training and testing because both datasets are structured the same, where the first 12.5k are positive and the last 12.5k are negative.


In [12]:
target = [ 1 if i<12500 else 0 for i in range(25000)]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, target, train_size=0.75)

In [14]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_test, lr.predict(X_test))))



Accuracy for C=0.01: 0.87776
Accuracy for C=0.05: 0.88336
Accuracy for C=0.25: 0.88
Accuracy for C=0.5: 0.8792
Accuracy for C=1: 0.87744


In [15]:
#looks like the acc is best for c=0.25

final_model = LogisticRegression(C=0.05)
final_model.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_model.predict(X_test_full)))
# Final Accuracy: 0.88128

Final Accuracy: 0.88276


In [16]:
#top positive words
feature_to_coef = {
    word: coef for word, coef in zip(
        cv.get_feature_names(), final_model.coef_[0]
    )
}
for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:5]:
    print (best_positive)

('excellent', 0.925621960754335)
('710', 0.771040250298678)
('perfect', 0.7517973059369191)
('great', 0.6705701564292714)
('amazing', 0.6264569091650866)


In [17]:
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:5]:
    print (best_negative)

('worst', -1.3688246996249862)
('waste', -1.1755848249870589)
('awful', -1.0101822286434057)
('poorly', -0.8715615733574955)
('boring', -0.8266797299739013)


In [18]:
#test
r1 = ["Great movie"]
r1 = cv.transform(r1)

In [19]:
r1

<1x141656 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [23]:
p1 = final_model.predict(r1)
p1

array([1])

In [24]:
r2 = cv.transform(["Very bad movie"])
p2 = final_model.predict(r2)
p2

array([0])

In [25]:
r3 = ["Awesome","Terrible", "okay", "it was fun", "good morning", "fine"]
r3 = cv.transform(r3)
p3 = final_model.predict(r3)
p3

array([1, 0, 0, 1, 1, 1])