In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
df.head()


Unnamed: 0,target,review
0,0,The film starts with a manager (Nicholas Bell)...
1,0,It must be assumed that those who praised this...
2,0,"This movie could have been very good, but come..."
3,0,I watched this video at a friend's house. I'm ...
4,0,"A friend of mine bought this film for £1, and ..."


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['review'],
                                                    df['target'],
                                                    random_state=0)

In [6]:
print('X_train first entry:\n\n', X_train[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afou

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)


In [10]:
vect.get_feature_names()[::2000]

['00',
 'aggelopoulos',
 'art',
 'befits',
 'brainer',
 'cataclysm',
 'cohabitant',
 'crazier',
 'deportivo',
 'downturn',
 'entwined',
 'feverishly',
 'gagnon',
 'groot',
 'him',
 'incorrectness',
 'joely',
 'landon',
 'lutzky',
 'memorize',
 'multinational',
 'odaka',
 'payal',
 'postmaster',
 'rahs',
 'retorts',
 'saugages',
 'shoving',
 'specked',
 'sullesteian',
 'theyare',
 'tutazema',
 'veer',
 'willingham']

In [12]:
len(vect.get_feature_names())

67544

In [14]:
X_train_vectorized = vect.transform(X_train)

In [15]:
X_train_vectorized

<18750x67544 sparse matrix of type '<class 'numpy.int64'>'
	with 2566410 stored elements in Compressed Sparse Row format>

In [17]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [21]:
from sklearn.metrics import roc_auc_score  

predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.878806240501


In [23]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs:\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'waste' 'disappointment' 'disappointing' 'awful' 'boring' 'lacks'
 'poorly' 'laughable' 'mess']

Largest Coefs:
['carrey' 'funniest' 'wonderfully' 'erotic' 'excellent' 'perfect'
 'refreshing' 'superb' 'surprisingly' 'flight']


In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

23617

In [26]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.890185343668


In [27]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.890185343668


In [28]:
feature_names = np.array(vect.get_feature_names())
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf:\n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['trajectory' 'dissolution' 'breathlessly' 'incisive' 'booed' 'dissolved'
 'attained' 'punishes' 'surname' 'oversee']

Largest tfidf:
['name' 'pokemon' 'steve' 'scanners' 'smallville' 'woo' 'botched' 'weller'
 'xica' 'bye']


In [29]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs:\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'bad' 'awful' 'waste' 'boring' 'poor' 'nothing' 'terrible' 'worse'
 'no']

Largest Coefs:
['great' 'excellent' 'best' 'perfect' 'wonderful' 'well' 'amazing' 'love'
 'favorite' 'fun']
