## Classify positive/negative review

Three classification models attempted:
1. Logistic regression: easy to interpret
2. naive Bayes classifier: it is very quick, and performs reasonably well
3. Random Forest: more complex, non-linear model for even higher accuracy

We use accuracy as our metric, because the classes are almost balanced


In [2]:
import pickle
#Opening objects saved from NLP


In [5]:
# with open("vectorizer.pickle", "rb") as f:
#     vectorizer = pickle.load(f)

# with open("X_train.pickle", "rb") as f:
#     X_train = pickle.load(f)    

# with open("X_test.pickle", "rb") as f:
#     X_test = pickle.load(f)    
    
# with open("target_train.pickle", "rb") as f:
#     target_train = pickle.load(f)    

# with open("target_test.pickle", "rb") as f:
#     target_test = pickle.load(f)    

EOFError: Ran out of input

In [3]:
#Logistic Regression: no parameters to tune, because classes are almost balanced

from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(solver='lbfgs',max_iter=200)
lr_clf.fit(X_train, target_train)

NameError: name 'X_train' is not defined

In [19]:
#Mean accuracy for training set
lr_clf.score(X_train, target_train)

0.8764560819616256

In [20]:
#Mean accuracy for test set
lr_clf.score(X_test, target_test)

0.852680540518866

In [24]:
def get_top_values(lst,n,labels):
    #Give a list of values, find the indices with the highest n values
    #Return the labels for each of the indices
    return [labels[i] for i in np.argsort(lst)[::-1][:n]]
def get_bottom_values(lst,n,labels):
    #Give a list of values, find the indices with the lowest n values
    #Return the labels for each of the indices
    return [labels[i] for i in np.argsort(lst)[:n]]

In [29]:
ix_to_words = {ix:word for word,ix in vocab.items()}

top_positive_words = get_top_values(lst=lr_clf.coef_[0], n=10, labels=ix_to_words)
print(top_positive_words)

top_negative_words = get_bottom_values(lst=lr_clf.coef_[0], n=10, labels=ix_to_words)
print(top_negative_words)

['amazing', 'best', 'delicious', 'thank', 'awesome', 'excellent', 'highly', 'incredible', 'perfect', 'wow']
['worst', 'rude', 'ok', 'bland', 'terrible', 'horrible', 'disappointing', 'okay', 'however', 'mediocre']


In [30]:
#Naive Bayes

from sklearn.naive_bayes import MultinomialNB

nb_clf = MultinomialNB(alpha=1.0, fit_prior=True)
nb_clf.fit(X_train, target_train)

MultinomialNB()

In [31]:
#Accuracy for training set
nb_clf.score(X_train, target_train)

0.849469695450809

In [32]:
#Accuracy for test set
nb_clf.score(X_test, target_test)

0.8290228156702354

This accuracy score is not bad! It is quick to run and so definitely worth considering

In [37]:
#Random Forest: more complicated, hopefully improves on linear models

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

#For reproduceable experiments
np.random.seed(5)

#Test different possibilities of max_features up until the actual number of features
total_number_of_features = X_train.shape[1]
log10_total_numbers_of_features = np.log10(total_number_of_features)
max_features = np.ceil(np.power(10,np.random.random(size=5)*log10_total_numbers_of_features)).astype(int)

#Check to make sure there are no duplicates
print(max_features)

[   12 13379    10 22559   207]


In [38]:
#Try different max_depths for trees:considering the number of features, we need deeper trees
max_depth = np.random.choice(range(10,50),size=5,replace=False) 
print(max_depth)

[38 36 16 27 33]


In [42]:
#Parameter tuning using grid search
rf_paramgrid = {
    'max_depth': max_depth,
    'n_estimators': (25,30,40), #For the sake of running speed, I'm limiting the no. of trees to 40
    'max_features': max_features
}

grid_rf = GridSearchCV(cv=3, 
                       estimator=RandomForestClassifier(),
                      param_grid=rf_paramgrid,
                      scoring='accuracy', n_jobs=2)

grid_rf.fit(X_train, target_train)

KeyboardInterrupt: 

In [None]:
print('After 3-fold randomized CV, Random Forest achieves a mean accuracy of {}'.format(grid_rf.best_score_))
print('The training accuracy is {}'.format(grid_rf.score(X_train, target_train)))

In [None]:
#Check the hyperparameters found
print(grid_rf.best_params_)

In [None]:
#Fine tuning around the parameter set found above

rf_paragrid_finetune = {
    
}

## Cluster Reviews with KMeans

Fit K-means clustering with training data, and then apply it on the entire dataset. Tune with elbow method.
Used sklearn.cluster.MiniBatchKMeans isntead of sklearn.cluster.KMeasn to speed up convergence

In [13]:
from sklearn.cluster import MiniBatchKMeans

#Define function to print top features (largest component of the centroid vector) of each cluster

def print_top_features(obj, ix_to_tokens, n=10):
    cluster_centers = obj.cluster_centers_
    labels = obj.labels_
    
    for i, centroid in enumerate(cluster_centers):
        top_n_features = np.argsort(centroid)[::-1][:n]
        top_n_words = [ix_to_tokens[feature_index] for feature_index in top_n_features]
        
        print('Top tokens from cluster {} (# of obs: {})'.format(i, (labels==i).sum()))
        print(top_n_words)

In [14]:
#Start with k=3, as a trial
kmeans = MiniBatchKMeans(n_clusters=3, batch_size=500, max_iter=100, n_init=5)
kmeans.fit(X_train)
index ={index:word for word,index in vocab.items()}
print_top_features(kmeans,index,n=10)

Top tokens from cluster 0 (# of obs: 124893)
['pizza', 'place', 'good', 'food', 'get', 'just', 'like', 'one', 'go', 'all']
Top tokens from cluster 1 (# of obs: 84760)
['great', 'food', 'service', 'place', 'very', 'good', 'amazing', 'friendly', 'back', 'will']
Top tokens from cluster 2 (# of obs: 85139)
['good', 'chicken', 'food', 'ordered', 'very', 'like', 'just', 'all', 'place', 'really']


### Experiment with different K

Attempt to find the optimal K using elbow method

In [16]:
from Elbow import elbow_tuner

k_to_try = range(3, 100, 9)

intra_cluster_sum_of_squares = elbow_tuner(X_train, k_to_try)

plt.plot(k_to_try, intra_cluster_sum_of_squares, '-+')
plt.xlabel('K (Number of Clusters)')
plt.ylabel('Mean Intra-cluster Sum of Squares')

MemoryError: Unable to allocate 132. GiB for an array with shape (166347, 106763) and data type float64

In [18]:
kmeans_final = MiniBatchKMeans(n_clusters=25).fit(X_train)
print_top_features(kmeans_final,index,n=8)

Top tokens from cluster 0 (# of obs: 1)
['desserts', 'tasting', 'belief', '125', 'highly', 'talented', 'will', 'blow']
Top tokens from cluster 1 (# of obs: 1)
['canele', 'pastry', 'bibingkus', 'filo', 'casino', 'ganache', 'danish', 'arm']
Top tokens from cluster 2 (# of obs: 8275)
['burger', 'fries', 'burgers', 'good', 'place', 'great', 'cheese', 'food']
Top tokens from cluster 3 (# of obs: 1)
['beans', 'packin', 'owner', 'cook', 'links', 'location', 'work', 'muzik']
Top tokens from cluster 4 (# of obs: 1)
['motioned', 'backs', 'warmly', 'escorted', 'headed', 'among', 'nobody', 'filipino']
Top tokens from cluster 5 (# of obs: 1)
['overendeared', 'frapps', 'probably', 'clientele', 'should', 'coffees', 'hire', 'teas']
Top tokens from cluster 6 (# of obs: 1)
['carlos', 'lil', 'spot', 'scrolling', 'caterpillar', 'ig', 'gorilla', 'errands']
Top tokens from cluster 7 (# of obs: 1)
['labeled', 'stations', 'country', 'complaint', 'liked', 'buffet', 'seated', 'took']
Top tokens from cluster 8 (

In [24]:
(kmeans_final.labels_ == 24).sum()

8893

In [22]:
kmeans_final.labels_ .shape

(294792,)

In [36]:
# list to store observations per cluster
obs_per_cluster = []
for cluster_id in range(24):
    obs_per_cluster.append( (kmeans_final.labels_ == cluster_id).sum() )

# convert into numpy array, which allows np.argsort() to work / speeds it up
# obs_per_cluster = np.array(obs_per_cluster)

# iterate through results of argsort() in reverse, to view number of observations in descending order
for cluster_id in reversed(np.argsort(obs_per_cluster)):
    print(cluster_id, obs_per_cluster[cluster_id])

23 85451
14 66053
11 22467
19 21227
8 20725
20 13392
16 12804
22 9245
18 8507
2 8275
10 6836
21 5571
9 5334
15 2
12 1
13 1
7 1
6 1
5 1
4 1
3 1
17 1
1 1
0 1


In [32]:
np.argsort?

[0;31mSignature:[0m [0mnp[0m[0;34m.[0m[0margsort[0m[0;34m([0m[0ma[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m,[0m [0mkind[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0morder[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns the indices that would sort an array.

Perform an indirect sort along the given axis using the algorithm specified
by the `kind` keyword. It returns an array of indices of the same shape as
`a` that index data along the given axis in sorted order.

Parameters
----------
a : array_like
    Array to sort.
axis : int or None, optional
    Axis along which to sort.  The default is -1 (the last axis). If None,
    the flattened array is used.
kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, optional
    Sorting algorithm. The default is 'quicksort'. Note that both 'stable'
    and 'mergesort' use timsort under the covers and, in general, the
    actual implementation will v