## Goal
Using Amazon Food reviews
1. Featurize/Vectorize - BoW, tfIDF, Average Word2Vec, tfIDF Word2Vec
2. Train/Test split using time based slicing as a 70-30 split
3. Use 10-fold cross validation to find optimal/best K in KNN for each of the vectorization
4. Report test accuracy with the best K for BoW, tfIDF, Average Word2Vec, tfIDF Word2Vec
5. Try brute force and KD-Tree
6. A subset of the data will be used especially for brute force

### Data
The Amazon fine food reviews dataset is available [here](https://www.kaggle.com/snap/amazon-fine-food-reviews/downloads/database.sqlite/2). The SQLLite version(database.sqlite) of the dataset is used. A folder called 'amazon-fine-food-reviews' needs to be created in the working directory and downloaded to that folder.

Google Word2Vec (GoogleNews-vectors-negative300.bin) is also used and those are available [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer, CountVectorizer
from sklearn.metrics import confusion_matrix, f1_score, classification_report, make_scorer
from sklearn import metrics
from sklearn.decomposition import TruncatedSVD

import itertools, pickle, random, sqlite3, nltk, string
from pathlib import Path
from scipy import sparse

import warnings
warnings.filterwarnings("ignore")

import seaborn as sns
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec, KeyedVectors

### Load Data

In [2]:
# using the SQLite Table to read data.
con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite') 

#filtering only positive and negative reviews i.e. 
# not considering those reviews with Score=3
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 """, con) 

In [3]:
# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

#change reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition) 
filtered_data['Score'] = positiveNegative
print ('Total number of reviews is {} with {} features each'.format(*filtered_data.shape))
filtered_data.head()

Total number of reviews is 525814 with 10 features each


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


### EDA

#### Remove Duplicates

In [4]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)

#### Data Cleanup

In [5]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(364171, 10)


positive    307061
negative     57110
Name: Score, dtype: int64

#### Note:
The dataset is imbalanced. So the metric to evaluate the model should not be accuracy, it could be confusion matrix 

### PreProcessing

#### Stemming, Stop word remove and Lemmatization

In [6]:
stop = set(stopwords.words('english')) #set of stopwords
sno = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned
print('The stop words are \n {}'.format(stop))
print('************************************************************')
print('The stem word for tasty is {}'.format(sno.stem('tasty')))

The stop words are 
 {'than', 'has', 'of', 'from', 'which', "wouldn't", 'at', 'will', 'as', 'such', 'about', 'very', 'an', 's', 'above', 'now', 'when', "hadn't", 'wasn', 'who', 'nor', "it's", 'we', 'because', 'too', 'you', 'or', 'there', 'few', "you'd", "doesn't", "haven't", 'same', 'but', 'then', 'over', 'o', 'a', "aren't", 'having', 'him', 'd', 'don', 'here', 'all', 'both', 'shouldn', 'ma', 'if', "shan't", "don't", 'its', 'can', 'before', "she's", 'll', 'into', 'each', 'are', 'under', 'myself', 'in', "shouldn't", "won't", 'won', 'being', 'the', 'mustn', 've', 'itself', 'whom', 'needn', 'she', 'between', 'her', "you're", 'them', "that'll", 'during', 'other', 'these', 'out', 'himself', 'up', 'more', 'couldn', 'herself', 'and', 'shan', 'your', 'until', 'not', 'own', 'only', "you've", "you'll", 'so', 'against', "hasn't", 'their', 'y', "should've", 'further', 'is', 'with', 'ourselves', 'wouldn', 'yourself', 'below', 'me', 'any', 'haven', 'does', 'i', 'ours', 'hers', 'once', 'how', 'isn', 

We need 3 types of cleaned data so we will use 3 different variables to store this data:
1. For Bag of Words and TfIDF we need the stop words to be removed and stemmed
2. For Bigrams and trigrams we do not need to remove the stop words but we can stem the words
3. For Avg Word2Vec and TfIDF Word2Vec we can remove the stopwords but not stem the words if we plan to use Google's word2vec model.

In [7]:
i=0
str_b=' '
str_n=' '
str_w=' '
bow_tfidf_string=[]
ngrams_string=[]
word2vec_string=[]
s=''
for sent in final['Text'].values:
    filtered_sentence_b=[]
    filtered_sentence_n=[]
    filtered_sentence_w=[]
    
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)): 
                s=(sno.stem(cleaned_words.lower())).encode('utf8')
                filtered_sentence_n.append(s) #for ngrams we need stemmed words
                if(cleaned_words.lower() not in stop):
                    filtered_sentence_b.append(s) #for BoW and tfidf we need stemmed and stop words removed data
                    filtered_sentence_w.append(cleaned_words.lower().encode('utf8')) # for Word2Vec we need non stemmed and stop words removed data
                else:
                    continue
            else:
                continue 
    
    
    str_n = b" ".join(filtered_sentence_n)
    str_b = b" ".join(filtered_sentence_b)
    str_w = b" ".join(filtered_sentence_w)
    
    bow_tfidf_string.append(str_b)
    ngrams_string.append(str_n)
    word2vec_string.append(str_w)

    i+=1

In [8]:
final['BowTfIDFText']=bow_tfidf_string 
final['BowTfIDFText']=final['BowTfIDFText'].str.decode("utf-8")

final['nGramsText']=ngrams_string 
final['nGramsText']=final['nGramsText'].str.decode("utf-8")

final['Word2VecText']=word2vec_string 
final['Word2VecText']=final['Word2VecText'].str.decode("utf-8")

In [9]:
# store final table into an SQlLite table for future.
final_dbconn = sqlite3.connect('./amazon-fine-food-reviews/finaldb.sqlite')
c=final_dbconn.cursor()
final_dbconn.text_factory = str
final.to_sql('Reviews', final_dbconn,  schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)

### Time Based Slicing
The below function creates a cross validation dataset and fits the model. This will be used across BoW, tfIDF, Average Word2Vec and tfIDF Word2Vec

In [2]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          figsize = (5,3),
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#         print("Normalized confusion matrix")
#     else:
#         print('Confusion matrix, without normalization')

#     print(cm)

#     plt.figure(figsize=(4,2))
    plt.figure(figsize=figsize)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    plt.show()
    
def fit_predict(estimator, X_train, y_train, X_test, y_test):
    # fitting the model on crossvalidation train
    estimator.fit(X_train, y_train)

    # predict the response on the crossvalidation train
    y_pred = estimator.predict(X_test)
    
    return y_pred

### Run Classifier
Given a dataset of X, y the below function
1. Splits the dataset into train and test.
2. Executes KNN classifier using bruteforce and KDTree.
3. Executes the classifier for varying values of 'k'.
4. Calculates the optimal 'k' value from the above step.
5. Uses the optimal 'k' value to run the classifier against the test dataset.

In [3]:
def split_data(data, sort_by_time='Y', test_size=0.3):
    
    if sort_by_time == 'Y':
        data = data.sort_values(by='Time')
        data.reset_index(inplace=True,drop=True)
        
    train_index = int(np.floor(data.shape[0] * (1- test_size)))

    X_train = data.loc[:train_index]
    y_train = X_train.Score
    X_train.drop(['Score'], axis=1, inplace=True)
    

    X_test = data.loc[train_index:]
    y_test = X_test.Score
    X_test.drop(['Score'], axis=1, inplace=True)
    
    return X_train, y_train, X_test, y_test

def save_vectors_to_db(filename, df):
    final_dbconn = sqlite3.connect('./amazon-fine-food-reviews/finaldb.sqlite')
    c=final_dbconn.cursor()
    final_dbconn.text_factory = str
    final.to_sql('Reviews', final_dbconn,  schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)
    
def create_save_vectors(data, vectorize_technique):
    if vectorize_technique not in ['bow', 'tfidf', 'Gavgw2v', 'Gtfidfw2v', 'Oavgw2v', 'Otfidfw2v']:
        print ('Allowed vectorization techniques are Bag Of Words(bow), TF-IDF(tfidf), \
                Average Word2Vec Using Google vectors(Gavgw2v), TF-IDF Word2Vec Using Google vectors(Gtfidfw2v), \
                Average Word2Vec Using Own Model(Oavgw2v), TF-IDF Word2Vec Using Own Model(Otfidfw2v)')
        return
    
    X_train, y_train, X_test, y_test = split_data(data, sort_by_time='Y')
    
    if vectorize_technique == 'bow':
        X_train = vectorize_bow(X_train)
        filename = './amazon-fine-food-reviews/bowvectors.sqlite'
        save_vectors_to_db(filename, X_train)

def run_classifier(data, vectorize_technique, algorithm, k=5):
    if vectorize_technique not in ['bow', 'tfidf', 'Gavgw2v', 'Gtfidfw2v', 'Oavgw2v', 'Otfidfw2v']:
        print ('Allowed vectorization techniques are Bag Of Words(bow), TF-IDF(tfidf), \
                Average Word2Vec Using Google vectors(Gavgw2v), TF-IDF Word2Vec Using Google vectors(Gtfidfw2v), \
                Average Word2Vec Using Own Model(Oavgw2v), TF-IDF Word2Vec Using Own Model(Otfidfw2v)')
        return
    
    if algorithm not in ['brute', 'kd_tree']:
        print ('Allowed algorithms are Brute Force(brute) and KD Tree(kd_tree)')
        return
    
    X_train, y_train, X_test, y_test = split_data(data, sort_by_time='Y')
    
    #Create X_train with just the vectors
    if vectorize_technique == 'bow':
        X_train, X_test = vectorize_bow(X_train, X_test, algorithm)
    elif vectorize_technique == 'tfidf':
        X_train, X_test = vectorize_tfidf(X_train, X_test, algorithm)
    elif vectorize_technique == 'Gavgw2v':
        X_train, X_test = vectorize_Gavgw2v(X_train, X_test, algorithm)
    elif vectorize_technique == 'Gtfidfw2v':
        X_train, X_test = vectorize_Gavgw2v(X_train, X_test, algorithm)
    elif vectorize_technique == 'Oavgw2v':
        X_train, X_test = vectorize_Oavgw2v(X_train, X_test, algorithm)
    elif vectorize_technique == 'Otfidfw2v':
        X_train, X_test = vectorize_Otfidfw2v(X_train, X_test, algorithm)
        
    cv = 10
    #Use TimeSeriesSplit for cross validation
    tscv = TimeSeriesSplit(n_splits=cv)
    
    # number of neighbors to try with
    #start from 2 to avoid underfitting the model
    neighbors = list(range(2, k+2))
    
    param_grid = {'n_neighbors': neighbors}
    knn_clf = KNeighborsClassifier(metric='cosine')
    
    print ('Algorithm - {}'.format(algorithm))
    print ('-' *80)
    train_test_split = tscv.split(X_train)
    print ('Doing Gridsearch .....')
    f1_scorer = make_scorer(f1_score, pos_label='positive')
    gs = GridSearchCV(knn_clf, param_grid, cv = train_test_split, n_jobs=-1,verbose=5, scoring=f1_scorer)

    gs.fit(X_train, y_train)
    print ('Calculating best score from grid search...')
    print ('Best score: ', gs.best_score_)

    print ('Best parameters set:')
    best_parameters = gs.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print (param_name, best_parameters[param_name])

    print ('Dimensions of X_test is {} and y_test is {}'.format (X_test.shape, y_test.shape))
    print ('Predicting with X_test.....')
    y_pred = gs.predict(X_test)
    print (classification_report(y_test, y_pred))

## Bag of Words

In [4]:
def get_bow_tfidfdata(no_of_records = 2000):
    #The original dataset has approximately 85% positive reviews and 15% negative reviews, so we will use the same % to make
    #a subset 
    pos_limit = int(no_of_records * 0.85)
    neg_limit = no_of_records - pos_limit

    final_dbconn = sqlite3.connect('./amazon-fine-food-reviews/finaldb.sqlite')
    final_dbconn.text_factory = str

    pos_qry = "SELECT time, BowTfIDFText, Score FROM Reviews where Score = 'positive' LIMIT " + str(pos_limit)
    pos_reviews = pd.read_sql_query(pos_qry, final_dbconn) 
    print('Positive reviews shape is {}'.format(pos_reviews.shape))

    neg_qry = "SELECT time, BowTfIDFText, Score FROM Reviews where Score = 'negative' LIMIT " + str(neg_limit)
    neg_reviews = pd.read_sql_query(neg_qry, final_dbconn) 
    print('Negative reviews shape is {}'.format(neg_reviews.shape))

    frames = [pos_reviews, neg_reviews]

    data = pd.concat(frames, ignore_index=True)
    print('Both positive and negative reviews combined together contains {} reviews'.format(data.shape[0]))
    return data

In [5]:
def run_svd(X_train_cnts, X_test_cnts, reduce_to):
    svd = TruncatedSVD(n_components=reduce_to)

    print ('Shape of X train cnts before SVD is {}'.format(X_train_cnts.shape))
    X_train_red = svd.fit_transform(X_train_cnts)
    print ('Shape of X train cnts after SVD is {}'.format(X_train_red.shape))
    explained_ratio = svd.explained_variance_ratio_
    print ('The explained ratio with {} components is {}'.format(reduce_to, np.sum(explained_ratio)))
    
    X_test_red = svd.fit_transform(X_test_cnts)
    return X_train_red, X_test_red

def vectorize_bow(X_train, X_test, algorithm):
    #Vectorize the reviews
    count_vect = CountVectorizer() 
    
    #Build vocabulary from X_train
    X_train_counts = count_vect.fit_transform(X_train['BowTfIDFText'].values)
    X_test_counts = count_vect.transform(X_test['BowTfIDFText'].values)
    
    print("Shape of vectorized train and test datasets are {} and {}".format(X_train_counts.get_shape(), X_test_counts.get_shape()) )
    print("Number of unique words in the dataset is ", X_train_counts.get_shape()[1])
    print ('Type of bow_counts ', type(X_train_counts))
    
    if algorithm == 'kd_tree':
        reduce_to = X_test_counts.get_shape()[0]
        if reduce_to > 2500:
            reduce_to = 2500
        X_train_counts, X_test_counts = run_svd(X_train_counts, X_test_counts, reduce_to)
        print ('Shape of X train counts are SVD is {}'.format(X_train_counts.shape))

    X_train_std_data = StandardScaler(with_mean = False).fit_transform(X_train_counts)
    print ('{} -Type of std_bow_data is {}'.format(algorithm, type(X_train_std_data)))
    print ('{}-Standardized Bag of Words review contains {} records with {} features each'.format(algorithm, *X_train_std_data.shape))

    #vectorize the test set with the same vocabulary as train data set
    if algorithm == 'kd_tree':
        print("{} - The shape of BOW vectorized test dataset is {}".format(algorithm, X_test_counts.shape))
    else:
        print("{} - The shape of our BOW vectorized test dataset is {}".format(algorithm, X_test_counts.get_shape()))

    X_test_std_data = StandardScaler(with_mean = False).fit_transform(X_test_counts)
    return X_train_std_data, X_test_std_data

In [8]:
no_of_records=30000
data = get_bow_tfidfdata(no_of_records)
run_classifier(data, vectorize_technique ='bow', algorithm='brute')

Positive reviews shape is (25500, 3)
Negative reviews shape is (4500, 3)
Both positive and negative reviews combined together contains 30000 reviews
Shape of vectorized train and test datasets are (21001, 20054) and (9000, 20054)
Number of unique words in the dataset is  20054
Type of bow_counts  <class 'scipy.sparse.csr.csr_matrix'>
brute -Type of std_bow_data is <class 'scipy.sparse.csr.csr_matrix'>
brute-Standardized Bag of Words review contains 21001 records with 20054 features each
brute - The shape of our BOW vectorized test dataset is (9000, 20054)
Algorithm - brute
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits




[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=3 ...................................................
[CV] n_neighbors=3 ...................................................
[CV] .......... n_neighbors=2, score=0.8752694795195566, total=   0.2s
[CV] n_neighbors=3 ...................................................
[CV] .

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:   22.9s remaining:    7.2s


[CV] .......... n_neighbors=5, score=0.9082435187882318, total=   1.3s
[CV] .......... n_neighbors=6, score=0.9229885057471264, total=   0.9s
[CV] .......... n_neighbors=4, score=0.8997867803837953, total=   2.4s
[CV] .......... n_neighbors=5, score=0.9036603221083456, total=   1.7s
[CV] .......... n_neighbors=6, score=0.9234312032239494, total=   1.1s
[CV] .......... n_neighbors=4, score=0.8849878934624698, total=   1.9s
[CV] .......... n_neighbors=6, score=0.9041740152851263, total=   1.5s
[CV] .......... n_neighbors=5, score=0.9157188684747741, total=   1.9s
[CV] .......... n_neighbors=6, score=0.9023886759068122, total=   1.7s
[CV] .......... n_neighbors=5, score=0.9037730330505996, total=   1.8s
[CV] .......... n_neighbors=6, score=0.9095760450637415, total=   1.7s
[CV] .......... n_neighbors=6, score=0.9005917159763313, total=   1.9s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   40.8s finished


Calculating best score from grid search...
Best score:  0.923612518832
Best parameters set:
n_neighbors 5
Dimensions of X_test is (9000, 20054) and y_test is (9000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.55      0.12      0.20      1632
   positive       0.83      0.98      0.90      7368

avg / total       0.78      0.82      0.77      9000



In [13]:
no_of_records=30000
data = get_bow_tfidfdata(no_of_records)
run_classifier(data, vectorize_technique ='bow', algorithm='kd_tree')

Positive reviews shape is (25500, 3)
Negative reviews shape is (4500, 3)
Both positive and negative reviews combined together contains 30000 reviews
Shape of vectorized train and test datasets are (21001, 20054) and (9000, 20054)
Number of unique words in the dataset is  20054
Type of bow_counts  <class 'scipy.sparse.csr.csr_matrix'>
Shape of X train cnts before SVD is (21001, 20054)
Shape of X train cnts after SVD is (21001, 2500)
The explained ratio with 2500 components is 0.9563980798221559
Shape of X train counts are SVD is (21001, 2500)
kd_tree -Type of std_bow_data is <class 'numpy.ndarray'>
kd_tree-Standardized Bag of Words review contains 21001 records with 2500 features each
kd_tree - The shape of BOW vectorized test dataset is (9000, 2500)
Algorithm - kd_tree
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ....................................

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:  2.5min remaining:   46.9s


[CV] n_neighbors=6 ...................................................
[CV] n_neighbors=6 ...................................................
[CV] .......... n_neighbors=6, score=0.9327635327635327, total=   7.2s
[CV] .......... n_neighbors=5, score=0.9053857350800582, total=  13.6s
[CV] .......... n_neighbors=6, score=0.9256388171116853, total=   8.9s
[CV] ........... n_neighbors=4, score=0.880901614377094, total=  13.6s
[CV] ........... n_neighbors=5, score=0.904860088365243, total=  15.0s
[CV] .......... n_neighbors=6, score=0.9236071223434807, total=  14.7s
[CV] .......... n_neighbors=5, score=0.9171303074670571, total=  15.9s
[CV] .......... n_neighbors=6, score=0.9076423723812334, total=  16.4s
[CV] .......... n_neighbors=5, score=0.9010600706713782, total=  15.8s
[CV] .......... n_neighbors=6, score=0.9031873696753052, total=  16.3s
[CV] .......... n_neighbors=6, score=0.9110320284697508, total=  16.0s
[CV] ........... n_neighbors=6, score=0.898542100565308, total=  18.7s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  3.4min finished


Calculating best score from grid search...
Best score:  0.923700274162
Best parameters set:
n_neighbors 5
Dimensions of X_test is (9000, 2500) and y_test is (9000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.20      0.03      0.05      1632
   positive       0.82      0.98      0.89      7368

avg / total       0.71      0.80      0.74      9000



## TF-IDF

In [6]:
def vectorize_tfidf(X_train, X_test, algorithm):
    #Vectorize the reviews
    tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
    
    #Build vocabulary from X_train
    X_train_counts = tf_idf_vect.fit_transform(X_train['BowTfIDFText'].values)
    X_test_counts = tf_idf_vect.transform(X_test['BowTfIDFText'].values)
    
    print("Shape of vectorized train and test datasets are {} and {}".format(X_train_counts.get_shape(), X_test_counts.get_shape()) )
    print("Number of unique words in the dataset is ", X_train_counts.get_shape()[1])
    print ('Type of bow_counts ', type(X_train_counts))
    
    if algorithm == 'kd_tree':
        reduce_to = X_test_counts.get_shape()[0]
        X_train_counts, X_test_counts = run_svd(X_train_counts, X_test_counts, reduce_to)
        print ('Shape of X train counts are SVD is {}'.format(X_train_counts.shape))

    X_train_std_data = StandardScaler(with_mean = False).fit_transform(X_train_counts)
    print ('{} -Type of std_bow_data is {}'.format(algorithm, type(X_train_std_data)))
    print ('{}-Standardized Bag of Words review contains {} with {} features each'.format(algorithm, *X_train_std_data.shape))
    
    #vectorize the test set with the same vocabulary as train data set
    if algorithm == 'kd_tree':
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.shape))
    else:
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.get_shape()))
    
    X_test_std_data = StandardScaler(with_mean = False).fit_transform(X_test_counts)
    return X_train_std_data, X_test_std_data

In [7]:
no_of_records=30000
data = get_bow_tfidfdata(no_of_records)
run_classifier(data, vectorize_technique ='tfidf', algorithm='brute')

Positive reviews shape is (25500, 3)
Negative reviews shape is (4500, 3)
Both positive and negative reviews combined together contains 30000 reviews
Shape of vectorized train and test datasets are (21001, 430534) and (9000, 430534)
Number of unique words in the dataset is  430534
Type of bow_counts  <class 'scipy.sparse.csr.csr_matrix'>
brute -Type of std_bow_data is <class 'scipy.sparse.csr.csr_matrix'>
brute-Standardized Bag of Words review contains 21001 with 430534 features each
brute - The shape of vectorized test dataset is (9000, 430534)
Algorithm - brute
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ..................................

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:   24.8s remaining:    7.8s


[CV] .......... n_neighbors=5, score=0.9081425673717763, total=   1.6s
[CV] .......... n_neighbors=4, score=0.9079535299374442, total=   1.9s
[CV] .......... n_neighbors=6, score=0.9284077892325315, total=   1.1s
[CV] .......... n_neighbors=6, score=0.9282845668387836, total=   1.4s
[CV] .......... n_neighbors=5, score=0.9071864998545242, total=   1.8s
[CV] .......... n_neighbors=4, score=0.9027567403817025, total=   2.1s
[CV] .......... n_neighbors=6, score=0.9071803852889667, total=   1.5s
[CV] .......... n_neighbors=5, score=0.9186214885606718, total=   2.1s
[CV] .......... n_neighbors=6, score=0.9058374889997066, total=   1.7s
[CV] .......... n_neighbors=5, score=0.9095697980684812, total=   2.2s
[CV] .......... n_neighbors=6, score=0.9197548876568427, total=   2.0s
[CV] .......... n_neighbors=6, score=0.9120521172638437, total=   2.2s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   43.7s finished


Calculating best score from grid search...
Best score:  0.926368469959
Best parameters set:
n_neighbors 5
Dimensions of X_test is (9000, 430534) and y_test is (9000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.64      0.11      0.19      1632
   positive       0.83      0.99      0.90      7368

avg / total       0.80      0.83      0.77      9000



In [7]:
no_of_records=10000
data = get_bow_tfidfdata(no_of_records)
run_classifier(data, vectorize_technique ='tfidf', algorithm='kd_tree')

Positive reviews shape is (8500, 3)
Negative reviews shape is (1500, 3)
Both positive and negative reviews combined together contains 10000 reviews
Shape of vectorized train and test datasets are (7001, 191264) and (3000, 191264)
Number of unique words in the dataset is  191264
Type of bow_counts  <class 'scipy.sparse.csr.csr_matrix'>
Shape of X train cnts before SVD is (7001, 191264)
Shape of X train cnts after SVD is (7001, 3000)
The explained ratio with 3000 components is 0.6198217557360185
Shape of X train counts are SVD is (7001, 3000)
kd_tree -Type of std_bow_data is <class 'numpy.ndarray'>
kd_tree-Standardized Bag of Words review contains 7001 with 3000 features each
kd_tree - The shape of vectorized test dataset is (3000, 3000)
Algorithm - kd_tree
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ..................................................

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:   32.1s remaining:   10.1s


[CV] .......... n_neighbors=6, score=0.9488683989941326, total=   0.8s
[CV] n_neighbors=6 ...................................................
[CV] .......... n_neighbors=5, score=0.9106984969053935, total=   1.2s
[CV] n_neighbors=6 ...................................................
[CV] .......... n_neighbors=6, score=0.9166666666666666, total=   1.0s
[CV] n_neighbors=6 ...................................................
[CV] ........... n_neighbors=5, score=0.907488986784141, total=   1.1s
[CV] n_neighbors=6 ...................................................
[CV] .......... n_neighbors=6, score=0.9339055793991416, total=   1.4s
[CV] .......... n_neighbors=6, score=0.9274611398963729, total=   1.0s
[CV] n_neighbors=6 ...................................................
[CV] .......... n_neighbors=5, score=0.9164467897977132, total=   1.7s
[CV] n_neighbors=6 ...................................................
[CV] n_neighbors=6 ...................................................
[CV] .

[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   42.3s finished


Calculating best score from grid search...
Best score:  0.924431922614
Best parameters set:
n_neighbors 5
Dimensions of X_test is (3000, 3000) and y_test is (3000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.20      0.02      0.04       525
   positive       0.83      0.98      0.90      2475

avg / total       0.72      0.81      0.75      3000



## Word2Vec
There are 2 ways to do word2vec
1. Use Google's model which was built using Google's news dataset. Since this was built from a large dataset, this gives us a 300 dimension vector which will be richer.
2. Train our own model. Since we have a small dataset, we can build a 50 dimensional vector with this.
We will try both ways below

As a first step we need to build a list of sentence from the Word2VecText we have saved earlier. This will be used to get the vectors from Google's model as well as the one that we are going to build

In [7]:
def get_w2vdata(no_of_records = 2000):
    #The original dataset has approximately 85% positive reviews and 15% negative reviews, so we will use the same % to make
    #a subset 
    pos_limit = int(no_of_records * 0.85)
    neg_limit = no_of_records - pos_limit

    final_dbconn = sqlite3.connect('./amazon-fine-food-reviews/finaldb.sqlite')
    final_dbconn.text_factory = str

    pos_qry = "SELECT time, Word2VecText, Score FROM Reviews where Score = 'positive' LIMIT " + str(pos_limit)
    pos_reviews = pd.read_sql_query(pos_qry, final_dbconn) 
    print('Positive reviews shape is {}'.format(pos_reviews.shape))

    neg_qry = "SELECT time, Word2VecText, Score FROM Reviews where Score = 'negative' LIMIT " + str(neg_limit)
    neg_reviews = pd.read_sql_query(neg_qry, final_dbconn) 
    print('Negative reviews shape is {}'.format(neg_reviews.shape))

    frames = [pos_reviews, neg_reviews]

    data = pd.concat(frames, ignore_index=True)
    print('Both positive and negative reviews combined together contains {} reviews'.format(data.shape[0]))

    return data

#### Average Word2Vec With Google's model

In [8]:
#set up google model
gmodel = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
gmodel_words = list(gmodel.wv.vocab)    

In [9]:
def vectorize_Gavgw2v(X_train, X_test, algorithm):
    train_list_of_sent=[]
    test_list_of_sent=[]
    print ('Splitting train data into sentences.....')
    for sent in X_train['Word2VecText'].values:
        train_list_of_sent.append(sent.split())
    
    print ('Splitting test data into sentences.....')
    for sent in X_test['Word2VecText'].values:
        test_list_of_sent.append(sent.split())
        
    print ('Vectorizing train data.....')
    train_sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
    for sent in train_list_of_sent: # for each review/sentence
        sent_vec = np.zeros(300) # as word vectors are of zero length
        cnt_words =0; # num of words with a valid vector in the sentence/review
        for word in sent: # for each word in a review/sentence
            if word in gmodel_words:
                vec = gmodel.wv[word]
                sent_vec += vec
                cnt_words += 1
        if cnt_words != 0:
            sent_vec /= cnt_words
        train_sent_vectors.append(sent_vec)

    print ('Vectorizing test data.....')
    test_sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
    for sent in test_list_of_sent: # for each review/sentence
        sent_vec = np.zeros(300) # as word vectors are of zero length
        cnt_words =0; # num of words with a valid vector in the sentence/review
        for word in sent: # for each word in a review/sentence
            if word in gmodel_words:
                vec = gmodel.wv[word]
                sent_vec += vec
                cnt_words += 1
        if cnt_words != 0:
            sent_vec /= cnt_words
        test_sent_vectors.append(sent_vec)

    if algorithm == 'brute':
        #convert to sparse matrix
        X_train_counts = sparse.csr_matrix(np.array(train_sent_vectors))
        X_test_counts = sparse.csr_matrix(np.array(test_sent_vectors))
    elif algorithm == 'kd_tree':
        #convert to dense matrix
        X_train_counts = np.array(train_sent_vectors)
        X_test_counts = np.array(test_sent_vectors)
        print ('Shape of dense matrix train is {} and test is {}'.format(X_train_counts.shape, X_test_counts.shape))
        
    X_train_std_data = StandardScaler(with_mean = False).fit_transform(X_train_counts)
    print ('{} -Type of std_bow_data is {}'.format(algorithm, type(X_train_std_data)))
    print ('{}-Standardized reviews contain {} with {} features each'.format(algorithm, *X_train_std_data.shape))
    
    #vectorize the test set with the same vocabulary as train data set
    if algorithm == 'kd_tree':
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.shape))
    else:
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.get_shape()))
    
    X_test_std_data = StandardScaler(with_mean = False).fit_transform(X_test_counts)
    return X_train_std_data, X_test_std_data

In [22]:
no_of_records=20000
data = get_w2vdata(no_of_records)
run_classifier(data, vectorize_technique ='Gavgw2v', algorithm='brute')

Positive reviews shape is (17000, 3)
Negative reviews shape is (3000, 3)
Both positive and negative reviews combined together contains 20000 reviews
Splitting train data into sentences.....
Splitting test data into sentences.....
Vectorizing train data.....
Vectorizing test data.....
brute -Type of std_bow_data is <class 'scipy.sparse.csr.csr_matrix'>
brute-Standardized reviews contain 14001 with 300 features each
brute - The shape of vectorized test dataset is (6000, 300)
Algorithm - brute
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ....................................

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:  3.6min remaining:  1.1min


[CV] .......... n_neighbors=6, score=0.9214659685863874, total=   8.4s
[CV] .......... n_neighbors=5, score=0.9200524246395805, total=  11.9s
[CV] .......... n_neighbors=4, score=0.9024943310657597, total=  15.4s
[CV] .......... n_neighbors=6, score=0.9236141422959406, total=  10.0s
[CV] .......... n_neighbors=5, score=0.9117387466902029, total=  13.7s
[CV] .......... n_neighbors=4, score=0.9010889292196008, total=  15.4s
[CV] .......... n_neighbors=6, score=0.9170344218887908, total=  11.4s
[CV] .......... n_neighbors=5, score=0.9121265377855886, total=  15.0s
[CV] .......... n_neighbors=6, score=0.9066547565877623, total=  12.8s
[CV] .......... n_neighbors=5, score=0.9091706888986398, total=  16.1s
[CV] .......... n_neighbors=6, score=0.9123275478415667, total=  14.5s
[CV] .......... n_neighbors=6, score=0.9075555555555556, total=  16.2s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  6.2min finished


Calculating best score from grid search...
Best score:  0.925703011946
Best parameters set:
n_neighbors 5
Dimensions of X_test is (6000, 300) and y_test is (6000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.69      0.20      0.31      1076
   positive       0.85      0.98      0.91      4924

avg / total       0.82      0.84      0.80      6000



In [23]:
# import datetime
from datetime import datetime, timedelta
# print(datetime.datetime.now())


ist = datetime.now() + timedelta(hours=5.5)
print (format(ist, '%H:%M:%S'))

19:54:10


In [24]:
no_of_records=20000
data = get_w2vdata(no_of_records)
run_classifier(data, vectorize_technique ='Gavgw2v', algorithm='kd_tree')

Positive reviews shape is (17000, 3)
Negative reviews shape is (3000, 3)
Both positive and negative reviews combined together contains 20000 reviews
Splitting train data into sentences.....
Splitting test data into sentences.....
Vectorizing train data.....
Vectorizing test data.....
Shape of dense matrix train is (14001, 300) and test is (6000, 300)
kd_tree -Type of std_bow_data is <class 'numpy.ndarray'>
kd_tree-Standardized reviews contain 14001 with 300 features each
kd_tree - The shape of vectorized test dataset is (6000, 300)
Algorithm - kd_tree
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] .......... n_neighbors=2, score=0.9161118508655126, total= 

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:   17.8s remaining:    5.6s


[CV] .......... n_neighbors=4, score=0.9024943310657597, total=   1.4s
[CV] .......... n_neighbors=6, score=0.9214659685863874, total=   1.3s
[CV] .......... n_neighbors=5, score=0.9117387466902029, total=   1.2s
[CV] .......... n_neighbors=6, score=0.9236141422959406, total=   0.8s
[CV] .......... n_neighbors=4, score=0.9010889292196008, total=   1.7s
[CV] .......... n_neighbors=6, score=0.9170344218887908, total=   0.8s
[CV] .......... n_neighbors=5, score=0.9121265377855886, total=   1.7s
[CV] .......... n_neighbors=5, score=0.9091706888986398, total=   1.3s
[CV] .......... n_neighbors=6, score=0.9066547565877623, total=   1.2s
[CV] .......... n_neighbors=6, score=0.9123275478415667, total=   1.3s
[CV] .......... n_neighbors=6, score=0.9075555555555556, total=   0.8s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   23.4s finished


Calculating best score from grid search...
Best score:  0.925703011946
Best parameters set:
n_neighbors 5
Dimensions of X_test is (6000, 300) and y_test is (6000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.69      0.20      0.31      1076
   positive       0.85      0.98      0.91      4924

avg / total       0.82      0.84      0.80      6000



In [25]:
from datetime import datetime, timedelta
# print(datetime.datetime.now())


ist = datetime.now() + timedelta(hours=5.5)
print (format(ist, '%H:%M:%S'))

20:34:24


#### TF-IDF weighted Word2Vec With Google's model

In [21]:
def vectorize_Gtfidfw2v(X_train, X_test, algorithm):
    # TF-IDF weighted Word2Vec
    tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
    tf_idf_counts = tf_idf_vect.fit_transform(X_train['Word2VecText'].values)

    tfidf_feat = tf_idf_vect.get_feature_names() 
    
    train_list_of_sent=[]
    test_list_of_sent=[]
    print ('Splitting train data into sentences.....')
    for sent in X_train['Word2VecText'].values:
        train_list_of_sent.append(sent.split())
    
    print ('Splitting test data into sentences.....')
    for sent in X_test['Word2VecText'].values:
        test_list_of_sent.append(sent.split())
        
    print ('Vectorizing train data.....')
    train_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
    row=0;
    for sent in train_list_of_sent: # for each review/sentence 
        sent_vec = np.zeros(300) # as word vectors are of zero length
        weight_sum =0; # num of words with a valid vector in the sentence/review
        for word in sent: # for each word in a review/sentence
            if word in gmodel_words:
                vec = gmodel.wv[word]
                # obtain the tf_idfidf of a word in a sentence/review
                tf_idf = tf_idf_counts[row, tfidf_feat.index(word)]
                sent_vec += (vec * tf_idf)
                weight_sum += tf_idf
        if weight_sum != 0:
            sent_vec /= weight_sum
        train_sent_vectors.append(sent_vec)
        row += 1
        
    print ('Vectorizing test data.....')
    test_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
    row=0;
    for sent in test_list_of_sent: # for each review/sentence 
        sent_vec = np.zeros(300) # as word vectors are of zero length
        weight_sum =0; # num of words with a valid vector in the sentence/review
        for word in sent: # for each word in a review/sentence
            if word in gmodel_words:
                vec = gmodel.wv[word]
                # obtain the tf_idfidf of a word in a sentence/review
                tf_idf = tf_idf_counts[row, tfidf_feat.index(word)]
                sent_vec += (vec * tf_idf)
                weight_sum += tf_idf
        if weight_sum != 0:
            sent_vec /= weight_sum
        test_sent_vectors.append(sent_vec)
        row += 1

    if algorithm == 'brute':
        #convert to sparse matrix
        X_train_counts = sparse.csr_matrix(np.array(train_sent_vectors))
        X_test_counts = sparse.csr_matrix(np.array(test_sent_vectors))
    elif algorithm == 'kd_tree':
        #convert to dense matrix
        X_train_counts = np.array(train_sent_vectors)
        X_test_counts = np.array(test_sent_vectors)
        print ('Shape of dense matrix train is {} and test is {}'.format(X_train_counts.shape, X_test_counts.shape))
        
    X_train_std_data = StandardScaler(with_mean = False).fit_transform(X_train_counts)
    print ('{} -Type of std_bow_data is {}'.format(algorithm, type(X_train_std_data)))
    print ('{}-Standardized reviews contain {} with {} features each'.format(algorithm, *X_train_std_data.shape))
    
    #vectorize the test set with the same vocabulary as train data set
    if algorithm == 'kd_tree':
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.shape))
    else:
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.get_shape()))
    
    X_test_std_data = StandardScaler(with_mean = False).fit_transform(X_test_counts)
    return X_train_std_data, X_test_std_data

In [26]:
no_of_records=20000
data = get_w2vdata(no_of_records)
run_classifier(data, vectorize_technique ='Gtfidfw2v', algorithm='brute')

Positive reviews shape is (17000, 3)
Negative reviews shape is (3000, 3)
Both positive and negative reviews combined together contains 20000 reviews
Splitting train data into sentences.....
Splitting test data into sentences.....
Vectorizing train data.....
Vectorizing test data.....
brute -Type of std_bow_data is <class 'scipy.sparse.csr.csr_matrix'>
brute-Standardized reviews contain 14001 with 300 features each
brute - The shape of vectorized test dataset is (6000, 300)
Algorithm - brute
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ....................................

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:  3.6min remaining:  1.1min


[CV] .......... n_neighbors=5, score=0.9200524246395805, total=  11.6s
[CV] .......... n_neighbors=6, score=0.9214659685863874, total=   8.8s
[CV] .......... n_neighbors=4, score=0.9024943310657597, total=  14.7s
[CV] .......... n_neighbors=5, score=0.9117387466902029, total=  11.5s
[CV] .......... n_neighbors=6, score=0.9236141422959406, total=  10.2s
[CV] .......... n_neighbors=4, score=0.9010889292196008, total=  16.5s
[CV] .......... n_neighbors=5, score=0.9121265377855886, total=  13.5s
[CV] .......... n_neighbors=6, score=0.9170344218887908, total=  11.6s
[CV] .......... n_neighbors=6, score=0.9066547565877623, total=  13.8s
[CV] .......... n_neighbors=5, score=0.9091706888986398, total=  14.9s
[CV] .......... n_neighbors=6, score=0.9123275478415667, total=  15.3s
[CV] .......... n_neighbors=6, score=0.9075555555555556, total=  17.8s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  6.2min finished


Calculating best score from grid search...
Best score:  0.925703011946
Best parameters set:
n_neighbors 5
Dimensions of X_test is (6000, 300) and y_test is (6000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.69      0.20      0.31      1076
   positive       0.85      0.98      0.91      4924

avg / total       0.82      0.84      0.80      6000



In [27]:
from datetime import datetime, timedelta
# print(datetime.datetime.now())


ist = datetime.now() + timedelta(hours=5.5)
print (format(ist, '%H:%M:%S'))

21:20:57


In [28]:
no_of_records=20000
data = get_w2vdata(no_of_records)
run_classifier(data, vectorize_technique ='Gtfidfw2v', algorithm='kd_tree')

Positive reviews shape is (17000, 3)
Negative reviews shape is (3000, 3)
Both positive and negative reviews combined together contains 20000 reviews
Splitting train data into sentences.....
Splitting test data into sentences.....
Vectorizing train data.....
Vectorizing test data.....
Shape of dense matrix train is (14001, 300) and test is (6000, 300)
kd_tree -Type of std_bow_data is <class 'numpy.ndarray'>
kd_tree-Standardized reviews contain 14001 with 300 features each
kd_tree - The shape of vectorized test dataset is (6000, 300)
Algorithm - kd_tree
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] .......... n_neighbors=2, score=0.9161118508655126, total= 

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:   18.6s remaining:    5.9s


[CV] .......... n_neighbors=6, score=0.9214659685863874, total=   0.6s
[CV] .......... n_neighbors=5, score=0.9200524246395805, total=   1.0s
[CV] .......... n_neighbors=4, score=0.9024943310657597, total=   1.3s
[CV] .......... n_neighbors=6, score=0.9236141422959406, total=   0.6s
[CV] .......... n_neighbors=4, score=0.9010889292196008, total=   1.8s
[CV] .......... n_neighbors=5, score=0.9117387466902029, total=   1.2s
[CV] .......... n_neighbors=5, score=0.9121265377855886, total=   0.9s
[CV] .......... n_neighbors=6, score=0.9170344218887908, total=   1.2s
[CV] .......... n_neighbors=5, score=0.9091706888986398, total=   1.7s
[CV] .......... n_neighbors=6, score=0.9066547565877623, total=   0.8s
[CV] .......... n_neighbors=6, score=0.9123275478415667, total=   1.1s
[CV] .......... n_neighbors=6, score=0.9075555555555556, total=   1.2s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   24.8s finished


Calculating best score from grid search...
Best score:  0.925703011946
Best parameters set:
n_neighbors 5
Dimensions of X_test is (6000, 300) and y_test is (6000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.69      0.20      0.31      1076
   positive       0.85      0.98      0.91      4924

avg / total       0.82      0.84      0.80      6000



#### Train our own Word2Vec model

In [29]:
def build_word2vec(train_list_of_sent):
    # Train your own Word2Vec model using your own text corpus
    w2v_model=Word2Vec(train_list_of_sent,min_count=5,size=50, workers=4)

    w2v_words = list(w2v_model.wv.vocab)
    return w2v_model, w2v_words

#### Average Word2Vec With our own model

In [30]:
def vectorize_Oavgw2v(X_train, X_test, algorithm):
    train_list_of_sent=[]
    test_list_of_sent=[]
    print ('Splitting train data into sentences.....')
    for sent in X_train['Word2VecText'].values:
        train_list_of_sent.append(sent.split())
    
    print ('Splitting test data into sentences.....')
    for sent in X_test['Word2VecText'].values:
        test_list_of_sent.append(sent.split())
        
    print ('Building our word model.....')
    w2v_model, w2v_words = build_word2vec(train_list_of_sent)

    print ('Vectorizing train data.....')
    train_sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
    for sent in train_list_of_sent: # for each review/sentence
        sent_vec = np.zeros(50) # as word vectors are of zero length
        cnt_words =0; # num of words with a valid vector in the sentence/review
        for word in sent: # for each word in a review/sentence
            if word in w2v_words:
                vec = w2v_model.wv[word]
                sent_vec += vec
                cnt_words += 1
        if cnt_words != 0:
            sent_vec /= cnt_words
        train_sent_vectors.append(sent_vec)

    print ('Vectorizing test data.....')
    test_sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
    for sent in test_list_of_sent: # for each review/sentence
        sent_vec = np.zeros(50) # as word vectors are of zero length
        cnt_words =0; # num of words with a valid vector in the sentence/review
        for word in sent: # for each word in a review/sentence
            if word in w2v_words:
                vec = w2v_model.wv[word]
                sent_vec += vec
                cnt_words += 1
        if cnt_words != 0:
            sent_vec /= cnt_words
        test_sent_vectors.append(sent_vec)

    if algorithm == 'brute':
        #convert to sparse matrix
        X_train_counts = sparse.csr_matrix(np.array(train_sent_vectors))
        X_test_counts = sparse.csr_matrix(np.array(test_sent_vectors))
    elif algorithm == 'kd_tree':
        #convert to dense matrix
        X_train_counts = np.array(train_sent_vectors)
        X_test_counts = np.array(test_sent_vectors)
        print ('Shape of dense matrix train is {} and test is {}'.format(X_train_counts.shape, X_test_counts.shape))
        
    X_train_std_data = StandardScaler(with_mean = False).fit_transform(X_train_counts)
    print ('{} -Type of std_bow_data is {}'.format(algorithm, type(X_train_std_data)))
    print ('{}-Standardized reviews contain {} with {} features each'.format(algorithm, *X_train_std_data.shape))
    
    #vectorize the test set with the same vocabulary as train data set
    if algorithm == 'kd_tree':
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.shape))
    else:
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.get_shape()))
    
    X_test_std_data = StandardScaler(with_mean = False).fit_transform(X_test_counts)
    return X_train_std_data, X_test_std_data

In [31]:
no_of_records=20000
data = get_w2vdata(no_of_records)
run_classifier(data, vectorize_technique ='Oavgw2v', algorithm='brute')

Positive reviews shape is (17000, 3)
Negative reviews shape is (3000, 3)
Both positive and negative reviews combined together contains 20000 reviews
Splitting train data into sentences.....
Splitting test data into sentences.....
Building our word model.....
Vectorizing train data.....
Vectorizing test data.....
brute -Type of std_bow_data is <class 'scipy.sparse.csr.csr_matrix'>
brute-Standardized reviews contain 14001 with 50 features each
brute - The shape of vectorized test dataset is (6000, 50)
Algorithm - brute
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 .........

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:   37.5s remaining:   11.8s


[CV] .......... n_neighbors=6, score=0.9111798497569599, total=   1.4s
[CV] ........... n_neighbors=5, score=0.908289241622575, total=   2.1s
[CV] .......... n_neighbors=4, score=0.8894052044609666, total=   2.7s
[CV] .......... n_neighbors=5, score=0.8973214285714285, total=   2.2s
[CV] .......... n_neighbors=6, score=0.9134826526130875, total=   1.8s
[CV] .......... n_neighbors=6, score=0.9008928571428572, total=   2.1s
[CV] .......... n_neighbors=4, score=0.8744228993536474, total=   4.9s
[CV] ........... n_neighbors=5, score=0.909737661182748, total=   2.6s
[CV] .......... n_neighbors=6, score=0.8882833787465941, total=   2.3s
[CV] .......... n_neighbors=5, score=0.8938249666814749, total=   2.8s
[CV] .......... n_neighbors=6, score=0.8994565217391305, total=   2.8s
[CV] .......... n_neighbors=6, score=0.8918062471706656, total=   2.9s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.1min finished


Calculating best score from grid search...
Best score:  0.917578934132
Best parameters set:
n_neighbors 5
Dimensions of X_test is (6000, 50) and y_test is (6000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.54      0.18      0.27      1076
   positive       0.84      0.97      0.90      4924

avg / total       0.79      0.83      0.79      6000



In [32]:
from datetime import datetime, timedelta
# print(datetime.datetime.now())


ist = datetime.now() + timedelta(hours=5.5)
print (format(ist, '%H:%M:%S'))

22:02:25


In [33]:
no_of_records=20000
data = get_w2vdata(no_of_records)
run_classifier(data, vectorize_technique ='Oavgw2v', algorithm='kd_tree')

Positive reviews shape is (17000, 3)
Negative reviews shape is (3000, 3)
Both positive and negative reviews combined together contains 20000 reviews
Splitting train data into sentences.....
Splitting test data into sentences.....
Building our word model.....
Vectorizing train data.....
Vectorizing test data.....
Shape of dense matrix train is (14001, 50) and test is (6000, 50)
kd_tree -Type of std_bow_data is <class 'numpy.ndarray'>
kd_tree-Standardized reviews contain 14001 with 50 features each
kd_tree - The shape of vectorized test dataset is (6000, 50)
Algorithm - kd_tree
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ....................

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:    9.9s remaining:    3.1s


[CV] .......... n_neighbors=6, score=0.9071523767214571, total=   0.6s
[CV] .......... n_neighbors=4, score=0.8868188169538891, total=   0.5s
[CV] .......... n_neighbors=6, score=0.9114811568799298, total=   0.7s
[CV] .......... n_neighbors=5, score=0.8972876834148511, total=   0.8s
[CV] .......... n_neighbors=6, score=0.9005376344086021, total=   0.6s
[CV] .......... n_neighbors=4, score=0.8717948717948717, total=   1.4s
[CV] .......... n_neighbors=6, score=0.8920018075011297, total=   0.5s
[CV] .......... n_neighbors=5, score=0.9033407572383072, total=   0.9s
[CV] .......... n_neighbors=5, score=0.8929527207850134, total=   0.8s
[CV] ........... n_neighbors=6, score=0.896958692691784, total=   0.8s
[CV] .......... n_neighbors=6, score=0.8891904115784712, total=   0.7s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   14.2s finished


Calculating best score from grid search...
Best score:  0.91671286117
Best parameters set:
n_neighbors 5
Dimensions of X_test is (6000, 50) and y_test is (6000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.53      0.18      0.27      1076
   positive       0.84      0.97      0.90      4924

avg / total       0.79      0.82      0.79      6000



In [34]:
from datetime import datetime, timedelta
# print(datetime.datetime.now())


ist = datetime.now() + timedelta(hours=5.5)
print (format(ist, '%H:%M:%S'))

22:03:15


#### TF-IDF weighted Word2Vec With our own model

In [35]:
def vectorize_Otfidfw2v(X_train, X_test, algorithm):
    # TF-IDF weighted Word2Vec
    tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
    tf_idf_counts = tf_idf_vect.fit_transform(X_train['Word2VecText'].values)

    tfidf_feat = tf_idf_vect.get_feature_names() 
    
    train_list_of_sent=[]
    test_list_of_sent=[]
    print ('Splitting train data into sentences.....')
    for sent in X_train['Word2VecText'].values:
        train_list_of_sent.append(sent.split())
    
    print ('Splitting test data into sentences.....')
    for sent in X_test['Word2VecText'].values:
        test_list_of_sent.append(sent.split())
        
    print ('Building our word model.....')
    w2v_model, w2v_words = build_word2vec(train_list_of_sent)
    
    print ('Vectorizing train data.....')
    train_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
    row=0;
    for sent in train_list_of_sent: # for each review/sentence 
        sent_vec = np.zeros(50) # as word vectors are of zero length
        weight_sum =0; # num of words with a valid vector in the sentence/review
        for word in sent: # for each word in a review/sentence
            if word in w2v_words:
                vec = w2v_model.wv[word]
                # obtain the tf_idfidf of a word in a sentence/review
                tf_idf = tf_idf_counts[row, tfidf_feat.index(word)]
                sent_vec += (vec * tf_idf)
                weight_sum += tf_idf
        if weight_sum != 0:
            sent_vec /= weight_sum
        train_sent_vectors.append(sent_vec)
        row += 1

    print ('Vectorizing test data.....')
    test_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
    row=0;
    for sent in test_list_of_sent: # for each review/sentence 
        sent_vec = np.zeros(50) # as word vectors are of zero length
        weight_sum =0; # num of words with a valid vector in the sentence/review
        for word in sent: # for each word in a review/sentence
            if word in w2v_words:
                vec = w2v_model.wv[word]
                # obtain the tf_idfidf of a word in a sentence/review
                tf_idf = tf_idf_counts[row, tfidf_feat.index(word)]
                sent_vec += (vec * tf_idf)
                weight_sum += tf_idf
        if weight_sum != 0:
            sent_vec /= weight_sum
        test_sent_vectors.append(sent_vec)
        row += 1

    if algorithm == 'brute':
        #convert to sparse matrix
        X_train_counts = sparse.csr_matrix(np.array(train_sent_vectors))
        X_test_counts = sparse.csr_matrix(np.array(test_sent_vectors))
    elif algorithm == 'kd_tree':
        #convert to dense matrix
        X_train_counts = np.array(train_sent_vectors)
        X_test_counts = np.array(test_sent_vectors)
        print ('Shape of dense matrix train is {} and test is {}'.format(X_train_counts.shape, X_test_counts.shape))
        
    X_train_std_data = StandardScaler(with_mean = False).fit_transform(X_train_counts)
    print ('{} -Type of std_bow_data is {}'.format(algorithm, type(X_train_std_data)))
    print ('{}-Standardized reviews contain {} with {} features each'.format(algorithm, *X_train_std_data.shape))
    
    #vectorize the test set with the same vocabulary as train data set
    if algorithm == 'kd_tree':
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.shape))
    else:
        print("{} - The shape of vectorized test dataset is {}".format(algorithm, X_test_counts.get_shape()))
    
    X_test_std_data = StandardScaler(with_mean = False).fit_transform(X_test_counts)
    return X_train_std_data, X_test_std_data

In [36]:
no_of_records=20000
data = get_w2vdata(no_of_records)
run_classifier(data, vectorize_technique ='Otfidfw2v', algorithm='brute')

Positive reviews shape is (17000, 3)
Negative reviews shape is (3000, 3)
Both positive and negative reviews combined together contains 20000 reviews
Splitting train data into sentences.....
Splitting test data into sentences.....
Building our word model.....
Vectorizing train data.....
Vectorizing test data.....
brute -Type of std_bow_data is <class 'scipy.sparse.csr.csr_matrix'>
brute-Standardized reviews contain 14001 with 50 features each
brute - The shape of vectorized test dataset is (6000, 50)
Algorithm - brute
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 .........

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:   38.9s remaining:   12.3s


[CV] .......... n_neighbors=5, score=0.9075777485764345, total=   1.9s
[CV] .......... n_neighbors=6, score=0.9090113735783028, total=   1.6s
[CV] ........... n_neighbors=4, score=0.891754951635191, total=   2.8s
[CV] ........... n_neighbors=5, score=0.896921017402945, total=   2.4s
[CV] .......... n_neighbors=6, score=0.9074967119684347, total=   1.7s
[CV] .......... n_neighbors=5, score=0.9028520499108735, total=   2.6s
[CV] ........... n_neighbors=6, score=0.898486197684773, total=   2.2s
[CV] .......... n_neighbors=4, score=0.8704133766836971, total=   3.2s
[CV] ........................ n_neighbors=6, score=0.89, total=   2.5s
[CV] .......... n_neighbors=5, score=0.8997338065661046, total=   2.9s
[CV] .......... n_neighbors=6, score=0.8983280614550384, total=   2.8s
[CV] .......... n_neighbors=6, score=0.8927927927927927, total=   3.1s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.1min finished


Calculating best score from grid search...
Best score:  0.91659484772
Best parameters set:
n_neighbors 5
Dimensions of X_test is (6000, 50) and y_test is (6000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.26      0.08      0.13      1076
   positive       0.83      0.95      0.88      4924

avg / total       0.72      0.79      0.75      6000



In [37]:
from datetime import datetime, timedelta
# print(datetime.datetime.now())


ist = datetime.now() + timedelta(hours=5.5)
print (format(ist, '%H:%M:%S'))

00:06:05


In [38]:
no_of_records=20000
data = get_w2vdata(no_of_records)
run_classifier(data, vectorize_technique ='Otfidfw2v', algorithm='kd_tree')

Positive reviews shape is (17000, 3)
Negative reviews shape is (3000, 3)
Both positive and negative reviews combined together contains 20000 reviews
Splitting train data into sentences.....
Splitting test data into sentences.....
Building our word model.....
Vectorizing train data.....
Vectorizing test data.....
Shape of dense matrix train is (14001, 50) and test is (6000, 50)
kd_tree -Type of std_bow_data is <class 'numpy.ndarray'>
kd_tree-Standardized reviews contain 14001 with 50 features each
kd_tree - The shape of vectorized test dataset is (6000, 50)
Algorithm - kd_tree
--------------------------------------------------------------------------------
Doing Gridsearch .....
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ...................................................
[CV] n_neighbors=2 ....................

[Parallel(n_jobs=-1)]: Done  38 out of  50 | elapsed:    9.7s remaining:    3.1s


[CV] .......... n_neighbors=5, score=0.9035933391761614, total=   0.9s
[CV] .......... n_neighbors=6, score=0.9071803852889667, total=   0.4s
[CV] .......... n_neighbors=6, score=0.9084537888742882, total=   0.6s
[CV] .......... n_neighbors=5, score=0.8976868327402134, total=   1.1s
[CV] .......... n_neighbors=4, score=0.8830328247803977, total=   0.9s
[CV] .......... n_neighbors=4, score=0.8648901355773725, total=   1.0s
[CV] .......... n_neighbors=6, score=0.9004444444444444, total=   0.7s
[CV] .......... n_neighbors=6, score=0.8920018075011297, total=   0.8s
[CV] .......... n_neighbors=5, score=0.9017857142857144, total=   1.4s
[CV] .......... n_neighbors=5, score=0.8943379402585823, total=   1.0s
[CV] .......... n_neighbors=6, score=0.8994136220117275, total=   0.8s
[CV] .......... n_neighbors=6, score=0.8850522489777374, total=   0.7s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   14.7s finished


Calculating best score from grid search...
Best score:  0.915121848389
Best parameters set:
n_neighbors 5
Dimensions of X_test is (6000, 50) and y_test is (6000,)
Predicting with X_test.....
             precision    recall  f1-score   support

   negative       0.24      0.07      0.10      1076
   positive       0.82      0.95      0.88      4924

avg / total       0.72      0.79      0.74      6000



In [39]:
from datetime import datetime, timedelta
# print(datetime.datetime.now())


ist = datetime.now() + timedelta(hours=5.5)
print (format(ist, '%H:%M:%S'))

02:18:54


|Model|Hyper Parameter|Train Error|Test Error|
| ------------- |:-------------:| -----:|-----:|
|KNN-BoW-Brute        | n_neighbors=5 | 92%| 77% |
|KNN-BoW-KDTree        |n_neighbors=5 |  92%|74% |
|KNN-tfIDF-Brute        | n_neighbors=5 |  92%| 77%|
|KNN-tfIDF-KDTree        | n_neighbors=5 |  92%|  75% |
|KNN-AvgW2VGoogle-Brute        | n_neighbors=5 | 92%|80% |
|KNN-AvgW2VGoogle-KDTree        | n_neighbors=5 | 92%|80%|
|KNN-TfIDFW2VGoogle-Brute        | n_neighbors=5 | 92%|80%|
|KNN-TfIDFW2VGoogle-KDTree        | n_neighbors=5 | 92%|80%|
|KNN-AvgW2VOwnModel-Brute        | n_neighbors=5| 92% | 79%|
|KNN-AvgW2VOwnModel-KDTree        | n_neighbors=5 | 92% |79%|
|KNN-TfIDFW2VOwnModel-Brute        | n_neighbors=5 | 92%| 75%|
|KNN-TfIDFW2VOwnModel-KDTree        | n_neighbors=5 | 92%|74%|

In [47]:
A = np.array([
[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[21,22,23,24,25,26,27,28,29,30]])
print('The original matrix is \n', A)
print('\nThe shape of the matrix is\n', A.shape)
# svd
print ('\nReducing number of components from 10 to 3. We will get back a 5x3 matrix')
print ('-'*80)
svd = TruncatedSVD(n_components=3)
svd.fit(A)
result = svd.transform(A)
print(result)

print ('\n Reducing number of components from 10 to 5. We will get back a 5x5 matrix')
print ('-'*80)
svd = TruncatedSVD(n_components=5)
svd.fit(A)
result = svd.transform(A)
print(result)

print ('\n Reducing number of components from 10 to 7. We will get back a 5x5 matrix')
print ('-'*80)
svd = TruncatedSVD(n_components=7)
svd.fit(A)
result = svd.transform(A)
print(result)

The original matrix is 
 [[ 1  2  3  4  5  6  7  8  9 10]
 [11 12 13 14 15 16 17 18 19 20]
 [ 1  2  3  4  5  6  7  8  9 10]
 [11 12 13 14 15 16 17 18 19 20]
 [21 22 23 24 25 26 27 28 29 30]]

The shape of the matrix is
 (5, 10)

Reducing number of components from 10 to 3. We will get back a 5x3 matrix
--------------------------------------------------------------------------------
[[ 1.86335320e+01  6.14747782e+00 -8.88178420e-16]
 [ 4.98391715e+01  1.02809650e+00  4.44089210e-16]
 [ 1.86335320e+01  6.14747782e+00 -8.88178420e-16]
 [ 4.98391715e+01  1.02809650e+00  4.44089210e-16]
 [ 8.10448110e+01 -4.09128482e+00  2.27595720e-15]]

 Reducing number of components from 10 to 5. We will get back a 5x5 matrix
--------------------------------------------------------------------------------
[[ 1.86335320e+01  6.14747782e+00 -1.18915294e-15 -3.33066907e-16
   1.11022302e-16]
 [ 4.98391715e+01  1.02809650e+00 -1.37910516e-15 -6.66133815e-16
  -2.22044605e-16]
 [ 1.86335320e+01  6.14747782e+00

### Reflection
1. While running classifier the entire training set is loaded into RAM. With 3000 records I noticed that most of the 16GB RAM was completly utilized and the CPU was running at 100%. For 2000 records slightly more than 50% of the RAM(around 9GB) was getting utilized. So while training KNN it is better to have a machine with large amounts of RAM. This was with dense matrices.
2. After vectorizing the dataset I converted it to a dataframe, thereby losing the sparsity of the matrix. This increased the processing time for 'brute' significantly. Use the sparse matrix for training as well as for prediction.
3. Using cosine distance gave me a slightly better results with a small dataset than with euclidean distance.
4. Reducing the dimensions for BoW and tfIDF improved the processing time significantly. Even when the expalined varaince ratio was less than 60% the f1_score reported was above 90%.