<a href="https://colab.research.google.com/github/zacherymoy/DS-Unit-4-Sprint-1-NLP/blob/master/KaggleDSPT4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# import zipfile

# with zipfile.ZipFile("whiskey-reviews-dspt4.zip","r") as zip_ref:
#     zip_ref.extractall("targetdir")

In [0]:
import pandas
from sklearn.utils import resample

sample_submission_csv = pandas.read_csv('sample_submission.csv')
test = pandas.read_csv('test.csv')
train = pandas.read_csv('train.csv')

minority = train[train['ratingCategory'] == 0]
majority = train[train['ratingCategory'] == 1]

df_minority_upsampled = resample(minority,
                                 replace=True,
                                 n_samples=majority.shape[0]
                                )
df_upsampled = pd.concat([majority, df_minority_upsampled])

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

## Load Data 

In [16]:
df_upsampled.head()


Unnamed: 0,id,description,ratingCategory
0,1321,"\nSometimes, when whisky is batched, a few lef...",1
2,655,\nThis release is a port version of Amrut’s In...,1
3,555,\nThis 41 year old single cask was aged in a s...,1
4,1965,"\nQuite herbal on the nose, with aromas of dri...",1
5,3190,\nCooley produced some great Irish single malt...,1


In [17]:
df_upsampled['description'][0]


'\nSometimes, when whisky is batched, a few leftover barrels are returned to the warehouse. Canadian Club recently pulled and vatted several of these from the 1970s. Acetone, Granny Smith apples, and fresh-cut white cedar showcase this long age. Complex and spicy, yet reserved, this dram is ripe with strawberries, canned pears, cloves, pepper, and faint flowers, then slightly pulling oak tannins. Distinct, elegant, and remarkably vibrant, this ancient Canadian Club is anything but tired. (Australia only)\xa0A$133'

## Clean data

In [18]:
import re

def cleandesc(text):
    text = text[1:-1]
    text = re.sub('\(.*$', '', text)
    return text

print(cleandesc(train['description'][0]))

df_upsampled['cleandesc'] = df_upsampled['description'].apply(cleandesc)
test['cleandesc'] = test['description'].apply(cleandesc)

df_upsampled.head()

Sometimes, when whisky is batched, a few leftover barrels are returned to the warehouse. Canadian Club recently pulled and vatted several of these from the 1970s. Acetone, Granny Smith apples, and fresh-cut white cedar showcase this long age. Complex and spicy, yet reserved, this dram is ripe with strawberries, canned pears, cloves, pepper, and faint flowers, then slightly pulling oak tannins. Distinct, elegant, and remarkably vibrant, this ancient Canadian Club is anything but tired. 


Unnamed: 0,id,description,ratingCategory,cleandesc
0,1321,"\nSometimes, when whisky is batched, a few lef...",1,"Sometimes, when whisky is batched, a few lefto..."
2,655,\nThis release is a port version of Amrut’s In...,1,This release is a port version of Amrut’s Inte...
3,555,\nThis 41 year old single cask was aged in a s...,1,This 41 year old single cask was aged in a she...
4,1965,"\nQuite herbal on the nose, with aromas of dri...",1,"Quite herbal on the nose, with aromas of dried..."
5,3190,\nCooley produced some great Irish single malt...,1,Cooley produced some great Irish single malt w...


## Build a Baseline TF-IDF Model

In [0]:
# Create Pipeline Components
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
clf = RandomForestClassifier()
pipe = Pipeline([('vect', vect), ('clf', clf)])

In [0]:
# Log in current specs because this is running for 6 minutes 
# n_jobs = -2 can help with CPU power 

# What I originally did 
# parameters = {
#     'vect__max_df': (0.7, 1.0),
#     'vect__min_df': (2, 5, 10),
#     'vect__max_features': (5000, 10000),
#     'clf__n_estimators': (100, 500),
#     'clf__max_depth': (10, 20)
# }

parameters = {
    'vect__max_df': (0.7, 1.0),
    'clf__max_depth': (10, 20)
}

grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=4, verbose=1)


In [32]:
train['ratingCategory']

0       1
1       0
2       1
3       1
4       1
       ..
4082    1
4083    1
4084    1
4085    1
4086    1
Name: ratingCategory, Length: 4087, dtype: int64

In [33]:
train.head()

Unnamed: 0,id,description,ratingCategory
0,1321,"\nSometimes, when whisky is batched, a few lef...",1
1,3861,\nAn uncommon exclusive bottling of a 6 year o...,0
2,655,\nThis release is a port version of Amrut’s In...,1
3,555,\nThis 41 year old single cask was aged in a s...,1
4,1965,"\nQuite herbal on the nose, with aromas of dri...",1


In [0]:
#df_upsampled['cleandesc'] = df_upsampled['description'].apply(cleandesc)
train['cleandesc'] = train['description'].apply(cleandesc)

In [35]:
train['cleandesc']

0       Sometimes, when whisky is batched, a few lefto...
1       An uncommon exclusive bottling of a 6 year old...
2       This release is a port version of Amrut’s Inte...
3       This 41 year old single cask was aged in a she...
4       Quite herbal on the nose, with aromas of dried...
                              ...                        
4082    What lies beneath the surface of Dewar’s? Here...
4083    After 6 to 7 years of maturation in bourbon ca...
4084    Bright, delicate, and approachable. While not ...
4085    I’m calling this the pitmaster’s dram: the nos...
4086    Spicy sultanas, greengage plums, toffee, and n...
Name: cleandesc, Length: 4087, dtype: object

In [36]:
grid_search.fit(train['cleandesc'], train['ratingCategory'])

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   23.1s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 2),
                                                        no

In [23]:
grid_search.best_params_


AttributeError: ignored

In [10]:
import pandas as pd

# Get sparse dtm
dtm = vect.fit_transform()

# Convert to dataframe
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
dtm.shape

(3, 3)

In [0]:
# Stuck on splitting? 
grid_search.fit()

In [0]:
grid_search.best_score_

In [0]:
grid_search.best_params_

In [0]:
from sklearn.metrics import accuracy_score

# Evaluate on test data
y_test = grid_search.predict()
accuracy_score(newsgroups_test.target, y_test)