# Project for General Assembly Data Science

## Description of Dataset

Yelp review dataset with the following data fields:

1. Business ID
2. Date
3. Review ID
4. Stars
5. Text
6. Type
7. User ID

## Objectives

1. To predict the 'Stars' rating by using the text reviews as features found in the 'Text' data field.
2. Tuning parameters using **`RandomizedSearchCV`** and combining steps using **`Pipeline`**.

## Tasks

1. Reading in and exploring data
2. Feature extraction using **`CountVectorizer`** and chaining cross validation using **`Pipeline`**
3. Feature extraction using **`TfidfVectorizer`** and chaining cross validation using **`Pipeline`**
4. Tuning **`CountVectorizer`** and classifiers using **`RandomizedSearchCV`**
5. Tuning **`TfidfVectorizer`** and classifiers using **`RandomizedSearchCV`**

## 1. Reading in and exploring data

In [13]:
import pandas as pd
url = 'https://raw.githubusercontent.com/toclim/GA-data-science/master/yelp.csv'
yelp = pd.read_csv(url)
yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
yelp.shape

(10000, 10)

In [4]:
yelp.stars.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

In [7]:
yelp.isnull().sum()

business_id    0
date           0
review_id      0
stars          0
text           0
type           0
user_id        0
cool           0
useful         0
funny          0
dtype: int64

##### Change into a 3-way classification problem by selecting only 1, 3 and 5 star reviews

In [14]:
yelp_new = yelp[(yelp.stars==5) | (yelp.stars==3) | (yelp.stars==1)]
yelp_new.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


In [16]:
yelp_new.shape

(5547, 10)

In [17]:
yelp_new.stars.value_counts().sort_index()

1     749
3    1461
5    3337
Name: stars, dtype: int64

## 2. Feature extraction using **`CountVectorizer`** and chaining cross validation using **`Pipeline`**

In [12]:
X = yelp_new['text']
y = yelp_new['stars']

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
from sklearn.linear_model import LogisticRegression
logr = LogisticRegression()
from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import cross_val_score

##### Evaluating **`CountVectorizer`** and classifiers using default parameters

In [22]:
pipe_nb_vect = make_pipeline(vect, nb)
cross_val_score(pipe_nb_vect, X, y, cv=5, scoring='accuracy').mean()

0.78240718337046489

In [23]:
pipe_logr_vect = make_pipeline(vect, logr)
cross_val_score(pipe_logr_vect, X, y, cv=5, scoring='accuracy').mean()

0.79846061200128471

## 3. Feature extraction using **`TfidfVectorizer`** and chaining cross validation using **`Pipeline`**

##### Evaluating **`TfidfVectorizer`** and classifiers using default parameters

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [26]:
pipe_nb_tfidf = make_pipeline(tfidf, nb)
cross_val_score(pipe_nb_tfidf, X, y, cv=5, scoring='accuracy').mean()

0.60248863303626088

In [27]:
pipe_logr_tfidf = make_pipeline(tfidf, logr)
cross_val_score(pipe_logr_tfidf, X, y, cv=5, scoring='accuracy').mean()

0.7719560751365464

## 4. Tuning **`CountVectorizer`** and classifiers using **`RandomizedSearchCV`**

In [47]:
import scipy as sp
from sklearn.grid_search import RandomizedSearchCV

In [48]:
pipe_nb_vect.named_steps.keys()

dict_keys(['countvectorizer', 'multinomialnb'])

In [50]:
print(vect)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [51]:
print(nb)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


In [29]:
# Creating search parameters for CountVectorizer and Naive Bayes
nb_vect = {}
nb_vect['countvectorizer__max_df'] = [0.1, 0.2, 0.3, 0.4]
nb_vect['countvectorizer__min_df'] = [1, 2, 3]
nb_vect['countvectorizer__stop_words'] = [None, 'english']
nb_vect['countvectorizer__ngram_range'] = [(1, 2)]
nb_vect['multinomialnb__alpha'] = sp.stats.uniform(scale=1)
nb_vect

{'countvectorizer__max_df': [0.1, 0.2, 0.3, 0.4],
 'countvectorizer__min_df': [1, 2, 3],
 'countvectorizer__ngram_range': [(1, 2)],
 'countvectorizer__stop_words': [None, 'english'],
 'multinomialnb__alpha': <scipy.stats._distn_infrastructure.rv_frozen at 0xb3d4208>}

In [33]:
rand_nb_vect = RandomizedSearchCV(pipe_nb_vect, nb_vect, cv=5, scoring='accuracy', n_iter=20, random_state=1)

In [34]:
%time rand_nb_vect.fit(X, y)

Wall time: 3min 7s


RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), p..., vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
          fit_params={}, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'countvectorizer__max_df': [0.1, 0.2, 0.3, 0.4], 'countvectorizer__min_df': [1, 2, 3], 'countvectorizer__stop_words': [None, 'english'], 'countvectorizer__ngram_range': [(1, 2)], 'multinomialnb__alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B3D4208>},
          pre_dispatch='2*n_jobs', random_state=1, refit=True,
          scoring='accuracy', verbose=0)

In [35]:
print(rand_nb_vect.best_score_)
print(rand_nb_vect.best_params_)

0.8123309897241753
{'countvectorizer__max_df': 0.2, 'countvectorizer__min_df': 3, 'countvectorizer__ngram_range': (1, 2), 'countvectorizer__stop_words': None, 'multinomialnb__alpha': 0.78592436822684564}


In [36]:
pipe_logr_vect.named_steps.keys()

dict_keys(['countvectorizer', 'logisticregression'])

In [52]:
print(logr)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


In [43]:
# Creating search parameters for CountVectorizer and Logistic Regression
logr_vect = {}
logr_vect['countvectorizer__max_df'] = [0.1, 0.2, 0.3, 0.4]
logr_vect['countvectorizer__min_df'] = [1, 2, 3]
logr_vect['countvectorizer__stop_words'] = [None, 'english']
logr_vect['countvectorizer__ngram_range'] = [(1, 2)]
logr_vect['logisticregression__penalty'] = ['l1', 'l2']
logr_vect['logisticregression__C'] = sp.stats.uniform(scale=1)
logr_vect

{'countvectorizer__max_df': [0.1, 0.2, 0.3, 0.4],
 'countvectorizer__min_df': [1, 2, 3],
 'countvectorizer__ngram_range': [(1, 2)],
 'countvectorizer__stop_words': [None, 'english'],
 'logisticregression__C': <scipy.stats._distn_infrastructure.rv_frozen at 0xb6c59e8>,
 'logisticregression__penalty': ['l1', 'l2']}

In [45]:
rand_logr_vect = RandomizedSearchCV(pipe_logr_vect, logr_vect, cv=5, scoring='accuracy', n_iter=20, random_state=1)

In [46]:
%time rand_logr_vect.fit(X, y)

Wall time: 4min 56s


RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
  ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
          fit_params={}, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'countvectorizer__max_df': [0.1, 0.2, 0.3, 0.4], 'countvectorizer__min_df': [1, 2, 3], 'countvectorizer__stop_words': [None, 'english'], 'countvectorizer__ngram_range': [(1, 2)], 'logisticregression__penalty': ['l1', 'l2'], 'logisticregression__C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000000B6C59E8>},
          pre_dispatch='2*n_jobs', random_state=1, refit=True,
          scoring='accura

In [49]:
print(rand_logr_vect.best_score_)
print(rand_logr_vect.best_params_)

0.8062015503875969
{'countvectorizer__max_df': 0.2, 'countvectorizer__min_df': 1, 'countvectorizer__ngram_range': (1, 2), 'countvectorizer__stop_words': None, 'logisticregression__C': 0.47838106564430904, 'logisticregression__penalty': 'l2'}


## 5. Tuning **`TfidfVectorizer`** and classifiers using **`RandomizedSearchCV`**

In [53]:
pipe_nb_tfidf.named_steps.keys()

dict_keys(['tfidfvectorizer', 'multinomialnb'])

In [54]:
print(tfidf)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)


In [55]:
# Creating search parameters for TfidfVectorizer and Naive Bayes
nb_tfidf = {}
nb_tfidf['tfidfvectorizer__max_df'] = [0.1, 0.2, 0.3, 0.4]
nb_tfidf['tfidfvectorizer__min_df'] = [1, 2, 3]
nb_tfidf['tfidfvectorizer__stop_words'] = [None, 'english']
nb_tfidf['tfidfvectorizer__ngram_range'] = [(1, 2)]
nb_tfidf['tfidfvectorizer__norm'] = ['l1', 'l2', None]
nb_tfidf['multinomialnb__alpha'] = sp.stats.uniform(scale=1)
nb_tfidf

{'multinomialnb__alpha': <scipy.stats._distn_infrastructure.rv_frozen at 0x123b8c50>,
 'tfidfvectorizer__max_df': [0.1, 0.2, 0.3, 0.4],
 'tfidfvectorizer__min_df': [1, 2, 3],
 'tfidfvectorizer__ngram_range': [(1, 2)],
 'tfidfvectorizer__norm': ['l1', 'l2', None],
 'tfidfvectorizer__stop_words': [None, 'english']}

In [56]:
rand_nb_tfidf = RandomizedSearchCV(pipe_nb_tfidf, nb_tfidf, cv=5, scoring='accuracy', n_iter=20, random_state=1)

In [57]:
%time rand_nb_tfidf.fit(X, y)

Wall time: 3min 15s


RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_i...   vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
          fit_params={}, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'tfidfvectorizer__max_df': [0.1, 0.2, 0.3, 0.4], 'tfidfvectorizer__min_df': [1, 2, 3], 'tfidfvectorizer__stop_words': [None, 'english'], 'tfidfvectorizer__ngram_range': [(1, 2)], 'tfidfvectorizer__norm': ['l1', 'l2', None], 'multinomialnb__alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000000123B8C50>},
          pre_dispatch='2*n_jobs', random_state=1, refit=True,
          scoring='accurac

In [58]:
print(rand_nb_tfidf.best_score_)
print(rand_nb_tfidf.best_params_)

0.7934018388318009
{'multinomialnb__alpha': 0.018523381574367836, 'tfidfvectorizer__max_df': 0.2, 'tfidfvectorizer__min_df': 2, 'tfidfvectorizer__ngram_range': (1, 2), 'tfidfvectorizer__norm': 'l2', 'tfidfvectorizer__stop_words': None}


In [60]:
pipe_logr_tfidf.named_steps.keys()

dict_keys(['tfidfvectorizer', 'logisticregression'])

In [61]:
# Creating search parameters for TfidfVectorizer and Logistic Regression
logr_tfidf = {}
logr_tfidf['tfidfvectorizer__max_df'] = [0.1, 0.2, 0.3, 0.4, 0.5]
logr_tfidf['tfidfvectorizer__min_df'] = [1, 2, 3]
logr_tfidf['tfidfvectorizer__stop_words'] = [None, 'english']
logr_tfidf['tfidfvectorizer__ngram_range'] = [(1, 2)]
logr_tfidf['tfidfvectorizer__norm'] = ['l1', 'l2', None]
logr_tfidf['logisticregression__penalty'] = ['l1', 'l2']
logr_tfidf['logisticregression__C'] = sp.stats.uniform(scale=1)
logr_tfidf

{'logisticregression__C': <scipy.stats._distn_infrastructure.rv_frozen at 0x1271dac8>,
 'logisticregression__penalty': ['l1', 'l2'],
 'tfidfvectorizer__max_df': [0.1, 0.2, 0.3, 0.4, 0.5],
 'tfidfvectorizer__min_df': [1, 2, 3],
 'tfidfvectorizer__ngram_range': [(1, 2)],
 'tfidfvectorizer__norm': ['l1', 'l2', None],
 'tfidfvectorizer__stop_words': [None, 'english']}

In [62]:
rand_logr_tfidf = RandomizedSearchCV(pipe_logr_tfidf, logr_tfidf, cv=5, scoring='accuracy', n_iter=20, random_state=1)

In [63]:
%time rand_logr_tfidf.fit(X, y)

Wall time: 10min 9s


RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_i...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
          fit_params={}, iid=True, n_iter=50, n_jobs=1,
          param_distributions={'tfidfvectorizer__max_df': [0.1, 0.2, 0.3, 0.4, 0.5], 'tfidfvectorizer__min_df': [1, 2, 3], 'tfidfvectorizer__stop_words': [None, 'english'], 'tfidfvectorizer__ngram_range': [(1, 2)], 'tfidfvectorizer__norm': ['l1', 'l2', None], 'logisticregression__penalty': ['l1', 'l2'], 'logisticregression__C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000000001271DAC8>},
          pre_dispatch='2*n_jobs', ran

In [64]:
print(rand_logr_tfidf.best_score_)
print(rand_logr_tfidf.best_params_)

0.8053001622498648
{'logisticregression__C': 0.0093317967596888707, 'logisticregression__penalty': 'l2', 'tfidfvectorizer__max_df': 0.5, 'tfidfvectorizer__min_df': 1, 'tfidfvectorizer__ngram_range': (1, 2), 'tfidfvectorizer__norm': None, 'tfidfvectorizer__stop_words': None}


#### Optimal paramters based on **`RandomizedSearchCV`**

In [65]:
print(rand_nb_vect.best_score_)
print(rand_nb_vect.best_params_)

0.8123309897241753
{'countvectorizer__max_df': 0.2, 'countvectorizer__min_df': 3, 'countvectorizer__ngram_range': (1, 2), 'countvectorizer__stop_words': None, 'multinomialnb__alpha': 0.78592436822684564}
