### PROJECT MCMULTY

#### AWS MySQL RDS Contents:
- business.db (30M)
- tip.db (141M)
- review.db (3.5G)
- checkin.db (4.8M)
- business_hours.db (13M)
- user.db (1.3G)

<center>**Click below for all other data information** </center>
[![](https://s3-media2.fl.yelpcdn.com/assets/srv0/styleguide/1ea40efd80f5/assets/img/brand_guidelines/yelp_fullcolor.png)](https://www.yelp.com/dataset/documentation/json)

### Table of Contents
1. [AWS RDS Connection](#0)
2. [Import Data](#2)<br>
     2.1 [Read yelp data from RDS](#2.1)<br>
     2.2 [Read supplement data from csv and convert all sub-categories to primary categories](#2.2)<br>
3. [Exploratory Data Analysis](#3)<br>
     3.1 [Visualize star rating distribution](#3.1)<br>
     3.2 [Visualize category distribution](#3.2)<br>
     3.3 [Visualize Review Counts by City Distribution](#3.3)<br>
4. [Data Cleaning](#4)<br>
5. [Feature Engineering](#4)<br>
     5.1 [Calculate number of business days within a week](#5.1)<br>
     5.2 [Calculate number of business hours within a week](#5.2)<br>
     5.3 [Extract the length of the business unit](#5.3)<br>
     5.4 [Extract number of vowels in the business name](#5.4)<br>
     5.5 [Extract number of friends on Yelp](#5.5)<br>
     5.6 [Final data cleaning prior to modeling](#5.6)<br>
6. [Baseline Model](#6)<br>
     6.1 [Define classification metrics function & downcasting function](#6.1)<br>
     6.2 [Baseline Logistic Regression model (One feature)](#6.2)<br>
     6.3 [Expanding Logistic Regression with 5 original features](#6.3)<br>
     6.4 [Expanding Logistic Regression with 5 feature engineered features](#6.4)<br>
     6.5 [Expanding Logistic Regression with 10 features](#6.5)<br>
     6.6 [Logistic Regression Model (All features)](#6.6)<br>
     6.7 [Baseline RF model (One feature)](#6.7)<br>
     6.8 [Baseline xgBoost model (One feature)](#6.8)<br>
     6.9 [RF model (All features)](#6.9)<br>
     6.10 [xgBoost model (All feature)](#6.10)<br>
7. [Natural Language Processing (NLP) as additional features](#7)<br>
     7.1 [TFIDFVectorizer](#7.1)<br>
     7.2 [RF Model (TFIDF + all features)](#7.2)<br>
     7.3 [xgBoost model (TFIDF + all features)](#7.3)<br>
8. [Model Optimization](#8)<br>
9. [Final test-scores](#9)<br>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import mysql.connector as sql
import math
import itertools
from functools import reduce
import statsmodels.formula.api as smf
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from itertools import groupby
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline
sns.set()

### <a id=0></a>1. Define AWS RDS endpoint connection

<a href="url"><center><img src="https://github.com/whosivan/sf18_ds9/blob/master/student_submissions/projects/luther/Cui_Ivan/AWS_setup.png?raw=true" align="left" height="480" width="480" ></center></a>

In [None]:
db_connection = sql.connect(host = 'yelpinstance.cvchd2jvtnxy.us-east-1.rds.amazonaws.com', 
                            database = 'yelpdb', 
                            user = 'root', 
                            password = 'admin123')

### <a id=2></a> 2. Import Data

#### <a id=2.1></a> 2.1 Read yelp data from RDS

In [None]:
%time
business_df = pd.read_sql('''
select b.business_id, b.name as business_name, b.city, b.state, b.stars as business_stars, b.review_count, b.is_open, b.categories, bh.monday, bh.tuesday, bh.wednesday, bh.thursday, bh.friday, bh.saturday, bh.sunday
from business as b
join business_hours as bh
on b.business_id = bh.business_id
''', con = db_connection)

In [None]:
%time
review_df = pd.read_sql('''SELECT user_id,
                                  business_id,
                                  stars,
                                  text as review_text,
                                  useful as r_review,
                                  funny as r_funny, 
                                  cool as r_cool FROM review LIMIT 1000000''', con = db_connection)
tip_df = pd.read_sql('''SELECT business_id,
                               count(text) as tip_count FROM tip
                               group by 1''', con = db_connection)
checkin_df = pd.read_sql('''SELECT business_id,
                                   checkins FROM checkin''', con = db_connection)
user_df = pd.read_sql('''SELECT user_id,
                                name as user_name,
                                review_count as user_review_counts,
                                friends,
                                useful as u_useful,
                                funny as u_funny,
                                cool as u_cool,
                                fans as u_fans,
                                average_stars,
                                compliment_cute,
                                compliment_more,
                                compliment_profile,
                                compliment_cute,
                                compliment_list,
                                compliment_note,
                                compliment_plain,
                                compliment_cool,
                                compliment_funny,
                                compliment_writer,
                                compliment_photos FROM user''', con = db_connection)

#### <a id=2.2></a> 2.2 Read supplement data from csv and convert all sub-categories to primary categories

In [None]:
category_list = pd.read_csv('../yelp_data/yelp-business-categories.csv')
category_list.drop_duplicates(subset='Sub-Categories', keep="first", inplace=True)

def pd_to_dict(df):
    category_dict = {k: g["Sub-Categories"].tolist() for k,g in df.groupby("Primary_Categories")}
    category_dict = dict((v,k) for k in category_dict for v in category_dict[k])
    return category_dict

category_dict = pd_to_dict(category_list)

In [None]:
def get_primary_category_label(df):
    df.categories = df.categories.apply(lambda x:x.rstrip('\r')).str[0:].str.split(';')
    df.categories = df.categories.apply(lambda col: col[0])
    return df

get_primary_category_label(business_df).head(1)

In [None]:
def map_main_category(df, dict_):
    df['primary_category'] = df['categories'].map(dict_)
    df.drop(['categories'], axis =1, inplace=True)
    return df

map_main_category(business_df, category_dict).head(1)

### <a id=3></a> 3. Exploratory Data Analysis

#### <a id=3.1></a>3.1 Visualize star rating distribution

In [None]:
#Get the distribution of the ratings
x = business_df['business_stars'].value_counts().sort_index()
#plot
plt.figure(figsize=(15, 8))
ax = sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Star Rating Distribution")
plt.ylabel('# of businesses', fontsize=12)
plt.xlabel('Star Ratings ', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

#### <a id=3.2></a>3.2 Visualize category distribution

In [None]:
x = business_df.primary_category.value_counts()
print("There are ",len(x)," different types/categories of Businesses in Yelp.")
count = 0
for index, num in enumerate(x.iloc[0:20]):
    count += num
print('Number of business in Top20: {}'.format(count))

#prep for chart
x = x.sort_values(ascending = False)
top_categories = x.index.tolist()

#chart
plt.figure(figsize = (16,8))
ax = sns.barplot(x.index, x.values, alpha = 0.8)
plt.title("Category Distribution",fontsize = 25)
locs, labels = plt.xticks()
plt.setp(labels, rotation = 80)
plt.ylabel('# businesses', fontsize = 12)
plt.xlabel('Category', fontsize = 12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha = 'center', va = 'bottom')

plt.show()

#### <a id=3.3></a> 3.3 Visualize Review Counts by City Distribution 

In [None]:
x = business_df['city'].value_counts().sort_values(ascending=False).iloc[0:20]
plt.figure(figsize=(16,8))
ax = sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Review Count by City")
locs, labels = plt.xticks()
plt.setp(labels, rotation=45)
plt.ylabel('# businesses', fontsize=12)
plt.xlabel('City', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha = 'center', va = 'bottom')

plt.show()

### <a id=4></a> 4. Data Mining

In [None]:
print('business+business_hours df shape: {}'.format(business_df.shape))
print('review df shape: {}'.format(review_df.shape))
print('tip df shape: {}'.format(tip_df.shape))
print('checkin df shape: {}'.format(checkin_df.shape))
print('user df shape: {}'.format(user_df.shape))

In [None]:
# dfs = [business_df, tip_df, checkin_df, review_df]
# df2 = reduce(lambda left,right: pd.merge(left,right,on='business_id'), dfs)
# maindf = pd.merge(df2, user_df, how='inner', on='user_id')
# maindf.describe()

In [None]:
#join tables
testdf = pd.merge(business_df, tip_df, how='inner', on='business_id')
testdf2 = pd.merge(testdf, checkin_df, how='inner', on='business_id')
testdf3 = pd.merge(testdf2, review_df, how='inner', on='business_id')
maindf = pd.merge(testdf3, user_df, how='inner', on='user_id')

In [None]:
maindf.shape

In [None]:
maindf.describe()

In [None]:
maindf.info()

### <a id=5></a>5. Feature Engineering

#### <a id=5.1></a> 5.1 Calculate number of business days within a week

In [None]:
weekdays = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday']
weekends = ['saturday', 'sunday']

def business_days(df):
    weekdays = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday']
    weekends = ['saturday', 'sunday']
    temp = df[weekdays+weekends].replace('None', np.nan)
    df['number_of_business_days'] = 7 - temp.isnull().sum(axis = 1)
    return df

business_days(maindf).sample(1, random_state = 5)

In [None]:
#how many business on list has no records of opening
print('Total records with 0 business hours: {}'.format(len(maindf.loc[maindf.number_of_business_days == 0])))

In [None]:
#how many business on list that is not open?
print('Total records shown as not open: {}'.format(len(maindf.loc[maindf.is_open == 0])))

#### <a id = 5.2></a> 5.2 Calculate number of business hours within a week

In [None]:
def business_hours(df):
    days = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']
    store_hours = pd.DataFrame()

    for i in range(len(days)):
        temp = pd.DataFrame(df[days[i]].replace(to_replace='None', value='0-0'))
        temp[['t1','t2']] = pd.DataFrame(temp[days[i]].str.split('-').values.tolist(), index = temp.index)
        temp['hh1'], temp['mm1'] = temp['t1'].str.split(':', 1).str
        temp['hh2'], temp['mm2'] = temp['t2'].str.split(':', 1).str
        temp['hours'] = temp['hh2'].astype(int) - temp['hh1'].astype(int)
        temp.hours[temp.hours < 0] = 24 - temp.hh1.astype(int) + temp.hh2.astype(int)
        store_hours[i] = temp.hours
    
    df['total_business_hours'] = store_hours.sum(axis=1)
    df.drop(days, axis = 1, inplace = True)
    
    return df
business_hours(maindf).head(1)

#### <a id=5.3></a> 5.3 Extract the length of the business name

In [None]:
def business_name_length(df):
    df['business_name_length'] = df['business_name'].apply(lambda x:len(x.split(' ')))
    return df

business_name_length(maindf).head(1)

#### <a id=5.4></a> 5.4 Extract number of vowels in the business name

In [None]:
def vowels_in_business_name(df):
    vow_list = []
    for title in df.business_name:
        vow = 0
        word = title.split(' ')
        for w in word:
            ws = w.lstrip('"').rstrip('"')
            vow += sum(letter in 'aeiou' for letter in ws.lower())
        vow_list.append(vow)

    df['number_of_vowels_in_business_name'] = vow_list
    return df

vowels_in_business_name(maindf).head(1)

#### <a id=5.5></a> 5.5 Extract number of friends on Yelp

In [None]:
def number_of_friends(df):
    df['number_of_friends'] = df.friends.replace('None', np.nan).str.split(', ').str.len()
    df['number_of_friends'] = df['number_of_friends'].fillna(0.0).astype(int)
    df.drop(['friends'], axis = 1)
    return df

number_of_friends(maindf).head(1)

#### <a id=5.6></a> 5.6 Final data cleaning prior to modeling

In [None]:
def drop_cols_post_feature_engineering(df):
    drop_final_maindfcol = ['user_id', 'user_id_y', 'review_id', 'business_id', 'friends']
    df.drop(drop_final_maindfcol, axis = 1, inplace = True)
    return df

drop_cols_post_feature_engineering(maindf).head(1)

### <a id=6></a> 6. Baseline Model

#### <a id=6.1></a> 6.1 Define classification metrics function & downcasting function

In [None]:
yelp_df = maindf.copy()
yelp_df.fillna('Others',inplace=True)

In [None]:
# Reset dtypes
temp_dtypes = {
    'business_stars': np.float32,
    'review_count': np.int32,
    'is_open': np.int8,
    'tip_count': np.int32,
    'checkins': np.int32,
    'stars': np.int32,
    'r_review': np.int32,
    'r_funny': np.int32,
    'r_cool': np.int32,
    'user_review_counts': np.int32,
    'u_useful': np.int32,
    'u_funny': np.int32,
    'u_cool': np.int32,
    'u_fans': np.int32,
    'average_stars': np.float32,
    'compliment_cute': np.int32,
    'compliment_more': np.int32,
    'compliment_profile': np.int32,
    'compliment_cute': np.int32,
    'compliment_list': np.int32,
    'compliment_note': np.int32,
    'compliment_plain': np.int32,
    'compliment_cool': np.int32,
    'compliment_funny': np.int32,
    'compliment_writer': np.int32,
    'compliment_photos': np.int32,
    'number_of_business_days': np.int32,
    'total_business_hours': np.int32,
    'business_name_length': np.int32,
    'number_of_vowels_in_business_name': np.int32,
    'number_of_friends': np.int32,
    }

for col, col_type in temp_dtypes.items():
    yelp_df[col] = yelp_df[col].astype(col_type)

In [None]:
def classification_loops(df, x, y, model_list, model_names_list):
    scores_table = pd.DataFrame()
    scores = []
    baseline_xtrain, baseline_xtest, baseline_ytrain, baseline_ytest = train_test_split(x, y, 
                                                                                        test_size=0.2,random_state=42)

    #get scores
    for i, model in enumerate(model_list):
        scores.append(np.mean(cross_val_score(model, baseline_xtrain, baseline_ytrain, cv=3, 
                                              scoring=make_scorer(metrics.accuracy_score))))
        print(i)
        scores.append(np.mean(cross_val_score(model, baseline_xtrain, baseline_ytrain, cv=3, 
                                              scoring=make_scorer(metrics.precision_score, average='macro'))))
        print(i)
        scores.append(np.mean(cross_val_score(model, baseline_xtrain, baseline_ytrain, cv=3, 
                                              scoring=make_scorer(metrics.recall_score, average='macro'))))
        print(i)
        scores.append(np.mean(cross_val_score(model, baseline_xtrain, baseline_ytrain, cv=3, 
                                              scoring=make_scorer(metrics.f1_score, average='macro'))))
        print(i)
    feature_scores = pd.DataFrame(np.array(scores).reshape(len(model_list), 4), 
                         columns=['Accuracy', 'Precision', 'Recall', 'F1'], index=model_names_list)

    scores_table = pd.concat([scores_table, feature_scores])
    return scores_table 

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    #print(cm)
    plt.figure(figsize=(10,10))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

#### <a id=6.2></a> 6.2 Baseline Logistic Regreesion model (One feature / 5 feature(non Feature-Eng) / 5 feature (Feature-Eng) / 10 feature

In [None]:
#define baseline model features and models
blv1_x = yelp_df[['review_count']]
blv1_y = yelp_df.primary_category
baseline_xtrain, baseline_xtest, baseline_ytrain, baseline_ytest = train_test_split(blv1_x, blv1_y, 
                                                                                     test_size=0.2,random_state=42)

In [None]:
yelp_lr = LogisticRegression(n_jobs=-1)
yelp_baseline_lr = yelp_lr.fit(baseline_xtrain, baseline_ytrain)
yelp_baseline_lr_pred = yelp_baseline_lr.predict(baseline_xtest)

In [None]:
print('baseline logistic regression coefficient:')
list(zip(yelp_baseline_lr.classes_, yelp_baseline_lr.coef_))

In [None]:
yelp_lrtpr, yelp_lrfpr, yelp_thresh_roc = roc_curve(np.array(baseline_ytest), 
                                                    yelp_baseline_lr.predict_proba(baseline_xtest)[:,1], 
                                                    pos_label='Pets')

plt.figure(figsize=(7,7))
plt.plot(yelp_lrfpr, yelp_lrtpr, color='#ff6666')
plt.plot([0,1], [0,1],linestyle='--',color='gray')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.text(0.1, 0.9,'AUC: {:0.3f}'.format(metrics.auc(yelp_lrfpr, yelp_lrtpr)), fontsize=15)
plt.title('ROC curve for Yelp Baseline Logistic Regression');

In [None]:
print('lr_baseline accuracy: {}'.format(np.mean(cross_val_score(yelp_lr, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.accuracy_score)))))
print('lr_baseline precision: {}'.format(np.mean(cross_val_score(yelp_lr, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.precision_score, average='macro')))))
print('lr_baseline recall: {}'.format(np.mean(cross_val_score(yelp_lr, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('lr_baseline F1: {}'.format(np.mean(cross_val_score(yelp_lr, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

In [None]:
lr_baseline_cnf = confusion_matrix(baseline_ytest,yelp_baseline_lr_pred)
plot_confusion_matrix(lr_baseline_cnf, classes=list(set(yelp_df.primary_category.tolist())), title='Logitic Baseline CM')

#### <a id=6.3></a> 6.3 Expanding Logistic Regression with 5 original features

In [None]:
#define baseline model with 5 features
blv1_5x = yelp_df[['review_count', 'business_stars', 'review_count', 'stars', 'average_stars']]
blv1_5y = yelp_df.primary_category
baseline5f_xtrain, baseline5f_xtest, baseline5f_ytrain, baseline5f_ytest = train_test_split(blv1_5x, blv1_5y, 
                                                                                     test_size=0.2,random_state=42)

In [None]:
yelp_lr5f = LogisticRegression(penalty='l1', n_jobs=-1)
yelp_baseline5f_lr = yelp_lr5f.fit(baseline5f_xtrain, baseline5f_ytrain)
yelp_baseline5f_lr_pred = yelp_baseline5f_lr.predict(baseline5f_xtest)

In [None]:
print('baseline (5feature) logistic regression coefficient:')
list(zip(yelp_baseline5f_lr.classes_, yelp_baseline5f_lr.coef_))

In [None]:
print('lr_baseline (5e) recall: {}'.format(np.mean(cross_val_score(yelp_baseline5f_lr, blv1_5x, blv1_5y, 
                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('lr_baseline (5e) F1: {}'.format(np.mean(cross_val_score(yelp_baseline5f_lr, blv1_5x, blv1_5y, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

#### <a id=6.4></a> 6.4 Expanding Logistic Regression with 5 feature engineered features

In [None]:
#5 feature baseline with engineering features
blv1_5xf = yelp_df[['review_count', 'number_of_vowels_in_business_name', 'business_name_length', 'total_business_hours', 'number_of_business_days']]
blv1_5yf = yelp_df.primary_category
baseline5fe_xtrain, baseline5fe_xtest, baseline5fe_ytrain, baseline5fe_ytest = train_test_split(blv1_5xf, blv1_5yf, 
                                                                                     test_size=0.2,random_state=42)

In [None]:
yelp_lr5fe = LogisticRegression(penalty='l1', n_jobs=-1)
yelp_baseline5fe_lr = yelp_lr5fe.fit(baseline5f_xtrain, baseline5fe_ytrain)
yelp_baseline5fe_lr_pred = yelp_baseline5fe_lr.predict(baseline5fe_xtest)

print('baseline logistic regression coefficient:')
list(zip(yelp_baseline5fe_lr.classes_, yelp_baseline5fe_lr.coef_))

In [None]:
print('lr_baseline recall: {}'.format(np.mean(cross_val_score(yelp_baseline5fe_lr, blv1_5x, blv1_5y, 
                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('lr_baseline F1: {}'.format(np.mean(cross_val_score(yelp_baseline5fe_lr, blv1_5x, blv1_5y, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

#### <a id=6.5></a> 6.5 Expanding Logistic Regression with 10 features

In [None]:
#baseline + 10 features
blv1_10xf = yelp_df[['review_count', 'number_of_vowels_in_business_name', 'business_name_length', 'total_business_hours', 'number_of_business_days',
                    'tip_count', 'business_stars', 'review_count', 'stars', 'average_stars']]
blv1_10yf = yelp_df.primary_category
baseline10fe_xtrain, baseline10fe_xtest, baseline10fe_ytrain, baseline10fe_ytest = train_test_split(blv1_10xf, blv1_10yf, 
                                                                                     test_size=0.2,random_state=42)

In [None]:
yelp_lr10fe = LogisticRegression(penalty='l1', n_jobs=-1)
yelp_baseline10fe_lr = yelp_lr10fe.fit(baseline10fe_xtrain, baseline10fe_ytrain)
yelp_baseline10fe_lr_pred = yelp_baseline10fe_lr.predict(baseline10fe_xtest)

print('baseline logistic regression coefficient:')
list(zip(yelp_baseline10fe_lr.classes_, yelp_baseline10fe_lr.coef_))

In [None]:
print('lr_baseline_10f recall: {}'.format(np.mean(cross_val_score(yelp_baseline10fe_lr, blv1_10xf, blv1_10yf, 
                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('lr_baseline_10f F1: {}'.format(np.mean(cross_val_score(yelp_baseline10fe_lr, blv1_10xf, blv1_10yf, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

#### <a =id=6.6></a> 6.6 Logistic Regression Model + all features

In [None]:
#define baseline model features and models
v2_x = yelp_df.drop(['business_name', 'primary_category', 'review_text', 'user_name', 'city', 'state'], axis = 1)
v2_y = yelp_df.primary_category
v2_xtrain, v2_xtest, v2_ytrain, v2_ytest = train_test_split(v2_x, v2_y, test_size=0.2,random_state=42)

In [None]:
yelp_lrv2 = LogisticRegression(n_jobs=-1)
yelp_v2_lr = yelp_lrv2.fit(v2_xtrain, v2_ytrain)
yelp_v2_lr_pred = yelp_v2_lr.predict(v2_xtest)

In [None]:
print('V2_all_feature logistic regression coefficient:')
list(zip(yelp_v2_lr.classes_, yelp_v2_lr.coef_))

In [None]:
print('lrv2_allfeature recall: {}'.format(np.mean(cross_val_score(yelp_lrv2, v2_x, v2_y, 
                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('lrv2_allfeature F1: {}'.format(np.mean(cross_val_score(yelp_lrv2, v2_x, v2_y, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

In [None]:
lr_v2_cnf = confusion_matrix(v2_ytest, yelp_v2_lr_pred)
plot_confusion_matrix(lr_v2_cnf, classes=list(set(yelp_df.primary_category.tolist())), title='Logistic all_feature CM')

#### <a id=6.7></a> 6.7 Baseline RF model (One feature)

In [None]:
blv1_x = yelp_df[['review_count']]
blv1_y = yelp_df.primary_category
baseline_xtrain, baseline_xtest, baseline_ytrain, baseline_ytest = train_test_split(blv1_x, blv1_y, 
                                                                                     test_size=0.2,random_state=42)

In [None]:
yelp_RF = RandomForestClassifier(n_estimators = 1000, n_jobs=-1)
yelp_baseline_RF = yelp_RF.fit(baseline_xtrain, baseline_ytrain)
yelp_baseline_RF_pred = yelp_baseline_RF.predict(baseline_xtest)

In [None]:
yelp_RFtpr, yelp_RFfpr, yelp_RFthresh_roc = roc_curve(np.array(baseline_ytest), 
                                                    yelp_baseline_RF.predict_proba(baseline_xtest)[:,1], 
                                                    pos_label='Pets')

plt.figure(figsize=(7,7))
plt.plot(yelp_RFfpr, yelp_RFtpr, color='#ff6666')
plt.plot([0,1], [0,1],linestyle='--',color='gray')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.text(0.1, 0.9,'AUC: {:0.3f}'.format(metrics.auc(yelp_RFfpr, yelp_RFtpr)), fontsize=15)
plt.title('ROC curve for Yelp Baseline RF');

In [None]:
print('RF_baseline accuracy: {}'.format(np.mean(cross_val_score(yelp_baseline_RF, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.accuracy_score)))))
print('RF_baseline precision: {}'.format(np.mean(cross_val_score(yelp_baseline_RF, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.precision_score, average='macro')))))
print('RF_baseline recall: {}'.format(np.mean(cross_val_score(yelp_baseline_RF, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('RF_baseline F1: {}'.format(np.mean(cross_val_score(yelp_baseline_RF, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

In [None]:
RF_baseline_cnf = confusion_matrix(baseline_ytest, yelp_baseline_RF_pred)
plot_confusion_matrix(RF_baseline_cnf, classes=list(set(yelp_df.primary_category.tolist())), title='RF baseline CM')

#### <a id=6.8></a> 6.8 Baseline xgBoost model (One feature)

In [None]:
blv1_x = yelp_df[['review_count']]
blv1_y = yelp_df.primary_category
baseline_xtrain, baseline_xtest, baseline_ytrain, baseline_ytest = train_test_split(blv1_x, blv1_y, 
                                                                                     test_size=0.2,random_state=42)
#baseline_xtrain, baseline_xval, baseline_ytrain, baseline_yval = train_test_split(baseline_xtrain, baseline_ytrain, 
#                                                                  test_size=0.25, random_state=2)

In [None]:
yelp_gbm = xgb.XGBClassifier(n_estimators=10000, max_depth=4, objective='multi:softmax',
                        learning_rate=.05, subsample=.8, min_child_weight=3,
                        colsample_bytree=.8, n_jobs=-1)
yelp_baseline_gbm = yelp_gbm.fit(baseline_xtrain, baseline_ytrain, 
                    eval_set=[(baseline_xtrain, baseline_ytrain),(baseline_xval, baseline_yval)],
                    eval_metric='merror', early_stopping_rounds=50, verbose=True)

In [None]:
f1_score(baseline_ytest, yelp_gbm.predict(baseline_xtest, ntree_limit=yelp_gbm.best_ntree_limit), average='macro')

In [None]:
recall_score(baseline_ytest, yelp_gbm.predict(baseline_xtest, ntree_limit=yelp_gbm.best_ntree_limit), average='macro')

In [None]:
%%time
xgb_tuned_param=[{'learning_rate':[.01,.1,1,10,.001],'max_depth':[3,4,5,6,7,8]}]
clf = GridSearchCV(yelp_baseline_gbm, xgb_tuned_param, cv=3, scoring='f1')
clf.fit(baseline_xtrain, baseline_ytrain)

print("Best parameters set found on development set:")
print()
print(clf.best_estimator_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
             % (mean_score, scores.std() / 2, params))
    
#0.118 (+/-0.002) for {'learning_rate': 0.05, 'max_depth': 4, 'objective': 'multi:softmax'}
#0.155 (+/-0.000) for {'learning_rate': 0.05, 'max_depth': 8, 'objective': 'multi:softmax'}
#0.160 (+/-0.000) for {'learning_rate': 0.05, 'max_depth': 9, 'objective': 'multi:softmax'}

In [None]:
print('RFv2_allfeature recall: {}'.format(np.mean(cross_val_score(yelp_baseline_gbm, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('RFv2_allfeature F1: {}'.format(np.mean(cross_val_score(yelp_baseline_gbm, blv1_x, blv1_y, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

#### <a id=6.9></a> 6.9 RF model (All features)

In [None]:
#define baseline model features and models
v2_x = yelp_df.drop(['business_name', 'primary_category', 'review_text', 'user_name', 'city', 'state'], axis = 1)
v2_y = yelp_df.primary_category
v2_xtrain, v2_xtest, v2_ytrain, v2_ytest = train_test_split(v2_x, v2_y, test_size=0.2,random_state=42)

In [None]:
print('max feature is determined through taking sqrt of total number of features: {}'\
      .format(math.ceil(np.sqrt(len(v2_xtrain.columns)))))

In [None]:
yelpv2_RF = RandomForestClassifier(n_estimators = 1000, max_features = math.ceil(np.sqrt(len(v2_xtrain.columns))),
                                min_samples_leaf = 4, n_jobs=-1)
yelp_v2_RF = yelpv2_RF.fit(v2_xtrain, v2_ytrain)
yelp_v2_RF_pred = yelp_v2_RF.predict(v2_xtest)

In [None]:
print('RFv2_allfeature accuracy: {}'.format(np.mean(cross_val_score(yelpv2_RF, v2_x, v2_y, 
                                                           cv=3, scoring=make_scorer(metrics.accuracy_score)))))
print('RFv2_allfeature precision: {}'.format(np.mean(cross_val_score(yelpv2_RF, v2_x, v2_y, 
                                                           cv=3, scoring=make_scorer(metrics.precision_score, average='macro')))))
print('RFv2_allfeature recall: {}'.format(np.mean(cross_val_score(yelpv2_RF, v2_x, v2_y, 
                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('RFv2_allfeature F1: {}'.format(np.mean(cross_val_score(yelpv2_RF, v2_x, v2_y, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

In [None]:
RFv2_feature_importances = sorted(zip(v2_x.columns,abs(yelp_v2_RF.feature_importances_)), key=lambda x: -x[1])[:25]
RFv2_feature_importances

In [None]:
sns.set_context('talk')

plt.figure(figsize=(13,10))
features_, scores_ = zip(*RFv2_feature_importances)
sns.barplot(y=list(features_), x=list(scores_), palette='coolwarm')
plt.title("Feature Importances for RF_All_Feature Model")
plt.ylabel('Relative Importances', fontsize=12)
plt.xlabel('Features', fontsize=12)

In [None]:
sns.set_context('notebook')
RF_v2_cnf = confusion_matrix(v2_ytest, yelp_v2_RF_pred)
plot_confusion_matrix(RF_v2_cnf, classes=list(set(yelp_df.primary_category.tolist())), title='RF all_feature CM')

#### <a id=6.10></a> 6.10 xgBoost model (All feature)

In [None]:
v2_x = yelp_df.drop(['business_name', 'primary_category', 'review_text', 'user_name', 'city', 'state', ], axis = 1)
v2_y = yelp_df.primary_category
v2_xtrain, v2_xtest, v2_ytrain, v2_ytest = train_test_split(v2_x, v2_y, test_size=0.2,random_state=42)
v2_xtrain, v2_xval, v2_ytrain, v2_yval = train_test_split(v2_xtrain, v2_ytrain, test_size=0.25, random_state=2)

In [None]:
v2_xtrain.drop(v2_xtrain.columns[15], axis=1, inplace=True)
v2_xtest.drop(v2_xtest.columns[15], axis=1, inplace=True)
v2_xval.drop(v2_xval.columns[15], axis=1, inplace=True)

In [None]:
yelpv2_gbm = xgb.XGBClassifier(n_estimators=10000, max_depth=4, objective='multi:softmax',
                        learning_rate=0.1, subsample=.8, min_child_weight=3,
                        colsample_bytree=.8, n_jobs=-1)
yelp_v2_gbm = yelpv2_gbm.fit(v2_xtrain, v2_ytrain, 
                    eval_set=[(v2_xtrain, v2_ytrain),(v2_xval, v2_yval)],eval_metric='merror', early_stopping_rounds=30, verbose=True)

In [None]:
recall_score(v2_ytest, yelpv2_gbm.predict(v2_xtrain, ntree_limit=yelpv2_gbm.best_ntree_limit), average='macro')

In [None]:
f1_score(v2_ytest, yelpv2_gbm.predict(v2_xtrain, ntree_limit=yelpv2_gbm.best_ntree_limit), average='macro')

In [None]:
%%time
v2_xtrain, v2_xtest, v2_ytrain, v2_ytest = train_test_split(v2_x, v2_y, test_size=0.2,random_state=42)
xgb_tuned_param=[{'learning_rate':[.01,.1,.001],'max_depth':[3,5,6,7,8], 'objective':['multi:softmax']}]
clf = GridSearchCV(xgb.XGBClassifier(), xgb_tuned_param, cv=2, scoring=make_scorer(metrics.f1_score, average='macro'))
clf.fit(v2_xtrain, v2_ytrain)

print("Best parameters set found on development set:")
print()
print(clf.best_estimator_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
             % (mean_score, scores.std() / 2, params))

    
#0.118 (+/-0.002) for {'learning_rate': 0.05, 'max_depth': 4, 'objective': 'multi:softmax'}
#0.155 (+/-0.000) for {'learning_rate': 0.05, 'max_depth': 8, 'objective': 'multi:softmax'}
#0.160 (+/-0.000) for {'learning_rate': 0.05, 'max_depth': 9, 'objective': 'multi:softmax'}

### <a id=7></a> 7. Natural Language Processing (NLP) as additional features

#### <a id=7.1></a> 7.1 TFIDFVectorizer

In [None]:
pre_vec_df = yelp_df.copy()
pre_vec_df.review_text = pre_vec_df.review_text.str.lower()
pre_vec_df.review_text.replace(r'([^a-z\s])', '', regex=True, inplace=True)

In [None]:
transformer = TfidfTransformer(smooth_idf=False)
vectorizer = TfidfVectorizer(stop_words=['a', 'an', 'and', 'are', 'as', 'at', 'be', 
                                         'by', 'for', 'from', 'has', 'he', 'in', 'is', 
                                         'its', 'it', 'of', 'on', 'that', 'the',
                                         'to', 'was', 'were', 'will', 'with', 
                                         'she', 'mm', 'off', '-', '&', '...', '!', '\n', 'du', 'et',
                                         'le', 'las'], min_df = 2, max_features = 1000)
project_tfidf_vec = vectorizer.fit_transform(pre_vec_df.review_text).toarray()
project_tfidf_df = pd.DataFrame(project_tfidf_vec, columns=list(vectorizer.vocabulary_.keys()))
project_tfidf_df.shape

In [None]:
yelp_tfidf_df = pd.concat([pre_vec_df, project_tfidf_df], axis = 1)

#### <a id=7.2></a> 7.2 RF Model (TFIDF + all features)

In [None]:
v3_x = yelp_tfidf_df.drop(['business_name', 'primary_category', 'review_text', 'user_name', 'city', 'state'], axis = 1)
v3_y = yelp_tfidf_df.primary_category
v3_xtrain, v3_xtest, v3_ytrain, v3_ytest = train_test_split(v3_x, v3_y, test_size=0.2,random_state=42)

In [None]:
print('max feature is determined through taking sqrt of total number of features: {}'\
      .format(math.ceil(np.sqrt(len(v3_xtrain.columns)))))

In [None]:
%%time
yelpv3_RF = RandomForestClassifier(n_estimators = 500, max_features = 4,
                                min_samples_leaf = 1, n_jobs=-1)
yelp_v3_RF = yelpv3_RF.fit(v3_xtrain, v3_ytrain)
yelp_v3_RF_pred = yelp_v3_RF.predict(v3_xtest)

In [None]:
%%time
rf_tuned_param=[{'n_estimators':[500],'max_features':[10], 'min_samples_leaf':[10]}]
rfclf = GridSearchCV(RandomForestClassifier(), rf_tuned_param, cv=3, scoring=make_scorer(metrics.f1_score, average='macro'))
rfclf.fit(v3_xtrain, v3_ytrain)

print("Best parameters set found on development set:")
print()
print(rfclf.best_estimator_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in rfclf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
             % (mean_score, scores.std() / 2, params))
    
#0.118 (+/-0.002) for {'learning_rate': 0.05, 'max_depth': 4, 'objective': 'multi:softmax'}
#0.155 (+/-0.000) for {'learning_rate': 0.05, 'max_depth': 8, 'objective': 'multi:softmax'}
#0.160 (+/-0.000) for {'learning_rate': 0.05, 'max_depth': 9, 'objective': 'multi:softmax'}

In [None]:
RFv3_feature_importances = sorted(zip(v3_x.columns,abs(yelp_v3_RF.feature_importances_)), key=lambda x: -x[1])[:25]
RFv3_feature_importances

In [None]:
sns.set_context('talk')

plt.figure(figsize=(13,10))
features_, scores_ = zip(*RFv3_feature_importances)
sns.barplot(y=list(features_), x=list(scores_), palette='coolwarm')
plt.title("Feature Importances for RFv3_allfeature+tfidf Model")
plt.ylabel('Features', fontsize=12)
plt.xlabel('Relative Importances', fontsize=12)

In [None]:
sns.set_context('notebook')
RF_v3_cnf = confusion_matrix(v3_ytest, yelp_v3_RF_pred)
plot_confusion_matrix(RF_v3_cnf, classes=list(set(yelp_tfidf_df.primary_category.tolist())), title='RF all_feature+tfidf CM')

In [None]:
#print('RFv3_allfeature+tfidf recall: {}'.format(np.mean(cross_val_score(yelp_v3_RF, v3_x, v3_y, 
#                                                           cv=3, scoring=make_scorer(metrics.recall_score, average='macro')))))
print('RFv3_allfeature+tfidf F1: {}'.format(np.mean(cross_val_score(yelp_v3_RF, v3_x, v3_y, 
                                                           cv=3, scoring=make_scorer(metrics.f1_score, average='macro')))))

### <a id=8></a> 8. Final Model Optimization

In [None]:
%%time
xgb_tuned_param=[{'learning_rate':[0.05, 1],'max_depth':[4], 'max_depth':[8], 'objective':['multi:softmax']}]
clf_final = GridSearchCV(xgb.XGBClassifier(), xgb_tuned_param, cv=2, scoring=make_scorer(metrics.f1_score, average='macro'))
clf_final.fit(v2_xtrain, v2_ytrain)

print("Best parameters set found on development set:")
print()
print(clf_final.best_estimator_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in clf_final.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
             % (mean_score, scores.std() / 2, params))

### <a id=9></a> 9. Final test-scores

In [None]:
vf_x = yelp_df.drop(['business_name', 'primary_category', 'review_text', 'user_name', 'city', 'state', ], axis = 1)
vf_y = yelp_df.primary_category
vf_xtrain, vf_xtest, vf_ytrain, vf_ytest = train_test_split(vf_x, vf_y, test_size=0.2,random_state=42)
vf_xtrain, vf_xval, vf_ytrain, vf_yval = train_test_split(vf_xtrain, vf_ytrain, test_size=0.25, random_state=2)

vf_xtrain.drop(vf_xtrain.columns[15], axis=1, inplace=True)
vf_xtest.drop(vf_xtest.columns[15], axis=1, inplace=True)
vf_xval.drop(vf_xval.columns[15], axis=1, inplace=True)

In [None]:
yelp_gbm_final = xgb.XGBClassifier(n_estimators=10000, max_depth=4, objective='multi:softmax',
                        learning_rate=0.1, subsample=.8, min_child_weight=3,
                        colsample_bytree=.8, n_jobs=-1)
yelp_gbm_final_model = yelp_gbm_final.fit(vf_xtrain, vf_ytrain, 
                    eval_set=[(vf_xtrain, vf_ytrain),(vf_xval, vf_yval)],eval_metric='merror', early_stopping_rounds=30, verbose=True)

In [None]:
recall_score(vf_ytest, yelpv2_gbm.predict(vf_xtest, ntree_limit=yelpv2_gbm.best_ntree_limit), average='macro')
f1_score(vf_ytest, yelpv2_gbm.predict(vf_xtest, ntree_limit=yelpv2_gbm.best_ntree_limit), average='macro')