# Challenge

Transform this regression problem into a binary classifier and clean up the feature set. You can choose whether or not to include nutritional information, but try to cut your feature set down to the 30 most valuable features.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import seaborn as sns
sns.set_style('white')
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## 1) Initial Cleaning

As we saw in the Guided Example, there are close to 700 columns and over 20,000 rows. Let's see how accurate we can get if we only take the nutrition-related columns and remove the 20% of them that have null values.

In [2]:
raw_data = pd.read_csv('epi_r.csv')

In [3]:
raw_nutrition = raw_data.dropna(axis=0, subset=['calories', 'protein', 'fat', 'sodium'])

In [7]:
nutrition = raw_nutrition[['rating', 'title', 'calories', 'protein', 'fat', 'sodium']]

In [13]:
from sklearn.svm import SVR, SVC

In [14]:
svr = SVR()
X = nutrition.drop(['rating', 'title'], 1)
Y = nutrition.rating
svr.fit(X, Y)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)

In [11]:
svr.score(X, Y)

0.5670895690804438

In [12]:
from sklearn.model_selection import cross_val_score
cross_val_score(svr, X, Y, cv = 5)

array([0.08307481, 0.06269616, 0.06296291, 0.05585074, 0.06144234])

These scores are MUCH higher than we saw previously but the values still vary quite a lot. Let's try to find some other tweaks to make that can produce a more consistent accuracy level.

## 2) Convert to Binary

We saw earlier that there are no datapoints with a 3.5 rating. Using this information, let us set the classification boundary to a rating of 4.

In [15]:
# Restart with the original raw data
X = raw_data.drop(['rating', 'title', 'calories', 'protein', 'fat', 'sodium'], 1)
Y = raw_data['rating']

In [16]:
# Convert the Y version of the ratings column to binary on the 4-star boundary
Y = np.where(raw_data['rating'] >= 4, 1, 0)

## 3) Feature Selection

The challenge requires a minimum of 30 features. Let's see how each feature scores in their value for predictions.

In [17]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [18]:
svc = SVC()
selector = SelectKBest(svc, k = 30)

In [19]:
# Use SelectKBest to find the most relevant features 
selector = SelectKBest(f_classif, k = 30)
X_new = selector.fit_transform(X, Y)

# Collect the names and F-Scores of the best features we just found
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]

# Convert this new information into a dataframe
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns = ['Feature', 'F-Score'])
ns_df_sorted = ns_df.sort_values(['F-Score', 'Feature'], ascending = [False, True])
ns_df_sorted

Unnamed: 0,Feature,F-Score
3,bon appétit,190.744223
17,house & garden,174.172453
9,drink,139.302115
0,alcoholic,120.080436
12,gin,101.773436
21,roast,93.686545
27,thanksgiving,90.66593
20,peanut free,87.444773
23,soy free,87.184921
6,cocktail party,81.26188


## 4) Train and Test

Let's split the data into a training set and testing set to see how well these 30 features can predict.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_new, Y, test_size=0.33, random_state=42)

In [22]:
svc.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [24]:
predictions = svc.predict(X_test)
print('--------- RESULTS ---------')
print()
print('R-Squared')
print(svc.score(X_test, y_test))
print()
print('Confusion Matrix')
print(confusion_matrix(y_test, predictions))
print()
print('Classification Report')
print(classification_report(y_test, predictions))

--------- RESULTS ---------

R-Squared
0.581293442127531

Confusion Matrix
[[ 741 2269]
 [ 502 3106]]

Classification Report
              precision    recall  f1-score   support

           0       0.60      0.25      0.35      3010
           1       0.58      0.86      0.69      3608

   micro avg       0.58      0.58      0.58      6618
   macro avg       0.59      0.55      0.52      6618
weighted avg       0.59      0.58      0.54      6618



## 5) Cross-Validation

Let's try a cross-validation again and see if our accuracy is consistent this time.

In [25]:
cross_val_score(svc, X_new, Y, cv = 5)

array([0.56070805, 0.57342309, 0.57417103, 0.57481297, 0.57944625])

Wow! The scores look very consistent now! Looks like we did it! Keep in mind the accuracy is still terrible.