In [None]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Now it's time for another guided example. This time we're going to look at recipes. Specifically we'll use the epicurious dataset, which has a collection of recipes, key terms and ingredients, and their ratings.

What we want to see is if we can use the ingredient and keyword list to predict the rating. For someone writing a cookbook this could be really useful information that could help them choose which recipes to include because they're more likely to be enjoyed and therefore make the book more likely to be successful.

First let's load the dataset. It's [available on Kaggle](https://www.kaggle.com/hugodarwood/epirecipes). We'll use the csv file here and as pull out column names and some summary statistics for ratings.

In [None]:
raw_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/epi_r.csv')

In [None]:
list(raw_data.columns)

In [None]:
raw_data.rating.describe()

We learn a few things from this analysis. From a ratings perspective, there are just over 20,000 recipes with an average rating of 3.71. What is interesting is that the 25th percentile is actually above the mean. This means there is likely some kind of outlier population. This makes sense when we think about reviews: some bad recipes may have very few very low reviews.

Let's validate the idea a bit further with a histogram.

In [None]:
raw_data.rating.hist(bins=20)
plt.title('Histogram of Recipe Ratings')
plt.show()

So a few things are shown in this histogram. Firstly there are sharp discontinutities. We don't have continuous data. No recipe has a 3.5 rating, for example. Also we see the anticipated increase at 0.

Let's try a naive approach again, this time using SVM Regressor. But first, we'll have to do a bit of data cleaning.

In [None]:
# Count nulls 
null_count = raw_data.isnull().sum()
null_count[null_count>0]

What we can see right away is that nutrition information is not available for all goods. Now this would be an interesting data point, but let's focus on ingredients and keywords right now. So we'll actually drop the whole columns for calories, protein, fat, and sodium. We'll come back to nutrition information later.

In [None]:
from sklearn.svm import SVR
svr = SVR()
X = raw_data.drop(['rating', 'title', 'calories', 'protein', 'fat', 'sodium'], 1).sample(frac=0.3, replace=True, random_state=1)
Y = raw_data.rating.sample(frac=0.3, replace=True, random_state=1)
svr.fit(X,Y)

__Note that this actually takes quite a while to run, compared to some of the models we've done before. Be patient.__ It's because of the number of features we have.

Let's see what a scatter plot looks like, comparing actuals to predicted.

In [None]:
plt.scatter(Y, svr.predict(X))

Now that is a pretty useless visualization. This is because of the discontinuous nature of our outcome variable. There's too much data for us to really see what's going on here. If you wanted to look at it you could create histograms, here we'll move on to the scores of both our full fit model and with cross validation. Again if you choose to run it again it will take some time, so you probably shouldn't.

In [None]:
svr.score(X, Y)

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(svr, X, Y, cv=5)

Oh dear, so this did seem not to work very well. In fact it is remarkably poor. Now there are many things that we could do here. 

Firstly the overfit is a problem, even though it was poor in the first place. We could go back and clean up our feature set. There might be some gains to be made by getting rid of the noise.

We could also see how removing the nulls but including dietary information performs. Though its a slight change to the question we could still possibly get some improvements there.

Lastly, we could take our regression problem and turn it into a classifier. With this number of features and a discontinuous outcome, we might have better luck thinking of this as a classification problem. We could make it simpler still by instead of classifying on each possible value, group reviews to some decided high and low values.

__And that is your challenge.__

Transform this regression problem into a binary classifier and clean up the feature set. You can choose whether or not to include nutritional information, but try to cut your feature set down to the 30 most valuable features.

Good luck!

When you've finished that, also take a moment to think about bias. Is there anything in this dataset that makes you think it could be biased, perhaps extremely so?

There is. Several things in fact, but most glaringly is that we don't actually have a random sample. It could be, and probably is, that the people more likely to choose some kinds of recipes are more likely to give high reviews.

After all, people who eat chocolate _might_ just be happier people.

Let's drop the variables that are really low correlated.

In [None]:
raw_data.columns.unique().value_counts().sum()

In [None]:
## Getting the top 30 most correlated variables with rating.
target = 'rating'
no_cols = 31
corrmat = raw_data.corr()
print (corrmat.nlargest(no_cols, target)[target])

cols = corrmat.nlargest(no_cols, target)[target].index
cm = np.corrcoef(raw_data[cols].values.T)

plt.figure(figsize=(20,15))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

In [None]:
print('median rating = {}'.format(raw_data['rating'].mean()))

In [None]:
cols = raw_data[["rating", "bon appétit", "peanut free", "soy free", "tree nut free", "bake", "roast", "fall", "sauté", "dinner", "kosher", "winter", "pescatarian", "thanksgiving", "onion", "grill/barbecue", "high fiber", "gourmet", "no sugar added", "tomato", "quick & easy", "herb", "pork", "beef", "cheese", "low carb", "mixer", "christmas", "sugar conscious", "braise", "low cal"]]

In [None]:
cols["rating"]

#### We binarize the target value Rating in order to perform Logistic Regression

In [None]:
cols['rating'] = (cols['rating'] > cols['rating'].median()).astype(np.int)
print(cols['rating'].head(5))

In [None]:
cols['rating'].unique()

In [None]:
raw_data.head()

### Title

In [None]:
from sklearn.svm import SVC
svc = SVC()
X = cols.drop('rating', 1)
Y = cols.rating
svc.fit(X,Y)

print(svc.score(X, Y))
print('---------------------------------------')
from sklearn.model_selection import cross_val_score
print(cross_val_score(svc, X, Y, cv=5))


#### We can see a great improvement in the performance of the model. The prediction accuracy with SVC is 86.4% which is great without over fitting and cross validated it with cross_val_score to get a consistent 86%~