# For the Love of Chocolate
Chocolate is loved worldwide.Can we identify the universal traits of chocolate we all love? Let's find out!

![](https://www.wtfclub.net/wp-content/uploads/2017/05/Chocolate-min.jpg)
Since we're all data people here, check out [some fun chocolate facts](https://www.wtfclub.net/chocolate-facts/) to get you properly motivated!

As a motivating philosophy, remember that machine learning can be done for two different end goals, prediction or understanding, and those goals shape the modeling decisions you make. For prediction, you emphasize processing and complexity in order to get the best performance possible. For interpretation, however, you relax the model performance a bit in order to understand the hidden relationships in the data. 

In this notebook, we are going to approach machine learning for understanding, and thus the actual processing of the data is going to be light. But, because we aren't doing heavy lifting, we should find interesting insights within the model itself!

So, do just enough processing to get the data into modeling format, then feed it to a lighter model. Our weapons of choice? Logistic Regression and Random Forests.

One thing that's interesting about this data is it's more descriptive than numerical. Not a lot of information on the contents of the bars themselves, just labels of where they were from. It's unlikely we can infer a lot about the taste preferences themselves, but we can discover what labels *contribute* to the taste preferences we have.

So, let's get started!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('../input/flavors_of_cacao.csv')

#removing some of the unicode non printing characters for convenience
for c in df.columns.values:
    if c != 'Rating':
        df[c] = df[c].apply(lambda x: str(x).replace(u'\xa0', ''))

        
print("Number of records: " + str(df.shape[0]))
print("Baseline Average Rating: " + str(df['Rating'].mean()))
df.head()

# Data Exploration

It's important to start off by getting a feel for the data. Who knows, we might get some ideas for valuable features to add!

### Rating
Let's start with the target. What does the rating variable actually look like? 

In [None]:
sns.countplot(df['Rating']).set_title('Distribution Over Chocolate Ratings')
plt.show()

A lot of chocolates in the 2.5-3.5 range. Not suprising. But there's actually some chocolates at the tail ends. Which are the sad ones liked by none? And the ones loved by all?

Looks like Belgium made some duds, but Italy made some hits!

In [None]:
print("Least liked chocolates: ")
df[df['Rating'] == 1]

In [None]:
print("Most liked chocolates: ")
df[df['Rating'] == 5]

### Review Date
Is there more data in the past, or is the volume growing with time? Is recent chocolate better, or is it staying the same?

In [None]:
sns.countplot(df['Review\nDate']).set_title('Rating Volume Over Time')
plt.show()

In [None]:
sns.set_style("darkgrid")

#fitting a linear regression line to a scatterplot
sns.regplot(x=df['Review\nDate'].apply(lambda x: float(x)), 
            y=df['Rating'].apply(lambda x: float(x)))

plt.title('Rating Over Time')
plt.show()

On the whole, chocolate seems to be getting better with time. As a chocolate lover, that's good to hear!

Although, it's more noteworthy that variation in ratings is getting smaller over time, given the funneled shape towards the right. 

### Cocoa Percent

In [None]:
print("Total unique Cocoa Percent: " + str(len(df['Cocoa\nPercent'].unique())))
df['Cocoa\nPercent'] = df['Cocoa\nPercent'].apply(lambda x: float(str(x).replace('%', '')))

sns.regplot(x=df['Cocoa\nPercent'], y=df['Rating'])

plt.title('Rating By Cocoa Percentage')
plt.show()

On the other hand, rating seems to go down with the increase of cocoa. 

I guess people aren't as big a fan of dark chocolate. ¯\_(ツ)_/¯

### Non Numeric Columns

In [None]:
df.replace('', np.nan, ).isnull().sum()

Most everything is filled, exept we're missing about half of the Bean Type, and a few Broad Bean Origin. Why is it missing for some? Is it an insightful variable? Does the Bean Type seem to matter much? 

In [None]:
df['Bean\nType'].fillna("Uknown", inplace = True)
#for one pesky row that has "nan"
df['Bean\nType'].replace("nan", "Uknown", inplace = True)

print("Total unique bean types: " + str(len(df['Bean\nType'].unique())))

f = {'Rating':['size','mean','std']}
df.groupby('Bean\nType').agg(f)

There seems to be three main players in the Bean cateory, Criollo, Forastero, and Trinitario, with some variations within the group.

The groups all seem to be around 3.1 on the whole, including the "Uknown" records, so maybe there's more than just the bean type when it comes to good taste!

### Company Location
What about countries? Are different parts of the world preferred for their style of chocolate?

In [None]:
print("Total unique bean types: " + str(len(df['Company\nLocation'].unique())))

f = {'Rating':['size','mean','std']}
df.groupby('Company\nLocation').agg(f)

Still a lot of twos and threes. Nothing to noteworthy. 

The other columns have too many unique values to really visualize, so we will likely treat them as text later.

In [None]:
print("Total unique Specific Bean Origin or Bar Name: " + str(len(df['Specific Bean Origin\nor Bar Name'].unique())))
print("Total unique REF: " + str(len(df['REF'].unique())))

----------------------------------------------------------------------------
# Modeling Time!

Now let's get to the good stuff, actual modeling! Of course, we start by preprocessing. With sklearn, it's simple.

In [None]:
X = df.drop('Rating', axis = 1)
y = df['Rating']

X.head()

## Categorical Variables

To start off, I'm going to go ahead and one hot encode the categorical variables **before** I split them into training and test sets. 

You can do it after you split them as well, but depending on how you do it, you may have mismatched columns in the two data sets, since some rarer values likely only appear in one data set or the other. Which means writing more code to fix it. To avoid that headache, I'm going to go ahead and do it now.

### Company (Maker-if known)


In [None]:
#using pd.get_dummies to create a one hot encoded matrix
dummies = pd.get_dummies(X['Company\xa0\n(Maker-if known)'])
#Adding the variable to the column names so I can keep track of which original variable it came from
dummies.columns = ['Company_' + k for k in dummies.columns.values]
X = pd.concat([X, dummies], axis=1)

#dropping the original column 
del X['Company\xa0\n(Maker-if known)']

X.head()

### Company Location

In [None]:
#using pd.get_dummies to create a one hot encoded matrix
dummies = pd.get_dummies(X['Company\nLocation'])
#Adding the variable to the column names so I can keep track of which original variable it came from
dummies.columns = ['Company_Loc_' + k for k in dummies.columns.values]

X = pd.concat([X, dummies], axis=1)

#dropping the original column 
del X['Company\nLocation']

X.head()

### REF
I'll be honest, I don't know exactly what REF represents, but I'm thinking it's likely a code for some other thing. Which makes me think treating it as a label instead of a number is more appropriate.

In [None]:
#using pd.get_dummies to create a one hot encoded matrix
dummies = pd.get_dummies(X['REF'])
#Adding the variable to the column names so I can keep track of which original variable it came from
dummies.columns = ['REF_' + k for k in dummies.columns.values]

X = pd.concat([X, dummies], axis=1)

#dropping the original column 
del X['REF']

X.head()

### Review Date
Dates can be tricky. On the one hand, we can bucket them into categories by the distinct year. On the other hand, we can convert it to a number, and treat it as "time since..." whatever we transform it as. In this dataset, we'd likely treat 0 as the first year, 1 as the second year, and so on. For now, I'm going to treat the years as buckets. 

In [None]:
#using pd.get_dummies to create a one hot encoded matrix
dummies = pd.get_dummies(X['Review\nDate'])
#Adding the variable to the column names so I can keep track of which original variable it came from
dummies.columns = ['Date_' + k for k in dummies.columns.values]

X = pd.concat([X, dummies], axis=1)

#dropping the original column 
del X['Review\nDate']

X.head()

## Splitting into Train and Test
One hot encoding is simple since it only looks at the row to change the representation, so it doesn't matter when you apply it. Other preprocessing looks at *the whole dataset* in order to determine how to change the data representation. Like for instance imputing a mean, you have to look at the whole column to figure out what the mean is. And if you're also using the test data to figure out how to change the data before you model with it, that is sort of cheating.

To avoid that, I need to split it into a training and test set, then fit transformations to the training data and apply the transformation to both data sets.  

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7)

X_train.head()

## Text Variables

### Specific Bean Origin or Bar Name

For this column, I want to use key words, since I think it could be useful to pick up on several separate words in the text. Given the value "Djakarta, Java and Ghana", I'd want to consider it similar to other bars from Java, and separately be able to consider it similar to other bars from Ghana. For this reason, I'll text vectorize it. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1,2))

#fitting to the training data
tf.fit(X_train['Specific Bean Origin\nor Bar Name'])

#transforming on both data set
train_transformed = tf.transform(X_train['Specific Bean Origin\nor Bar Name'])
test_transformed = tf.transform(X_test['Specific Bean Origin\nor Bar Name'])

#converting to a dataframe so we can see the column names easier later
train_transformed = pd.DataFrame(data = train_transformed.todense(), 
                              index = X_train.index.values, 
                              columns = ["Bar_Name_" + k for k in tf.vocabulary_] )
test_transformed = pd.DataFrame(data = test_transformed.todense(), 
                              index = X_test.index.values, 
                              columns = ["Bar_Name_" + k for k in tf.vocabulary_] )

#appending back to the original data
X_train = pd.concat([X_train, train_transformed], axis=1)
X_test = pd.concat([X_test, test_transformed], axis=1)

del X_train['Specific Bean Origin\nor Bar Name']
del X_test['Specific Bean Origin\nor Bar Name']

X_train.head()

### Bean Type
Similarly, I want to detect key words from the bean type as well.

In [None]:
tf = TfidfVectorizer(ngram_range=(1,2))

tf.fit(X_train['Bean\nType'])

train_bean_name = tf.transform(X_train['Bean\nType'])
test_bean_name = tf.transform(X_test['Bean\nType'])

#transforming on both data set
train_transformed = tf.transform(X_train['Bean\nType'])
test_transformed = tf.transform(X_test['Bean\nType'])

#converting to a dataframe so we can see the column names easier later
train_transformed = pd.DataFrame(data = train_transformed.todense(), 
                              index = X_train.index.values, 
                              columns = ["Bean_Type_" + k for k in tf.vocabulary_] )
test_transformed = pd.DataFrame(data = test_transformed.todense(), 
                              index = X_test.index.values, 
                              columns = ["Bean_Type_" + k for k in tf.vocabulary_] )

#appending back to the original data
X_train = pd.concat([X_train, train_transformed], axis=1)
X_test = pd.concat([X_test, test_transformed], axis=1)

del X_train['Bean\nType']
del X_test['Bean\nType']

X_train.head()

### Broad Bean Origin

Lastly, I want to get keywords for the bean origin as well.

In [None]:
tf = TfidfVectorizer(
    ngram_range=(1,2))

tf.fit(X_train['Broad Bean\nOrigin'])

#transforming on both data set
train_transformed = tf.transform(X_train['Broad Bean\nOrigin'])
test_transformed = tf.transform(X_test['Broad Bean\nOrigin'])

#converting to a dataframe so we can see the column names easier later
train_transformed = pd.DataFrame(data = train_transformed.todense(), 
                              index = X_train.index.values, 
                              columns = ["Bean_Origin_" + k for k in tf.vocabulary_] )
test_transformed = pd.DataFrame(data = test_transformed.todense(), 
                              index = X_test.index.values, 
                              columns = ["Bean_Origin_" + k for k in tf.vocabulary_] )

#appending back to the original data
X_train = pd.concat([X_train, train_transformed], axis=1)
X_test = pd.concat([X_test, test_transformed], axis=1)

del X_train['Broad Bean\nOrigin']
del X_test['Broad Bean\nOrigin']

X_train.head()

Now for the last variable.

I'm going to scale down the Cocoa Percent, since right now the values are so much bigger than any other column, and we don't want it to be unfairly influencing the model.

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

sc.fit(X_train['Cocoa\nPercent'].values.reshape(-1, 1))

X_train['Cocoa\nPercent'] = sc.transform(X_train['Cocoa\nPercent'].values.reshape(-1, 1))
X_test['Cocoa\nPercent'] = sc.transform(X_test['Cocoa\nPercent'].values.reshape(-1, 1))

X_train.head()

#  Modeling

Now we get to the main goal, inference. There's a vast array of models you can choose from, but I'm going to stick to ones that allow for feature relationships to be easily inferred. And to increase the chances of finding interesting things, I'll make a model from two different families, one tree based model, and one linear model.

## Random Forests
I like Random Forests because it's easy to apply to both regression and classification. It also has a built in feature importance ranking, which we can use to infer which variables are correlated most strongly to the target variable.

Unfortunately, we can't infer the ***direction*** of the relationship, but we can at least be certain that there is a relationship going on. So let's see what we find...

To build it, I'm going to use grid search to try out a couple different parameters and pick the model with the best cross validation code. The actual code to do so is quite straightforward:

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

#parameter combinations to try
param_grid = {'n_estimators': [10, 30, 50, 90], 
              'max_depth': [5, 10, 20, None]
             }

regr = RandomForestRegressor()

#fitting the model to each combination in the grid
model = GridSearchCV(regr, param_grid)
#fining the best parameters based on the search grid
model.fit(np.matrix(X_train), y_train)

#pulling the fitted model on the best settings so we can see the variable importances
regr = model.best_estimator_

print(model.best_score_)

The score isn't too bad, an average error of about 0.17 per rating. Given the scale, that's a reasonable model. If it was really bad, the inference we might make from the model might not actually represent the true relationships and I would say we should throw the model away, but I feel comfortable using it to learn inferences. 

Now, let's get to what we're really interested in, the feature rankings. 

In [None]:
#finding the indices that would sort the array 
sorted_indices = np.argsort(regr.feature_importances_)

#finding the most important features and associated importance
variables = regr.feature_importances_[sorted_indices]
importance_rating = X_train.columns.values[sorted_indices]

importances = pd.DataFrame({'variable':variables, 'importance':importance_rating})
importances.tail(10)

Interesting! The Cocoa Percentage is the strongest factor going into a ranking. People must really have a preference there.  ¯\_(ツ)_/¯
We saw that trend early in the exploratory phase, so it's good validation that the model is doing something right. 

Also, remember that variable importances assign a value for EVERY variable, I just pulled the top ten values for inference. At some point, you have to make a decision at what the cutoff value is for a true "importance". In this instance, the cocoa percentage is ***significantly*** farther from the second variable than the second variable is from the third, so I am very comfortable in the "importance" of cocoa percent, but for the other ones, I'd say there is a small trend.

Looking into the honduras trend a little more, there doesn't seem to be anything super distinguishing there. So I would be less confident in the strength of the relationship for the other variables going forward.

In [None]:
mask = df['Specific Bean Origin\nor Bar Name'].apply(lambda x: 'honduras' in str(x).lower())

print("Number of bars for Honduras: " + str(df[mask].shape[0]))
df[mask]

Looking into the Soma trend, however, gives me a little more confidence. Not only does it have a lot of bars in the category, but they also score relatively higher! [Looking at their website](https://www.somachocolate.com/), they are quite fancy, so I guess they're worth a try if you're in Toronto!

In [None]:
print("Number of bars for Soma: " + str(df[df['Company\xa0\n(Maker-if known)'] == 'Soma'].shape[0]))
print("Average Soma bar rating: " + str(df[df['Company\xa0\n(Maker-if known)'] == 'Soma']['Rating'].mean()))

On the other hand, looking at REF 887, there are actually only two bars in the category, and together they didn't do well. With such a small sample, it's hard to tell if that's really a negative correlation, or if getting more samples would change. So I wouldn't call this one a true trend just yet.

In [None]:
print("Number of bars for 887: " + str(df[df['REF'] == '887'].shape[0]))
print("Average bar rating: " + str(df[df['REF'] == '887']['Rating'].mean()))

Similarly with the words Del Toro, not a whole lot there. So, it's probably going to be diminishing returns going down the list. Let's move on to another model.

In [None]:
mask = df['Specific Bean Origin\nor Bar Name'].apply(lambda x: 'del toro' in str(x).lower())

print("Number of bars for del toro: " + str(df[mask].shape[0]))
df[mask]

----------------------------------------
## Ridge Regression

For another inference approach, we can generate a linear model to infer variable relationships. In this model, the coefficient of the model can be interpreted as the strength of the variable in relationship to the target variable, and the sign of the coefficient tells us the direction of the relationship. 

In that regard, it's a little more powerful than a tree based model, but because they are from different families, both models have the ability to pick up on different nuances, so always try more than one!

In [None]:
from sklearn.linear_model import Ridge

param_grid = {'alpha': [0.001, 0.01, 1, 3, 5, 10]
             }

regr = Ridge()

model = GridSearchCV(regr, param_grid)
#fining the best parameters based on the search grid
model.fit(np.matrix(X_train), y_train)

#pulling the fitted model on the best settings so we can see the variable importances
regr = model.best_estimator_

print(model.best_score_)

In [None]:
#finding the indices that would sort the array 
sorted_indices = np.argsort(regr.coef_)

#finding the most important features and associated importance
variables = regr.coef_[sorted_indices]
importance_rating = X_train.columns.values[sorted_indices]

importances = pd.DataFrame({'variable':variables, 'coefficient':importance_rating})
print("Total non zero coefficients: " + str(len(importances[importances['coefficient'] != 0.0])))

That's a lot of variables! Again, the smaller coefficients might not be as strong of a relationship, so it's best to stick with the largest coefficients overall. 

In [None]:
importances.head()

For negative correlations, we again see the REF 887, honduras, and del toro. But we already know it was a smaller sample. But what about Callebaut? It's a small sample as well. [Their website](https://www.callebaut.com/en-US/homepage) looks so tempting... Poor Callebaut. 

In [None]:
print("Number of bars for Callebaut: " + str(df[df['Company\xa0\n(Maker-if known)'] == 'Callebaut'].shape[0]))
df[df['Company\xa0\n(Maker-if known)'] == 'Callebaut']

Now for the flip side, let's look at positive trends! 

In [None]:
importances.tail()

[Amedei](http://www.amedei.it/en/) is looking good. They have a larger sample, and had several 4s.

In [None]:
print("Number of bars for Amedei: " + str(df[df['Company\xa0\n(Maker-if known)'] == 'Amedei'].shape[0]))
df[df['Company\xa0\n(Maker-if known)'] == 'Amedei']

111 is also consistently high. Still a smallish sample, but promising.

In [None]:
print("Number of bars for 111: " + str(df[df['REF'] == '111'].shape[0]))
df[df['REF'] == '111']

[Patric](http://patric-chocolate.com/) also had some 4s in their history. 

In [None]:
print("Number of bars for Patric: " + str(df[df['Company\xa0\n(Maker-if known)'] == 'Patric'].shape[0]))
df[df['Company\xa0\n(Maker-if known)'] == 'Patric']

[Cacao Sampaka](http://www.cacaosampaka.com/) is also doing well. They have a larger sample, and had several 4s.

In [None]:
print("Number of bars for Cacao Sampaka: " + str(df[df['Company\xa0\n(Maker-if known)'] == 'Cacao Sampaka'].shape[0]))
df[df['Company\xa0\n(Maker-if known)'] == 'Cacao Sampaka']

We can continue on, but I think you get the idea.

# Wrapping Up
Overall, this was an interesting exploration into the world of chocolates. Surprisingly, the most signficant trends the models were able to pick up were the companies themselves more than any other factor. We haven't discovered any groundbreaking discoveries in chocolate, but at least I've found a few new brands to add to my list of desserts to try!

Have you found any interesting trends?