# About the dataset
## Context
Chocolate is one of the most popular candies in the world. Each year, residents of the United States collectively eat more than 2.8 billions pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown.
### Flavors of Cacao Rating System:
- 5= Elite (Transcending beyond the ordinary limits)
- 4= Premium (Superior flavor development, character and style)
- 3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
- 2= Disappointing (Passable but contains at least one significant flaw)
- 1= Unpleasant (mostly unpalatable)



![Photo by <a href="https://unsplash.com/@foodess?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Food Photographer | Jennifer Pallian</a> on <a href="https://unsplash.com/s/photos/chocolate?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  ](https://images.unsplash.com/photo-1481391319762-47dff72954d9?ixlib=rb-1.2.1&q=80&fm=jpg&crop=entropy&cs=tinysrgb&dl=food-photographer-jennifer-pallian-dcPNZeSY3yk-unsplash.jpg)
<br />

Each chocolate is evaluated from a combination of both objective qualities and subjective interpretation. A rating here only represents an experience with one bar from one batch. Batch numbers, vintages and review dates are included in the database when known.

The database is narrowly focused on plain dark chocolate with an aim of appreciating the flavors of the cacao when made into chocolate. The ratings do not reflect health benefits, social missions, or organic status.

**Flavor** is the most important component of the Flavors of Cacao ratings. Diversity, balance, intensity and purity of flavors are all considered. It is possible for a straight forward single note chocolate to rate as high as a complex flavor profile that changes throughout. Genetics, terroir, post harvest techniques, processing and storage can all be discussed when considering the flavor component.

**Texture** has a great impact on the overall experience and it is also possible for texture related issues to impact flavor. It is a good way to evaluate the makers vision, attention to detail and level of proficiency.

**Aftermelt** is the experience after the chocolate has melted. Higher quality chocolate will linger and be long lasting and enjoyable. Since the aftermelt is the last impression you get from the chocolate, it receives equal importance in the overall rating.

Overall Opinion is really where the ratings reflect a subjective opinion. Ideally it is my evaluation of whether or not the components above worked together and an opinion on the flavor development, character and style. It is also here where each chocolate can usually be summarized by the most prominent impressions that you would remember about each chocolate.

# Setting up the environment

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("/kaggle/input/chocolate-bar-ratings/flavors_of_cacao.csv")
df.sample(10)

# Cleaning the data
To make it easy for us to refer the columns, I'm renaming them.

In [None]:
df.columns = ["Company", "Specific Bean Origin or Bar Name", "REF", "Review Date", "Cocoa Percent", "Location", "Rating", "Bean Type", "Broad Bean Origin"]
df.sample()

Let's get some info about our dataset.

In [None]:
df.info()

Here we can see that there aren't many null values in out dataset.

In [None]:
df.sample(10)

But in this sample we can see that there are a lot of blanks in the `Bean Type` column.

Let's clean that.

In [None]:
list(df['Bean Type'][0:5])

In [None]:
df = df.applymap(lambda x: np.nan if str(x).strip()=="\xa0" else x)
df = df.applymap(lambda x: np.nan if str(x).strip()=="" else x)

In [None]:
df.info()

Now we can see that there are a lot of missing values in `Bean Type`.

In [None]:
df['Bean Type'].value_counts()

We have two options here:
1. Fill all of the null values with the mode value (`Trinitario`)
2. Drop the column

Let's try method 1.

In [None]:
df['Bean Type'] = df['Bean Type'].fillna('Trinitario')

In [None]:
df['Bean Type'].value_counts()

But now if we see, `Trinitario` is a little too much (as there were a lot of null values) and might end up affecting the results. Therefore, we will drop the whole column.

In [None]:
df.drop(['Bean Type'], axis=1, inplace=True)

In [None]:
df.sample()

Great!

Now, let's analyze the `Company` column.

In [None]:
df['Company'].value_counts()

We can see `Company` has 416 different classes.

Again, we have 2 options:
1. Group the companies based on some other metric
2. Drop the column

Let's check the top 14 companies.

In [None]:
top_companies = list(df['Company'].value_counts()[:14].index)
top_companies

Maybe, we can group the rest of the companies as `Other`.

In [None]:
df['Company'] = df['Company'].apply(lambda x: x if x in top_companies else 'Other')
df['Company'].value_counts()

We have the same problem as `Bean Type`. Therefore, let's drop this column too.

In [None]:
df.drop(['Company'], axis=1, inplace=True)

In [None]:
df.sample()

Nice!

Now let's convert the `Cocoa Percent` to *float*.

In [None]:
df['Cocoa Percent'] = df['Cocoa Percent'].apply(lambda x: float(x.strip('%')) / 100.0)

In [None]:
df.sample(3)

Awesome!

Now let's see the `Specific Bean Origin or Bar Name` column and the `Broad Bean Origin` column.

In [None]:
df['Specific Bean Origin or Bar Name'].value_counts()

In [None]:
df['Broad Bean Origin'].value_counts()

In [None]:
df.sample(10)

Here we can see the two columns are very similar. Therefore, we can drop `Specific Bean Origin or Bar Name`.

In [None]:
df.drop(['Specific Bean Origin or Bar Name'], axis=1, inplace=True)

In [None]:
df.sample()

Cool!

In [None]:
df['Broad Bean Origin'].value_counts()

Let's reduce the number of classes in this column.

In [None]:
top_origin = list(df['Broad Bean Origin'].value_counts()[:14].index)
top_origin

In [None]:
df['Broad Bean Origin'] = df['Broad Bean Origin'].apply(lambda x: x if x in top_origin else 'Other')

In [None]:
df['Broad Bean Origin'].value_counts()

# Feature Generation
We can classify the chocolates into `Dark`, `Normal` and `White` chocolates.

In [None]:
df[df['Cocoa Percent']<0.7].sample(10)

In [None]:
df['Chocolate Type'] = df['Cocoa Percent'].apply(lambda x: 'dark' if x>=0.7 else 'normal' if x>0.0 else 'white')

In [None]:
df['Chocolate Type'].value_counts()

In [None]:
df.info()

Peace.

We can even group the `Location` column into continents.

In [None]:
# Lising top 10 countries
list(df['Location'].value_counts()[:10].index)

In [None]:
asia = ['Japan', 'Vietnam', 'Israel', 'South Korea', 'Singapore', 'India', 'Philippines', 'Russia']
africa = ['Madagascar', 'Sao Tome', 'South Africa', 'Ghana']
north_america = ['U.S.A.', 'Canada', 'Martinique', 'Niacragua', 'Guatemala', 'St. Lucia', 'Puerto Rico', 'Mexico', 'Costa Rica', 'Honduras', 'Nicaragua', 'Domincan Republic', ]
south_america = ['Ecuador', 'Eucador', 'Colombia', 'Suriname', 'Bolivia', 'Venezuela', 'Chile', 'Peru', 'Brazil', 'Argentina', 'Lithuania']
europe = ['France', 'Denmark', 'Scotland', 'Wales', 'Czech Republic', 'Finland', 'Ireland', 'Portugal', 'Netherlands', 'Poland', 'Amsterdam', 'Sweden', 'U.K.', 'Italy', 'Belgium', 'Switzerland', 'Germany', 'Austria', 'Spain', 'Hungary', ]
oceania = ['Australia', 'New Zealand', 'Fiji']

In [None]:
def continents(x):
    if x in asia:
        return 'asia'
    if x in africa:
        return 'africa'
    if x in north_america:
        return 'north america'
    if x in south_america:
        return 'south america'
    if x in europe:
        return 'europe'
    if x in oceania:
        return 'oceania'
    return 'europe'

In [None]:
df['Continent'] = df['Location'].apply(continents)

In [None]:
df.sample(5)

In [None]:
df.drop(['Location'], axis=1, inplace=True)

In [None]:
df.sample()

In [None]:
df['Continent'].value_counts()

In [None]:
df.info()

Now, let's drop the `REF` column as it is of no use to us.

In [None]:
df.drop(['REF'], axis=1, inplace=True)

In [None]:
df.sample()

Now let's convert the data type of the string columns into categorical data type.

In [None]:
df['Continent'] = pd.Categorical(df['Continent'])
df['Broad Bean Origin'] = pd.Categorical(df['Broad Bean Origin'])
df['Chocolate Type'] = pd.Categorical(df['Chocolate Type'])

In [None]:
df.info()

# EDA
Types of chocolates reviewed

In [None]:
sns.countplot(x='Chocolate Type', data=df)

We can see that a lot more dark chocolates are reviewed.

*Maybe other types aren't really chocolates... just saying :)*

In [None]:
plt.figure(figsize=(15,8))
sns.lineplot(x='Cocoa Percent', y='Rating', data=df)

According to this plot, we can see majority of the people do not like the extreme types of chocolate (i.e. too dark or too light). The *sweet spot* is around 60%-70%. 

*You see what I did there... **Sweet spot**... hehe*

In [None]:
sns.relplot(x="Review Date", y="Rating", size="Cocoa Percent", sizes=(15, 200), data=df);

Here we can see people don't like very dark chocolates :/

The heavier dots are the bottom of the rating chart for almost all of the years

In [None]:
sns.relplot(x="Review Date", y="Rating", hue="Chocolate Type", kind="line", data=df);

Here we can see, the liking for dark chocolates have increased over the years whereas for the other, it's the same.

In [None]:
sns.countplot(x='Continent', data=df)

If we see this graph, we can make out that this dataset is not very diversed and could be biased towards the European's and the North American's likings.

In [None]:
plt.figure(figsize=(20,8))
sns.countplot(x='Broad Bean Origin', data=df)

If we exclude others, we can see that most of the `Broad Bean Origin` is in *Venezuela*.

In [None]:
plt.figure(figsize=(20,8))
sns.lineplot(y='Rating', x='Broad Bean Origin', data=df)

In this plot we can see that beans from all of the origins are more or less equally liked.

In [None]:
cat_columns = df.select_dtypes(['category']).columns
cat_columns

In [None]:
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
df.sample(5)

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), linewidths=0.1, vmax=1.0, square=True, cmap='coolwarm', linecolor='white', annot=True).set_title("Correlation Map")

Consider upvoting :)