To me, this data set appears to provide an  opportunity to demonstrate the importance of data visualization skills. If we were to buy chocolates based on this data, can the visualization help us reach better decision? Also, how much do these ratings actually matter? Again, can we gauge their importance only from the visualization exercises? Let's see.

As usual, we first import the libraries that we need

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

Read the relevant data file. Now, the column names of this file are somewhat tedious to use in the analysis. Hence, after loading the file, I have changed the names to something more manageable for me.

In [None]:
choco_df = pd.read_csv("../input/flavors_of_cacao.csv")
choco_df.head()

In [None]:
choco_df.columns=['Maker', 'Bean Origin', 'REF', 'Review Date','Cocoa percent', 'Company Location', 'Rating', 'Bean Type',' Bean Country']

In [None]:
choco_df.head()

The **Cocoa percent** column is a string. To be used in data visualization, it needs to be converted to a numeric.

In [None]:
choco_df['Cocoa percent'] = choco_df['Cocoa percent'].str.replace('%','')
choco_df['Cocoa percent'] = pd.to_numeric(choco_df['Cocoa percent'])

Let's get a list of the countries where the companies are located.

In [None]:
choco_df['Company Location'].unique()

Now let's get a list of chocolate manufacturing companies

In [None]:
choco_df['Maker'].unique()

Let's look at the distribution of chocolate bars w.r.t. their cocoa percent

In [None]:
plt.subplots(figsize=(12,9))
sns.distplot(choco_df['Cocoa percent'],kde=False,color='red')

The above plot shows that most chocolates that are rated have between 70-75 % (approx) cocoa percent

Let us now look at what countries have the highest mean rating of chocolate bars

In [None]:
choco_country_rating = choco_df[['Company Location','Rating']]
choco_country_rating.head()

Note that the above is unsorted.

We now group the ratings with the countries, based on their mean values

In [None]:
choco_country_rating_mean_std = choco_country_rating.groupby('Company Location',sort=False).mean()
choco_country_rating_mean_std=choco_country_rating_mean_std.reset_index()


In [None]:
choco_country_rating_mean_std.columns = [['Company Location','Rating Mean']]
choco_country_rating_mean_std.head()

Let us plot the 20 highest mean ratings and the country those belong to

In [None]:
plt.subplots(figsize=(12,9))
sns.barplot(x='Company Location',y='Rating Mean',data=choco_country_rating_mean_std.nlargest(20,'Rating Mean'))
plt.xticks(rotation=90)
plt.tight_layout

This shows some surprising results. Who would have thought that chocolates from Chile would have the highest mean rating? Also, note Amsterdam and Netherlands are shown differently. That is because of how the data is entered. We didn't do any data cleaning, which we should have done. This is a manual task, where we would need to locate such entries by scanning through the list, and then replacing those with a unique value.

Also, note that the values are not that different from each other. In other words, this appears to be a close competition.
Let us analyze the underlying causes for this result

In [None]:
choco_df[(choco_df['Company Location']=='Chile')]['Company Location'].value_counts()

In [None]:
choco_df[(choco_df['Company Location']=='Netherlands')]['Company Location'].value_counts()

In [None]:
choco_df[(choco_df['Company Location']=='Amsterdam')]['Company Location'].value_counts()

From the above figures, we can see that there are only 2 companies from Chile, and 4 each from Netherlands and Amsterdam whose chocolates were rated. This definitely skews the sample mean.

Let us redo the above analysis for only those locations with more than 10 ratings for each location.

First, we find out which locations have > 10 entries.

In [None]:
choco_co_loc10 = choco_df['Company Location'].value_counts()>10
choco_co_loc10.head()

Now, we create a data frame, which consists only of countries that are mentioned > 10 times in the "Company Location" column.

In [None]:
choco_df_10loc = choco_df.merge(choco_co_loc10.to_frame(),left_on='Company Location',right_index=True)
choco_df_10loc.head()

We now look at the mean ratings of chocolates from such locations

In [None]:
choco_highloc_rating = choco_df_10loc[choco_df_10loc['Company Location_y']==True].groupby('Company Location').mean().reset_index()
choco_highloc_rating.head()

In [None]:
plt.subplots(figsize=(12,9))
sns.barplot(x='Company Location',y='Rating',data=choco_highloc_rating.nlargest(20,'Rating'))
plt.xticks(rotation=90)
plt.tight_layout

Now, we have a better picture of the mean ratings of chocolate bars from these countries. We see that some of the countries with < 10 entries have been removed. The U.S.A., therefore, is able to come in to the top 20, albiet, at the lower end. However, still note that the ratings are still very close to each other.

**What this shows us is that it may not matter where the chocolate was manufactured!**

So, you needn't worry too much where the chocolate was made, when you purchase one.

**Does the percentage of cocoa determine what the rating of the chocolate would be?**

Let's take a look.

In [None]:
plt.subplots(figsize=(16,12))
sns.swarmplot(x='Cocoa percent',y='Rating',data=choco_df)
plt.xticks(rotation=90)
plt.tight_layout

The above plot shows that there is **almost no** relation between cocoa content and chocolate rating! That is because the rating is based on a combination of various factors, and it seems that cocoa content has no influence on any of those factors.

Let us now look at chocolate makers who have a high rating. A "**high rating**" is a rating 3 or above, since 3 indicates satisfactory

In [None]:
choco_df_high_rates = choco_df[choco_df['Rating']>=3.0]
choco_df_high_rates.head()

Let us look at the top 20 chocolate makers who have a high rating

In [None]:
plt.subplots(figsize=(16,12))
choco_df_high_rates['Maker'].value_counts().head(20).plot.barh()
plt.xlabel('No. of bars')
plt.ylabel('Maker')
plt.tight_layout

This barplot shows that Soma has the highest number of chocolates that have been rated 3.0 or higher. So, if you buy a chocolate manufactured by Soma, you have the highest chance of enjoying it.

How is the distribution of the rating of Soma manufactured chocolates?

In [None]:
plt.subplots(figsize=(16,12))
choco_df[(choco_df['Maker']=='Soma')]['Rating'].plot.hist()

So, except for two, the rest of Soma's chocolates have a high rating. 

Where does Soma get its beans from?

In [None]:
plt.subplots(figsize=(16,12))
choco_df[choco_df['Maker']=='Soma'][' Bean Country'].value_counts().head(20).plot.barh()
plt.ylabel('Origin of bean for Soma chocolates')
plt.xlabel('No. of beans sourced from each country')
plt.tight_layout

Where does Soma make its chocolates?

In [None]:
plt.subplots(figsize=(16,12))
choco_df_high_rates[choco_df_high_rates['Maker']=='Soma']['Company Location'].value_counts().head(20).plot.barh()
plt.ylabel('Company Location')

Which are the **Top 20 countries** that manufacture the highly rated chocolate bars

In [None]:
plt.subplots(figsize=(16,12))
choco_df_high_rates['Company Location'].value_counts().head(20).plot.barh()
plt.xlabel('No. of bars')
plt.ylabel('Company Location')

So, what is the distribution fo ratings for chocolates made in the USA?

In [None]:
plt.subplots(figsize=(16,12))
choco_df_high_rates[(choco_df_high_rates['Company Location']=='U.S.A.')]['Rating'].plot.hist()

All chocolates made in the US are rated between 3 and 4, with most being rated between 3.0 to 3.6

Which are the **Top 20** sources of beans, based on the number of highly rated chocolates?

In [None]:
plt.subplots(figsize=(16,12))
choco_df_high_rates[' Bean Country'].value_counts().head(20).plot.barh()
plt.xlabel('No. of Bars')
plt.ylabel('Origin of Bean')
plt.tight_layout

How many  companies source their beans from Venezuela and are made in the USA?

In [None]:
plt.subplots(figsize=(16,12))
choco_df_high_rates[(choco_df_high_rates[' Bean Country']=='Venezuela') & (choco_df_high_rates['Company Location']=='U.S.A.')]['Maker'].value_counts().head(20).plot.barh()
plt.xlabel('No. of Companies')
plt.ylabel('Company name')

So, to conclude, the cocoa percent doesn't provide any insight in to how highly rated a chocolate could be. Turns out that cocoa percent is merely a personal choice. The analysis also sheds light on where one could mostly likely buy a highly rated chocolate, which manufacturers are most likely to have a highly rated chocolate on the market, and which country's beans to look out for, when making a purchasing decision about that chocolate.

If you liked going through this analysis, do comment or upvote.