Hi everyone, this a simple EDA project using python and working with the Chocolate Bar Ratings dataset from [Flavors of Cacao](http://flavorsofcacao.com/index.html). Lets get started :) 

In [None]:
#Usual imports for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Lets read in the dataset and get some information on whats in it.

In [None]:
df = pd.read_csv('../input/flavors_of_cacao.csv')

In [None]:
df.info()

In [None]:
df.head()

Now lets check to make sure that there aren't a lot of null values in the data

In [None]:
sns.heatmap(df.isnull(),cmap='plasma')

In [None]:
df.isnull().sum()

It doesnt look like theres a lot of values so we wont need to clean up too much from it. For now, we'll drop the null values

In [None]:
#Dropping null values from dataset
df.dropna(subset=['Broad Bean\nOrigin'],inplace=True)
df.dropna(subset=['Bean\nType'],inplace=True)

Also, one thing to note is that the values in the cocoa percent column are strings. Lets convert that into a float for simplified analysis

In [None]:
#Convert Cocoa Percent from string to float
df['Cocoa\nPercent'] = df['Cocoa\nPercent'].apply(lambda num: float(num.strip('%'))/100)

Now, lets explore the data with some simple visualizations

In [None]:
#plot number of chocolates from top 20 Companies 
plt.figure(figsize=(12,6))
df['Company\xa0\n(Maker-if known)'].value_counts().head(20).plot.bar(color = 'crimson')
plt.title('Top 20 Companies')
plt.ylabel('Number of Chocolate brands')
plt.xlabel('Company',labelpad = 20)
plt.tight_layout()

In [None]:
df['Company\xa0\n(Maker-if known)'].value_counts().head(5)

* Soma has the most chocolate bars in the dataset with 46 different bars.
* The other 14 companies have around the same number of bars.

In [None]:
#plot top 15 Company location countries
plt.figure(figsize=(12,6))
df['Company\nLocation'].value_counts().head(15).plot.bar(color = 'green')
plt.title('Top 15 Company Locations')
plt.xlabel('Company Location Country', labelpad = 20)
plt.ylabel('Number of Companies')
plt.tight_layout()

* Companies are overwhelmingly based in the US as opposed to other countries, with over 750 different companies!
* European countries are also very dominant in terms of company locations, however the origin of the bean will most likely be very different from that

Lets do a pairplot to quickly visualize any relationships with any of our continuous data

In [None]:
#Pairplot of columns for further EDA
sns.pairplot(df)

Some things to note:
* REF distribution doesnt seem to change as time goes on. However as review dates increase, the REF number increases as well. This just shows that REF is just an incremental number given to reviews as they come in and most likely has little to no impact on any further EDA.
* Ratings follow a left tailed distribution, with the mean floating around 3.5
* Cocoa Percent has a normal distribution, showing that most chocolates tend to be around 70% cacao
* Reviews increase as time goes on, perhaps due to internet becoming more and more mainstream

With the observations made above, lets explore the data further and see if what we assesed is accurate.

In [None]:
#Distribution of Review Dates for chocolates
plt.figure(figsize=(10,5))
sns.distplot(df['Review\nDate'], bins=30,kde=False)
plt.ylabel('Number of Reviews')
plt.title('Reviews Over Time')
plt.tight_layout()

* Number of reviews increase over time on average, with the peak being 2015

In [None]:
#Distribution of rating across dataset
plt.figure(figsize=(10,5))
sns.distplot(df['Rating'],bins=20,color = 'purple')
plt.title('Distribution of Ratings')
plt.tight_layout()

In [None]:
df['Rating'].describe()

* Ratings are normally distributed, with the mean around 3.2

We saw that we had some bars that were rated at 1 and at 5 in the graphs above. Lets see what chocolates had those ratings.

In [None]:
#Outliers, showing chocolates with ratings of 5 and 1
rating_outliers = df[(df['Rating'] == 5) | (df['Rating']==1)]
rating_outliers

In [None]:
rating_outliers.describe()

In [None]:
df[df['Company\xa0\n(Maker-if known)'] =='Amedei']

* There are only four bars that have a 1 rating and two bars that have a perfect 5 rating
* The cocoa percentage doesnt seem to matter as much, since all but one have around 70%
* Amedei had mostly good reviews the year they got 5.0 ratings, perhaps they had a really good year

Lets look at the distrubution of cocoa ratings now

In [None]:
#Plot distribution of cocoa percentage
plt.figure(figsize = (12,6))
sns.distplot(df['Cocoa\nPercent'],bins=25,color='blue')

In [None]:
#Countplot of top 10 cocoa percentages
plt.figure(figsize=(12,6))
df['Cocoa\nPercent'].value_counts().head(10).plot.bar()
plt.tight_layout()
plt.title('Distribution of Top 10 Cocoa Percentages' )
plt.xlabel('Cocoa Percent')
plt.ylabel('Count')

* Most bars have cocoa percentages around 70%
* Only a few bars have 100% cocoa, perhaps not as much of a market for them.

Now lets look at some of the top ratings and their company locations to see if location matters.

In [None]:
#countplot of ratings relative to company locations for top 200 ratings
top_countries = df[(df['Rating'] >= 3.75) & (df['Rating'] <= 5.0)]
grouped = top_countries.sort_values('Rating', ascending=False).head(200)

plt.figure(figsize=(19,7))
sns.countplot('Company\nLocation',hue = 'Rating', data=grouped)
plt.legend(title = 'Rating',loc='right', bbox_to_anchor=(1, .9))
plt.tight_layout()

* USA seems to have the most above average ratings amongst all countries, however this could be due to the fact that they have a larger amount of bars in the survey. We'll look at average ratings in a bit.
* Although France has 5 times fewer bars in the dataset than the US, they have almost as many 4.0 ratings. Seems like if you want good chocolate, France is a safer choice.
* Italy is the only one with a 5.0 rating despite having similar number of above average ratings as other European nations.

In [None]:
#Make dataframes showing USA and France ratings
FR = df[(df['Rating'] >= 3.75) & (df['Rating'] <= 5.0) & (df['Company\nLocation'] =='France')]
USA = df[(df['Rating'] >= 3.75) & (df['Rating'] <= 5.0) & (df['Company\nLocation'] =='U.S.A.')]

In [None]:
print('French Average Rating : '+ str(FR['Rating'].mean()))
print('U.S.A. Average Rating : '+ str(USA['Rating'].mean()))

* Based on the data, French based companies on average have higher ratings than U.S.A. based counterparts

Now lets look at the bean origin and see what insights we can gather from it. Before we look at it, there were some null values in the data set so we'll clean up the data first and then make observations.

In [None]:
#Converts xa0 unicode to null values
def null_converter(x):
    if x == '\xa0':
        return None
    else:
        return x

df['Broad Bean\nOrigin'] = df['Broad Bean\nOrigin'].apply(null_converter)

In [None]:
#Drops null vallues from dataset
df.dropna(subset=['Broad Bean\nOrigin'],inplace=True)

#Dropped 75 values from dataset
df.info()

Now that we cleaned up the broad bean origin column a bit, lets explore the origin country data.

In [None]:
plt.figure(figsize=(14,6))
df['Broad Bean\nOrigin'].value_counts().head(15).plot.bar()
plt.tight_layout()
plt.title('Top 15 Bean Origin Countries')
plt.xlabel('Bean Origin')

* Majority of beans originate from South American countries where Venezuela, Ecuador, and Peru lead the pack

In [None]:
top_Origins = df[(df['Broad Bean\nOrigin'] == 'Venezuela')| (df['Broad Bean\nOrigin'] == 'Ecuador')|
                (df['Broad Bean\nOrigin'] == 'Peru')| (df['Broad Bean\nOrigin'] == 'Madagascar')|
                (df['Broad Bean\nOrigin'] == 'Dominican Republic')| (df['Broad Bean\nOrigin'] == 'Nicaragua')]

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(x='Broad Bean\nOrigin',y= 'Rating', data=top_Origins)

* Ratings seem to be pretty similar amongst the origin countries with the most bars in the dataset
* Venezuela seems to have slightly higher ratings than the other countries, with it being the only country with a 5.0 rating

Thsi all leads into another question about the data: is there any correlation between cocoa percentage and rating?

In [None]:
sns.jointplot(x = 'Cocoa\nPercent', y = 'Rating', data = df)

In [None]:
#Correlation of Rating and Cocoa Percent, also adding Review Date and REF out of curiosity
df_rating_correlation = df[['Rating','Cocoa\nPercent','REF','Review\nDate']]

#Heatmap of correlation
sns.heatmap(df_rating_correlation.corr(),annot=True)


* No correlation at all between the ratings and the cocoa percentages!
* As shown before, REF and Review date are positively correlated

Thanks for taking the time to read through my analysis! Any feedback is always appreciated :) 