## **Dataset overview**
 Chocolate is one of the most popular candies in the world. Each year, residents of the United States collectively eat more than 2.8 billions pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown.

In this notebook, I have explored the chocolate bar rating dataset, and tried to draw conclusions from it. If you like my work, please upvote! :D

---

### Importing the libraries and dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from statistics import mean
from sklearn.metrics import classification_report, accuracy_score
from pandas.plotting import scatter_matrix

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#importing the dataset
df =pd.read_csv('../input/chocolate-bar-ratings/flavors_of_cacao.csv')
df.head()

In [None]:
#having a look at some of the basic info about the data
def info(df):
    
    # Shape of the dataframe
    print("Number of Instances:",df.shape[0])
    print("Number of Features:",df.shape[1])
    
    # Summary Stats
    print("\nSummary Stats:")
    print(df.describe())
    
    # Missing Value Inspection
    print("\nMissing Values:")
    print(df.isna().sum())

info(df)


## Cleaning the data

In [None]:
#changing the feature names and making it more readable

cols = list(df.columns)

# replacing newline characters and spaces in the feature names
def rec_features(feature_names):
    rec_feat = []
    for f in feature_names:
        rec_feat.append(((f.casefold()).replace("\n","_")).replace(" ","_"))
    return rec_feat

print("Feature Names before Cleaning:")
print(cols)
print("\nFeature Names after Cleaning:")
print(rec_features(cols))

# handling the feature name company
new_feature_names = rec_features(cols)
new_feature_names[0] = "company"

# Re-assigning feature names
df=df.rename(columns=dict(zip(df.columns,new_feature_names)))
df.dtypes

## Dealing with missing values
In the above code cells, as per the df.info command, we saw that only the features bean_type and broad_bean_origin have 1 missing value each 

In [None]:
# having a look into the missing values
df[['bean_type', 'broad_bean_origin']].head()

But we can see bean_type feature clearly has more missing values

In [None]:
#Having a look at all the values 
print(df['bean_type'].value_counts().head())

In [None]:
#Having a look at what the missing spaces are encoded as
list(df['bean_type'][0:5])

So we have **887** instances in **bean_type** where space is encoded as **\xa0**

In [None]:
# Replacing spaces with NONE to make the dataset cleaner

def space(x):
    if(x is "\xa0"):
        return "None"
    
df['bean_type'] = df['bean_type'].apply(space)
df.head()

In [None]:
# converting cocoa % to numerical values
df['cocoa_percent']=df['cocoa_percent'].str.replace('%','').astype(float)/100
df.head()

# Exploring the data
---

## **Cocoa percentage** in chocolate over the years


In [None]:
dcoco = df.groupby('review_date').aggregate({'cocoa_percent':'mean'})
dcoco = dcoco.reset_index()

# Plotting
sns.set()
plt.figure(figsize=(15, 4))
ax = sns.lineplot(x='review_date', y='cocoa_percent', data=dcoco)
ax.set(xticks=dcoco.review_date.values)
plt.xlabel("\nDate of Review")
plt.ylabel("Average Cocoa Percentage")
plt.title("Cocoa Percentage patterns over the years \n")
plt.show()

### Some Observations from the above graph:
- The highest cocoa percentage in a chocolate bar was in 2008 and was approx 73%.
- The lowest percentage of cocoa followed in the very next year, 2009 and was approx 70%.
- There was a rise in cocoa percentage in chocolate from the year 2009 to 2013
- From 2014 there was a steady decline in cocoa percentage 

## **Rating** of chocolate bars over the years

In [None]:
drate = df.groupby('review_date').aggregate({'rating':'mean'})
drate = drate.reset_index()

# Plotting
sns.set()
plt.figure(figsize=(15, 4))
ax = sns.lineplot(x='review_date', y='rating', data=drate)
ax.set(xticks=drate.review_date.values)
plt.xlabel("\nDate of Review")
plt.ylabel("Average Rating")
plt.title("Average Rating over the years \n")
plt.show()


### Some observations
- The lowest rating was around 3 and it came in 2008.
- Since then to 2011, there was an increase in average ratings and in 2011 it was at 3.25.
- In 2017 the rating lies at its apex at around 3.31.


Interestingly, In the year **2008**, the **cocoa percentage** in chocolate was **highest** and the **average rating** happened to be the **lowest**. 

The following year in **2009**, the chocolate bars saw a **steep decline in cocoa percentage**, with an **increase in average rating**. This might indicate that *chocolate bar producers decreased their cocoa content to make better chocolates*.

## Chocolate companies

In [None]:
# Top 5 companies in terms of chocolate bars 
d = df['company'].value_counts().sort_values(ascending=False).head(5)
d = pd.DataFrame(d)
d = d.reset_index() # dataframe with top 5 companies

# Plotting
sns.set()
plt.figure(figsize=(10,4))
sns.barplot(x='index', y='company', data=d)
plt.xlabel("\nChocolate Company")
plt.ylabel("Number of Bars")
plt.title("Top 5 Companies in terms of Chocolate Bars\n")
plt.show()

**Soma** produces max number of chocolate bars.

In [None]:
# Top 5 companies in terms of average ratings
d2 = df.groupby('company').aggregate({'rating':'mean'})
d2 = d2.sort_values('rating', ascending=False).head(5)
d2 = d2.reset_index()

# Plotting
sns.set()
plt.figure(figsize=(20, 6))
sns.barplot(x='company', y='rating', data=d2)
plt.xlabel("\nChocolate Company")
plt.ylabel("Average Rating")
plt.title("Top 5 Companies in terms of Average Ratings \n")
plt.show()

- Tobago Estate (Pralus) has the highest rating of 4.0 

In [None]:
# Average rating over the years wrt companies
top5_dict = {}
# Top 5 companies in terms of chocolate bars in this dataset
d = df['company'].value_counts().sort_values(ascending=False).head(5)
d = pd.DataFrame(d)
d = d.reset_index() # dataframe with top 5 companies
for element in list(d['index']):
    temp = df[df['company']==element]
    top5_dict[element]=temp

top5_list = list(top5_dict.keys())

# Rating over the years
d7 = df.groupby(['review_date', 'company']).aggregate({'rating':'mean'})
d7 = d7.reset_index()
d7 = d7[d7['company'].isin(top5_list)]


# Plotting
sns.set()
plt.figure(figsize=(15, 4))
ax = sns.lineplot(x='review_date', y='rating', hue="company", data=d7, palette="husl")
ax.set(xticks=d7.review_date.values)
plt.xlabel("\nDate of Review")
plt.ylabel("Average Rating")
plt.title("Average Rating over the years (Top 5 Producer Companies)\n")
plt.show()

- Pralus and Bonnat were the earliest companies among these top 5 to be reviewed in 2006, while A. Morin was the latest at 2012
- Both Bonnat and Pralus started around with the same average rating in 2006 of around 3.40, but in the very next year of 2007, whle Pralus hit it's highest ever rating of 4.00, Bonnat slumped to it's lowest of 2.50. As of 2016, Bonnat stands 0.25 rating points clear of Pralus on the yearly average
- The worst rating among these top 5 came in 2009 when Pralus scored only a 2.00 on average. This was a result of Pralus's steady decline from 4.00 in 2007 to 2.00 in 2009.
- Coincidentally, the highest rating was just a year back, 2008 when Bonnat hit 4.00 (a feat Pralus had achieved in 2007)
- From 2011 to 2015, Pralus has shown consistency in the average ratings
- A. Morin was reviewed only for the years 2012, 2013, 2014, 2015 and 2016. As of 2016, it's got the highest average rating at 3.75
- Fresco has not been reviewed after 2014, and its last review gave it around 3.30 on average rating
- Soma was first reviewed in 2009 where it got around 3.42. In it's latest review in 2016, it has a 3.61
- Soma's lowest rating came in 2009 (3.42) and this is still higher than the lowest ratings other companies have got over all years

## Analysing the largest chocolate bar producer - Soma

Location where Soma gets their beans from

In [None]:
soma = df[df['company']=='Soma']
d3 = soma['broad_bean_origin'].value_counts().sort_values(ascending=False).head(5)
d3 = pd.DataFrame(d3)
d3 = d3.reset_index()
# Plotting
sns.set()
plt.figure(figsize=(10, 6))
sns.barplot(x='index', y='broad_bean_origin', data=d3)
plt.xlabel("\nBroad Bean Origin")
plt.ylabel("Number of Chocolate Bars")
plt.title("Where does Soma get it's beans from? \n")
plt.show()

- Venezuela is the largest provider of Soma's beans.

### Visualizing Soma's chocolate bar rating

In [None]:
#Soma's performance over the years
d4 = soma.groupby('review_date').aggregate({'rating':'mean'})
d4 = d4.reset_index()

# Plotting
plt.figure(figsize=(10, 6))
sns.lineplot(x='review_date', y='rating', data=d4)
plt.xlabel("\nDate of Review")
plt.ylabel("Average Rating")
plt.title("Soma's Average Rating over the years\n")
plt.show()

- The worst average rating Soma ever got came in the year 2009 at 3.42, when it was first reviewed
- The highest average rating achieved came in 2010 at 3.75 
- Between 2012 and 2014, Soma's average rating saw a slump which revived after 3.75 was achieved in 2015 again; it slumped to 3.61 in 2016

# Categorizing chocolate bars based on their rating

In [None]:
unsatisfactory = df[df['rating'] < 3.0]
satisfactory = df[(df['rating'] >= 3.0) & (df.rating < 4)]
pre_elite = df[df['rating'] >= 4.0]
label_names=['Unsatisfactory','Above Satisfactory','Premium']
sizes = [unsatisfactory.shape[0],satisfactory.shape[0],pre_elite.shape[0]]

# Making the donut plot
explode = (0.05,0.05,0.05)
my_circle=plt.Circle((0,0),0.7,color='white')
plt.figure(figsize=(7,7))
plt.pie(sizes,labels=label_names,explode=explode,autopct='%1.1f%%',pctdistance=0.85,startangle=90,shadow=True)
fig=plt.gcf()
fig.gca().add_artist(my_circle)
plt.axis('equal')
plt.tight_layout()
plt.show()

- premium chocolate bars are produced very rarely (Only 5.6% of total chocolate bars produced)

## Counts of each rating of chocolate bars

In [None]:
# The counts of each rating
r=list(df['rating'].value_counts())
rating=df['rating'].value_counts().index.tolist()
rat=dict(zip(rating,r))
for key,val in rat.items():
    print ('Rating:',key,'Reviews:',val)
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=df)
plt.xlabel('Rating of chocolate bar',size=12,color='blue')
plt.ylabel('Number of Chocolate bars',size=12,color='blue')
plt.show()


- Most bars have been rated at 3.5.

## Distribution of chocolate bars according to their cocoa percentage

In [None]:
#Top 10 cocoa % is taken
plt.figure(figsize=(10,5))
df['cocoa_percent'].value_counts().head(10).sort_index().plot.bar()
plt.xlabel('Percentage of Cocoa',size=12)
plt.ylabel('Number of Chocolate bars',size=12)
plt.show()

- Majority of bars have 70% cocoa, followed by 75% and 72%.

## Where the best cocoa beans are grown (based on rating)

In [None]:
countries=df['broad_bean_origin'].value_counts().index.tolist()[:5]
satisfactory={} 
for j in countries:
    c=0
    b=df[df['broad_bean_origin']==j]
    br=b[b['rating']>=3] 
    for i in br['rating']:
        c+=1
        satisfactory[j]=c    

# Code to visualize the countries that give best cocoa beans
li=satisfactory.keys()
plt.figure(figsize=(10,5))
plt.bar(range(len(satisfactory)), satisfactory.values(), align='center',color=['#a22a2a','#511515','#e59a9a','#d04949','#a22a2a'])
plt.xticks(range(len(satisfactory)), list(li))
plt.xlabel('\nCountry')
plt.ylabel('Number of chocolate bars')
plt.title("Top 5 Broad origins of the Chocolate Beans with a Rating above 3.0\n")
plt.show()

print(satisfactory)

- Venezuela has the largest number of chocolate bars rated above 3.0

## Analysing the top chocolate bar producing countries (in terms of quantity)

In [None]:
print ('Top Chocolate Producing Countries in the World\n')
country=list(df['company_location'].value_counts().head(10).index)
choco_bars=list(df['company_location'].value_counts().head(10))
prod_ctry=dict(zip(country,choco_bars))
print(df['company_location'].value_counts().head())

plt.figure(figsize=(10,5))
plt.hlines(y=country,xmin=0,xmax=choco_bars,color='skyblue')
plt.plot(choco_bars,country,"o")
plt.xlabel('Country')
plt.ylabel('Number of chocolate bars')
plt.title("Top Chocolate Producing Countries in the World")
plt.show()

- USA is the top chocolate producing country

## Visualizing countries that produce best chocolates

In [None]:
countries=country
best_choc={}
for j in countries:
    c=0
    b=df[df['company_location']==j]
    br=b[b['rating']>=4] 
    for i in br['rating']:
        c+=1
        best_choc[j]=c    


li=best_choc.keys()
# The lollipop plot
plt.hlines(y=li,xmin=0,xmax=best_choc.values(),color='darkgreen')
plt.plot(best_choc.values(),li,"o")
plt.xlabel('Country')
plt.ylabel('Number of chocolate bars')
plt.title("Top Chocolate Producing Countries in the World (Ratings above 4.0)")
plt.show()
print(best_choc)


- USA produces the highest number of 4 and above rated chocolate bars, followed by France