#  <font color=red><u>__CHOCOLATE BAR RECIPE TREND ANALYSIS (2006-2020)__</u></font>
<br>

__Here we will try to study chocolate bar ingredient trends, preferences by companies and rating. We will mostly use Numpy, Pandas to compute the results and, Matplotlib & Seaborn for plotting graphs.The dataset used in this project is taken from [kaggle.com](https://www.kaggle.com/soroushghaderi/chocolate-bar-2020?select=chocolate.csv). The dataset has many different information about chocolate bar companies such as  'company', 'company_location', 'country_of_bean_origin',  'review_date', chocolate 'rating', 'cocoa_percent', common ingredients and tastes information.__

Let's look into our dataset.

To read the csv file, we will use ```pd.read_csv()``` function, where we will pass path to our csv file which we would like to use for this project.<br>
Let's call it ```chocolate_raw_df``` as this is just raw or unprocessed dataset now, on which further modifications will be done for it to be prepared for data analysis.

In [None]:
import pandas as pd

In [None]:
chocolate_raw_df = pd.read_csv('../input/chocolate-bar-2020/chocolate.csv')
chocolate_raw_df

Looks like there are 21 columns. Let's see all the columns using ```chocolate_raw_df.columns```

In [None]:
chocolate_raw_df.columns

## <font color=blue><u>__Data Preparation and Cleaning__</u></font>

In this section, we select relevant data, explore various details such as shape, unique values, information about columns, its values, missing values, count the same, memory usage, sample the same etc. and make any appropriate changes if needed.

 Let's select a subset of columns with the relevant data for our analysis.

In [None]:
selected_columns = [
    # Company and respective ratings
    'company',
    'company_location',
    'country_of_bean_origin', 
    'review_date',
    'rating',     
    # Ingredients
    'cocoa_percent',
    'counts_of_ingredients',   
    'cocoa_butter',
    'vanilla',
    'lecithin',
    'salt',
    'sugar',
    # Tastes
    'first_taste',
    'second_taste',
    'third_taste',
    'fourth_taste'
]


In [None]:
# lets check how many columns we have selected
len(selected_columns)

In [None]:
# We will be using copy() function to NOT modify original data frame
# and to actually create a separate one derived from original
chocolate_df = chocolate_raw_df[selected_columns].copy()
chocolate_df

Lets use ```pandas.DataFrame.shape``` here, which return's a tuple representing the dimensionality of the DataFrame.

In [None]:
chocolate_df.shape

Now, looking into values in columns such as *cocoa_butter, vanilla, lecithin, salt and, sugar* we can see some kind of similar variation in data of each column. Let's have a look at one of these:

In [None]:
chocolate_df.lecithin

In [None]:
# Let's check unique values in this particular column 
chocolate_df.lecithin.unique()

We can actually deal with these values and manually adjust the data type for each column on a case-by-case basis.
To make our further analysis easier, the best way is to change the values into boolean ```True``` and ```False```, where if _'not'_ is present in the string, it will be taken as _'False'_ to show 'absence' of an item and _'True'_ otherwise.
<br>
To carry these, we will use the functions below in our custom function _'change_to_boolean'_:
* ```pandas.DataFrame.apply``` <br>
> __Format:__ ```DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)```
> <br>Apply a function along an axis of the DataFrame.
* and ```lambda``` which represents an anonymous function, where if it is used with previous ```pd.Series.apply```, each element of the series is fed into this lambda function. Here we use this for our _if-else_ condition mentioned above.
> The result will be another ```pd.Series``` with each element run through that ```lambda```.

To check the output later and also to verify with original data, we will use<br>
```pandas.Series.value_counts```<br><br>
__Format:__ ```Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)```<br>
Returns a Series containing counts of unique values. The resulting object will be in descending order so the first element is the most frequently-occurring element. Excludes NA values by default.

In [None]:
def change_to_boolean(col_series):
    return col_series.apply(lambda x: False if 'not' in x else True)

In [None]:
# old values (to verify)
chocolate_df.cocoa_butter.value_counts()

In [None]:
chocolate_df['cocoa_butter'] = change_to_boolean(chocolate_df['cocoa_butter'])
chocolate_df.cocoa_butter.value_counts()

In [None]:
# old values (to verify)
chocolate_df.vanilla.value_counts()

In [None]:
chocolate_df['vanilla'] = change_to_boolean(chocolate_df['vanilla'])
chocolate_df.vanilla.value_counts()

In [None]:
# old values (to verify)
chocolate_df.lecithin.value_counts()

In [None]:
chocolate_df['lecithin'] = change_to_boolean(chocolate_df['lecithin'])
chocolate_df['lecithin'].value_counts()

In [None]:
# old values (to verify)
chocolate_df.salt.value_counts()

In [None]:
chocolate_df['salt'] = change_to_boolean(chocolate_df['salt'])
chocolate_df.salt.value_counts()

In [None]:
# old values (to verify)
chocolate_df.sugar.value_counts()

In [None]:
chocolate_df['sugar'] = change_to_boolean(chocolate_df['sugar'])
chocolate_df.sugar.value_counts()

In [None]:
chocolate_df

Let's now use ```pandas.DataFrame.info``` to print a concise summary of our DataFrame.
> __Format:__```DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)```<br>
> This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [None]:
chocolate_df.info()

We can also check missing values using ```pandas.DataFrame.isna```
> __Format:__ ```DataFrame.isna()```<br>
> Returns dataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

and hence, adding all to get _total sum_ of missing values in each column.

In [None]:
chocolate_df.isna().sum()

In [None]:
# Let's now see all the columns again
chocolate_df.columns

We can also use ```pandas.DataFrame.describe``` to generate descriptive statistics.
> __Format:__ ```DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)[source]```

_Descriptive statistics_ include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding ```NaN``` values. Analyzes both numeric and object series, as well as ```DataFrame``` column sets of mixed data types.

In [None]:
chocolate_df.describe()

This was cool. Let's see what are the companies included in our data.

In [None]:
chocolate_df.company.value_counts()

We can see company 'Soma' has many variety of chocolate bars in our dataset. Let's check in details:

In [None]:
soma_df = chocolate_df[chocolate_df.company == 'Soma']
soma_df

Well, we've now cleaned up and prepared the dataset all ready for analysis.
<br>
Let's take a look at sample of rows from the data frame.

In [None]:
chocolate_df.sample(10)

## <font color=blue><u>__Exploratory Analysis and Visualization__</u></font>

In this section, we compute mean, percentage etc. sum etc. We also sort values, explore some more kinds of plot graphs, draw venn diagram and learn about correlation to know interdependence between two or more column variables. We also look into other useful functions such as size and head.

In [None]:
# center all output images using HTML
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

Let's begin our analysis and visualization journey by importing ```matplotlib.pyplot``` and ```seaborn``` first.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (12, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
matplotlib.rcParams["axes.labelsize"] = 14
matplotlib.rcParams["axes.titlesize"] = 18
matplotlib.rcParams["xtick.labelsize"] = 14
matplotlib.rcParams["ytick.labelsize"] = 14

### <a id="function1"><font color=green><u>1. Company And Ingredients</u> </font></a>

Let's look into how common an ingredient is among companies.

In [None]:
# Total companies
chocolate_df.company.nunique()

We create ```ingredients_df``` to view the present data in consideration. Here we will be using functions such as:
* ```pandas.DataFrame.mean```
> __Format:__ ```DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)```
> Return the mean of the values for the requested axis as Series or DataFrame (if level specified).

* ```pandas.DataFrame.sort_values```
> __Format:__ ```DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)```
> Sort by the values along either axis.

* We will use horizontal bars to visualize our data. Rotating to a horizontal bar chart from traditional vertical one, is one way to give some variance to a report. This kind of chart also allow for extra long bar titles.<br>To draw a set of horizontal bars here, we will use```seaborn.barplot``` <br>
> __Format:__ ```ax = sns.barplot(x, y)```

Labelling of the axis is achieved using the Matplotlib syntax on the “```plt```” object imported from ```pyplot```. The key functions used here are:<br>

* “```xlabel```” to add an x-axis label
* “```ylabel```” to add a y-axis label
* “```title```” to add a plot title

In [None]:
ingredients_df = chocolate_df[['cocoa_butter',
    'vanilla',
    'lecithin',
    'salt',
    'sugar']].copy()
ingredients_df

In [None]:
# Let's check type of any one column value, which was modified earlier
# using our custom function change_to_boolean
type(ingredients_df.cocoa_butter[0])

In [None]:
# Percentage of companies preferring an ingredient
ingredients_percentage = ingredients_df.mean().sort_values(ascending=False) * 100
ingredients_percentage

In [None]:
plt.figure(figsize=(12,6)) 
sns.barplot(x=ingredients_percentage,y=ingredients_percentage.index, palette="Paired_r")
plt.title("Common Ingredients Preference")
plt.xlabel('percentage of companies');

**Summary:**<br>
Sugar is the most common ingredient, followed by cocoa butter, lecithin and vanilla. Salt as least preferred by companies.

***
### <a id="function1"><font color=green><u>2. Tastes</u> </font></a>

In the dataset, in column description, we noticed that there is data of first taste, second, third and fourth taste. Lets look into all the tastes, all the common tastes which is switched between these four, most common tastes in each of these four categories and finally draw a venn diagram to get a better view.

Hence, here we can learn all the different tastes present, tastes preferred as first, second and third. Since, fourth taste is rarely there, lets ignore this column for now.<br>
Some functions we explore in this section are:

* ```pandas.Series.ravel```
> __Format:__ ```Series.ravel(order='C')```<br>
> Returns the flattened underlying data as an numpy.ndarray or ndarray-like

* ```pandas.unique```
> __Format:__ ```pandas.unique(values)```<br>
> Uniques are returned in order of appearance, though this does NOT sort.<br>
> Significantly faster than numpy.unique. Includes NA values.

* ```pandas.DataFrame.count```
> __Format:__ ```DataFrame.count(axis=0, level=None, numeric_only=False)```<br>
> Counts non-NA cells for each column or row. The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.

* ```pandas.DataFrame.size```
> __Format:__ ```property DataFrame.size```<br>
> Return an int representing the number of elements in this object. Returns the number of rows if Series. Otherwise returns the number of rows times number of columns if DataFrame.

* ```pandas.DataFrame.head```
> __Format:__ ```DataFrame.head(n=5)```<br>
> This function returns the first n rows for the object based on position. It is useful for quickly testing if an object has the right type of data in it.<br>
> For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].<br>

* Functions provided by ```matplotlib-venn``` for plotting area-proportional two- and three-way _Venn diagrams_ in matplotlib.<br>
> The functions ```venn2_circles``` and ```venn3_circles``` draw just the circles, whereas the functions ```venn2``` and ```venn3``` draw the diagrams as a collection of colored patches, annotated with text labels.<br> To install:<br>
> ```pip install matplotlib-venn```


In [None]:
# various tastes
column_values = chocolate_df[["first_taste", "second_taste", "third_taste"]].values.ravel()
unique_values =  pd.unique(column_values)
# type(unique_values)
unique_values.size

In [None]:
# first_taste preference among companies as percentage 
first_taste = chocolate_df.first_taste.value_counts() * 100 / chocolate_df.first_taste.count()
first_taste.head(10)

In [None]:
# second_taste preference among companies as percentage 
second_taste = chocolate_df.second_taste.value_counts() * 100 / chocolate_df.second_taste.count()
second_taste.head(10)

In [None]:
# third_taste preference among companies as percentage 
third_taste = chocolate_df.third_taste.value_counts() * 100 / chocolate_df.third_taste.count()
third_taste.head(10)

Since having three tastes is pretty common, we consider ```first_taste```, ```second_taste``` and ```third_taste``` data from our ```chocolate_df``` dataframe. As this is a _three-circle_ case, we will be using ```venn3``` function.

In [None]:
# pip install matplotlib-venn
from matplotlib_venn import venn2, venn2_circles
from matplotlib_venn import venn3, venn3_circles

first_taste = set(chocolate_df['first_taste'])
second_taste = set(chocolate_df['second_taste'])
third_taste = set(chocolate_df['third_taste'])

plt.figure(figsize=(12,6)) 
venn3([first_taste, second_taste, third_taste], ('First Taste', 'Second Taste', 'Third Taste'))
plt.title("Number. of Unique and Common Tastes")
plt.show();

Next, we use ```list(set(df1.A) & set(df2.A) & set(df3.A))``` to find total common tastes.

In [None]:
# common tastes
a = list(first_taste & second_taste & third_taste)
len(a)

***
### <a id="function1"><font color=green><u>3. Percentage of  Cocoa and Variation Over Years</u> </font></a>

Let's  use ```seaborn.lineplot``` this time to draw a line plot with possibility of several semantic groupings.<br>
> __Format:__ ```seaborn.lineplot(*, x=None, y=None, hue=None, size=None, style=None, data=None, palette=None, hue_order=None, hue_norm=None, sizes=None, size_order=None, size_norm=None, dashes=True, markers=None, style_order=None, units=None, estimator='mean', ci=95, n_boot=1000, seed=None, sort=True, err_style='band', err_kws=None, legend='auto', ax=None, **kwargs)```<br>
> By default, the plot aggregates over multiple y values at each value of x and shows an estimate of the central tendency and a confidence interval for that estimate.Passing the entire dataset in long-form mode will aggregate over repeated values (each year) to show the mean and 95% confidence interval:

In [None]:
plt.figure(figsize=(12,6)) 
sns.lineplot(x=chocolate_df.review_date, y=chocolate_df.cocoa_percent)
plt.title("Pecentage of Cocoa Used Over Years(2006 - 2020)");

**Summary:**<br>
In 2009, less cocoa amount in chocolate bars was popular, but 71-73% cocoa is generally popular over the years. So, 71-73% cocoa is a safe bet!

***
### <a id="function1"><font color=green><u>4. Rating and Cocoa Percent</u> </font></a>

Let's see how _cocoa_ amount in chocoloate bars affects its rating. 

In this section we explore how the dependence of two variables can be analyzed w.r.t. each other. We can use _joint plot_. A ```jointplot``` augments a bivariate relational or distribution plot with the marginal distributions of the two variables.<br>
In short. we visualize how rating & cocoa amount vary using the ```jointplot``` function from ```seaborn```.<br>
> __Format:__ ```jointplot(x, y[, data, kind, stat_func, ...])```<br>
> Setting a different ```kind="kde"``` in ```jointplot()``` basically combines two different plots._KDE_ shows the density where the points match up the most . Therefore, It is used to draw a plot of two variables with bivariate and univariate graphs.<br>
> A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analagous to a histogram. Several other figure-level plotting functions in seaborn make use of the ```histplot()``` and ```kdeplot()``` functions.<br>
> x and y are two strings that are the column names and the data that column contains is used by specifying the data parameter.
here we can see ```cocoa_percent``` on the _y axis_ and ```rating``` on the _x axis_. _Shade of color_ represents the density of values in a region of the graph.

In [None]:
plt_s = sns.jointplot(x=chocolate_df.rating, y=chocolate_df.cocoa_percent, kind ='kde');
plt_s.fig.suptitle("Rating and Cocoa Percent")
plt_s.ax_joint.collections[0].set_alpha(0)
plt_s.fig.tight_layout()
plt_s.fig.subplots_adjust(top=0.95)
plt_s.fig.set_figwidth(12)
plt_s.fig.set_figheight(7);


**Summary**<br>

As noticed in previous section 71-73% cocoa being popular among the years. Here, its proved that this generous amount is a safe bet for decent rating. And it is certainly NOT the case that higher the amount of cocoa, higher the rating, though lesser amount of cocoa than average is also a good risk. 

***
### <a id="function1"><font color=green><u>5. Correlation between different columns</u> </font></a>

To see interdependence between two or more variables, use correlation function ```pandas.DataFrame.corr```. Then we can, check all the correlations simultaneously.
> __Format:__ ```DataFrame.corr(method='pearson', min_periods=1)```<br>
> Computes pairwise correlation of columns, excluding NA/null values.<br>
> Returns : A DataFrame (Correlation matrix).<br>
> The _Pearson method_ is used by default, but the _Pandas_ allows the use of other indexes.<br>

> * 0.9 to 1 positive or negative indicates a very strong correlation.<br>
> * 0.7 to 0.9 positive or negative indicates a strong correlation.<br>
> * 0.5 to 0.7 positive or negative indicates a moderate correlation.<br>
> * 0.3 to 0.5 positive or negative indicates a weak correlation.<br>
> * 0 to 0.3 positive or negative indicates a negligible corr<br>

To facilitate this visualization of the correlations, it is possible to use the colors. Let's use the ```heatmap``` function in ```seaborn```.
> ```seaborn.heatmap```<br>
> __Format:__ ```seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, robust=False, annot=None, fmt='.2g', annot_kws=None, linewidths=0, linecolor='white', cbar=True, cbar_kws=None, cbar_ax=None, square=False, xticklabels='auto', yticklabels='auto', mask=None, ax=None, **kwargs)```<br>
> Plots rectangular data as a color-encoded matrix.<br>
> Annotates each cell with the numeric value using integer formatting.<br>
> The color gradation is observed in relation to the positive and negative correlations.<br>

In [None]:
rating_and_composition = chocolate_df[['rating',
    'cocoa_percent',
    'counts_of_ingredients',
    'cocoa_butter',
    'vanilla',
    'lecithin',
    'salt',
    'sugar']]
rating_and_composition

In [None]:
plt.figure(figsize=(12,9)) 
Chocolate_corr = rating_and_composition.corr()
sns.heatmap(Chocolate_corr, xticklabels=Chocolate_corr.columns, yticklabels=Chocolate_corr.columns, annot=True, cmap='YlOrBr',linewidths=.5)
plt.title("TITLE");

**Summary**<br>
From the measuring chart in section introduction, we can see _lecithin and cocoa butter;_ or  _lecithin and vanilla_ have a weak correlation whereas _vanilla and rating_; _cocoa percent and cocoa butter/lecithin/vanilla_ have negligible correlation.

## <font color=blue>Now all that analysis and visualization in previous section have made us more curious about the whole dataset. Let's look into some common questions that comes to the mind and try to solve the same.</font>

### <a id="function1"><font color=green><u> Q1. How presence of cocoa butter and lecithin effect rating in last three years (2018-2020)?</u> </font></a>

To lower the viscosity of chocolate and to actually bind the ingredients, both of these element serves this big purpose. But the question is which of these is popular. Well actually cocoa butter is always the better option but its comparatively expensive. Next option is to use both of these in right quantities or use only lecithin to produce cheapest variety of chocolates. In this section, lets check, how these factors effect chocolate bar rating.

We create ```cocoa_or_lecithin_all``` to view the present data in consideration. Then lets create ```cocoa_or_lecithin``` which only has last 3 years data from dataframe ```cocoa_or_lecithin_all```. We move '```cocoa_butter```' and '```lecithin```' to the ```index``` and then ```unstack``` them. This action will assume we have only one (lecithin, cocoa_butter) combination per rating.<br>
```Stacking``` a DataFrame means moving (also rotating or pivoting) the innermost column index to become the innermost row index and yes as you guessed, the inverse operation is called _unstacking_ which means moving the innermost row index to become the innermost column index again.

Here we will be using visualization function:
```pandas.DataFrame.plot```<br>
> __Format:__ ```DataFrame.plot(x=None, y=None, kind='line', ax=None, subplots=False, sharex=None, sharey=False, layout=None, figsize=None, use_index=True, title=None, grid=None, legend=True, style=None, logx=False, logy=False, loglog=False, xticks=None, yticks=None, xlim=None, ylim=None, rot=None, fontsize=None, colormap=None, table=False, yerr=None, xerr=None, secondary_y=False, sort_columns=False, **kwds)```<br>
> Make plots of a Series or a DataFrame.



In [None]:
cocoa_or_lecithin_all = chocolate_df[['review_date','rating','cocoa_butter','lecithin']].copy()
cocoa_or_lecithin = cocoa_or_lecithin_all[(cocoa_or_lecithin_all.review_date >= 2018) & (cocoa_or_lecithin_all.review_date <= 2020)].reset_index(drop=True)
# cocoa_or_lecithin = cocoa_or_lecithin.set_index('review_date')
cocoa_or_lecithin

In [None]:
cocoa_or_lecithin.set_index(['cocoa_butter', 'lecithin'], append=True, inplace=True)

In [None]:
cocoa_or_lecithin_df = cocoa_or_lecithin.unstack(['cocoa_butter', 'lecithin']).xs('rating',axis=1).plot(figsize=(12,7), colormap='plasma')
cocoa_or_lecithin_df.legend(["Only Cocoa butter", "Both","None","Only Lecithin"], prop={'size':14})
plt.title("Rating vs Cocoa butter, Lecithin in Chocolates (2018-2020)", fontsize=18)
plt.ylabel("Rating", fontsize=14)
plt.yticks(fontsize=12)
cocoa_or_lecithin_df.set_facecolor("grey");

# cocoa_butter, lecithin
# True, False  (Only Cocoa butter)
# True, True   (Both)
# False, False (None)
# False, True  (Only Lecithin)

**Summary:**<br>
As we can see here using both cocoa butter and lecithin does equally good. But only adding Cocoa butter does have better chances in gaining good score.

***
### <a id="function1"><font color=green><u> Q2. How much cocoa is actually preferred by top companies?</u> </font></a>

We use ```pandas.DataFrame.max``` here:
> __Format__ ```DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)```<br>
> Returns the maximum of the values for the requested axis.

In [None]:
top_rated_df = chocolate_df[chocolate_df.rating == chocolate_df.rating.max()]
top_rated_df

Let's check percentage of cocoa each of these top companies have.
To plot a histogram, we use ```matplotlib.pyplot.hist``` here.
> __Format:__ ```matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None, cumulative=False, bottom=None, histtype='bar', align='mid', orientation='vertical', rwidth=None, log=False, color=None, label=None, stacked=False, *, data=None, **kwargs)[source]```

In [None]:
plt.figure(figsize=(12,6)) 
plt.title("Percentage of Cocoa in Chocolates")
plt.xlabel('Percentage of cocoa')
plt.ylabel('Number of companies')
plt.hist(top_rated_df.cocoa_percent, bins=[30, 50, 60, 70, 85, 99], color='maroon');

**Summary:**<br>

Chocolate is broadly classified by the amount of cocoa it contains. And generally, over 70% cocoa is dark chocolate. Therefore, this implies, many top rated companies have preferred manufacturing dark chocolates over other versions of chocolates.

***
### <a id="function1"><font color=green><u> Q3. From which countries, top companies import cocoa beans?</u> </font></a>

Note: If _x-axis_ labels that are too long for comfortable display, there’s two options in this case –<br>
rotating the labels to make a bit more space, or rotating the entire chart to end up with a horizontal bar chart.<br>
The ```xticks``` function from ```Matplotlib``` is used here, with the rotation.
> The Matplotlib “```xtick```” function is used to rotate the labels on axes, allowing for longer labels when needed.

In [None]:
top_bean_countries= top_rated_df.country_of_bean_origin.value_counts()
top_bean_countries

In [None]:
plt.figure(figsize=(12,6)) 
plt.xticks(rotation=75)
plt.title('Countries of Bean Origin of Chocolates(In Top Rated)')
sns.barplot(x=top_bean_countries.index, y=top_bean_countries, palette="viridis")
plt.xlabel("Bean Origin(Country)")
plt.ylabel("");

**Summary**<br>
It can be seen that most of these countries are _'developing~countries'_.

***
### <a id="function1"><font color=green><u> Q4. What must have been the recipe of top rated chocolate in the last year 2019?</u> </font></a>

In [None]:
top_rated_recent = top_rated_df[(top_rated_df.review_date >= 2016) & (top_rated_df.review_date <= 2020)]
top_rated_recent

We have been displaying dataframes. But it gets boring to look into same design every~time. For this special display of _**recipe**_, let's add background color. We will use:<br>
```df.style.set_properties```
> By using this, we can use inbuilt functionality to manipulate data frame styling from font color to background color.<br>
> ```DataFrame.style``` property, returns styler object having a number of useful methods for formatting and visualizing the data frames.

In [None]:
recipe_df = top_rated_recent[(top_rated_recent.review_date == 2019)]
recipe_df = recipe_df[['company',
    'company_location',               
    'cocoa_percent',    
    'cocoa_butter',
    'vanilla',
    'lecithin',
    'salt',
    'sugar',
    'first_taste',
    'second_taste',
    'third_taste'                        
    ]]
recipe_df = recipe_df.set_index('company')
recipe_df.style.set_properties(**{'background-color': 'brown', 
                           'color': 'yellow'})

**Summary**<br>
Well top recipe does indicate dark chocolate. One of the recipe being using only cocoa butter and sugar over lecithin where it has creamy, fruity & nutty tastes. Another being cocoa butter, sugarfree with fig as the only taste.<br>

***
### <a id="function1"><font color=green><u> Q5. What are the major regions of chocolate, companies of which, generally makes it to Top 50?</u> </font></a>

In short, Countries with most companies in Top 50 <br>
We take average rating of each company here.<br>
**NOTE:** If you need to work with a dataframe after aggregation, use ```as_index=False``` 

In [None]:
chocolate_df

In [None]:
average_rating_df = chocolate_df.groupby(['company','company_location'], as_index=False)[['rating']].mean()
average_rating_df

In [None]:
top_fifty_df = average_rating_df.sort_values('rating', ascending=False).head(50)
top_fifty_df

In [None]:
top_countries= top_fifty_df.company_location.value_counts()
top_countries

In [None]:
plt.figure(figsize=(12,6)) 
plt.xticks(rotation=75)
plt.title('Countries in Top 50')
sns.barplot(x=top_countries.index, y=top_countries, palette="hls")
plt.xlabel("Country")
plt.ylabel("");

**Summary:**<br>
In top 50, most companies are from U.S.A who generally get best ratings on an average, followed by Japan and Australia. But looking at most of these countries, it is implied that most of these are developed countries.

## <font color=blue><u>__Inferences and Conclusion__</u></font>

* 71-73% cocoa is a safe bet as dark chocolates are gaining popularity over the years!
* There is atleast 155 common tastes which is preferred for first, second and third tastes.
* Sugar is the most common ingredient, followed by cocoa butter, lecithin and vanilla. Salt as least preferred by companies.
* Adding only cocoa butter over lecithin does have better chances in gaining good score.
* Beans are mostly originated in developing countries and is exported to developed countries.
* Top tastes seems to be creamy, which is pretty popular as first taste, followed by honey as second taste and nutty & cocoa~like in third taste.

## <font color=blue><u>__References and Future Work__</u></font>

* **_Reference Links:_**<br>
> * [dataset](https://www.kaggle.com/soroushghaderi/chocolate-bar-2020)
> * [seaborn](http://seaborn.pydata.org/)
> * [visualization](https://pandas.pydata.org/pandas-docs/version/0.9.1/visualization.html)
> * [pandas.DataFrame.xs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html)
> * [pandas-docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html)
> * [center-a-matplotlib-figure](https://moonbooks.org/Articles/How-to-center-a-matplotlib-figure-in-a-Jupyter-notebook-/)
> * [visualization-with-pandas-plot](https://kanoki.org/2019/09/16/dataframe-visualization-with-pandas-plot/)
> * [color_palettes](https://seaborn.pydata.org/tutorial/color_palettes.html)
> * [colormaps](https://matplotlib.org/3.1.1/tutorials/colors/colormaps.html)
> * [visualizing_set_diagrams](https://monstott.github.io/visualizing_set_diagrams_with_python)
> * [if-condition-in-pandas](https://datatofish.com/if-condition-in-pandas-dataframe/)
> * [python-lambda-functions](https://mode.com/python-tutorial/pandas-groupby-and-python-lambda-functions/)
> * [seaborn.barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html)
> * [find-the-unique-values-in-multiple-columns](https://www.kite.com/python/answers/how-to-find-the-unique-values-in-multiple-columns-of-a-pandas-dataframe-in-python)
> * [medium.com/dunder-data](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-39e811c81a0c)
> * [matplotlib-venn](https://pypi.org/project/matplotlib-venn/)
> * [seaborn-distribution-plots](https://www.geeksforgeeks.org/seaborn-distribution-plots/)
> * [area-plot](https://pythontic.com/pandas/dataframe-plotting/area-plot)
> * [correlation](https://medium.com/brdata/correlation-straight-to-the-point-e692ab601f4c)
> * [seaborn.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)
> * [the-art-of-subplots](https://towardsdatascience.com/master-the-art-of-subplots-in-python-45f7884f3d2e)
> * [set-the-spacing-between-subplots](https://www.kite.com/python/answers/how-to-set-the-spacing-between-subplots-in-matplotlib-in-python)
> * [pandas-dataframe-background-color](https://www.geeksforgeeks.org/set-pandas-dataframe-background-color-and-font-color-in-python/)
> * [stack-and-unstack-explained](https://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/)
> * [stackoverflow.com](https://stackoverflow.com/questions/19060144/more-efficient-matplotlib-stacked-bar-chart-how-to-calculate-bottom-values)
> * [www.shanelynn.ie/bar-plots-in-python-using-pandas-dataframes](https://www.shanelynn.ie/bar-plots-in-python-using-pandas-dataframes/#:~:text=Stacked%20bar%20plots,-In%20the%20stacked&text=Pandas%20makes%20this%20easy%20with,each%20x%2Daxis%20tick%20mark.)
> * [scentellegher.github.io/programming](https://scentellegher.github.io/programming/2017/07/15/pandas-groupby-multiple-columns-plot.html)

* **_Future Work:_**<br>
This dataset can be combined with respective company market data to know more sales and shares.