# Bakery

We are going to analyse Bakery data set from KAGGLE.

## Index

- [1. Import libraries and download data](#section1)
- [2. Dataframe Analysis](#section2)
- [3. Conclusion](#section3)


## 1. Import libraries and download data <a id='section1'></a>

In [None]:
import numpy as np
import pandas as pd
#visualisation 

import matplotlib.pyplot as plt
import time
from datetime import date
from datetime import datetime
import datetime as dt


from wordcloud import WordCloud
from ipywidgets import interact
from collections import defaultdict

In [None]:
#download the files and become in dataframe(pandas)
path = "../input/"

bakery = pd.read_csv(path + "BreadBasket_DMS.csv", sep=",")

## 2. Dataframe Analysis <a id='section2'></a>

We are going to __analyse the features__ from the _bakery's_ dataframe.

### 2.1 Shape

In [None]:
print("  - Bakery: \nbakery:", bakery.shape)
print("Head:")
print(bakery.head(),"\n")

The data contains __four features__: _Date_, _Time_, _Transaction_ and _Item_. Now, we are going to analyse each of these variables with more detail.

### 2.2 Item

Item feature is a categorical variable that shows us the products that are sold in the bakery. 

In [None]:
print('The number of products that are sold:',bakery['Item'].drop_duplicates().count())

In the cell below we are going to create the __WordCloud graph__ with __all__ the items and also, the __bar plot__ with the __frequency of the 20 most popular__ products which are sold.

In [None]:
fig, axes=plt.subplots(nrows=2, figsize=(10,8))

#WordCloud Graph
items_dict=bakery.groupby('Item')['Item'].count().sort_values(ascending=False).to_dict()
wordcloud = WordCloud()
wordcloud.generate_from_frequencies(frequencies=items_dict)
axes[0].imshow(wordcloud, interpolation="bilinear")
axes[0].axis("off")

# Frequency Bar
bakery.groupby('Item')['Item'].count().sort_values(ascending=False)[0:19].plot.bar(ax=axes[1])
plt.title('Frequency the 20 most popular items')
plt.show()

Observing the plot, the three items which are more popular are: __Coffee__, __Bread__ and __Tea__.

### 2.3 Transactions

We are going to calculate __the number of transactions realised over different times of periods__: per __Date__, __Year__, __Month__, __Day__, __Weekday__ and __Hour__. In order to plot this information, we create the corresponding __new features__ in the dataframe. After this, we include several plots where the corresponding number of transactions over time are displayed for different periods.

In [None]:
bakery['Date'] = pd.to_datetime(bakery['Date'],format='%Y-%m-%d')
bakery['Year'] = bakery['Date'].dt.year
bakery['Month'] = bakery['Date'].dt.month
bakery['Day'] = bakery['Date'].dt.day
bakery['Weekday'] = bakery['Date'].dt.weekday
bakery['Hour'] = pd.to_datetime(bakery['Time'],format='%H:%M:%S').dt.hour

 <a id='section4'></a>
The plots are arranged in two columns: The left column shows the __number of transactions per different types of times__ in a line plot, and right columns displays stacked bar plot where there are the __10 most popular products__.

In [None]:
def draw_plots(freq,i):
    aux1_df=pd.DataFrame(bakery.groupby(freq)['Transaction'].count())
    #plot 1
    aux1_df.plot.line(ax=axes[i,0])
    axes[i,0].set_title('Frequency of Transactions per %s' %freq)
    #plot 2
    bakery_10_popItems.groupby([freq,'Item'])['Transaction'].count().unstack().fillna(0).plot.bar(stacked=True, ax=axes[i,1])
    if freq=='Date':
        x=list(np.arange(1,len(aux1_df)+1,20))
        aux1_df['Range']=range(1,len(aux1_df)+1)
        a=aux1_df[aux1_df['Range'].isin(x)]
        labels=(a.index)
        labels=labels.format(str)[1:]
        plt.sca(axes[i, 1])
        plt.xticks(x, labels, rotation=30)
    axes[i,1].set_title('Stacked bar per %s' %freq)

In [None]:
popular_items_10=bakery.groupby('Item')['Transaction'].count().sort_values(ascending=False)[0:10]
popular_items_10=pd.DataFrame(popular_items_10)
bakery_10_popItems=bakery[bakery['Item'].isin(popular_items_10.index)]
# features from plot
fig, axes=plt.subplots(nrows=6, ncols=2,figsize=(20,20))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.50)

freq=['Date','Year', 'Month', 'Day','Weekday','Hour']


for (f,i) in zip(freq, np.arange(6)):
    draw_plots(f,i)



Looking the plots for each frequency we observe:
- __Date__: The number of transactions for this period keeps between 100 to 200, although there are some exceptions. For example, one of the exception there is only one transaction on 2017-01-01 and it could be that the bakery would be closed. Moreover, seeing the stacked bar, the colours that show the distribution for the 10 items are stable.
- __Year__: In 2017 the number of transactions are bigger than 2016, this is simply because there are more months from 2017 than 2016. Seeing the stacked bar, the items keeps more or less the same proportion between these two years.
- __Month__: Note that we only have transaction between 2016-10-30 and 2017-04-09, in the graph we see a significant lower number of transactions in the months of April and October, but this is actually due to the fact that these two months (for this data set) have a much lower number of days than the other. Observing the stacked bar, the items which are sold are quite constant throwout the different months.
- __Day__: The line graph shows the number of transactions as a function of the day number. It seems that the number of transactions at the beginning of the month is slightly higher. The stacked bar draws that the proportion of number of items sold, that proportion continuous stable inside the days.
- __Weekday__: The line graph draws that on Saturday is day that produces the highest number of sales. Besides, we can observe that Friday, Saturday and Sunday are busier than the rest of the week. Looking at the stacked bars graph, we observe that the proportion of the items keeps stable during the different weekday.
- __Hour__: This line graph shows that the open hours seems to be between 7h to 19h, and the most popular hour is at 11. The items keep the proportion during the hours, with the exception of the sandwiches that are sold mostly during lunch hours (12 to 14).


 <a id='section5'></a>
#### - None
In this section, we focus on __NONE__ value got in the _Item_ feature. 

In [None]:
Nothing=bakery[bakery['Item']=='NONE'].groupby('Date')['Transaction'].count()
print('Mean:',Nothing.mean())
print('Std:', Nothing.std())
fig, ax=plt.subplots(nrows=1, figsize=(10,5))
Nothing.plot.line(ax=ax)
plt.show()

The line plot shows the __NONE__ is sold around 5 per day. This item seems that is produced by mistake or unknown item.

In [None]:
bakery['Date'].max()-bakery['Date'].min()

 <a id='section6'></a>
#### - Items sale equal or less one per week
We are going to study the behaviour of the items which are sold once per week or less. In order to perform it, _items_dict_ is a dictionary that contains the items with their frequency and helps us to see easily the frequency and select the items with a frequency less than 23, since the whole period have 23 weeks and supposing that the products are sold one per week.

In [None]:
items_dict.keys()

In [None]:
bakery['n']=bakery['Item'].replace(items_dict)

After this filter, the elements are grouped by the same frequency and plotted.

In [None]:
fig, ax=plt.subplots(nrows=1, figsize=(10,5))
baker_less_freq=bakery[bakery['n']<23].groupby(['Date','n'])['n'].count().unstack().fillna(0)
bakery[bakery['n']<23].groupby(['Date','n'])['n'].count().unstack().fillna(0).plot.line(ax=ax)
ax.legend(loc='right')
plt.title('Elements with less frequency ')
plt.show()

<a id='section8'></a>
We are going to see what happen when one of less popular items is sold how many items get involve in the same transaction.

In [None]:
def number_items_transaction(n,i,j):
    transaction_23_list=bakery[bakery['n']==n].Transaction
    print('number of elements with frequency %d: %d' %(n,len(transaction_23_list)) )
    if len(transaction_23_list)==0:
        plt.sca(axes[i, j])
        axes[i,j].set_title('#Items per transaction with items bought {}'.format(n))
        plt.text(0.4, 0.5, "No Values", size=10, rotation=20.,
         ha="center", va="center",
         bbox=dict(boxstyle="round",
                   ec=(1., 0.5, 0.5),
                   fc=(1., 0.8, 0.8),
                   ))
    else:
        bakery_23=bakery[bakery['Transaction'].isin(transaction_23_list)]
        aux=pd.DataFrame(bakery_23.groupby('Transaction')['Item'].count()).groupby('Item')['Item'].count()
        #print('number of elements sold: %d'%len(bakery_23))
        pd.DataFrame(aux).plot.bar(ax=axes[i,j])
        axes[i,j].set_title('#Items per transaction with items bought {}'.format(n))
        #plt.title('Transaction with #elements per transaction')
        axes[i,j].get_legend().remove()
  

In [None]:
fig, axes=plt.subplots(nrows=7, ncols=3, figsize=(15,10))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=1.3)

n_list=list(np.arange(1,22))
i_list = sorted(list(np.arange(7))*3)
j_list = list(np.arange(3))*7

for (n,i,j) in zip(n_list,i_list,j_list):
    number_items_transaction(n,i,j)

In general, we observe that the least popular products are complement to the other items, since the majority of transactions belong to the group of more than one item per sale.

<a id='section7'></a>
#### - Correlations 
We are going to study the correlation between the 10 most popular items.

In [None]:
a=bakery_10_popItems.groupby(['Date','Item'])['Transaction'].count().unstack().fillna(0)

In [None]:
a.corr(method='pearson')

Observing the correlation matrix, we can see that most of the coefficients are positive. The product with more negative coefficients is Midealuna, which is positively correlated with Bread, Coffee and Tea (the three most popular products), but negatively correlated with other complementary products like Cake, Cookies and Sandwich.

#### - Probability and Conditional Probability

Formula for Conditional Probality: 

$P(B|A)=\frac{P (B \cap A)}{P(A)}$

We are going to study different __conditional probability__, such as how affect number of elements per transaction if you choose a specific product, and also how affect selecting one product involve to take another in the same transaction.

In [None]:
def transact_condition(items,i):
    trans_list=list(bakery[bakery['Item']==items].Transaction)
    aux1_df=bakery[bakery['Transaction'].isin(trans_list)]
    aux2_df=aux1_df
    aux1_df=pd.DataFrame(aux1_df.groupby('Transaction')['Item'].count()).groupby('Item')['Item'].count()
    aux1_df=pd.DataFrame(aux1_df)
    aux1_df['Prob']=aux1_df['Item']/sum(aux1_df['Item'])
    aux2_df=pd.DataFrame(aux2_df.groupby('Item')['Item'].count().sort_values(ascending=False))
    aux2_df=pd.DataFrame(aux2_df)
    aux2_df['Prob']=aux2_df['Item']/sum(aux2_df['Item'])
    
    #plot 1
    pd.DataFrame(aux1_df['Prob']).plot.bar(ax=axes[i,0])
    axes[i,0].set_title('P(#items per transaction|%s)' %items)
    axes[i,0].set_xlabel(items)
    axes[i,0].get_legend().remove()
    
    
    #2.
    aux2_df['Prob'][1:20].plot.bar(ax=axes[i,1])
    #x=np.arange(0,20)
    labels=aux2_df[1:20].index
    plt.sca(axes[i, 1])
    #axes[i,1].set_xticklabels(labels,rotation=40)
    plt.xticks(np.arange(20),labels,rotation=60)
    axes[i,1].set_title('P(item belong to transaction|%s)' %items)
    axes[i,1].set_xlabel(' ')
    #plt.show()

In [None]:
# features from plot
fig, axes=plt.subplots(nrows=9, ncols=2,figsize=(25,40))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.8)
items=['Coffee','Bread','Tea','Cake','Pastry','Sandwich','NONE','Medialuna','Hot chocolate']

for (it,i) in zip(items, np.arange(9)):
    transact_condition(it,i)


Overall, the plots on the left column show that the mode (the most frequent value) for the conditional probability is 2 items. The most notable exception is for the item bread where which the mode is 1 item, buying only bread. Besides, the figures on the right column show that coffee has the highest probability as the second item when you choose an item.  

### 3. Conclusion<a id='section3'></a>

Analysing bakery data, we get to the following conclusions:

- [Period](#section4): The analysis under different time periodicity reveals some interesting facts. For example, the weekends are busier than work days. Another example is that sandwiches are mostly sold around lunch. It would be interesting to have a data set covering a longer period of time, which would help to refine this analysis.
- [None](#section5): There are some transactions tagged as the __NONE__ item, it seems an error or salesperson does not know to classify the item because the average of the number of sales of this product per day is 5.35 and has standard deviation 3.70. 
- [Products not popular](#section6): At the beginning you can think that it is not much profit with this items, since they are sold at most one per week. However, a more detailed analysis ([plot](#section8)) shows that the less popular products are usually sold with more items in the same transaction. It would be better to have a larger period in order to see, if these items are worth selling.
- [Products related](#section7): We can see that most of popular products are positively correlated. The most notable exception is Midealuna, which is negatively correlated with other complementary food products (like Cake, Cookies and Sandwich). Besides, we analyse the number of products sold conditional to the product sold, we observe that the only product that is sold the most on its own is Bread, all other items are most frequently sold together with other products, with coffee being the most common to pair with.