# Transactions from a bakery

The data belongs to a bakery called "The Bread Basket", located in the historic center of Edinburgh. This bakery presents a refreshing offer of Argentine and Spanish products.


Content Data set containing 15 010 observations and more than 6 000 transactions from a bakery. The data set contains the following columns:

- **Date**. Categorical variable that tells us the date of the transactions (YYYY-MM-DD format). The column includes dates from 30/10/2016 to 09/04/2017.

- **Time**. Categorical variable that tells us the time of the transactions (HH:MM:SS format).

- **Transaction**. Quantitative variable that allows us to differentiate the transactions. The rows that share the same value in this field belong to the same transaction, that's why the data set has less transactions than observations.

You can find the original dataset [here](https://www.kaggle.com/aboliveira/bakery-market-basket-analysis/data).

![](./dataset-cover.jpg)

## Getting Started

We will be using the python programming language to help us look at the data in more detail. Before we continue, we should ensure that the correct version of python is installed.

In [1]:
!python --version

Python 2.7.15


These are some of the libraries that we will use along the way. 

- Pandas
- Numpy
- Scipy
- Bokeh

In [39]:
import pandas as pd
import numpy as np
import scipy as sp
import bokeh

from math import pi

from bokeh.io import output_notebook, show
from bokeh.palettes import inferno
from bokeh.plotting import figure
from bokeh.transform import cumsum

Bokeh usually outputs to a file. By setting this, we ensure that the output gets set inline.

In [3]:
output_notebook() 

Now that we have imported all the tools that we need, let's start by loading the dataset.

In [4]:
raw_df = pd.read_csv('BreadBasket_DMS.csv')
raw_df.dtypes

Date           object
Time           object
Transaction     int64
Item           object
dtype: object

Here is a glimps of what the dataset looks like:

In [5]:
raw_df.head()

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam


## Cleaning the data

Before we continue, it is probably a good idea to get the data into a workable format. In this dataset, some of the Item information is missing. To make our lives easier, we can discard these data points.

In [6]:
def cleanup_dataset(df):
    # Returns new dataset without NONE values in specified c
    df_none_entries = df.loc[df['Item']=='NONE',:]
    return df.drop(df_none_entries.index)

dataset = cleanup_dataset(raw_df)
dataset.head()

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam


Each row in the table above represents in item in a transaction. We might want to focus in on each of these aspects seperatly. So let's make a list of transactions and a seperate list of items!

In [7]:
list_of_transactions = dataset[['Transaction', 'Date', 'Time']].drop_duplicates()
list_of_transactions.head()

Unnamed: 0,Transaction,Date,Time
0,1,2016-10-30,09:58:11
1,2,2016-10-30,10:05:34
3,3,2016-10-30,10:07:57
6,4,2016-10-30,10:08:41
7,5,2016-10-30,10:13:03


In [8]:
bakery_items = dataset[['Item']].drop_duplicates()
bakery_items.head()

Unnamed: 0,Item
0,Bread
1,Scandinavian
3,Hot chocolate
4,Jam
5,Cookies


## Setting up some auxillary functions

Here are a few functions that will make it easier for us to interact with our dataframes:

The **Date** and **Time** headings encode many different pieces of information. Splitting up this information is gonna make it easier for us to group data points.

In [170]:
def split_date_field(df_orig):
    """
    Converts the Date Column into three sepreate columns
    (YYYY-MM-DD) -> (YYYY, MM, DD)
    """
    
    df = df_orig.copy()
    date = pd.to_datetime(df['Date'])
    
    df['Year'] = date.dt.year
    df['Month'] = date.dt.month
    df['Day'] = date.dt.day
    
    weekday_map = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    df['Weekday'] = date.dt.weekday #.apply(lambda x: weekday_map[x])
    
    return df

split_date_field(dataset).head()

Unnamed: 0,Date,Time,Transaction,Item,Year,Month,Day,Weekday
0,2016-10-30,09:58:11,1,Bread,2016,10,30,6
1,2016-10-30,10:05:34,2,Scandinavian,2016,10,30,6
2,2016-10-30,10:05:34,2,Scandinavian,2016,10,30,6
3,2016-10-30,10:07:57,3,Hot chocolate,2016,10,30,6
4,2016-10-30,10:07:57,3,Jam,2016,10,30,6


In [10]:
def split_time_field(df_orig):
    """
    Converts the Date Column into three sepreate columns
    
    (HH-MM-SS) -> (HH, MM, SS)
    """
    df = df_orig.copy()
    df['Hours'], df['Mins'], df['Secs'] = df['Time'].str.split(':').str
    return df

split_time_field(dataset).head()

Unnamed: 0,Date,Time,Transaction,Item,Hours,Mins,Secs
0,2016-10-30,09:58:11,1,Bread,9,58,11
1,2016-10-30,10:05:34,2,Scandinavian,10,5,34
2,2016-10-30,10:05:34,2,Scandinavian,10,5,34
3,2016-10-30,10:07:57,3,Hot chocolate,10,7,57
4,2016-10-30,10:07:57,3,Jam,10,7,57


The exact time of day is sometimes too specific to an individual transaction. If we want to look for trends, it might make more sense to look at the approximate time of day.

In [158]:
def extract_time_of_day(df_orig):
    df = df_orig.copy()
    
    time = df['Time']
    
    time_of_day = ['Morning', 'Afternoon', 'Evening', 'Night']
    
    df.loc[(time <'12:00:00'),'Daytime']= time_of_day[0]
    df.loc[(time>='12:00:00')&(time <'17:00:00'),'Daytime'] = time_of_day[1]
    df.loc[(time>='17:00:00')&(time <'21:00:00'),'Daytime'] = time_of_day[2]
    df.loc[(time>='21:00:00')&(time <'23:50:00'),'Daytime'] = time_of_day[3]
    
    return df

extract_time_of_day(dataset).head()

Unnamed: 0,Date,Time,Transaction,Item,Daytime
0,2016-10-30,09:58:11,1,Bread,Morning
1,2016-10-30,10:05:34,2,Scandinavian,Morning
2,2016-10-30,10:05:34,2,Scandinavian,Morning
3,2016-10-30,10:07:57,3,Hot chocolate,Morning
4,2016-10-30,10:07:57,3,Jam,Morning


In [173]:
#@TODO:
def extract_season(df_orig):
    df = df_orig.copy()
    
    date = df['Date']
    
    df['date'] = pd.to_datetime(df['Date'])
    month = df['date'].dt.month
    day = df['date'].dt.day
    
    time_of_year = ['Winter', 'Spring', 'Summer', 'Fall']
    
    df.loc[(month < 3), 'Season']= time_of_year[0]
    df.loc[(month >=11)&(month >=11),'Season']= time_of_year[0]
    df.loc[(month>=3)&(month <6),'Season'] = time_of_year[1]
    df.loc[(month>=6)&(month <9),'Season'] = time_of_year[2]
    df.loc[(month>=9)&(month <11),'Season'] = time_of_year[3]
    
    return df

extract_season(dataset).head()

Unnamed: 0,Date,Time,Transaction,Item,date,Season
0,2016-10-30,09:58:11,1,Bread,2016-10-30,Fall
1,2016-10-30,10:05:34,2,Scandinavian,2016-10-30,Fall
2,2016-10-30,10:05:34,2,Scandinavian,2016-10-30,Fall
3,2016-10-30,10:07:57,3,Hot chocolate,2016-10-30,Fall
4,2016-10-30,10:07:57,3,Jam,2016-10-30,Fall


In this dataset, we can see that a transaction is defined by a date and time. We can now look more closely at all the transactions that were made during this time.

## Understanding the data

Now we can use the data to answer some real questions!

### How many transactions took place?

In [13]:
list_of_transactions['Transaction'].count()

9465

### How many items are in the bakery?

In [57]:
number_of_items = dataset['Item'].nunique()
number_of_items

94

In [56]:
bakery_items.count()

Item    94
dtype: int64

### What are the most popular items?

In [51]:
transaction_count = dataset.groupby(by='Item')[['Transaction']].count().sort_values(by='Transaction', ascending=False)
transaction_count.head()

Unnamed: 0_level_0,Transaction
Item,Unnamed: 1_level_1
Coffee,5471
Bread,3325
Tea,1435
Cake,1025
Pastry,856


In [50]:
def convert_to_percentage(x):
    return 100 * x / float(x.sum())

transaction_percentage = transaction_count.apply(convert_to_percentage)
transaction_percentage.head()

Unnamed: 0_level_0,Transaction
Item,Unnamed: 1_level_1
Coffee,26.678695
Bread,16.213976
Tea,6.997611
Cake,4.998293
Pastry,4.174184


In [135]:
def create_circle_graph(title, slices_dict, colour_palette_fuc):
    number_of_slices = len(slices_dict)
    data = pd.Series(slices_dict).reset_index(name='value').rename(columns={'index': title})
    data['angle'] = data['value']/data['value'].sum() * 2*pi
    data['color'] = colour_palette_fuc(number_of_slices)

    p = figure(plot_height=350, title=title, toolbar_location=None,
               tools="hover", tooltips="@{"+ title + "}: @value", x_range=(-0.5, 1.0))

    p.wedge(x=0, y=1, radius=0.4,
            start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
            line_color="white", fill_color='color', source=data)

    p.axis.axis_label=None
    p.axis.visible=False
    p.grid.grid_line_color = None

    return p

slices = transaction_percentage.to_dict()['Transaction']
circle_graph = create_circle_graph('Bakery Items', slices, inferno)
show(circle_graph)