<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>


# Online sales of goods analysis
## **Lab. 4. Market Basket Analysis**


Estimated time needed: **45** minutes


## Abstract


This lab is dedicated to performing the market basket analysis based on the dataset which belongs to "Online sales of goods". The dataset has 406934 entries.

<b>Within the Dataset file, following fields are present:</b>
* Invoice Number: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
* Stock Code: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
* Description: Product (item) name. Nominal.
* Quantity: The quantities of each product (item) per transaction. Numeric.
* Invoice Date: Invoice date and time. Numeric. The day and time when a transaction was generated.
* Unit Price: Unit price. Numeric. Product price per unit in sterling (£).
* Unit Price-USD: Unit price. Numeric. Product price per unit in USDT ($).
* Customer ID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
* Country: Country name. Nominal. The name of the country where a customer resides.
* Turnover-GBP: turnover per line (quantity * unitprice of product) in GBP (sterling).
* Turnover-USD: turnover per line (quantity * unitprice of product) in USDT.


The statistical data was obtained from the <a href="https://www.kaggle.com/datasets/sowndarya23/online-retail-dataset/download?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX02PYEN2993-2023-01-01&datasetVersionNumber=1">https://www.kaggle.com/datasets/sowndarya23/online-retail-dataset/download?datasetVersionNumber=1</a><br>
License: <a href="https://creativecommons.org/publicdomain/zero/1.0/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX02PYEN2993-2023-01-01">CC0 1.0 Universal (CC0 1.0) Public Domain Dedication</a></li>
<p>The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
<br>You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.</p>


## Introduction


All modern companies, including online stores, analyze customer transactions and use them to form a market basket, which is actually a set of popular products that are bought together.
This basket can be used to create recommendations for shopping, product range, placement on supermarket shelves or making promotional offers.

Market Basket Analysis is a powerful tool for turning a huge number of customer transactions into simple, easy-to-visualize rules used to promote a product and build sales recommendations.

In this lab, we will learn to perform the analysis of the market basket using classical methods of data visualization and rules of association.


## Materials and methods


In this lab, we will learn how to download data, pre-prepare it, perform basic market basked analysis, build associative rules and visualize them.
This lab consists of the following steps:
<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#part1">Import Data from ModuleDownload and pre-preparation data.</a></li>
    <li><a href="#part2">Data Visualizations</a></li>
    <li><a href="#part3">Association Rule</a></li>
</ol>
</div>
<hr>


## Prerequisites
* [Python](https://www.python.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX02PYEN2993-2023-01-01) - intermediate level
* [Pandas](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX02PYEN2993-2023-01-01) - intermediate level 
* [SeaBorn](https://seaborn.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX02PYEN2993-2023-01-01) - intermediate level


## Objectives


After completing this lab, you will be able to:


* Download a DataSet from *.csv files
* Create new and recalculate values of existing columns
* Transform a DataSet of transactions into a market basket DataSet
* Visualize data with seaborn
* Produce Association rules


<h2><a name="part1" style="color: black; text-decoration: none;">Import Libraries/Define Auxiliary Functions</a></h2>


**Running outside Skills Network Labs.** This notebook was tested within Skills Network Labs. Running in another environment should work as well, but is not guaranteed and may require different setup routine.


Libraries such as Pandas, MatplotLib, SeaBorn, mlxtend and pyvis should be installed.


In [ ]:
# conda install -c conda-forge pandas

In [ ]:
# conda install -c conda-forge matplotlib

In [ ]:
# conda install -c conda-forge seaborn 

In [ ]:
# conda install -c conda-forge numpy 

Some libraries should be imported before you can begin.


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import numpy as np

### Download and pre-preparation


Let's download the data of customer transactions from a csv file.


In [ ]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX02PYEN/clean_df_4.csv"

Use the Pandas method <b>read_csv()</b> to load the data from the web address. Set the parameter  "names" equal to the Python list "headers".


In [ ]:
df = pd.read_csv(filename)

Use the method <b>head()</b> to display the first five rows of the dataframe.


In [ ]:
df.head()

Let's study the DataSet. As you can see, the DataSet consists of 11 columns. 


We can drop several unnecessary columns. F.e. Turnover and Unit Price in sterlings. We will use this values in USDT.


In [ ]:
df=df.drop(columns = ['Turnover-GBP','Unit Price'])
#check updated df
df.head()

In [ ]:
df.info()

<h4>Convert data types to proper format</h4>


In [ ]:
df['Invoice Date']=pd.to_datetime(df['Invoice Date'])
# OR  df[["Invoice Date"]] = df[["Invoice Date"]].astype('datetime64[ns]')

In [ ]:
df[["Customer ID"]] = df[["Customer ID"]].astype("int64")
df[["Country"]] = df[["Country"]].astype("category")

In [ ]:
df.info()

Now we should add some new columns for basic visual market basket analysis.

To analyse purchases dynamics during different time period, we should split date_time columns into time and hours.


In [ ]:
df['time']=df['Invoice Date'].dt.time
df['hour']=df['Invoice Date'].dt.hour

To analyse the dynamics of purchases during the year, we should add columns with month numbers and their names.


In [ ]:
df['month'] = df['Invoice Date'].dt.month
df['month name'] = df['month'].replace([1,2,3,4,5,6,7,8,9,10,11,12],['January','February','March','April','May','June','July','August','September','October','November','December'])

Similarly, to analyse weekly purchases, we need to highlight the days of the week and their names.


In [ ]:
df['day'] = df['Invoice Date'].dt.day
df['weekday'] = df['Invoice Date'].dt.weekday
df['weekday name'] = df['weekday'].replace([0,1,2,3,4,5,6], ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])

Let's analyse the final DataSet:


In [ ]:
df

In [ ]:
df.info()

As you can see, we have 16 columns with all necessary information for preliminary visual market basket analysis.


<h2><a name="part2" style="color: black; text-decoration: none;">Data Visualizations</a></h2>


Let's analyze top 20 most popular purchases. 


In [ ]:
popular = df['Description'].value_counts()
(df['Description'].value_counts(normalize=False)*100).head(20)

In [ ]:
plt.figure(figsize=(15,5))
sns.barplot(x = popular.head(20).index, y = popular.head(20).values, palette = 'hls')
plt.xlabel('Items', size = 15)
plt.xticks(rotation=90)
plt.ylabel('Count of Items', size = 15)
plt.title('Top 20 Items purchased by customers', color = 'blue', size = 20)
plt.show()

As you can see, the most popular of purchases is White Hanging heart T-Light Holder, the next is Regency cakestand 3 tier, then goes Jumbo Bag red Retrospot.


Let's analyze the dynamics of monthly purchases. For correct sorting, we need to group the DataSet by month number but display on the graph by month name.


In [ ]:
monthTran = df.groupby(['month','month name'])['Invoice Number'].count().reset_index()
plt.figure(figsize=(12,5))
sns.barplot(data = monthTran[['month name', 'Invoice Number']], x = "month name", y = "Invoice Number")
plt.xlabel('Months', size = 15)
plt.ylabel('Orders per month', size = 15)
plt.title('Number of orders received each month', color = 'blue', size = 18)
plt.show()

As you can see, the largest number of purchases corresponds to October and November. Also, the most active buyers are in autumn. In winter activity is the lowest.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #1:</b>

<p>Let's analyze the weekly activity.
    Write your part of code</p>

</div>


In [ ]:
weekTran = ##YOUR CODE GOES HERE##

plt.figure(figsize=(12,5))
sns.barplot(data = weekTran[['weekday name', 'Invoice Number']], x = "weekday name", y = "Invoice Number")
plt.xlabel('Week Day', size = 15)
plt.ylabel('Orders per day', size = 15)
plt.title('Number of orders received each day', color = 'blue', size = 20)
plt.show()

<details><summary>Click <b>here</b> for the solution</summary> 
<code>weekTran = df.groupby(['weekday','weekday name'])['Invoice Number'].count().reset_index()
</code>
</details>


As you can see from the plot, buyers are active throughout the week. There are no days with no purchases. You can see that the maximum number of purchases falls upon Thursday.


It is also interesting to study the activity of consumers during the day.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;"> Question  #2:</b>

<p>Let's analyze the share of purchases per hour of the day.<br>
    Write your part of code</p>

</div>


In [ ]:
countbyhour = ##YOUR CODE GOES HERE##
countbyhour.sort_values('hour',inplace=True)

plt.figure(figsize=(12,5))
sns.barplot(data=countbyhour, x='hour', y='Invoice Number')
plt.xlabel('Hour', size = 15)
plt.ylabel('Transaction', size = 15)
plt.title('Transaction per hour of the day', color = 'blue', size = 20)
plt.show()

<details><summary>Click <b>here</b> for the solution</summary> 
    <code>    
     countbyhour = df.groupby('hour')['Invoice Number'].count().reset_index()
    </code>
</details>


It is clearly seen that consumers become the most active starting from 9 in the morning to 4 in the evening.


#### If we want to check which period of the day has the most purchases, we have to add a new column.


This function sets the period of the day according to the hour.


In [ ]:
def get_time_of_day(date_str):
    hour = date_str.hour

    if hour < 12:
        return 'morning'
    elif hour < 17:
        return 'afternoon'
    else:
        return 'evening'

Apply it to our dataset


In [ ]:
df['period of day'] = df['Invoice Date'].apply(get_time_of_day)
df

We will analyze the activity of buyers during the day. We have just added this information to the dataset. All you need to do is add your sort order to display the columns correctly.


In [ ]:
coutbyweekday=df.groupby('period of day')['Invoice Number'].count().reset_index()
coutbyweekday.loc[:,"dayorder"] = [1, 2, 0]
coutbyweekday.sort_values("dayorder",inplace=True)
plt.figure(figsize=(12,5))
sns.barplot(data=coutbyweekday, x='period of day', y='Invoice Number')
plt.xlabel('Period of the day', size = 15)
plt.ylabel('Transaction', size = 15)
plt.title('Transaction per period of the day', color = 'blue', size = 20)
plt.show()

Let's analyze the share of purchases on diffent periods of the day.


In [ ]:
size = df['period of day'].value_counts()
labels = size.index.values
colors = ["deepskyblue", "lightblue", "cornflowerblue"]
explode = [0, 0, 0]

plt.figure(figsize=(12,5))
plt.pie(size, labels = labels, colors = colors, explode = explode, shadow = True, autopct = "%.2f%%")
plt.title('Transaction by day period')
plt.show()

<h2><a name="part3" style="color: black; text-decoration: none;">Association Rules</a></h2>


Defining the relationship between purchases is necessary to build association rules. To do this, it is necessary to transform the transaction DataSet into a special table. Columns of this table contain types of purchases and rows are transactions. Cells of this table should be bool (true/false). There are two most common ways to do this.


### Pivot table


This way involves the use of classical pandas methods like pivot_table and group_by.

First of all, we group rows into transactions:


In [ ]:
transactions = df.groupby(['Invoice Number', 'Description'])['Quantity'].sum().reset_index(name ='Count')
transactions

For the next actions we will use data of 20 the most popular products.


In [ ]:
transactions_popular = transactions[transactions.Description.isin(popular.index[:20])]
transactions_popular

Then transform this DataSet by pivot_table into a necessary market basket structure:


In [ ]:
basket = transactions_popular.pivot_table(index='Invoice Number', columns='Description', values='Count', aggfunc='sum').fillna(0)
basket

Next, we should change non-zero data to True and zero data to False:


In [ ]:
def encode_units(x): 
    if(x<=0): 
        return False
    if(x>0): 
        return True
    
basket_sets = basket.applymap(encode_units)
basket_sets

This is a necessary market basket DataSet that contains information about 11339 orders and 20 types of purchases.

The next step is saving our Dataframe and 'basket sets'.


In [ ]:
df.to_csv("df_clean_5.csv", index=False)

In [ ]:
basket_sets.to_csv("basket_sets.csv", index=True)

## Conclusions


In this lab, we learned how to perform a market basket analysis based on our dataset , using classical data visualization techniques and association rules. 

We will continue a market basket analysis in the next lab #5.


### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/veronika_lanchuv?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX02PYEN2993-2023-01-01">Veronika Lanchuv</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX02PYEN2993-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX02PYEN2993-2023-01-01">Olga Kavun</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                 |
| ----------------- | ------- | ---------- | ---------------------------------- |
| 2022-04-25        | 2.0     | Veronika   | lab is done                        |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
