# Gufhtugu Publications- Visualization & Basket Analysis (Apriori Algorithm)
This notebook aims to uncover interesting insights from the Gufhtugu dataset, which is one of the largest public e-commerce datasets from Pakistan. It also applies and visualizes the apriori algorthm to obtain association rules from the dataset.

This notebook is split into 3 main parts:
1. Data cleaning,
2. Visualization & analysis,
3. Market basket analysis (apriori algorithm).
    
The analysis addresses the following questions:
* What are the best selling books?
* How does the number of orders vary by month?
* Does the number of sales depend on the day of the week?
    
More coming soon! Contributions and feedback welcome. Connect with me on [Linkedin](http://www.linkedin.com/in/muhammad-ali-857016172/).

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

plt.rcParams['figure.figsize'] = [12, 8]

In [None]:
# Load dataset
df = pd.read_csv('/kaggle/input/gufhtugu-publications-dataset-challenge/GP Orders - 4.csv')
df.head(20)

In [None]:
df.info()  # View column names and their data types

In [None]:
# Get number of missing values in each columnn
pd.isna(df).sum()

In [None]:
# Rename columns to make them easier to use
df.columns = ['order_number', 'order_status', 'book_name', 'order_date', 'city']
df.head()

In [None]:
df['order_status'].value_counts()  # View values in column 'order_status' along with their frequencies

In [None]:
df['book_name'].value_counts().head(10)  # View 10 most frequent values in column 'book_name' along with their frequencies

In [None]:
df['city'].value_counts().head(10)  # View 10 most frequent values in column 'book_name' along with their frequencies

# Data Cleaning

We notice that each entry in the column may contain multiple book names seperated by '/'. So we split the column on '/'. We then transform the obtained dataframe so that each row corresponds to a single book sale.

In [None]:
# Split the column on '/'
s = df['book_name'].str.split('/', expand=True).stack()

# Melting dataframe so that we have one book in each row
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'book_name' # needs a name to join

df = df.drop(columns='book_name').join(s)
df.head(10)

We also notice that missing values in the book_name and city columns are represented with question marks. The number of question marks is not fixed and entries that denote missing values may also contain - or whitespace. So we find all such entries using a regular expression and replace with NA.

In [None]:
missing_book_names_mask = df['book_name'].str.contains(pat="^[-? ]+$", na=False)
df.loc[missing_book_names_mask, 'book_name'] = np.nan

missing_cities_mask = df['city'].str.contains(pat="^[-? ]+$", na=False)
df.loc[missing_cities_mask, 'city'] = np.nan

df.head()

In [None]:
# Further clean book_name and city columns by transforming entries to upper case and stripping periods
df['book_name'] = df['book_name'].str.upper()
df['city'] = df['city'].str.upper().str.strip('.')

df.head()

In [None]:
# We notice that some books are appearing multiple times under different names. We replace some of the most popular books that have this problem
df['book_name'].replace({"Linux - An Introduction  (Release Data - October 3, 2020)": "LINUX - AN INTRODUCTION", 
                         "PYTHON PROGRAMMING- RELEASE DATE: AUGUST 14, 2020": "PYTHON PROGRAMMING",
                         "(C++) ++سی" : "(C++)"}, inplace=True)

In [None]:
# order_status is a categorical variable and order_date is datetime so change dtypes accordingly.
df['order_status'] = pd.Categorical(df['order_status'])
df['order_date'] = pd.to_datetime(df['order_date'])

# Exploratory Data Analysis

In [None]:
# Plot number of orders against time

df['date'] = df['order_date'].dt.date
df_sales = pd.DataFrame({'count_sales': df['date'].value_counts().sort_index()})
df_sales.index = pd.to_datetime(df_sales.index)

df_sales.head()

In [None]:
df_sales.plot(y='count_sales')
plt.show()

In [None]:
# Smooth plot so we can see trends more clearly. Using weekly moving average.
df_weekly_ma = df_sales.rolling(7, min_periods=1).mean()

df_weekly_ma.plot(y='count_sales')
plt.show()

We see an abrupt drop in demand in February 2020 which increases in May. This is likely an affect of COVID and the resulting lockdown.

In [None]:
# Visualize order status values and label each bar with percentage of total
total = len(df)
ax = sns.countplot(x="order_status", data=df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            f"{height/total*100 :1.2f}%",
            ha="center")

ax.set(xlabel='Order Status', ylabel='Total Sales Received')
ax.set_title("Order Status Values (as Percentages of Total Orders)",fontsize=15)

plt.show()

## What are the best selling books?

In [None]:
n = 5  # Number of best selling books to show

best_sellers = df['book_name'].value_counts().head(n)

# Visualize sales of best selling books and label each bar with percentage of total
total = len(df)
plt.figure(figsize=(16,8))
ax = sns.barplot(best_sellers.index, best_sellers.values)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            f"{height/total*100 :1.2f}%",
            ha="center")

ax.set(xlabel='Book Names', ylabel='Total Sales')
ax.set_title("Best Selling Books (Sales as Percentages of Total Orders)",fontsize=15)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")

plt.show()

## How does the number of orders vary by month?

In [None]:
df_sales['day_of_week'] = df_sales.index.day_name()
df_sales['day_of_week'] = pd.Categorical(df_sales['day_of_week'], ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

df_sales['month'] = df_sales.index.month_name()
df_sales['month'] = pd.Categorical(df_sales['month'], df_sales['month'].drop_duplicates().tolist())

df_sales['year'] = df_sales.index.year

df_sales.head()

In [None]:
month_aggregated = pd.DataFrame(df_sales.groupby("month")["count_sales"].sum()).reset_index().sort_values('month')
ax = sns.barplot(data=month_aggregated,x="month",y="count_sales")
ax.set(xlabel='Month', ylabel='Total Sales received')
ax.set_title("Total Sales By Month",fontsize=15)

plt.show()

We do not have enough data to judge whether there is yearly seasonality.

## Does the number of sales depend on the day of the week?

In [None]:
day_aggregated = pd.DataFrame(df_sales.groupby("day_of_week")["count_sales"].sum()).reset_index().sort_values('count_sales')
ax = sns.barplot(data=day_aggregated,x="day_of_week",y="count_sales")
ax.set(xlabel='Day of the Week', ylabel='Total Sales received')
ax.set_title("Total Sales by Day of the Week",fontsize=15)

plt.show()

We uncover an interesting insight. Sales are lowest mid-week (on tuesday, wednesday and thursday) while they are highest on the weekend.

# Market Basket Analysis Using Association Rules and the Apriori Algorithm

## What is the Apriori Algorithm?
> "Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis."
\- Wikipedia, 2020

## Useful Links:
The following resources were used as references for this section of the notebook:
1. [Wikipedia entry on the Apriori algorithm](https://en.wikipedia.org/wiki/Apriori_algorithm),
2. [Kdnuggets.com article "Association Rules and the Apriori Algorithm: A Tutorial"](https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html/2),
3. [Kaggle notebook by Yogesh](https://www.kaggle.com/yugagrawal95/market-basket-analysis-apriori-in-python) which was used as a reference in writing this code,
4. Network graph code via [intelligentonlinetools.com](https://intelligentonlinetools.com/blog/2018/02/10/how-to-create-data-visualization-for-association-rules-in-data-mining/).

In [None]:
# Transform data into form required by mlextend
df_dropped_book_names_na= df.dropna(subset=['book_name'])  # Drop NAs in book name column
series_book_names_as_lists = df_dropped_book_names_na.groupby('order_number')['book_name'].apply(list)

# The mlextend library requires book names to be columns and values to represent whether that book was present in the order 
# (binary value- 1 denotes book was present in order while 0 denotes that it was not present)
df_counts_binarized = series_book_names_as_lists.map(lambda x: '/'.join((map(str, x)))).str.get_dummies(sep='/')
df_counts_binarized

Support is a measure of the popularity of the itemset. It is the proportion of transactions that the itemset is included in (for example if a book is part of 50% of transactions, it's support is 0.5). We set value of minimum support to 0.01 i.e. we only consider itemsets that are part of at least 1% of all orders on the site.

In [None]:
freq_items = apriori(df_counts_binarized, min_support=0.01, use_colnames=True, verbose=1)
freq_items.head(10)

Lift indicates how likely the purchase of an item Y is when an item X is purchased while controlling for the popularity of item Y. High values of lift mean that item Y is likely to be purchased when item X is purchased. A lift value of 1 implies there is no association between the sale of the two items while values smaller than 1 denote that item Y is unlikely to be purchased when item X is purchased.

In [None]:
rules_mlxtend = association_rules(freq_items, metric="lift", min_threshold=1).sort_values('lift', ascending=False) # sort by lift to get rules where association is strongest
rules_mlxtend.head(10)

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

def draw_graph(rules, rules_to_show):
    G1 = nx.DiGraph()
    color_map=[]
    N = 50
    colors = np.random.rand(N)    
    strs=['R0', 'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9', 'R10', 'R11']

    for i in range(rules_to_show):
        G1.add_nodes_from(["R"+str(i)])
        for a in rules.iloc[i]['antecedents']:
            G1.add_nodes_from([a])
            G1.add_edge(a, "R"+str(i), color=colors[i] , weight = 2)
        for c in rules.iloc[i]['consequents']:
            G1.add_nodes_from([c])
            G1.add_edge("R"+str(i), c, color=colors[i],  weight=2)

    for node in G1:
        found_a_string = False
        for item in strs: 
            if node==item:
                found_a_string = True
        if found_a_string:
            color_map.append('yellow')
        else:
            color_map.append('green')       

    edges = G1.edges()
    colors = [G1[u][v]['color'] for u,v in edges]
    weights = [G1[u][v]['weight'] for u,v in edges]

    pos = nx.spring_layout(G1, k=16, scale=1)
    nx.draw(G1, pos, edges=edges, node_color = color_map, edge_color=colors, width=weights, font_size=16, 
            with_labels=False)            

    for p in pos:  # raise text positions
        pos[p][1] += 0.07
        nx.draw_networkx_labels(G1, pos)
        plt.show()
        
draw_graph (rules_mlxtend, 10)

## Ideas for further work:
1. Fix urdu text being unreadable in plots,
2. Answering questions included in dataset description,
3. Analyze order cities,
4. Visualize order timings,
5. Analyze returns (which cities are they coming from, which books are frequently returned etc.)
6. Timeseries Analysis (check for stationarity, transform as required, fit and analyze timeseries models).

More coming soon!