In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
df = pd.read_csv("../input/store-transaction-data/Hackathon_Working_Data.csv")
df

The dataset contains information regarding the sales of ten different stores during a period of three months. Our objective is to discover the particular characteristics of each store's sales and possibly determine strategies to implement based on our findings.

In [None]:
df.info()

## Sales by Store Analysis

In the dataset we can see that each transaction has its own Bill_Id number. However, some of them are used by more than one store. In order to avoid miscalculations we'll create a unique Id by combining the Bill_Id with the Storecode number in a new column in order to calculate sales accurately.

In [None]:
df['UNIQUE_ID'] = df['STORECODE'].str.cat(df['BILL_ID'],sep="_")
df

In [None]:
by_store = df.groupby("UNIQUE_ID").mean()[["DAY","BILL_AMT","QTY"]]
by_store

In [None]:
merged_df = pd.merge(df,by_store,on="UNIQUE_ID")
merged_df

In [None]:
merged_df = merged_df.drop(["DAY_x","BILL_AMT_x"], axis=1)
merged_df

In [None]:
unique_sales = merged_df.groupby("UNIQUE_ID",as_index=True).mean()[["DAY_y","BILL_AMT_y"]]
unique_sales

Let's start by finding out what are the total sales by store.

In [None]:
unique_sales['STORE'] = unique_sales.index.str.split('_').str[0]
unique_sales

In [None]:
sales_by_store = unique_sales.groupby("STORE").sum().sort_values('BILL_AMT_y', ascending=False)
plt.figure(figsize=(20,10))
sns.barplot(x=sales_by_store.index,y=sales_by_store['BILL_AMT_y'],data=sales_by_store)
plt.title("Total Sales by Store")
plt.xlabel("Store")
plt.ylabel("Sales")
plt.grid(axis='y',color='black')

We can observe that stores 7, 9, 5, 2, and 4 are well above the rest in terms of total sales during the three month period. Let's now take a look at their variability and what are their daily sales each month in total and in average.

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(x=df["STORECODE"],y=df["BILL_AMT"],data=df)
plt.title("Varibility of Sales by Store")
plt.ylabel("Sales")

## What are the daily sales by Store?

In [None]:
for i in unique_sales.STORE.unique():
    store = unique_sales.loc[unique_sales.STORE == i].groupby("DAY_y").sum()
    plt.figure(figsize=(20,5))
    sns.barplot(x=store.index,y=store["BILL_AMT_y"],data=store).axhline(store["BILL_AMT_y"].mean(),color='purple')
    plt.title("Total Sales from Store: " + i)
    plt.xlabel("Day")
    plt.ylabel("Total Sales")
    plt.grid(axis='y',color='black')

In [None]:
for i in unique_sales.STORE.unique():
    store = unique_sales.loc[unique_sales.STORE == i].groupby("DAY_y").mean()
    plt.figure(figsize=(20,5))
    sns.barplot(x=store.index,y=store["BILL_AMT_y"],data=store).axhline(store["BILL_AMT_y"].mean(),color='purple')
    plt.title("Average Sales from Store: " + i)
    plt.xlabel("Day")
    plt.ylabel("Average Monthly Sales (3 months)")
    plt.grid(axis='y',color='black')

The two sets of figures above are very revealing. The first one aggregates the sales of the thirty days of the month (three month aggregate). The second one takes an average of the daily sales during the entire month for the three month period. 

The first set confirms the fact that stores 7, 9, 5, 2, and 4 have the highest total sales. However, the second one tells us a different story. Stores 9, 8, 4, 6, and 2 have the highest average daily sales. This may suggest that the other stores possibly had a couple of days of good sales but have less consistency.

In order to have a better understanding of what could be happening, we need to take a deeper look into what's being sold at each store.

## Which category sells the most items in general and by store? 

We will begin by determining which product categories have the highest sales in general and at each store. 

In [None]:
top_items_sales = merged_df.groupby('GRP').sum().sort_values("VALUE",ascending=False)[['QTY_x','PRICE','VALUE']][0:25]
plt.figure(figsize=(12,8))
sns.barplot(x='VALUE',y=top_items_sales.index,data=top_items_sales)
plt.title("Sales from Top 25 Categories")
plt.xlabel("Sales")
plt.ylabel("Categories")
plt.grid(axis='x',color='black');

In [None]:
top_items = merged_df.groupby('GRP').sum().sort_values("QTY_x",ascending=False)[['QTY_x','PRICE']][0:25]
plt.figure(figsize=(12,8))
sns.barplot(x='QTY_x',y=top_items.index,data=top_items)
plt.title("Number of Units Sold by Top 25 Categories")
plt.xlabel("Number of Units")
plt.ylabel("Categories")
plt.grid(axis='x',color='black');

In [None]:
for i in merged_df.STORECODE.unique():
    x = merged_df.loc[df.STORECODE == i].groupby("GRP").sum().sort_values("VALUE",ascending=False)[["VALUE","QTY_x"]][0:25]
    plt.figure(figsize=(12,8))
    sns.barplot(x='VALUE',y=x.index,data=x)
    plt.title("Top 25 Categories by Sales from store:  " + i)
    plt.ylabel("Categories")
    plt.xlabel("Sales")

By looking into the different product categories, we can see that each store caters to a different clientele. This may be as a result of location or business strategy. The sales in some of the stores are dominated by cleaning products, others by pantry products, and some by snacks or a an even combination of all of them.

Similarly, we will analyze the number of units sold by category at each store and determine if they are congruent with sales.

In [None]:
for i in merged_df.STORECODE.unique():
    x = merged_df.loc[df.STORECODE == i].groupby("GRP").sum().sort_values("QTY_x",ascending=False)[["VALUE","QTY_x"]][0:25]
    plt.figure(figsize=(12,8))
    sns.barplot(x='QTY_x',y=x.index,data=x)
    plt.title("Number of Units Sold by Category from store:  " + i)
    plt.xlabel("Number of Units")
    plt.ylabel("Categories")
    plt.grid(axis='x',color='black')

It would be helpful to determine what are the most and least expensive categories as that would help us better understand the results from the sales and units analysis at each store. 

In [None]:
grp_per_price = merged_df.groupby("GRP").mean().sort_values("PRICE", ascending=False)[["PRICE","QTY_x"]][0:25]
plt.figure(figsize=(12,8))
sns.barplot(x='PRICE',y=grp_per_price.index,data=grp_per_price)
plt.title("25 Most Expensive Categories")
plt.ylabel("Category")
plt.xlabel("Average Price")
plt.grid(axis='x',color='black')

In [None]:
grp_per_price_2 = merged_df.groupby("GRP").mean().sort_values("PRICE", ascending=True)[["PRICE","QTY_x"]][0:25]
plt.figure(figsize=(12,8))
sns.barplot(x='PRICE',y=grp_per_price_2.index,data=grp_per_price_2)
plt.title("25 Least Expensive Categories")
plt.ylabel("Category")
plt.xlabel("Average Price")
plt.grid(axis='x',color='black')

The two previous graphs help us make more accurate sales analysis. Some of the conclusions we can draw are:

1. Sales at stores 1 and 5 are largely dominated by snacks (Biscuits)
2. Sales at stores 3, 6, and 7 are mostly dominated bt pantry products (Edible oils)
3. Sales at store 2 are dominated by packaged tea and coffee. 
4. The rest of the stores do not have a dominant category but instead two, three, or more of them

This may help explain why stores 4, 6, 8, and 9 have higher average sales since their sales do not rely solely on lower priced categories such as snacks (as is the case for stores 1, 2, and 5) but also in higher priced items such as cleaning products and specialty pantry products. knowing this can help each store create new business strategies or reinforce their current ones. This is particularly critical for those stores whose sales are largely dominated by a single ctegory or two in order to diversify their sales.  

In addition let's try to find out which are the best performing brands

## Which Brands sell more by dollars and by units?

In [None]:
brands_v = merged_df.groupby('BRD').sum().sort_values('VALUE',ascending=False)[['VALUE','QTY_x']][0:25]
plt.figure(figsize=(12,8))
sns.barplot(x='VALUE',y=brands_v.index,data=brands_v)
plt.title("Top 25 Brands by Sales")
plt.xlabel("Sales")
plt.ylabel("Brands")
plt.grid(axis='x',color='black')

In [None]:
brands_q = merged_df.groupby('BRD').sum().sort_values('QTY_x',ascending=False)[['VALUE','QTY_x']][0:25]
plt.figure(figsize=(12,8))
sns.barplot(x='QTY_x',y=brands_q.index,data=brands_q)
plt.title("Top 25 Brands by Units Sold")
plt.xlabel("Number of Units")
plt.ylabel("Brands")
plt.grid(axis='x',color='black')

## Brand analysis by store

In [None]:
for i in merged_df.STORECODE.unique():
    brd_st = merged_df.loc[merged_df.STORECODE == i]
    brd = brd_st.groupby('BRD').sum().sort_values('VALUE',ascending=False)[['VALUE','QTY_x']][0:25]
    plt.figure(figsize=(12,8))
    sns.barplot(x='VALUE',y=brd.index,data=brd)
    plt.title("Top 25 Brands by Sales from store: " + i)
    plt.xlabel("Sales")
    plt.ylabel("Brand")
    plt.grid(axis='x',color='black')

In [None]:
for i in merged_df.STORECODE.unique():
    brd_st = merged_df.loc[merged_df.STORECODE == i]
    brd = brd_st.groupby('BRD').sum().sort_values('QTY_x',ascending=False)[['VALUE','QTY_x']][0:25]
    plt.figure(figsize=(12,8))
    sns.barplot(x='QTY_x',y=brd.index,data=brd)
    plt.title("Top 25 Brands by Units Sold from store: " + i)
    plt.xlabel("Number of Units")
    plt.ylabel("Brand")
    plt.grid(axis='x',color='black')

## Which Store Sells More Units?

It may also be helpful to analyze the physical volume that each store handles in order to understand their size as this could be significant while designing a strategy. 

In [None]:
store_qty = merged_df.groupby("UNIQUE_ID").sum()[['QTY_x','PRICE']]
store_qty['STORE'] = store_qty.index.str.split('_').str[0]
store_qty

In [None]:
units_by_store = store_qty.groupby("STORE").sum().sort_values('QTY_x',ascending=False)
plt.figure(figsize=(20,10))
sns.barplot(x=units_by_store.index,y=units_by_store["QTY_x"],data=units_by_store)
plt.title("Quantity of Items Sold by Store")
plt.xlabel("Stores")
plt.ylabel("Units")
plt.grid(axis='y',color='black');

This graph shows that store 1 handles five times more volume than most stores. However, as our sales by unit analysis has shown us, store 1's sales by unit is relies largeluy in the sales of biscuits. Therefore, despite handling a seemingly large volume, store 1 may not necessarily be the bigger one. 

The second biggest store bby units sold is store 2 whose total sales are dominated by packaged tea and coffee. However, their sales by unit mostly rely on pantry products and cleaning products. we can then deduce than it is very likely that store number 2 is indeed a physically larger store.


In terms of store 7 (third largest by volume), most of its sales rely in pantry products and personal care products. This store is also the one with the highest sales during the three months analyzed. 

Stores 4 and 5 have very diversified sales. However, their sales by units have biscuits as a very large component which may help explain why they handle almost twice the unit volume than stores 10, 3, 9, 8 and 6.

In [None]:
str_brd = df.groupby("STORECODE")["BRD"].nunique().sort_values(ascending=False)
str_brd

# Sales by Month

Lastly, let's analyze the sales by each month.

In [None]:
month_info = merged_df[["MONTH","UNIQUE_ID"]]
month_info

In [None]:
monthly_sales = pd.merge(month_info,unique_sales,on="UNIQUE_ID")
monthly_sales

In [None]:
monthly_sales['N_MONTH'] = monthly_sales['MONTH'].apply(lambda x: x[1])
monthly_sales

In [None]:
monthly_sales = monthly_sales.drop('MONTH',axis=1)
monthly_sales

In [None]:
monthly_sales['N_MONTH'] = pd.to_numeric(monthly_sales['N_MONTH'])

In [None]:
month_uq = monthly_sales.groupby('UNIQUE_ID').mean()
month_uq

In [None]:
m_sales = month_uq.groupby('N_MONTH').sum()
plt.figure(figsize=(20,10))
sns.barplot(x=m_sales.index,y='BILL_AMT_y',data=m_sales)
plt.title("Total Sales by Month")
plt.xlabel('Month')
plt.ylabel('Sales')
plt.grid(axis='y',color='black');