# Differences Between Online and Offline Sales

Hi, thanks for checking out this notebook.  

This is my first public notebook for a competition.  
I hope this helps your understading of the dataset for the competition. 

## What you can find in this notebook
1. Comparison of sales by sales channel (which is 'online' and 'store).
2. Distribution of Sales within each sales channel.
  
## Intro
I have read EDAs that were shared on for this competition and there were quite a lot of useful ones that were great for starting points.  
They visualized and summarized the data in general so that we can understand the data easily.  
  
In this notebook, however, I tried to focus on particular subject for deeper understanding of the data.  
The subject is about *the differences between the sales channel*.

In [None]:
# Let's first take care of the packages and settings.

import pandas as pd
import plotly.express as px

pd.set_option('max_columns', None)

color = px.colors.qualitative.Pastel

I will first merge articles table and transactions table.

In [None]:
articles = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')
transactions = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')

# To save some memory, I selected the only columns that I used.
articles = articles[['article_id', 'product_type_name', 'product_group_name', 'graphical_appearance_name',
          'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name',
          'index_group_name', 'index_name', 'section_name', 'garment_group_name']]
transactions = transactions[['article_id', 't_dat', 'sales_channel_id']]

combined = transactions.merge(articles, on='article_id', how='left')

For better readability, I will create and use 'channel' column to indicate online / store sales.

In [None]:
combined['channel'] = ''

combined.loc[combined.sales_channel_id==1, 'channel'] = 'store'
combined.loc[combined.sales_channel_id==2, 'channel'] = 'online'

First, let's investigate if there are any difference in sales (by quantity) by the categories of the articles.

I forced the width to be 1500 since there are quite a lot of categories in certain cases.

In [None]:
groups_list = ['index_group_name', 'index_name', 'product_group_name', 'product_type_name', 'section_name', 'garment_group_name']

for group in groups_list:
    com_g = combined[['article_id', 'channel', group]].groupby([group, "channel"]).count().reset_index()
    fig = px.bar(com_g, x=group, y='article_id', color='channel', barmode='group',
                 labels={'article_id': 'sales_quantity'}, title='Sales (Quantity) by ' + group, color_discrete_sequence=color,
                 width=1500, height=500)
    fig.show()

### Insights from these charts.
1. Overall, the online sales are profoundly higher.
2. Most sales are from Divided and Ladieswear.  

Most of the sales are from **Divided and Ladieswear** that are purchased **online**.  
(Divided is a teenager collection.)

Let's also investigate how visual elements of articles affect sales.

In [None]:
groups_list = ['graphical_appearance_name', 'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name']

for group in groups_list:
    com_g = combined[['article_id', 'channel', group]].groupby([group, "channel"]).count().reset_index()
    fig = px.bar(com_g, x=group, y='article_id', color='channel', barmode='group',
                 labels={'article_id': 'sales_quantity'}, title='Sales (Quantity) by ' + group, color_discrete_sequence=color,
                 width=1500, height=500)
    fig.show()

### Insights from these charts.
1. Most sales are from solid colored shirts.
2. Most Popular color is Black. 2nd is White, 3rd is Blue.  

**Solid and Black** dominates most of the sales.



## Ratio of Sales by Channel

Until now, we've only investigated using the absolute sales volume.  
However, this prevented us from looking at the categories in detail that are not as popular as the dominant ones.  

In order to look at the differences by the channel more accurately, I calculated ratio of sales within each group.  

In [None]:
groups_list = ['index_group_name', 'index_name', 'product_group_name', 'product_type_name', 'section_name', 'garment_group_name']

for group in groups_list:
    combined_grouped = combined[['article_id', 'channel', group]].groupby([group, "channel"]).count().reset_index()

    temp = combined[['article_id', group]].groupby(group).count().reset_index()
    temp = temp.rename(columns={'article_id':'total'})

    combined_grouped = combined_grouped.merge(temp, on=group)
    combined_grouped['sales_perc'] = combined_grouped['article_id'] / combined_grouped['total'] * 100

    combined_grouped = combined_grouped.rename(columns={'article_id':'sales_quantity'})

    fig = px.bar(combined_grouped, x=group, y='sales_perc', color='channel', barmode='group',
                 labels={'sales_perc': 'Percentage within group'}, title='Sales Ratio within ' + group,
                 color_discrete_sequence=color[2:], hover_data=['sales_quantity'],
                 width=1500, height=500)
    fig.show()

### Insights from these charts.
By using the ratio, it is much clear that online sales are much higher with few exceptions.
1. Menswear online sales are not as high as those of index groups.
2. Offline sales of Accessories, Cosmetics, Items, and Socks and Tights are higher than online sales.  

Notable facts:
* 76% of Divided sales are from online channel. (Teenagers purchase more online)



In [None]:
groups_list = ['graphical_appearance_name', 'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name']

for group in groups_list:
    combined_grouped = combined[['article_id', 'channel', group]].groupby([group, "channel"]).count().reset_index()

    temp = combined[['article_id', group]].groupby(group).count().reset_index()
    temp = temp.rename(columns={'article_id':'total'})

    combined_grouped = combined_grouped.merge(temp, on=group)
    combined_grouped['sales_perc'] = combined_grouped['article_id'] / combined_grouped['total'] * 100

    combined_grouped = combined_grouped.rename(columns={'article_id':'sales_quantity'})

    fig = px.bar(combined_grouped, x=group, y='sales_perc', color='channel', barmode='group',
                 labels={'sales_perc': 'Percentage within group'}, title='Sales Ratio within ' + group,
                 color_discrete_sequence=color[2:], hover_data=['sales_quantity'],
                 width=1500, height=500)
    fig.show()

### Insights from these charts.
1. The visual elements do not play much role here. Most of the columns have similar difference in sales between online, offline channel.
2. Clothes with appreance of Hologram, Transparent, Argyle sold much more in stores, but the sample sizes are too small to consider it meaningful.
3. Metalic colors (Gold, Silver) are sold more in stores.

## Sales distribution within each channel

This time, let's investigate the distribution of sales in each channel.  
We have already found out that online sales are much higher, but will there be any difference in distribution in each channel?  

In [None]:
groups_list = ['index_group_name', 'index_name', 'product_group_name', 'product_type_name', 'section_name', 'garment_group_name']

for group in groups_list:
    combined_grouped = combined[['article_id', 'channel', group]].groupby([group, "channel"]).count().reset_index()

    online_total = combined[combined.channel=='online']['article_id'].count()
    store_total = combined[combined.channel=='store']['article_id'].count()

    combined_grouped.loc[combined_grouped.channel=='online', 'sales_perc'] = combined_grouped['article_id'] / online_total * 100
    combined_grouped.loc[combined_grouped.channel=='store', 'sales_perc'] = combined_grouped['article_id'] / store_total * 100

    fig = px.bar(combined_grouped, x='sales_perc', y='channel', color=group, text='sales_perc', text_auto='.2s',
                 labels={'sales_perc': 'Percentage within group'}, title='Sales Distribution within each channel by ' + group,
                 color_discrete_sequence=color, orientation='h',
                 width=1500, height=600)
    fig.update_traces(insidetextanchor='middle')
    fig.show()

### Insights from these charts.
1. As we found out in the ratio section, Menswear has more share in the offline sales while Divided has more share in the online sales.
2. Dress and Swimwear has much more online sales.
3. Accessories has much more offline sales.

In [None]:
groups_list = ['graphical_appearance_name', 'colour_group_name', 'perceived_colour_value_name', 'perceived_colour_master_name']

for group in groups_list:
    combined_grouped = combined[['article_id', 'channel', group]].groupby([group, "channel"]).count().reset_index()

    online_total = combined[combined.channel=='online']['article_id'].count()
    store_total = combined[combined.channel=='store']['article_id'].count()

    combined_grouped.loc[combined_grouped.channel=='online', 'sales_perc'] = combined_grouped['article_id'] / online_total * 100
    combined_grouped.loc[combined_grouped.channel=='store', 'sales_perc'] = combined_grouped['article_id'] / store_total * 100

    fig = px.bar(combined_grouped, x='sales_perc', y='channel', color=group, text='sales_perc', text_auto='.2s',
                 labels={'sales_perc': 'Percentage within group'}, title='Sales Distribution within each channel by ' + group,
                 color_discrete_sequence=color, orientation='h',
                 width=1500, height=600)
    fig.update_traces(insidetextanchor='middle')
    fig.show()

### Insights from these charts.
1. Visual elements have surprisingly low effect on sales of each channel.  
2. The distribution of categories sold by each channel is very similar.

## Conclusion
After looking at quite a few visualizations, we can conclude the followings:
1. Online sales are much higher in general and notably higher in Divided category.
2. Menswear is an exception.
3. Accessories are sold more in stores.
4. Dresses and swimwear are sold much more online compared to other categories.
5. Visual elements and sales channel do not seem to have any relationship.

Thanks for reading this notebook.  
I will further investigate the difference by the sales channel with respect to customers.