# <u>**Getting started with H&M EDA**</u>

### In this notebook we have analysed each of the transactions, customers and articles datafrfame separately:
* Performed basic operations to understand the data (column types, missing data and so on)
* Visualized the trend of sales over the period of time
* Found out which days are most favourable for shoppers
* Comparison of online vs offline sales
* Extract list of top articles and customers
* Feature Engineering on few of the customer features
* Top Products of H&M

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import plotly.express as px
import dask.dataframe as dd

warnings.simplefilter('ignore')
%matplotlib inline

In [None]:
transactions_path = '../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv'
customers_path = '../input/h-and-m-personalized-fashion-recommendations/customers.csv'
articles_path = '../input/h-and-m-personalized-fashion-recommendations/articles.csv'

# converting the column datatypes to save memory
c = 'category'
tran_dict = {'customer_id': c, 't_dat': c}
art_dict = {'prod_name': c, 'prod_type_name': c, 'product_group_name': c, 'graphical_appearance_name': c, 'colour_group_name': c, 'perceived_colour_value_name': c, 
            'department_name': c, 'index_code': c, 'index_name': c, 'index_group_name': c, 'section_name': c, 'garment_group_name': c, 'detail_desc': c}

transactions = pd.read_csv(transactions_path, dtype=tran_dict)
articles = pd.read_csv(articles_path, dtype=art_dict)

In [None]:
def drop(df, *feature):
    '''
    Function to drop features from a dataframe
    Takes the dataframe and single/multiple features
    '''
    for feat in feature:
        df.drop(feat, axis=1, inplace=True)

# Analyzing the transaction data

In [None]:
transactions.head(3)

In [None]:
transactions.info()

In [None]:
transactions.isnull().sum()        # no missing values in transaction

**Extracting date info from t_dat feature**

In [None]:
transactions['year'] = pd.DatetimeIndex(transactions['t_dat']).year
transactions['month'] = pd.DatetimeIndex(transactions['t_dat']).month
transactions['dayofweek'] = pd.DatetimeIndex(transactions['t_dat']).dayofweek

In [None]:
month_wise_sales = transactions.groupby(['year', 'month']).size().reset_index().rename(columns={0:'Count'})
month_wise_sales['month'] = month_wise_sales['month'].map({1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'})
matplotlib.rcParams['figure.figsize'] = (12,6)
sns.barplot(x='month', y='Count', data=month_wise_sales, hue='year', palette=['blue', 'red', 'green'])
plt.title('Month wise Number of Items Sold')
plt.show()

In [None]:
day_wise_sales = transactions.groupby(['dayofweek']).size().reset_index().rename(columns={0:'Count'})
day_wise_sales['dayofweek'] = day_wise_sales['dayofweek'].map({0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'})
matplotlib.rcParams['figure.figsize'] = (12,6)
sns.barplot(x='dayofweek', y='Count', data=day_wise_sales)
plt.title('WeekDay wise Number of Items Sold')
plt.xticks(rotation=45)
plt.show()

In [None]:
sales = transactions[['year', 'month', 'price']].groupby(['year', 'month']).sum().reset_index().rename(columns={0:'Sales'})
sales['price'] = sales['price'].astype(int)
sns.lineplot(data=sales, x="month", y="price", hue='year', palette=['red', 'green', 'blue'])

### Some vital conclusions can be drawn from the above line chart:
* Total sales was quite less in the first month (September 2018)
* Sales doubled in the very next month and remained consistent in the range of 25000 to 40000
* Breakthrough month - Jun 2019
* In 2020 the sales improved gradually till June after which it started declining

In [None]:
plt.figure(figsize=(4,6))
sns.countplot(transactions['sales_channel_id'], color = 'crimson')
plt.show()

In [None]:
df_subset = transactions[['year', 'month', 'price']].groupby(['year', 'month']).sum().reset_index()
sns.catplot(x = 'year', y = 'price', data=df_subset, hue = 'month')

### **Top Articles and Customers**

In [None]:
top_ten_articles = transactions['article_id'].value_counts().index[:10]               # 10 most sold articles
top_twenty_customers = transactions['customer_id'].value_counts().index[:20]             # 20 customers who purchased max number of times

In [None]:
drop(transactions, ['t_dat'])                             # extracted useful date info, t_dat no longer required

## Summary of the findings: 
1. We don't have any missing values in transaction dataframe
2. Transaction details are available from September 2018 to September 2020
3. Barring September for all the other months, 2018 saw more number of purchases than 2019 which is in turn did better than 2020 (no. of purchases is declining yearly)
4. Customers make most purchases on Saturdays and least on Sundays though the difference is not too high
5. We found the min, max and avg price values of the transactions
6. We extracted the top articles sold over the period and also the top 10 customers who bought the most number of times.
7. 2/3rd of the sales happened via channel ID 2 and the rest via channel ID 1


# Analyzing the customer data

In [None]:
customers = pd.read_csv(customers_path)
customers.head(3)

In [None]:
customers.info()

In [None]:
def value_counts(feature_list):
    '''
    To print the value counts of the categories within a feature. Takes a list of features as an argument.
    '''
    for i in feature_list:
        print(i.upper())
        print(customers[i].value_counts())
        print('Missing values: ', customers[i].isnull().sum(), '\n')

In [None]:
check_features = ['FN', 'Active', 'club_member_status', 'fashion_news_frequency']
value_counts(check_features)

## Handling the missing values:
* **FN**: Filling the missing values with 0s as the non-null values are 1s
* **Active**: Same as FN
* **club_member_status**: Filling with the mode (ACTIVE)
* **fashion_news_frequency**: Replacing 2 `None` values with `NONE` and filling the missing values with the mode NONE for the time being

In [None]:
customers['fashion_news_frequency'].replace(to_replace='None', value='NONE', inplace=True)
values = {"FN": 0, "Active": 0, "club_member_status": 'ACTIVE', "fashion_news_frequency": 'NONE'}
customers.fillna(value=values, inplace=True)

In [None]:
value_counts(check_features)           # missing values handled

**Let's see the age feature now**

In [None]:
customers['age'].isnull().sum()           # total number of missing values in age feature

In [None]:
print('Minimum Age: ', customers['age'].value_counts().index.min())
print('Maximum Age: ', customers['age'].value_counts().index.max())

In [None]:
customers['age'].describe()

In [None]:
sns.boxplot(customers['age'], color='purple')

In [None]:
# Missing value imputation with median as we have outliers
customers['age'].fillna(customers['age'].median(), inplace=True)

In [None]:
customers[customers['age']>65].shape

**We have customers ranging from 16 years to 99 years of age with the average age being 36 years**

Binning - grouping the age into categories:
1. Below 26
2. 26-35
3. 36-45
4. 46-55
5. 56-65
6. Above 66

We are considering buckets of 10, later we can try different approaches of binning or perhaps frequency encoding.

In [None]:
age_bins = [15,26,36,46,56,66,100]
customers['age'] = pd.cut(customers['age'], bins=age_bins, labels=['Below 26','26-35','36-45','46-55', '56-65', 'Above 65'])

In [None]:
plt.figure(figsize=(10,6))
customers.groupby('age').size().plot(kind='pie', autopct='%1.2d%%')
plt.ylabel('Age Distribution', size=20)
plt.tight_layout()

In [None]:
drop(customers, ['postal_code'])             

## Insights of customers dataset
* Outliers detected in age feature
* Folks between age range of 16-35 dominates customers data
* FN and Active features had 1s or empty values 

# Analyzing the articles data

In [None]:
articles.info()                      # only description has few missing values

In [None]:
articles[['product_code','prod_name']].value_counts().head(5)

**As we see that features are in pairs of unique codes or IDs and their corressponding names, we can drop either of the two to avoid data redundancy**

In [None]:
feature_list = [i for i in articles.columns[1:] if str(articles[i].dtypes)[:3] == 'int']
drop(articles, feature_list)
articles.head(3)

In [None]:
px.sunburst(articles, path=['perceived_colour_value_name', 'perceived_colour_master_name'], title='Color Categories')

**Perceived_colour_master_name doesn't add much information**

In [None]:
drop(articles, 'perceived_colour_master_name')

## **Dominating Products in H&M Stock**

In [None]:
def top_fives(feature, col):
    products = articles.groupby(feature).size().reset_index().rename(columns={0: 'Total'}).sort_values('Total', ascending=False).head()
    fig = px.pie(products, values='Total', names=feature, color_discrete_sequence=col, title='Top 5 {}'.format(feature))
    fig.show()

In [None]:
to_get_tops = {'product_group_name': px.colors.sequential.RdBu, 
               'prod_name': px.colors.sequential.RdBu_r, 
               'product_type_name': px.colors.sequential.BuGn_r, 
               'graphical_appearance_name': px.colors.sequential.OrRd_r}

for i, j in to_get_tops.items():
    top_fives(i, j)

### **Detailed Description Feature**

In [None]:
from wordcloud import WordCloud, STOPWORDS
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['the', 'with', 'at', 'zip'])                        # adding in the list

In [None]:
articles['detail_desc'] = articles['detail_desc'].str.replace('[#,@,&,.]','')        # removing special charaters from description

# Remove stop words and remove words with 2 or less characters
def preprocess(text):
    ''' keeping only the words which are not in stop_words list '''
    result = []
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2 and token not in stop_words:
            result.append(token)
            
    return ' '.join(result)

In [None]:
articles['detail_desc'].fillna('None', inplace=True)
articles['clean_desc'] = articles['detail_desc'].apply(preprocess)
drop(articles, 'detail_desc')

In [None]:
def word_show(group):
    ''' to display the word cloud on the basis of index_group_name '''
    plt.figure(figsize=(15,10))
    wc = WordCloud(max_words=2000, width=1600, height=800, stopwords=stop_words).generate(str(articles[articles['index_group_name']==group].clean_desc))
    plt.title('Dominating words in description of {} section'.format(group), fontsize=25)
    plt.imshow(wc)

In [None]:
word_show('Ladieswear')

In [None]:
word_show('Baby/Children')

## **Top 10**

In [None]:
top_articles = articles.loc[articles['article_id'].isin(list(top_ten_articles))]
px.sunburst(top_articles, path=['index_group_name', 'index_name', 'section_name'], title='Top Selling Products ')

In [None]:
top_customers = customers.loc[customers['customer_id'].isin(list(top_twenty_customers))]
top_customers['fashion_news_frequency'].replace('NONE', 'Irregular', inplace=True)
px.sunburst(top_customers, path=['age', 'fashion_news_frequency', 'club_member_status'], title="Top Customers' behaviour")

## **Summary of the insights:**
* Only detailed_desc has few missing values
* Every informative feature comes in pair of unique ID/code and its corressponding name
* Ladieswear is the most popular section followed by Kids/Baby section whereas Sports is a very small section in H&M
* There are 3 color features, the superset perceived_colour_value_name, doen't add any extra information hence, it has been dropped
* Top 5 product features have been identified with the below results:
    1. Upper body garments product categories is available in abundance followed by lower body garments
    2. Dragonfly Dress is the top SKU
    3. Trousers and Dress product types dominates the articles dataset
    4. H&M tend to prefer solid colors over patterns for its products
* Extracted the commonly occurring description words in the most popular categories - Ladies and Children
* Top 20 customers are all active club members and most of them follow fashion news regularly
* Top purchasing customers are in the age group of 25-55
* Divided and Ladies are top selling product categories

## **Next steps**
* Merge the datasets
* Ensure missing values are handled
* Categorical encoding
* Scaling features
* Feature Correlation
* Feature Selection
* Model building
* Performance Evaluation

**Suggestions/feedbacks are welcomed!**