![](https://media-cldnry.s-nbcnews.com/image/upload/t_social_share_1200x630_center,f_auto,q_auto:best/newscms/2017_24/1222336/hm-today-170616-tease.jpg)

Hennes & Mauritz AB is a Swedish multinational clothing company headquartered in Stockholm. It is known for its fast-fashion clothing for men, women, teenagers, and children

### Data background

The dataset contains 4 csv files and one folder with several subfolders, each with a different number of images.

In this Exploratory Data Analysis Notebook we will look to the data, will analyze the content of each csv file, check for missing data, understand the data distribution, see what are the relations between data in various files. There are three tabular data files.

* Customer Data
* Article Data
* Transaction Data


### Importing Libraries

We will include here the required packages for reading, parsing, filtering, processing, visualizing the data, both tabular and image.

In [None]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from datetime import datetime

### Creating a class with data locations


In [None]:
class DataLocations:
    article_csv = '../input/h-and-m-personalized-fashion-recommendations/articles.csv'
    customer_csv = '../input/h-and-m-personalized-fashion-recommendations/customers.csv'
    tx_csv = '../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv'
    sub_csv = '../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv'

### Customer Data
We will start analysis with customer data

In [None]:
customer_df = pd.read_csv(DataLocations.customer_csv)
customer_df.head()

it is obvious that there are some null values. We will insect more into this

In [None]:
print("shape of data Customer data",customer_df.shape)

we can see that we have `1371980` rows and `7` columns

In [None]:
customer_df.isnull().sum()

we can see that there are many null values in the dataset. To have a clear view lets take these into a plot as precentages

In [None]:
# Function to plot the Nan percentages of each columns
def plot_nas(df):
    if df.isnull().sum().sum() != 0:
        na_df = (df.isnull().sum() / len(df)) * 100      
        na_df = na_df.drop(na_df[na_df == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame({'Missing Ratio %' :na_df})
        missing_data.plot(kind = "barh")
        plt.show()
    else:
        print('No NAs found')

print("Checking Null's in Customer data ")
plot_nas(customer_df)

As shown in the plot `FN` and `Active` columns have large number of null values. We have to decide whether we can use these columns to build the model. So we have two options. 

* Fill null values and use the colums 
* Remove the columns and go with the rest

However to take one of above options we have to analyze other data as well.
Let's see how many uniques values are in the columns

In [None]:
print(customer_df.nunique())

As shown above in Active column, there is only one value which is `Active`. We can assume that NaN values of Active 

In [None]:
temp = customer_df.groupby(["age"])["customer_id"].count()
df = pd.DataFrame({'age':temp.index,'count':temp.values})
df = df.sort_values(['age'],ascending=False)
plt.figure(figsize=(35,7))
plt.title("Number of Customers by Age")
sns.set_color_codes("pastel")
s = sns.barplot(x = 'age', y="count", data=df)
plt.show()

We can identify that the most of customers are between age of 20 to 30.

In [None]:
temp = customer_df.groupby(["fashion_news_frequency"])["customer_id"].count()
df = pd.DataFrame({'Fashion News Frequency': temp.index,
                   'Customers': temp.values
                  })
df = df.sort_values(['Customers'], ascending=False)
plt.figure(figsize = (6,6))
plt.title(f'Number of Customers per each Fashion News Frequency')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Fashion News Frequency', y="Customers", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
print(temp)
count = 0
for i in temp:
    count = count + i
print("Precentage of customers who have subscribed to regulary news : ",round(temp[3]/count*100,2) ,"%")

And most of the customer have not subscribed to fashion news. However 35% of customers have subscribed to regularly news

In [None]:
temp = customer_df.groupby(["club_member_status"])["customer_id"].count()
df = pd.DataFrame({'Club Member Status': temp.index,
                   'Customers': temp.values
                  })
df = df.sort_values(['Customers'], ascending=False)
plt.figure(figsize = (6,6))
plt.title(f'Number of Customers per each Club Member Status')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Club Member Status', y="Customers", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

As shown in the bar chart most of the customers has an Active club member status. Let's see how many has subscribed to the news regulary from active members.

In [None]:
# method to plot club member status bar chart
def plot_bar(df, column):
    long_df = pd.DataFrame(df.groupby(column)['customer_id'].count().reset_index().rename({'customer_id': 'count'}, axis=1))
    fig = px.bar(long_df, x=column, y="count", color=column, title="bar plot for {column} ")
    fig.show()
    

In [None]:
plot_bar( customer_df, 'club_member_status')

In [None]:
temp = customer_df.groupby(["club_member_status","fashion_news_frequency"])["customer_id"].count()
temp
print("The precentage of customers who have active member status from all the customers who have active status is ",round(471304/477416,2),"%")

We can see that 99% of the customers who have subscribed to news have an Active club member status.

In [None]:
temp_df = temp.to_frame()
#sns.barplot(x="club_member_status", y="customer_id", hue="fashion_news_frequency", data=temp_df)

In [None]:
fig = plt.figure(figsize=(20, 7))
sns.histplot(customer_df.postal_code.value_counts()[1:], bins=250, kde=False)
plt.xlim(0, 50)
plt.tight_layout()
plt.show()

We can see here that most of the customers are from one particular postal code. 

### Article Data
We will start analysis with article data

In [None]:
articles_df = pd.read_csv(DataLocations.article_csv)
articles_df.head()

In [None]:
articles_df.shape

article_csv has 25 columns and 105542 rows

In [None]:
temp = articles_df.groupby(["product_group_name"])["product_type_name"].nunique()
df = pd.DataFrame({'Product Group': temp.index,
                   'Product Types': temp.values
                  })
df = df.sort_values(['Product Types'], ascending=False)
plt.figure(figsize = (8,6))
plt.title('Number of Product Types per each Product Group')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Product Group', y="Product Types", data=df,palette="cubehelix")
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

Most of the products are from accessories.And there is also an unknown category.

In [None]:
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=5,
        random_state=1
    ).generate(str(data))

    fig = plt.figure(1, figsize=(10,10))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=14)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
show_wordcloud(articles_df["prod_name"], "Wordcloud from product name")

In [None]:
temp = articles_df.groupby(["product_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Group': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)
plt.figure(figsize = (8,6))
plt.title('Number of Articles per each Product Group')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Product Group', y="Articles", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

Most of the articles are from the `Garmenr Upper Body` product group. Let's see it as a percenetage

In [None]:
temp = articles_df.groupby(["product_group_name"])['article_id'].nunique().sort_values(ascending=False)
temp

In [None]:
print("Garmenr Upper Body articles as a percenetage of total articles : ",
   round(temp[0]/articles_df['article_id'].
         count()*100,2),"%")

In [None]:
temp = articles_df.groupby(["product_type_name"])["article_id"].nunique()
df = pd.DataFrame({'Product Type': temp.index,
                   'Articles': temp.values
                  })
total_types = len(df['Product Type'].unique())

#getting top 50 
df = df.sort_values(['Articles'], ascending=False)[0:50]
plt.figure(figsize = (16,6))
plt.title(f'Number of Articles per each Product Type (top 50 from total: {total_types})')
s = sns.barplot(x = 'Product Type', y="Articles", data=df,palette="rocket")
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

Most of the articles are from four product types
* trousers
* Dress
* Sweater
* T-shirt



In [None]:
temp = articles_df.groupby(["department_name"])["article_id"].nunique()
df = pd.DataFrame({'Department Name': temp.index,
                   'Articles': temp.values
                  })
total_depts = len(df['Department Name'].unique())
df = df.sort_values(['Articles'], ascending=False).head(50)
plt.figure(figsize = (16,6))
plt.title(f'Number of Articles per each Department (top 50 from total: {total_depts})')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Department Name', y="Articles", data=df,palette="CMRmap")
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

It can be identified that most of the articles are from `jersey` deparment.

In [None]:
temp = articles_df.groupby(["graphical_appearance_name"])["article_id"].nunique()
df = pd.DataFrame({'Graphical Appearance Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False).head(50)
plt.figure(figsize = (16,6))
plt.title(f'Number of Articles per each Graphical Appearance Name')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Graphical Appearance Name', y="Articles", data=df,palette="crest")
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
temp = articles_df.groupby(["index_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Index Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)
plt.figure(figsize = (6,6))
plt.title(f'Number of Articles per each Index Group Name')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Index Group Name', y="Articles", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

`Ladieswear` has the most of the aricles and `Babychildren` also has a large number of articles

In [None]:
temp = articles_df.groupby(["colour_group_name"])["article_id"].nunique()
df = pd.DataFrame({'Colour Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)
plt.figure(figsize = (12,6))
plt.title(f'Number of Articles per each Colour Group Name')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Colour Group Name', y="Articles", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

We can see that most of the products are `black`

In [None]:
temp = articles_df.groupby(["perceived_colour_value_name"])["article_id"].nunique()
df = pd.DataFrame({'Perceived Colour Group Name': temp.index,
                   'Articles': temp.values
                  })
df = df.sort_values(['Articles'], ascending=False)
plt.figure(figsize = (6,6))
plt.title(f'Number of Articles per each Perceived Colour Group Name')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'Perceived Colour Group Name', y="Articles", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

We can see that most of the products are Dark color. We also identifed that Black is the most famous color in products