# Introduction

**Marketing** connects **best products** to **right customers**. 


In todays digital world, **personalization** leads to **increased** customer **satisfaction** and likelihood of **repeat purchases**.


**Recommendation System algorithms** are a set of algorithms which recommend most relevant items to users based on their preferences predicted using algorithms. It acts on **behavioural data**, such as cutomer’s previous purchases, ratings or reviews to predict their likelihood of buying a new product or service.
**Examples** are Amazon’s  “Customers who buy this item also bought”, Netflix “shows or movies you may want to watch”.


Recommender systems are very popular for **recommending products** such as movies, music, groceries and act as backbone for cross-selling across inductries.


**Three** widely used algorithms used for building Recommendation System are:-

1) Association Rules
2) Collaborative Filtering
3) Matrix Factorization


# Import Packages

In [None]:


import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

# Read files

There are 4 files namely,
1. articles.csv
2. customers.csv
3. transactions_train.csv
4. sample_submission.csv

In [None]:
# Input data files are available in the "../input/" directory.
# List all files under the input directory

input_path = '../input/h-and-m-personalized-fashion-recommendations'

       
# read files

fname = 'articles.csv'
articles_df = pd.read_csv(os.path.join(input_path , fname))

fname = 'customers.csv'
customers_df = pd.read_csv(os.path.join(input_path , fname))

# Make sure article_id is being loading in as a string
transactions_train_path = '../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv'
transactions_train_df = pd.read_csv(transactions_train_path,  index_col="t_dat", 
                                    parse_dates=True,dtype={'article_id': str})


fname = 'sample_submission.csv'
sample_submission_df = pd.read_csv(os.path.join(input_path , fname))

../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv
../input/h-and-m-personalized-fashion-recommendations/articles.csv
../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv
../input/h-and-m-personalized-fashion-recommendations/customers.csv
../input/h-and-m-personalized-fashion-recommendations/images/057/0570177001.jpg

## Exploring articles

In [None]:
articles_df.shape

So, there are 1,05,542 rows or observations and 25 columns or features.

In [None]:
articles_df.info()

This method prints information about a **DataFrame** including the **index** dtype and **column** dtypes, **non-null** values and **memory usage**.

All values are **non-null**

So now we know names of **25 features** (on left) and 

Of which **11** are **ints**(whole numbers) - 

article_id , product_code, product_type_no, graphical_appearance_no, colour_group_code, perceived_colour_value_id, perceived_colour_master_id, department_no, index_code, index_group_no, section_no , garment_group_no                                                                                            

and **14** are objects(text or string) - 

prod_name , product_type_name, product_group_name, graphical_appearance_name, colour_group_name, perceived_colour_value_name, perceived_colour_master_name, department_name , index_name ,index_group_name, section_name, garment_group_name, detail_desc .

Let us check out their values and whether they match the types.

In [None]:
#pandas.DataFrame.head(n=5) - Return the first `n` rows.
articles_df.head()

Whole data in article.csv is **Qualitative** in nature.  Qualitative data is categorical in nature as it assigns our observations to certain group.

Each article has prod_name , product_type_name, product_group_name, graphical_appearance_name, colour_group_name, perceived_colour_value_name, perceived_colour_master_name, department_name , index_name ,index_group_name, section_name, garment_group_name,    (**12 different categories**)



In [None]:
product_group_name_counts = articles_df['product_group_name'].value_counts()
product_group_name_counts # Series (index, value)

product_group_name_counts.index.values # array of index of Series
product_group_name_counts.values # array of value of Series

In [None]:
index_group_name_counts = articles_df['index_group_name'].value_counts()
index_group_name_counts # Series (index, value)

index_group_name_counts.index.values # array of index of Series
index_group_name_counts.values # array of value of Series

In [None]:
garment_group_name_counts = articles_df['garment_group_name'].value_counts()
garment_group_name_counts # Series (index, value)

garment_group_name_counts.index.values # array of index of Series
garment_group_name_counts.values # array of value of Series

In [None]:
graphical_appearance_name_counts = articles_df['graphical_appearance_name'].value_counts()
graphical_appearance_name_counts # Series (index, value)

graphical_appearance_name_counts.index.values # array of index of Series
graphical_appearance_name_counts.values # array of value of Series

## Bar plots for single categorical variable



### Frequency distribution of Single categorical variable

Plot the index of Series (different value that column/ feature can take) on X-axis and number of times that value occurs in dataset (frequency count) on Y-axis

In [None]:
plt.style.use('seaborn-whitegrid')


# Get the figure and the axes (or subplots)

fig, (ax0, ax1, ax2) = plt.subplots(nrows=1, ncols=3, figsize=(15, 4))

# Thus we have to give more margin:
plt.subplots_adjust(top=0.7)
ax0.xaxis.set_tick_params(rotation=90)
ax1.xaxis.set_tick_params(rotation=90)
ax2.xaxis.set_tick_params(rotation=90)

ax0.bar(product_group_name_counts.index.values, product_group_name_counts.values, width=0.5, align='center')
ax0.set(title = 'product_group_name_counts', xlabel='product_group_name' , ylabel = 'Frequency')

ax1.bar(index_group_name_counts.index.values, index_group_name_counts.values, width=0.5, align='center')
ax1.set(title = 'index_group_name_counts', xlabel='index_group_name' , ylabel = 'Frequency')

ax2.bar(garment_group_name_counts.index.values, garment_group_name_counts.values, width=0.5, align='center')
ax2.set(title = 'garment_group_name_counts', xlabel='garment_group_name' , ylabel = 'Frequency')

# Title the figure
fig.suptitle('Frequency Distribution', fontsize=14, fontweight='bold');

In [None]:
articles_df.nunique()

So we have as many unique article_ids as we have total no of rows in articles_df (105,542)

## Exploring customers

In [None]:
customers_df.shape

So, there are **13,71,980** rows or observations and **7** columns or features.


In [None]:
customers_df.info()

This method prints information about a** DataFrame** including the **index dtype** and **column dtypes**, **non-null values** and **memory usage**.

All values are **non-null**

So now we know names of **7** features (on left) and

customer_id, FN, Active, club_member_status, fashion_news_frequency, age and postal_code   

of which **3 are float64** (FN, Active, Age)

and rest are object( text)

In [None]:
customers_df.head()

In [None]:
customers_df.nunique()

There are 1371,980 unique customers.


FN and Active take 1 unique value.(qualitative data)

club_member_status  has 3 unique values and fashion_news_frequency  has 4 unique values.(qualitative data)

age has 84 unique values. (quanitative data)

postal_code has different unique 352899.

Let us check out each column's values.

In [None]:
customers_df['FN'].unique()

customers_df['FN'].unique().size

In [None]:
customers_df['Active'].unique()

customers_df['Active'].unique().size

In [None]:
customers_df['club_member_status'].unique()

customers_df['club_member_status'].unique().size

In [None]:
customers_df['fashion_news_frequency'].unique()

customers_df['fashion_news_frequency'].unique().size

In [None]:
customers_df['age'].unique()

customers_df['age'].unique().size

In [None]:
customers_df['postal_code'].unique()

customers_df['postal_code'].unique().size

In [None]:
customers_active_counts = customers_df['Active'].value_counts()
customers_active_counts # Series (index, value)

customers_active_counts.index.values # array of index of Series
customers_active_counts.values # array of value of Series

Out of total 1371,980 customers , 

464,404 are active(around 1/3rd)

rest are nan.

In [None]:
club_member_status_counts = customers_df['club_member_status'].value_counts()
club_member_status_counts # Series (index, value)

club_member_status_counts.index.values # array of index of Series
club_member_status_counts.values # array of value of Series

Out of total 1371,980 customers ,

1272,491 are ACTIVE club members,( a majority are Active club members) 

92, 960 have pre-create status , 

467 have left club and 

rest are nan.

In [None]:
fashion_news_frequency_counts = customers_df['fashion_news_frequency'].value_counts()
fashion_news_frequency_counts # Series (index, value)

fashion_news_frequency_counts.index.values # array of index of Series
fashion_news_frequency_counts.values # array of value of Series

Out of total 1371,980 customers ,

477,416 (Regularly) and 842(Monthly) have subscribed to Fashion News Agency 

877,711 (NONE), None (2)

1/3 rd have subscribed and 2/3 rd have not subscribed.

Unsubscribed (877,711) is double(477,416) the no of Subscribed.


In [None]:
plt.style.use('seaborn-whitegrid')


# Get the figure and the axes (or subplots)

fig, (ax0, ax1, ax2) = plt.subplots(nrows=1, ncols=3, figsize=(15, 4))


ax0.bar(customers_active_counts.index.values, customers_active_counts.values, width=0.5, align='center')
ax0.set(title = 'customers_active_counts', xlabel='customers_active' , ylabel = 'Frequency')

ax1.bar(club_member_status_counts.index.values, club_member_status_counts.values, width=0.5, align='center')
ax1.set(title = 'club_member_status_counts', xlabel='club_member_status' , ylabel = 'Frequency')

ax2.bar(fashion_news_frequency_counts.index.values, fashion_news_frequency_counts.values, width=0.5, align='center')
ax2.set(title = 'fashion_news_frequency_counts', xlabel='fashion_news_frequency' , ylabel = 'Frequency')

# Title the figure
fig.suptitle('Frequency Distribution', fontsize=14, fontweight='bold');

# Exploring age

It is the only quantitative data.


In [None]:
#customers_df["age"].describe()

In [None]:
# Histogram 
sns.distplot(customers_df['age'], kde=False)

In [None]:
# KDE plot (Smooth Histogram )
sns.kdeplot(customers_df['age'], shade=True)

In [None]:
# Make boxplot for one group only
sns.violinplot(y=customers_df["age"])

Maximum purchases are for customers aged 20-30 and another peak for those aged 50.

### Exploring transactions_train

In [None]:
transactions_train_df.shape

There are 31 788 324 transactions.

In [None]:
transactions_train_df.info()

In [None]:
transactions_train_df.head()

In [None]:
transactions_train_df.tail()

In [None]:
sales_channel_id_counts = transactions_train_df['sales_channel_id'].value_counts()
sales_channel_id_counts # Series (index, value)

sales_channel_id_counts.index.values # array of index of Series
sales_channel_id_counts.values # array of value of Series

In [None]:
plt.style.use('seaborn-whitegrid')

# Get the figure and the axes (or subplots)

fig, (ax0) = plt.subplots(nrows=1, ncols=1, figsize=(8, 4))

ax0.bar(sales_channel_id_counts.index.values, sales_channel_id_counts.values, width=0.5, align='center')
ax0.set(title = 'sales_channel_id_counts', xlabel='sales_channel_id' , ylabel = 'Frequency')


# Title the figure
fig.suptitle('Frequency Distribution', fontsize=14, fontweight='bold');

In [None]:
list(transactions_train_df.columns)

In [None]:
transactions_train_df_from26Aug2020=transactions_train_df.tail(1000000) #Last 1 million records as data is huge
transactions_train_df_from26Aug2020.head()



In [None]:
transactions_train_df_from26Aug2020.tail()

In [None]:
transactions_train_df_from26Aug2020.shape
transactions_train_df_from26Aug2020.nunique()

There are 226 120 unique customers and 28001 unique articles starting from 26 Aug 2020 to 22 Sep 2020

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Price")

# Line chart 
sns.lineplot(data=transactions_train_df_from26Aug2020['price'], label="Price")

# Add label for horizontal axis
plt.xlabel("Date")

# References


Machine Learning Using Python