# H&M Data Exploration

In this notebook we investigate the categorical data in articles table. The main objective is to get some info and prepare the categorical (ex. encode product name) data to be processed by ML technique. Since the categorical data needs to be encoded, here we encode the product name of articles (`prod_name` column) in five numerical columns.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
articles = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')
trans = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')
customer  = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/customers.csv')

# Data info

An overview of the data we are dealing with. We are interested in the string columns of article table.

In [None]:
# Data size
print(f"articles.shape = {articles.shape}")
print(f"trans.shape = {trans.shape}")
print(f"customer.shape = {customer.shape}")

In [None]:
# Some info
articles.head().T

In [None]:
# object type tell us the column is a string data
articles.dtypes

We can notice that product names repeat. Lets see how many unique names are there.

In [None]:
# Unique product names
uniqProdNameCnt = len(articles.loc[:, 'prod_name'].unique())
# Percent of unique product names
uniqProdNamePercent = 100*uniqProdNameCnt/articles.shape[0]
print(f"Percent of unique product names: {uniqProdNamePercent:.3f}%")

# Relation between `product_code` and `prod_name`

If there is correspondence between `product_code` and `prod_name` columns, then we already have the product name encoded. Lets check how many different names have the same product code.

In [None]:
print(f"Number of unique product codes: {len(articles.loc[:, 'product_code'].unique())}")

In [None]:
print(f"Number of unique product codes: {len(articles.loc[:, 'prod_name'].unique())}")

Lets see which different names have the same code, and what are the names that have the same code.

In [None]:
diffNameList = []
for code in articles['product_code'].unique():
    apnlist = articles[articles.loc[:, 'product_code'] == code]['prod_name']
    if len(apnlist.unique()) > 1:
        diffNameList.append(apnlist.unique())
print(f"There are : {len(diffNameList)} codes that encode more than one prod_name")

In [None]:
print("Different names encoded with the same product_code:\n")
for i in range(20):
    print(diffNameList[i])

Ok, we can conclude that we can use the `product_code` for encoding the `prod_name` categorical data. Next we move to some data exploration to see if we can extract some more data from the product name before discarding it.

# Word processing

We want to create lists of most used words for column of product name. First we want to lowercase all
of the strings. Generally we can apply the lowercase only to columns of 'object' type since these are the columns
with string data.

In [None]:
# Lowercase all of the following articles columns
artStrCols = articles.dtypes[articles.dtypes == np.dtype(object)]
print(artStrCols.keys())
for col in artStrCols.keys():
    articles.loc[:, col] = articles.loc[:, col].str.lower()    

In [None]:
articles.head().T

We can see that the number of unique columns reduced

In [None]:
# Unique product names
uniqProdNameCnt = len(articles.loc[:, 'prod_name'].unique())
# Percent of unique product names
uniqProdNamePercent = 100*uniqProdNameCnt/articles.shape[0]
print(f"Percent of unique product names: {uniqProdNamePercent:.3f}%")

Lets check some regex expressions for extracting words from the strings.

In [None]:
import re

# Lets take a speciffic word
strOne = articles.loc[3, 'prod_name']
print(f"word: '{strOne}'")
# using regex( findall() ) to extract words from string
res = re.findall(r'\w+', strOne)
print(f"findall regex1: {res}")
res = re.findall(r'\b[0-9A-Za-z\-]+', strOne)
print(f"findall regex2: {res}")
# Regex 2 is better since it treats '-' as part of words
regexStr = r'\b[0-9A-Za-z\-]+'

Lets check how does our chosen regex splits strings.

In [None]:
prodNameStrList = articles.loc[:, 'prod_name']
prodNameWordListList = prodNameStrList.str.findall(regexStr)
print(f"string series:\n{prodNameStrList.head()}\n")
print(f"word series:\n{prodNameWordListList.head()}")

Next we make histogram of most used words for prod_name, since we have a Series of lists for the 
word, first we need to flat that out. After flatting we can draw the histogram.

In [None]:
prodNameWordList = prodNameWordListList.apply(pd.Series).stack().reset_index(drop=True)
prodNameWordList.head()

In [None]:
prodNameWordCount = pd.value_counts(prodNameWordList)
print(prodNameWordCount[0:40])
prodNameWordCount[0:40].plot(kind='bar', figsize=(20,10))

# Histogram of word count in a string column

For later use it will be good to know what is the distribution of word count in a string column. This will help us to choose the best number of columns to add for encoding the string.

In [None]:
wordCount = pd.Series(data=0, dtype=int, index=[i for i in range(len(prodNameWordListList))])
for i in range(len(prodNameWordListList)):
    wordCount.loc[i] = len(prodNameWordListList[i])

In [None]:
wordDist = pd.value_counts(wordCount, normalize=True).sort_index()
wordDist

In [None]:
wordDist.plot(kind='bar', figsize=(20,10))

We use the cumulative sum of word distribution to select how many columns we will use for encoding. For example, if we want to cover more than 95% of string lengths we should use 5 columns for encoding as is shown in the following example.

In [None]:
wordDistCumsum = wordDist.cumsum()
wordDistCumsum

In [None]:
wordDistCumsum[wordDistCumsum > 0.95].index.min()

# Encode string data

Here we construct new articles table consisted only of numerical data. We show how you can encode the product name and additionaly extract some data that is not present in the `product_code`.

In [None]:
print("Create new table with data from numerical columns:\n")
artNumCols = articles.dtypes[articles.dtypes == np.dtype(np.int64)]
articlesCoded = pd.DataFrame({key:articles.loc[:, key] for key in artNumCols.keys()})
articlesCoded.head().T

Create DataFrame for encoding all words that appear in the prod_column

In [None]:
code = [i for i in range(len(prodNameWordCount))]
prodName = pd.DataFrame({'code': code, 'count': prodNameWordCount.values}, index=prodNameWordCount.index)
prodName.iloc[0:10, :]

We add five more columns to the numeric articles table in wich we encode words from the `prod_name`. If there are no words, we put `None` and we take care to first sort words in order of frequency of appearance in the `prod_name` column. When we sort words by frequency we favor words that are more common (if the prod_name cosists of more than five words). If we use inverse sorting we will favor words that are more speciffic to the product name.

Lets see how words are ordered by the code value (smaller the code value - word is more frequent)

In [None]:
alists = articles.loc[0:9, 'prod_name'].str.findall(regexStr)
print(f"alists:\n{alists}\n")
print(f"alists[0].code.sort:\n{prodName.loc[alists[0], 'code'].sort_values()[0:5]}\n")
print(f"alists[8].code.sort:\n{prodName.loc[alists[8], 'code'].sort_values()[0:5]}\n")

Now we use the above encoding to add the prod_name words to the articlesCoded dataframe

In [None]:
alists = articles.loc[:, 'prod_name'].str.findall(regexStr)
for i in range(5):
    colName = f'prod_name_{i}'
    valList = [] 
    for j in range(len(alists)):
        value = prodName.loc[alists[j], 'code'].sort_values()[i] if i < len(alists[j]) else None 
        valList.append(value)
    articlesCoded[colName] = valList

Voila! We created additional five columns of meaningful data, encoded and based on a categorical column that we can now dicard.

In [None]:
articlesCoded.loc[0:8].T

Next you can use the same method for all categorical columns with strings and prepare table for XGBoost, for example.