# Relationship of Article columns

One of the datasets provided in the [H&M Personalized Fashion Recommendations competition](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations) consists of article information in tabular form. 

The `articles` dataset combines information of the products, such as color or the product name, combined with general information, like department number. The data is denormalized (flattened) so it can easily be integrated into ML models. However, due to the denormalization the hierarchies of the underlying data model are not easily recognizable. It's also hard to spot which columns are redundant. 
For instance we will see that there is no 1:1 relationship between `product_type_no` and `product_type_name`.

The information of the relationship and hierarchies can later on be used to create enbeddings for articles or models with a certain focus.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
from matplotlib import pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import graphviz

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Load articles:

In [None]:
articles = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')

Check whether there is a *1:1*, *1:n* or *m:n* relationship between the columns. The hierarchical composition is done by comparing the distinct number of values with the distinct number of values of pairs.

In [None]:
a_unq = articles.nunique()
art_cols = articles.columns

In [None]:
# decompose hierarchie
def column_dependencies(articles, verbose = False):
    a_unq = articles.nunique()
    art_cols = articles.columns
    
    mx = np.zeros((len(art_cols),len(art_cols))) 
    
    if verbose:
        print('# List relationships. All others are n:m')
            
    for i1 in range(len(art_cols)):
        for i2 in range(i1+1,len(art_cols)):
            col1 = articles.columns[i1]
            col2 = articles.columns[i2]
            
            if a_unq[col1] == a_unq[col2]:
                mx[i1,i2]=2
                rel = '1:1'
            else:
                pair_nunique = articles.loc[:,[col1,col2]].drop_duplicates().shape[0]
                if a_unq[col1] == pair_nunique:
                    rel = 'n:1'
                    mx[i1,i2]=1
                elif a_unq[col2] == pair_nunique:
                    rel = '1:n'
                    mx[i2,i1]=1
                        
                else: 
                    rel = 'm:n'
            
            if verbose:
                if (rel!='m:n') & (col1!='article_id'):
                    print(col1,rel,col2)
    return mx
    

In [None]:
col_deps = column_dependencies(articles, verbose = False)

Let's plot the relationships. 
- Red marks a 1:1 relationship (`index_name` : `index_code`).
- Black marks 1:n relationship. It can be read like `article_id` is a child of `department_no` and a grant-child of `section_no`.
- Blue marks n:m relationships, hence no technical hierarchie.

In [None]:
mask = np.ones_like(col_deps)
mask[np.triu_indices_from(mask,1)] = 0

sns.set(rc={'figure.figsize':(12,10)})
sns.color_palette("tab10")

#with sns.axes_style("white"):
ax = sns.heatmap(col_deps, 
                 xticklabels = articles.columns, 
                 yticklabels = articles.columns, 
                 cmap= sns.color_palette("icefire",3),
                 linewidths = 1,
                 mask = mask
                 #cbar=False
                )
ax.set(xlabel='(grand-)parent', ylabel='(grand-)child')
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([1/3, 1, 5/3])
colorbar.set_ticklabels(['n:m', ' n:1 ', '1:1'])
ax.set_title('Relationship between columns')
plt.show()

If we now travel the map from bottom right to upper left and we look to the next black box right above, we see the next direct child.

In [None]:
children = (np.sign(col_deps)*(np.arange(1,26).reshape(-1,1))).argmax(axis=0)

So now we can draw a graph of the hierarchie along with the number of unique values of each column.

In [None]:
g = graphviz.Graph('col_dep')#, graph_attr={'rankdir':'LR', 'size':'15,10'})
g.attr('node', shape='box')

for col, child in zip(articles.columns, children):
    g.node(col, label = f'{col} ({a_unq[col]})')
    if col != articles.columns[child]:
        g.edge(articles.columns[child],col)

g

The 1:1 relationships are between the `*_no` and `*_name` columns. Expect of the `departmant_name` which includes many `department_no`. That's not a surprice. It's probably hard to come up with 299 meaning full names for departments. 
The hierarchies of `*_no/*_name` pairs can also be interpreted as an entities. E.g. entity `department` with primary key `department_no` and attribute `department_name`.

So there are two areas that don't match the hierarchical pattern as expected. 

The first one is `product_type_no`, `product_type_name` and `product_group_name`. Let's inspect what is causing the ambiguous hierarchie. Which `product_type_name` occure in more than one group?

In [None]:
articles[['product_type_name', 'product_group_name']].drop_duplicates().groupby('product_type_name').count().sort_values(by='product_group_name', ascending = False).head(n=3)

In [None]:
articles[articles.product_type_name=='Umbrella'][['product_type_name', 'product_type_no', 'product_group_name']].drop_duplicates()

Ok, *Umbrella* is ambiguous it is part of the group *Items* and *Accessories*.

The other area that is ambiguous from a hierarchical point of view includes `product_code`, `prod_name` and `detail_desc`. The number of unique values here is huge. So it's not surprising that `product_no` and `prod_name` don't form a 1:1 relationship. So we leave that part as it is and assume `prod_name` is an attribute of `product_no`.

For a better overview, we redraw the graph without the name columns:

In [None]:
no_name_cols = [col for col in articles.columns if col[-4:].strip()!='name']
col_deps_no_name = column_dependencies(articles[no_name_cols], verbose = False)
children_no_name = (np.sign(col_deps_no_name)*(np.arange(1,len(no_name_cols)+1).reshape(-1,1))).argmax(axis=0)

g_no_name = graphviz.Graph('col_dep_no_name')#, graph_attr={'rankdir':'LR', 'size':'15,10'})
g_no_name.attr('node', shape='box')

for col, child in zip(no_name_cols, children_no_name):
    g_no_name.node(col, label = f'{col} ({a_unq[col]})')
    if col != no_name_cols[child]:
        g_no_name.edge(no_name_cols[child],col)

g_no_name

... to be continued ...