## **<span style="color:#023e8a;">Intro</span>**

**<span style="color:#023e8a;">The competition is dedicated to the product recomendations (H&M)  </span>**

**<span style="color:#023e8a;">Here we have different kinds of data that help us to get good recomendations: </span>**

📸 `images` - images of every article_id

🙋 `articles`  - detailed metadata of every article_id

👔 `customers`  - detailed metadata of every customer_id

🧾 `transactions_train`  - purchases with details

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm

In [2]:
articles = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv")
customers = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv")
transactions = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv")

## **<span id="Articles" style="color:#023e8a;">2. Articles</span>**

**<span style="color:#023e8a;"> This table contains all h&m articles with details such as a type of product, a color, a product group and other features.</span>**  
**<span style="color:#023e8a;"> Article data description: </span>**

> `article_id` **<span style="color:#023e8a;">: A unique identifier of every article.</span>**  
> `product_code`, `prod_name` **<span style="color:#023e8a;">: A unique identifier of every product and its name (not the same).</span>**  
> `product_type`, `product_type_name` **<span style="color:#023e8a;">: The group of product_code and its name</span>**  
> `graphical_appearance_no`, `graphical_appearance_name` **<span style="color:#023e8a;">: The group of graphics and its name</span>**  
> `colour_group_code`, `colour_group_name` **<span style="color:#023e8a;">: The group of color and its name</span>**  
> `graphical_appearance_no`, `graphical_appearance_name` **<span style="color:#023e8a;">: The group of graphics and its name</span>**  
> `perceived_colour_value_id`, `perceived_colour_value_name`, `perceived_colour_master_id`, `perceived_colour_master_name` **<span style="color:#023e8a;">: The added color info</span>**  
> `department_no`, `department_name`: **<span style="color:#023e8a;">: A unique identifier of every dep and its name</span>**  
> `index_code`, `index_name`: **<span style="color:#023e8a;">: A unique identifier of every index and its name</span>**  
> `index_group_no`, `index_group_name`: **<span style="color:#023e8a;">: A group of indeces and its name</span>**  
> `section_no`, `section_name`: **<span style="color:#023e8a;">: A unique identifier of every section and its name</span>**  
> `garment_group_no`, `garment_group_name`: **<span style="color:#023e8a;">: A unique identifier of every garment and its name</span>**  
> `detail_desc`: **<span style="color:#023e8a;">: Details</span>**  

In [3]:
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [4]:
len(articles)

105542

**<span style="color:#023e8a;"> Customers data description: </span>**

> `customer_id` **<span style="color:#023e8a;">: A unique identifier of every customer</span>**  
> `FN` **<span style="color:#023e8a;">: 1 or missed </span>**  
> `Active` **<span style="color:#023e8a;">: 1 or missed</span>**  
> `club_member_status` **<span style="color:#023e8a;">: Status in club</span>**  
> `fashion_news_frequency` **<span style="color:#023e8a;">: How often H&M may send news to customer</span>**  
> `age` **<span style="color:#023e8a;">: The current age</span>**  
> `postal_code` **<span style="color:#023e8a;">: Postal code of customer</span>**  

In [5]:
pd.options.display.max_rows = 50
customers.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [6]:
#quick_bar_chart('postal_code', (3, 3), False, customers, True)
len(customers)

1371980

In [7]:
import sys
import time

import logging
import threading

import math

In [8]:
class ProgressBar:
    
    def __init__(self, transactions, trans_len = None):
        self.percent_intervals = 3
        self.percent_rounded_decimals = 2
        if trans_len == None:
            self.trans_len = len(transactions)
        else:
            self.trans_len = trans_len
        self.percent = int(self.trans_len / 10**self.percent_intervals)
        self.epsilon = 10**-2
        self.start = time.time()

    def check(self, i, chunked=False):
        if i % self.percent == 0 or i + 1 == self.trans_len or chunked:
            end = time.time()
            percent_decimal = i / (self.trans_len - 1)
            percent_current = percent_decimal * 100
            time_elapsed = end - self.start
            time_estimated = time_elapsed / (percent_decimal + self.epsilon)
            sys.stdout.write("%6.2f%% time elapsed: %d, estimated: %d\r" % (percent_current, time_elapsed, time_estimated))

In [9]:
articles_id_hash = {}
pb = ProgressBar(articles)
for i in range(len(articles)):
    columns = ['prod_name',
    'product_type_name',
    'product_group_name',
    'graphical_appearance_name',
    'colour_group_name',
    'department_name',
    'index_name',
    'index_group_name',
    'section_name',
    'garment_group_name',
    'detail_desc']
    row = []
    for column in columns:
        row.append(articles[column][i])
    articles_id = articles['article_id'][i]
    articles_id_hash[articles_id] = row
    pb.check(i)

100.00% time elapsed: 7, estimated: 7

In [10]:
articles.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [11]:
customers_id_hash = {}
nan_count = 0
pb = ProgressBar(customers)
for i in range(len(customers)):
    customer_age = customers['age'][i]
    customer_id = customers['customer_id'][i]
    customers_id_hash[customer_id] = customer_age
    if math.isnan(customer_age):
        nan_count += 1
    pb.check(i)

100.00% time elapsed: 16, estimated: 15

In [12]:
nan_count / len(customers) * 100

1.1560664149623172

In [None]:
customers_id_hash['00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657']

In [None]:
np.max(customers['age']), np.min(customers['age']), np.mean(customers['age']), np.median(customers['age'])

## **<span id="Transactions" style="color:#023e8a;">4. Transactions</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

**<span style="color:#023e8a;"> Transactions data description: </span>**

> `t_dat` **<span style="color:#023e8a;">: A unique identifier of every customer</span>**  
> `customer_id` **<span style="color:#023e8a;">: A unique identifier of every customer </span>**  **<span style="color:#FF0000;">(in </span>** `customers` **<span style="color:#FF0000;"> table)</span>**  
> `article_id` **<span style="color:#023e8a;">: A unique identifier of every article</span>**  **<span style="color:#FF0000;">(in </span>** `articles` **<span style="color:#FF0000;"> table)</span>**  
> `price` **<span style="color:#023e8a;">: Price of purchase</span>**  
> `sales_channel_id` **<span style="color:#023e8a;">: 1 or 2</span>**  

In [None]:
transactions.head()

In [None]:
len(transactions)

In [None]:
percent_intervals = 3
percent_rounded_decimals = 2
trans_len = len(transactions)
percent = int(trans_len / 10**percent_intervals)
epsilon = 10**-2
start = time.time()
for i in range(0,len(transactions),30):
    transactions['t_dat'][i]
    if i % percent == 0 or i + 1 == trans_len:
        end = time.time()
        percent_decimal = i / (trans_len - 1)
        percent_current = percent_decimal * 100
        time_elapsed = end - start
        time_estimated = time_elapsed / (percent_decimal + epsilon)
        sys.stdout.write("%6.2f%% time elapsed: %d, estimated: %d\r" % (percent_current, time_elapsed, time_estimated))

In [None]:
t_dat = transactions['t_dat']
pb = ProgressBar(transactions)
for i in range(0,len(transactions),1):
    t_dat[i]
    pb.check(i)

In [None]:
speed_list = []
pb = ProgressBar(transactions)
for i in range(0,len(transactions),1):
    sub_list = []
    sub_list.append(transactions['t_dat'][i])
    sub_list.append(transactions['customer_id'][i])
    sub_list.append(transactions['article_id'][i])
    sub_list.append(transactions['price'][i])
    speed_list.append(sub_list)
    pb.check(i)

In [None]:
len(speed_list)

In [None]:
new_list = []
pb = ProgressBar(speed_list)
for i in range(0,len(speed_list)):
    new_list.append(speed_list[i][2])
    if i % 1000000 == 0:
        print(speed_list[i][2])
    pb.check(i)

In [None]:
def thread_function(name):
    print("Thread %s: starting", name)
    time.sleep(3 - name)
    print("Thread %s: finishing", name)


In [None]:
print("Main    : before creating thread")
for i in range(0,2):
    x = threading.Thread(target=thread_function, args=(i,))
    print("Main    : before running thread")
    x.start()
print("Main    : wait for the thread to finish")
# x.join()
print("Main    : all done")

In [None]:
#speed_list = []
#def thread_function(name):
#    pb = 0
#    if name == 0:
#        pb = ProgressBar(transactions)
#    for i in range(0, len(transactions)):
#        if name == 0 and i % 2 == 0:
#            continue
#        elif name == 1 and i % 2 == 1:
#            continue
#        speed_list.append(transactions['t_dat'][i])
#        if name == 0:
#            pb.check(i)
#            
#    print("done",name)
#        
#for i in range(0,2):
#    x = threading.Thread(target=thread_function, args=(i,))
#    print("Main    : before running thread")
#    x.start()

In [None]:
customers.head()

In [None]:
transactions.head()

In [None]:
transactions_age_list = []
pb = ProgressBar(transactions)
for i in range(0,len(transactions),100):
    customer_id = transactions['customer_id'][i]
    transactions_age_list.append(customers_id_hash[customer_id])
    pb.check(i, True)

In [None]:
plt.figure(figsize=(10, 6), dpi=80)
n, bins, patches = plt.hist(transactions_age_list, 99-16, density=False, facecolor='#CC071E')#, facecolor='g', alpha=0.75)
plt.xlabel('Age (Years)')
plt.ylabel('Count')
plt.title('Histogram of H&M Customer Ages Transaction')
plt.xlim(15, 100)
plt.ylim(0, 20000)
plt.grid(True)
plt.xticks(np.arange(15, 105, step=5))
plt.show()

In [None]:
age_article_id_list = {}
pb = ProgressBar(transactions)
for i in range(0,len(transactions),1):
    customer_id = transactions['customer_id'][i]
    article_id = transactions['article_id'][i]
    customer_age = customers_id_hash[customer_id]
    if math.isnan(customer_age):
        customer_age = 'nan'
    else:
        customer_age = str(int(customer_age))
    if not customer_age in age_article_id_list:
        age_article_id_list[customer_age] = {}
    if not article_id in age_article_id_list[customer_age]:
        age_article_id_list[customer_age][article_id] = 0
    age_article_id_list[customer_age][article_id] += 1
    
    pb.check(i, True)

  2.99% time elapsed: 81, estimated: 2043

In [61]:
speed_list = []
pb = ProgressBar(transactions)
for i in range(0,len(transactions),100):
    sub_list = []
    sub_list.append(transactions['t_dat'][i])
    sub_list.append(transactions['customer_id'][i])
    sub_list.append(transactions['article_id'][i])
    sub_list.append(transactions['price'][i])
    speed_list.append(sub_list)
    pb.check(i)

100.00% time elapsed: 8, estimated: 80

In [None]:
age_article_id_list = {}
pb = ProgressBar(speed_list)
for i in range(0,len(speed_list),1):
    customer_id = speed_list[i][1]
    article_id = speed_list[i][2]
    customer_age = customers_id_hash[customer_id]
    if math.isnan(customer_age):
        customer_age = 'nan'
    else:
        customer_age = str(int(customer_age))
    if not customer_age in age_article_id_list:
        age_article_id_list[customer_age] = {}
    if not article_id in age_article_id_list[customer_age]:
        age_article_id_list[customer_age][article_id] = 0
    age_article_id_list[customer_age][article_id] += 1
    
    pb.check(i, True)

In [None]:
age_article_id_list.keys()

In [17]:
def sort_obj_by_keys(obj, sort_on_keys=True, reverse=False):
    
    keys = []
    values = []
    for key in obj.keys():
        keys.append(key)
        values.append(obj[key])
    
    ind = None
    if sort_on_keys:
        ind = np.argsort(keys)
    else:
        ind = np.argsort(values)
    
    keys_sort = np.array(keys)[ind]
    values_sort = np.array(values)[ind]
    
    if reverse:
        keys_sort = np.flip(keys_sort)
        values_sort = np.flip(values_sort)
    
    newObj = {}
    newObj['keys'] = keys_sort
    newObj['values'] = values_sort
    
    return newObj

In [None]:
age_article_id_list_sort = sort_obj_by_keys(age_article_id_list)

In [None]:
#age_article_id_list_sort

In [None]:
values = age_article_id_list_sort['values']
newValues = []
for i in range(len(values)):
    value = values[i]
    newValue = sort_obj_by_keys(value, False, True)
    newValues.append(newValue)

In [None]:
age_article_id_list_sort['values'] = newValues

In [None]:
for i in range(len(age_article_id_list_sort['keys'])):
    age = age_article_id_list_sort['keys'][i]
    articles_per_age = age_article_id_list_sort['values'][i]
    
    print('Age:',age)
    end = 10
    if len(articles_per_age['keys']) < 10:
        end = len(articles_per_age['keys'])
    for j in range(0,end):
        article_id = articles_per_age['keys'][j]
        article_count = articles_per_age['values'][j]
        
        print('Count:',article_count, articles_id_hash[article_id][0])#, articles_id_hash[article_id][10])
        
    print()

In [45]:
transactions['article_id'][0]

663713001

In [46]:
transactions['article_id']

0           663713001
1           541518023
2           505221004
3           685687003
4           685687004
              ...    
31788319    929511001
31788320    891322004
31788321    918325001
31788322    833459002
31788323    898573003
Name: article_id, Length: 31788324, dtype: int64

In [53]:
age_article_id_list = {}
pb = ProgressBar(transactions)
for i in range(0,len(transactions),100):
    customer_id = transactions['customer_id'][i]
    article_id = transactions['article_id'][i]
    customer_age = customers_id_hash[customer_id]
    if math.isnan(customer_age):
        customer_age = 'nan'
    else:
        customer_age = str(int(customer_age/5))
    if not customer_age in age_article_id_list:
        age_article_id_list[customer_age] = {}
    if not article_id in age_article_id_list[customer_age]:
        age_article_id_list[customer_age][article_id] = 0
    age_article_id_list[customer_age][article_id] += 1
    
    pb.check(i, True)

100.00% time elapsed: 28, estimated: 27

In [54]:
age_article_id_list_sort = sort_obj_by_keys(age_article_id_list)

In [55]:
values = age_article_id_list_sort['values']
newValues = []
for i in range(len(values)):
    value = values[i]
    newValue = sort_obj_by_keys(value, False, True)
    newValues.append(newValue)

In [56]:
age_article_id_list_sort['values'] = newValues

In [57]:
for i in range(len(age_article_id_list_sort['keys'])):
    age = age_article_id_list_sort['keys'][i]
    articles_per_age = age_article_id_list_sort['values'][i]
    
    print('Age:',age)
    end = 10
    if len(articles_per_age['keys']) < 10:
        end = len(articles_per_age['keys'])
    for j in range(0,end):
        article_id = articles_per_age['keys'][j]
        article_count = articles_per_age['values'][j]
        
        print('Count:',article_count, articles_id_hash[article_id][0])#, articles_id_hash[article_id][10])
        
    print()

Age: 10
Count: 49 Jade HW Skinny Denim TRS
Count: 38 7p Basic Shaftless
Count: 37 Jade HW Skinny Denim TRS
Count: 35 Henry polo. (1)
Count: 34 Kanta slacks RW
Count: 31 Luna skinny RW
Count: 28 Skinny Ankle R.W Brooklyn
Count: 28 Mariette Blazer
Count: 28 Pluto RW slacks (1)
Count: 26 Tilly (1)

Age: 11
Count: 22 Tilly (1)
Count: 19 Jade HW Skinny Denim TRS
Count: 18 Henry polo. (1)
Count: 15 7p Basic Shaftless
Count: 15 Skinny Ankle R.W Brooklyn
Count: 15 7p Basic Shaftless
Count: 15 Skinny Ankle R.W Brooklyn
Count: 15 Calista cardigan.
Count: 15 3p Sneaker Socks
Count: 15 Lazer Razer Brief

Age: 12
Count: 12 Luna skinny RW
Count: 11 Jade HW Skinny Denim TRS
Count: 10 Skinny Ankle R.W Brooklyn
Count: 10 Luna skinny RW
Count: 10 HM+ Cora tee
Count: 9 Madison skinny HW
Count: 9 Calista cardigan.
Count: 9 Flock (1)
Count: 8 Scallop 5p Socks
Count: 8 Henry polo. (1)

Age: 13
Count: 6 Ringo hipbelt
Count: 5 Moa tank
Count: 5 Pluto slacks RW
Count: 5 Enter(1)
Count: 5 Skinny Ankle R.W Brook

In [59]:
import csv

# open the file in the write mode
f = open('submission.csv', 'w')

# create the csv writer
writer = csv.writer(f)

# write a row to the csv file
writer.writerow(['customer_id','prediction'])

pb = ProgressBar(customers)
for i in range(len(customers['customer_id'])):
    customer_id = customers['customer_id'][i]
    customer_age = customers['age'][i]
    article_id = articles['article_id'][0]
    if math.isnan(customer_age):
        customer_age = 'nan'
    else:
        customer_age = str(int(customer_age/5))
    
    article_ids = ''
    for j in range(0,len(age_article_id_list_sort['keys'])):
        key_age = age_article_id_list_sort['keys'][j]
        if key_age == customer_age:
            values_age = age_article_id_list_sort['values'][j]
            end = 10
            if len(values_age['keys']) < 10:
                end = len(values_age['keys'])
            for k in range(0,end):
                article_id = str(values_age['keys'][k])
                for l in range(len(article_id), 10):
                    article_id = '0' + article_id
                article_ids += str(article_id) + ' '
    writer.writerow([customer_id, article_ids])
    pb.check(i)
# close the file
f.close()

100.00% time elapsed: 68, estimated: 67

In [48]:
#age_article_id_list_sort

In [52]:
len(age_article_id_list_sort['values'][0]['keys'][0])

9