## Liste des articles déjà achetés
Dans ce notebook, nous dressons la liste des articles déjà achetés pour chaque client, et regardons le score obtenu.
Cela nous permet d'obtenir un score de base auquel comparer les résultats des modèles ultérieurs.

Par la suite elle sera utile pour générer une sélection restreinte d'articles par client, sur laquelle entraîner le modèle.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


In [2]:
# Import des transactions
transactions = pd.read_pickle('pickles/transactions_clean.pkl')

# Modifications propres à la présente manipulation
last_day = transactions['t_dat'].max()
transactions['day_number'] = (last_day - transactions['t_dat']).dt.days
transactions['article_id'] = transactions['article_id'].astype('int64')

transactions = transactions[['article_id', 'day_number', 'customer_id']].drop_duplicates()

In [3]:
# Création de la table voulue
already_purchased = transactions.groupby(['customer_id', 'article_id'], as_index = False, sort = False).agg(
    days_list = ('day_number', lambda x: list(x)) # Le tableau est censé déjà être trié .
)

In [4]:
# Ajout de champs supplémentaires après le groupage, celui-ci étant déjà lourd. 
already_purchased['count'] = already_purchased['days_list'].apply(lambda x: len(x))
already_purchased['last_day'] = already_purchased['days_list'].apply(lambda x: x[0])
already_purchased['first_day'] = already_purchased['days_list'].apply(lambda x: x[-1])

In [6]:
already_purchased.to_pickle("pickles/list_already_purchased.pkl")
already_purchased.head()

Unnamed: 0,customer_id,article_id,days_list,count,last_day,first_day
0,fffef3b6b73545df065b521e19f64bf6fe93bfd450ab20...,898573003,[0],1,0,0
1,53d5f95331b01525404c3cbb2da6a84e1173dccb979d28...,752814021,"[0, 1]",2,0,1
2,53da4b44e81286ed175a46d8ffd5a2baf47843089dc03f...,793506006,[0],1,0,0
3,53da4b44e81286ed175a46d8ffd5a2baf47843089dc03f...,802459001,[0],1,0,0
4,53da4b44e81286ed175a46d8ffd5a2baf47843089dc03f...,874169001,[0],1,0,0


#### Affichage des intervalles entre deux rachats

In [10]:
intervals = []

def append_interval(day_list):
    end = len(day_list)
    
    for i in range(0, end):
        for j in range(i + 1, end):
            interval = day_list[j] - day_list[i]
            
            intervals.append(interval)

blank = already_purchased[already_purchased['count'] > 1]['days_list'].apply(append_interval)

In [15]:
already_purchased[already_purchased['count'] > 1].tail()

Unnamed: 0,customer_id,article_id,days_list,count,last_day,first_day
27305763,ad2a92f1e284f000fc0d6374906234434824dcbc4391f6...,692216005,"[733, 733, 733]",3,733,733
27305878,a85989072287a41df0f46144bb5268385c1ea941031ebd...,547058001,"[733, 733]",2,733,733
27305908,a872d8b3b98cf5dbede2ae7f4f2ffb12a730d371fc2e78...,660712003,"[733, 733]",2,733,733
27306209,aa205195260c9adb0fef75b6649daddad2d0ba44678b44...,684033003,"[733, 733]",2,733,733
27306404,a91e805e60a92b1bafc60c96fd37996f4e3e3dd843142e...,650677004,"[733, 733]",2,733,733


In [6]:
transactions[(transactions['day_number'] == 733) & (transactions['article_id'] == 650677004)]

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,quantity,week_number,day_number
1194,2018-09-20,06fb540d721f33154ff93ae30a699ec6e362c8a1e3ef62...,650677004,0.033881,2,1,104,733
8069,2018-09-20,2ffad87a239b7a88d275b06ea22800288dc9b581d1047d...,650677004,0.033881,2,1,104,733
40793,2018-09-20,ef379edf1d4fc2ca81937452a844e42fe526cee10d6f67...,650677004,0.033881,2,2,104,733
28592,2018-09-20,a5d55f3010d375068c51f266ebb807558017ccfaa75c30...,650677004,0.018627,1,1,104,733
29137,2018-09-20,a91e805e60a92b1bafc60c96fd37996f4e3e3dd843142e...,650677004,0.032186,2,1,104,733


In [11]:
intervals_distribution = pd.Series(intervals).value_counts(normalize=True)
intervals_distribution.to_pickle('pickles/repurchases_interval_distribution.pkl')

0    0.140400
1    0.079960
2    0.078543
3    0.064268
4    0.051180
5    0.039894
6    0.034179
7    0.030450
8    0.022329
9    0.017949
dtype: float64

In [8]:
plt.figure()
plt.hist(intervals, bins = 50)
plt.xlabel("Nombre de semaines")
plt.ylabel("Nombre de rachats")
plt.title("Intervalles entre deux rachats d'un même article")
plt.show()

: 

: 

Observations : 
- 62% des rachats d'un même article se font dans les deux semaines.
- passé la quatrième semaine, la probabilité de rachat d'un même article diminue très fortement.

In [17]:
# Ajout des données client et produit
customers = pd.read_pickle('pickles/customers_second_iteration.pkl')
customers = customers[['customer_id', 'repurchases', 'repurchases_interval']]

In [18]:
articles = pd.read_pickle('pickles/articles_second_iteration.pkl')
articles = articles[['article_id', 'repurchases', 'repurchase_interval']]

In [19]:
already_purchased = already_purchased.merge(customers, on = 'customer_id', how = 'left')

In [20]:
already_purchased = already_purchased.merge(articles, on = 'article_id', how = 'left', suffixes = ('_customer', '_article'))

In [21]:
already_purchased['mean_interval'] = (already_purchased['repurchases_interval'] + already_purchased['repurchase_interval']) / 2

In [22]:
already_purchased['score'] = already_purchased['repurchases_article'] * already_purchased['repurchases_customer']


In [23]:
already_purchased['interval_weighted'] =  already_purchased['last_week'] / already_purchased['mean_interval']
already_purchased = already_purchased[already_purchased['interval_weighted'] <= 50]

In [24]:
already_purchased['score'] *= already_purchased['interval_weighted'].apply(lambda x: 
    intervals_distribution.loc[round(x)]
)

In [25]:
already_purchased = already_purchased[['customer_id', 'article_id', 'score']].sort_values(['customer_id', 'score'], ascending = False)

In [26]:
already_purchased.reset_index(drop = True, inplace = True)

In [27]:
already_purchased

Unnamed: 0,customer_id,article_id,score
0,ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38...,0866755002,0.000430
1,ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab5...,0663568009,0.001645
2,ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab5...,0877009001,0.000311
3,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0557599022,0.078538
4,ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...,0713997002,0.012731
...,...,...,...
541544,00009d946eec3ea54add5ba56d5210ea898def4b46c685...,0693242018,0.004934
541545,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0399061015,0.000398
541546,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0590928022,0.014261
541547,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0599580049,0.009479


In [28]:
repurchase_lists = already_purchased.groupby('customer_id', as_index = False, sort = False).agg(
    list = ('article_id', lambda x: list(x))
)

In [30]:
repurchase_lists['list'] = repurchase_lists['list'].apply(lambda x: x[0:11])

In [31]:
repurchase_lists.to_pickle('pickles/already_purchased_list.pkl')

In [33]:
repurchase_lists['list'].apply(lambda x: len(x)).describe()

count    226839.000000
mean          2.272087
std           2.102758
min           1.000000
25%           1.000000
50%           1.000000
75%           3.000000
max          11.000000
Name: list, dtype: float64