# Retail Lab (Simple Regression Model)

**Learning Objectives:**
  * Define and fit simple regression models
  * Gain exposure to retail related DataSets

## Context of the datasets

### 1. There are three datasets: `articles.csv.zip`, `customers.csv.zip` and `transactions2020.csv.zip`

#### 2. The Articles dataset contains information over products available.
#### 3. The Customers dataset contains information over registered customers.
#### 4. The Transactions dataset contains purchases of articles made by customers.



## 1. Library Import

In [1]:
import pandas as pd
import warnings
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
from matplotlib import pyplot as plt

In [2]:
warnings.simplefilter('ignore')

## 2. Data loading and DataFrame creation

In [3]:
Articles=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/articles.csv.zip")

In [4]:
Articles.head(3)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


In [5]:
Articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105542 entries, 0 to 105541
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   article_id                    105542 non-null  int64 
 1   product_code                  105542 non-null  int64 
 2   prod_name                     105542 non-null  object
 3   product_type_no               105542 non-null  int64 
 4   product_type_name             105542 non-null  object
 5   product_group_name            105542 non-null  object
 6   graphical_appearance_no       105542 non-null  int64 
 7   graphical_appearance_name     105542 non-null  object
 8   colour_group_code             105542 non-null  int64 
 9   colour_group_name             105542 non-null  object
 10  perceived_colour_value_id     105542 non-null  int64 
 11  perceived_colour_value_name   105542 non-null  object
 12  perceived_colour_master_id    105542 non-null  int64 
 13 

In [6]:
Customers=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/customers.csv.zip")

In [7]:
Customers.sample(3)

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
1352114,fc47d0d7f2aa745df2cd628d245139e98426def1059df6...,1.0,1.0,ACTIVE,Regularly,23.0,a3b2b2b362d3c5b13d81c87cc4193298e5ec8eef227385...
590471,6e2852b4365a905078a69c2919c96c64fc81a7d3fb939f...,1.0,1.0,ACTIVE,Regularly,23.0,9b89111c191fabd8e676bd307f28554aefaaf9cb9ec64a...
420403,4e9ca9de2d73db640848efd51a3a9228851e87b4f30d9d...,1.0,1.0,ACTIVE,Regularly,52.0,5f88ec1a6e7b04305f2462dc09a3d0b321469706bc8bce...


In [8]:
Customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1371980 entries, 0 to 1371979
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   customer_id             1371980 non-null  object 
 1   FN                      476930 non-null   float64
 2   Active                  464404 non-null   float64
 3   club_member_status      1365918 non-null  object 
 4   fashion_news_frequency  1355969 non-null  object 
 5   age                     1356119 non-null  float64
 6   postal_code             1371980 non-null  object 
dtypes: float64(3), object(4)
memory usage: 73.3+ MB


In [9]:
Transactions=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/transactions2020.csv.zip")

In [10]:
Transactions.sample(3)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
2848230,2020-07-25,cae9ab8fe54d77e307b2e69c5f1e429f69bccaea7d1374...,754792003,0.025407,2
833219,2020-06-18,4a70418531a3e1093411cf8b1c8dc6d0f04e308aa71b8f...,783978004,0.022017,2
4169535,2020-08-27,581bc468946bcd5b617743dbc633e84ce5ca5ee4db6dd3...,827968001,0.016932,1


In [11]:
Transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5151470 entries, 0 to 5151469
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   t_dat             object 
 1   customer_id       object 
 2   article_id        int64  
 3   price             float64
 4   sales_channel_id  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 196.5+ MB


## 3. Merging DataFrames

#### 3.1. Transactions-Articles


In [12]:
Transactions.head(3)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2020-06-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,844198001,0.016932,2
1,2020-06-01,000b31552d3785c79833262bbeefa484cbc43d7b612b3c...,777016001,0.030492,1
2,2020-06-01,002d8d26c9414c981c012c6f5e4b2de7ffd3bc568c4574...,820507001,0.010153,2


In [13]:
Articles.head(3)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.


In [14]:
## we merge both DataFrames using the common key: article_id. We store the result in a new DataFrame
TransactionsAndArticles=pd.merge(Transactions, Articles, how='left',on='article_id')

#### 3.2. Transactions-Articles-Customers

In [15]:
TransactionsAndArticles.head(3)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,2020-06-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,844198001,0.016932,2,844198,Saturn trs (J),296,Pyjama bottom,Nightwear,...,Nightwear,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1017,"Under-, Nightwear",Pyjama bottoms in sweatshirt fabric with wide ...
1,2020-06-01,000b31552d3785c79833262bbeefa484cbc43d7b612b3c...,777016001,0.030492,1,777016,Cisco skirt,275,Skirt,Garment Lower body,...,Trousers & Skirt,A,Ladieswear,1,Ladieswear,18,Womens Trend,1009,Trousers,"Calf-length skirt in softly draping, patterned..."
2,2020-06-01,002d8d26c9414c981c012c6f5e4b2de7ffd3bc568c4574...,820507001,0.010153,2,820507,Charlotte Hipster Primula,286,Underwear bottom,Underwear,...,Expressive Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Hipster briefs in lace with a mid waist, lined..."


In [16]:
Customers.head(3)

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...


In [17]:
## we merge both DataFrames using the common key: customer_id. We store the result in a new DataFrame
TransactionsAndArticlesAndCustomers=pd.merge(TransactionsAndArticles, Customers, how='left',on='customer_id')

## 4. Exploratory Data Analysis

In [18]:
TransactionsAndArticlesAndCustomers.head(3)

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,2020-06-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,844198001,0.016932,2,844198,Saturn trs (J),296,Pyjama bottom,Nightwear,...,"Womens Nightwear, Socks & Tigh",1017,"Under-, Nightwear",Pyjama bottoms in sweatshirt fabric with wide ...,,,ACTIVE,NONE,40.0,0c0e15f8fa88a1d4aa6ca8a0b4a8289ca1affbaebdea22...
1,2020-06-01,000b31552d3785c79833262bbeefa484cbc43d7b612b3c...,777016001,0.030492,1,777016,Cisco skirt,275,Skirt,Garment Lower body,...,Womens Trend,1009,Trousers,"Calf-length skirt in softly draping, patterned...",1.0,1.0,ACTIVE,Regularly,59.0,2c29ae653a9282cce4151bd87643c907644e09541abc28...
2,2020-06-01,002d8d26c9414c981c012c6f5e4b2de7ffd3bc568c4574...,820507001,0.010153,2,820507,Charlotte Hipster Primula,286,Underwear bottom,Underwear,...,Womens Lingerie,1017,"Under-, Nightwear","Hipster briefs in lace with a mid waist, lined...",,,ACTIVE,NONE,23.0,8d4ceb946237cf52ce5c2a1a71d1221fde77627a52d661...


In [19]:
TransactionsAndArticlesAndCustomers.sample(3).T

Unnamed: 0,4825541,3804790,274981
t_dat,2020-09-13,2020-08-17,2020-06-06
customer_id,3fc0b8c0a559a1c92be26fba637814e28b25568f424162...,a3d624af63dfcc720e3bb205ed280fdfd0fc60fc07fa63...,ccbeaf707712937c047c160d27e1b7e6762659350ee316...
article_id,852584001,733097001,804664004
price,0.033881,0.016932,0.023763
sales_channel_id,2,1,2
product_code,852584,733097,804664
prod_name,SUPREME RW tights,Saga body,Kaizen Teddy
product_type_no,-1,256,252
product_type_name,Unknown,Bodysuit,Sweater
product_group_name,Unknown,Garment Upper body,Garment Upper body


In [20]:
TransactionsAndArticlesAndCustomers['t_dat']=pd.to_datetime(TransactionsAndArticlesAndCustomers['t_dat'])

In [21]:
TransactionsAndArticlesAndCustomers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5151470 entries, 0 to 5151469
Data columns (total 35 columns):
 #   Column                        Dtype         
---  ------                        -----         
 0   t_dat                         datetime64[ns]
 1   customer_id                   object        
 2   article_id                    int64         
 3   price                         float64       
 4   sales_channel_id              int64         
 5   product_code                  int64         
 6   prod_name                     object        
 7   product_type_no               int64         
 8   product_type_name             object        
 9   product_group_name            object        
 10  graphical_appearance_no       int64         
 11  graphical_appearance_name     object        
 12  colour_group_code             int64         
 13  colour_group_name             object        
 14  perceived_colour_value_id     int64         
 15  perceived_colour_value_name   ob

In [22]:
del Articles, Customers, Transactions

In [24]:
TransactionsAndArticlesAndCustomers.groupby('t_dat').count()

Unnamed: 0_level_0,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,...,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
t_dat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-06-01,43084,43084,43084,43084,43084,43084,43084,43084,43084,43084,...,43084,43084,43084,43050,19110,18855,43030,42986,42950,43084
2020-06-02,44666,44666,44666,44666,44666,44666,44666,44666,44666,44666,...,44666,44666,44666,44626,18863,18659,44607,44543,44529,44666
2020-06-03,53187,53187,53187,53187,53187,53187,53187,53187,53187,53187,...,53187,53187,53187,53122,23647,23291,53124,53003,52933,53187
2020-06-04,50470,50470,50470,50470,50470,50470,50470,50470,50470,50470,...,50470,50470,50470,50427,22440,22159,50344,50334,50294,50470
2020-06-05,44470,44470,44470,44470,44470,44470,44470,44470,44470,44470,...,44470,44470,44470,44415,19500,19258,44426,44347,44332,44470
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-09-18,39284,39284,39284,39284,39284,39284,39284,39284,39284,39284,...,39284,39284,39284,39275,17958,17679,39235,39211,39168,39284
2020-09-19,36796,36796,36796,36796,36796,36796,36796,36796,36796,36796,...,36796,36796,36796,36790,15418,15188,36725,36701,36626,36796
2020-09-20,31489,31489,31489,31489,31489,31489,31489,31489,31489,31489,...,31489,31489,31489,31482,13799,13502,31415,31408,31375,31489
2020-09-21,32130,32130,32130,32130,32130,32130,32130,32130,32130,32130,...,32130,32130,32130,32124,14145,13949,32067,32072,32033,32130


In [None]:
TransactionsAndArticlesAndCustomers.groupby('sales_channel_id').count()

Unnamed: 0_level_0,t_dat,customer_id,article_id,price,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,...,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
sales_channel_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1788396,1788396,1788396,1788396,1788396,1788396,1788396,1788396,1788396,1788396,...,1788396,1788396,1788396,1786791,817650,804129,1787640,1780039,1779148,1788396
2,3363074,3363074,3363074,3363074,3363074,3363074,3363074,3363074,3363074,3363074,...,3363074,3363074,3363074,3361065,1502625,1482103,3356625,3358923,3352518,3363074


In [None]:
TransactionsAndArticlesAndCustomers.groupby('department_name').count()

Unnamed: 0_level_0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
department_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK Bottoms,4400,4400,4400,4400,4400,4400,4400,4400,4400,4400,...,4400,4400,4400,4400,2128,2099,4389,4393,4382,4400
AK Dresses & Outdoor,2879,2879,2879,2879,2879,2879,2879,2879,2879,2879,...,2879,2879,2879,2879,1402,1365,2872,2874,2865,2879
AK Other,113,113,113,113,113,113,113,113,113,113,...,113,113,113,113,66,63,113,113,113,113
AK Tops Jersey & Woven,390,390,390,390,390,390,390,390,390,390,...,390,390,390,390,155,153,390,390,390,390
AK Tops Knitwear,392,392,392,392,392,392,392,392,392,392,...,392,392,392,392,183,180,390,391,391,392
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Young Girl Shoes,380,380,380,380,380,380,380,380,380,380,...,380,380,380,380,183,181,380,379,380,380
Young Girl Swimwear,699,699,699,699,699,699,699,699,699,699,...,699,699,699,699,337,330,699,697,698,699
Young Girl Trouser,952,952,952,952,952,952,952,952,952,952,...,952,952,952,952,483,477,951,952,943,952
Young Girl UW/NW,1554,1554,1554,1554,1554,1554,1554,1554,1554,1554,...,1554,1554,1554,1554,717,707,1550,1551,1548,1554


In [None]:
TransactionsAndArticlesAndCustomers.groupby('section_name').count()

Unnamed: 0_level_0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,section_no,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
section_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Baby Boy,6048,6048,6048,6048,6048,6048,6048,6048,6048,6048,...,6048,6048,6048,6048,2733,2697,6019,6018,6016,6048
Baby Essentials & Complements,16504,16504,16504,16504,16504,16504,16504,16504,16504,16504,...,16504,16504,16504,16504,6847,6739,16421,16413,16390,16504
Baby Girl,5482,5482,5482,5482,5482,5482,5482,5482,5482,5482,...,5482,5482,5482,5482,2743,2685,5458,5456,5445,5482
Boys Underwear & Basics,6105,6105,6105,6105,6105,6105,6105,6105,6105,6105,...,6105,6105,6105,6097,2906,2854,6079,6079,6069,6105
Collaborations,1611,1611,1611,1611,1611,1611,1611,1611,1611,1611,...,1611,1611,1611,1611,791,774,1607,1610,1602,1611
Contemporary Casual,30348,30348,30348,30348,30348,30348,30348,30348,30348,30348,...,30348,30348,30348,30348,14000,13820,30322,30283,30223,30348
Contemporary Smart,46340,46340,46340,46340,46340,46340,46340,46340,46340,46340,...,46340,46340,46340,46340,21878,21607,46290,46232,46158,46340
Contemporary Street,28180,28180,28180,28180,28180,28180,28180,28180,28180,28180,...,28180,28180,28180,28180,13643,13410,28155,28113,28079,28180
Denim Men,26376,26376,26376,26376,26376,26376,26376,26376,26376,26376,...,26376,26376,26376,26376,12262,12113,26351,26338,26264,26376
Divided Accessories,32575,32575,32575,32575,32575,32575,32575,32575,32575,32575,...,32575,32575,32575,32574,14968,14707,32528,32494,32450,32575


In [None]:
TransactionsAndArticlesAndCustomers.groupby('club_member_status').count()

Unnamed: 0_level_0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,section_no,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,fashion_news_frequency,age,postal_code
club_member_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ACTIVE,5082441,5082441,5082441,5082441,5082441,5082441,5082441,5082441,5082441,5082441,...,5082441,5082441,5082441,5082441,5078877,2315032,2281184,5071449,5067537,5082441
LEFT CLUB,559,559,559,559,559,559,559,559,559,559,...,559,559,559,559,559,0,0,559,559,559
PRE-CREATE,61265,61265,61265,61265,61265,61265,61265,61265,61265,61265,...,61265,61265,61265,61265,61217,4518,4325,60629,57269,61265


In [None]:
TransactionsAndArticlesAndCustomers.groupby('fashion_news_frequency').count()



Unnamed: 0_level_0,t_dat,customer_id,article_id,price,sales_channel_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,...,section_no,section_name,garment_group_no,garment_group_name,detail_desc,FN,Active,club_member_status,age,postal_code
fashion_news_frequency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Monthly,954,954,954,954,954,954,954,954,954,954,...,954,954,954,954,953,947,926,954,952,954
NONE,2811991,2811991,2811991,2811991,2811991,2811991,2811991,2811991,2811991,2811991,...,2811991,2811991,2811991,2811991,2809906,1212,814,2806374,2797394,2811991
Regularly,2326017,2326017,2326017,2326017,2326017,2326017,2326017,2326017,2326017,2326017,...,2326017,2326017,2326017,2326017,2324499,2318116,2284492,2325309,2321747,2326017


## 4. Does sales depend on the month under consideration?
### We are interested in determining whether sales are time dependent. To do this we fit a pooled cross sectional model regressing `CustomerSales` on the factor `month`


#### We need to compute the volume of purchases made by each customer first


In [23]:
TransactionsAndArticlesAndCustomers[['t_dat','customer_id','article_id','price']]

Unnamed: 0,t_dat,customer_id,article_id,price
0,2020-06-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,844198001,0.016932
1,2020-06-01,000b31552d3785c79833262bbeefa484cbc43d7b612b3c...,777016001,0.030492
2,2020-06-01,002d8d26c9414c981c012c6f5e4b2de7ffd3bc568c4574...,820507001,0.010153
3,2020-06-01,002d8d26c9414c981c012c6f5e4b2de7ffd3bc568c4574...,869811005,0.016932
4,2020-06-01,002d8d26c9414c981c012c6f5e4b2de7ffd3bc568c4574...,823118004,0.025407
...,...,...,...,...
5151465,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,929511001,0.059305
5151466,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,891322004,0.042356
5151467,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,918325001,0.043203
5151468,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,833459002,0.006763


In [27]:
CustomerPurchases=TransactionsAndArticlesAndCustomers.groupby(['customer_id','t_dat','sales_channel_id']).agg({'price':'sum'}).reset_index()
CustomerPurchases.rename(columns={'price':'purchasespercustomer'},inplace=True)
CustomerPurchases

Unnamed: 0,customer_id,t_dat,sales_channel_id,purchasespercustomer
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,2020-09-05,1,0.050831
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,2020-07-08,1,0.027102
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,2020-09-15,2,0.061000
3,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,2020-06-03,2,0.127068
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,2020-08-12,2,0.128746
...,...,...,...,...
1536657,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,2020-07-03,2,0.040627
1536658,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,2020-07-16,1,0.038932
1536659,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,2020-09-08,2,0.037237
1536660,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,2020-09-09,1,0.025407


In [34]:
## We extract the month in which the transaction took place

CustomerPurchases['purchasemonth']=CustomerPurchases['t_dat'].dt.month
CustomerPurchases

Unnamed: 0,customer_id,t_dat,sales_channel_id,purchasespercustomer,purchasemonth
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,2020-09-05,1,0.050831,9
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,2020-07-08,1,0.027102,7
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,2020-09-15,2,0.061000,9
3,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,2020-06-03,2,0.127068,6
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,2020-08-12,2,0.128746,8
...,...,...,...,...,...
1536657,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,2020-07-03,2,0.040627,7
1536658,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,2020-07-16,1,0.038932,7
1536659,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,2020-09-08,2,0.037237,9
1536660,ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...,2020-09-09,1,0.025407,9


#### 4.1. Model Fit

In [38]:
#We impose a simple, linear, model:
# We specify purchasespercustomer as the response variable (a.k.a dependent variable).
# We need to make sure that purchasemonth is considered as a qualitative variable, that's why we use C(purchasemonth)

reg = smf.ols(formula='purchasespercustomer ~ C(purchasemonth)', data=CustomerPurchases)


In [39]:
#We fit the model
results = reg.fit()

In [40]:

print(results.summary())


                             OLS Regression Results                             
Dep. Variable:     purchasespercustomer   R-squared:                       0.006
Model:                              OLS   Adj. R-squared:                  0.006
Method:                   Least Squares   F-statistic:                     3338.
Date:                  Mon, 12 Aug 2024   Prob (F-statistic):               0.00
Time:                          07:24:23   Log-Likelihood:             1.3559e+06
No. Observations:               1536662   AIC:                        -2.712e+06
Df Residuals:                   1536658   BIC:                        -2.712e+06
Df Model:                             3                                         
Covariance Type:              nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept 

#### 4.2. Model Interpretation
##### Based on the previous we have fitted the following model:

$ purchasespercustomer=0.0889-0.0103*purchasemonth_{7}-0.0024*purchasemonth_{8}+0.0147*purchasemonth_{9}+u $

#### The factor `purchasemonth` is a discrete variable which controls for the month in which the transaction was made (6: June, 7:July, 8:August,9:September). Given the results we conclude that sales in July and August are lower (-0.0103 and -0.0024 respectively) that sales in June (the month of reference). Sales in September are larger (0.0147) than in June.

#### The low levels of r-squared indicates that the model is not very good at predicting sales.





## 5. What is the temporal evolution of sales accross channels?
### We are interested in determining whether sales are time dependent and how this sales evolve in the different channels. To do this we fit a pooled cross sectional model regressing `CustomerSales` on the factor `month` and `sales_channel_id`

In [42]:
CustomerPurchases.head(3)

Unnamed: 0,customer_id,t_dat,sales_channel_id,purchasespercustomer,purchasemonth
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,2020-09-05,1,0.050831,9
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,2020-07-08,1,0.027102,7
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,2020-09-15,2,0.061,9


#### 5.1. Model Fit

In [49]:
reg2 = smf.ols(formula='purchasespercustomer ~ C(purchasemonth)+sales_channel_id+C(purchasemonth)*sales_channel_id', data=CustomerPurchases)

In [50]:
#We fit the model
results2 = reg2.fit()

In [51]:
print(results2.summary())


                             OLS Regression Results                             
Dep. Variable:     purchasespercustomer   R-squared:                       0.075
Model:                              OLS   Adj. R-squared:                  0.075
Method:                   Least Squares   F-statistic:                 1.789e+04
Date:                  Mon, 12 Aug 2024   Prob (F-statistic):               0.00
Time:                          07:50:53   Log-Likelihood:             1.4111e+06
No. Observations:               1536662   AIC:                        -2.822e+06
Df Residuals:                   1536654   BIC:                        -2.822e+06
Df Model:                             7                                         
Covariance Type:              nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------

#### 5.2. Model Interpretation
##### Based on the previous results we conclude that:

* sales in channel 2 are 0.0509 larger than in channel 1 (reference category).

* July's sales in channel 2 are -0.0076 lower that July's sales in channel 1 (reference category)

* August's sales in channel 2 are 0.0037 higher that August's sales in channel 1 (reference category)

* September's sales in channel 2 are 0.0177 higher that September's sales in channel 1 (reference category)