# Retail: Assess Sales Outlets' Performance Decomposition

### Define the Goal

   * What do you want to achieve and why?
   
   
   * Who's interested in what you produce?
   
   
   * What decisions will be made based on your analysis?


**What do you want to achieve and why?**

   * Find the relationship between revenue growth and the probability of churn.
   * Identify the relationship between payment frequency and the probability of churn.
   * Compare the time since the last purchase with the probability of churn.


**Who's interested in what you produce?**

   * The Home World company in general and the Marketing department in particular.
   
   * Interested customers.

**What decisions will be made based on your analysis?**

   * To decide whether to continue the loyalty program or not.
   
   * To confirm whether the loyal customers are satisfied by the loyalty company or not.

### Specify Details


Task: Determine the probability that a customer will leave based on their behavior.

### Propose Hypotheses

   * Loyal customers show lower growth dynamics than average for the sample.
   
   
   * Loyal customers make payments less often than average.
   
   
   * Loyal customers haven't bought anything for a long time.


### Action Plan

Then it follows from the hypotheses that we need to:


   * Look into the relationship between revenue growth and the probability of churn.
   
   
   * Identify the relationship between payment frequency and the probability of churn.
   
   
   * Compare the time since the last purchase with the probability of churn.


### Description of the data

The dataset contains data on purchases made at the building-material retailer Home World. All of its customers have membership cards. Moreover, they can become members of the store's loyalty program for $20 per month. The program includes discounts, information on special offers, and gifts.

`retail_dataset_us.csv` contains the following fields::

- `purchaseId`
- `item_ID`
- `purchasedate`
- `Quantity` — the number of items in the purchase
- `CustomerID`
- `ShopID`
- `loyalty_program` — whether the customer is a member of the loyalty program

`product_codes_us.csv` contains the following fields:

- `productID`
- `price_per_one`

# Step 1. Download the data

In [39]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [40]:
try:
    retail = pd.read_csv('/retail_dataset_us.csv')
    product = pd.read_csv('/product_codes_us.csv', sep= ';')
except:    
    retail = pd.read_csv('/datasets/retail_dataset_us.csv')
    product = pd.read_csv('/datasets/product_codes_us.csv', sep= ';')
finally:
    print('Try to read files')
       
retail.head()
product.head()

Try to read files


Unnamed: 0,productID,price_per_one
0,10002,0.85
1,10080,0.85
2,10120,0.21
3,10123C,0.65
4,10124A,0.42


In [41]:
retail.head()

Unnamed: 0,purchaseid,item_ID,Quantity,purchasedate,CustomerID,loyalty_program,ShopID
0,538280,21873,11,2016-12-10 12:50:00,18427.0,0,Shop 3
1,538862,22195,0,2016-12-14 14:11:00,22389.0,1,Shop 2
2,538855,21239,7,2016-12-14 13:50:00,22182.0,1,Shop 3
3,543543,22271,0,2017-02-09 15:33:00,23522.0,1,Shop 28
4,543812,79321,0,2017-02-13 14:40:00,23151.0,1,Shop 28


In [42]:
# To know the number of columns and rows in the DataFrame
retail.shape

(105335, 7)

In [43]:
# To know the number of columns and rows in the DataFrame
product.shape

(3159, 2)

In [44]:
# Get information about the DataFrame
retail.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105335 entries, 0 to 105334
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   purchaseid       105335 non-null  object 
 1   item_ID          105335 non-null  object 
 2   Quantity         105335 non-null  int64  
 3   purchasedate     105335 non-null  object 
 4   CustomerID       69125 non-null   float64
 5   loyalty_program  105335 non-null  int64  
 6   ShopID           105335 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 29.0 MB


In [45]:
# Get information about the DataFrame
product.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3159 entries, 0 to 3158
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   productID      3159 non-null   object 
 1   price_per_one  3159 non-null   float64
dtypes: float64(1), object(1)
memory usage: 217.0 KB


In [46]:
# To check whether column's name are written in appropriate way
retail.columns

Index(['purchaseid', 'item_ID', 'Quantity', 'purchasedate', 'CustomerID',
       'loyalty_program', 'ShopID'],
      dtype='object')

In [47]:
# To check whether column's name are written in appropriate way
product.columns

Index(['productID', 'price_per_one'], dtype='object')

In [48]:
# To get information about statistical character of the DataFrame
retail.describe()

Unnamed: 0,Quantity,CustomerID,loyalty_program
count,105335.0,69125.0,105335.0
mean,7.821218,21019.302047,0.226345
std,327.946695,1765.444679,0.418467
min,-74216.0,18025.0,0.0
25%,0.0,19544.0,0.0
50%,2.0,20990.0,0.0
75%,7.0,22659.0,0.0
max,74214.0,23962.0,1.0


In [49]:
# To get information about statistical character of the DataFrame
product.describe()

Unnamed: 0,price_per_one
count,3159.0
mean,2.954495
std,7.213808
min,0.0
25%,0.65
50%,1.45
75%,3.29
max,175.0


In [50]:
# Checking duplicates in the DataFrame
retail.duplicated().sum()

1033

In [51]:
# Checking duplicates in the DataFrame
product.duplicated().sum()

0

In [52]:
# To confirm the missing values of the DataFrame
retail.isna().sum()

purchaseid             0
item_ID                0
Quantity               0
purchasedate           0
CustomerID         36210
loyalty_program        0
ShopID                 0
dtype: int64

In [53]:
# To confirm the missing values of the DataFrame
product.isna().sum()

productID        0
price_per_one    0
dtype: int64

In [54]:
# To get the exact number of missing values
retail['CustomerID'].isnull().sum()

36210

In [55]:
# To get the percentile information of the columns in the dataframe
mis_values = retail.isnull().sum().to_frame('missing_values')
mis_values['%'] = round(retail.isnull().sum()*100/len(retail), 3)
mis_values['%']

purchaseid          0.000
item_ID             0.000
Quantity            0.000
purchasedate        0.000
CustomerID         34.376
loyalty_program     0.000
ShopID              0.000
Name: %, dtype: float64

In [56]:
mis_values.sort_values(by='%', ascending=False)

Unnamed: 0,missing_values,%
CustomerID,36210,34.376
purchaseid,0,0.0
item_ID,0,0.0
Quantity,0,0.0
purchasedate,0,0.0
loyalty_program,0,0.0
ShopID,0,0.0


In [57]:
# To get the percentile information of the columns in the dataframe
mis_values = product.isnull().sum().to_frame('missing_values')
mis_values['%'] = round(product.isnull().sum()*100/len(product), 3)
mis_values['%']

productID        0.0
price_per_one    0.0
Name: %, dtype: float64

In [58]:
mis_values.sort_values(by='%', ascending=False)

Unnamed: 0,missing_values,%
productID,0,0.0
price_per_one,0,0.0


In [59]:
# Print a sample of DataFrame
retail.sample(10)

Unnamed: 0,purchaseid,item_ID,Quantity,purchasedate,CustomerID,loyalty_program,ShopID
103172,543096,22077,0,2017-02-03 11:39:00,19723.0,0,Shop 17
14772,536621,22536,47,2016-12-02 10:35:00,18787.0,0,Shop 25
77523,540551,21494,5,2017-01-10 09:43:00,,0,Shop 0
9910,539453,35957,2,2016-12-17 17:08:00,,0,Shop 0
53608,541131,82583,1,2017-01-14 10:16:00,,0,Shop 0
27808,544040,22666,0,2017-02-15 11:40:00,20380.0,0,Shop 12
63697,540257,47566B,1,2017-01-05 17:33:00,18427.0,0,Shop 30
3904,538868,22909,47,2016-12-14 14:42:00,22696.0,1,Shop 30
50033,544434,23006,1,2017-02-18 16:12:00,,0,Shop 0
22375,540524,84755,7,2017-01-09 12:53:00,22414.0,1,Shop 6


In [60]:
# Print a sample of DataFrame
product.sample(10)

Unnamed: 0,productID,price_per_one
2019,47518f,4.13
571,21448,0.0
3005,90157,4.98
1456,22602,0.85
1706,22863,2.55
2846,90024B,8.32
1304,22440,0.42
896,21945,0.85
131,20681,3.25
2084,71406C,0.42


In [61]:
# To check if there are unique values
retail['purchaseid'].unique()

array(['538280', '538862', '538855', ..., '540564', '542572', 'C541650'],
      dtype=object)

In [62]:
# To check if there are unique values
retail['item_ID'].unique()

array(['21873', '22195', '21239', ..., '90053', '17028J', '79320'],
      dtype=object)

In [63]:
# To check if there are unique values
retail['Quantity'].unique()

array([    11,      0,      7,      1,      5,      4,      3,      9,
           35,     23,      2,     71,     -2,    103,      6,     24,
          383,    -25,     18,     19,     95,    143,      8,     47,
           17,     20,     14,    -11,     12,     99,     39,     49,
          239,     25,     31,    -31,     29,    -61,     13,     15,
           28,    -13,     -3,     -8,   -101,    -10,   -940,     59,
          199,     27,     22,    119,    287,     37,     -9,     46,
           -5,    -81,     10,     -4,     -6,    215,    -41,     69,
          -16,    -26,   -724,     -7,   -100,     30,    479,   -193,
          -33,   2399,    -15,    191,     21,     63,    -24,    299,
          -19,   1295,    107,     79,     26,    140,    -17,    -14,
          359,    323,     32,   1007,     48,     64,    179,    251,
          -21,    -37,   1286,    -34,    599,    207,     33,    -36,
           52,   -145,     83,     43,    -54,    -65,     68,    503,
      

In [64]:
# To check if there are unique values
retail['purchasedate'].unique()

array(['2016-12-10 12:50:00', '2016-12-14 14:11:00',
       '2016-12-14 13:50:00', ..., '2016-12-05 13:09:00',
       '2017-01-10 10:36:00', '2017-01-20 11:44:00'], dtype=object)

In [65]:
# To check if there are unique values
retail['CustomerID'].unique()

array([18427., 22389., 22182., ..., 20156., 20358., 23763.])

In [66]:
# To check if there are unique values
retail['loyalty_program'].unique()

array([0, 1])

In [67]:
# To check if there are unique values
retail['ShopID'].unique()

array(['Shop 3', 'Shop 2', 'Shop 28', 'Shop 20', 'Shop 0', 'Shop 1',
       'Shop 6', 'Shop 17', 'Shop 24', 'Shop 18', 'Shop 16', 'Shop 14',
       'Shop 13', 'Shop 30', 'Shop 8', 'Shop 23', 'Shop 12', 'Shop 26',
       'Shop 4', 'Shop 7', 'Shop 10', 'Shop 9', 'Shop 19', 'Shop 5',
       'Shop 29', 'Shop 22', 'Shop 27', 'Shop 21', 'Shop 25', 'Shop 15',
       'Shop 11'], dtype=object)

In [68]:
# To check if there are unique values
product['productID'].unique()

array(['10002', '10080', '10120', ..., 'gift_0001_40', 'gift_0001_50',
       'm'], dtype=object)

In [69]:
# To check if there are unique values
product['price_per_one'].unique()

array([8.500e-01, 2.100e-01, 6.500e-01, 4.200e-01, 1.690e+00, 1.400e-01,
       5.300e-01, 2.950e+00, 4.950e+00, 4.250e+00, 4.600e+00, 1.246e+01,
       7.950e+00, 2.500e-01, 1.200e-01, 3.200e-01, 5.000e-01, 7.200e-01,
       0.000e+00, 5.550e+00, 1.250e+00, 1.600e-01, 2.900e-01, 2.100e+00,
       2.460e+00, 2.400e-01, 7.000e-02, 3.800e-01, 1.700e-01, 2.550e+00,
       1.630e+00, 6.000e-02, 1.950e+00, 5.060e+00, 7.500e-01, 8.300e-01,
       1.000e-01, 1.060e+00, 3.250e+00, 3.750e+00, 3.290e+00, 1.650e+00,
       1.450e+00, 6.350e+00, 4.650e+00, 5.490e+00, 3.995e+01, 6.750e+00,
       2.120e+00, 7.900e-01, 1.102e+01, 9.320e+00, 5.950e+00, 4.210e+00,
       8.470e+00, 5.500e-01, 4.130e+00, 9.500e-01, 2.750e+00, 7.460e+00,
       8.950e+00, 6.400e-01, 1.698e+01, 6.950e+00, 3.390e+00, 3.360e+00,
       1.530e+00, 1.850e+00, 4.000e-01, 7.620e+00, 5.910e+00, 5.790e+00,
       8.500e+00, 8.290e+00, 2.195e+01, 5.450e+00, 1.275e+01, 9.950e+00,
       2.079e+01, 1.050e+00, 1.595e+01, 1.095e+01, 

# Step 2. Data Preprocessing

For instance, the preprocessing stage can be divided into substages:

   * Study missing values
   
   * Study type correspondence
   
   * Study duplicate values
   
   * Check the correctness of column names
   
   * Rename the columns
   
   * Remove duplicates
   
   * Convert types
   
   * Replace missing values


# Step 3. Exploratory Data Analysis

    * Look into the relationship between revenue growth and the probability of churn.
    * Identify the relationship between payment frequency and the probability of churn.
    * Compare the time since the last purchase with the probability of churn.
       — For each customer, find the date of the last purchase.
       — Use this data to split the customers into n categories.
       — For each category, calculate the share of the customers who left.
       — Within each category, define extra indices (e.g. total sum of payments, total number of purchases).
       — Draw conclusions: how time since the last purchase relates to customers' indices.
       — Draw conclusions: how time since the last purchase relates to churn.


# Step 4. Conclusion

Based on the results from the analysis conclusion will be given on
   * The relationship between revenue growth and the probability of churn.
   * The relationship between payment frequency and the probability of churn.
   • The last purchase relates to customers' indices.
   * The last purchase relates to churn.