# Black Friday Purchase Forcasting
### Dataset
https://www.kaggle.com/datasets/sdolezel/black-friday

<div>
    <h2>Problem Statement</h2>
A retail company "ABC Private Limited" wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month. The data set also contains customer demographics <i>(age, gender, marital status, city_type, stay_in_current_city)</i>, product details <i>(product_id and product category)</i> and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.
    </div>

<div class="alert alert-info">
    <h2>Plan for the project:</h2>
    <ol>
        <li>Load the dataset</li>
        <li>Exploratory Data Analysis</li>
        <li>Feature Engineering
            <ol>
                <li>Feature generating</li>
                <li>Feature selection</li>
                <li>Feature scaling</li>
            </ol>
        </li>
        <li>Model selection</li>
        <li>Hyper-parameter tuning (GridSearchCV)</li>
        <li>Measure Model performance (cross validation)</li>
    </ol>
</div>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
df = pd.read_csv('black_friday.csv')
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     550068 non-null  int64  
 1   Product_ID                  550068 non-null  object 
 2   Gender                      550068 non-null  object 
 3   Age                         550068 non-null  object 
 4   Occupation                  550068 non-null  int64  
 5   City_Category               550068 non-null  object 
 6   Stay_In_Current_City_Years  550068 non-null  object 
 7   Marital_Status              550068 non-null  int64  
 8   Product_Category_1          550068 non-null  int64  
 9   Product_Category_2          376430 non-null  float64
 10  Product_Category_3          166821 non-null  float64
 11  Purchase                    550068 non-null  int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB


<div class='alert alert-info'>
    <h4>Observations:</h4>
    <ol>
        <li>Number of observations: <b>550068</b></li>
        <li>Number of columns: <b>12</b></li>
        <li>User_ID and Product_ID columns are not usable</li>
    </ol>
</div>

#### Drop User_ID and Product_ID columns

In [6]:
df.drop(['User_ID', 'Product_ID'], axis=1, inplace=True)
df.head()

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,F,0-17,10,A,2,0,3,,,8370
1,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,F,0-17,10,A,2,0,12,,,1422
3,F,0-17,10,A,2,0,12,14.0,,1057
4,M,55+,16,C,4+,0,8,,,7969


## Explor different features for categorical features

<div class="alert alert-info">
    By looking at above dataset, without doing any further analysis, we can see that following features are categorical features.
    <ol>
        <li>Gender</li>
        <li>Age (Age groups)</li>
        <li>City_category</li>
        <li>Marital_Status</li>
    </ol>
    </div>

### Further exploring the features...

In [8]:
df.Stay_In_Current_City_Years.value_counts()

1     193821
2     101838
3      95285
4+     84726
0      74398
Name: Stay_In_Current_City_Years, dtype: int64

In [9]:
df['Product_Category_1'].value_counts()

5     150933
1     140378
8     113925
11     24287
2      23864
6      20466
3      20213
4      11753
16      9828
15      6290
13      5549
10      5125
12      3947
7       3721
18      3125
20      2550
19      1603
14      1523
17       578
9        410
Name: Product_Category_1, dtype: int64

<div class='alert alert-info'>
    <h4>Observations:</h4>
    <ol>
        <li><b>Stay_In_Current_City_Years</b> can be considered as a categorical feature.</li>
        <li><b>Product_Category_1, Product_Category_2 and Product_Category_3</b> should be considered as <b>non-categorical</b> features, because they seems to have un-predictable categories.</li>
    </ol>
</div>