- A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.


- The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.


- Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.


### Data                                                                                       

|Variable	                   |Definition                                                             |
|------------------            |-----------------------------------------------------------------------|
|User_ID	                   |User ID                                                                |
|Product_ID	                   |Product ID                                                             |
|Gender                        |Sex of User                                                            |
|Age                           |Age in bins                                                            |
|Occupation                    |Occupation (Masked)                                                    |
|City_Category                 |Category of the City (A,B,C)                                           |
|Stay_In_Current_City_Years    |Number of years stay in current city                                   |
|Marital_Status                |Marital Status                                                         |
|Product_Category_1	           |Product Category (Masked)                                              |
|Product_Category_2	           |Product may belongs to other category also (Masked)                    |
|Product_Category_3	           |Product may belongs to other category also (Masked)                    |
|Purchase	                   |Purchase Amount (Target Variable)                                      |

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import accuracy_score

  data = yaml.load(f.read()) or {}
  import pandas.util.testing as tm
  defaults = yaml.load(f)


In [3]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submission = pd.read_csv("sample_submission_V9Inaty.csv")

In [4]:
display(train.head())
display(test.head())
display(submission.head())

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0


Unnamed: 0,Purchase,User_ID,Product_ID
0,100,1000004,P00128942
1,100,1000009,P00113442
2,100,1000010,P00288442
3,100,1000010,P00145342
4,100,1000011,P00053842


In [5]:
print(train.shape)
print(test.shape)

(550068, 12)
(233599, 11)


In [6]:
train_original = train.copy()
test_original = test.copy()

In [7]:
display(train.info())
display(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     550068 non-null  int64  
 1   Product_ID                  550068 non-null  object 
 2   Gender                      550068 non-null  object 
 3   Age                         550068 non-null  object 
 4   Occupation                  550068 non-null  int64  
 5   City_Category               550068 non-null  object 
 6   Stay_In_Current_City_Years  550068 non-null  object 
 7   Marital_Status              550068 non-null  int64  
 8   Product_Category_1          550068 non-null  int64  
 9   Product_Category_2          376430 non-null  float64
 10  Product_Category_3          166821 non-null  float64
 11  Purchase                    550068 non-null  int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233599 entries, 0 to 233598
Data columns (total 11 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     233599 non-null  int64  
 1   Product_ID                  233599 non-null  object 
 2   Gender                      233599 non-null  object 
 3   Age                         233599 non-null  object 
 4   Occupation                  233599 non-null  int64  
 5   City_Category               233599 non-null  object 
 6   Stay_In_Current_City_Years  233599 non-null  object 
 7   Marital_Status              233599 non-null  int64  
 8   Product_Category_1          233599 non-null  int64  
 9   Product_Category_2          161255 non-null  float64
 10  Product_Category_3          71037 non-null   float64
dtypes: float64(2), int64(4), object(5)
memory usage: 19.6+ MB


None

In [8]:
train.drop(['User_ID', 'Product_ID'], axis=1, inplace=True)
test.drop(['User_ID', 'Product_ID'], axis=1, inplace=True)

In [9]:
display(train.isnull().sum())
display(test.isnull().sum())

Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64

Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2             72344
Product_Category_3            162562
dtype: int64

### 1. Gender

In [10]:
display(train['Gender'].value_counts())
display(test['Gender'].value_counts())

M    414259
F    135809
Name: Gender, dtype: int64

M    175772
F     57827
Name: Gender, dtype: int64

In [11]:
train['Gender'] = train['Gender'].map({'M':1, 'F':0})
test['Gender'] = test['Gender'].map({'M':1, 'F':0})

In [12]:
display(train.head(3))
display(test.head(3))

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,0,0-17,10,A,2,0,3,,,8370
1,0,0-17,10,A,2,0,1,6.0,14.0,15200
2,0,0-17,10,A,2,0,12,,,1422


Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1,46-50,7,B,2,1,1,11.0,
1,1,26-35,17,C,0,0,3,5.0,
2,0,36-45,1,B,4+,1,5,14.0,


### 2. Age

In [13]:
display(train['Age'].value_counts())
display(test['Age'].value_counts())

26-35    219587
36-45    110013
18-25     99660
46-50     45701
51-55     38501
55+       21504
0-17      15102
Name: Age, dtype: int64

26-35    93428
36-45    46711
18-25    42293
46-50    19577
51-55    16283
55+       9075
0-17      6232
Name: Age, dtype: int64

In [14]:
train['Age'] = train['Age'].replace('55+',55)
test['Age'] = test['Age'].replace('55+',55)

In [15]:
train['Age'] = train['Age'].map({'0-17': 17, '18-25': 25, '26-35': 35, '36-45': 45, '46-50': 50, '51-55': 55})
test['Age'] = test['Age'].map({'0-17': 17, '18-25': 25, '26-35': 35, '36-45': 45, '46-50': 50, '51-55': 55})

In [16]:
display(train['Age'].value_counts())
display(test['Age'].value_counts())

35.0    219587
45.0    110013
25.0     99660
50.0     45701
55.0     38501
17.0     15102
Name: Age, dtype: int64

35.0    93428
45.0    46711
25.0    42293
50.0    19577
55.0    16283
17.0     6232
Name: Age, dtype: int64

In [41]:
train['Age'] = train['Age'].fillna(0)
test['Age'] = test['Age'].fillna(0)

### 3. City_Category

In [17]:
display(train['City_Category'].value_counts())
display(test['City_Category'].value_counts())

B    231173
C    171175
A    147720
Name: City_Category, dtype: int64

B    98566
C    72509
A    62524
Name: City_Category, dtype: int64

In [18]:
train['City_Category'] = train['City_Category'].map({'A':0, 'B':1, 'C':2})
test['City_Category'] = test['City_Category'].map({'A':0, 'B':1, 'C':2})

In [19]:
display(train['City_Category'].value_counts())
display(test['City_Category'].value_counts())

1    231173
2    171175
0    147720
Name: City_Category, dtype: int64

1    98566
2    72509
0    62524
Name: City_Category, dtype: int64

In [20]:
display(train.head(3))
display(test.head(3))

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,0,17.0,10,0,2,0,3,,,8370
1,0,17.0,10,0,2,0,1,6.0,14.0,15200
2,0,17.0,10,0,2,0,12,,,1422


Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1,50.0,7,1,2,1,1,11.0,
1,1,35.0,17,2,0,0,3,5.0,
2,0,45.0,1,1,4+,1,5,14.0,


### 4. Stay_In_Current_City_Years

In [21]:
display(train['Stay_In_Current_City_Years'].value_counts())
display(test['Stay_In_Current_City_Years'].value_counts())

1     193821
2     101838
3      95285
4+     84726
0      74398
Name: Stay_In_Current_City_Years, dtype: int64

1     82604
2     43589
3     40143
4+    35945
0     31318
Name: Stay_In_Current_City_Years, dtype: int64

In [22]:
train['Stay_In_Current_City_Years'] = train['Stay_In_Current_City_Years'].replace('4+',4)
test['Stay_In_Current_City_Years'] = test['Stay_In_Current_City_Years'].replace('4+',4)

In [23]:
display(train['Stay_In_Current_City_Years'].value_counts())
display(test['Stay_In_Current_City_Years'].value_counts())

1    193821
2    101838
3     95285
4     84726
0     74398
Name: Stay_In_Current_City_Years, dtype: int64

1    82604
2    43589
3    40143
4    35945
0    31318
Name: Stay_In_Current_City_Years, dtype: int64

### 5. Marital_Status

In [24]:
display(train['Marital_Status'].value_counts())
display(test['Marital_Status'].value_counts())

0    324731
1    225337
Name: Marital_Status, dtype: int64

0    137807
1     95792
Name: Marital_Status, dtype: int64

### 6. Product_Category_1

In [25]:
display(train['Product_Category_1'].value_counts())
display(test['Product_Category_1'].value_counts())

5     150933
1     140378
8     113925
11     24287
2      23864
6      20466
3      20213
4      11753
16      9828
15      6290
13      5549
10      5125
12      3947
7       3721
18      3125
20      2550
19      1603
14      1523
17       578
9        410
Name: Product_Category_1, dtype: int64

5     65017
1     60321
8     48369
2     10192
11    10153
6      8860
3      8578
4      5003
16     4105
15     2694
13     2381
10     2248
12     1663
7      1624
18     1311
14      663
17      223
9       194
Name: Product_Category_1, dtype: int64

### 7. Product_Category_2

In [26]:
display(train['Product_Category_2'].value_counts())
display(test['Product_Category_2'].value_counts())

8.0     64088
14.0    55108
2.0     49217
16.0    43255
15.0    37855
5.0     26235
4.0     25677
6.0     16466
11.0    14134
17.0    13320
13.0    10531
9.0      5693
12.0     5528
10.0     3043
3.0      2884
18.0     2770
7.0       626
Name: Product_Category_2, dtype: int64

8.0     27229
14.0    23726
2.0     21281
16.0    18432
15.0    16259
4.0     11028
5.0     10930
6.0      7109
11.0     6096
17.0     5784
13.0     4523
9.0      2484
12.0     2273
10.0     1377
18.0     1257
3.0      1239
7.0       228
Name: Product_Category_2, dtype: int64

In [27]:
train['Product_Category_2'] = train['Product_Category_2'].fillna(1)
test['Product_Category_2'] = test['Product_Category_2'].fillna(1)

In [28]:
print(train['Product_Category_2'].isnull().sum())
print(test['Product_Category_2'].isnull().sum())

0
0


### 8. Product_Category_3

In [29]:
display(train['Product_Category_3'].value_counts())
display(test['Product_Category_3'].value_counts())

16.0    32636
15.0    28013
14.0    18428
17.0    16702
5.0     16658
8.0     12562
9.0     11579
12.0     9246
13.0     5459
6.0      4890
18.0     4629
4.0      1875
11.0     1805
10.0     1726
3.0       613
Name: Product_Category_3, dtype: int64

16.0    13833
15.0    11955
14.0     7855
5.0      7141
17.0     7116
8.0      5299
9.0      4953
12.0     3869
13.0     2390
6.0      1998
18.0     1992
4.0       816
11.0      780
10.0      775
3.0       265
Name: Product_Category_3, dtype: int64

In [30]:
train['Product_Category_3'] = train['Product_Category_3'].fillna(1)
test['Product_Category_3'] = test['Product_Category_3'].fillna(1)

In [31]:
print(train['Product_Category_3'].isnull().sum())
print(test['Product_Category_3'].isnull().sum())

0
0


In [42]:
display(train.isnull().sum())
display(test.isnull().sum())

Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Product_Category_2            0
Product_Category_3            0
Purchase                      0
dtype: int64

Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Product_Category_2            0
Product_Category_3            0
dtype: int64

In [43]:
display(train.head())
display(test.head())

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,0,17.0,10,0,2,0,3,1.0,1.0,8370
1,0,17.0,10,0,2,0,1,6.0,14.0,15200
2,0,17.0,10,0,2,0,12,1.0,1.0,1422
3,0,17.0,10,0,2,0,12,14.0,1.0,1057
4,1,0.0,16,2,4,0,8,1.0,1.0,7969


Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1,50.0,7,1,2,1,1,11.0,1.0
1,1,35.0,17,2,0,0,3,5.0,1.0
2,0,45.0,1,1,4,1,5,14.0,1.0
3,0,45.0,1,1,4,1,4,9.0,1.0
4,0,35.0,1,2,1,0,4,5.0,12.0


In [44]:
display(train.info())
display(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Gender                      550068 non-null  int64  
 1   Age                         550068 non-null  float64
 2   Occupation                  550068 non-null  int64  
 3   City_Category               550068 non-null  int64  
 4   Stay_In_Current_City_Years  550068 non-null  int64  
 5   Marital_Status              550068 non-null  int64  
 6   Product_Category_1          550068 non-null  int64  
 7   Product_Category_2          550068 non-null  float64
 8   Product_Category_3          550068 non-null  float64
 9   Purchase                    550068 non-null  int64  
dtypes: float64(3), int64(7)
memory usage: 42.0 MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233599 entries, 0 to 233598
Data columns (total 9 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Gender                      233599 non-null  int64  
 1   Age                         233599 non-null  float64
 2   Occupation                  233599 non-null  int64  
 3   City_Category               233599 non-null  int64  
 4   Stay_In_Current_City_Years  233599 non-null  int64  
 5   Marital_Status              233599 non-null  int64  
 6   Product_Category_1          233599 non-null  int64  
 7   Product_Category_2          233599 non-null  float64
 8   Product_Category_3          233599 non-null  float64
dtypes: float64(3), int64(6)
memory usage: 16.0 MB


None

In [45]:
train['Stay_In_Current_City_Years'] = train['Stay_In_Current_City_Years'].astype('int64')
test['Stay_In_Current_City_Years'] = test['Stay_In_Current_City_Years'].astype('int64')

In [46]:
display(train.info())
display(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Gender                      550068 non-null  int64  
 1   Age                         550068 non-null  float64
 2   Occupation                  550068 non-null  int64  
 3   City_Category               550068 non-null  int64  
 4   Stay_In_Current_City_Years  550068 non-null  int64  
 5   Marital_Status              550068 non-null  int64  
 6   Product_Category_1          550068 non-null  int64  
 7   Product_Category_2          550068 non-null  float64
 8   Product_Category_3          550068 non-null  float64
 9   Purchase                    550068 non-null  int64  
dtypes: float64(3), int64(7)
memory usage: 42.0 MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233599 entries, 0 to 233598
Data columns (total 9 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Gender                      233599 non-null  int64  
 1   Age                         233599 non-null  float64
 2   Occupation                  233599 non-null  int64  
 3   City_Category               233599 non-null  int64  
 4   Stay_In_Current_City_Years  233599 non-null  int64  
 5   Marital_Status              233599 non-null  int64  
 6   Product_Category_1          233599 non-null  int64  
 7   Product_Category_2          233599 non-null  float64
 8   Product_Category_3          233599 non-null  float64
dtypes: float64(3), int64(6)
memory usage: 16.0 MB


None

In [47]:
X = train.iloc[:,:-1]
y = train.iloc[:,-1]

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [49]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(440054, 9) (110014, 9) (440054,) (110014,)


### Linear Regression

In [50]:
LinReg = LinearRegression()
LinReg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [51]:
LasReg = Lasso()
LasReg.fit(X_train, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [52]:
LidReg = Ridge()
LidReg.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [53]:
y_pred_LinReg = LinReg.predict(X_test)
       
y_pred_lasso = LasReg.predict(X_test)
       
y_pred_ridge = LidReg.predict(X_test)

In [54]:
print("Train Score {:.2f} & Test Score {:.2f}".format(LinReg.score(X_train, y_train), LinReg.score(X_test, y_test)))
print("Train Score {:.2f} & Test Score {:.2f}".format(LasReg.score(X_train, y_train), LasReg.score(X_test, y_test)))
print("Train Score {:.2f} & Test Score {:.2f}".format(LidReg.score(X_train, y_train), LidReg.score(X_test, y_test)))

Train Score 0.15 & Test Score 0.15
Train Score 0.15 & Test Score 0.15
Train Score 0.15 & Test Score 0.15


### 1. XGBOOST

In [55]:
xgb = XGBRegressor()
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
       importance_type='gain', interaction_constraints='',
       learning_rate=0.300000012, max_delta_step=0, max_depth=6,
       min_child_weight=1, missing=nan, monotone_constraints='()',
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='reg:squarederror', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
       validate_parameters=1, verbosity=None)

In [56]:
y_pred_xgb = xgb.predict(X_test)

In [57]:
print(" Train Score {:.2f} & Test Score {:.2f}".format(xgb.score(X_train, y_train), (xgb.score(X_test, y_test))))

 Train Score 0.68 & Test Score 0.67


### 2. Random Forest Regressor

In [58]:
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [59]:
y_pred_rf = rf.predict(X_test)

In [60]:
print("Train Score {:.2f} & Test Score {:.2f}".format(rf.score(X_train, y_train), rf.score(X_test, y_test)))

Train Score 0.79 & Test Score 0.63


### LGBM

In [61]:
lgb = LGBMRegressor()
lgb.fit(X_train, y_train)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
       importance_type='split', learning_rate=0.1, max_depth=-1,
       min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
       n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
       random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
       subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [62]:
y_pred_lgb = lgb.predict(X_test)

In [63]:
print(" Train Score {:.2f} & Test Score {:.2f}".format(lgb.score(X_train, y_train), (lgb.score(X_test, y_test))))

 Train Score 0.67 & Test Score 0.67


### Submission

In [43]:
y_pred_test = xgb.predict(test)

In [44]:
submission = pd.DataFrame({'Purchase':y_pred_test, 'User_ID':test_original['User_ID'],
                           'Product_ID':test_original['Product_ID']})
submission.to_csv('Submission.csv', index=False)