# Black Friday Sales Prediction

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

## Data

- Variable           -Definition
- User_ID	User ID
- Product_ID	Product ID
- Gender	Sex of User
- Age	Age in bins
- Occupation	Occupation (Masked)
- City_Category	Category of the City (A,B,C)
- Stay_In_Current_City_Years	Number of years stay in current city
- Marital_Status	Marital Status
- Product_Category_1	Product Category (Masked)
- Product_Category_2	Product may belongs to other category also (Masked)
- Product_Category_3	Product may belongs to other category also (Masked)
- Purchase	Purchase Amount (Target Variable)

In [1]:
import pandas as pd
import numpy as np

In [2]:
train=pd.read_csv("C:\\Users\\DELL\\Downloads\\Internship task\\Black Friday sales\\train.csv")
train

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969
...,...,...,...,...,...,...,...,...,...,...,...,...
550063,1006033,P00372445,M,51-55,13,B,1,1,20,,,368
550064,1006035,P00375436,F,26-35,1,C,3,0,20,,,371
550065,1006036,P00375436,F,26-35,15,B,4+,1,20,,,137
550066,1006038,P00375436,F,55+,1,C,2,0,20,,,365


In [3]:
print(train.shape)


(550068, 12)


In [4]:
train.isnull().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2            173638
Product_Category_3            383247
Purchase                           0
dtype: int64

In [5]:
train = train.drop(['Product_Category_3'], axis=1)
train.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,7969


In [6]:
train['Product_Category_2'].fillna((train['Product_Category_2'].mean()), inplace=True)



In [7]:
train.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,9.842329,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,9.842329,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,9.842329,7969


There are still some special characters, like (+) in the columns 'Age' and 'stay in Current City_Years, which need to be removed, before machine learning algorithms can be run later.

In [8]:
train['Age']=(train['Age'].str.strip('+'))


In [9]:
train['Stay_In_Current_City_Years']=(train['Stay_In_Current_City_Years'].str.strip('+').astype('float'))


In [10]:
train.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase
0,1000001,P00069042,F,0-17,10,A,2.0,0,3,9.842329,8370
1,1000001,P00248942,F,0-17,10,A,2.0,0,1,6.0,15200
2,1000001,P00087842,F,0-17,10,A,2.0,0,12,9.842329,1422
3,1000001,P00085442,F,0-17,10,A,2.0,0,12,14.0,1057
4,1000002,P00285442,M,55,16,C,4.0,0,8,9.842329,7969


In [11]:
train.isnull().sum()

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Product_Category_2            0
Purchase                      0
dtype: int64

In [12]:
train.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase
0,1000001,P00069042,F,0-17,10,A,2.0,0,3,9.842329,8370
1,1000001,P00248942,F,0-17,10,A,2.0,0,1,6.0,15200
2,1000001,P00087842,F,0-17,10,A,2.0,0,12,9.842329,1422
3,1000001,P00085442,F,0-17,10,A,2.0,0,12,14.0,1057
4,1000002,P00285442,M,55,16,C,4.0,0,8,9.842329,7969


In [13]:
train.drop(['User_ID','Product_ID'], axis=1, inplace=True)

train.head()

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase
0,F,0-17,10,A,2.0,0,3,9.842329,8370
1,F,0-17,10,A,2.0,0,1,6.0,15200
2,F,0-17,10,A,2.0,0,12,9.842329,1422
3,F,0-17,10,A,2.0,0,12,14.0,1057
4,M,55,16,C,4.0,0,8,9.842329,7969


In [14]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,8))
sns.heatmap(train.corr(),cmap="rainbow",annot=True)
plt.show()

<Figure size 1000x800 with 2 Axes>

The key take aways from the above plot are the positive correlation coefficients of three features as a function of Purchase:

- Occupation
- Stay_In_Current_City_Years
- Marital Status

Increase in any of the values for the above three features is likey to result in a higher purchase from the customer.

In [15]:
train_Gender = pd.get_dummies(train['Gender'])
train_Age = pd.get_dummies(train['Age'])
train_City_Category = pd.get_dummies(train['City_Category'])
train_Stay_In_Current_City_Years = pd.get_dummies(train['Stay_In_Current_City_Years'])

data = pd.concat([train, train_Gender, train_Age, train_City_Category, train_Stay_In_Current_City_Years], axis=1)

data.head()

Unnamed: 0,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase,F,...,51-55,55,A,B,C,0.0,1.0,2.0,3.0,4.0
0,F,0-17,10,A,2.0,0,3,9.842329,8370,1,...,0,0,1,0,0,0,0,1,0,0
1,F,0-17,10,A,2.0,0,1,6.0,15200,1,...,0,0,1,0,0,0,0,1,0,0
2,F,0-17,10,A,2.0,0,12,9.842329,1422,1,...,0,0,1,0,0,0,0,1,0,0
3,F,0-17,10,A,2.0,0,12,14.0,1057,1,...,0,0,1,0,0,0,0,1,0,0
4,M,55,16,C,4.0,0,8,9.842329,7969,0,...,0,1,0,0,1,0,0,0,0,1


In [16]:
data.columns

Index([                    'Gender',                        'Age',
                       'Occupation',              'City_Category',
       'Stay_In_Current_City_Years',             'Marital_Status',
               'Product_Category_1',         'Product_Category_2',
                         'Purchase',                          'F',
                                'M',                       '0-17',
                            '18-25',                      '26-35',
                            '36-45',                      '46-50',
                            '51-55',                         '55',
                                'A',                          'B',
                                'C',                          0.0,
                                1.0,                          2.0,
                                3.0,                          4.0],
      dtype='object')

In [17]:
data.drop(['Gender','City_Category','Age'], axis=1, inplace=True)
data.head()

Unnamed: 0,Occupation,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Purchase,F,M,0-17,18-25,...,51-55,55,A,B,C,0.0,1.0,2.0,3.0,4.0
0,10,2.0,0,3,9.842329,8370,1,0,1,0,...,0,0,1,0,0,0,0,1,0,0
1,10,2.0,0,1,6.0,15200,1,0,1,0,...,0,0,1,0,0,0,0,1,0,0
2,10,2.0,0,12,9.842329,1422,1,0,1,0,...,0,0,1,0,0,0,0,1,0,0
3,10,2.0,0,12,14.0,1057,1,0,1,0,...,0,0,1,0,0,0,0,1,0,0
4,16,4.0,0,8,9.842329,7969,0,1,0,0,...,0,1,0,0,1,0,0,0,0,1


In [18]:
X=data[[ 'Occupation','Stay_In_Current_City_Years','Marital_Status','Product_Category_1','Product_Category_2',
'M','F','0-17','18-25','26-35','36-45','46-50','51-55','55','A','B','C',0.0,1.0,2.0,3.0,4.0]]
y=data["Purchase"]

In [19]:
X.shape

(550068, 22)

In [20]:
y.shape

(550068,)

In [21]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

In [22]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(385047, 22)
(165021, 22)
(385047,)
(165021,)


In [23]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [24]:
print(lm.intercept_)
print(lm.coef_)

1613086439651099.8
[ 5.25574583e+00 -8.15345389e+14 -7.48254036e+01 -4.10600375e+02
 -7.05543383e+01  1.07306151e+12  1.07306151e+12  6.29943794e+12
  6.29943794e+12  6.29943794e+12  6.29943794e+12  6.29943794e+12
  6.29943794e+12  6.29943794e+12  1.89406859e+13  1.89406859e+13
  1.89406859e+13 -1.63939962e+15 -8.24054236e+14 -8.70884654e+12
  8.06636543e+14  1.62198193e+15]


In [25]:
coeff=pd.DataFrame(lm.coef_,X.columns,columns=["Coefficient"])
coeff

Unnamed: 0,Coefficient
Occupation,5.255746
Stay_In_Current_City_Years,-815345400000000.0
Marital_Status,-74.8254
Product_Category_1,-410.6004
Product_Category_2,-70.55434
M,1073062000000.0
F,1073062000000.0
0-17,6299438000000.0
18-25,6299438000000.0
26-35,6299438000000.0


In [26]:
y_pred=lm.predict(x_test)



In [27]:
from sklearn import metrics
mae=metrics.mean_absolute_error(y_test,y_pred)
mse=metrics.mean_squared_error(y_test,y_pred)
rmse=np.sqrt(mse)
r_sq= metrics.r2_score(y_test, y_pred)
print("Mean absolute error is:",mae)
print("Mean squared error is:",mse)
print(" Root Mean squared error is:",rmse)
print('R2 score is {}'.format(r_sq))

Mean absolute error is: 3598.373211591252
Mean squared error is: 22095830.76246243
 Root Mean squared error is: 4700.62025295199
R2 score is 0.12635791659805684


In [28]:
test=pd.read_csv("C:\\Users\\DELL\\Downloads\\Internship task\\Black Friday sales\\test.csv")
test

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0
...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,F,26-35,15,B,4+,1,8,,
233595,1006036,P00254642,F,26-35,15,B,4+,1,5,8.0,
233596,1006036,P00031842,F,26-35,15,B,4+,1,1,5.0,12.0
233597,1006037,P00124742,F,46-50,1,C,4+,0,10,16.0,


In [29]:
print(test.shape)

(233599, 11)


In [30]:
test.isnull().sum()

User_ID                            0
Product_ID                         0
Gender                             0
Age                                0
Occupation                         0
City_Category                      0
Stay_In_Current_City_Years         0
Marital_Status                     0
Product_Category_1                 0
Product_Category_2             72344
Product_Category_3            162562
dtype: int64

In [31]:
test = test.drop(['Product_Category_3'], axis=1)
test.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0


In [32]:
test['Product_Category_2'].fillna((test['Product_Category_2'].mean()), inplace=True)


In [33]:
test['Age']=(test['Age'].str.strip('+'))

In [34]:
test['Stay_In_Current_City_Years']=(test['Stay_In_Current_City_Years'].str.strip('+').astype('float'))


In [35]:
test_Gender = pd.get_dummies(test['Gender'])
test_Age = pd.get_dummies(test['Age'])
test_City_Category = pd.get_dummies(test['City_Category'])
test_Stay_In_Current_City_Years = pd.get_dummies(test['Stay_In_Current_City_Years'])

df = pd.concat([test, test_Gender, test_Age, test_City_Category, test_Stay_In_Current_City_Years], axis=1)


df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,...,51-55,55,A,B,C,0.0,1.0,2.0,3.0,4.0
0,1000004,P00128942,M,46-50,7,B,2.0,1,1,11.0,...,0,0,0,1,0,0,0,1,0,0
1,1000009,P00113442,M,26-35,17,C,0.0,0,3,5.0,...,0,0,0,0,1,1,0,0,0,0
2,1000010,P00288442,F,36-45,1,B,4.0,1,5,14.0,...,0,0,0,1,0,0,0,0,0,1
3,1000010,P00145342,F,36-45,1,B,4.0,1,4,9.0,...,0,0,0,1,0,0,0,0,0,1
4,1000011,P00053842,F,26-35,1,C,1.0,0,4,5.0,...,0,0,0,0,1,0,1,0,0,0


In [36]:
df.drop(['Gender','City_Category','Age'], axis=1, inplace=True)
df.head()

Unnamed: 0,User_ID,Product_ID,Occupation,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,F,M,0-17,...,51-55,55,A,B,C,0.0,1.0,2.0,3.0,4.0
0,1000004,P00128942,7,2.0,1,1,11.0,0,1,0,...,0,0,0,1,0,0,0,1,0,0
1,1000009,P00113442,17,0.0,0,3,5.0,0,1,0,...,0,0,0,0,1,1,0,0,0,0
2,1000010,P00288442,1,4.0,1,5,14.0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
3,1000010,P00145342,1,4.0,1,4,9.0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
4,1000011,P00053842,1,1.0,0,4,5.0,1,0,0,...,0,0,0,0,1,0,1,0,0,0


In [37]:
df_copy=df.copy()
df_copy.head()

Unnamed: 0,User_ID,Product_ID,Occupation,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,F,M,0-17,...,51-55,55,A,B,C,0.0,1.0,2.0,3.0,4.0
0,1000004,P00128942,7,2.0,1,1,11.0,0,1,0,...,0,0,0,1,0,0,0,1,0,0
1,1000009,P00113442,17,0.0,0,3,5.0,0,1,0,...,0,0,0,0,1,1,0,0,0,0
2,1000010,P00288442,1,4.0,1,5,14.0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
3,1000010,P00145342,1,4.0,1,4,9.0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
4,1000011,P00053842,1,1.0,0,4,5.0,1,0,0,...,0,0,0,0,1,0,1,0,0,0


In [38]:
df.drop(['User_ID','Product_ID'], axis=1, inplace=True)
df.head()

Unnamed: 0,Occupation,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,F,M,0-17,18-25,26-35,...,51-55,55,A,B,C,0.0,1.0,2.0,3.0,4.0
0,7,2.0,1,1,11.0,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1,17,0.0,0,3,5.0,0,1,0,0,1,...,0,0,0,0,1,1,0,0,0,0
2,1,4.0,1,5,14.0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,1,4.0,1,4,9.0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
4,1,1.0,0,4,5.0,1,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0


In [39]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x_train, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [41]:
y_pred=lm.predict(df)
print(y_pred.shape)
print(y_pred)

(233599,)
[10558.  10669.   9167.  ... 11402.   7586.5 10190.5]


In [42]:
submission = pd.DataFrame({"Purchase": y_pred,"User_ID":df_copy["User_ID"],"Product_ID":df_copy["Product_ID"]})

In [43]:
submission.to_csv('sample_submission_V9Inaty.csv', index=False)