### About the data

This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem.The idea and dataset is taken from AnalyticsVidhya where the project is a part of a hackathon.

### Data Description
Variable	:                Definition

User_ID	:                    User ID

Product_ID :                 Product ID

Gender :                     Sex of User

Age	 :                       Age in bins

Occupation  :                Occupation (Masked)

City_Category  :             Category of the City (A,B,C)

Stay_In_Current_City_Years:	Number of years stay in current city

Marital_Status:	            Marital Status

Product_Category_1:	        Product Category (Masked)

Product_Category_2	:        Product may belongs to other category also (Masked)

Product_Category_3:	        Product may belongs to other category also (Masked)

Purchase	:                Purchase Amount (Target Variable)

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Import dataset

In [None]:
test = pd.read_csv("../input/black-friday/test.csv")
sales = pd.read_csv("../input/black-friday/train.csv")
sales.head()

In [None]:
test.head()

In [None]:
#for submission

submission = pd.DataFrame()
submission['Purchase'] = []
submission['User_ID'] = test['User_ID']
submission['Product_ID'] = test['Product_ID']

### Data exploration

In [None]:
sales.shape

In [None]:
sales.info()

Looking at the data, we can conclude that our set possesses 12 different parameters: 7 numerical (integer and float) and 5 object variables.

Looking into the summary statistics for these 7numercal features

In [None]:
sales.describe()

There are 12 features, looking into each of these features:
    
1. User ID: Each user has been provided a unique ID. Lets see how many unique users we have in our dataset 

In [None]:
sales.User_ID.nunique()

There are 5891 unique users in our dataset and none of the value in this feature is null

2. Product_ID: Each product that is available for sales has a specific/ unique product id associated with it. Lets look into the number of unique products available for sale.

In [None]:
sales.Product_ID.nunique()

So there are a total of 3631 products available for sales.

3. Gender: Gender is a categorical variables with 2 categries: Male(M) and Female(F).

In [None]:
sales.Gender.value_counts(normalize=True)*100

There are no null values in this feature and Males constitute 75% of the data.

4. Age: Age is again a categorical data with age divide in particular range.

In [None]:
sales.Age.value_counts()

The age is divide in 7 categories i.e 0-17,18-25,26-35,36-45,46-50,51-55,55+ . The bins size here is varing.

5. Occupation: The Occupation number is the ID number of occupation type of each customer. We can see that around 21 different occupations exist.

In [None]:
sales.Occupation.nunique()

6. City_Category : The city has been categorised into 3 categories i.e A,B,C.

In [None]:
sales.City_Category.value_counts()

7. Stay_In_Current_City_Years : This depects the numbers of year from which a person is residing in that particular city. It has been divided into 5 categories 

In [None]:
sales.Stay_In_Current_City_Years.value_counts()

8. Marital_Status: This features shows if a person is married or not.

In [None]:
sales.Marital_Status.value_counts()

The products have been categorised into three categories with represent three different features:
    
9. Product_Category_1 
    
10. Product_Category_2 
    
11. Product_Category_3

12. Purchase: This is our final feature which is our dependent variables whose value we want to predict, the purchase amount. It is a contionous variable so this makes it a regression problem.

### Missing data

In [None]:
# lets combine the data for data prep

test['Purchase']=np.nan
sales['data']='train'
test['data']='test'
test=test[sales.columns]
combined=pd.concat([sales,test],axis=0)

In [None]:
combined.head()

In [None]:
sales.isna().sum().sort_values(ascending=False)

Here we can see that there are 2 features which contain the missing values i.e Product_Category_2 and Product_Category_3.

In [None]:
#percent of missing data relevant to all data
percent = (sales.isnull().sum()/sales.isnull().count()).sort_values(ascending=False)
percent[[0,1]]

The feature Product_Category_3 has 70% data missing so imputing this much data is not feasible , so it is better to drop this feature

In [None]:
combined.drop('Product_Category_3',axis=1,inplace=True)

The feature Product_Category_2 has 30% data missing so we can impute values into this using an appromiate method.

In [None]:
combined.Product_Category_2.value_counts()

These are almost 18 categories in which these Product_Category_2 is divided ,imputing the mean value does make sense because that gives a decimal term 9.8 which is not a product category here. So, there are 2 possible ways median or mode.

In [None]:
#imputed missing values with random values in the same probability distribution as given feature already had

vc = combined.Product_Category_2.value_counts(normalize = True)
miss = combined.Product_Category_2.isna()
combined.loc[miss, 'Product_Category_2'] = np.random.choice(vc.index, size = miss.sum(), p = vc.values)

In [None]:
combined.Product_Category_2.value_counts()

In [None]:
combined.isna().sum()

There are no null values left in the data set.The purchase null values are because of the test data that needs to be predicted

### Exploratory Data Analysis

In [None]:
#using the train data part from combined dataset for eda

sales_1 = combined[combined['data']=='train']

#### Univariate Analysis:

In [None]:
sns.countplot(sales_1['Gender'])
plt.show()

The graph shows that there are almost 3 times more male customers than female customers.

In [None]:
sns.countplot(sales_1['Age'])
plt.show()

The graph shows that the majority of the customers that purchase things during the sales season mainly belong to the age group of 26-35 and 36-45.

In [None]:
sns.countplot(sales_1['Occupation'])
plt.show()

The graph shows that top three Occupations with the majority of buyers is 4,0,7.

In [None]:
sns.countplot(sales_1['City_Category'])
plt.show()

The graph shows that people from city B buy majorly during the sale  

In [None]:
sns.countplot(sales_1['Stay_In_Current_City_Years'])
plt.show()

The graph shows that majority people buying during sales have lived in the current city for an year.

In [None]:
sns.countplot(sales_1['Marital_Status'])
plt.show()

The graphs shows that single people tend to buy more things during sales.

#### Bivariate Analysis / Multivariate Analysis:

In [None]:
# Avearge amount spend by different age groups

data = sales_1.groupby('Age')['Purchase'].mean()
plt.plot(data.index,data.values,marker='o',color='g')
plt.xlabel('Age group');
plt.ylabel('Average_Purchase amount in $');
plt.title('Age group vs average amount spent');
plt.show()

The average amount spend by age group 51-55 is most during the festive season sales.

In [None]:
# Avearge amount spend based on the time of stay in the current city

data = sales_1.groupby('Stay_In_Current_City_Years')['Purchase'].mean()
plt.plot(data.index,data.values,marker='o',color='y')
plt.xlabel('Stay_In_Current_City_Years');
plt.ylabel('Average_Purchase amount in $');
plt.title('Stay_In_Current_City_Years vs average amount spent');
plt.show()

The people who have been living in the current city for 2 or more years are on an average spending more amount in the black friday sales

In [None]:
# Avearge purchase based on Marital_Status

data = sales_1.groupby('Marital_Status')['Purchase'].mean()
plt.bar(data.index,data.values)
plt.xlabel('Marital_Status');
plt.ylabel('Average_Purchase amount in $');
plt.title('Avearge purchase based on Marital_Status');
plt.show()

Purchasers who married or not, have almost same average of purchase.

In [None]:
# Top 10 products which made the highest sales

data = sales_1.groupby("Product_ID").sum()['Purchase']

plt.figure(figsize=(10,5))
data.sort_values(ascending=False)[0:10].plot(kind='bar')
plt.xticks(rotation=90)
plt.xlabel('Product ID')
plt.ylabel('Total amount purchased in Million $')
plt.title('Top 10 Products with highest sales')
plt.show()

In [None]:
#comparing based on Marital_Status and Gender

sns.countplot(x='Marital_Status',data=sales_1,hue='Gender')
plt.title('Comparing based on Marital_Status and Gender')
plt.show()

Males tend to purchase more and Unmarried Males are around 45% in the data and they show to purchase 9000$ on average.

Products that are most purchased by each of the age group:

In [None]:
a =pd.crosstab(sales_1['Age'],sales_1['Product_ID'])
a.idxmax(axis=1)

In [None]:
#Occupations and City Category

plt.figure(figsize=(15,5))
sns.countplot(x='Occupation',data=sales_1,hue='City_Category')
plt.title('Comparing Occupations and City Category')
plt.show()

People from Occupation 4,0,7 are buying the most and most of the people from these occupations belong to City_Category B.

In [None]:
#the purchase habits of different genders across the different city categories.

g = sns.FacetGrid(sales_1,col="City_Category")
g.map(sns.barplot, "Gender", "Purchase")
plt.show()

For City_categories B and C, Males tend to dominate the purchasing, whereas it is the opposite for City Category_C, where Females tend to puchase more than men.

### Data preprocessing.

In [None]:
# for datapreprocessing again working with the combined dataset
combined.head()

1. User_ID and Product_ID: 

In [None]:
# User_ID data preprocess. e.g. 1000002 -> 2

combined['User_ID'] = combined['User_ID'] - 1000000

# Product_ID preprocess e.g. P00069042 -> 69042

combined['Product_ID'] = combined['Product_ID'].str.replace('P00', '')

#object to int
combined['Product_ID'] = pd.to_numeric(combined['Product_ID'],errors='coerce')

In [None]:
combined.info()

2. Product_Category_2 :

All the unique values in product category 2 are integers. But the data type shown in info is float so we can change it by converting the numbers in float to integers.

In [None]:
combined.Product_Category_2 = combined.Product_Category_2.astype('int64')

In [None]:
# features with datatype object

cat_cols = combined.select_dtypes(['object']).columns
cat_cols

3. Stay_In_Current_City_Years

For Stay in current city years we need to convert the object datatype to int.
It contains a category which has '4+' that needs to be altered.

In [None]:
# 4+ to 4
combined['Stay_In_Current_City_Years'] =np.where(combined['Stay_In_Current_City_Years'].str[:2]=="4+",4,combined['Stay_In_Current_City_Years'])

#object to int
combined['Stay_In_Current_City_Years'] = pd.to_numeric(combined['Stay_In_Current_City_Years'],errors='coerce')

4. Gender: 

    Gender 'F' for female are represented by the value fo 0.

    Gender 'M' for male are represented by the value fo 1.

In [None]:
combined['Gender'] = combined['Gender'].map({'F':0, 'M':1}).astype(int)

5. Age

In [None]:
# Modify age column

combined['Age'] = combined['Age'].map({'0-17': 9,
                               '18-25': 22,
                               '26-35': 31,
                               '36-45': 42,
                               '46-50': 48,
                               '51-55': 53,
                               '55+': 60})
combined['Age'].value_counts()

6. City_Category : dummy variables for this feature

In [None]:
combined = pd.get_dummies(combined,columns=['City_Category'],drop_first = True)

In [None]:
combined.head()

In [None]:
combined.info()

In [None]:
combined.head()

In [None]:
#splitting the data back into train and test as it was already provided

sales = combined[combined['data']=='train']
del sales['data']
test_input = combined[combined['data']=='test']
test_input.drop(['Purchase','data'],axis=1,inplace=True)

del combined

In [None]:
#Heatmap to show the correlation between various variables of the train data set

plt.figure(figsize=(12, 5))
cor = sales.corr()
ax = sns.heatmap(cor,annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

The variables which show a significant correlation in the data are:

1. Marital_status and Age
2. Product_Category_1 and Purchase
3. City_Category_B and City_category_A

### Model building

In [None]:
#splitting the data into X and y
X = sales.drop('Purchase',axis=1)
y = sales['Purchase']

#train test split for model building
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=0)

LinearRegression :
    
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.



In [None]:
#Linear regression

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train,y_train) # training the algorithm

# Getting the coefficients and intercept

print('coefficients:\n', lr.coef_)
print('\n intercept:', lr.intercept_)

In [None]:
#Predicting on the test data

y_pred = lr.predict(X_test)

from sklearn import metrics

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

The score generated with the Linear Regression model was very low so used Regularized Linear model i.e Ridge Regression

Ridge Regression: This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression 

In [None]:
# Ridge Regression

from sklearn.linear_model import Ridge

RR = Ridge(alpha=0.05,normalize=True)
RR.fit(X_train, y_train)

y_pred = RR.predict(X_test)

print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

Linear Regressiom models were not giving that much improvement so tried non linear regression models.

Decision Tree: 

In [None]:
# Decision Tree Model

from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor(max_depth=15, min_samples_leaf=100)

DT.fit(X_train, y_train)

y_pred = DT.predict(X_test)

print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
#Decision Tree 2

DT2 = DecisionTreeRegressor(max_depth=8, min_samples_leaf=150)

DT2.fit(X_train, y_train)

y_pred = DT2.predict(X_test)

print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

Random Forest Regressor: 

Random Forest is an ensemble machine learning algorithm that follows the bagging technique. The base estimators in random forest are decision trees.It randomly selects a set of features which are used to decide the best split at each node of the decision tree.

In [None]:
#Fitting the model
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state = 3,max_depth=10,n_estimators=25)

rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
# another random forest

from sklearn.ensemble import RandomForestRegressor

rf3 = RandomForestRegressor(random_state=3,max_depth=10,min_samples_split=500,oob_score=True)


rf3.fit(X_train,y_train)

y_pred = rf3.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
# random forest 4

rf4 = RandomForestRegressor(n_estimators=30,random_state=3,max_depth=15,min_samples_split=100,oob_score=True)


rf4.fit(X_train,y_train)

y_pred = rf4.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

ExtraTreesRegressor : 

The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees).

In [None]:
#Fitting the model
from sklearn.ensemble import ExtraTreesRegressor

rf = ExtraTreesRegressor()

rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

XG BRegressor :
    
XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. XGBoost has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons. XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as ‘regularized boosting‘ technique.

In [None]:
#XGBoost Model1
from xgboost import XGBRegressor


xgb1 = XGBRegressor(n_estimators=1000, learning_rate=0.05)

xgb1.fit(X_train,y_train)

y_pred = xgb1.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
## XGBoost2
from xgboost import XGBRegressor

xgb2 = XGBRegressor(n_estimators=500,max_depth=10,learning_rate=0.05)

xgb2.fit(X_train,y_train)

y_pred = xgb2.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
## XGBoost3

xgb3 = XGBRegressor(n_estimators=6,max_depth=500)

xgb3.fit(X_train,y_train)

y_pred = xgb3.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
#XGBoost4

xgb4 = XGBRegressor(learning_rate=1.0, max_depth=6, min_child_weight=40, seed=0)

xgb4.fit(X_train,y_train)

y_pred = xgb4.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
#XGBoost5
from xgboost import XGBRegressor

xgb5 = XGBRegressor(n_estimators=450,max_depth=8,learning_rate=0.076)

xgb5.fit(X_train,y_train)

y_pred = xgb5.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
#XGBoost6
from xgboost import XGBRegressor

xgb6 = XGBRegressor(n_estimators=470,max_depth=9,learning_rate=0.06)

xgb6.fit(X_train,y_train)

y_pred = xgb6.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

Light GBM:

Light GBM beats all the other algorithms when the dataset is extremely large. Compared to the other algorithms, Light GBM takes lesser time to run on a huge dataset.LightGBM is a gradient boosting framework that uses tree-based algorithms and follows leaf-wise approach while other algorithms work in a level-wise approach pattern

In [None]:
from lightgbm import LGBMRegressor

lgbm1 = LGBMRegressor(n_estimators=500,max_depth=10,learning_rate=0.05)

lgbm1.fit(X_train,y_train)

y_pred = lgbm1.predict(X_test)

print('r2_score:', metrics.r2_score(y_test,y_pred)) 
print('rmse:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

### Conclusion

Comparing all the models, we conclude that the XGBRegressor model is the best model to be able to predict purchase amount from our dataset.

Parameters and score: 

XGBRegressor(n_estimators=500,max_depth=10,learning_rate=0.05)

r2_score: 0.7492767237638949

rmse: 2518.284905633662


In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(25)
df1.head()

In [None]:
# Feature Importance

imp = pd.DataFrame(xgb2.feature_importances_,index=X.columns,columns=['importance'])
imp.sort_values(by='importance',ascending=False)

In [None]:
df1.plot(kind='bar',figsize=(10,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [None]:
plt.scatter(df1.Predicted,df1.Actual)
plt.plot(y_pred,y_pred,'r')
plt.xlabel('y predicted')
plt.ylabel('y actual')
plt.show()