![](https://thumbs.gfycat.com/GleamingWarmheartedAracari-small.gif)

**Introduction**

Black Friday is an informal name for the Friday following Thanksgiving Day in the United States, which is celebrated on the fourth Thursday of November. The day after Thanksgiving has been regarded as the beginning of the United States Christmas shopping season since 1952. The term "Black Friday" did not become widely used until more recent decades, during which time global retailers have adopted the term and date to market their own holiday sales.

Many stores offer highly promoted sales on Black Friday and open very early, such as at midnight, or may even start their sales at some time on Thanksgiving. Black Friday is not an official holiday, but California and some other states observe "The Day After Thanksgiving" as a holiday for state government employees. It is sometimes observed in lieu of another federal holiday, such as Columbus Day. Many non-retail employees and schools have both Thanksgiving and the following Friday off. Along with the following regular weekend, this makes Black Friday weekend a four-day weekend. Additional days off is said to increase the number of potential shoppers.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train=pd.read_csv("../input/black-friday-sales-prediction/train_oSwQCTC (1)/train.csv")

In [None]:
train.head()

* User_ID: Unique identifier of shopper.
* Product_ID: Unique identifier of product. (No key given)
* Gender: Sex of shopper.
* Age: Age of shopper split into bins.
* Occupation: Occupation of shopper. (No key given)
* City_Category: Residence location of shopper. (No key given)
* Stay_In_Current_City_Years: Number of years stay in current city.
* Marital_Status: Marital status of shopper.
* Product_Category_1: Product category of purchase.
* Product_Category_2: Product may belong to other category.
* Product_Category_3: Product may belong to other category.
* Purchase: Purchase amount in dollars.

In [None]:
test=pd.read_csv("../input/black-friday-sales-prediction/test_HujdGe7 (1)/test.csv")

In [None]:
test.head()

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.columns

In [None]:
m=train['Gender'].value_counts()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
age=train['Age'].value_counts()

In [None]:
import plotly.express as px

In [None]:
fig=px.bar(age,y=age,x=age.index,color=age.index,template='ggplot2')
fig.update_layout(
    xaxis_title= 'Age',
    yaxis_title="Count",
    legend_title='Age',
    font_family="Courier New",
    font_color="blue",
    title_font_family="Times New Roman",
    title_font_color="red",
    legend_title_font_color="green"
)
fig.show()

In [None]:

labels = ['Male', 'Female']
colors = ['Green', 'Orange']
explode = [0, 0.1]

plt.pie(m, colors = colors, labels = labels, shadow = True, explode = explode, autopct = '%.2f%%')
plt.title('A Pie Chart representing the gender gap', fontsize = 20)
plt.legend()
plt.show()

In [None]:
train[['Product_Category_1','Product_Category_2','Product_Category_3']].groupby(train['Gender']).mean()

In [None]:
train[['Product_Category_1','Product_Category_2','Product_Category_3']].groupby(train['User_ID']).count()

In [None]:
train.groupby('User_ID').Product_ID.count()

Top Sellers

In [None]:
train['Product_ID'].value_counts()[:5]

In [None]:
train.info()

In [None]:
fig,ax=plt.subplots(figsize=(8,8))
sns.heatmap(train.corr(),annot=True)
plt.show()

In [None]:
sns.pairplot(train)
plt.show()

In [None]:
train.describe()

In [None]:
from scipy import stats
from scipy.stats import norm

# plotting a distribution plot for the target variable
plt.rcParams['figure.figsize'] = (7, 7)
sns.distplot(train['Purchase'], color = 'pink', fit = norm)

# fitting the target variable to the normal curve 
mu, sigma = norm.fit(train['Purchase']) 
print("The mu {} and Sigma {} for the curve".format(mu, sigma))

plt.title('A distribution plot to represent the distribution of Purchase')
plt.legend(['Normal Distribution ($mu$: {}, $sigma$: {}'.format(mu, sigma)], loc = 'best')
plt.show()

# plotting the QQplot
stats.probplot(train['Purchase'], plot = plt)
plt.show()

In [None]:
sns.boxplot(train['Purchase'])

* Who Purchases more?Male(Married/Unmarried) or Female(Married/Unmarried)??
* Marital Status of Different Gender

In [None]:
fig,ax = plt.subplots(figsize=(20,4),ncols=2,nrows=1)
sns.barplot(x="Gender",y="Purchase",hue="Marital_Status",estimator=np.mean,data=train,ax=ax[0])
sns.countplot(x="Gender",hue="Marital_Status",data=train,ax=ax[1])

How Does City and Number of years of stay in thatcity affects the purchase

In [None]:
fig,ax = plt.subplots(figsize=(20,4),ncols=2,nrows=1)
sns.barplot(x="Stay_In_Current_City_Years",y="Purchase",hue="City_Category",order=["0","1","2","3","4+"],estimator=np.mean,data=train,ax=ax[0])
sns.countplot(x="Stay_In_Current_City_Years",hue="City_Category",order=["0","1","2","3","4+"],data=train,ax=ax[1])

People from which occupation purchases more??

In [None]:
fig,ax = plt.subplots(figsize=(20,4),ncols=2,nrows=1)
sns.violinplot(x="City_Category",y="Occupation",data=train,ax=ax[0])
sns.lineplot(x="Occupation",y="Purchase",data=train,ax=ax[1])

Which age-group has potential buyers?

In [None]:
fig,ax = plt.subplots(figsize=(20,4),ncols=2,nrows=1)
sns.violinplot(x="Age",y="Purchase",order=["0-17","18-25","26-35","36-45","46-50","51-55","55+"],data=train,ax=ax[0])
sns.violinplot(x="Age",y="Occupation",order=["0-17","18-25","26-35","36-45","46-50","51-55","55+"],data=train,ax=ax[1])

In [None]:
test_copy=test.copy()

In [None]:
train['Product_Category_2'].fillna(train['Product_Category_2'].mean(),inplace=True)

In [None]:
test['Product_Category_2'].fillna(test['Product_Category_2'].mean(),inplace=True)

In [None]:
train['Product_Category_3'].fillna(train['Product_Category_3'].mode()[0],inplace=True)

In [None]:
test['Product_Category_3'].fillna(test['Product_Category_3'].mode()[0],inplace=True)

In [None]:
train['Product_Category_1']=np.sqrt(train['Product_Category_1'])

In [None]:
test['Product_Category_1']=np.sqrt(test['Product_Category_1'])

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le=LabelEncoder()

In [None]:
train['Product_ID']=le.fit_transform(train['Product_ID'])

In [None]:
columns=[]
for col in train.columns:
    if train[col].dtypes=='object':
        columns.append(col)

In [None]:
train_encode=pd.get_dummies(train,columns=columns,dtype=np.uint8)

In [None]:
test_encode=pd.get_dummies(test,columns=columns,dtype=np.uint8)

In [None]:
test_encode['Product_ID']=le.fit_transform(test_encode['Product_ID'])

In [None]:
y=train_encode['Purchase']

In [None]:
X=train_encode.drop(['Purchase'],axis=1)

In [None]:
from sklearn.preprocessing import StandardScaler
col=X.columns
ind=X.index
colum=test_encode.columns
index=test_encode.index
scl=StandardScaler()
X=scl.fit_transform(X)
X=pd.DataFrame(X,columns=col,index=ind)
test_encode=scl.fit_transform(test_encode)
test_encode=pd.DataFrame(test_encode,columns=colum,index=index)

In [None]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
def metrics(true,pred):
    return sqrt(mean_squared_error(true,pred))

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5,random_state=1,shuffle=True)
for train_index,test_index in kf.split(X,y):
    xtr,xvl = X.iloc[train_index],X.iloc[test_index]
    ytr,yvl = y.iloc[train_index],y.iloc[test_index]

In [None]:
xgb=XGBRegressor(random_state=1,n_estimators= 61, max_depth=11)
xgb.fit(xtr,ytr)
pre=xgb.predict(xvl)
print(metrics(yvl,pre))
result=xgb.predict(test_encode)
test_copy['Purchase']=result
submission = test_copy[['Purchase','User_ID','Product_ID']]
submission.to_csv('sub_xgb.csv', index=False)

<span class='h1'>Please Upvote if you liked the work</span>