## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Import Dataset

In [None]:
data = pd.read_csv('../input/black-friday/train.csv')
data.head()

Here we can see there are NaN values in Product Category 2 and 3, and also there are Two IDs column one is User ID and other is Product ID.
Starting 4 rows have same user because it has a ID of 1000001 and that person is Female who bought 4 Products.

In [None]:
print("Number of Rows: ", data.shape[0])

In [None]:
data.describe()

The maximum Purchase is of $23961

In [None]:
data.info()

As we can see that there are many missing values in a Product Category 2 and 3. Let see how many values are missing in these two columns.

In [None]:
print("Missing Values in Each Column:")
print(data.isna().sum())

There are many missing values in Product Category 2 and 3. What we can do is fill the data with the mean/median or forward fill. The data type of these two columns are float64 so the better option is mean, median or mode.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.boxplot(data.Product_Category_2, ax=axes[0])
data.Product_Category_2.plot(kind='box', ax=axes[1])
plt.show()

No outliers in Product Category 2

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.boxplot(data.Product_Category_3, ax=axes[0])
data.Product_Category_3.plot(kind='box', ax=axes[1])
plt.show()

No outliers in Product Category 3. 

So now what we can do is replace that NaN values with ffill and bfill method because if we take the mean it would put the average there and if we use the mode it will put the most frequent product category in a missing place both will create an error

In [None]:
data.fillna(method='ffill', inplace=True)
data.isna().sum()

There's one missing value in the both column because the first row value was empty too and we used the ffill method now we will use bfill to fill the remaining values.

In [None]:
data.fillna(method='bfill', inplace=True)
data.isna().sum()

In [None]:
print("Now lets see how's our data looking: ")
data.head(15)

Now again we should look at the data info

In [None]:
data.info()

Let's explore unique values count in each column

In [None]:
data.nunique()

Here we can see that Gender, City Category, Martial Status, Stay in current city years  can be turn into a category.

In [None]:
# but first let confirm that is there any missing values left in a data
assert pd.notnull(data).all().all()

Great! there's no missing values in a dataset.
Now let's explore the unique values in every column.

In [None]:
col = list(data.columns)
print("Unique Values in each column:\n")
for c in col:
    print(c, ": ", data[c].unique())
    print()

So we are going to change the gender and city category by LabelEncode and mapping

In [None]:
data['Gender'] = data.Gender.map({
    'M' : 0,
    'F' : 1
})
data.head()

Here
* M is Male which is 0
* F is Female which is 1

In [None]:
data.Gender.unique()

Now we will LabelEncode the City Category Column

In [None]:
from sklearn.preprocessing import LabelEncoder
lE = LabelEncoder()
data.City_Category = lE.fit_transform(data.City_Category)
data.head()

Here:
* A is 0
* B is 1
* C is 2

In [None]:
data.City_Category.unique()

In [None]:
data.info()

As we can see that Gender and City Category is now Integer64 data type

Here in Stay_In_Current_City_Years column:  There are 5 unique values which are=>['2' '4+' '3' '1' '0'].
Instead of 4+ we will replace it into 4. and then change the data type of the column to integer

In [None]:
data.loc[data['Stay_In_Current_City_Years'] == '4+','Stay_In_Current_City_Years'] = '4'
data.Stay_In_Current_City_Years = data.Stay_In_Current_City_Years.astype('int64')
data.info()

All good so far but we have a age column and we have to explore it

In [None]:
print("Unique Values in Age Column:")
data.Age.unique()

It is a range column so we can't change so we can't change for this situation

## Plotting

#### Exploring Age Column

In [None]:
sns.countplot(x=data.Age)
plt.show()

Age between 26-35 orders the most in a black friday

#### Explore Gender Column

In [None]:
ax = sns.countplot(data.Gender)
gen = ['M', 'F']
ax.set(xticklabels=gen)
plt.show()

As a result from the plot we can see that Male are one who order's the most in this black friday.

#### Explore Martial Column

In [None]:
ax = sns.countplot(data.Marital_Status)
mar = ['Married', 'Single']
ax.set(xlabel='Martial Status', xticklabels=mar)
plt.show()

As we can see User who are married tend to buy the most

#### Plotting Gender and Martial Status in respect to Purchase

In [None]:
print('Martial Status: 0=Married, 1=Single')
print('Gender: 0=Male, 1=Female')
ax = data.groupby(['Marital_Status', 'Gender'])['Purchase'].count().plot(kind='bar')
ax.set(xlabel='Martial Status and Gender', ylabel='Purchase', title='Gender and Martial Status in respect to Purchase')
plt.show()

Males and Females who are married tends to buy the most products in Black Friday

#### Product Category

In [None]:
plt.bar(['PC1', 'PC2', 'PC3'], [data.Product_Category_1.sum(), data.Product_Category_2.sum(), data.Product_Category_3.sum()])
plt.show()

Here the product which falls in a category 3 were purchased the most

#### Stay in Current City Years

In [None]:
data.groupby('Stay_In_Current_City_Years')['Purchase'].count()

In [None]:
plt.bar([0, 1, 2, 3, 4], data.groupby('Stay_In_Current_City_Years')['Purchase'].count())
plt.show()

#### City Category

In [None]:
plt.bar(['A', 'B', 'C'], data.groupby('City_Category')['Purchase'].mean())
plt.show()

All cities has a good purchase rate but city which falls in category C has the max 

#### x = Gender, y = Purchase

In [None]:
data.groupby('Gender')['Purchase'].mean()

In [None]:
sns.regplot(x='Gender', y='Purchase', data=data)
plt.show()

#### X = City Category, y = Purchase

In [None]:
data.groupby('City_Category')['Purchase'].mean()

In [None]:
sns.regplot(x='City_Category', y='Purchase', data=data)
plt.show()

As we can see that both City Category and Gender are not correlated to Purchase. We can further explore by using Corr method on a dataset

## Further Analysis

In [None]:
cols = ['Gender', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1', 'Product_Category_2', 'Product_Category_3', 'Purchase']
corr_result = data[cols].corr()
corr_result

As we can see -ve and +ve correlation with Purchase column

In [None]:
sns.heatmap(corr_result, annot=True)
plt.show()

In [None]:
data.groupby(['Age', 'Marital_Status', 'Gender']).count()

User with Age between 26-35, Married, Male purchased the most in Black Friday

In [None]:
print("Let see which user pays the maximum price for a product and for which Product:")
data.loc[data.Purchase.idxmax()]

Here the person which is male, age between 26-35, is unmarried bought that product which has the highest price.