# Assignment -- Walmart

The Management team at Walmart Inc. wants to analyze the customer purchase behavior (specifically, purchase amount) against the customer’s gender and the various other factors to help the business make better decisions. They want to understand if the spending habits differ between male and female customers: Do women spend more on Black Friday than men?

Data Description

The company collected the transactional data of customers who purchased products from Walmart Stores during Black Friday. The dataset in walmart_data.csv has the following features:

User_ID
Product_ID
Gender - sex of a customer
Age - age in bins
Occupation (masked)
City_Category - category of the city [A, B, C]
Stay_In_Current_City_Years: number of years a customer stays in their current city
Marital_Status
Product_Category (masked)
Purchase - purchase amount
Practicalities

For simplicity, you may assume that 50% of Walmart`s customer base are Male and the other 50% are Female.

Analyze the provided data and provide insights to the best of your abilities. Include the relevant tables/graphs/visualization to explain what you have learned about the market. Make sure that the solution reflects your entire thought process including the preparation of data - it is more important how the code is structured rather than just the final result or plot.

In [54]:
import pandas as pd
import ast
from collections import defaultdict
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import norm
import gender_guesser.detector as gender
predictor = gender.Detector()

In [55]:
walmart = pd.read_csv('walmart_data.csv')

## Data Overview

In [56]:
walmart.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,7969


In [66]:
walmart.tail()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
550063,1006033,P00372445,M,51-55,13,B,1,1,20,368
550064,1006035,P00375436,F,26-35,1,C,3,0,20,371
550065,1006036,P00375436,F,26-35,15,B,4+,1,20,137
550066,1006038,P00375436,F,55+,1,C,2,0,20,365
550067,1006039,P00371644,F,46-50,0,B,4+,1,20,490


In [67]:
walmart.sample(5)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
420201,1004625,P00217942,M,26-35,7,C,0,0,5,7026
201956,1001181,P00022142,M,36-45,7,A,3,1,1,8295
499206,1004867,P00117542,M,26-35,16,A,3,0,18,3763
530068,1003661,P00251242,M,36-45,12,C,0,1,5,7194
324056,1001883,P00232442,F,51-55,0,B,1,1,5,5346


In [69]:
walmart.shape

(550068, 10)

In [68]:
walmart.describe()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category,Purchase
count,550068.0,550068.0,550068.0,550068.0,550068.0
mean,1003029.0,8.076707,0.409653,5.40427,9263.968713
std,1727.592,6.52266,0.49177,3.936211,5023.065394
min,1000001.0,0.0,0.0,1.0,12.0
25%,1001516.0,2.0,0.0,1.0,5823.0
50%,1003077.0,7.0,0.0,5.0,8047.0
75%,1004478.0,14.0,1.0,8.0,12054.0
max,1006040.0,20.0,1.0,20.0,23961.0


In [70]:
walmart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   User_ID                     550068 non-null  int64 
 1   Product_ID                  550068 non-null  object
 2   Gender                      550068 non-null  object
 3   Age                         550068 non-null  object
 4   Occupation                  550068 non-null  int64 
 5   City_Category               550068 non-null  object
 6   Stay_In_Current_City_Years  550068 non-null  object
 7   Marital_Status              550068 non-null  int64 
 8   Product_Category            550068 non-null  int64 
 9   Purchase                    550068 non-null  int64 
dtypes: int64(5), object(5)
memory usage: 42.0+ MB


In [57]:
walmart.dtypes

User_ID                        int64
Product_ID                    object
Gender                        object
Age                           object
Occupation                     int64
City_Category                 object
Stay_In_Current_City_Years    object
Marital_Status                 int64
Product_Category               int64
Purchase                       int64
dtype: object

Find unique values in each coloumn

In [63]:
def print_unique_values(df):
    for column in df.columns:
        unique_values = df[column].unique()
        print(f"\nUnique Values of {column}: ", unique_values)

# Example usage
print_unique_values(walmart)


Unique Values of User_ID:  [1000001 1000002 1000003 ... 1004113 1005391 1001529]

Unique Values of Product_ID:  ['P00069042' 'P00248942' 'P00087842' ... 'P00370293' 'P00371644'
 'P00370853']

Unique Values of Gender:  ['F' 'M']

Unique Values of Age:  ['0-17' '55+' '26-35' '46-50' '51-55' '36-45' '18-25']

Unique Values of Occupation:  [10 16 15  7 20  9  1 12 17  0  3  4 11  8 19  2 18  5 14 13  6]

Unique Values of City_Category:  ['A' 'C' 'B']

Unique Values of Stay_In_Current_City_Years:  ['2' '4+' '3' '1' '0']

Unique Values of Marital_Status:  [0 1]

Unique Values of Product_Category:  [ 3  1 12  8  5  4  2  6 14 11 13 15  7 16 18 10 17  9 20 19]

Unique Values of Purchase:  [ 8370 15200  1422 ...   135   123   613]


In [65]:
def print_nunique_values(df):
    for column in df.columns:
        unique_values = df[column].nunique()
        print(f"\nUnique Values of {column}: ", unique_values)

# Example usage
print_nunique_values(walmart)


Unique Values of User_ID:  5891

Unique Values of Product_ID:  3631

Unique Values of Gender:  2

Unique Values of Age:  7

Unique Values of Occupation:  21

Unique Values of City_Category:  3

Unique Values of Stay_In_Current_City_Years:  5

Unique Values of Marital_Status:  2

Unique Values of Product_Category:  20

Unique Values of Purchase:  18105


## Data Cleaning

In [59]:
walmart.isnull().values.any()

False

The "False" value indicates that there are no null values

In [62]:
print("Number of duplicated values: ", walmart.duplicated().sum())

Number of duplicated values:  0


In [72]:
# Removing "+" symbol
walmart.Stay_In_Current_City_Years=walmart.Stay_In_Current_City_Years.str.replace("+","")

In [73]:
walmart.Stay_In_Current_City_Years.unique()

array(['2', '4', '3', '1', '0'], dtype=object)

In [74]:
# Assuming '+' symbols have already been removed as per your screenshot
walmart['Stay_In_Current_City_Years'] = pd.to_numeric(walmart['Stay_In_Current_City_Years'])

In [75]:
walmart

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,1057
4,1000002,P00285442,M,55+,16,C,4,0,8,7969
...,...,...,...,...,...,...,...,...,...,...
550063,1006033,P00372445,M,51-55,13,B,1,1,20,368
550064,1006035,P00375436,F,26-35,1,C,3,0,20,371
550065,1006036,P00375436,F,26-35,15,B,4,1,20,137
550066,1006038,P00375436,F,55+,1,C,2,0,20,365
