# Retail Sales Prediction 

## Project Summary

This is the capstone project for Advanced Data Science With IBM specialization on Coursera (https://www.coursera.org/learn/advanced-data-science-capstone/home/welcome).

This project demonstrates the technical skills gained in data science and machine learning technologies. This project uses:
* Python 3.6+
* Jupyter notebook environment
* Libraries: 
  * Numerical - numpy, pandas
  * Visualization - matplotlib, seaborn
  * ML - Scikit-Learn
  * NN - Keras, Tensorflow


### Problem Statement
A retail company wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

### Data Source
https://datahack.analyticsvidhya.com/contest/black-friday/

In [1]:
## imports

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

## Initial Data Exploration

### Read data
Read the data from input csv file

In [2]:
train_df = pd.read_csv('./BFS/Dataset/train.csv')
train_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


### Getting familiar with data
Explore the data in the dataframe to identify cleanup and transformation needed later

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   User_ID                     550068 non-null  int64  
 1   Product_ID                  550068 non-null  object 
 2   Gender                      550068 non-null  object 
 3   Age                         550068 non-null  object 
 4   Occupation                  550068 non-null  int64  
 5   City_Category               550068 non-null  object 
 6   Stay_In_Current_City_Years  550068 non-null  object 
 7   Marital_Status              550068 non-null  int64  
 8   Product_Category_1          550068 non-null  int64  
 9   Product_Category_2          376430 non-null  float64
 10  Product_Category_3          166821 non-null  float64
 11  Purchase                    550068 non-null  int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB


In [29]:
# Exclude User_ID from describe
train_df.iloc[:,1:].describe(include='all')

Unnamed: 0,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
count,550068,550068,550068,550068.0,550068,550068.0,550068.0,550068.0,376430.0,166821.0,550068.0
unique,3631,2,7,,3,5.0,,,,,
top,P00265242,M,26-35,,B,1.0,,,,,
freq,1880,414259,219587,,231173,193821.0,,,,,
mean,,,,8.076707,,,0.409653,5.40427,9.842329,12.668243,9263.968713
std,,,,6.52266,,,0.49177,3.936211,5.08659,4.125338,5023.065394
min,,,,0.0,,,0.0,1.0,2.0,3.0,12.0
25%,,,,2.0,,,0.0,1.0,5.0,9.0,5823.0
50%,,,,7.0,,,0.0,5.0,9.0,14.0,8047.0
75%,,,,14.0,,,1.0,8.0,15.0,16.0,12054.0


In [26]:
col_list = ["Gender", 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status']
for col in col_list:
    print(f'{col} -> {train_df[col].unique()}')

Gender -> ['F' 'M']
Age -> ['0-17' '55+' '26-35' '46-50' '51-55' '36-45' '18-25']
Occupation -> [10 16 15  7 20  9  1 12 17  0  3  4 11  8 19  2 18  5 14 13  6]
City_Category -> ['A' 'C' 'B']
Stay_In_Current_City_Years -> ['2' '4+' '3' '1' '0']
Marital_Status -> [0 1]


In [None]:
# Explore 

# top 10 customers - most number of purchases, most $ purchases
# top 5 products
# Age demographic by number of purchases and $ purchases
# top 5 occupations
# top cities

## Visualizations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

## Identifying data issues

## ETL

### Handle NULLs

### Transform column to categorical or use one-hot encoding