## Overview

For this course assignment, we were given a sales dataset which contains data on a retail company that wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. 

The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and total purchase_amount from last month.

The purpose of this assignment is to  build a model to predict the purchase amount of customer against various products which will help the company to create personalized offer for customers against different products.

## Data

| Variable	                    | Description                                        |
|-------------------------------|----------------------------------------------------|
|``User_ID``                    |User ID                                             |
|``Product_ID``                 |Product ID                                          |
|``Gender``                     |Sex of User                                         |
|``Age``                        |Age in bins                                         |
|``Occupation``                 |Occupation (Masked)                                 |
|``City_Category``              |Category of the City (A, B, C)                      |
|``Stay_In_Current_City_Years`` |Number of years stay in current city                |
|``Marital_Status``             |Marital Status                                      |
|``Product_Category_1``         |Product Category (Masked)                           |
|``Product_Category_2``         |Product may belongs to other category also (Masked) |
|``Product_Category_3``         |Product may belongs to other category also (Masked) |
|``Purchase``                   |Purchase Amount (Target Variable)                   |

## Evaluation

The root mean squared error (RMSE) will be used for model evaluation.

### Reading in the needed libraries and dataset/s

In [1]:
import numpy as np
import pandas as pd

np.random.seed = 42

In [2]:
data = pd.read_csv("sales_data.csv")
data.head()

Unnamed: 0,Age,City_Category,Gender,Marital_Status,Occupation,Product_Category_1,Product_Category_2,Product_Category_3,Product_ID,Purchase,Stay_In_Current_City_Years,User_ID
0,0-17,A,F,0,10,1,6,14,394,15200.0,2,1000001
1,46-50,B,M,1,7,1,8,17,287,19215.0,2,1000004
2,26-35,A,M,1,20,1,2,5,214,15665.0,1,1000005
3,51-55,A,F,0,9,5,8,14,366,5378.0,1,1000006
4,51-55,A,F,0,9,2,3,4,521,13055.0,1,1000006


**Randomly splitting the given data into 2 subsets for training (80%) and test (20%). Using *random_state = 42*.** 

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size = 0.2, random_state = 42)

**Saving the 2 subsets to 2 CSV files: ``sales_train.csv`` and ``sales_test.csv``.**

In [4]:
train.to_csv('sales_train.csv', index=False)
test.to_csv('sales_test.csv', index=False)

**Reloading the two newly created CSV files into 2 dataframes ``train`` and ``test``.**

In [5]:
train = pd.read_csv("sales_train.csv")
test = pd.read_csv("sales_test.csv")

**Converting all category data in ``train`` and ``test`` into numerical data.**

In [6]:
def cat_to_num(df, cat_name, cat_dict):
    df[cat_name] = df[cat_name].apply(lambda line: cat_dict[line])

cat_to_num(train, 'Gender', {'F':0, 'M':1})
cat_to_num(train, 'Age', {'0-17':0, '18-25':1, '26-35':2, '36-45':3, '46-50':4, '51-55':5, '55+':6})
cat_to_num(train, 'City_Category', {'A':0, 'B':1, 'C':2})
cat_to_num(train, 'Stay_In_Current_City_Years', {'0':0, '1':1, '2':2, '3':3, '4+':4})

cat_to_num(test, 'Gender', {'F':0, 'M':1})
cat_to_num(test, 'Age', {'0-17':0, '18-25':1, '26-35':2, '36-45':3, '46-50':4, '51-55':5, '55+':6})
cat_to_num(test, 'City_Category', {'A':0, 'B':1, 'C':2})
cat_to_num(test, 'Stay_In_Current_City_Years', {'0':0, '1':1, '2':2, '3':3, '4+':4})

# You can use LabelEncoder in sklearn.preprocessing as well

**Building a Linear Regression model using ``train`` and all the columns except ``User_ID`` and ``Purchase`` as the predictors. Report RMSE values on the training and test sets.**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [7]:
from sklearn import metrics

def fit_model(algo, dtrain, dtest, predictors, target):
    algo.fit(dtrain[predictors], dtrain[target])
        
    dtrain_predictions = algo.predict(dtrain[predictors])
    print("Train - RMSE : %.4g" % np.sqrt(metrics.mean_squared_error((dtrain[target]).values, dtrain_predictions)))
    
    dtest_predictions = algo.predict(dtest[predictors])
    print("Test - RMSE : %.4g" % np.sqrt(metrics.mean_squared_error((dtest[target]).values, dtest_predictions)))

In [8]:
target = 'Purchase'

predictors = list(train.columns)
predictors.remove('Purchase')
predictors.remove('User_ID')
print(predictors)

['Age', 'City_Category', 'Gender', 'Marital_Status', 'Occupation', 'Product_Category_1', 'Product_Category_2', 'Product_Category_3', 'Product_ID', 'Stay_In_Current_City_Years']


In [9]:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
fit_model(LR, train, test, predictors, target)

Train - RMSE : 4601
Test - RMSE : 4617


**Building a Decision Tree Regressor with default parameters using train and all the columns except ``User_ID`` and ``Purchase`` as the predictors. Report RMSE values on the training and test sets.**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

In [10]:
from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor()
fit_model(DT, train, test, predictors, target)

Train - RMSE : 1244
Test - RMSE : 4481


**Using RandomizedSearchCV to tune parameters ``max_depth``, ``min_samples_split``, ``min_samples_leaf`` and ``max_features`` of the Decision Tree Regressor in order to obtain a lower RMSE value on the test set compared to that in part 6.**

Documents:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py

In [11]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {"max_depth": [10, 20, 30, None],
              "min_samples_split": np.linspace(0.1, 1.0, 10, endpoint=True),
              "min_samples_leaf": np.linspace(0.1, 0.5, 5, endpoint=True),
              "max_features": list(range(1,len(predictors)))}

random_search = RandomizedSearchCV(DT, param_distributions=param_dist, n_iter=100, cv=5,
                                   scoring='neg_mean_squared_error', random_state=42)

random_search.fit(train[predictors], train[target])

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
          fit_params=None, iid='warn', n_iter=100, n_jobs=None,
          param_distributions={'max_depth': [10, 20, 30, None], 'min_samples_split': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]), 'min_samples_leaf': array([0.1, 0.2, 0.3, 0.4, 0.5]), 'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring='neg_mean_squared_error',
          verbose=0)

In [12]:
random_search.best_params_

{'min_samples_split': 0.2,
 'min_samples_leaf': 0.1,
 'max_features': 6,
 'max_depth': 10}

In [13]:
train_predictions = random_search.best_estimator_.predict(train[predictors])
print("Train - RMSE : %.4g" % np.sqrt(metrics.mean_squared_error((train[target]).values, train_predictions)))

Train - RMSE : 4276


In [14]:
test_predictions = random_search.best_estimator_.predict(test[predictors])
print("Test - RMSE : %.4g" % np.sqrt(metrics.mean_squared_error((test[target]).values, test_predictions)))

Test - RMSE : 4286


In [16]:
train[predictors].shape

(133456, 10)