# Overview

Overview
For this assignment, we were given a sales dataset which contains data on a retail company that wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories.

The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and total purchase_amount from last month.

The purpose of this assignment is to build a model to predict the purchase amount of customer against various products which will help the company to create personalized offer for customers against different products.


## Data

| Variable	                    | Description                                        |
|-------------------------------|----------------------------------------------------|
|``User_ID``                    |User ID                                             |
|``Product_ID``                 |Product ID                                          |
|``Gender``                     |Sex of User                                         |
|``Age``                        |Age in bins                                         |
|``Occupation``                 |Occupation (Masked)                                 |
|``City_Category``              |Category of the City (A, B, C)                      |
|``Stay_In_Current_City_Years`` |Number of years stay in current city                |
|``Marital_Status``             |Marital Status                                      |
|``Product_Category_1``         |Product Category (Masked)                           |
|``Product_Category_2``         |Product may belongs to other category also (Masked) |
|``Product_Category_3``         |Product may belongs to other category also (Masked) |
|``Purchase``                   |Purchase Amount (Target Variable)                   |

## Evaluation

The root mean squared error (RMSE) will be used for model evaluation.

## Code

In [1]:
import numpy as np
import pandas as pd

np.random.seed = 42

Load the given dataset.

In [2]:
data = pd.read_csv("sales_data.csv")
data.head()

Unnamed: 0,Age,City_Category,Gender,Marital_Status,Occupation,Product_Category_1,Product_Category_2,Product_Category_3,Product_ID,Purchase,Stay_In_Current_City_Years,User_ID
0,0-17,A,F,0,10,1,6,14,394,15200.0,2,1000001
1,46-50,B,M,1,7,1,8,17,287,19215.0,2,1000004
2,26-35,A,M,1,20,1,2,5,214,15665.0,1,1000005
3,51-55,A,F,0,9,5,8,14,366,5378.0,1,1000006
4,51-55,A,F,0,9,2,3,4,521,13055.0,1,1000006


**Any missing values?**

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166821 entries, 0 to 166820
Data columns (total 12 columns):
Age                           166821 non-null object
City_Category                 166821 non-null object
Gender                        166821 non-null object
Marital_Status                166821 non-null int64
Occupation                    166821 non-null int64
Product_Category_1            166821 non-null int64
Product_Category_2            166821 non-null int64
Product_Category_3            166821 non-null int64
Product_ID                    166821 non-null int64
Purchase                      166821 non-null float64
Stay_In_Current_City_Years    166821 non-null object
User_ID                       166821 non-null int64
dtypes: float64(1), int64(7), object(4)
memory usage: 15.3+ MB


No missing values since the number of non-null values in each column is the same as the number of rows.

**Dropping attribute `User_ID`.**

In [4]:
data = data.drop("User_ID", axis=1)

**Converting the following categorical attributes below to numerical values with the rule as below.**
+ `Gender`: `F`:0, `M`:1
+ `Age`: `0-17`:0, `18-25`:1, `26-35`:2, `36-45`:3, `46-50`:4, `51-55`:5, `55+`:6
+ `Stay_In_Current_City_Years`: `0`:0, `1`:1, `2`:2, `3`:3, `4+`:4

You may want to apply a `lambda` function to each row of a column in the dataframe. Some examples here may be helpful: https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/

In [5]:
data["Gender"] = data["Gender"].apply(lambda r: {'F':0, 'M':1}[r])
data["Age"] = data["Age"].apply(lambda r: {'0-17':0, '18-25':1, '26-35':2, '36-45':3, '46-50':4, '51-55':5, '55+':6}[r])
data["Stay_In_Current_City_Years"] = data["Stay_In_Current_City_Years"].apply(lambda r: {'0':0, '1':1, '2':2, '3':3, '4+':4}[r])

In [6]:
data.head()

Unnamed: 0,Age,City_Category,Gender,Marital_Status,Occupation,Product_Category_1,Product_Category_2,Product_Category_3,Product_ID,Purchase,Stay_In_Current_City_Years
0,0,A,0,0,10,1,6,14,394,15200.0,2
1,4,B,1,1,7,1,8,17,287,19215.0,2
2,2,A,1,1,20,1,2,5,214,15665.0,1
3,5,A,0,0,9,5,8,14,366,5378.0,1
4,5,A,0,0,9,2,3,4,521,13055.0,1


**Randomly split the current data frame into 2 subsets for training (80%) and test (20%). Use *random_state = 42*.**

In [7]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size = 0.2, random_state = 42)

**Get the list of numerical predictors (all the attributes in the current data frame except the target, `Purchase`) and the list of categorical predictor.**

In [8]:
num_attribs = list(data.select_dtypes(include=[np.number]))
num_attribs.remove("Purchase")
cat_attribs = ["City_Category"]

**Create a transformation pipeline including two pipelines handling the following**
- Numerical *predictors*: apply Standard Scaling
- Categorical *predictor*: apply One-hot-encoding

You will need to use `ColumnTransformer`.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
        ("num", StandardScaler(), num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

**Train and use that transformation pipeline to transform the training data (e.g. for a machine learning model).**

In [10]:
X_train = full_pipeline.fit_transform(train)
y_train = train["Purchase"].copy()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


**Use that transformation pipeline to transform the test data (e.g. for testing a machine learning model).**

In [11]:
X_test = full_pipeline.transform(test)
y_test = test["Purchase"].copy()

  res = transformer.transform(X)


**Build a Linear Regression model using the training data after transformation and test it on the test data. Report the RMSE values on the training and test data.**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics

LR = LinearRegression()
LR.fit(X_train, y_train)

X_pred = LR.predict(X_train)
print("Train - RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(X_pred, y_train)))

X_pred = LR.predict(X_test)
print("Test - RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(X_pred, y_test)))

Train - RMSE : 4600
Test - RMSE : 4616


**Repeat Question 9 using a `KNeighborsRegressor`. Comment on the processing time and performance of the model in this question.**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

In [13]:
import sklearn.neighbors

knn_regression_model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
knn_regression_model.fit(X_train, y_train)

X_pred = knn_regression_model.predict(X_train)
print("Train - RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(X_pred, y_train)))

X_pred = knn_regression_model.predict(X_test)
print("Test - RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(X_pred, y_test)))

Train - RMSE : 2935
Test - RMSE : 4211


Compared to Linear Regression, k-NN Regressor is much slower but produces better results.