# DATA 201 - Assignment 2

Please use this page http://apps.ecs.vuw.ac.nz/submit/DATA201 for submitssion and submit only this single Jupyter notebook with your code added into it at the appropriate places.

The due date is **Sunday 5th April, before midnight**.

The dataset for this assignment is file **sales_data.csv** which is provided with this notebook.

Please choose menu items *Kernel => Restart & Run All* then *File => Save and Checkpoint* in Jupyter before submission.

## Problem Statement

A retail company wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and total purchase_amount from last month.

You need to build a model to predict the purchase amount of customer against various products which will help the company to create personalized offer for customers against different products.

## Data

| Variable	                    | Description                                        |
|-------------------------------|----------------------------------------------------|
|``User_ID``                    |User ID                                             |
|``Product_ID``                 |Product ID                                          |
|``Gender``                     |Sex of User                                         |
|``Age``                        |Age in bins                                         |
|``Occupation``                 |Occupation (Masked)                                 |
|``City_Category``              |Category of the City (A, B, C)                      |
|``Stay_In_Current_City_Years`` |Number of years stay in current city                |
|``Marital_Status``             |Marital Status                                      |
|``Product_Category_1``         |Product Category (Masked)                           |
|``Product_Category_2``         |Product may belongs to other category also (Masked) |
|``Product_Category_3``         |Product may belongs to other category also (Masked) |
|``Purchase``                   |Purchase Amount (Target Variable)                   |

## Evaluation

The root mean squared error (RMSE) will be used for model evaluation.

## Questions and Code

In [1]:
import numpy as np
import pandas as pd

np.random.seed = 42

Load the given dataset.

In [2]:
data = pd.read_csv("sales_data.csv")
data.head()

Unnamed: 0,Age,City_Category,Gender,Marital_Status,Occupation,Product_Category_1,Product_Category_2,Product_Category_3,Product_ID,Purchase,Stay_In_Current_City_Years,User_ID
0,0-17,A,F,0,10,1,6,14,394,15200.0,2,1000001
1,46-50,B,M,1,7,1,8,17,287,19215.0,2,1000004
2,26-35,A,M,1,20,1,2,5,214,15665.0,1,1000005
3,51-55,A,F,0,9,5,8,14,366,5378.0,1,1000006
4,51-55,A,F,0,9,2,3,4,521,13055.0,1,1000006


**1. Is there any missing value? [1 point]**

In [3]:
np.where(data.isnull())

(array([], dtype=int64), array([], dtype=int64))

** 2. Drop attribute `User_ID`. [1 point] **

In [4]:
data = data.drop('User_ID', axis=1)

** 3. Then convert the following categorical attributes below to numerical values with the rule as below. [4 points]**
+ `Gender`: `F`:0, `M`:1
+ `Age`: `0-17`:0, `18-25`:1, `26-35`:2, `36-45`:3, `46-50`:4, `51-55`:5, `55+`:6
+ `Stay_In_Current_City_Years`: `0`:0, `1`:1, `2`:2, `3`:3, `4+`:4

You may want to apply a `lambda` function to each row of a column in the dataframe. Some examples here may be helpful: https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/

In [5]:
data['Gender'] = data['Gender'].apply(lambda row: (0 if row == 'F' else 1))

data['Age'] = data['Age'].apply(lambda row: (0 if row == '0-17' else 
                                             (1 if row == '18-25' else 
                                              (2 if row == '26-35' else 
                                              (3 if row == '36-45' else 
                                              (4 if row == '46-50' else 
                                              (5 if row == '51-55' else
                                               (6)
                                              )))))))

data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].apply(lambda row: row if row != '4+' else 4)

**4. Randomly split the current data frame into 2 subsets for training (80%) and test (20%). Use *random_state = 42*. [2 points]**

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

x_train = train_set.drop('Purchase', axis=1)
y_train = train_set['Purchase'].copy()

x_test = test_set.drop('Purchase', axis=1)
y_test = test_set['Purchase'].copy()

**5. Get the list of numerical predictors (all the attributes in the current data frame except the target, `Purchase`) and the list of categorical predictor. [1 point]**

In [8]:
numericPredictors = x_train._get_numeric_data()
categoricalPredictors = ['City_Category']

**6. Create a transformation pipeline including two pipelines handling the following [3 points]**
- Numerical *predictors*: apply Standard Scaling
- Categorical *predictor*: apply One-hot-encoding

You will need to use `ColumnTransformer`. The example in Week 3 lectures may be helpful.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer

In [10]:
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="mean")),
        ('norm', StandardScaler()),
    ])

pipeline = ColumnTransformer([
    ('num', num_pipeline, list(numericPredictors)),
    ('cat', OneHotEncoder(), categoricalPredictors)
])

**7. Train and use that transformation pipeline to transform the training data (e.g. for a machine learning model). [2 points]**

In [11]:
training_data_prepared = pipeline.fit_transform(x_train)

**8. Use that transformation pipeline to transform the test data (e.g. for testing a machine learning model). [2 points]**

In [12]:
test_data_prepared = pipeline.fit_transform(x_test)

**9. Build a Linear Regression model using the training data after transformation and test it on the test data. Report the RMSE values on the training and test data. [3 points]**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [13]:
import sklearn.linear_model
from sklearn.metrics import mean_squared_error

linear_regression_model = sklearn.linear_model.LinearRegression()
linear_regression_model.fit(training_data_prepared, y_train)

predictions = linear_regression_model.predict(test_data_prepared)

lrrmse = mean_squared_error(y_test, predictions)

np.sqrt(lrrmse)

4615.575563973856

**10. Repeat Question 9 using a `KNeighborsRegressor`. Comment on the processing time and performance of the model in this question. [1 point]**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

In [14]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(training_data_prepared, y_train)
predictions = knn.predict(test_data_prepared)
knnrmse = mean_squared_error(y_test, predictions)
np.sqrt(knnrmse)

5299.727158357741

KNN seems to take more time to execute, and it has a higher RMSE. Thereore, it is less accurate than linear regression in this insance