# Task 2: Recommendation Engine - Skeleton Notebook

This notebook provides a very basic example for the notebook you are expected to submit for Task 2 of the Final Project. The main purpose is that we can try different examples to get a better sense of your approach. Compared to Task 1 (Kaggle Competition), we don't have any objective means to evaluate the recommendations. 

Some general comments:
* You can import any data you need. This particularly includes your cleaned version of the Used Cars dataset; there's no need to show the data cleaning / preprocessing steps in this notebook.
* You can also import your code in form of external Python (.py) script. You're actually encouraged to do so to keep this notebook light and uncluttered.
* Please consider this notebook as an example and not to set specific requirements. As long there is a section where we can easily test your solution, it should be fine.

## Overview 
In this notebook, we introduce a personalized recommendation system for the new customer on the 2nd car selling website. This personalized recommendation system will help the new customer to explore their interests and find the best car that they might be interested.<br>

When the customer first visits the 2nd car selling website, there will be a list of popular cars introduced to the customer. Customer might click into some Links/cars that they interested from the recommended list. <br>
Meanwhile, we will collect, record and derive customer interests based on the Link/car clicked by the customers. Similar item will be recommended based on the derived customer interests.<br>
To help the customer further explore his/her interests, we will add some popular items into the recommendation system to prevent overspecialization.  

## Objectives/Assumptions on A Good Recommendations
1. Design for new customers <br>
This recommendation system is able to do recommendation for new customers by implementing popular-based engines. 
2. Create Customer Profile.<br>
This recommendation system is able to remember customers browsing history and make recommendations accordingly. <br>
3. Prevent Overspecialization.<br>
This recommendation system is able to prevent overspecialization by adding some Popular item into the recommendation system.<br>
4. Prevent Repeating. <br>
Prevent to recommend the same item that has been clicked by the customer. <br>
5. Robustness <br>
The recommendation system can adjust the recommended item based on the accumulated browsing history. 

## Assumptions:
Assumptions as follows are made to make the recommendation easier and implementable. 
1.	The Link/car clicked by the customer is not random clicked by the customer. The customer clicks the link/car due to being attracted by the features of that car. The features are, for example, Make, Color, Mileage, Price, etc.<br>
2.	The ratings for each car are known in the original database to make it possible to implement popular-based engines. <br>
3.	The clicking history inputted into the recommendation system is based on one specific customer.



## Setting up the Notebook

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys  
sys.path.insert(0, r'C:\Users\wuhan\Desktop\cs5228project\src')
import ItemSimilarity1 as ItemSimilarity

In [3]:
%matplotlib notebook
import pandas as pd
import numpy as np

## Load the Data

2 datasets are generated by `Data_Processing.ipynb` based on Task 1 training dataset with added column "Rating": <br>
1. df_non_encoded: Original data set. <br>
2. df_encoded: encoded data for item-item similarity analysis <br>

Indexes of recommended items based on df_encoded dataset will be used on df_non_encoded dataset to return the actual recommended items. 


In [4]:
df_encoded = pd.read_excel(r'C:\Users\wuhan\Desktop\cs5228project\Train_encoded.xlsx')
df_non_encoded = pd.read_excel(r'C:\Users\wuhan\Desktop\cs5228project\Train_non_encoded.xlsx')
df_non_encoded.head(2)
# print(df_encoded.shape)
# print(df_non_encoded.shape)

Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,...,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
0,bmw,320i,2013,luxury sedan,auto,1560.0,135.0,petrol,1997,1,...,77100.0,1210.0,47514.0,73000.0,45330,50462,non_opc,71300,2897,45
1,toyota,hiace,2014,van,manual,1740.0,133.622143,diesel,2982,3,...,10660.0,1341.544993,3648.0,110112.0,27502,1376,non_opc,43800,2484,48


## Computing the Top Recommendations
In this recommendation system, we implement the method, `get_top_recommendations()`, to return the top K meaningful recommendations in the form of `pd.DataFrame` when the clicking history of the customer is inputted. In principle, the method `get_top_recommendations()` is imported from an external Python script, `ItemSimilarity.py`. 

In [5]:
def get_top_recommendations(row_id, popularization=0.2, **kwargs, ) -> pd.DataFrame:
    
    #####################################################
    ## Initialize the required parameters
    
    # The number of recommendations seem recommended
    # Additional input parameters are up to you
    k = None
    
    # Extract all **kwargs input parameters
    # and set the used paramaters (here: k)
    for key, value in kwargs.items():
        if key == 'k':
            k = value 
    #####################################################
    ## Compute your recommendations
    #
    # This is where your magic happens. Of course, you can call methods
    # defined in this notebook or in external Python (.py) scripts
    # Call python file 
    
    # calculate similarity 
    Item_similarities = recom.calc_item_similarities(df_encoded)
    
    # get suggeted item index
    row_id_suggested = recom.calc_rating_item(row_id, Item_similarities,df_encoded, popularization=0.2)
    
    # compute recommended result as df_result 
    df_result = df_non_encoded.iloc[row_id_suggested]
    
    # Return the dataset with the k recommendations
    return df_result


## Functions  used in `ItemSimilarity.py`:
`topk_popular()`: Return top K popular items based on “Rating” <br>
`calc_item_similarities()`: Return `Item_similarities` based on cosine similarity<br>
`calc_rating_item()`: For first click, Return recommended_item. The K recommendations includes `(1-popularization)*K` similar items and `(popularization*K)` popular items to prevent the Overspecialization, where popularization = 0.2 by default. <br>
For more than a single click, return `(1-popularization)*K` similar items based on the `latest click`. For `(popularization*K)`popular items, based on the clicking history, we use` valued_feature()` to return the top 5 frequent features with the most frequent values in each feature. Original dataset is updated based on the return from `valued_feature()`, and `(popularization*K)` popular items are computed based on updated data set and the “Rating” value. <br>
`valued_feature()`: Based on the `click_history()`, return `feature_key_top`, which is a dictionary that contains the top 5 frequent features as the key with the most frequent value under each feature as the value.<br>
`click_history()`: Store the clicking history of the customer.<br>


## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Step 1: For new customers
Top K popular cars are suggested by system when the customer browses the website at the first time.

In [6]:
# Return the top 3 ranking car based on the current database
K=5
recom = ItemSimilarity.Recommendations(K)
top_K_index, top_K_popular_car = recom.topk_popular(df_non_encoded)
print("Top K popular cars:")
print("K = 5")
pd.set_option('display.max_columns', None)
pd.DataFrame(top_K_popular_car)

Top K popular cars:
K = 5


Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
112,bmw,x5,2018,suv,auto,2060.0,250.0,petrol,2998,2,32310.0,32909.0,2384.0,115445.0,73410.018874,83555,122399,non_opc,328700,1028,99
124,honda,accord,2008,luxury sedan,auto,1505.0,115.0,petrol,1997,4,8590.0,17706.0,1573.0,9479.0,207000.0,33066,33066,non_opc,25300,4582,99
162,mercedes-benz,s300l,2012,luxury sedan,auto,1850.0,170.0,petrol,2997,2,25350.0,96101.0,2382.0,61654.0,78500.0,84972,84972,non_opc,90000,3166,99
485,hyundai,elantra,2011,mid-sized sedan,auto,1267.0,95.6,petrol,1591,1,7760.0,41937.849024,738.0,44238.116556,91125.0,14512,14512,non_opc,42700,3650,99
500,honda,jazz,2008,hatchback,auto,1115.0,88.0,petrol,1497,5,6710.0,27571.0,889.0,20055.0,131500.0,21533,21533,non_opc,53700,4723,99


## Commends: Compare Design Objectives and Results

Objective 1:This recommendation system can do recommendation for new customers by implementing popular-based engines.

Results 1: Popular Items are recommended based on "Rating". The Max rating is 99.

### Step 2: Pick a Sample Listing as Input 
After a while, first link is clicked by user. This link is from the whole database, including the top K popular cars.

In [7]:
# Pick a row id of choice
row_id = 20
row = df_non_encoded.iloc[row_id]
print("row_id = {}".format(row_id))
pd.DataFrame([row])

row_id = 20


Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
20,jaguar,xf,2010,luxury sedan,auto,1735.0,175.0,petrol,2967,4,9160.0,37502.0,2818.0,30711.0,73410.018874,54019,54019,non_opc,82500,4191,82


### Compute and Display the recommendations

1. Call the method `get_top_recommendations()` to return the suggested K recommendations in the form of `pd.DataFrame`. 
2. The K recommendations includes `(1-popularization)*K` similar items and `popularization*K` popular items to prevent the Overspecialization, where popularization = 0.2 by default. 
3. The browsing history of the user will be stored in `click_history`.
4. The K recommendations will not include the items in `click_history`, which been viewed by the customer. 

In [8]:
df_recommendations = get_top_recommendations(row_id, k=K, popularization=0.2)
pd.DataFrame(df_recommendations)

click_history [20]


Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
10975,jaguar,xf,2008,luxury sedan,auto,1735.0,175.0,petrol,2967,3,8510.0,33377.0,3052.0,23446.0,140000.0,63601,63601,non_opc,65800,4808,1
9252,jaguar,xf,2008,luxury sedan,auto,1735.0,175.0,petrol,2967,3,8610.0,33377.0,3052.0,23565.0,139000.0,63601,63601,non_opc,66900,4808,94
13784,jaguar,xf,2008,luxury sedan,auto,1735.0,175.0,petrol,2967,6,7910.0,37906.0,3287.0,25776.0,100300.0,55806,55806,non_opc,59200,4903,92
13934,jaguar,xf,2011,luxury sedan,auto,1810.0,140.0,diesel,2179,2,10580.0,41937.849024,2096.0,44238.116556,73410.018874,54583,54583,non_opc,116500,3531,79
4923,porsche,cayenne,2004,suv,auto,2245.0,184.0,petrol,3189,3,9790.0,71811.0,4023.0,23098.0,135000.0,80715,88787,non_opc,34600,6186,99


## Commends: Compare Design Objectives and Results

1. Objective 2: Prevent Overspecialization <br>
The K recommendations should includes (1-popularization)*K similar items and popularization*K popular items to prevent the Overspecialization, where popularization = 0.2 by default.<br>
Results 2: The recommendation system indeed provide 4 similar items and 1 popular items with "Rating" = 99.<br>
2. Objective 3: Create Customer Profile <br>
The browsing history of the user will be stored in `click_history`.<br>
Results 3: `click_history` is updated to `click_history` = [20] <br>
3. Objective 4: Prevent Repeating <br>
The K recommendations will not include the items in `click_history`.<br>
Results 4: Objective reached since item 20 is not suggested in the recommendation lists.

### Step 3: Pick another Sample Listing as Input 
In step 3, we would like to test the Robustness of the recommendation system. <br> 
In the other world, The recommendation system can adjust the recommended item based on the `accumulated browsing history`. 

In [9]:
# Pick another row id of choice
row_id = 9252
row = df_non_encoded.iloc[row_id]
pd.DataFrame([row])

Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
9252,jaguar,xf,2008,luxury sedan,auto,1735.0,175.0,petrol,2967,3,8610.0,33377.0,3052.0,23565.0,139000.0,63601,63601,non_opc,66900,4808,94


`feature_key_top` is a dictionary that contains the top 5 frequent features as the key with the most frequent values in each feature as the value.<br>
`click_history` is a list that conatins the browsing history of customers


In [10]:
df_recommendations = get_top_recommendations(row_id, k=K, popularization=0.2)
pd.DataFrame(df_recommendations)

click_history [20, 9252]
feature_key_top = {'make': 11, 'model': 19, 'type_of_vehicle': 0, 'transmission': 0, 'curb_weight': 9}


Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
10975,jaguar,xf,2008,luxury sedan,auto,1735.0,175.0,petrol,2967,3,8510.0,33377.0,3052.0,23446.0,140000.0,63601,63601,non_opc,65800,4808,1
13784,jaguar,xf,2008,luxury sedan,auto,1735.0,175.0,petrol,2967,6,7910.0,37906.0,3287.0,25776.0,100300.0,55806,55806,non_opc,59200,4903,92
4577,jaguar,xf,2011,luxury sedan,auto,1735.0,175.0,petrol,2967,2,10870.0,41937.849024,2348.0,44238.116556,133000.0,54755,54755,non_opc,119700,3552,45
13934,jaguar,xf,2011,luxury sedan,auto,1810.0,140.0,diesel,2179,2,10580.0,41937.849024,2096.0,44238.116556,73410.018874,54583,54583,non_opc,116500,3531,79
8341,jaguar,xf,2014,luxury sedan,auto,1873.0,177.0,petrol,1999,1,14410.0,68668.0,1212.0,64299.0,51500.0,49476,61267,non_opc,90200,2432,89


In [11]:
# Pick another row id of choice
row_id = 4577
row = df_non_encoded.iloc[row_id]
pd.DataFrame([row])

Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
4577,jaguar,xf,2011,luxury sedan,auto,1735.0,175.0,petrol,2967,2,10870.0,41937.849024,2348.0,44238.116556,133000.0,54755,54755,non_opc,119700,3552,45


In [12]:
df_recommendations = get_top_recommendations(row_id, k=K, popularization=0.2)
pd.DataFrame(df_recommendations)

click_history [20, 9252, 4577]
feature_key_top = {'make': 11, 'model': 19, 'type_of_vehicle': 0, 'transmission': 0, 'curb_weight': 9}


Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
13934,jaguar,xf,2011,luxury sedan,auto,1810.0,140.0,diesel,2179,2,10580.0,41937.849024,2096.0,44238.116556,73410.018874,54583,54583,non_opc,116500,3531,79
2447,jaguar,xf,2014,luxury sedan,auto,1810.0,147.0,diesel,2179,2,13680.0,71889.0,2096.0,61043.0,90000.0,51013,53824,non_opc,84200,2410,10
2348,jaguar,xf,2013,luxury sedan,auto,1873.0,177.0,petrol,1999,2,15100.0,65001.0,1212.0,57271.0,120000.0,50957,63723,non_opc,83700,2664,77
14542,jaguar,xf,2011,luxury sedan,auto,1735.0,175.0,petrol,2967,3,14476.872525,75889.0,2348.0,30050.0,139000.0,56642,56642,non_opc,43900,3637,65
13784,jaguar,xf,2008,luxury sedan,auto,1735.0,175.0,petrol,2967,6,7910.0,37906.0,3287.0,25776.0,100300.0,55806,55806,non_opc,59200,4903,92


In [13]:
# Pick another row id of choice
row_id = 4923
row = df_non_encoded.iloc[row_id]
pd.DataFrame([row])

Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
4923,porsche,cayenne,2004,suv,auto,2245.0,184.0,petrol,3189,3,9790.0,71811.0,4023.0,23098.0,135000.0,80715,88787,non_opc,34600,6186,99


In [14]:
df_recommendations = get_top_recommendations(row_id, k=K, popularization=0.2)
pd.DataFrame(df_recommendations)

click_history [20, 9252, 4577, 4923]
feature_key_top = {'transmission': 0, 'curb_weight': 9, 'power': 9, 'fuel_type': 0, 'engine_cap': 9}


Unnamed: 0,make,model,manufactured,type_of_vehicle,transmission,curb_weight,power,fuel_type,engine_cap,no_of_owners,depreciation,coe,road_tax,dereg_value,mileage,omv,arf,opc_scheme,price,Date_Used,Rating
12966,porsche,cayenne,2005,suv,auto,2245.0,184.0,petrol,3189,6,12330.0,60519.0,4023.0,25517.0,73410.018874,73982,81381,non_opc,57200,5827,17
8247,porsche,cayenne,2007,suv,auto,2245.0,213.0,petrol,3598,5,15640.0,52473.0,4651.0,31139.0,127000.0,83811,92193,non_opc,102100,5128,63
3349,porsche,cayenne,2009,suv,auto,2245.0,213.0,petrol,3598,4,11830.0,37906.0,3986.0,25901.0,133000.0,84001,84001,non_opc,88900,4472,47
1861,porsche,cayenne,2008,suv,auto,2245.0,213.0,petrol,3598,1,11740.0,36888.0,4651.0,25397.0,133323.0,85134,85134,non_opc,88900,4859,83
1344,maserati,quattroporte,2010,luxury sedan,auto,1990.0,295.0,petrol,4244,5,13320.0,37502.0,5198.0,30639.0,135000.0,111145,111145,non_opc,119700,4149,99


## Commends: Compare Design Objectives and Results

Objective 4: Robustness: The recommendation system can adjust the recommended items based on the accumulated browsing history.<br>
Results 4: <br>
In the final recommendation, it return 4 similar items based on the `latest click` (row_id = 4923).<br>
While for the suggested popular item, we add robustness inside of it. we want to find the customer interests in the popular items. Based on the clicking history, we use `valued_feature()` to return `the top 5 frequent features with the most frequent value under each feature` stored in `feature_key_top`. After that, Original data set is updated based on `feature_key_top`, and the suggested popular items are computed based on updated data set together with the “Rating” value.<br>

In the above example, `row_id` = [20,9252,4577] are the similar items, `row_id` = [4923] is the top popular item with "Rating" = 99, which is different to `row_id` = [20,9252,4577]. By looking at the update of `feature_key_top` from `row_id` = [20,9252,4577] to `row_id` = [20,9252,4577,4923]. The features `feature_key_top` we used to filter original data set are changed. This is because when customer click `row_id` = 4923, which is different to `row_id` = [20,9252,4577].`The change of customer interests are updated and stored in  "feature_key_top".` <br>


## Side Note:
`'make': 11` equal to `'transmission': jaguar` <br>

`'model': 19` equal to `'model': xf`<br>

`'type\_of\_vehicle': 0` equal to `'type\_of\_vehicle': luxury sedan`<br>

`'transmission': 0` equal to `'transmission': auto` <br>

`'curb\_weight': 9` equal to `'curb\_weight': Range(1000,3000)`<br>

`'power': 9` equal to `'power': Range(100, 350)`<br>

`'fuel\_type': 0` equal to `'fuel\_type': petrol`<br>

`'engine_cap': 9` equal to `'engine_cap': Range(1500, 4500)`<br>

## Future Improvements

There is one improvement that can be done on the current recommendation system. In the case of there is no preference of customer interest. In the other word, we cannot find the frequent features based on `feature_key_top`. all the features have the same count of the most frequent value under each feature. In this case, we can recommend popular items or random items to the customer.