# Task 2: Recommendation Engine - Skeleton Notebook

This notebook provides a very basic example for the notebook you are expected to submit for Task 2 of the Final Project. The main purpose is that we can try different examples to get a better sense of your approach. Compared to Task 1 (Kaggle Competition), we don't have any objective means to evaluate the recommendations. 

Some general comments:
* You can import any data you need. This particularly includes your cleaned version of the Used Cars dataset; there's no need to show the data cleaning / preprocessing steps in this notebook.
* You can also import your code in form of external Python (.py) script. You're actually encouraged to do so to keep this notebook light and uncluttered.
* Please consider this notebook as an example and not to set specific requirements. As long there is a section where we can easily test your solution, it should be fine.

## Setting up the Notebook

In [1]:
import numpy as np
import pandas as pd
from pandarallel import pandarallel

In [2]:
%%capture
cd ..

In [3]:
import cleaner
import constants as const
import similarity as sim
import utils

In [4]:
pandarallel.initialize()

INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [5]:
df_train_raw = pd.read_csv(const.TRAIN_PATH)
df_train = cleaner.clean_preliminary(df_train_raw)

## Load the Data

For this example, we use a simplified version of the dataset with only 100 data sample, each with only 6 features

In [6]:
df_sample_raw = pd.read_csv(const.USED_CARS_SIMPLIFIED_PATH)
df_sample = cleaner.clean_preliminary(df_sample_raw)
df_sample.head()

Unnamed: 0,index,listing_id,make,power,engine_cap,mileage,price
0,0,1006025,bmw,135.0,1997.0,,82300
1,1,994672,land rover,202.0,2993.0,25843.0,427900
2,2,921142,honda,95.6,1496.0,2000.0,109800
3,3,1008328,bmw,185.0,1998.0,57386.0,166600
4,4,1010661,UNKNOWN,96.0,1498.0,76000.0,59400


## Computing the Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data sample (data sample = row in the dataframe of the dataset). The input is a row from the dataset and a list of optional input parameters which will depend on your approach; `k` is the number of returned recommendations seems useful, though.

The output should be a `pd.DataFrame` containing the recommendations. The output dataframe should have the same columns as the row + any additional columns you deem important (e.g., any score or tags that you might want to add to your recommendations).

In principle, the method `get_top_recommendations()` may be imported from a external Python (.py) script as well.

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [None]:
vec_file = "task1/data/instance_vec.npy"
tmp_item = np.load(vec_file, allow_pickle=True).item()
series_list = tmp_item["series_list"]
vec_array = tmp_item["vec_list"]
norm_vec_array = np.linalg.norm(vec_array, axis = 1)

In [7]:
DEFAULT_K = 10
DEFAULT_NOISE_LEVEL = 0.05
cache = {}


def get_top_recommendations(row, use_fast_method=0, **kwargs) -> pd.DataFrame:
    #####################################################
    ## Initialize the required parameters
    
    # The number of recommendations seem recommended
    # Additional input parameters are up to you
    k = kwargs.get('k', DEFAULT_K)
    noise_level = kwargs.get('noise_level', DEFAULT_NOISE_LEVEL)
    assert k >= 1, 'k should be >= 1'
    assert noise_level >= 0.0, 'noise should not be negative'
    #####################################################
    if use_fast_method == 1:
        tmp_list = []
        for key in row.keys():
            tmp_list.append("{} is {}.".format(key, row[key]))
            
        query_row = " ".join(tmp_list)
        query_emb = encoder.encode([query_row]).squeeze()
        output = ((vec_array @ query_emb)/norm_vec_array)/np.linalg.norm(query_emb)
        index_list = output.argsort()[-k:]
        
        ans_list = []
        for index in index_list:
            ans_list.append(series_list[index])
            
        return pd.concat(ans_list, axis=1)
    
    if row.listing_id in cache:
        sim_df = cache[row.listing_id]
    else:
        sim_df = sim.compute_similarities(pd.DataFrame([row]), df_train, is_test=True)
        cache[row.listing_id] = sim_df

    noise = np.random.uniform(-noise_level, noise_level, size=len(df_train))
    sim_df_noisy = sim_df + noise
    most_similar_indices = utils.get_top_k_most_similar(sim_df_noisy, k=k).iloc[0]
    df_result = df_train.iloc[most_similar_indices]
        
    # Return the dataset with the k recommendations
    return df_result

## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Pick a Sample Listing as Input

In [8]:
# Pick a row id of choice
row_id = 10
#row_id = 20
#row_id = 30
#row_id = 40
#row_id = 50

row = df_sample.iloc[row_id]
pd.DataFrame([row])

Unnamed: 0,index,listing_id,make,power,engine_cap,mileage,price
10,10,1020216,honda,73.0,1317.0,22703.0,78000


## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

In [9]:
# Number of similar rows to show
k = 5

# Noise level. The higher, the more noise the suggestions. If noise_level=0, then
# get_top_recommendations becomes deterministic
noise_level = 10

df_recommendations = get_top_recommendations(row, k=k, noise_level=noise_level)
df_recommendations.head(k)

Unnamed: 0,index,listing_id,title,make,model,description,manufactured,type_of_vehicle,category,transmission,...,dereg_value,mileage,omv,arf,opc_scheme,features,accessories,price,registered_date,car_age
2317,2383,1030661,Honda Fit 1.3A G F-Package (OPC),honda,fit,able to convert to normal plate option! 100% l...,2019.0,0,"{low mileage car, opc car, parf car}",0,...,19962.0,20000.0,15965.0,5965.0,1,"1.3l earth dreams i-vtec engine, cvt auto tran...","upgraded sports rims, leather seats, touchscre...",61500,2019-07-23,2.0
146,149,1028117,Honda Jazz 1.5A RS,honda,jazz,1 owner. low mileage. 5 years warranty. kah mo...,2018.0,0,"{low mileage car, parf car}",0,...,34353.0,23000.0,17135.0,17135.0,0,"128bhp, 1.5l 4 cylinders inline 16 valve dohc ...","leather upholstery, keyless entry/start, facto...",83400,2018-10-19,3.0
12654,12933,1030273,Honda Freed 1.5A G (COE till 09/2024),honda,freed,"consignment unit, trade-in, finance available,...",2008.0,3,{coe car},0,...,8862.0,,20631.0,20631.0,0,"1.5l 16 valves dohc vvt-i engine, airbags, abs...","leather seats, sports rim.",28400,2009-09-15,13.0
10447,10686,1026157,Honda Fit 1.3A G Skyroof (COE till 11/2023),honda,fit,UNKNOWN,2008.0,0,{coe car},0,...,6617.0,,14338.0,14338.0,0,view specs of the honda fit,UNKNOWN,18200,2008-11-25,13.0
1670,1719,1024061,Honda Vezel 1.5A X,honda,vezel,beautiful unit in midnight blue. low mileage c...,2016.0,2,"{premium ad car, parf car}",0,...,33435.0,70000.0,20700.0,10980.0,0,original condition. view specs of the honda vezel,pioneer audio system with bluetooth and mic. s...,60000,2016-07-28,5.0
