# Task 2: Recommendation Engine

This notebook implements our solution to task 2. If any problems occur trying to run it locally, we also have a version available at https://colab.research.google.com/drive/1jsVK6iWu78BEwGA7fJ731-H2UsEt47eY?usp=sharing

## Setting up the Notebook

In [1]:
%%capture
cd ..

In [3]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from pandarallel import pandarallel

import cleaner
import constants as const
import similarity as sim
import utils
import item_filters as itf

In [2]:
REPORT_ROW_LISTING_ID = 1028659
pandarallel.initialize()

INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


Load model for generating embeddings for sentences. This is used for the fast approach. **NB: It might download a model to your Python installation**

In [3]:
# Setup related to fast approach
encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
CAR_EMBEDDING_MATRIX = np.load(const.CAR_EMBEDDING_MATRIX_PATH, allow_pickle=True)
CAR_EMBEDDING_NORM_VEC = np.linalg.norm(CAR_EMBEDDING_MATRIX, axis=1)

In [4]:
df_train_raw = pd.read_csv(const.TRAIN_PATH)
df_train = cleaner.clean_preliminary(df_train_raw)

## Load the Data

For this example, we use a simplified version of the dataset with only 100 data sample, each with only 6 features

In [5]:
df_sample_raw = pd.read_csv(const.USED_CARS_SIMPLIFIED_PATH)
df_sample = cleaner.clean_preliminary(df_sample_raw)
df_sample.head()

Unnamed: 0,index,listing_id,make,power,engine_cap,mileage,price
0,0,1006025,bmw,135.0,1997.0,,82300
1,1,994672,land rover,202.0,2993.0,25843.0,427900
2,2,921142,honda,95.6,1496.0,2000.0,109800
3,3,1008328,bmw,185.0,1998.0,57386.0,166600
4,4,1010661,,96.0,1498.0,76000.0,59400


In [6]:
COLS_TO_SHOW = [
    'listing_id',
    'title',
    'make',
    'power',
    'engine_cap',
    'mileage',
    'price',
    'description',
    'fuel_type',
    'accessories'
]

## Computing the Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data sample (data sample = row in the dataframe of the dataset). The input is a row from the dataset and a list of optional input parameters which will depend on your approach; `k` is the number of returned recommendations seems useful, though.

The output should be a `pd.DataFrame` containing the recommendations. The output dataframe should have the same columns as the row + any additional columns you deem important (e.g., any score or tags that you might want to add to your recommendations).

In principle, the method `get_top_recommendations()` may be imported from a external Python (.py) script as well.

In [7]:
DEFAULT_K = 10
DEFAULT_NOISE_LEVEL = 0.1
DEFAULT_USE_FAST_METHOD = False

cache = {}


def get_top_recommendations(row, **kwargs) -> pd.DataFrame:
    #####################################################
    ## Initialize the required parameters
    k = kwargs.get('k', DEFAULT_K)
    noise_level = kwargs.get('noise_level', DEFAULT_NOISE_LEVEL)
    use_fast_method = kwargs.get('use_fast_method', DEFAULT_USE_FAST_METHOD)
    user_id = kwargs.get('user_id', None)
    user_pref = USER_PREFERENCES.get(user_id, {})
    assert k >= 1, 'k should be >= 1'
    assert noise_level >= 0.0, 'noise should not be negative'
    assert isinstance(use_fast_method, bool), '`use_fast_method` should be a bool'
    assert user_id is None or user_id in USER_PREFERENCES, \
        f'User ID not found, try one of {list(USER_PREFERENCES.keys())} or None'
    #####################################################
    sim_df = None
    if use_fast_method:
        query_strs = [f'{key} is {val}' for key, val in row.items()]
        query_row = ' '.join(query_strs)
        query_embedding = encoder.encode([query_row]).squeeze()
        raw_sim_score = CAR_EMBEDDING_MATRIX @ query_embedding
        normalizer = CAR_EMBEDDING_NORM_VEC * np.linalg.norm(query_embedding)
        sim_scores = raw_sim_score / normalizer
        sim_df = pd.DataFrame(sim_scores.reshape(1, -1))
    else:  # Slower approach
        # Get listing id similarity from cache if present, otherwise compute from scratch
        if row.listing_id in cache:
            sim_df = cache[row.listing_id]
        else:
            sim_df = sim.compute_similarities(pd.DataFrame([row]), df_train, is_test=True)
            cache[row.listing_id] = sim_df

    # Make noise similarity DataFrame
    noise = np.random.uniform(-noise_level, noise_level, size=len(df_train))
    sim_df_noisy = sim_df + noise
    
    # Get most similar
    most_similar_indices = utils.get_top_k_most_similar(sim_df_noisy, k=None).iloc[0]
    df_most_similar = df_train.iloc[most_similar_indices]
    
    # Filter based on user preferences (if any) and original listing id
    df_user = itf.filter_on_user_pref(user_pref, df_most_similar)
    df_result = df_user[df_user.listing_id != row.listing_id]
        
    # Return the dataset with the k recommendations
    return df_result.head(k)

## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Define User Preferences

Below, we define a user preferences dictionary where the key is the user id and the item is a dictionary of column-filter pairs specifying the item filter the user wants to apply for a given column.

It is assumed that if a user specifies some preferences, they *only* want to see cars that fullfill their preferences - i.e., we can do raw filtering based on their preferences.

The following item filters are available: 
* `RegexFilter(pattern: str)`
* `NumericalFilter(min_value: Optional[Number], max_value: Optional[Number])`
* `SetFilter(values: Set[Any])`
* `NotFilter(item_filter: ItemFilter)`

In [8]:
USER_PREFERENCES = {
    1: {
        'accessories': itf.RegexFilter(r'bluetooth'),
        'price': itf.NumericalFilter(max_value=110_000)
    },
    2: {
        'accessories': itf.RegexFilter(r'leather'),
        'car_age': itf.NumericalFilter(1, 3),
        'price': itf.NumericalFilter(min_value=50_000)
    }
}

### Pick a Sample Listing as Input

In [9]:
# Pick a row id of choice
#row_id = 10
#row_id = 20
row_id = 30
#row_id = 40
#row_id = 50
#row = df_sample.iloc[row_id]

# Row used in report
row = df_train[df_train.listing_id == REPORT_ROW_LISTING_ID].iloc[0]

pd.DataFrame([row])

Unnamed: 0,index,listing_id,title,make,model,description,manufactured,type_of_vehicle,category,transmission,...,mileage,omv,arf,opc_scheme,features,accessories,price,make_model,registered_date,car_age
50,51,1028659,BMW 2 Series 216i Active Tourer,bmw,216i,"1 owner unit, warranty under pml 3 years 200,0...",2021.0,0,"{parf car, low mileage car, almost new car}",0,...,41.0,31169.0,35637.0,,3-cylinders 12-valve turbocharged. 7-speed dc...,multi-function steering. auto headlights/rain ...,147100,63,2021-06-15,0.0


## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

In [10]:
# Number of similar rows to show
k = 5

# Noise level. The higher, the more noise the suggestions. If noise_level=0, then
# get_top_recommendations becomes deterministic
noise_level = 0

# Optional user id. Should be either None or an existing user_id - {1, 2} already defined
user_id = None

# Whether to use fast, more rough approach or slower, more precise approach
use_fast_method = False

# Get and show recommendations
df_recommendations = get_top_recommendations(
    row,
    k=k,
    noise_level=noise_level,
    user_id=user_id,
    use_fast_method=use_fast_method
)
df_recommendations[COLS_TO_SHOW]

Unnamed: 0,listing_id,title,make,power,engine_cap,mileage,price,description,fuel_type,accessories
8105,1026303,BMW 2 Series 216i Active Tourer,bmw,80.0,1499.0,232.0,141700,"1 owner. low mileage, newly registered. warran...",petrol,multi-function steering. auto headlights/rain ...
5453,1022382,BMW 2 Series 216i Active Tourer,bmw,80.0,1499.0,2300.0,152700,30,petrol,"keyless entry/start, navigation, bluetooth, el..."
11966,1024241,BMW 2 Series 216i Active Tourer Sport,bmw,80.0,1499.0,35.0,146900,lowest price gteed! unbelievable nice number g...,petrol,"original sport rims, connected idrive, bluetoo..."
8868,1020719,BMW 2 Series 216i Active Tourer,bmw,80.0,1496.0,45.0,150500,"new unit, elegant white unit on rare black int...",petrol,"original sport rims, connected idrive, bluetoo..."
12692,1030742,BMW 2 Series 216i Active Tourer Sport,bmw,80.0,1499.0,5053.0,136200,certified management bmw active tourer. in ren...,petrol,"reverse camera, bluetooth, electric front seat..."


Quickly show the obtained columns in $\LaTeX$ format to copy-paste to report

In [11]:
REPORT_COLS = [
    'listing_id',
    'title',
    'power',
    'engine_cap',
    'mileage',
    'price'
]

latex_str = df_recommendations[REPORT_COLS].to_latex(index=False)
print(latex_str)

\begin{tabular}{rlrrrr}
\toprule
 listing\_id &                                 title &  power &  engine\_cap &  mileage &  price \\
\midrule
    1026303 &       BMW 2 Series 216i Active Tourer &   80.0 &      1499.0 &    232.0 & 141700 \\
    1022382 &       BMW 2 Series 216i Active Tourer &   80.0 &      1499.0 &   2300.0 & 152700 \\
    1024241 & BMW 2 Series 216i Active Tourer Sport &   80.0 &      1499.0 &     35.0 & 146900 \\
    1020719 &       BMW 2 Series 216i Active Tourer &   80.0 &      1496.0 &     45.0 & 150500 \\
    1030742 & BMW 2 Series 216i Active Tourer Sport &   80.0 &      1499.0 &   5053.0 & 136200 \\
\bottomrule
\end{tabular}

