# Task 2: Recommendation Engine - item_based

In this notebook, we implement item-based recommendation by computing the similarity of different data entries. Based on any data entry, we can recommend the data entries that are most similar to it.


## Setting up the Notebook

In [195]:
import pandas as pd
import numpy as np
from data_processing import data_processing
from sklearn.metrics.pairwise import cosine_similarity

In [196]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load the Data

For this example, we use a simplified version of the dataset with only 2k+ data samples, each with only a subset of features.

In [197]:
df_sample = pd.read_csv('../data/sg-property-prices-simplified.csv')
df_sample.head()

Unnamed: 0,listing_id,title,property_name,property_type,built_year,num_beds,num_baths,size_sqft,planning_area,price
0,799762,hdb flat for sale in 524 ang mo kio avenue 5,hdb-ang mo kio,hdb 3 rooms,1980.0,2.0,2.0,732,ang mo kio,419000.0
1,896907,4 bed condo for sale in kopar at newton,kopar at newton,condo,2023.0,4.0,4.0,1528,novena,3727500.0
2,445021,4 bed condo for sale in nouvel 18,nouvel 18,condo,2014.0,4.0,3.0,2476,newton,8013600.0
3,252293,hdb flat for sale in 467 jurong west street 41,hong kah ville,hdb,1985.0,3.0,2.0,1302,jurong west,682500.0
4,926453,hdb flat for sale in 664b punggol drive,waterway sunbeam,Hdb 5 Rooms,2016.0,3.0,2.0,1184,punggol,764400.0


## After data processing
Similar to task1 data processing, only useful data columns are retained, one-hot processing is performed on category data, and numerical data is normalized. Besides, we delete data entries with NaN instead of filling to keep the recommendation real.

In [198]:
X = data_processing(df_sample)
X.head()

Unnamed: 0,listing_id,num_beds,num_baths,size_sqft,area_mean_price,property_type_condo,property_type_hdb,property_type_house,built_year_1995,built_year_2005,built_year_2015,built_year_2025,price_2.0,price_3.0,price_4.0
0,799762,-0.830249,-0.385384,-0.469722,-0.343703,0,1,0,0,0,0,0,0,0,0
1,966261,-0.830249,-0.385384,-0.476024,-0.343703,0,1,0,0,0,0,0,0,0,0
2,528355,-0.013897,-0.385384,-0.347105,-0.343703,0,1,0,0,0,0,0,0,0,0
3,567595,-0.013897,-0.385384,-0.285224,-0.343703,0,1,0,0,0,0,0,0,0,0
4,703909,-0.013897,-0.385384,-0.315591,-0.343703,0,1,0,0,0,1,0,0,0,0


## Computing the Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data sample (data sample = row in the dataframe of the dataset). The input is a row from the dataset and a list of optional input parameters which will depend on your approach; `k` is the number of returned recommendations seems useful, though.

The output should be a `pd.DataFrame` containing the recommendations. The output dataframe should have the same columns as the row + any additional columns you deem important (e.g., any score or tags that you might want to add to your recommendations).

In principle, the method `get_top_recommendations()` may be imported from a external Python (.py) script as well.

In [199]:
#Define a function that computes similarity
def cal_sim(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
def calculate_sim(property_array, item):
    sim = np.array([cal_sim(property_array[i], item) for i in range(0,len(property_array))])
    return sim

def get_top_recommendations(row, **kwargs) -> pd.DataFrame:
    
    #####################################################
    ## Initialize the required parameters
    
    # The number of recommendations seem recommended
    # Additional input parameters are up to you
    k = None
    row_id = None
    # Extract all **kwargs input parameters
    # and set the used paramaters (here: k)
    for key, value in kwargs.items():
        if key == 'k':
            k = value
        if key =="row_id":
            row_id = value
    #####################################################
    ## Compute your recommendations
    #
    # This is where your magic happens. Of course, you can call methods
    # defined in this notebook or in external Python (.py) scripts
    #
    item_id = df_sample.iloc[row_id]['listing_id']
    row_id=X[(X['listing_id']==item_id)].index[0]
    X_cal=X.drop('listing_id',axis=1) # remove listing_id before calculation
    item = np.array(X_cal.iloc[row_id]) #get the base item

    sim = calculate_sim(np.array(X_cal),item)
    idx = sim.argsort()[-k-2:-2][::-1]
    res = X.iloc[idx]['listing_id']
    # Here, we just return the input row k times
    # Ideally, your recommendations will be much better
    df_result = df_sample.loc[df_sample['listing_id'].isin(res)]
        
    # Return the dataset with the k recommendations
    return df_result


## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Pick a Sample Listing as Input

In [200]:
# Pick a row id of choice
# row_id = 10
# row_id = 20
row_id = 30
# row_id = 40

# Get the row from the dataframe (a valid row ids will throw an error)
row = df_sample.iloc[row_id]

# Just for printing it nicely, we create a new dataframe from this single row
pd.DataFrame([row])

Unnamed: 0,listing_id,title,property_name,property_type,built_year,num_beds,num_baths,size_sqft,planning_area,price
30,800627,hdb flat for sale in 86 telok blangah heights,hdb-bukit merah,hdb 5 rooms,2003.0,3.0,2.0,1184,bukit merah,890400.0


## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

In [201]:
k = 5
df_recommendations = get_top_recommendations(row, k=k,row_id=row_id)
df_recommendations.head(k)

Unnamed: 0,listing_id,title,property_name,property_type,built_year,num_beds,num_baths,size_sqft,planning_area,price
725,709820,hdb flat for sale in 73a redhill road,73a redhill road,hdb,2005.0,3.0,2.0,1076,bukit merah,942900.0
906,862524,hdb flat for sale in 185 bedok north road,vista 8,hdb,2005.0,3.0,2.0,990,bedok,661500.0
1567,595261,hdb flat for sale in 77a redhill road,77a redhill road,hdb,2005.0,3.0,2.0,1076,bukit merah,834800.0
1695,380465,hdb flat for sale in 74a redhill road,74a redhill road,hdb,2005.0,3.0,2.0,1237,bukit merah,942900.0
1960,154525,hdb flat for sale in 596b ang mo kio street 52,city view @ cheng san,hdb 5 rooms,2002.0,3.0,2.0,1184,ang mo kio,840000.0
