# Task 2: Recommendation Engine - Skeleton Notebook

This notebook provides a very basic example for the notebook you are expected to submit for Task 2 of the Final Project. The main purpose is that we can try different examples to get a better sense of your approach. Compared to Task 1 (Kaggle Competition), we don't have any objective means to evaluate the recommendations. 

Some general comments:
* You can import any data you need. This particularly includes your cleaned version of the properties dataset (incl. the auxiliary data or any other data you might have collected); there's no need to show the data cleaning / preprocessing steps in this notebook.
* You can also import your code in form of external Python (.py) script. You're actually encouraged to do so to keep this notebook light and uncluttered.
* **Important:** Please consider this notebook as an example and not to set specific requirements. Your notebook is likely to look very different. As long there is a section where we can easily test your solution, it should be fine.

## Setting up the Notebook

In [1]:
import pandas as pd
from data_processing import data_processing


%load_ext autoreload
%autoreload 2


## Load the Data

For this example, we use a simplified version of the dataset with only 2k+ data samples, each with only a subset of features.

In [54]:
import pandas as pd
import numpy as np

numerical_features = ['num_beds', 'num_baths', 'size_sqft', 'planning_area', 'price']
categorical_features = ['property_type', 'built_year']

def normalize(train_data, features):
    for feature in features:
        mean = np.mean(train_data[feature])
        std = np.std(train_data[feature])
        train_data[feature] = (train_data[feature] - mean) / std
    return train_data

def data_processing(df_sample):
    df = df_sample.copy()
    df = df.loc[:, numerical_features + categorical_features]
    df = df.dropna()
    
    df['property_type'] = df.property_type.str.lower()
    df.loc[df['property_type'].str.contains('hdb'), 'property_type'] = 'hdb'
    df.loc[df['property_type'].str.contains('condo'), 'property_type'] = 'condo'
    df.loc[df['property_type'].str.contains('house'), 'property_type'] = 'house'
    df.loc[df['property_type'].str.contains('bungalow'), 'property_type'] = 'bungalow'
    df = df[df['property_type'].str.contains('hdb|condo|house|bungalow')]
    
    df['built_year'] = df['built_year'].astype(int)
    df.loc[df['built_year'] <= 1990, 'built_year'] = 1990
    df.loc[(df['built_year'] > 1990) & (df['built_year'] <= 2000), 'built_year'] = 1995
    df.loc[(df['built_year'] > 2000) & (df['built_year'] <= 2010), 'built_year'] = 2005
    df.loc[(df['built_year'] > 2010) & (df['built_year'] <= 2020), 'built_year'] = 2015
    df.loc[df['built_year'] > 2020, 'built_year'] = 2025
    pd.value_counts(df['built_year'])
    
    df['planning_area'] = df['planning_area'].str.lower()
    mean_price = df['price'] / df['size_sqft']
    mean_price = pd.concat([df['planning_area'], mean_price], axis=1)
    area_mean_price = mean_price.groupby('planning_area').mean()
    area_mean_price.rename(columns={0: 'area_mean_price'}, inplace=True)
    df = pd.merge(df, area_mean_price, on='planning_area')
    
    df.loc[df['price'] <= 1000000, 'price'] = 1
    df.loc[(df['price'] > 1000000) & (df['price'] <= 2000000), 'price'] = 2
    df.loc[(df['price'] > 2000000) & (df['price'] <= 3000000), 'price'] = 3
    df.loc[df['price'] > 3000000, 'price'] = 4
    
    df['listing_id'] = df_sample['listing_id']
    
    
    numerical = ['num_beds', 'num_baths', 'size_sqft', 'area_mean_price']
    categorical = ['property_type', 'built_year', 'price']
    df = pd.get_dummies(df, columns=categorical, drop_first=True)
    
    df = normalize(df, numerical)
    
    df.drop('planning_area', axis=1, inplace=True)
    
    return df





df_sample = pd.read_csv('data/sg-property-prices-simplified.csv')

df_sample.head()

X = data_processing(df_sample)

X.head()

Unnamed: 0,num_beds,num_baths,size_sqft,area_mean_price,listing_id,property_type_condo,property_type_hdb,property_type_house,built_year_1995,built_year_2005,built_year_2015,built_year_2025,price_2.0,price_3.0,price_4.0
0,-0.830249,-0.385384,-0.469722,-0.343703,799762,0,1,0,0,0,0,0,0,0,0
1,-0.830249,-0.385384,-0.476024,-0.343703,896907,0,1,0,0,0,0,0,0,0,0
2,-0.013897,-0.385384,-0.347105,-0.343703,445021,0,1,0,0,0,0,0,0,0,0
3,-0.013897,-0.385384,-0.285224,-0.343703,252293,0,1,0,0,0,0,0,0,0,0
4,-0.013897,-0.385384,-0.315591,-0.343703,926453,0,1,0,0,0,1,0,0,0,0


## Computing the Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data sample (data sample = row in the dataframe of the dataset). The input is a row from the dataset and a list of optional input parameters which will depend on your approach; `k` is the number of returned recommendations seems useful, though.

The output should be a `pd.DataFrame` containing the recommendations. The output dataframe should have the same columns as the row + any additional columns you deem important (e.g., any score or tags that you might want to add to your recommendations).

In principle, the method `get_top_recommendations()` may be imported from a external Python (.py) script as well.

In [38]:
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series
from sklearn.metrics.pairwise import cosine_similarity




def cal_sim(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

def calculate_sim(car_array, item):
    sim = np.array([cal_sim(car_array[i], item) for i in range(0,len(car_array))])
    return sim

def get_user_profile(listingid_score_dict, processed_df, non_zero_column_num):
    scores = np.array(list(listingid_score_dict.values()))
    scores_norm = (scores - scores.mean()).T
    rated_items = processed_df.set_index('listing_id').loc[list(listingid_score_dict.keys())].reset_index(inplace=False).drop(columns=['listing_id'])
    user_profile = scores_norm.dot(rated_items) / non_zero_column_num
    return user_profile


def calculate_cos_similar_for_all_items(user_profile, user_item_index, df):
    similarity_list = []
    user_profile = np.array(user_profile).reshape(1,user_profile.shape[0])
    unrated_items = df[~df['listing_id'].isin(user_item_index)]
    for index, row in unrated_items.iterrows():
        row_cleaned = np.array(row[1:]).reshape(1,user_profile.shape[1])
        simi = cosine_similarity(row_cleaned, user_profile)
        t = (simi[0][0], row['listing_id'])
        similarity_list.append(t)
    return similarity_list


def get_top_recommendations(k, **kwargs):
    filtered_data = kwargs.get("filtered_data")
    listingid_score_dict = kwargs.get("listingid_score_dict")
    
    # filtered data processing: normalize and one-hot encode

    processed_df =user_select_data

    # get the number of features

    non_zero_column_num = filtered_data.shape[1]-1
    
    # get user profile
    user_profile = get_user_profile(listingid_score_dict, processed_df, non_zero_column_num)
    
    # compute similarity for all items in filtered data
    similarity_list = calculate_cos_similar_for_all_items(user_profile=user_profile, user_item_index=list(listingid_score_dict.keys()), df=processed_df)
    sorted_list = sorted(similarity_list,key=lambda t:t[0], reverse=True)
    
    # return top k items
    top_k = sorted_list[:k]
    id_list = []
    for x in top_k:
        id_list.append(x[1])
    recommend_items = X.set_index('listing_id').loc[id_list].reset_index(inplace=False)
    return recommend_items


## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Pick a Sample Listing as Input
We designed a user who have browse records in many properties. This user prefer property that property type is condo and prefer buly year bewteen 2010~2020.

Design score rules to measure user behavior. -2 means that the user clicked "don't want to see this kind of information" when browsing, -1 means that the browsing time is less than five minutes ,0 means that the browsing time is more than five minutes, 1 means that the house is included in the favorites, 2 means that the house agent is contacted through the website.




In [68]:
user_select_data = X[(X.built_year_2015 == 1) & (X.property_type_condo == 1)]
# user_select_data = X[ (X.property_type_condo == 1)]
# user_select_data = X[((X["price_2.0"]) == 1)&(X.built_year_2015 == 1)&(X.property_type_hdb == 1)]

n = 10
user_item = user_select_data.sample(n=n)
user_item_index = user_item['listing_id'].tolist()
scores = [random.randint(-2, 2) for _ in range(10)]

listingid_score_dict = {user_item_index[i]: scores[i] for i in range(n)}

print(f"Your score list is: {listingid_score_dict}")
X[X['listing_id'].isin(listingid_score_dict.keys())]


Your score list is: {735919: 0, 713023: 2, 164888: 0, 147860: 1, 574574: 2, 398341: -1, 728128: -1, 126387: 1, 850862: -2, 997603: -2}


Unnamed: 0,num_beds,num_baths,size_sqft,area_mean_price,listing_id,property_type_condo,property_type_hdb,property_type_house,built_year_1995,built_year_2005,built_year_2015,built_year_2025,price_2.0,price_3.0,price_4.0
148,0.802454,0.349469,0.529546,1.742353,713023,1,0,0,0,0,1,0,0,0,1
179,0.802454,0.349469,-0.228499,-1.17571,126387,1,0,0,0,0,1,0,1,0,0
384,-0.830249,-0.385384,-0.395808,-0.354482,574574,1,0,0,0,0,1,0,1,0,0
519,0.802454,0.349469,-0.130521,-0.914443,997603,1,0,0,0,0,1,0,1,0,0
605,-0.013897,0.349469,-0.25944,-1.335549,728128,1,0,0,0,0,1,0,1,0,0
608,0.802454,-0.385384,-0.06291,-1.335549,147860,1,0,0,0,0,1,0,1,0,0
615,0.802454,0.349469,-0.229072,-1.335549,398341,1,0,0,0,0,1,0,1,0,0
1091,-0.013897,-0.385384,-0.290953,-0.733276,850862,1,0,0,0,0,1,0,1,0,0
1302,-0.013897,-0.385384,-0.013634,1.508647,735919,1,0,0,0,0,1,0,0,0,1
1588,-1.6466,-1.120238,-0.617549,-0.185273,164888,1,0,0,0,0,1,0,1,0,0


## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

We create the user profile according to the user behav-ior  
The type of all the result is condo. The system works well in the single feature. 
When we design another user that multiple conditions, whose preferred price is between 1000000 to 2000000, built year is between 2010~2020 and type is HDB the recom-mendation system gives only 2 results that satisfy all fea-tures because of small data base. When type changed to condo, the system works well in the multiple features. To avoid situation that recommendation system cannot give enough proper results, in the futures some features can have smaller weights, for example, built year are less im-portant than price.


In [69]:
get_top_recommendations(10, filtered_data=user_select_data, listingid_score_dict=listingid_score_dict)

Unnamed: 0,listing_id,num_beds,num_baths,size_sqft,area_mean_price,property_type_condo,property_type_hdb,property_type_house,built_year_1995,built_year_2005,built_year_2015,built_year_2025,price_2.0,price_3.0,price_4.0
0,101620,-0.013897,0.349469,-0.136823,1.508647,1,0,0,0,0,1,0,0,0,1
1,111643,-0.013897,0.349469,-0.019363,1.508647,1,0,0,0,0,1,0,0,0,1
2,127075,-0.013897,0.349469,0.18347,1.02202,1,0,0,0,0,1,0,0,0,1
3,168189,-0.013897,0.349469,0.800563,1.559053,1,0,0,0,0,1,0,0,0,1
4,192693,0.802454,1.084322,0.578822,1.816187,1,0,0,0,0,1,0,0,0,1
5,245087,-0.013897,0.349469,0.041945,1.508647,1,0,0,0,0,1,0,0,0,1
6,254384,-1.6466,-1.120238,-0.537333,1.508647,1,0,0,0,0,1,0,0,1,0
7,422867,-0.013897,-0.385384,-0.112185,1.508647,1,0,0,0,0,1,0,0,0,1
8,309548,1.618805,1.084322,0.412087,1.508647,1,0,0,0,0,1,0,0,0,1
9,499663,-0.830249,-0.385384,-0.167764,1.508647,1,0,0,0,0,1,0,0,0,1
