# Introduction

This Recommender System recommends similar hotels.

Since there is no data available for hostels in Ireland on data libraries so I have scrapped the data from Hostel World website for the experiment.

Let's start by importing the necessary libraries.

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import euclidean_distances

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Loading the dataset
df_hostels = pd.read_csv("../input/latest.csv", encoding='latin1')

In [None]:
df_hostels.head()

Here for some entertainment features, the values are give as 1 and 0. It means 0 = No and 1 = Yes.

# 1. Exploratory Data Analysis

In [None]:
# General Information
df_hostels.info()

In [None]:
# Statistical characteristics of numerical features
df_hostels.describe()

Let's draw histograms for some relavent fields

In [None]:
plt.figure(figsize=(14, 10))
plt.subplot(221)
plt.hist(df_hostels['Price'].values, bins=20)
plt.title('Price')
plt.subplot(222)
plt.hist(df_hostels['summary.score'].values, bins=20)
plt.title('summary.score')

Most hostels in the dataset are between 15 to 20 Euros.

# 2. Data Preprocessing

Handling Missing Values

In [None]:
df_hostels.isnull().sum()

No missing value are found.

Now, I'm deleting summary.score, Name, and rating.band. Because summary.score is the average value for Value.for.money, Security, Location, Staff, Atmosphere, Cleanliness, and Facilities. And I don't think ranting.band is necessary because they're giving rating based on summary.score i-e if summary.score is between 1 to 3, then rating.band is "Good".

In [None]:
df_hostels.drop(['summary.score', 'Name', 'rating.band'], inplace=True, axis=1);

In [None]:
df_hostels.head()

Now, label encoding categorical features like Distance and City.

In [None]:
# Label Encoding
le = LabelEncoder()
df_hostels['Distance'] = le.fit_transform(df_hostels['Distance'])
df_hostels['City'] = le.fit_transform(df_hostels['City'])

df_hostels.head(3)

In [None]:
def get_hostel_recommendations(df, anchor_id):
    # features used to compute the similarity
    features = ['City']
    
    # create the features - make the anchor be the first row in the dataframe
    df_sorted = df.copy()
    df_sorted = pd.concat([df_sorted[df_sorted['hotel.id'] == anchor_id],df_sorted[df_sorted['hotel.id'] != anchor_id]])
    df_features = df_sorted[features].copy()
    df_features = normalize_features(df_features)
    
    # compute the distances
    X = df_features.values
    Y = df_features.values[0].reshape(1, -1)
    distances = euclidean_distances(X, Y)
    
    df_sorted['similarity_distance'] = distances
    new_df = df_sorted.sort_values('similarity_distance').reset_index(drop=True)
    
    return new_df

def get_city_hostel_recommendations(df, city_id):
    features = ['Distance', 'Value.for.money', 'Security', 'Location', 'Staff', 'Atmosphere', 'Cleanliness', 'Facilities', 'Price', 'Board.Games', 'Dvds', 'Foosball', 'Games.Room', 'PlayStation', 'Pool.Table']
    
    df_sorted = df[df["City"].isin(city_id)]
    df_features = df_sorted[features].copy()
    df_features = normalize_features(df_features)
    
    # compute the distances
    X = df_features.values
    Y = df_features.values[0].reshape(1, -1)
    distances = euclidean_distances(X, Y)
    df_sorted['similarity_distance'] = distances
    return df_sorted.sort_values('similarity_distance').reset_index(drop=True)
    
def normalize_features(df):
    for col in df.columns:
        # fill any NaN's with the mean
        df[col] = df[col].fillna(df[col].mean())
        df[col] = StandardScaler().fit_transform(df[col].values.reshape(-1, 1))
    return df

def Remove(duplicate): 
    final_list = [] 
    for num in duplicate: 
        if num not in final_list: 
            final_list.append(num) 
    return final_list 

You can see I'm getting recommendations based on features = ['Distance', 'City', 'Value.for.money', 'Security', 'Location', 'Staff', 'Atmosphere', 'Cleanliness', 'Facilities', 'Price', 'Board.Games', 'Dvds', 'Foosball', 'Games.Room', 'PlayStation', 'Pool.Table'].

A few things to note here:

1. Normalizing the feature space is very important since we are dealing with features that have different units. The choice of normalization scheme depends on the problem and the data — in this case we are using a standard score approach because the data is normally distributed — but, min/max scaling or TF-IDF (for comparing documents) may also be useful for other applications.

2. It is important to backtest your recommendation algorithm to pick the best normalization scheme and similarity score and tune any other parameters.

Let’s see what happens when we make the anchor hotel.id = 5:

In [None]:
df_recs = get_hostel_recommendations(df_hostels, 66)

df_final = df_recs.head(n=4)

city = Remove(df_final.City)

# Now make recommendations for only the data where city is fetched from above
df_final = get_city_hostel_recommendations(df_hostels, city)

df_final.head(n=3)

Next, let’s look at a second example where we set the anchor to be hostel.id = 1 to check whether we will get recommendation of hostels which have board games.

In [None]:
df_recs = get_hostel_recommendations(df_hostels, 1)

df_final = df_recs.head(n=4)

city = Remove(df_final.City)

# Now make recommendations for only the data where city is fetched from above
df_final = get_city_hostel_recommendations(df_hostels, city)

df_final.head(n=3)

The results looks promising.