# Introduction

This Recommender System Uses the name of a hostel to recommend similar hostels.

Since there is no data available for hostels in Ireland on data libraries so I have scrapped the data from Hostel World website for the experiment.

Let's start by importing the necessary libraries.

In [173]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import random
import cufflinks
pd.options.display.max_columns = 30
from IPython.core.interactiveshell import InteractiveShell
import matplotlib.pyplot as plt
import seaborn as sns

Loading the dataset.

In [174]:
df_hostels = pd.read_csv("../input/text_based_hostel_new.csv", encoding='latin1')

In [175]:
df_hostels.head()

Unnamed: 0,name,address,desc
0,Abbey View,"Bushy Park, on Main Clifden N59, Co. Galway, I...","Abbey View is a comfortable, family-run B&B, c..."
1,Hotel Killarney,"Cork Road, Ballyspillane, Killarney, Co. Kerry...",Like the meeting of the 3 lakes we are a desti...
2,Abigails Hostel,"7-9 Aston Quay, Temple Bar, Dublin, D02 DX56, ...",We look forward to welcoming you to Abigails H...
3,Abbey Court Hostel,"29 Bachelors Walk, North City, Dublin, D01 AX9...",Bachelors Walk Apartments are located in Dubli...
4,Abrahams Hostel,"82-83 Lower Gardiner Street, Mountjoy, Dublin ...",Abrahams Hostel enjoys a reputation as one of ...


# EDA

### Token (vocabulary) Frequency Distribution Before Removing Stop Words

In [176]:
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [177]:
common_words = get_top_n_words(df_hostels['name'], 20)
df1 = pd.DataFrame(common_words, columns = ['name' , 'count'])

df1.head(n=20)

Unnamed: 0,name,count
0,hostel,54
1,the,20
2,house,14
3,hotel,6
4,inn,5
5,backpackers,4
6,guest,4
7,guesthouse,4
8,dublin,4
9,abbey,3


### Token (vocabulary) Frequency Distribution After Removing Stop Words

In [178]:
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(df_hostels['name'], 20)
df2 = pd.DataFrame(common_words, columns = ['name' , 'count'])

df2.head(n=20)

Unnamed: 0,name,count
0,hostel,54
1,house,14
2,hotel,6
3,inn,5
4,backpackers,4
5,guest,4
6,guesthouse,4
7,dublin,4
8,abbey,3
9,bar,3


### Hostel Name Word Count Distribution

In [179]:
df_hostels['word_count'] = df_hostels['name'].apply(lambda x: len(str(x).split()))
desc_lengths = list(df_hostels['word_count'])
print("Number of names:",len(desc_lengths),
      "\nAverage word count", np.average(desc_lengths),
      "\nMinimum word count", min(desc_lengths),
      "\nMaximum word count", max(desc_lengths))

Number of names: 102 
Average word count 2.911764705882353 
Minimum word count 2 
Maximum word count 6


Many hostels use name to their full potential, know how to utilize captivating name to appeal to travelers’ emotions to drive direct bookings.

### Text Preprocessing

The test is pretty clean, we don’t have a lot to do, but just in case.

In [180]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text
    
df_hostels['desc_clean'] = df_hostels['name'].apply(clean_text)

### Modeling

* Create a TF-IDF matrix of unigrams, bigrams, and trigrams for each hostel.
* Compute similarity between all hostels using sklearn’s linear_kernel (equivalent to cosine similarity in our case).
* Define a function that takes in hostel name as input and returns the top 10 recommended hostels.

In [181]:
df_hostels.set_index('name', inplace = True)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df_hostels['desc_clean'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

indices = pd.Series(df_hostels.index)

def recommendations(name, cosine_similarities = cosine_similarities):
    
    recommended_hostels = []
    
    # getting the index of the hostel that matches the name
    idx = indices[indices == name].index[0]
    
    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar hostels except itself
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the names of the top 10 matching hostels
    for i in top_10_indexes:
        recommended_hostels.append(list(df_hostels.index)[i])
        
    return recommended_hostels

### Recommendations

Let’s make some recommendations!

In [182]:
recommendations('The Times Hostel')

['Times Hostel Camden',
 'City Hostel',
 'Kinlay Hostel',
 'Galway Hostel',
 'Woodquay Hostel',
 'Sheilas Hostel',
 'Rainbow Hostel',
 "O'Donoghue Hostel",
 'Lynfield Hostel',
 "Lovett's Hostel"]

Good results, as all the hostels belonged to "The Times Hostel".

Let's test for "Backpackers Citi Hostel".

In [183]:
recommendations('Backpackers Citi Hostel') 

['Sky Backpackers',
 'Macgabhainns Backpackers Hostel',
 'Garden Lane Backpackers',
 'The Times Hostel',
 'City Hostel',
 'Galway Hostel',
 'Kinlay Hostel',
 'Sheilas Hostel',
 'Rainbow Hostel',
 "Neptune's Hostel"]

The results are again awesome.