### Description : 
##### The code below takes in the description and amenities as the input file and consolidates all the similar amenities into one

### Approach : 
##### 1. Found all unique amenities from the amenties column
##### 2. Descriptive analysis of the amenities and manually processing the data showed that there were many amenities which were joined together and needed processing : For example : Safety Deposit BoxMini Refrigerator have 2 different amenities attached to it which is "Safety Deposit Box" and "Mini Refrigerator"
##### 3. Pre-processed such amenities to separate them and then found the unique amenities from them.
##### 4. Pre-Processed each of theme amenities by removing punctuations, extra spaces, removing spaces at end , lowering and lemmatization
##### 5. Found vector representation of these processed words using spacy pre-trained model 'en_core_web_lg' model, giving a vector representation of 30Dim for each word
##### 6. Found Cosine similarity between all of these amenities to find a number between [0-1] to form a matrix
##### 7. Experimented with different threshold values of the similarity which is  0.80,0.85,0.90,0.95
##### 8. Manual analysis of the amenities and its corresponding similar words , chose 0.85 as the threshold value
#####  Examples :  'Breakfast buffet' : ['Breakfast buffet','Complimentary Breakfast', 'breakfast','Breakfast Buffet', 'Breakfast Room', 'Continental breakfast','Continental Breakfast', 'Restaurant (buffet)'] as results

#### Evaluation : Manually looked into the final list of consolidated amenities

### Other approaches : 
##### Train embedding model on hotel description to get word embeddings of all the amenities and then do a similarity matrix 
##### Need more experimentation on embedding size and also various other approache of embedding models like word2vec,glove,fasttext,BERT etc..

### Challenges : 
##### Dealing with Acronyms : For example TV is Television but they are different words right now
##### Some cases where there is a mixing of Amenities together like TVHot & Cold water (mixed together) , not able to separate them 
##### Phrase detection , right now the approach is take embeddings of all words and just average them out but the algo can perform better if we are able to find the phrases first and then train the model

In [None]:
# !pip3 install -U pip setuptools wheel
# !pip3 install -U spacy
# !python3 -m spacy download en_core_web_lg

#### Import Libraries

In [1]:
import pandas as pd
import spacy
import re
import string
import numpy as np
from collections import Counter
import scipy.spatial as sp
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
nlp = spacy.load('en_core_web_lg')

#### Reading Data

In [2]:
df = pd.read_csv('assignment_hotel_training.csv')
df['hotel_ameneties_words'] = df.hotel_ameneties.apply(lambda x : x.split('|'))

In [3]:
df.head()

Unnamed: 0,hotel_ameneties,property_description,hotel_ameneties_words
0,Hot & Cold water|Credit cards accepted|Airpor...,New Soul Kiss Group of Houseboats |New Soul Ki...,"[ Hot & Cold water, Credit cards accepted, Air..."
1,Hot/cold Water |Parking Facility| Shower Area...,Hotel Raj International |Hotel Raj Internation...,"[ Hot/cold Water , Parking Facility, Shower A..."
2,Backup generator|Indoor Multi Cuisine Restaura...,Avisa Nila Beach Resort | Avisa Nila Beach Res...,"[Backup generator, Indoor Multi Cuisine Restau..."
3,Business centre|WiFiBathroom Toiletries |Baker...,Mayflower Hotel |Occupying a prime location on...,"[Business centre, WiFiBathroom Toiletries , Ba..."
4,Dining Table|Geyser In Bathroom|Bathroom Toile...,The Zara Residency Hotel Located in the heart ...,"[Dining Table, Geyser In Bathroom, Bathroom To..."


In [4]:
all_amenities = df.hotel_ameneties_words.tolist()
all_amenities = [a.strip() for b in all_amenities for a in b]

#### Analysis on amenities

In [5]:
amenities_df = pd.DataFrame.from_dict(Counter(all_amenities), orient='index').reset_index()
amenities_df.columns = ['amenity','count']

In [6]:
## Top 10 amenities
amenities_df.sort_values('count',ascending=False).head(10)

Unnamed: 0,amenity,count
11,Doctor on Call,5830
5,Laundry Service,4443
13,Parking Facility,4165
6,Room Service,3528
30,Attached Bathroom,3446
12,Hot/cold Water,3163
18,Bathroom Toiletries,3110
7,Television,3082
52,Restaurant,3050
33,Air Conditioning,2391


In [7]:
## Bottom 10 amenities
amenities_df.sort_values('count',ascending=False).tail(10)

Unnamed: 0,amenity,count
1742,CinemaDaily Newspaper,1
1743,Tour assistanceTelevision,1
1747,Video Game ArcadeTelephone,1
1749,Wi-Fi Internet accessClub Lounge Access,1
1754,Free parking (limited spaces)Hot/cold Water,1
1755,Air conditioningTelephone,1
1762,Outdoor pool (all year)Telephone,1
1767,Room service (24 hours)Cable T V,1
1769,Yoga ClassisTea/Coffee Maker,1
2380,Parking FacilityIroning Board,1


In [9]:
### Unique amenities 
print('Unique amenities: '+str(len(set(all_amenities))))

Unique amenities: 2381


#### Processing[removing punctuations,spaces,stripping,lowering, removing stopwords and lemmatization] along with Processing of the amenities list to form a unqiue list where the amenities are separated


In [10]:
remove = string.punctuation
regex = re.compile('[%s]' % re.escape(remove))

def pre_processing_words(sent):
    ## Removing punctuations, extra spaces, striping,lowering, stopwords and lemmatization
    new_sent = regex.sub(' ', sent)
    new_sent = re.sub(' +',' ',new_sent).strip().lower()
    text_tokens = new_sent.split()
    tokens_without_sw = [lemmatizer.lemmatize(word) for word in text_tokens if not word in stopwords.words('english')]
    final_sent = ' '.join(tokens_without_sw)
    
    return new_sent

In [11]:
def split_sentences(string_to_iterate):
    all_sents = []
    char_index = 0

    while char_index < len(string_to_iterate)-1:        
        if (string_to_iterate[char_index].islower() and string_to_iterate[char_index+1].isupper()) or (string_to_iterate[char_index]==')' and string_to_iterate[char_index+1].isupper()) or (string_to_iterate[char_index].islower() and string_to_iterate[char_index+1].isdigit()) or (string_to_iterate[char_index]== ')' and string_to_iterate[char_index+1].isdigit()):
            all_sents.append(string_to_iterate[:char_index+1])
            string_to_iterate = string_to_iterate[char_index+1:]
            char_index = 0
        else:
            char_index = char_index+1

    if len(string_to_iterate):
        all_sents.append(string_to_iterate)
        
        
    return all_sents


new_amen_list = []
for amen in set(all_amenities):
    processed_new_list = []
    all_words = split_sentences(amen)
    
    i = 0
    while i < len(all_words):
        if len(all_words) > 1:
            try:
                if 'Wi' in all_words[i] and 'Fi' in all_words[i+1]:
                    processed_new_list.append(all_words[i]+all_words[i+1])
                    del all_words[i+1]
                    i = i+1
                    
                else:
                    processed_new_list.append(all_words[i])
                    i = i+1
            
            except:
                i = i+1
                
            
        else:
            processed_new_list.append(all_words[i])
            i = i+1
    
    new_amen_list.append(processed_new_list)

new_amen_list = list(set([amen for sublist in new_amen_list for amen in sublist]))   


In [13]:
### Removing some amenities which seems irrelavant
new_amen_list.remove('i')
new_amen_list.remove('Ãƒ')
new_amen_list.remove('Ã‚Â©')
new_amen_list.remove('H')

In [14]:
### Unique amenities after pocessing
print('Unique amenities: '+str(len(set(new_amen_list))))

Unique amenities: 1391


#### Finding Vector Matrix for each unique amenity in the new list

In [15]:
word_embeddings = {}
word_embeddings_array = []
for amen in set(new_amen_list):
    if amen == '':
        continue
    else:
        word_embeddings[amen] =  nlp(pre_processing_words(amen)).vector
        word_embeddings_array.append(nlp(pre_processing_words(amen)).vector)

#### Finding the Cosine Similarity between each of the vectors

In [16]:
word_embeddings_array = np.array(word_embeddings_array)
similarity_scores = 1 - sp.distance.cdist(word_embeddings_array, word_embeddings_array, 'cosine')

In [17]:
#### Putting similar words with different threshold value into pandas dataframe for manual analysis
similarity_scores_df = pd.DataFrame(similarity_scores,columns=word_embeddings.keys(),index=word_embeddings.keys())
similarity_scores_df_80 = similarity_scores_df.apply(lambda row: list(row[row >= 0.8].index), axis=1)
similarity_scores_df_85 = similarity_scores_df.apply(lambda row: list(row[row >= 0.85].index), axis=1)
similarity_scores_df_90 = similarity_scores_df.apply(lambda row: list(row[row >= 0.9].index), axis=1)
similarity_scores_df_95 = similarity_scores_df.apply(lambda row: list(row[row >= 0.95].index), axis=1)

similarity_scores_df['similarity_scores_80'] = similarity_scores_df_80
similarity_scores_df['similarity_scores_85'] = similarity_scores_df_85
similarity_scores_df['similarity_scores_90'] = similarity_scores_df_90
similarity_scores_df['similarity_scores_95'] = similarity_scores_df_95

In [18]:
similarity_scores_df[['similarity_scores_80','similarity_scores_85','similarity_scores_90','similarity_scores_95']].head(100)


Unnamed: 0,similarity_scores_80,similarity_scores_85,similarity_scores_90,similarity_scores_95
Private Beach Area,"[Private Beach Area, Private beach, Beach Near...","[Private Beach Area, Private beach]","[Private Beach Area, Private beach]",[Private Beach Area]
Sun Beds (pool),"[Sun Beds (pool), Sun Terrace, Pool Terrace, S...","[Sun Beds (pool), Sun Beds, Sun beds]","[Sun Beds (pool), Sun Beds, Sun beds]",[Sun Beds (pool)]
Private check in/out,"[Private check in/out, Express Check-In/Check-...","[Private check in/out, Express Check-In/Check-...",[Private check in/out],[Private check in/out]
Wired high-speed Internet access (surcharge),"[Wired high-speed Internet access (surcharge),...","[Wired high-speed Internet access (surcharge),...","[Wired high-speed Internet access (surcharge),...","[Wired high-speed Internet access (surcharge),..."
Breakfast available (surcharge),"[Breakfast available (surcharge), Parking limi...",[Breakfast available (surcharge)],[Breakfast available (surcharge)],[Breakfast available (surcharge)]
...,...,...,...,...
Half board rates available,"[Half board rates available, Full board rates ...","[Half board rates available, Full board rates ...","[Half board rates available, Full board rates ...",[Half board rates available]
Electronic Key,"[Electronic Key, Electronic/magnetic keys]",[Electronic Key],[Electronic Key],[Electronic Key]
Mountain biking,"[Mountain biking, Hiking/biking trails, Hiking]","[Mountain biking, Hiking/biking trails]","[Mountain biking, Hiking/biking trails]",[Mountain biking]
Complimentary Airport/Railway Station Transfer,[Complimentary Airport/Railway Station Transfe...,[Complimentary Airport/Railway Station Transfe...,[Complimentary Airport/Railway Station Transfe...,[Complimentary Airport/Railway Station Transfe...


In [19]:
similarity_scores_df[similarity_scores_df.index == 'Free garage parking' ]['similarity_scores_85'][0]

['Free parking nearby',
 'Free covered parking',
 'Parking(free)',
 'Free secure outdoor parking',
 'Free garage parking',
 'Free valet parking',
 'Free outdoor parking',
 'Parking garage',
 'Free on-street parking',
 'Free secure parking',
 'Free Parking',
 'Parking (free)',
 'Free self parking']

In [20]:
similarity_scores_df[similarity_scores_df.index == 'Hot & Cold water' ]['similarity_scores_85'][0]


['TVHot/cold Water',
 'Hot & Cold water',
 'Hot Water',
 '24-hrs Hot & Cold Water',
 'hot water',
 'Hot & Cold Running Water',
 '24-hr Hot & Cold Water',
 'Hot/cold Water',
 'TVHot & Cold Running Water']

In [21]:
similarity_scores_df[similarity_scores_df.index == 'Breakfast buffet' ]['similarity_scores_85'][0]

['Continental Breakfast',
 'Breakfast Buffet',
 'Breakfast Room',
 'Breakfast buffet',
 'breakfast',
 'Restaurant (buffet)',
 'Complimentary Breakfast',
 'Continental breakfast']

In [22]:
### Output of similar words
similarity_scores_df[['similarity_scores_80','similarity_scores_85','similarity_scores_90','similarity_scores_95']].to_csv('similar_words_processed_amenities_v2.csv')
similarity_scores_df[['similarity_scores_85']].to_csv('similar_words_thresh85_processed_amenities_v2.csv')
