### Description : 
##### Given a property decription , find out amenities from description

### Approach :
##### 1. Create a list of unique amenities which was created in Question 1 , here we are creating a dictionary with processed amenities as key and the raw amenity as value
##### 2. Given a description , pre-process the description by rremoving punctuation, extra spaces, lowering, stripping extra spaces , removing stop words and lemmatization
##### 3. Function extracting_amenities searches all possible amenities in the processed description and find out the amenities present in the description

### Evaluation :
##### 1 . From each data point we find how many amenities are matching with the given list of amenities and depending on that we find score for each row and average them overall , the value range from 0-1 , more the score more would be the accuracy of the model.
##### 2. Second approach of measurement is , for each amenity in the amenities predicted , find out if same or similar amenity is present in the actual list , like in approach 1 , more the score more would be the accuracy.

### Other Approaches :
##### 1. In the current method , I am directly listing if any of the amenity is present in the description , we could try n-grams from amenities to match the description to get more exhaustive list of amenities from description.
##### 2. Training a custom NER where we tag amenities to the description and call the entity as AMEN and train a Custom NER model.

### Challenges with this approach :
##### 1. For this method to work , we would need to keep on updating the amenity list
##### 2. If there is a totally new word in the description this method won't be able to identify that as its is searching from a given list of amenity
##### 3. Data Challenges : Amenities mentioned in the given data are not mentioned at all sometimes in the description. Hard to evaluate how this rule based method performs.


### Importing Libraries

In [1]:
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import numpy as np
import pickle
 
lemmatizer = WordNetLemmatizer()

### Reading Data and All Unique Amenities from previous problem

In [2]:
df = pd.read_csv('assignment_hotel_training.csv')
all_amenities_df = pd.read_csv('similar_words_thresh85_processed_amenities_v2.csv')
all_amenities = list(all_amenities_df['Unnamed: 0'])

In [3]:
df.head()

Unnamed: 0,hotel_ameneties,property_description
0,Hot & Cold water|Credit cards accepted|Airpor...,New Soul Kiss Group of Houseboats |New Soul Ki...
1,Hot/cold Water |Parking Facility| Shower Area...,Hotel Raj International |Hotel Raj Internation...
2,Backup generator|Indoor Multi Cuisine Restaura...,Avisa Nila Beach Resort | Avisa Nila Beach Res...
3,Business centre|WiFiBathroom Toiletries |Baker...,Mayflower Hotel |Occupying a prime location on...
4,Dining Table|Geyser In Bathroom|Bathroom Toile...,The Zara Residency Hotel Located in the heart ...


### Processing function

In [4]:
regex = re.compile('[%s]' % re.escape(string.punctuation))
def pre_processing_words(sent):
    ## Removing punctuations, extra spaces, striping , lowering, removing stop words, lemmatize
    new_sent = regex.sub(' ', sent)
    new_sent = re.sub(' +',' ',new_sent).strip().lower()
    text_tokens = new_sent.split()
    tokens_without_sw = [lemmatizer.lemmatize(word) for word in text_tokens if not word in stopwords.words('english')]
    final_sent = ' '.join(tokens_without_sw)
    
    return final_sent

In [5]:
## Putting all amenities to a dictionary where key is processed amenity and its corresponding ammenity
amenities_dict = {}
for amen in all_amenities:
    k = pre_processing_words(amen)
    amenities_dict[k] = amen

In [6]:
with open('amenities_dictionary.pkl', 'wb') as f:
    pickle.dump(amenities_dict, f)

### Logic to find amenities in description 

In [7]:
def extracting_amenities(amenities_dict,description):
    description = pre_processing_words(description)
    amenities = []
    for amen in amenities_dict.keys():
        if amen in description:
            amenities.append(amenities_dict[amen])
    return amenities

### Give description and amenites dictionary it finds out the amenities in the description

In [8]:
extracting_amenities(amenities_dict,df.property_description[0])

['Housekeeping',
 'Laundry',
 'Laundry Service',
 'Television',
 'Wi-fi',
 'Wi-Fi Internet access',
 'Cards',
 'Hot/cold Water',
 'Dry Cleaning',
 'Telephone',
 'Wi-Fi Internet',
 'Credit cards accepted',
 'Doctor on call',
 'Internet access',
 'ac',
 'Airport Transfer',
 'Phone']

### Evaluation

In [13]:
def split_sentences(string_to_iterate):
    all_sents = []
    char_index = 0

    while char_index < len(string_to_iterate)-1:        
        if (string_to_iterate[char_index].islower() and string_to_iterate[char_index+1].isupper()) or (string_to_iterate[char_index]==')' and string_to_iterate[char_index+1].isupper()) or (string_to_iterate[char_index].islower() and string_to_iterate[char_index+1].isdigit()) or (string_to_iterate[char_index]== ')' and string_to_iterate[char_index+1].isdigit()):
            all_sents.append(string_to_iterate[:char_index+1])
            string_to_iterate = string_to_iterate[char_index+1:]
            char_index = 0
        else:
            char_index = char_index+1

    if len(string_to_iterate):
        all_sents.append(string_to_iterate)

    
    return all_sents


def process_amenities(hotel_amenities):
    all_amenities = hotel_amenities.split('|')
    all_amenities = [amen.strip() for amen in all_amenities]
    
    new_amen_list = []
    for amen in set(all_amenities):
        processed_new_list = []
        all_words = split_sentences(amen)
        i = 0
        while i < len(all_words):
            if len(all_words) > 1:
                try:
                    if 'Wi' in all_words[i] and 'Fi' in all_words[i+1]:
                        processed_new_list.append(all_words[i]+all_words[i+1])
                        del all_words[i+1]
                        i = i+1

                    else:
                        processed_new_list.append(all_words[i])
                        i = i+1

                except:
                    i = i+1
                
            else:
                processed_new_list.append(all_words[i])
                i = i+1
    
        new_amen_list.append(processed_new_list)
    
    return new_amen_list


In [10]:
df['extracted_amenities'] = df.property_description.apply(lambda x : extracting_amenities(amenities_dict,x))

In [14]:
df['all_hotel_amenities'] = df.hotel_ameneties.apply(lambda x : [a for sublist in process_amenities(x) for a in sublist])

In [30]:
def evaluate_string_match_approach(given_amenities, extracted_amenities):
    den = len(given_amenities)
    processed_given_amen = [pre_processing_words(i) for i in given_amenities]
    sum_a = 0
    for amen in extracted_amenities:
        if pre_processing_words(amen) in processed_given_amen:
            sum_a = sum_a+1
        else:
            continue
            
    return sum_a/den

In [31]:
df['evaluation_num'] = df.apply(lambda x : evaluate_string_match_approach(x['all_hotel_amenities'],x['extracted_amenities']),axis=1)

In [34]:
print('Performance'+' : '+str(df['evaluation_num'].mean()*100))

Performance : 34.73111984069456
