### Description
##### The following code takes in the input file where there is a property description and uses 2 functions find_airport_station_bus_with_index and finding_distance to find the distance from the airport/railway/bus

### Approach : 

##### 1. Find list of airports, railway station and bus stand by using spacy pre-trained NER model along with its index in the property description
##### 2. Use the list and index to traverse right or left of the string :
###### a) Find the word in the description , if the word to the left of the listed airport/railway station/bus stand is "from" then based where the word km/m/meters/miles is present take the number just beside that. 
###### b) Find the word in the description , if the word to the right of the listed airport/railway station/bus stand is "is" then based where the word km/m/meters/miles is present take the number just beside that.

### Other Approaches
##### 1. keyword based : find the relevant word like airport,railway,junction,bus in the normalized description, Go right or left based keyword is,from and check if there is km,m,meters,miles in the vicinity of these words, return the numbers found before km,m,meters,miles etc
##### 2. Train NER model with Airport name,Railway Station and Bus Stop as an entity and then use it to find airports given a sentence

### Challenges with current approach: 
##### 0. Uses Spacy pre-trained NER model which may not be able to give all the airports/railway station and bus stand
##### 1. Rule based method which may not be 100% right
##### 2. Slow Execution Speed [need work on optimizing the solution]
##### 3. Only capture cases where it satisfies the rule


### Importing Libraries

In [1]:
import pandas as pd
import spacy
import re
import string

nlp = spacy.load('en_core_web_lg')

### Reading Data

In [2]:
df = pd.read_csv('assignment_hotel_training.csv')

### Functions for finding the distance

In [7]:
### List of facilities coming from description using spacy pre-trained ner model
def find_airport_station_bus_with_index(description):
    tagging_word = []
    text1 = nlp(description)
    for word in text1.ents:
        if word.label_ == 'FAC' and ('airport' in word.text.lower() or 'railway' in word.text.lower() or 'junction' in word.text.lower() or 'bus' in word.text.lower()):
            tagging_word.append(word.text)
        else:
            continue
            
    res = {key: [description.index(key), description.index(key) + len(key) - 1]
           for key in tagging_word if ' '+key+'' in description}
                        
    return res


### Rule based method of finding the distance of airport/railway/bus from the description : input is the result coming from function find_airport_station_bus_with_index and the property description
def finding_distance(facilities_with_index,description):
    distance = {}
    for key in facilities_with_index:
        if 'airport' in key.lower():
            try:
                left = description[:facilities_with_index[key][0]].strip().split()
                right = description[facilities_with_index[key][1]+1:].strip().split()
                if left[-1] == 'from':
                    if any(i in ' '+left[-2].strip()+' ' for i in [' km ',' m ',' miles ',' kms ',' meters ']): 
                        distance['airport'] = left[-3]+left[-2]
                    elif any(i in ' '+pre_processing_words(left[-3].strip())+' ' for i in [' km ',' m ',' miles ',' meters ',' kms ']):
                        distance['airport'] = left[-4]+left[-3]
                    else:
                        continue
                        
                elif right[0] == 'is':
                    if any(i in ' '+right[2].strip()+' ' for i in [' km ',' m ',' miles ',' meters ']):
                        distance['airport'] = right[1]+right[2]
                    elif any(i in ' '+right[3].strip()+' ' for i in [' km ',' m ',' miles ','kms',' meters ']):
                        distance['airport'] = right[2]+right[3]
        
                else:
                    continue
            
            except:
                distance = distance
        
        elif 'railway' in key.lower() or 'junction' in key.lower():
            try:
                left = description[:facilities_with_index[key][0]].strip().split()
                right = description[facilities_with_index[key][1]+1:].strip().split()
                if left[-1] == 'from':
                    if any(i in ' '+left[-2].strip()+' ' for i in [' km ',' m ',' miles ',' kms ',' meters ']): 
                        distance['railway'] = left[-3]+left[-2]
                    elif any(i in ' '+pre_processing_words(left[-3].strip())+' ' for i in [' km ',' m ',' miles ',' meters ',' kms ']):
                        distance['railway'] = left[-4]+left[-3]
                    else:
                        continue
                        
                elif right[0] == 'is':
                    if any(i in ' '+right[2].strip()+' ' for i in [' km ',' m ',' miles ',' meters ']):
                        distance['railway'] = right[1]+right[2]
                    elif any(i in ' '+right[3].strip()+' ' for i in [' km ',' m ',' miles ','kms',' meters ']):
                        distance['railway'] = right[2]+right[3]
        
                else:
                    continue
            
            except:
                distance = distance
                
        elif 'bus' in key.lower():
            try:
                left = description[:facilities_with_index[key][0]].strip().split()
                right = description[facilities_with_index[key][1]+1:].strip().split()
                if left[-1] == 'from':
                    if any(i in ' '+left[-2].strip()+' ' for i in [' km ',' m ',' miles ',' kms ',' meters ']): 
                        distance['bus'] = left[-3]+left[-2]
                    elif any(i in ' '+pre_processing_words(left[-3].strip())+' ' for i in [' km ',' m ',' miles ',' meters ',' kms ']):
                        distance['bus'] = left[-4]+left[-3]
                    else:
                        continue
                        
                elif right[0] == 'is':
                    if any(i in ' '+right[2].strip()+' ' for i in [' km ',' m ',' miles ',' meters ']):
                        distance['bus'] = right[1]+right[2]
                    elif any(i in ' '+right[3].strip()+' ' for i in [' km ',' m ',' miles ','kms',' meters ']):
                        distance['bus'] = right[2]+right[3]
        
                else:
                    continue
            
            except:
                distance = distance
                           
    return distance

In [8]:
### Pandas apply function on the complete dataset to find the distance
import datetime
t1 = datetime.datetime.now()
df['distance_metrics'] = df.property_description.apply(lambda x : finding_distance(find_airport_station_bus_with_index(x),x))
t2 = datetime.datetime.now()

print(t2-t1)

### Separating the dictionary into multiple columns
df['distance_from_airport'] = df['distance_metrics'].apply(lambda x : x['airport'] if 'airport' in x.keys() else '')
df['distance_from_railway'] = df['distance_metrics'].apply(lambda x : x['railway'] if 'railway' in x.keys() else '')
df['distance_from_bus'] = df['distance_metrics'].apply(lambda x : x['bus'] if 'bus' in x.keys() else '')

0:08:44.389596


In [16]:
## This algorithm was able to extract airport/railway/bus distance for 45% of the data provided
(df[df.distance_metrics != {}].shape[0]/df.shape[0])*100

45.074285714285715

In [18]:
df.to_csv('property_description_with_amenities_with_distance_metrics.tsv',sep = '\t')