# NER Extraction:
This is the second sub-module of the `activities_recommendation` module. Firstly we define the purpose of the module and the challenges, then we will dive into the features of the sub-module and then will walkthrough the code.

## Context:
As mentioned earlier, the aim for the 4-level sub-module is to map Viator activities to the TripAdvisor attraction data. In the `pre-processing` sub-module, we splitted the Viator activities in mapped and unmapped activities (ref. TripAdvisor activities). In this module, we will extract potential attraction names out of the activity description and then in the later stage, do a `fuzzy` matching with the TripAdvisor attraction data, finally mapping the activities with the attraction.

### Initialization
We import the Tagger, instantiate it, and see how the extraction goes with a certain description

In [8]:
VIATOR_UNMAPPED_ACTS = 'viator_unmapped_act.csv'
VIATOR_MAPPED_ACTS = 'viator_mapped_act.csv'
WORLD_CITIES = '../world-cities.csv'
WORLD_STATES_AND_REGIONS = '../State and Countries DDC September 2019.csv'
VIATOR_ACT_NER_DONE = 'viator_act_ner_done.csv'

import nltk
nltk.download('punkt')
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
st = StanfordNERTagger('../stanford-ner-4.0.0/classifiers/english.all.3class.distsim.crf.ser.gz',
                       '../stanford-ner-4.0.0/stanford-ner.jar', encoding='utf-8')

[nltk_data] Downloading package punkt to /home/somu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
import pandas as pd
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = None
df = pd.read_csv(VIATOR_UNMAPPED_ACTS, low_memory=False)

In [10]:
df.ProductText.at[2]

"Enjoy an evening dinner show at the Moulin Rouge the number one show in Paris. With a choice of three different dinner menus, don't miss your chance to see the world-renowned showgirls and French Cancan dancers strut their stuff on the Moulin Rouge's historic stage. Moulin Rouge Paris sells out months in advance, so book ahead to avoid disappointment."

In [None]:
gateway of india = gateway of 
india gate
statue of unity, USA

In [11]:
out = st.tag(word_tokenize(df.ProductText.at[2]))
print(out)

[('Enjoy', 'O'), ('an', 'O'), ('evening', 'O'), ('dinner', 'O'), ('show', 'O'), ('at', 'O'), ('the', 'O'), ('Moulin', 'LOCATION'), ('Rouge', 'LOCATION'), ('the', 'O'), ('number', 'O'), ('one', 'O'), ('show', 'O'), ('in', 'O'), ('Paris', 'LOCATION'), ('.', 'O'), ('With', 'O'), ('a', 'O'), ('choice', 'O'), ('of', 'O'), ('three', 'O'), ('different', 'O'), ('dinner', 'O'), ('menus', 'O'), (',', 'O'), ('do', 'O'), ("n't", 'O'), ('miss', 'O'), ('your', 'O'), ('chance', 'O'), ('to', 'O'), ('see', 'O'), ('the', 'O'), ('world-renowned', 'O'), ('showgirls', 'O'), ('and', 'O'), ('French', 'O'), ('Cancan', 'O'), ('dancers', 'O'), ('strut', 'O'), ('their', 'O'), ('stuff', 'O'), ('on', 'O'), ('the', 'O'), ('Moulin', 'ORGANIZATION'), ('Rouge', 'ORGANIZATION'), ("'s", 'O'), ('historic', 'O'), ('stage', 'O'), ('.', 'O'), ('Moulin', 'ORGANIZATION'), ('Rouge', 'ORGANIZATION'), ('Paris', 'ORGANIZATION'), ('sells', 'O'), ('out', 'O'), ('months', 'O'), ('in', 'O'), ('advance', 'O'), (',', 'O'), ('so', 'O'),

### From the Above Example:
- From the above example, it is clear that the tagger extracts all the tokens from a sentence, and tags it with certain attributes. Looking at it, we need the `ORGANIZATION` and `LOCATION` tags out of the classified text. Also, the original names are concatened version of the classified tokens. So we need to consider that as well. However, in the `VIATOR_MAPPED_ACTS`, we already have an extracted name which is the **second last element in the `parents` array** (from TripAdvisor activities).

- One more thing to note is that the extractions are not precise, i.e. it extracts locations like **Paris**, and **Europe**. We need to filter out such entries as they would tend to create alot of false positives. So we will create some filter layers:
    - An exhaustive list of countries.
    - A list of major cities of the world (we need to filter these out because these are not attraction names).
    - A list of states or regions in the world.

In [12]:
# exhaustive list of countries
countries = [ "Afghanistan", "Åland Islands", "Albania", "Algeria", "American Samoa", "Andorra", "Angola", "Anguilla", "Antarctica", "Antigua and Barbuda", "Argentina", "Armenia", "Aruba", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain", "Bangladesh", "Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bermuda", "Bhutan", "Bolivia (Plurinational State of)", "Bonaire, Sint Eustatius and Saba", "Bosnia and Herzegovina", "Botswana", "Bouvet Island", "Brazil", "British Indian Ocean Territory", "United States Minor Outlying Islands", "Virgin Islands (British)", "Virgin Islands (U.S.)", "Brunei Darussalam", "Bulgaria", "Burkina Faso", "Burundi", "Cambodia", "Cameroon", "Canada", "Cabo Verde", "Cayman Islands", "Central African Republic", "Chad", "Chile", "China", "Christmas Island", "Cocos (Keeling) Islands", "Colombia", "Comoros", "Congo", "Congo (Democratic Republic of the)", "Cook Islands", "Costa Rica", "Croatia", "Cuba", "Curaçao", "Cyprus", "Czech Republic", "Denmark", "Djibouti", "Dominica", "Dominican Republic", "Ecuador", "Egypt", "El Salvador", "Equatorial Guinea", "Eritrea", "Estonia", "Ethiopia", "Falkland Islands (Malvinas)", "Faroe Islands", "Fiji", "Finland", "France", "French Guiana", "French Polynesia", "French Southern Territories", "Gabon", "Gambia", "Georgia", "Germany", "Ghana", "Gibraltar", "Greece", "Greenland", "Grenada", "Guadeloupe", "Guam", "Guatemala", "Guernsey", "Guinea", "Guinea-Bissau", "Guyana", "Haiti", "Heard Island and McDonald Islands", "Holy See", "Honduras", "Hong Kong", "Hungary", "Iceland", "India", "Indonesia", "Côte d'Ivoire", "Iran (Islamic Republic of)", "Iraq", "Ireland", "Isle of Man", "Israel", "Italy", "Jamaica", "Japan", "Jersey", "Jordan", "Kazakhstan", "Kenya", "Kiribati", "Kuwait", "Kyrgyzstan", "Lao People's Democratic Republic", "Latvia", "Lebanon", "Lesotho", "Liberia", "Libya", "Liechtenstein", "Lithuania", "Luxembourg", "Macao", "Macedonia (the former Yugoslav Republic of)", "Madagascar", "Malawi", "Malaysia", "Maldives", "Mali", "Malta", "Marshall Islands", "Martinique", "Mauritania", "Mauritius", "Mayotte", "Mexico", "Micronesia (Federated States of)", "Moldova (Republic of)", "Monaco", "Mongolia", "Montenegro", "Montserrat", "Morocco", "Mozambique", "Myanmar", "Namibia", "Nauru", "Nepal", "Netherlands", "New Caledonia", "New Zealand", "Nicaragua", "Niger", "Nigeria", "Niue", "Norfolk Island", "Korea (Democratic People's Republic of)", "Northern Mariana Islands", "Norway", "Oman", "Pakistan", "Palau", "Palestine, State of", "Panama", "Papua New Guinea", "Paraguay", "Peru", "Philippines", "Pitcairn", "Poland", "Portugal", "Puerto Rico", "Qatar", "Republic of Kosovo", "Réunion", "Romania", "Russian Federation", "Rwanda", "Saint Barthélemy", "Saint Helena, Ascension and Tristan da Cunha", "Saint Kitts and Nevis", "Saint Lucia", "Saint Martin (French part)", "Saint Pierre and Miquelon", "Saint Vincent and the Grenadines", "Samoa", "San Marino", "Sao Tome and Principe", "Saudi Arabia", "Senegal", "Serbia", "Seychelles", "Sierra Leone", "Singapore", "Sint Maarten (Dutch part)", "Slovakia", "Slovenia", "Solomon Islands", "Somalia", "South Africa", "South Georgia and the South Sandwich Islands", "Korea (Republic of)", "South Sudan", "Spain", "Sri Lanka", "Sudan", "Suriname", "Svalbard and Jan Mayen", "Swaziland", "Sweden", "Switzerland", "Syrian Arab Republic", "Taiwan", "Tajikistan", "Tanzania, United Republic of", "Thailand", "Timor-Leste", "Togo", "Tokelau", "Tonga", "Trinidad and Tobago", "Tunisia", "Turkey", "Turkmenistan", "Turks and Caicos Islands", "Tuvalu", "Uganda", "Ukraine", "United Arab Emirates", "United Kingdom of Great Britain and Northern Ireland", "United States of America", "Uruguay", "Uzbekistan", "Vanuatu", "Venezuela (Bolivarian Republic of)", "Viet Nam", "Wallis and Futuna", "Western Sahara", "Yemen", "Zambia", "Zimbabwe" ]
countries = [country.lower() for country in countries]

# list of world cities and states
world_cities_df = pd.read_csv(WORLD_CITIES)
states_df = pd.read_csv(WORLD_STATES_AND_REGIONS)

# lowercase the names
world_cities_df = world_cities_df['name'].apply(lambda name: name.lower())
states_df = states_df['State'].apply(lambda name: name.lower())

### The Algorithm
- The algorithm is quite simple, we traverse the array and maintain a tuple for each element with the first element being the extracted entity and the second element being the position of it in our extraction. Each iteration consists of a check if the current position is just the succession of the previous one, if yes, we concated them with a space (generating a new name), and if not, we continue the loop. This, does the job in `O(n)` time.

- We filter now the extracted locations and organizations with the 3 layers we defined earlier (countries, states, cities). And then return the lists of locations and organizations in a tuple. This takes `O(nlogn)` time on average.

In [13]:
from geotext import GeoText
def get_locs_and_orgs(text):
    """Extracts the locations and organizations from the text.
    Arguments: text (string)
    Outputs: tuple(list, list) :: (locations, organizations)
    """
    
    # eliminate the brackets -- as they would eventually cause regex problems
    text = text.replace('(', "")
    text = text.replace(')', "")
    
    # use GeoText to identify place names as well.
    places = GeoText(text)
    tokenized_text = word_tokenize(text)
    classified_text = st.tag(tokenized_text)
    ct = [(moulin, org), (rouge, org), (blabla, O), (kalsjdf, org), (ksadf, org)]
    organizations = [(moulin, 0)]
    organizations = [(moulin rogue, 1)]
    locations = []
    
    # traverse the classified text and concaten the successive classified tokens if they are of the same class
    for i in range(len(classified_text)):
        if classified_text[i][1].lower() == 'organization': 
            
            # check if the position is adjacent to the latest organization added
            if organizations and i == organizations[-1][1] + 1:
                if classified_text[i][0] not in places.cities: 
                    
                    # if not repeated, concaten and increment the latest organization position
                    organizations[-1][0] += ' ' + classified_text[i][0]
                    organizations[-1][1] += 1
            else:
                # append the organization
                organizations.append([classified_text[i][0], i])
        
        # same with the locations
        elif classified_text[i][1].lower() == 'location':
            if locations and i == locations[-1][1] + 1:
                if classified_text[i][0] not in places.cities:
                    locations[-1][0] += ' ' + classified_text[i][0]
                    locations[-1][1] += 1
            else:
                locations.append([classified_text[i][0], i])
    
    # filter organizations before appending to the final result
    final_orgs = []
    for o in organizations:
        try:
            world_cities_df.str.contains(o[0].lower()).any()
        except:
            print('Encountered the bracket regex error, skipping...') 
            continue
        if o[0].lower() not in final_orgs and \
        o[0].lower() not in places.cities and \
        o[0].lower() not in places.countries and \
        o[0].lower() not in countries and \
        not world_cities_df.str.contains(o[0].lower()).any() and\
        not states_df.str.contains(o[0].lower()).any(): final_orgs.append(o[0].lower())
    
    # filter locations before appending in the final result
    final_locs = []
    for w in locations:
        try:
            world_cities_df.str.contains(w[0].lower()).any()
        except:
            print('Encountered the bracket regex error, skipping...') 
            continue
            
        if w[0].lower() not in final_locs and \
        w[0].lower() not in places.cities and \
        w[0].lower() not in places.countries and \
        w[0].lower() not in countries and \
        not world_cities_df.str.contains(w[0].lower()).any() and\
        not states_df.str.contains(w[0].lower()).any(): final_locs.append(w[0].lower())
    
    # return the tuple
    return (final_locs, final_orgs)

### The Multi-processing wrapper
We used pandarellel for the job, with number of cores defined as 4 (`DEFAULT`).

In [17]:
import time
from pandarallel import pandarallel
pandarallel.initialize()
N_CORES = 4

# sample on which extraction is to be done
small_df = df.head(100)

t0 = time.time()
small_df['output'] = small_df['ProductText'].parallel_apply(get_locs_and_orgs, N_CORES)

# expand the output to unpack the location organization lists to different columns
small_df[['locations', 'organizations']] = small_df.apply(lambda entry: entry.output, result_type='expand', axis=1)

# drop the intermediate column
small_df.drop(columns = ['output'], inplace=True)
t1 = time.time()

print("Time elapsed(secs): ", t1 - t0, ' For: ', small_df.shape[0], ' records')
small_df.head(5)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
Time elapsed(secs):  149.28998446464539  For:  100  records


Unnamed: 0,Rank,ProductType,ProductCode,ProductName,Introduction,ProductText,Special,Duration,Commences,ProductImage,...,PriceSEK,PriceHKD,PriceSGD,PriceZAR,AvgRating,AvgRatingStarURL,BookingType,VoucherOption,locations,organizations
0,4,SITours_NEW,3951WESTDLX,Grand Canyon West Rim and Hoover Dam Tour from...,Hit the highway out of Las Vegas and spend the...,Hit the highway out of Las Vegas and spend the...,0,12 hours,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,...,1172.09,938.95,171.65,2214.89,4.5,http://www.partner.viator.com/images/stars/red...,Freesale,VOUCHER_E,"[grand canyon, grand canyon skywalk]",[]
1,5,SITours_NEW,2280LI_5H,Grand Canyon 4-in-1 Helicopter Tour,Take the ultimate Grand Canyon tour! You'll fl...,Take the ultimate Grand Canyon tour! You'll fl...,0,6 hours 30 minutes,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,...,6148.42,4925.44,900.43,11618.64,5.0,http://www.partner.viator.com/images/stars/red...,FreesaleOnRequest,VOUCHER_E,"[grand canyon, colorado river, west rim, grand...",[]
2,10,SITours_NEW,5022MOUDIN,Moulin Rouge Paris Dinner and Show Ticket,Enjoy an evening dinner show at the Moulin Rou...,Enjoy an evening dinner show at the Moulin Rou...,0,4 hours,"Paris, France",https://media.tacdn.com/media/attractions-spli...,...,2059.7,1609.57,301.61,3891.78,4.5,http://www.partner.viator.com/images/stars/red...,OnRequest,VOUCHER_E,[moulin rouge],[moulin rouge]
3,11,SITours_NEW,2800NYE,New York City New Year's Eve Circle Line Cruise,Celebrate the turn of the year in style on thi...,Celebrate the turn of the year in style on thi...,0,3 hours,"New York, United States",https://media.tacdn.com/media/attractions-spli...,...,2578.41,2065.54,377.61,4872.41,4.0,http://www.partner.viator.com/images/stars/red...,Freesale,VOUCHER_E,[],[]
4,54,SITours_NEW,5516ST1,Grand Canyon West Rim Luxury Helicopter Tour,Travel in five-star style and luxury to one of...,Travel in five-star style and luxury to one of...,0,3 hours,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,...,3272.5,2621.57,479.25,6184.03,5.0,http://www.partner.viator.com/images/stars/red...,FreesaleOnRequest,VOUCHER_E,"[grand canyon, hoover dam, lake mead, colorado...",[black canyon]


In [18]:
# export the data as csv this would be used by the 3rd sub-module where we will finally export the 
# mapped viator to trip advisor attraction activities
small_df.to_csv(VIATOR_ACT_NER_DONE, index=False, encoding='utf-8')

### Wrap Up
Now we have exported the NER extracted file, we are ready to use to fuzzy match the attraction data