# Pre-processing:
This is the first sub-module of the `activities_recommendation` module. Firstly we define the purpose of the module and the challenges, then we will dive into the features of the sub-module and then will walkthrough the code.

## Activities Recommendation
Each trip consists of a certain number of days. In addition to travel and hotel-stay, the travellers can engage in different activities during their trip. The Activity Recommendation Logic aims to analyze Viator / TripAdvisor activities/attraction data and generate day-wise activity recommendations keeping their schedule, budget and interests in mind. 

### Challenges? 
Even before going into the recommendation channel, there are various challenges we encouter with the initial data we are getting:

#### The expected data for each activity
- For each activity we need the following data: `num_reviews`, `star_rating`, `prices`, `duration`, `age-group allowed`, `location`, `timings`
- And, we also need the attraction data to which an activity is mapped to: `attraction_name`, `star_rating`, `num_reviews`, `hours` (availability times), `location`

But in the default data we have activity data without any association with the attraction data -- inshort, we need to map the activities to the attraction data. Moreover, even the viator data alone doesn't suffice all the attributes we need for the activity data -- some of them are present in our 3rd dataset, which is `ta_activity` dataset.

#### The mapping
Even if we can figure out this much, mapping the activity data to the attraction data is not that straightforward since there is no common column in these two data on which we can do an inner join. Thus, we need to firstly **extract** the names out of the activity data, and then map it to the attraction data. We will see how is that done with the use of `Standford NER Tagger`, and `fuzzy matching`. 

#### The optimization
Even if we can do the mapping properly with matching attributes, there is a lot of data to be mapped, indeed we have about **240 billion** comparisions to do! So we need to optimize the code correctly. Moreover, this adds up one more reason for us to preprocess the data properly. 

## In this sub-module
In this sub-module we have these two input data sources
- Trip advisor activities data: From an `os.walk` through the server recursively (around 73,000 entries)
- Viator activity dataset: Available as a `.csv` file format (around 390,000 entries)

In this sub-module, our aim would be to sanitize the **Viator activity dataset** by removing the null rows. And then map it to the **Trip advisor activities**. This way we would have all the attributes for the activities we need (except `age-groups allowed`, by manual picking and checking of the data, I didn't find this key to be present there). We export the data at each step to `.csv` files for later use. These are the outputs we will have by the end of this sub-module.
- `TA_ACTS_OS_WALK` (approx. 73,000 entries): Will have the trip advisor activity data from the `os.walk`.
- `VIATOR_ACTS_CLEANED` (approx. 268,000): Will have the Viator activities after the null rows are removed.
- `MAPPED_ACT_VIA_TO_TA` (approx. 32,000 entries): Will have the mapped activities from Viator to Trip advisor, will have columns of both of these datasets.
- `VIATOR_MAPPED_ACTS` (approx. 32,000 entries): Will have the mapped Viator activities -- this will have the columns from the viator dataset alone.
- `VIATOR_UNMAPPED_ACTS` (approx. 236,000): Will have the unmapped Viator activities.

In [2]:
# imports and global pandas settings for the notebook
import pandas as pd
import os
import json
pd.options.display.max_columns = None

# input CSV locations
VIATOR_ACTS = '../vapProducts.csv'

# output CSV files to be written
TA_ACTS_OS_WALK = 'ta_activities_os_walk.csv'
VIATOR_ACTS_CLEANED = 'viator_activities_cleaned.csv'
MAPPED_ACT_VIA_TO_TA = 'viator_to_ta_mapped_act.csv'
VIATOR_MAPPED_ACTS = 'viator_mapped_act.csv'
VIATOR_UNMAPPED_ACTS = 'viator_unmapped_act.csv'

### Create a DataFrame for TripAdvisor activities
We do and `os.walk` to get this -- this is the default data source we have with us. We also save it in a csv file for later use as `os.walk` every time is slow.

In [3]:
# ta_activities data from os walk
data_root_folder = "/mnt/disks/data-01/tapa/activity_index"
TA_DATA_FILE_PATH = data_root_folder + '/raw_data/trip_advisor_products/'
ta_data = []    
for folder, subs, files in os.walk(TA_DATA_FILE_PATH):
    for filename in files:
        with open(os.path.join(folder, filename), 'r') as src:
            for l in src:
                ta_data.append(json.loads(l))
adf = pd.DataFrame(ta_data)

# representation
adf.head(5)

Unnamed: 0,star_rating,detailed_ratings,details,certificate,tags,num_reviews,product_code,inside_city_rank,vendor,category,highlights,similar_activity_names,parents,name,location,parent_paths,stored_directory,similar_activity_codes,address,price,parent_ids,current_date,review_desc,extra_info,status,description,url
0,4.5,,"[Visit Khor Al Adeid, one of the few places in...",,,5.0,25649P11,,[Qatar International Adventures (QIA)],,[Full-day private Qatar desert safari from Doh...,"[Full Day Desert Safari with Lunch or Dinner, ...","[Middle East, Qatar, Doha, Places to visit in ...",Full-day Private Qatar Desert Safari from Doha,"{'lng': 51.498245, 'city_id': 294009, 'lat': 2...","[/Tourism-g21-Middle_East-Vacations.html, /Tou...",21/294008/294009,"[12898432, 11988610, 11988619]",,€ 99.40,"[21, 294008, 294009, 294009, 294009]",1527784859.000944,"[{'rating': 5.0, 'desc': 'Had the BEST day on ...",,True,"{'rating': 5.0, 'desc': 'Had the BEST day on t...",https://www.tripadvisor.in/AttractionProductDe...
1,,,"[Enjoy a desert safari in Qatar, an experience...",,,,52343P19,,[],,[],[],"[Middle East, Qatar, Doha, Places to visit in ...",Full Day Desert Safari with Lunch or Dinner,"{'lng': 51.458443, 'city_id': 294009, 'lat': 2...","[/Tourism-g21-Middle_East-Vacations.html, /Tou...",21/294008/294009,[],,"₹ 7,582.90","[21, 294008, 294009, 294009, 294009]",1527784861.7445896,[],,True,{},https://www.tripadvisor.in/AttractionProductDe...
2,4.0,,[Go dune bashing in the sandy expanse Qatar's ...,,,1.0,25649P3,,[Qatar International Adventures (QIA)],,[Overnight accommodation in a luxury Bedouin-s...,[],"[Middle East, Qatar, Doha, Places to visit in ...",Desert Safari with Overnight Camping from Doha,"{'lng': 51.53805, 'city_id': 294009, 'lat': 25...","[/Tourism-g21-Middle_East-Vacations.html, /Tou...",21/294008/294009,[],,"₹ 9,004.70","[21, 294008, 294009, 294009, 294009]",1527784834.872959,"[{'rating': 4.0, 'desc': 'This trip was very e...",,True,"{'rating': 4.0, 'desc': 'This trip was very ex...",https://www.tripadvisor.in/AttractionProductDe...
3,,,"[ , , , , , , , , , , , ]",,,,53513P25,,[365 Adventures],,[],[],"[Middle East, Qatar, Doha, Places to visit in ...",Education and Sports City Tour,"{'lng': 51.50022, 'city_id': 294009, 'lat': 25...","[/Tourism-g21-Middle_East-Vacations.html, /Tou...",21/294008/294009,[],,"₹ 3,385.20","[21, 294008, 294009, 294009, 294009]",1527784851.8813367,[],,True,{},https://www.tripadvisor.in/AttractionProductDe...
4,,,[Enjoy access to the inland sea of Khor Al Ade...,,,,52343P2,,[],,"[Hotel pickup and drop-off in Doha, Guided com...","[Full Day Desert Safari with Lunch or Dinner, ...","[Middle East, Qatar, Doha, Places to visit in ...",Inland Sea Desert Safari with BBQ at Souq Al W...,"{'lng': 51.486706, 'city_id': 294009, 'lat': 2...","[/Tourism-g21-Middle_East-Vacations.html, /Tou...",21/294008/294009,"[12898432, 11988610, 11988772]",,"₹ 8,192.30","[21, 294008, 294009, 294009, 294009]",1527784844.6952927,[],,True,{},https://www.tripadvisor.in/AttractionProductDe...


In [3]:
# export ta_activity data as .csv file 
adf.to_csv(TA_ACTIVITIES_OS_WALK, index=False, encoding='utf-8')

### Import the Viator activity data:
Now we will import the data for the viator activities. We set the `low_memory = False` to get rid of the warning of `mixed dtypes`. Then we finally export the cleaned data for later use, this data would be saved in the `viator_activities_cleaned.csv` file.

#### Few notes on this data:
- **This data has a `ProductCode` column which can be used to map it to the TripAdvisor activities column `product_code`.**
- About 100k rows of this data are `NaN`, i.e. totally empty. We sanitize the data by removing such columns.

In [5]:
# data from viator activities
via_df = pd.read_csv(VIATOR_ACTS, low_memory=False)

# clean data by removing the null rows, here if 'ProductCode' is null, then the whole row is null (you can also 
# provide the whole subset of rows here, but for us, this does the job, pretty fast)
via_df.dropna(subset=['ProductCode'], inplace=True)

# export the cleaned data as a csv file
via_df.to_csv(VIATOR_ACTS_CLEANED, index=False, encoding='utf-8')

# representation
via_df.head(5)

Unnamed: 0,Rank,ProductType,ProductCode,ProductName,Introduction,ProductText,Special,Duration,Commences,ProductImage,ProductImageThumb,DestinationID,Continent,Country,Region,City,IATACode,Group1,Category1,Subcategory1,Group2,Category2,Subcategory2,Group3,Category3,Subcategory3,ProductURL,PriceAUD,PriceNZD,PriceEUR,PriceGBP,PriceUSD,PriceCAD,PriceCHF,PriceNOK,PriceJPY,PriceSEK,PriceHKD,PriceSGD,PriceZAR,AvgRating,AvgRatingStarURL,BookingType,VoucherOption
0,1,SITours_NEW,2280AAHT,Grand Canyon All-American Helicopter Tour,Take off from McCarran Airport on an exhilarat...,Take off from McCarran Airport on an exhilarat...,0,3 hours 30 minutes,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,684.0,Northern America,United States,Nevada,Las Vegas,LAS,"Air, Helicopter & Balloon Tours",Helicopter Tours,Helicopter Tour,Tours & Sightseeing,Self-guided Tours & Rentals,Self Guided Tours,Tours & Sightseeing,Audio Guided Tours,Audio Guided Tours,http://www.partner.viator.com/en/66575/tours/L...,637.45,689.81,38167,340.92,409.99,583.66,407.34,"4 196,38",45119,4065.86,3257.12,595.44,7683.23,5.0,http://www.partner.viator.com/images/stars/red...,FreesaleOnRequest,VOUCHER_E
1,2,SITours_NEW,5022MOULIN,Moulin Rouge Show Ticket Paris,The Moulin Rouge is the number one show in Par...,The Moulin Rouge is the number one show in Par...,0,2 hours,"Paris, France",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,479.0,Western Europe,France,Ile-de-France,Paris,CDG,"Shows, Concerts & Sports",Cabaret,Cabaret,"Shows, Concerts & Sports",Family-friendly Shows,Children's Shows & Concerts,"Shows, Concerts & Sports","Theater, Shows & Musicals",Show,http://www.partner.viator.com/en/66575/tours/P...,169.96,183.9,10000,90.9,111.33,155.6,108.59,"1 118,82",12029,1084.05,847.14,158.74,2048.3,4.5,http://www.partner.viator.com/images/stars/red...,OnRequest,VOUCHER_E
2,3,SITours_NEW,3731VATICAN,"Faster Than Skip-the-Line: Vatican, Sistine Ch...",Spend more time inside with no-wait access to ...,Spend more time inside with no-wait access to ...,0,3 hours,"Roma, Italy",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,511.0,Southern Europe,Italy,Lazio,Roma,ROM,Cultural & Theme Tours,Cultural Tours,Cultural Tour,Tours & Sightseeing,Half-day Tours,Half-day Tour,Walking & Biking Tours,Walking Tours,Walking Tour,http://www.partner.viator.com/en/66575/tours/R...,107.07,115.86,6300,57.27,70.14,98.03,68.41,70486,7578,682.95,533.7,100.01,1290.43,5.0,http://www.partner.viator.com/images/stars/red...,Freesale,VOUCHER_E
3,4,SITours_NEW,3951WESTDLX,Grand Canyon West Rim and Hoover Dam Tour from...,Hit the highway out of Las Vegas and spend the...,Hit the highway out of Las Vegas and spend the...,0,12 hours,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,684.0,Northern America,United States,Nevada,Las Vegas,LAS,"Air, Helicopter & Balloon Tours",Helicopter Tours,Helicopter Tour,Tours & Sightseeing,Bus & Minivan Tours,Bus Tour,Outdoor Activities,"4WD, ATV & Off-Road Tours",Adventure Tour,http://www.partner.viator.com/en/66575/tours/L...,183.76,198.86,11002,98.28,118.19,168.25,117.43,"1 209,71",13007,1172.09,938.95,171.65,2214.89,4.5,http://www.partner.viator.com/images/stars/red...,Freesale,VOUCHER_E
4,5,SITours_NEW,2280LI_5H,Grand Canyon 4-in-1 Helicopter Tour,Take the ultimate Grand Canyon tour! You'll fl...,Take the ultimate Grand Canyon tour! You'll fl...,0,6 hours 30 minutes,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,684.0,Northern America,United States,Nevada,Las Vegas,LAS,"Air, Helicopter & Balloon Tours",Helicopter Tours,Helicopter Tour,Tours & Sightseeing,Self-guided Tours & Rentals,Self Guided Tours,Tours & Sightseeing,Full-day Tours,Day Tour,http://www.partner.viator.com/en/66575/tours/L...,963.96,1043.14,57716,515.54,619.99,882.61,615.99,"6 345,79",68230,6148.42,4925.44,900.43,11618.64,5.0,http://www.partner.viator.com/images/stars/red...,FreesaleOnRequest,VOUCHER_E


### Merge the two DataFrames:
- Now we merge these two DataFrames on the basis of matching `ProductCode`, or `product_code`.
- Export the merged DataFrame as `mapped_activities_via_to_ta.csv`.

In [6]:
# an inner merge on the two datasets to find the activities with same product code
mapped_act_df = pd.merge(via_df, adf, left_on='ProductCode', right_on='product_code', how='inner')

In [7]:
# exporting the data to CSV file
mapped_act_df.to_csv(MAPPED_ACT_VIA_TO_TA, index=False, encoding='utf-8')

### Distinguish Mapped and Unmapped Viator activities:
- Now we export these two set of activities in different csv files. 
- For the mapped activities, we have the following extra attributes for an activity
    - `num_reviews`, `parents` (second last element of the list is the possible attraction name), `location` (holds the coordinates), `extra_info` (might hold other details)
- The other set of activities (about 90%) will be the other dataset from which we can't map any viator activity and now we will map it to the relevant attraction by NER and string matching techniques later.

In [10]:
# copy only the relevant columns from the done_act_df
done_act_df = mapped_act_df[['Rank', 'ProductType', 'ProductCode', 'ProductName', 'Introduction',
       'ProductText', 'Special', 'Duration', 'Commences', 'ProductImage',
       'ProductImageThumb', 'DestinationID', 'Continent', 'Country', 'Region',
       'City', 'IATACode', 'Group1', 'Category1', 'Subcategory1', 'Group2',
       'Category2', 'Subcategory2', 'Group3', 'Category3', 'Subcategory3',
       'ProductURL', 'PriceAUD', 'PriceNZD', 'PriceEUR', 'PriceGBP',
       'PriceUSD', 'PriceCAD', 'PriceCHF', 'PriceNOK', 'PriceJPY', 'PriceSEK',
       'PriceHKD', 'PriceSGD', 'PriceZAR', 'AvgRating', 'AvgRatingStarURL',
       'BookingType', 'VoucherOption']].copy()

# drop duplicates, if any
done_act_df.drop_duplicates(inplace=True)

# export done activities for later use with the NER sub-module
done_act_df.to_csv(VIATOR_MAPPED_ACTS, encoding='utf-8', index=False)

# apply mask to via_df dataset (superset) and filter indexes which are exclusive to done_act_df
not_mapped_df = via_df[~via_df.index.isin(done_act_df.index)]
not_mapped_df.head(5)

Unnamed: 0,Rank,ProductType,ProductCode,ProductName,Introduction,ProductText,Special,Duration,Commences,ProductImage,ProductImageThumb,DestinationID,Continent,Country,Region,City,IATACode,Group1,Category1,Subcategory1,Group2,Category2,Subcategory2,Group3,Category3,Subcategory3,ProductURL,PriceAUD,PriceNZD,PriceEUR,PriceGBP,PriceUSD,PriceCAD,PriceCHF,PriceNOK,PriceJPY,PriceSEK,PriceHKD,PriceSGD,PriceZAR,AvgRating,AvgRatingStarURL,BookingType,VoucherOption
3,4,SITours_NEW,3951WESTDLX,Grand Canyon West Rim and Hoover Dam Tour from...,Hit the highway out of Las Vegas and spend the...,Hit the highway out of Las Vegas and spend the...,0,12 hours,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,684.0,Northern America,United States,Nevada,Las Vegas,LAS,"Air, Helicopter & Balloon Tours",Helicopter Tours,Helicopter Tour,Tours & Sightseeing,Bus & Minivan Tours,Bus Tour,Outdoor Activities,"4WD, ATV & Off-Road Tours",Adventure Tour,http://www.partner.viator.com/en/66575/tours/L...,183.76,198.86,11002,98.28,118.19,168.25,117.43,"1 209,71",13007,1172.09,938.95,171.65,2214.89,4.5,http://www.partner.viator.com/images/stars/red...,Freesale,VOUCHER_E
4,5,SITours_NEW,2280LI_5H,Grand Canyon 4-in-1 Helicopter Tour,Take the ultimate Grand Canyon tour! You'll fl...,Take the ultimate Grand Canyon tour! You'll fl...,0,6 hours 30 minutes,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,684.0,Northern America,United States,Nevada,Las Vegas,LAS,"Air, Helicopter & Balloon Tours",Helicopter Tours,Helicopter Tour,Tours & Sightseeing,Self-guided Tours & Rentals,Self Guided Tours,Tours & Sightseeing,Full-day Tours,Day Tour,http://www.partner.viator.com/en/66575/tours/L...,963.96,1043.14,57716,515.54,619.99,882.61,615.99,"6 345,79",68230,6148.42,4925.44,900.43,11618.64,5.0,http://www.partner.viator.com/images/stars/red...,FreesaleOnRequest,VOUCHER_E
9,10,SITours_NEW,5022MOUDIN,Moulin Rouge Paris Dinner and Show Ticket,Enjoy an evening dinner show at the Moulin Rou...,Enjoy an evening dinner show at the Moulin Rou...,0,4 hours,"Paris, France",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,479.0,Western Europe,France,Ile-de-France,Paris,CDG,"Shows, Concerts & Sports",Cabaret,Cabaret,"Shows, Concerts & Sports","Theater, Shows & Musicals",Show,"Shows, Concerts & Sports",Dinner Packages,Dinner and Show,http://www.partner.viator.com/en/66575/tours/P...,322.92,349.41,19000,172.7,211.53,295.64,206.31,"2 125,76",22856,2059.7,1609.57,301.61,3891.78,4.5,http://www.partner.viator.com/images/stars/red...,OnRequest,VOUCHER_E
10,11,SITours_NEW,2800NYE,New York City New Year's Eve Circle Line Cruise,Celebrate the turn of the year in style on thi...,Celebrate the turn of the year in style on thi...,0,3 hours,"New York, United States",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,687.0,Northern America,United States,New York,New York,NYC,Holiday & Seasonal Tours,New Years,New Years Eve,"Cruises, Sailing & Water Tours",Night Cruises,Night Cruise,Tours & Sightseeing,Night Tours,Night Tours,http://www.partner.viator.com/en/66575/tours/N...,404.25,437.45,24204,216.2,260.0,370.13,258.32,"2 661,18",28613,2578.41,2065.54,377.61,4872.41,4.0,http://www.partner.viator.com/images/stars/red...,Freesale,VOUCHER_E
53,54,SITours_NEW,5516ST1,Grand Canyon West Rim Luxury Helicopter Tour,Travel in five-star style and luxury to one of...,Travel in five-star style and luxury to one of...,0,3 hours,"Las Vegas, United States",https://media.tacdn.com/media/attractions-spli...,https://media.tacdn.com/media/attractions-spli...,684.0,Northern America,United States,Nevada,Las Vegas,LAS,"Air, Helicopter & Balloon Tours",Helicopter Tours,Helicopter Tour,Tours & Sightseeing,Full-day Tours,Day Tour,Luxury & Special Occasions,Luxury Tours,Luxury Tours,http://www.partner.viator.com/en/66575/tours/L...,513.07,555.21,30719,274.4,329.99,469.77,327.86,"3 377,55",36315,3272.5,2621.57,479.25,6184.03,5.0,http://www.partner.viator.com/images/stars/red...,FreesaleOnRequest,VOUCHER_E


In [9]:
# export the unmapped data as unmapped_act.csv
not_mapped_df.to_csv(VIATOR_UNMAPPED_ACTS, encoding='utf-8', index=False)

### Conclusion
Now we preprocessed the data for successive sub-modules. We will use the `VIATOR_UNMAPPED_ACTS`, and `VIATOR_MAPPED_ACTS` to extract the possible location names using an NER tagger