## Problem
### Using the available information, we need you to identify clusters of accommodation that bring the most positive impact to the community, allowing a wider range of actors to participate in travel and tourism as consumers and/or providers.

The main dataset is yet to be released by the contest organizers. Meanwhile we can look at the secondary data. We assume that this data is collected to find out the factors that are considered to bring most positive impact on the society.   

<span style="color:red"><b>We(Vishal/Shagun)</b> can simultaneously work on this notebook, under headings assigned to us.</span>

<b> Note: 
    * Before pushing the code to the repo, always remember to clear the output first. Cell -> All Output -> clear.
    * Document the code really well. This work style of working on same repo will help each other a lot if it is very well documented
    * Always mention the exact data source giving the the url so that the other person can download it as we are not going to push the data.  
</b>

## Official Data 
Main data of the problem is stored in the a tsv file called data.tsv. Let's read the data.

In [None]:
import pandas as pd 

mainData = pd.read_csv("../../data/raw/official/data.tsv", sep='\t')

mainData.head()

This looks about right. Let's do the data profiling to get an insight of the data.

In [None]:
import pandas_profiling

profile = mainData.profile_report(title='mainData Profiling Report')
profile.to_file(output_file="../../data/processed/mainData.html")

As a profiling report could be very big, we can write the report in a html file and save in data/processed folder. Also, doing the same for other datasets. 

In [None]:
# profiling counties data
countiesData = pd.read_csv("../../data/raw/official/counties.tsv", sep='\t')
profile = countiesData.profile_report(title='countiesData Profiling Report')
profile.to_file(output_file="../../data/processed/countiesData.html")

In [None]:
# profiling  population characteristics for US counties for a period of 2010 to 2018 data
censusData = pd.read_csv("../../data/raw/official/cc-est2018-alldata/cc-est2018-alldata.csv", encoding="ISO-8859-1")
profile = censusData.profile_report(title='censusData Profiling Report')
profile.to_file(output_file="../../data/processed/censusData.html")

The <b>air_outbound_popularity_bucket</b> is highly correlated with <b>air_inbound_popularity_bucket <span style="color:green">(ρ = 0.9962258185)</span></b>. So we can drop air_outbound_popularity from the table.  

In [None]:
mainData = mainData.drop(['air_outbound_popularity_bucket'], axis=1)
mainData.head()

<b>countyfp</b> is the funny one. Although, the total number of counties in USA and each state matches the count but the values are assigned quite randomly.   

In [None]:
import numpy as np

# Number of counties in 3 states by alphabatical order
print(mainData.groupby('state_code')['countyfp'].nunique()[:3], '\n')
# Total number of counties in USA
print(mainData.groupby('state_code')['countyfp'].nunique().sum(), '\n')
# values of countyfp for randomly selected 5 states.
print(np.sort(mainData[mainData.state_code == 'VA'].countyfp.unique()))
print(np.sort(mainData[mainData.state_code == 'AK'].countyfp.unique()))
print(np.sort(mainData[mainData.state_code == 'RI'].countyfp.unique()))
print(np.sort(mainData[mainData.state_code == 'TX'].countyfp.unique()))


So we have repeating values for the countyfp for each state. But this is not the case with <b>geoid</b>. <b>gioid</b> has unique values for all the counties ranging from 1001 to 56037. Also, similar to countyfp there is no obvious pattern in assigning the values.  

In [None]:
# values of geoid for randomly selected 2 states.
print(np.sort(mainData[mainData.state_code == 'AK'].geoid.unique()))
print(np.sort(mainData[mainData.state_code == 'AL'].geoid.unique()))

<b>lodging_num_reviews_bucket</b> is highly correlated to <b>lodging_inventory_bucket <span style="color:green">(ρ = 0.929536127)</span></b> and <b>lodging_popularity_bucket</b> is highly correlated to <b>lodging_num_reviews_bucket<span style="color:green">(ρ = 0.9485770051)</span></b> . So we can drop both of these fields. 

In [None]:
mainData = mainData.drop(['lodging_num_reviews_bucket','lodging_popularity_bucket'], axis=1)
mainData.head()

<b>state_code</b> and <b>statefp</b> are basically same thing. We can drop state_code as well for Analysis.  

In [None]:
mainData = mainData.drop(['state_code'], axis=1)
mainData.head()

### Fixing the missing values. 

<b>lodging_avg_review_rating, lodging_avg_star_rating</b> and <b>lodging_inventory_bucket</b> has 59.3%, 64.1% and 51.1% of the values missing.

In [None]:
# Checking the summary of the lodging_avg_review_rating for Vacation rental true and false
nonVacationRentalReview = mainData[mainData.is_vacation_rental == 0].lodging_avg_review_rating
vacationRentalReview = mainData[mainData.is_vacation_rental == 1].lodging_avg_review_rating
print("Non Vacational Rental Review Summary:", nonVacationRentalReview.describe())
print("\n Vacational Rental Review Summary:", vacationRentalReview.describe())

For <b>lodging_avg_review_rating</b> the distribution is very Gaussian like and low standard deviation from the mean. We see a difference in mean and standard deviation of review rating. So we can replace Nan for both values differently.   

In [None]:
# filling Nan with mean for vacation and non vacation rental and then replacing the original column
x = mainData.loc[mainData.is_vacation_rental == 0]['lodging_avg_review_rating'].fillna(3.9)
x = pd.DataFrame({'lodging_avg_review_rating' : x})
y = mainData.loc[mainData.is_vacation_rental == 1]['lodging_avg_review_rating'].fillna(4.5)
y = pd.DataFrame({'lodging_avg_review_rating' : y})
frames = [x, y]
z = pd.concat(frames)
# replacing the original lodging_avg_review_rating
mainData['lodging_avg_review_rating'] = z.sort_index()

We need to do more for the left two fields than just replacing the value with mean. We can come back to them if these value will be needed. 

### Feature Engineering
Let's start building the features and final dataset that will be used for clustering. lets build the data county wise. 

In [None]:
# moving geoid to final dataset
finalData = pd.DataFrame({'geoid' : np.sort(mainData.geoid.unique())})
finalData.head()

### Customer Satisfaction
Customer satisfaction is the first feature we are going to add to the dataset. We can take <b>lodging_avg_review_rating, lodging_avg_star_rating</b> take as customer satisfaction. We have already filler the missing values in the lodging_avg_review_rating, so we can add that to the data set straight up. But the rating is given for different years in the main Dataset. Lets take the latest value and average the value for vacation rental and non vacation rental.  

In [None]:
from tqdm import tqdm

# initialize the value in new column with float value
finalData['CustomerSatisfactionAvgReviewRating'] = 0.0

# taking average of latest avg review rating for vacation rental and non vacation rental
for index, value in tqdm(finalData['geoid'].items()):
    finalData.CustomerSatisfactionAvgReviewRating[index] = mainData[mainData.geoid == finalData.loc[index]['geoid']][-2:]['lodging_avg_review_rating'].values.mean()
    
finalData.head()

The regenerating the missing values of the <b>lodging_avg_star_rating</b> will be little trickier. We need to apply a Machine Learning Algorithm to regenerate the value.

We are not interested in all the values, but in the latest ones. Let's start with creating a new dataframe with lesser values.  

## Economic impact
### Vishal 

Tourism affect the economy of the region and same holds for vice versa. In this section of the notebook we will try to analyze economic factor that impact community and hence tourism. 

<h4>Income</h4>
Starting with income of people in the region. Better average income of the society indicate more prosperity. let's start playing with the data we have related with the income. 

* Data used: SELECTED ECONOMIC CHARACTERISTICS, 2018: ACS 1-Year Estimates Data profiles
* url: https://data.census.gov/cedsci/table?q=United%20States&g=0100000US,.050000&table=DP03&tid=ACSDP1Y2018.DP03&hidePreview=true&vintage=2018&lastDisplayedRow=144

The data is stored in raw folder as a CSV file.

In [None]:
# loading the data
import pandas as pd

censusData = pd.read_csv("../../data/raw/Census2018/ACSDP1Y2018.DP03_data_with_overlays_2019-12-28T161853.csv") 

censusData.head()

In [None]:
# check the shape of the dataframe.
censusData.shape

Things to remember for later.
* pay gap between genders

## Environmental impact
### Shagun