# Section 1: Business Understanding

&ensp; This is the first Udacity Data Science Nano degree project. For this project, I picked the Boston Airbnb dataset(listings.csv) to understand if there is any geographic preference.
<br> &ensp; Question: Is there any geographic preference in the Boston market?(The other two questions are written separately in other two notebooks)
<br> &ensp; The method for detecting any geographic preference is to count the zip code and plot the scatter on the map with different color and size of bubble according to the count of listing.

In [3]:
#import libraries
import pandas as pd
import pgeocode as pg
import plotly.express as px

# Section 2: Data Understanding

&ensp; After the data set being loaded to the listing_file data frame, my main focus is on the "zipcode" field. I also notice that there are some missing/null values and bad data in this field. Therefore, there will be some data cleansing work to do. Because the "zipcode" is an object type, there is no need to do type conversion for calculating count. 

In [4]:
#load listing file to variable file_listings 
listing_file = pd.read_csv(u'/Users/zhosheng/Desktop/Study/DS_Udacity_Projects/Boston/listings.csv')

listing_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3585 entries, 0 to 3584
Data columns (total 95 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3585 non-null   int64  
 1   listing_url                       3585 non-null   object 
 2   scrape_id                         3585 non-null   int64  
 3   last_scraped                      3585 non-null   object 
 4   name                              3585 non-null   object 
 5   summary                           3442 non-null   object 
 6   space                             2528 non-null   object 
 7   description                       3585 non-null   object 
 8   experiences_offered               3585 non-null   object 
 9   neighborhood_overview             2170 non-null   object 
 10  notes                             1610 non-null   object 
 11  transit                           2295 non-null   object 
 12  access

# Section 3: Data Preparation

&ensp; In this section, the first step is to remove the missing/null value in "zipcode" field. Next step is to remove bad data that has less than or greater than five digits in this field. After finishing the data cleansing part, calculating count is becoming very straightforward.   

In [6]:
#remove null value on zipcode from file_listing
print('There are {} zipcode before reomval NA'.format(listing_file.size))
listing_file=listing_file.dropna(subset=['zipcode'])

#remove irregular zipcode format
listing_file=listing_file[listing_file['zipcode'].str.len()==5]
print('There are {} zipcode after reomval NA'.format(listing_file.size))

#count the number of each zipcode
count_zip=listing_file.groupby(['zipcode'])['zipcode'].count()

#Convert series type to data frame
count_zip=count_zip.to_frame()
count_zip.rename(columns={'zipcode':'count'},inplace=True)
count_zip.reset_index(inplace=True)

There are 340575 zipcode before reomval NA
There are 336775 zipcode after reomval NA


&ensp; The next step is to creat a function, which will ultimately generate a geographic visualization to show which particular areas are popular in terms of nubmer of booking. 

In [9]:
def map_bubble_scatter(nation,location, count_zipcode):
    '''
    This function provides a bubble scatter plot according to the zipcode and number of booking
    
    Input parameter:
    1. "nation" string type
    2. "location" dataframe type
    3. "count_zipcode" dataframe type
    
    Output result: bubble scatter plot
    
    '''
    #load US GEOcode
    us_land=pg.Nominatim(nation)
    
    zipcode_count_file=pd.DataFrame().assign(count=count_zipcode,zipcode=location)

    #add Latitude and longitude to the existing data frame based on zipcode
    #count_zip['Latitude']=(us_land.query_postal_code(count_zip['zipcode'].tolist()).latitude)
    zipcode_count_file['Latitude']=(us_land.query_postal_code(zipcode_count_file['zipcode'].tolist()).latitude)
    #count_zip['Longitude']=(us_land.query_postal_code(count_zip['zipcode'].tolist()).longitude)
    zipcode_count_file['Longitude']=(us_land.query_postal_code(zipcode_count_file['zipcode'].tolist()).longitude)

    #Create scatter plot based on latitude and longitude and bubble size based on the count of occurrance of the zipcode
    fig = px.scatter_mapbox(zipcode_count_file, lat="Latitude", lon="Longitude", color="count", size="count",
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10,
                  mapbox_style="carto-positron")

    fig.show()

# Section 5: Evaluate the Results

&ensp; After running the map_bubble_scatter function, the conclusion is that the preferred location is on the south side of the Cambridge area.

In [11]:
map_bubble_scatter('us',count_zip['zipcode'], count_zip['count'])