## Airbnb DC Hosting Helper ##

## 1_data_collection ##

### Executive Summary ###

Airbnb was started in 2007 and has been disrupting the hospitality industry ever since. Hosts on Airbnb offer unique stays and local experiences for travelers that can't be replicated by a stay in a hotel. According to [their site,](https://news.airbnb.com/about-us/) Airbnb has helped over 4 million hosts welcome over 900 million guests in almost every country around the world. 

According to [SmartAsset's](https://smartasset.com/mortgage/where-do-airbnb-hosts-make-the-most-money) 2020 study on the profit potential of rentals in 15 of the largest Airbnb markets in the US, renting out an entire place or room can be a profitable venture. The average expected annual profit of Airbnb hosts renting out a full two bedroom apartment or house after expenses is $20,619 in the cities studied. For hosts renting out a one room in a two-bedroom home on Airbnb in these cities, they could expect to pay about 81% of their rent from renting on average. 

As you could imagine, there is a robust amount of Airbnb information in every major city around the world. To explore this data, [Inside Airbnb](http://insideairbnb.com/index.html) created an independent, non-commercial website that scrapes publicly available listing and review data from Airbnb every month and allows anyone to explore or work with this information.

This information can useful to many parties working with Airbnb. For this project, I will aim to use this information to help hosts understand what makes and Airbnb listing the most popular and what they could focus on to make their listing more competitive and increase their profits. I will specifically be focusing on Washington DC as my case study. I will also use the [Foursquare API](https://developer.foursquare.com/) to gather information on type of venues in each neighborhood to get a better idea of the city from a tourist's point of view and see if this impacts the Airbnb listing popularity.

### Problem Statement ###

I will create the best binary clasfication model to predict whether or not an Airbnb listing in DC will be considered popular or not compared to the current listing competition. In addition to the best predictive model, I will create a highly interpretive model to help hosts understand what features they could improve on their listing to increase popularity. These models will be deployed together an app that hosts can use to make their listings as strong as possible.

There are a few important metrics to point out for this project. The first is how to determine the binary classification of the listing being popular or not. I will use a combination of the number of ratings and the average rating to calculate popularity. Listings with over 60 reviews and over a 4.8 (out of 5) overall rating will be considered popular. I want to explore the features that separate this group from the rest. Why are these lisitngs getting such high ratings and reviews? Second, the model I select should have as consistent accuracy scores as possible between training and testing groups and ideally the scores for both groups should be high to explain as much variance as possible. I will also focus on optimizing the precision score (true positives over all predicted positives), because in this specific case it would be worse to tell a host that their listing will be popular if it is actually not.

Import libraries and read in data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json, requests

In [None]:
np.random.seed(100)

In [2]:
pd.set_option('display.max_columns', 300)

In [3]:
pd.set_option('display.max_rows', 300)

In [4]:
listings = pd.read_csv('../data/listings.csv.gz')

In [45]:
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,3686,https://www.airbnb.com/rooms/3686,20210710190002,2021-07-11,Vita's Hideaway,IMPORTANT NOTES<br />* Carefully read and be s...,We love that our neighborhood is up and coming...,https://a0.muscache.com/pictures/61e02c7e-3d66...,4645,https://www.airbnb.com/users/show/4645,Vita,2008-11-26,"Washington D.C., District of Columbia, United ...","I am a literary scholar, teacher, poet, vegan ...",within a day,80%,75%,f,https://a0.muscache.com/im/users/4645/profile_...,https://a0.muscache.com/im/users/4645/profile_...,Anacostia,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Washington, District of Columbia, United States",Historic Anacostia,,38.86339,-76.98889,Private room in house,Private room,1,,1 private bath,1.0,1.0,"[""First aid kit"", ""Long term stays allowed"", ""...",$55.00,2,365,2,2,365,365,2.0,365.0,,t,1,31,61,336,2021-07-11,75,3,0,2014-06-22,2021-01-12,4.59,4.71,4.44,4.89,4.82,3.8,4.58,,f,2,0,2,0,0.87
1,3943,https://www.airbnb.com/rooms/3943,20210710190002,2021-07-11,Historic Rowhouse Near Monuments,Please contact us before booking to make sure ...,This rowhouse is centrally located in the hear...,https://a0.muscache.com/pictures/432713/fab7dd...,5059,https://www.airbnb.com/users/show/5059,Vasa,2008-12-12,"Washington, District of Columbia, United States",I have been living and working in DC for the l...,within a few hours,100%,29%,f,https://a0.muscache.com/im/pictures/user/8ec69...,https://a0.muscache.com/im/pictures/user/8ec69...,Eckington,0.0,0.0,"['email', 'phone', 'reviews', 'kba']",t,t,"Washington, District of Columbia, United States","Edgewood, Bloomingdale, Truxton Circle, Eckington",,38.91195,-77.00456,Private room in townhouse,Private room,2,,1.5 shared baths,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",$70.00,2,1125,2,2,1125,1125,2.0,1125.0,,t,9,39,69,344,2021-07-11,429,0,0,2010-08-08,2018-08-07,4.82,4.89,4.91,4.94,4.9,4.54,4.74,,f,2,0,2,0,3.22
2,4529,https://www.airbnb.com/rooms/4529,20210710190002,2021-07-11,Bertina's House Part One,This is large private bedroom with plenty of...,Very quiet neighborhood and it is easy accessi...,https://a0.muscache.com/pictures/86072003/6709...,5803,https://www.airbnb.com/users/show/5803,Bertina'S House,2008-12-30,"Washington, District of Columbia, United States","I am an easy going, laid back person who loves...",,,,f,https://a0.muscache.com/im/users/5803/profile_...,https://a0.muscache.com/im/users/5803/profile_...,Eastland Gardens,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Washington, District of Columbia, United States","Eastland Gardens, Kenilworth",,38.90585,-76.94469,Private room in house,Private room,4,,1 shared bath,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",$54.00,30,180,30,30,180,180,30.0,180.0,,t,29,59,89,179,2021-07-11,102,0,0,2014-09-23,2019-07-05,4.66,4.8,4.6,4.93,4.93,4.51,4.83,,f,1,0,1,0,1.23
3,4967,https://www.airbnb.com/rooms/4967,20210710190002,2021-07-11,"DC, Near Metro","<b>The space</b><br />Hello, my name is Seveer...",,https://a0.muscache.com/pictures/2439810/bb320...,7086,https://www.airbnb.com/users/show/7086,Seveer,2009-01-26,"Washington D.C., District of Columbia, United ...","I am fun, honest and very easy going and trave...",within a few hours,100%,78%,t,https://a0.muscache.com/im/pictures/user/6efb4...,https://a0.muscache.com/im/pictures/user/6efb4...,Ivy City,5.0,5.0,"['email', 'phone', 'reviews', 'kba']",t,t,,"Ivy City, Arboretum, Trinidad, Carver Langston",,38.91217,-76.99249,Private room in house,Private room,1,,3 baths,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",$99.00,2,365,2,2,365,365,2.0,365.0,,t,0,0,0,146,2021-07-11,31,0,0,2012-02-13,2016-09-22,4.74,4.68,4.89,4.93,4.93,4.21,4.64,,f,3,0,3,0,0.27
4,5589,https://www.airbnb.com/rooms/5589,20210710190002,2021-07-11,Cozy apt in Adams Morgan,This is a 1 br (bedroom + living room in Adams...,"Adams Morgan spills over with hipsters, salsa ...",https://a0.muscache.com/pictures/207249/9f1df8...,6527,https://www.airbnb.com/users/show/6527,Ami,2009-01-13,"Washington D.C., District of Columbia, United ...","I am an environmentalist, and I own and operat...",within a few hours,100%,17%,f,https://a0.muscache.com/im/users/6527/profile_...,https://a0.muscache.com/im/users/6527/profile_...,Adams Morgan,4.0,4.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Washington, District of Columbia, United States","Kalorama Heights, Adams Morgan, Lanier Heights",,38.91887,-77.04008,Entire apartment,Entire home/apt,3,,1 bath,1.0,1.0,"[""Window guards"", ""Cooking basics"", ""First aid...",$86.00,5,150,5,23,150,150,8.8,150.0,,t,7,32,62,121,2021-07-11,95,0,0,2010-07-30,2020-03-05,4.54,4.75,4.17,4.83,4.84,4.91,4.47,,f,2,1,1,0,0.71


In [47]:
listings.shape

(8033, 74)

In [52]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8033 entries, 0 to 8032
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            8033 non-null   int64  
 1   listing_url                                   8033 non-null   object 
 2   scrape_id                                     8033 non-null   int64  
 3   last_scraped                                  8033 non-null   object 
 4   name                                          8032 non-null   object 
 5   description                                   7875 non-null   object 
 6   neighborhood_overview                         5144 non-null   object 
 7   picture_url                                   8033 non-null   object 
 8   host_id                                       8033 non-null   int64  
 9   host_url                                      8033 non-null   o

Create list of dc neighborhoods to work with

In [54]:
listings['neighbourhood_cleansed'].value_counts()

Capitol Hill, Lincoln Park                                                                           746
Union Station, Stanton Park, Kingman Park                                                            724
Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View                                           687
Dupont Circle, Connecticut Avenue/K Street                                                           649
Shaw, Logan Circle                                                                                   548
Edgewood, Bloomingdale, Truxton Circle, Eckington                                                    531
Brightwood Park, Crestwood, Petworth                                                                 416
Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street                        386
Kalorama Heights, Adams Morgan, Lanier Heights                                                       347
West End, Foggy Bottom, GWU                            

In [55]:
list_neighborhoods = np.unique(listings['neighbourhood_cleansed'])
list_neighborhoods

array(['Brightwood Park, Crestwood, Petworth',
       'Brookland, Brentwood, Langdon', 'Capitol Hill, Lincoln Park',
       'Capitol View, Marshall Heights, Benning Heights',
       'Cathedral Heights, McLean Gardens, Glover Park',
       'Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace',
       'Colonial Village, Shepherd Park, North Portal Estates',
       'Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View',
       'Congress Heights, Bellevue, Washington Highlands',
       'Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights',
       'Douglas, Shipley Terrace',
       'Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol Street',
       'Dupont Circle, Connecticut Avenue/K Street',
       'Eastland Gardens, Kenilworth',
       'Edgewood, Bloomingdale, Truxton Circle, Eckington',
       'Fairfax Village, Naylor Gardens, Hillcrest, Summit Park',
       'Friendship Heights, American University Park, Tenle

Create a locations dataframe and group by neighborhood. Find the average latitude and longitude for each neighborhood to serve as a center point of the neighborhoods for foursquare lookups.

In [56]:
locations = pd.concat([ listings[['neighbourhood_cleansed']] , listings[['latitude']] , listings[['longitude']]], axis=1)

In [57]:
locations.head()

Unnamed: 0,neighbourhood_cleansed,latitude,longitude
0,Historic Anacostia,38.86339,-76.98889
1,"Edgewood, Bloomingdale, Truxton Circle, Eckington",38.91195,-77.00456
2,"Eastland Gardens, Kenilworth",38.90585,-76.94469
3,"Ivy City, Arboretum, Trinidad, Carver Langston",38.91217,-76.99249
4,"Kalorama Heights, Adams Morgan, Lanier Heights",38.91887,-77.04008


In [58]:
locations = locations.groupby('neighbourhood_cleansed').mean()
locations

Unnamed: 0_level_0,latitude,longitude
neighbourhood_cleansed,Unnamed: 1_level_1,Unnamed: 2_level_1
"Brightwood Park, Crestwood, Petworth",38.946504,-77.024729
"Brookland, Brentwood, Langdon",38.92611,-76.983434
"Capitol Hill, Lincoln Park",38.884692,-76.992667
"Capitol View, Marshall Heights, Benning Heights",38.885287,-76.931463
"Cathedral Heights, McLean Gardens, Glover Park",38.924179,-77.075507
"Cleveland Park, Woodley Park, Massachusetts Avenue Heights, Woodland-Normanstone Terrace",38.931255,-77.059736
"Colonial Village, Shepherd Park, North Portal Estates",38.986958,-77.035842
"Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View",38.929396,-77.031084
"Congress Heights, Bellevue, Washington Highlands",38.835467,-76.999961
"Deanwood, Burrville, Grant Park, Lincoln Heights, Fairmont Heights",38.900688,-76.928999


In [59]:
locations.reset_index(inplace=True)

In [60]:
locations

Unnamed: 0,neighbourhood_cleansed,latitude,longitude
0,"Brightwood Park, Crestwood, Petworth",38.946504,-77.024729
1,"Brookland, Brentwood, Langdon",38.92611,-76.983434
2,"Capitol Hill, Lincoln Park",38.884692,-76.992667
3,"Capitol View, Marshall Heights, Benning Heights",38.885287,-76.931463
4,"Cathedral Heights, McLean Gardens, Glover Park",38.924179,-77.075507
5,"Cleveland Park, Woodley Park, Massachusetts Av...",38.931255,-77.059736
6,"Colonial Village, Shepherd Park, North Portal ...",38.986958,-77.035842
7,"Columbia Heights, Mt. Pleasant, Pleasant Plain...",38.929396,-77.031084
8,"Congress Heights, Bellevue, Washington Highlands",38.835467,-76.999961
9,"Deanwood, Burrville, Grant Park, Lincoln Heigh...",38.900688,-76.928999


Get list of average latitude and longitude coordinates by neighborhood for next steps with foursquare lookups.

In [61]:
lat_lon = list(zip(locations['latitude'], locations['longitude']))
lat_lon = [str(lat_lon[i]).replace('(','').replace(')','') for i in range(0,len(lat_lon))]

In [74]:
lat_lon[0]

'38.946504018489904, -77.02472855730106'

Create dictionary of venue categories to loop through for each neighborhood

In [63]:
category_dict = {
    'historic site': '4deefb944765f83613cdba6e',
    'museum': '4bf58dd8d48988d181941735',
    'metro': '4bf58dd8d48988d1fd931735',
    'music venue': '4bf58dd8d48988d1e5931735',
    'perfomring arts venue': '4bf58dd8d48988d1f2931735',
    'college and university': '4d4b7105d754a06372d81259',
    'food': '4d4b7105d754a06374d81259',
    'nightlife spot': '4d4b7105d754a06376d81259',
    'outdoors and recreation': '4d4b7105d754a06377d81259',
    'government building': '4bf58dd8d48988d126941735',
    'clothing store': '4bf58dd8d48988d103951735'
                }

In [64]:
cat_list = [value for key, value in category_dict.items() ]
cat_list

['4deefb944765f83613cdba6e',
 '4bf58dd8d48988d181941735',
 '4bf58dd8d48988d1fd931735',
 '4bf58dd8d48988d1e5931735',
 '4bf58dd8d48988d1f2931735',
 '4d4b7105d754a06372d81259',
 '4d4b7105d754a06374d81259',
 '4d4b7105d754a06376d81259',
 '4d4b7105d754a06377d81259',
 '4bf58dd8d48988d126941735',
 '4bf58dd8d48988d103951735']

In [65]:
cat_key_list = [key for key, value in category_dict.items() ]
cat_key_list

['historic site',
 'museum',
 'metro',
 'music venue',
 'perfomring arts venue',
 'college and university',
 'food',
 'nightlife spot',
 'outdoors and recreation',
 'government building',
 'clothing store']

Set up Foursquare API pull on neighborhood venues

In [84]:
#set up first pull

venue_list = []

for j in cat_list:

    url = 'https://api.foursquare.com/v2/venues/search'

    params = dict(
    client_id='JWWPNW4JVAJ4OLM3ZSWASPF0R2ZP4DVHKQ52FGRLK0514J3Q',
    client_secret='NL0E1FVEIB0I1N22M51IRHPVUW45WDG0RSFF4IPW5Z01HBXM',
    v='20180323',
    ll= lat_lon[0],
    categoryId = j,
    limit=50,
    radius=1000
    #.6 mile radius
    )

    #make request
    req = requests.get(url=url, params=params)

    #pull necessary data
    data = json.loads(req.text)

    x = len(data['response']['venues'])
    
    venue_list.append(x)
    
full = pd.DataFrame(venue_list).T

full.columns = cat_key_list

In [85]:
full

Unnamed: 0,historic site,museum,metro,music venue,perfomring arts venue,college and university,food,nightlife spot,outdoors and recreation,government building,clothing store
0,1,1,1,3,5,8,48,15,47,10,10


In [86]:
#set up pulls for the rest of the neighborhoods

for i in lat_lon[1:]:

    venue_list = []
    
    for j in cat_list:

        #set up url for looping
        url = 'https://api.foursquare.com/v2/venues/search'

        params = dict(
        client_id='JWWPNW4JVAJ4OLM3ZSWASPF0R2ZP4DVHKQ52FGRLK0514J3Q',
        client_secret='NL0E1FVEIB0I1N22M51IRHPVUW45WDG0RSFF4IPW5Z01HBXM',
        v='20180323',
        ll= i,
        categoryId = j,
        limit=50,
        radius=1000
        #.6 mile radius
        )

        #make request
        req = requests.get(url=url, params=params)

        #pull necessary data
        data = json.loads(req.text)

        x = len(data['response']['venues'])

        venue_list.append(x)

    temp = pd.DataFrame(venue_list).T

    temp.columns = cat_key_list

    full = pd.concat([full, temp])


In [87]:
full.reset_index(inplace=True)

In [88]:
full

Unnamed: 0,index,historic site,museum,metro,music venue,perfomring arts venue,college and university,food,nightlife spot,outdoors and recreation,government building,clothing store
0,0,1,1,1,3,5,8,48,15,47,10,10
1,0,0,0,0,2,6,6,44,22,29,10,8
2,0,8,6,2,0,9,13,50,49,46,49,23
3,0,1,0,1,0,0,4,13,1,15,4,0
4,0,5,0,0,1,4,17,48,27,44,24,2
5,0,7,0,2,3,4,10,50,19,46,23,7
6,0,7,3,1,4,3,16,50,19,44,33,7
7,0,7,2,3,22,23,50,50,50,46,39,29
8,0,1,0,0,1,1,8,18,1,17,9,2
9,0,1,0,1,0,0,1,13,2,12,1,0


Combine neighborhood locations df with venue information

In [89]:
neighborhood_venues = pd.concat([locations, full], axis=1)

In [90]:
neighborhood_venues

Unnamed: 0,neighbourhood_cleansed,latitude,longitude,index,historic site,museum,metro,music venue,perfomring arts venue,college and university,food,nightlife spot,outdoors and recreation,government building,clothing store
0,"Brightwood Park, Crestwood, Petworth",38.946504,-77.024729,0,1,1,1,3,5,8,48,15,47,10,10
1,"Brookland, Brentwood, Langdon",38.92611,-76.983434,0,0,0,0,2,6,6,44,22,29,10,8
2,"Capitol Hill, Lincoln Park",38.884692,-76.992667,0,8,6,2,0,9,13,50,49,46,49,23
3,"Capitol View, Marshall Heights, Benning Heights",38.885287,-76.931463,0,1,0,1,0,0,4,13,1,15,4,0
4,"Cathedral Heights, McLean Gardens, Glover Park",38.924179,-77.075507,0,5,0,0,1,4,17,48,27,44,24,2
5,"Cleveland Park, Woodley Park, Massachusetts Av...",38.931255,-77.059736,0,7,0,2,3,4,10,50,19,46,23,7
6,"Colonial Village, Shepherd Park, North Portal ...",38.986958,-77.035842,0,7,3,1,4,3,16,50,19,44,33,7
7,"Columbia Heights, Mt. Pleasant, Pleasant Plain...",38.929396,-77.031084,0,7,2,3,22,23,50,50,50,46,39,29
8,"Congress Heights, Bellevue, Washington Highlands",38.835467,-76.999961,0,1,0,0,1,1,8,18,1,17,9,2
9,"Deanwood, Burrville, Grant Park, Lincoln Heigh...",38.900688,-76.928999,0,1,0,1,0,0,1,13,2,12,1,0


In [91]:
neighborhood_venues.drop(columns=['index'], inplace=True)

Export neighborhood venue information to csv

In [92]:
neighborhood_venues.to_csv('../data/neighborhood_venues.csv')