# Predicting a 5-Star Airbnb Ratings

Young Jeong

In this project, I use different classification algorithms to generate a model that best predicts whether an Airbnb listing has a 5-star ratings or not.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Abstract" data-toc-modified-id="Abstract-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Abstract</a></span></li><li><span><a href="#Obtain-the-Data" data-toc-modified-id="Obtain-the-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Obtain the Data</a></span><ul class="toc-item"><li><span><a href="#Airbnb-Data" data-toc-modified-id="Airbnb-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Airbnb Data</a></span></li></ul></li><li><span><a href="#Scrub-the-Data" data-toc-modified-id="Scrub-the-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scrub the Data</a></span><ul class="toc-item"><li><span><a href="#Adding-useful-features-1:-Walkscore" data-toc-modified-id="Adding-useful-features-1:-Walkscore-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Adding useful features 1: Walkscore</a></span></li></ul></li><li><span><a href="#Explore-the-Data" data-toc-modified-id="Explore-the-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Explore the Data</a></span></li><li><span><a href="#Model-the-Data" data-toc-modified-id="Model-the-Data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Model the Data</a></span></li><li><span><a href="#Interpret-the-Model" data-toc-modified-id="Interpret-the-Model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Interpret the Model</a></span></li></ul></div>

## Abstract

In the past few years, short-term rental business has become a vital part of the hospitality business, and as a result, Airbnb has grown into an international giant in the space. As a previous Airbnb host, I often struggle with keeping the highest of qualities for my guests. Perception is a big thing, and I know I was fighting to keep my ratings at a 5-star level. Why? Because Airbnb grants a special designation called SuperHost that will lead to a higher visibility, bonus perks and higher income (Airbnb states increase in income of 22% or up to $2250 a month).

This leads to my main question of the topic: Is there some correlation between certain information or features about a listing and its rating? Can I build a model that can 1. Tell me strong indicators of a 5-star ratings and 2. Therefore correctly predict a listings ratings (whether it is a 5-star or not?) I believe this can help two groups of people and they are:

1. Potential hosts that wants to curate their listings to attract and hopefully receive 5-star ratings, and
2. Current hosts that want to move towards the 5-star ratings in their first steps of becoming a SuperHost.

So here comes my journey in generating that model!

## Obtain the Data

### Airbnb Data

The Main data source was from insideairbnb.com, a non-profit website that scrapes Airbnb data for visualization purposes. It's nicely categorized into different major cities, one of which is Seattle. I decided to look at the latest scraped data of the listings in Seattle for this project.

The file is in a CSV format and has about 8400 listings and various information about them in columns. Please see 
`references/data_dictionary` for more details. 

## Scrub the Data

There are lots of features in the scraped Airbnb flatfile; 106 of them to be exact. Taking a look at all of them gives some insight as to which features and usable and not.

After carefully looking at them, I've determined these to be useful:

- 'host_since' - How long a host has been a host on Airbnb
- 'host_response_time' - How long it takes for the host to respond back to potential guests or guests
- 'host_response_rate' - How much the host responds to messages (rate of response vs total messages)
- 'host_total_listings_count' - How many listings hosts own 
- 'host_verifications' - How many ways has Airbnb verified hosts (IDs, phone numbers etc.)
- 'host_has_profile_pic', - Does Host have profile pictures
- 'host_identity_verified' - is host's identity verified by Airbnb
- 'neighbourhood_cleansed' - Neighbourhood the listing is located in
- 'zipcode' - Zipcode of the listing
- 'latitude' - latitude of the listing
- 'longitude' - longitude of the listing
- 'room_type' - Type of room (Entire house or a room in a house)
- 'bathrooms' - # of Bathrooms 
- 'bedrooms' - # of bedrooms
- 'beds' - # of beds
- 'bed_type' - Types of beds provided
- 'amenities' - Amenities provided
- 'price' - Price of the listing (per night)
- 'security_deposit' - Security Deposit (if required)
- 'cleaning_fee' - Cleaning Fee (if required)
- 'extra_people' - How many extra people are allowed
- 'minimum_nights' - Minimum nights required per reservation
- 'calendar_updated' - How often the listing's calendar is updated
- 'availability_30' - How many days available within the next 30 days
- 'availability_90' - How many days available within the next 90 days
- 'cancellation_policy' - Cancellation policy
- 'reviews_per_month' - How many reviews are received per month
- 'review_scores_rating' - Ratings (our target variable)

### Adding useful features 1: Walkscore

Using the latitude and longitude features, I first decided to use google's geocode API to map out the exact street address. This was necessary step because in order to use the Walkscore's API, street address was required.

In [12]:
## %%writefile ../src/features/build_features.py
import pandas as pd
import seaborn as sns
import googlemaps
from psycopg2 import connect
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
from sqlalchemy import create_engine

def dataset_sql(directory, filename):
    
    df = pd.read_csv(directory+filename,delimiter=',',low_memory=False)
    
    params = {
        'host': '127.0.0.1',
        'user': 'youngjeong',
        'port': 5432
    }
    
    connection = connect(**params, dbname='listings')
    connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
    connection.cursor().execute('CREATE DATABASE listings;')
    connection_string = f'postgres://youngjeong:{params["host"]}@{params["host"]}:{params["port"]}/listings'
    engine = create_engine(connection_string)
    df.to_sql('listings', engine, index=False)

# Retrieves street address from Google using Geocode API
def retrieve_address():
    cursor.execute("SELECT latitude, longitude FROM listings")
    latlong = cursor.fetchall()
    gmaps = googlemaps.Client(key='AIzaSyAg7a4wxLj2jhH1dHkzxPolTXIzItbz5x0')
    
    add_list=[]
    
    for i in range(latlong.shape[0]):
        lat = latlong[i][0]
        long = latlong[i][1]
        address = gmaps.reverse_geocode((round(lat,6), round(long,6)))
        add_list.append(address[0]['formatted_address'])
    
    return add_list

def get_walkscore_url(address, city, zip_code, lat, lon):
    """
    Construct url for Walkscore api call
    Input: address, city, and zip_code as strings; lat/lon coords as float
    Output: prepared url to request walkscore for address
    """
    api_key = '466d1cb991e8a99345b049d505c6a4a7'
    base_url = 'http://api.walkscore.com/score?format=json'
    mid_url = 'transit=1&bike=1'
    address = 'address=' + '%'.join(address.split())
    address = '%20'.join([address, city, 'WA', zip_code])
    lat = f'lat={lat}'
    lon = f'lon={lon}'
    api_key = f'wsapikey={api_key}'
    url = '&'.join([base_url, address, lat, lon, mid_url, api_key])
    return url


def get_walkscores(row):
    """
    Makes api call to Walkscore and extracts bike, walk, and transit scores
    Input: dataframe row containing required fields
    Output: list containing bike, walk, and transit scores (or nan, if failure)
    """
    fields = ['address1', 'city', 'zip_code', 'latitude', 'longitude']
    address, city, zip_code, lat, lon = row[fields]
    url = get_walkscore_url(address, city, zip_code, lat, lon)
    
    try:
        r = requests.get(url)
        response = json.loads(r.text)

        bike_score = response['bike']['score']
        walk_score = response['walkscore']
        transit_score = response['transit']['score']
    except:
        return (np.nan, np.nan, np.nan)
    return [bike_score, walk_score, transit_score]
    
    
def run():
    """
    Executes a set of helper functions that read files from data/raw, cleans them,
    and converts the data into a design matrix that is ready for modeling.
    """
    dataset_sql('../data/raw/', 'listings.csv')
    add_list = retrieve_address()
    # clean_dataset_2('data/raw', filename)
    # save_cleaned_data_1('data/interim', filename)
    # save_cleaned_data_2('data/interim', filename)
    # build_features()
    # save_features('data/processed')
    pass

In [13]:
run()

*Before moving on to exploratory analysis, write down some notes about challenges encountered while working with this data that might be helpful for anyone else (including yourself) who may work through this later on.*

## Explore the Data

*Before you start exploring the data, write out your thought process about what you're looking for and what you expect to find. Take a minute to confirm that your plan actually makes sense.*

*Calculate summary statistics and plot some charts to give you an idea what types of useful relationships might be in your dataset. Use these insights to go back and download additional data or engineer new features if necessary. Not now though... remember we're still just trying to finish the MVP!*

In [None]:
## %%writefile ../src/visualization/visualize.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # data = load_features('data/processed')
    # describe_features(data, 'reports/')
    # generate_charts(data, 'reports/figures/')
    pass


*What did you learn? What relationships do you think will be most helpful as you build your model?*

## Model the Data

*Describe the algorithm or algorithms that you plan to use to train with your data. How do these algorithms work? Why are they good choices for this data and problem space?*

In [None]:
## %%writefile ../src/models/train_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # data = load_features('data/processed/')
    # train, test = train_test_split(data)
    # save_train_test(train, test, 'data/processed/')
    # model = build_model()
    # model.fit(train)
    # save_model(model, 'models/')
    pass


In [None]:
## %%writefile ../src/models/predict_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from data/processed,
    calculates descriptive statistics for the population, and plots charts
    that visualize interesting relationships between features.
    """
    # test_X, test_y = load_test_data('data/processed')
    # trained_model = load_model('models/')
    # predictions = trained_model.predict(test_X)
    # metrics = evaluate(test_y, predictions)
    # save_metrics('reports/')
    pass



_Write down any thoughts you may have about working with these algorithms on this data. What other ideas do you want to try out as you iterate on this pipeline?_

## Interpret the Model

_Write up the things you learned, and how well your model performed. Be sure address the model's strengths and weaknesses. What types of data does it handle well? What types of observations tend to give it a hard time? What future work would you or someone reading this might want to do, building on the lessons learned and tools developed in this project?_