# ASK - Queries

The AirBnB dataset was obtained from the following location [http://insideairbnb.com/get-the-data]. We obtained data for the last four quarters of AirBnB listing data for Washington D.C.

As we started to investigate the dataset, we started with the following questions:
1. What is the context of the need, Who are the stakeholders and other interested parties?
2. What is the organizational need which requires fixing with data?
3. What is going to be required and what does success look like?
4. How will the result work itself back into the organization?

## 1. Context

AirBnB is a digital service that connects private individuals who want to rent rooms in their home for a short time with those who are looking to rent a room for an overnight stay. AirBnb is an application platform which facilitate this activity. In AirBnB, those who are offering their home's for rent are called 'hosts' and those who are renting them are 'guests'. AirBnB can be compared to a hotel booking model, except the booked rooms are owned by private individuals in their homes.

The city of Washington D.C. has many AirBnB rentals distributed across the city. As guests look to book a room, they use many signals provided by the AirBnB platform to make their decision, these include - host status (super host), verification, location, reviews, neighborhood, availability, price, and other listing information to narrow down their results.

## 2. Need

Our team is interested in knowing which factors are the most influential in determining the price of an AirBnB. From our experience booking hotels and using the AirBnB service, we hypothesized that location is one primary factor in determining the price per night of an AirBnB listing, and we would like to know if the relative safety of a neighborhood impacts the prices of AirBnB listings. This line of thinking naturally lead us to consider the characteristics of a location, such as the neighborhood and the safety of the area. This line of linking led us to consider the impact of crime to determine AirBnB pricing. Additionally, we believe that price can be predicted by using the characteristics of a listing, location, crimes (general safety), and reviews. This model will be used to determine pricing power of AirBnBs for hosts and public safety officials.

## 3. Vision

The success of our model is based on the ability of our secondary dataset - crime data to boost the predictive power of our model to determine price. We believe that with sufficient location data and using other information from the listing, we will see that crime and location two factors impacting the price of AirBnBs. 

## 4. Outcome

If we can create a model that can accurately predict AirBnB prices, then it can be used as a baseline to help hosts price their AirBnBs more competitively in their market by knowing which factors will lead to the most pricing power. Additionally, We can publish the findings and allow cities and policy makers to understand the impact crime has on these short-term rental operations. 

# Queries

In [3]:
import numpy as np
import time
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import duckdb

sns.set(style="whitegrid")

In [5]:
con = duckdb.connect(database='ps6.duckdb', read_only=False)

Count the number of latest listings in the database.

In [13]:
con.execute("select count(id) from latest_listings;")
print(con.fetchall())

[(10560,)]


Describe the database tables and data types.

In [16]:
con.execute("DESCRIBE")
print(con.fetchall())

[('all_listings', ['accommodates', 'amenities', 'availability_30', 'availability_365', 'availability_60', 'availability_90', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'calendar_last_scraped', 'calendar_updated', 'description', 'first_review', 'has_availability', 'host_about', 'host_acceptance_rate', 'host_has_profile_pic', 'host_id', 'host_identity_verified', 'host_is_superhost', 'host_listings_count', 'host_location', 'host_name', 'host_neighbourhood', 'host_picture_url', 'host_response_rate', 'host_response_time', 'host_since', 'host_thumbnail_url', 'host_total_listings_count', 'host_url', 'host_verifications', 'id', 'instant_bookable', 'last_review', 'last_scraped', 'latitude', 'license', 'listing_url', 'longitude', 'maximum_maximum_nights', 'maximum_minimum_nights', 'maximum_nights', 'maximum_nights_

Obtain all listings from the data warehouse and save to pandas dataframe.

In [6]:
all_listings = con.execute("SELECT * from all_listings").df()

Obtain a list of neighborhoods from the neighborhoods table. and save to a pandas data frame.

## Neighborhoods

In [7]:
neighborhoods = con.execute("SELECT * from neighborhoods").df()

Display a list of the prices from all listings where price is less than $5.

In [8]:
from tabulate import tabulate

In [9]:
con.execute("SELECT DISTINCT id, name, price from all_listings WHERE price < 5")
low_cost = list(con.fetchall())
print(tabulate(low_cost, headers=["id", "name", "price"], tablefmt='fancy_grid'))

╒══════════╤════════════════════════════════╤═════════╕
│       id │ name                           │   price │
╞══════════╪════════════════════════════════╪═════════╡
│ 42738808 │ Capital View Hostel            │       0 │
├──────────┼────────────────────────────────┼─────────┤
│ 43036130 │ U Street Capsule Hostel        │       0 │
├──────────┼────────────────────────────────┼─────────┤
│ 46253554 │ citizenM Washington DC Capitol │       0 │
├──────────┼────────────────────────────────┼─────────┤
│ 43301430 │ Riggs Washington DC            │       0 │
├──────────┼────────────────────────────────┼─────────┤
│ 42065771 │ The LINE Hotel DC              │       0 │
├──────────┼────────────────────────────────┼─────────┤
│ 43308773 │ Viceroy Washington DC          │       0 │
╘══════════╧════════════════════════════════╧═════════╛


Display a list of data where price is greater than $5.

In [10]:
con.execute("SELECT DISTINCT id, name, price from all_listings WHERE price > 5000")
low_cost = list(con.fetchall())
print(tabulate(low_cost, headers=["id", "name", "price"], tablefmt='fancy_grid'))

╒════════════════════╤════════════════════════════════════════════════════╤═════════╕
│                 id │ name                                               │   price │
╞════════════════════╪════════════════════════════════════════════════════╪═════════╡
│           14507861 │ Entire Capitol Hill Home - 5BR/4BA                 │    5995 │
├────────────────────┼────────────────────────────────────────────────────┼─────────┤
│           46004444 │ Yours Truly DC, 2 Bedroom Master Suite             │   10000 │
├────────────────────┼────────────────────────────────────────────────────┼─────────┤
│ 614471937104927680 │ NEW Listing! Unique House+Garden Rental, sleeps 40 │    7500 │
├────────────────────┼────────────────────────────────────────────────────┼─────────┤
│            8303678 │ Vista 2 Bedroom Rowhome FoggyBottom                │    6000 │
├────────────────────┼────────────────────────────────────────────────────┼─────────┤
│            8784458 │ Spacious condo in NW, DC       

## Secondary Data Source

For our secondary data source, we wanted to answer the question of how safety of the neighborhood impacted price of an AirBnB. To understand safety, we used crime as the representation and decided to use crime data available through the Metropolitan Police Department.  

Count the number of crimes in the crimes table. 

In [14]:
con.execute("SELECT COUNT(*) FROM crimes")
print(con.fetchall())

[(27707,)]


Explore the crimes data.

In [18]:
con.execute('select * from crimes;')
crime_data = con.fetch_df()
print(crime_data)

      neighborhood_cluster  census_tract offensegroup  longitude  \
0               cluster 32        9603.0      violent -76.952775   
1               cluster 39        9700.0      violent -76.985384   
2               cluster 33        9906.0      violent -76.934174   
3               cluster 21        3400.0      violent -77.012155   
4               cluster 39        7304.0     property -76.992605   
...                    ...           ...          ...        ...   
27702           cluster 18        2400.0     property -77.022238   
27703           cluster 34        7601.0     property -76.975962   
27704           cluster 27        7201.0     property -77.001291   
27705           cluster 25       10601.0     property -77.007482   
27706           cluster 18        2301.0      violent -77.016427   

                      end_date                offense_text     shift  \
0        11/6/2013, 1:07:00 PM                    homicide  midnight   
1        1/9/2022, 11:31:00 PM         

## Listings

The all_listings table includes multiple entries for the same listing. We may want to just deal with one listing for the whole year, taking the latest data. This would weight each listing the same.

Create a view for just the latest listings.

First, we need to determine the latest listing for each listing (e.g. when was the latest data scrape for listing 3686?)

In [None]:
con.execute('drop view if exists last_scraped')
con.execute('create view last_scraped as select id, max(calendar_last_scraped) as last_scraped from all_listings group by id;')


Now, we'll create a view called latest_listings that only takes the latest listing data for each listing.

In [None]:
con.execute('drop view if exists latest_listings')
con.execute('create view latest_listings as select all_listings.* from all_listings inner join last_scraped on all_listings.id = last_scraped.id and all_listings.calendar_last_scraped = last_scraped.last_scraped; ')

How many listings are there compared to the entire data set?

In [None]:
con.execute("select count(id) from latest_listings;")
print(con.fetchall())

There were 28076 total rows in the all_listings data.

In [None]:
con.execute("select count(id) from latest_listings;") print(con.fetchall())

The latest_listings view, which only has one row per listing, has just 10560 rows, less than half of the original data set.

## Reviews

Count the number of reviews data from the table.

In [24]:
con.execute("select count(reviewer_id) from reviews;")
print(con.fetchall())

[(321209,)]


## Calendar

Get a count of calendar data from the database table.

In [27]:
con.execute("select count(date) from calendar;")
print(con.fetchall())

[(10245531,)]


The calendar data is very large compared to the other tables. This could indicate many duplicates or redundant data. 

Let's look at a description of the columns in the calendar table.

In [32]:
con.execute("DESCRIBE calendar")
print(con.fetchall())

[('listing_id', 'BIGINT', 'YES', None, None, None), ('date', 'DATE', 'YES', None, None, None), ('available', 'VARCHAR', 'YES', None, None, None), ('price', 'INTEGER', 'YES', None, None, None), ('adjusted_price', 'INTEGER', 'YES', None, None, None), ('minimum_nights', 'INTEGER', 'YES', None, None, None), ('maximum_nights', 'INTEGER', 'YES', None, None, None)]


Retrieve the number of distinct dates in the calendar.

In [29]:
con.execute("SELECT DISTINCT date FROM calendar")
print(con.fetchall())

[(datetime.date(2022, 6, 24),), (datetime.date(2022, 7, 3),), (datetime.date(2022, 7, 14),), (datetime.date(2022, 8, 11),), (datetime.date(2022, 8, 16),), (datetime.date(2022, 9, 21),), (datetime.date(2022, 10, 4),), (datetime.date(2022, 10, 5),), (datetime.date(2022, 10, 17),), (datetime.date(2022, 11, 3),), (datetime.date(2022, 11, 11),), (datetime.date(2022, 11, 14),), (datetime.date(2022, 11, 28),), (datetime.date(2022, 11, 29),), (datetime.date(2022, 12, 11),), (datetime.date(2022, 12, 12),), (datetime.date(2021, 12, 20),), (datetime.date(2022, 1, 4),), (datetime.date(2022, 1, 13),), (datetime.date(2022, 1, 15),), (datetime.date(2022, 1, 16),), (datetime.date(2022, 1, 25),), (datetime.date(2022, 1, 30),), (datetime.date(2022, 3, 26),), (datetime.date(2022, 5, 24),), (datetime.date(2022, 5, 31),), (datetime.date(2022, 6, 3),), (datetime.date(2022, 6, 13),), (datetime.date(2022, 1, 6),), (datetime.date(2022, 1, 9),), (datetime.date(2022, 3, 31),), (datetime.date(2023, 3, 6),), (date