# Airbnb Dataset
The **[Inside Airbnb](http://insideairbnb.com/about.html)** project provides data on Airbnb listings by location since 2015. The data is updated nearly every month with previous datasets archived but kept available for the public. For a determined location and date, three datasets are provided. The *listings* dataset contains detailed information on registered listings such as location (i.e. latitude and longitude), host information, price, availability in the next 30-365 days, and reviews ratings. The *reviews* dataset provides the date, reviewers’ identification number, and the review text on all reviews for a specific listing. Finally, the *calendar* dataset provides data on availability of a listing by day as well as price and listing identification number. 

For our project, we are interested in compiling information from the *listings* and *calendar* datasets for the San Francisco Bay Area, which would include datasets on Oakland, San Francisco, San Mateo County, and Santa Clara County. For all months available, datasets for each of these locations were downloaded and a final dataset for data exploration was created. The current notebook guides the creation of the dataset step by step. 


### Importing the data

First, we need to import all files downloaded from Inside Airbnb into the notebook. The files were organized by location and date, so the import follows the same order. 

In [1]:
import pandas as pd
import glob

In [2]:
# Creating a list of files' names

Bay_Area_listings = glob.glob("**/*listings.csv")
Bay_Area_calendar = glob.glob("**/*calendar.csv")

In [3]:
# Reading the csv files

Bay_Area_listings_dfs = [pd.read_csv(file) for file in Bay_Area_listings]
Bay_Area_calendar_dfs = [pd.read_csv(file) for file in Bay_Area_calendar]

### *Listings* Datasets

The *listings* datasets contain the following columns: 

1. *id*: The unique identifier of the Airbnb listing
1. *name*: The name of the listing
2. *host_id*: A unique identifier of the Airbnb host
3. *host_name*: The host's name
4. *neighbourhood_group*: A blank column
5. *neighbourhood*: Name of the listing's neighbourhood
5. *latitude*: Latitude of the listing
6. *longitude*: Longitude of the listing
6. *room_type*: Type of room of the listing
7. *price*: Price for a day of the listing
8. *minimum_nights*: Minimum required nights for booking
9. *number_of_reviews*: The number of reviews the listing has
10. *last_review*: Date of the last review
11. *reviews_per_month*: Ratio of reviews/ month since the listing was added
12. *calculated_host_listings_count*: The amount of listings the host has in total
13. *availability_365*: The amount of days the listing is available for in the next 365 days

From the *listings* datasets, we are interested in the location of the Airbnb listing. 
To create a directory of the listings, we will create a dataset with the columns **id**, **latitude**, and **longitude**. Then, we will merge all *listings* datasets into one and remove duplicates. 

In [4]:
# Removing columns from datasets 
for i in range(len(Bay_Area_listings_dfs)):
    Bay_Area_listings_dfs[i] = Bay_Area_listings_dfs[i].drop(columns = ['name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'])


In [5]:
Bay_Area_listings_dfs[0].head()

Unnamed: 0,id,latitude,longitude
0,1789525,37.800801,-122.300892
1,6093412,37.805462,-122.285417
2,5421997,37.814515,-122.251765
3,5970720,37.809887,-122.258498
4,3878187,37.809335,-122.257051


In [6]:
# Merging datasets and removing duplicates 

Listings_directory = pd.concat(Bay_Area_listings_dfs, ignore_index=True).drop_duplicates()
Listings_directory.sort_values(['id']).head(10)

Unnamed: 0,id,latitude,longitude
409701,958,37.769266,-122.433588
556979,958,37.76931,-122.43386
265418,958,37.76931,-122.433856
262343,2822,37.728192,-122.397966
26543,3083,37.80832,-122.293
855,3083,37.808323,-122.293002
26544,3264,37.83593,-122.2611
2688,3264,37.835931,-122.261096
265869,3850,37.754021,-122.458048
564257,3850,37.75402,-122.45805


In [7]:
# We still have id duplicates since either latitude or longitude values do not match
# So, we will take the average of the latitude and longitude values of each id and use that as the best aprox. to the real values
Final_Listings_directory = Listings_directory.groupby('id').mean()

Final_Listings_directory.sort_values(['id']).head(10)

Unnamed: 0_level_0,latitude,longitude
id,Unnamed: 1_level_1,Unnamed: 2_level_1
958,37.769295,-122.433768
2822,37.728192,-122.397966
3083,37.808322,-122.293001
3264,37.83593,-122.261098
3850,37.754021,-122.458049
4952,37.439721,-122.156722
5021,37.762976,-122.434161
5193,37.793339,-122.415912
5739,37.815032,-122.262122
5841,37.752496,-122.422588


------------

### *Calendar* Datasets

The *calendar* datasets contain the following columns:

1. listing_id: The unique identifier of the Airbnb listing. 
2. date: The date of interest
3. available: Availability of the listing for the specific date, in which *t* = available and *f* = not available.
4. price: Price of the listing.


Our goal with this dataset is to investigate the amount of bookings in the Bay Area across time.

First, it is important to note that the dates listed in this dataset refer to upcoming bookings and not past bookings. For example, the dataset that corresponds to the file 'Oakland_2020_Oct25_calendar.csv' contains availability of listings from October 26th, 2020 and on. Given the multiple files and dates, we will compile the availability of listings by date according to the most updated dataset.

For example, using the dataset Oakland_2020_May18_listings.csv, we will extract the availability of listings in Oakland from May 19th, 2020 until the starting date of the next dataset (i.e. Oakland_2020_Oct25_calendar.csv), which is October 25th, 2020. 

In [8]:
# Adding Datetime format 
for i in range(len(Bay_Area_calendar_dfs)):
    Bay_Area_calendar_dfs[i]['Date'] = pd.to_datetime(Bay_Area_calendar_dfs[i]['date'])

In [9]:
# Removing not used column 
for i in range(len(Bay_Area_calendar_dfs)):
    Bay_Area_calendar_dfs[i] = Bay_Area_calendar_dfs[i].drop(columns = ['price', 'date'])

In [10]:
Bay_Area_calendar_dfs[0].head()

Unnamed: 0,listing_id,available,Date
0,6093412,f,2015-07-08
1,6093412,f,2015-07-09
2,6093412,f,2015-07-10
3,6093412,f,2015-07-11
4,6093412,f,2015-07-12


In [11]:
# Pivoting dataframes
for i in range(len(Bay_Area_calendar_dfs)):
    Bay_Area_calendar_dfs[i] = Bay_Area_calendar_dfs[i].pivot(index='listing_id', columns='Date', values='available')

# Needs revision starting at df[58]

ValueError: Index contains duplicate entries, cannot reshape

In [12]:
for i in range(len(Bay_Area_calendar_dfs)):
    print([i], Bay_Area_calendar[i], 'Start:', Bay_Area_calendar_dfs[i].columns[0])
    print([i], Bay_Area_calendar[i], 'End:', Bay_Area_calendar_dfs[i].columns[-1])

[0] Oakland/Oakland_2015_Jun22_calendar.csv Start: 2015-06-29 00:00:00
[0] Oakland/Oakland_2015_Jun22_calendar.csv End: 2016-07-09 00:00:00
[1] Oakland/Oakland_2016_May04_calendar.csv Start: 2016-05-04 00:00:00
[1] Oakland/Oakland_2016_May04_calendar.csv End: 2017-05-03 00:00:00
[2] Oakland/Oakland_2018_Apr14_calendar.csv Start: 2018-04-14 00:00:00
[2] Oakland/Oakland_2018_Apr14_calendar.csv End: 2019-04-14 00:00:00
[3] Oakland/Oakland_2018_Aug16_calendar.csv Start: 2018-08-16 00:00:00
[3] Oakland/Oakland_2018_Aug16_calendar.csv End: 2019-08-15 00:00:00
[4] Oakland/Oakland_2018_Dec12_calendar.csv Start: 2018-12-12 00:00:00
[4] Oakland/Oakland_2018_Dec12_calendar.csv End: 2019-12-12 00:00:00
[5] Oakland/Oakland_2018_Jul16_calendar.csv Start: 2018-07-16 00:00:00
[5] Oakland/Oakland_2018_Jul16_calendar.csv End: 2019-07-15 00:00:00
[6] Oakland/Oakland_2018_May17_calendar.csv Start: 2018-05-17 00:00:00
[6] Oakland/Oakland_2018_May17_calendar.csv End: 2019-05-18 00:00:00
[7] Oakland/Oakland_

In [13]:
# Downloaded datasets varied by location and dates, so need to first concat by location then create final dataset
# Creating final datasets by city 
Bay_Area_bookings_Oakland = pd.concat([Bay_Area_calendar_dfs[0].loc[:, :'2016-05-04 00:00:00'],Bay_Area_calendar_dfs[1].loc[:, :'2017-05-03 00:00:00'],Bay_Area_calendar_dfs[2].loc[:, :'2018-05-17 00:00:00'],Bay_Area_calendar_dfs[6].loc[:, :'2018-07-16 00:00:00'],Bay_Area_calendar_dfs[5].loc[:, :'2018-08-16 00:00:00'],Bay_Area_calendar_dfs[3].loc[:, :'2018-09-13 00:00:00'],Bay_Area_calendar_dfs[9].loc[:, :'2018-10-11 00:00:00'],Bay_Area_calendar_dfs[8].loc[:, :'2018-11-15 00:00:00'],Bay_Area_calendar_dfs[7].loc[:, :'2018-12-12 00:00:00'],Bay_Area_calendar_dfs[4].loc[:, :'2019-01-17 00:00:00'],Bay_Area_calendar_dfs[14].loc[:, :'2019-02-09 00:00:00'],Bay_Area_calendar_dfs[13].loc[:, :'2019-03-11 00:00:00'],Bay_Area_calendar_dfs[17].loc[:, :'2019-04-14 00:00:00'],Bay_Area_calendar_dfs[10].loc[:, :'2019-05-18 00:00:00'],Bay_Area_calendar_dfs[18].loc[:, :'2019-06-13 00:00:00'],Bay_Area_calendar_dfs[16].loc[:, :'2019-07-13 00:00:00'],Bay_Area_calendar_dfs[15].loc[:, :'2019-08-14 00:00:00'],Bay_Area_calendar_dfs[11].loc[:, :'2019-09-20 00:00:00'],Bay_Area_calendar_dfs[21].loc[:, :'2019-10-18 00:00:00'],Bay_Area_calendar_dfs[20].loc[:, :'2019-11-20 00:00:00'],Bay_Area_calendar_dfs[19].loc[:, :'2019-12-15 00:00:00'],Bay_Area_calendar_dfs[12].loc[:, :'2020-01-14 00:00:00'],Bay_Area_calendar_dfs[24].loc[:, :'2020-02-22 00:00:00'],Bay_Area_calendar_dfs[23].loc[:, :'2020-03-17 00:00:00'],Bay_Area_calendar_dfs[26].loc[:, :'2020-04-21 00:00:00'],Bay_Area_calendar_dfs[22].loc[:, :'2020-05-18 00:00:00'],Bay_Area_calendar_dfs[27].loc[:, :'2020-06-17 00:00:00'],Bay_Area_calendar_dfs[25].loc[:, :'2020-10-25 00:00:00']], axis=1, sort=False)

#Not yet run
#Bay_Area_bookings_San_Francisco = pd.concat([]], axis=1, sort=False)
#Bay_Area_bookings_San_Clara = pd.concat([]], axis=1, sort=False)
#Bay_Area_bookings_San_Mateo = pd.concat([]], axis=1, sort=False)

In [14]:
Bay_Area_bookings_Oakland.sort_values(['listing_id']).head()

Date,2015-06-29 00:00:00,2015-06-30 00:00:00,2015-07-01 00:00:00,2015-07-02 00:00:00,2015-07-03 00:00:00,2015-07-04 00:00:00,2015-07-05 00:00:00,2015-07-06 00:00:00,2015-07-07 00:00:00,2015-07-08 00:00:00,...,2020-10-16 00:00:00,2020-10-17 00:00:00,2020-10-18 00:00:00,2020-10-19 00:00:00,2020-10-20 00:00:00,2020-10-21 00:00:00,2020-10-22 00:00:00,2020-10-23 00:00:00,2020-10-24 00:00:00,2020-10-25 00:00:00
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3083,f,f,f,f,f,f,f,f,f,f,...,t,t,t,t,t,t,t,t,t,t
3264,,,,,,,,,,,...,,,,,,,,,,
5739,f,f,f,t,t,t,f,f,f,f,...,f,f,f,f,f,f,f,f,f,f
6201,,,,,,,,,,,...,,,,,,,,,,
8478,,,,,,,,,,,...,,,,,,,,,,


In [15]:
Bay_Area_bookings_Oakland.apply(pd.value_counts)

Date,2015-06-29 00:00:00,2015-06-30 00:00:00,2015-07-01 00:00:00,2015-07-02 00:00:00,2015-07-03 00:00:00,2015-07-04 00:00:00,2015-07-05 00:00:00,2015-07-06 00:00:00,2015-07-07 00:00:00,2015-07-08 00:00:00,...,2020-10-16 00:00:00,2020-10-17 00:00:00,2020-10-18 00:00:00,2020-10-19 00:00:00,2020-10-20 00:00:00,2020-10-21 00:00:00,2020-10-22 00:00:00,2020-10-23 00:00:00,2020-10-24 00:00:00,2020-10-25 00:00:00
f,24,77,132,198,287,375,445,521,607,665,...,1960,1960,1956,1962,1964,1962,1967,1963,1957,1956
t,4,14,44,57,52,75,123,153,166,169,...,1244,1244,1248,1242,1240,1242,1237,1241,1247,1248
