This notebook scrapes and cleans AirBnB rentals data. The data is acquired from this [source](https://tomslee.net/airbnb-data-collection-get-the-data).

## 1. Libraries & Functions

In [1]:
import pandas as pd
import numpy as np

## 2. Data

#### Import original AirBnB Boston data.

In [4]:
df_boston = pd.read_csv('../data/boston_rentals.csv')
df_boston.head(2)

Unnamed: 0,id,name,description,neighbourhood_cleansed,transit,host_id,host_name,host_is_superhost,city,state,...,property_type,room_type,bathrooms,bedrooms,beds,price,reviews_per_month,review_scores_rating,accommodates,cancellation_policy
0,3075044,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Roslindale,Plenty of safe street parking. Bus stops a few...,2572247,Andrea,f,Boston,MA,...,Apartment,Private room,1.0,1.0,1.0,65.0,1.3,94.0,2,moderate
1,6976,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...",Roslindale,"PUBLIC TRANSPORTATION: From the house, quick p...",16701,Phil,t,Boston,MA,...,Apartment,Private room,1.0,1.0,1.0,65.0,0.47,98.0,2,moderate


In [5]:
cols = df_boston.columns.to_list()
print(f"Boston data:\n {df_boston.shape[0]} listings\n {df_boston.shape[1]} features\n Columns: {cols}")

Boston data:
 2506 listings
 24 features
 Columns: ['id', 'name', 'description', 'neighbourhood_cleansed', 'transit', 'host_id', 'host_name', 'host_is_superhost', 'city', 'state', 'zipcode', 'country', 'latitude', 'longitude', 'property_type', 'room_type', 'bathrooms', 'bedrooms', 'beds', 'price', 'reviews_per_month', 'review_scores_rating', 'accommodates', 'cancellation_policy']


#### Import/Scrape data for other city

In [6]:
df_data = pd.read_csv('../data/airbnb_calgary.csv')
df_data.head(2)

Unnamed: 0,room_id,survey_id,host_id,room_type,country,city,borough,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,bathrooms,price,minstay,last_modified,latitude,longitude,location
0,10795607,1558,32275597,Shared room,,Calgary,CENTRE,POINT MCKAY,0,0.0,2,1.0,,40.0,,2017-08-10 14:18:57.809195,51.059864,-114.143254,0101000020E61000005B41D3122B895CC0971AA19FA987...
1,8596438,1558,45239480,Shared room,,Calgary,CENTRE,DOWNTOWN EAST VILLAGE,0,0.0,1,1.0,,405.0,,2017-08-10 14:18:56.338034,51.050347,-114.056722,0101000020E6100000ADBD4F55A1835CC0D2393FC57186...


In [7]:
cols_data = df_data.columns.to_list()
print(f"New data:\n {df_data.shape[0]} listings\n {df_data.shape[1]} features\n Columns: {cols_data}")

New data:
 2983 listings
 19 features
 Columns: ['room_id', 'survey_id', 'host_id', 'room_type', 'country', 'city', 'borough', 'neighborhood', 'reviews', 'overall_satisfaction', 'accommodates', 'bedrooms', 'bathrooms', 'price', 'minstay', 'last_modified', 'latitude', 'longitude', 'location']


In [8]:
neighborhood_count = df_data["neighborhood"].value_counts()
df_data = df_data[~df_data["neighborhood"].isin(neighborhood_count[neighborhood_count<30].keys())]

In [9]:
print(f"New data:\n {df_data.shape[0]} listings\n {df_data.shape[1]} features\n Columns: {cols_data}")

New data:
 1242 listings
 19 features
 Columns: ['room_id', 'survey_id', 'host_id', 'room_type', 'country', 'city', 'borough', 'neighborhood', 'reviews', 'overall_satisfaction', 'accommodates', 'bedrooms', 'bathrooms', 'price', 'minstay', 'last_modified', 'latitude', 'longitude', 'location']


In [14]:
df_data.rename(columns={"room_id": "id"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [10]:
df_data.to_csv("../data/re_analyst.csv", index=False)

## END