## Wrangling Airbnb Rental Listings in Seattle 

### Goals of the Task



There are three tables in the data set which was scraped from the airbnb website on different dates <br>
*listings - each row is a unique rental property* <br>
*reviews - each row is a review left by a guest after checking out of a property* <br>
*calendar - each row is a date and property showing if it was available or unvailable on that date* <br>

We want to use this data to try to understand how many airbnb properties are located close to less popular cycle hire stations and if any airbnb guest commented on transport in their reviews for those properties. This insight could be used to make decisions about the future of those cycle stations. 

- the three data tables are large and unmanageable in excel, are also slow to visualise in PowerBI
- we want to focus on specific locations only 
- not all the columns will be useful to this analysis
- potentially useful mentions of transport are embedded in the review text 
- there are different dates across the 3 data sets (date scraped, date of review, dates available)

#### Step 1 : use pandas to read the data from the 3 csv files to create 3 data frames (listings, reviews, calendar)
- import pandas as pd 
- use pandas read_csv 
- ensure you are pointing at the correct file path for the data sources (you may have to navigate in your notebook!) 


In [1]:
import pandas as pd
calendar = pd.read_csv('data/calendar.csv')
listings = pd.read_csv('data/listings.csv')
reviews = pd.read_csv('data/reviews.csv')


#### Step 2: preview each dataframe using pandas functions like .info() .head(), .tail() and .describe() 
- look out for nulls and missing data 
- any problematic data types 
- consider if you need to do anything about missing data (replace/ impute /ignore / drop)

In [48]:
calendar.info()
calendar.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1393570 entries, 0 to 1393569
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   listing_id  1393570 non-null  int64 
 1   date        1393570 non-null  object
 2   available   1393570 non-null  object
 3   price       934542 non-null   object
dtypes: int64(1), object(3)
memory usage: 42.5+ MB


Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,t,$85.00
1,241032,2016-01-05,t,$85.00
2,241032,2016-01-06,f,
3,241032,2016-01-07,f,
4,241032,2016-01-08,f,


In [7]:
calendar.tail()

Unnamed: 0,listing_id,date,available,price
1393565,10208623,2016-12-29,f,
1393566,10208623,2016-12-30,f,
1393567,10208623,2016-12-31,f,
1393568,10208623,2017-01-01,f,
1393569,10208623,2017-01-02,f,


In [9]:
calendar.describe()

Unnamed: 0,listing_id
count,1393570.0
mean,5550111.0
std,2962274.0
min,3335.0
25%,3258213.0
50%,6118244.0
75%,8035212.0
max,10340160.0


In [10]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 65 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3818 non-null   int64  
 1   date_scraped                      3818 non-null   object 
 2   name                              3818 non-null   object 
 3   summary                           3641 non-null   object 
 4   space                             3249 non-null   object 
 5   description                       3818 non-null   object 
 6   transit                           2884 non-null   object 
 7   host_id                           3818 non-null   int64  
 8   host_response_time                3295 non-null   object 
 9   host_response_rate                3295 non-null   object 
 10  host_acceptance_rate              3045 non-null   object 
 11  host_is_superhost                 3816 non-null   object 
 12  host_l

In [18]:
listings.id.min()

3335

In [12]:
listings.describe()

Unnamed: 0,id,host_id,host_listings_count,host_total_listings_count,latitude,longitude,accommodates,bathrooms,bedrooms,beds,...,availability_365,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month
count,3818.0,3818.0,3816.0,3816.0,3818.0,3818.0,3818.0,3802.0,3812.0,3817.0,...,3818.0,3818.0,3171.0,3160.0,3165.0,3160.0,3167.0,3163.0,3162.0,3191.0
mean,5550111.0,15785560.0,7.157757,7.157757,47.628961,-122.333103,3.349398,1.259469,1.307712,1.735394,...,244.772656,22.223415,94.539262,9.636392,9.556398,9.786709,9.809599,9.608916,9.452245,2.078919
std,2962660.0,14583820.0,28.628149,28.628149,0.043052,0.031745,1.977599,0.590369,0.883395,1.13948,...,126.772526,37.730892,6.606083,0.698031,0.797274,0.595499,0.568211,0.629053,0.750259,1.822348
min,3335.0,4193.0,1.0,1.0,47.505088,-122.417219,1.0,0.0,0.0,1.0,...,0.0,0.0,20.0,2.0,3.0,2.0,2.0,4.0,2.0,0.02
25%,3258256.0,3275204.0,1.0,1.0,47.609418,-122.354321,2.0,1.0,1.0,1.0,...,124.0,2.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,0.695
50%,6118244.0,10558140.0,1.0,1.0,47.623601,-122.328874,3.0,1.0,1.0,1.0,...,308.0,9.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,1.54
75%,8035127.0,25903090.0,3.0,3.0,47.662694,-122.3108,4.0,1.0,2.0,2.0,...,360.0,26.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,3.0
max,10340160.0,53208610.0,502.0,502.0,47.733358,-122.240607,16.0,8.0,7.0,15.0,...,365.0,474.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,12.15


In [13]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84849 entries, 0 to 84848
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   84849 non-null  int64 
 1   id           84849 non-null  int64 
 2   date         84849 non-null  object
 3   reviewer_id  84849 non-null  int64 
 4   comments     84831 non-null  object
dtypes: int64(3), object(2)
memory usage: 3.2+ MB


In [14]:
reviews.describe()

Unnamed: 0,listing_id,id,reviewer_id
count,84849.0,84849.0,84849.0
mean,3005067.0,30587650.0,17013010.0
std,2472877.0,16366130.0,13537040.0
min,4291.0,3721.0,15.0
25%,794633.0,17251270.0,5053141.0
50%,2488228.0,32288090.0,14134760.0
75%,4694479.0,44576480.0,27624020.0
max,10248140.0,58736510.0,52812740.0


In [15]:
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,comments
0,7202016,38917982,19/07/2015,28943674,Cute and cozy place. Perfect location to every...
1,7202016,39087409,20/07/2015,32440555,Kelly has a great room in a very central locat...
2,7202016,39820030,26/07/2015,37722850,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,02/08/2015,33671805,Close to Seattle Center and all it has to offe...
4,7202016,41986501,10/08/2015,34959538,Kelly was a great host and very accommodating ...


#### Step 3: detect and manage any duplicate rows using pandas .duplicated()
- consider what a true duplicate is in each data frame
- decide whether to drop the duplicates entirely or to review and try to understand why it exists 
- if you drop any rows, remember to reset your index on the dataframe afterwards using .reset_index(inplace=True,drop=True)

In [22]:
# Detect and manage duplicates in 'listings'
listings_duplicates = listings.duplicated()
listings = listings[~listings_duplicates].reset_index(drop=True)

# Detect and manage duplicates in 'reviews'
reviews_duplicates = reviews.duplicated()
reviews = reviews[~reviews_duplicates].reset_index(drop=True)

# Detect and manage duplicates in 'calendar'
calendar_duplicates = calendar.duplicated()
calendar = calendar[~calendar_duplicates].reset_index(drop=True)


In [23]:
listings.info()
reviews.info()
calendar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 65 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3818 non-null   int64  
 1   date_scraped                      3818 non-null   object 
 2   name                              3818 non-null   object 
 3   summary                           3641 non-null   object 
 4   space                             3249 non-null   object 
 5   description                       3818 non-null   object 
 6   transit                           2884 non-null   object 
 7   host_id                           3818 non-null   int64  
 8   host_response_time                3295 non-null   object 
 9   host_response_rate                3295 non-null   object 
 10  host_acceptance_rate              3045 non-null   object 
 11  host_is_superhost                 3816 non-null   object 
 12  host_l

#### step 4: filter the listings data frame and reduce the number of columns 

We want to identify airbnb properties listed near to locations of less popular cycle stations found in the cycle hire data using pandas .query() 

This is the location information we know about the stations:

- station WF-03 (160 trips from here) zipcode 98121 lat/long 47.6114 -122.349 Alaskan Way Belltown
- station SLU-22 (761 trips from here) zipcode 98109 lat/long 47.6209 -122.347 Thomas Street South Lake Union  <br>

You will have to identify columns in the listings dataframe which containing matching location information. 

Next, reduce the number of columns in this data frame to those which will be likely useful to analyse the sample of properties. You should keep the Id, Host Id, some location information and some columns about the space eg property type, no of bedrooms 

In [30]:
listings_columns_to_keep = [
    'id', 'name', 'summary', 'space', 'description', 'transit',
    'host_id', 'neighbourhood', 'neighbourhood_group', 'zipcode',
    'latitude', 'longitude', 'property_type', 'beds'
]

# Create a new DataFrame with the specified columns
filtered_listings = listings[listings_columns_to_keep].copy()

# Display the new DataFrame
filtered_listings.head(3)


Unnamed: 0,id,name,summary,space,description,transit,host_id,neighbourhood,neighbourhood_group,zipcode,latitude,longitude,property_type,beds
0,241032,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,,956883,West Queen Anne,Queen Anne,98119,47.636289,-122.371025,Apartment,1.0
1,953595,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,"Convenient bus stops are just down the block, ...",5177328,West Queen Anne,Queen Anne,98119,47.639123,-122.365667,Apartment,1.0
2,3308979,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,A bus stop is just 2 blocks away. Easy bus a...,16708587,West Queen Anne,Queen Anne,98119,47.629724,-122.369483,House,7.0


In [100]:

# Use .query() to filter listings based on the provided station locations
stations_query = filtered_listings.query(
    "(zipcode == '98121' or latitude == 47.6114 or longitude == -122.349) or " +
    "(zipcode == '98109' or latitude == 47.6209 or longitude == -122.347)"
)

# Display the filtered listings DataFrame
# 398
stations_query.head(5)


Unnamed: 0,id,name,summary,space,description,transit,host_id,neighbourhood,neighbourhood_group,zipcode,latitude,longitude,property_type,beds
8,4948745,Urban Charm || Downtown || Views,"Nestled in the heart of the city, this space i...","Located in the heart of the city, this space i...","Nestled in the heart of the city, this space i...",Bus stop to downtown directly across the stree...,2166277,West Queen Anne,Queen Anne,98109,47.63241,-122.357216,Apartment,1.0
39,6389657,Cute little Seattle Studio!,I love my little studio! My space might be sma...,The patio has a table and chairs and potted pl...,I love my little studio! My space might be sma...,"The area is very walkable, though there is a b...",2052160,West Queen Anne,Queen Anne,98109,47.634616,-122.358023,Apartment,1.0
44,6575380,One bedroom with Lounge,"This large, remodeled daylight basement space ...",With a comfortable space for lounging and havi...,"This large, remodeled daylight basement space ...",Only five minutes from the bus stop to take yo...,34392395,West Queen Anne,Queen Anne,98109,47.6343,-122.35706,House,1.0
199,3768745,Downtown Queen Anne large suite,Gorgeous professionally decorated suite in 192...,"As a fellow traveler to over 75 countries, I u...",Gorgeous professionally decorated suite in 192...,Half a dozen bus lines stop within a block of ...,19311504,East Queen Anne,Queen Anne,98109,47.631843,-122.345123,House,1.0
200,3975434,One private bedroom in a large Apt,This bedroom is yours and it comes with the sh...,This is a great apartment in a great neighborh...,This bedroom is yours and it comes with the sh...,There is the monorail although I would only ad...,6307839,East Queen Anne,Queen Anne,98109,47.62812,-122.348752,Apartment,1.0


In [45]:
import numpy as np

# Tolerance for float comparisons
tolerance = 1e-3

# Check for approximate matches within the tolerance
filtered_dfWF = stations_query[np.isclose(stations_query['latitude'], 47.6114, atol=tolerance) & np.isclose(stations_query['longitude'], -122.349, atol=tolerance)]

# Display the filtered DataFrame
filtered_dfWF



Unnamed: 0,id,name,summary,space,description,transit,host_id,neighbourhood,neighbourhood_group,zipcode,latitude,longitude,property_type,beds
1212,5736082,Living room in 1 BHK for females,"Hi, I have living room available for rent in 1...","Hi, I have living room available for rent in m...","Hi, I have living room available for rent in 1...",Lot of buses near this place. westlake transit...,29745749,Belltown,Downtown,98121,47.61109,-122.348157,Apartment,1.0
1284,7807169,Amazing Views of Elliott Bay! AH2,"This 2-bed, 2-bath condo sleeps 6 and includes...",ArtHouse takes center stage in Belltown as one...,"This 2-bed, 2-bath condo sleeps 6 and includes...",Convenient public transportation. The location...,4962900,Belltown,Downtown,98121,47.612499,-122.348258,Apartment,3.0
1388,1857141,Charming studio in Belltown,"This comfortable studio sleeps two, has a full...",Live like the locals do! This is a charming st...,"This comfortable studio sleeps two, has a full...","Very convenient transit, train from airport st...",9691769,Belltown,Downtown,98121,47.611477,-122.348518,Apartment,2.0
1390,8517235,Heart of Seattle,My apartment is 1 block from the Pike Place Ma...,,My apartment is 1 block from the Pike Place Ma...,,44846373,Belltown,Downtown,98121,47.611711,-122.347532,Apartment,1.0
1432,7440415,1BR 2 Blocks frm Pike Market & Pool,Make yourself at home in a large one bedroom a...,I am offering my private apartment in a great ...,Make yourself at home in a large one bedroom a...,,30499792,Belltown,Downtown,98121,47.611165,-122.347545,Apartment,1.0


In [46]:
# Tolerance for float comparisons
tolerance = 1e-3

# Check for approximate matches within the tolerance
filtered_dfSLU = stations_query[np.isclose(stations_query['latitude'], 47.6209, atol=tolerance) & np.isclose(stations_query['longitude'], -122.347, atol=tolerance)]

# Display the filtered DataFrame
filtered_dfSLU

Unnamed: 0,id,name,summary,space,description,transit,host_id,neighbourhood,neighbourhood_group,zipcode,latitude,longitude,property_type,beds
1146,8829089,In the heart of it all,Great location with comfortable accommodation...,you will have occupy the master bedroom with a...,Great location with comfortable accommodation...,Seattle is a drivers nightmare generally speak...,25855544,South Lake Union,Cascade,98109,47.619917,-122.346108,Apartment,1.0


#### Step 5: obtain a list of unique listing ids from the calendar table which show properties available for rent on any day in August 2016

- note that it is not possible to do this using excel due to the large size of the file 
- extract the unique values from the dataframe using unique() 
- convert the series to a list 
- use the list of ids to filter the listings table with a pandas query 

In [50]:
calendar['date'] = pd.to_datetime(calendar['date'].str.strip(), format='%Y-%m-%d')

In [89]:
august_listings_ids = calendar.loc[
    (calendar['date'].dt.year == 2016) & (calendar['date'].dt.month == 8) & (calendar['available'] == 't')]

unique_listing = august_listings_ids['listing_id'].unique()
# len(unique_listing)
list(unique_listing)

# filter listings dataframe with the unique list of ids available for august 8 2016
filtered_list_august = filtered_listings.query('id in @unique_listing')
filtered_list_august.head(5)

Unnamed: 0,id,name,summary,space,description,transit,host_id,neighbourhood,neighbourhood_group,zipcode,latitude,longitude,property_type,beds
0,241032,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,,956883,West Queen Anne,Queen Anne,98119,47.636289,-122.371025,Apartment,1.0
1,953595,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,"Convenient bus stops are just down the block, ...",5177328,West Queen Anne,Queen Anne,98119,47.639123,-122.365667,Apartment,1.0
2,3308979,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,A bus stop is just 2 blocks away. Easy bus a...,16708587,West Queen Anne,Queen Anne,98119,47.629724,-122.369483,House,7.0
3,7421966,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,,9851441,West Queen Anne,Queen Anne,98119,47.638473,-122.369279,Apartment,2.0
4,278830,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,The nearest public transit bus (D Line) is 2 b...,1452570,West Queen Anne,Queen Anne,98119,47.632918,-122.372471,House,3.0


#### Step 6: combine the data sources into a new data frame
- the Id in the listings table and the Listing Id in reviews are the relevant keys to use 
- use the pandas merge method
- the data frame should contain a subset of columns that you consider useful to analyse airbnb properties near the 2 least popular cycle stations, plus the review text column
- the data frame should only contain properties that were available for rent at August 2016
- the data frame should contain only the reviews those properties received in the period matching our cycle hire sample dates (2014-2016)

In [101]:
stations_query.rename(columns={'id':'listing_id'}, inplace=True)
merged_and_reviews_inner = reviews.merge(stations_query, how="inner")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stations_query.rename(columns={'id':'listing_id'}, inplace=True)


In [103]:
merged_august_and_reviews_inner

Unnamed: 0,listing_id,id,date,reviewer_id,comments,name,summary,space,description,transit,host_id,neighbourhood,neighbourhood_group,zipcode,latitude,longitude,property_type,beds
0,7833113,44472406,26/08/2015,34123551,Kelli was an amazing host and immediately open...,Prime Location to Explore Seattle!,Come explore Seattle! Private room/den with fu...,My home is in a vintage brick apartment buildi...,Come explore Seattle! Private room/den with fu...,You can catch the Link light rail from Sea-Tac...,5203265,South Lake Union,Cascade,98109,47.623861,-122.342263,Apartment,1.0
1,7833113,45156782,31/08/2015,13316300,First and foremost Kelli went out of her way t...,Prime Location to Explore Seattle!,Come explore Seattle! Private room/den with fu...,My home is in a vintage brick apartment buildi...,Come explore Seattle! Private room/den with fu...,You can catch the Link light rail from Sea-Tac...,5203265,South Lake Union,Cascade,98109,47.623861,-122.342263,Apartment,1.0
2,7833113,45704944,05/09/2015,7354543,Convenient location close to Music Museum. Eas...,Prime Location to Explore Seattle!,Come explore Seattle! Private room/den with fu...,My home is in a vintage brick apartment buildi...,Come explore Seattle! Private room/den with fu...,You can catch the Link light rail from Sea-Tac...,5203265,South Lake Union,Cascade,98109,47.623861,-122.342263,Apartment,1.0
3,7833113,46177947,08/09/2015,16840700,Kelli made my friend and I feel extremely welc...,Prime Location to Explore Seattle!,Come explore Seattle! Private room/den with fu...,My home is in a vintage brick apartment buildi...,Come explore Seattle! Private room/den with fu...,You can catch the Link light rail from Sea-Tac...,5203265,South Lake Union,Cascade,98109,47.623861,-122.342263,Apartment,1.0
4,7833113,46752160,13/09/2015,23038979,Kelli was so sweet and made us feel welcome ri...,Prime Location to Explore Seattle!,Come explore Seattle! Private room/den with fu...,My home is in a vintage brick apartment buildi...,Come explore Seattle! Private room/den with fu...,You can catch the Link light rail from Sea-Tac...,5203265,South Lake Union,Cascade,98109,47.623861,-122.342263,Apartment,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7530,6781308,56513469,14/12/2015,39312228,From start to finish Tim was always there when...,"Heart of Seattle, minutes to it all",Our beautiful newly renovated vacation rental ...,"Newly renovated one bedroom, one bath apartmen...",Our beautiful newly renovated vacation rental ...,Many of our guests bike or walk. If you choose...,33944952,East Queen Anne,Queen Anne,98109,47.642192,-122.346430,Apartment,2.0
7531,6781308,56817099,18/12/2015,50993517,Timothy checked to make sure we got to the apa...,"Heart of Seattle, minutes to it all",Our beautiful newly renovated vacation rental ...,"Newly renovated one bedroom, one bath apartmen...",Our beautiful newly renovated vacation rental ...,Many of our guests bike or walk. If you choose...,33944952,East Queen Anne,Queen Anne,98109,47.642192,-122.346430,Apartment,2.0
7532,830948,4665466,19/05/2013,6079208,I traveled cross-country from Florida to Seatt...,City & Lake Views. Central- shared,,My condo is located in the center of Seattle. ...,My condo is located in the center of Seattle. ...,,4133860,Westlake,Cascade,98109,47.633474,-122.343445,Apartment,1.0
7533,830948,20420393,29/09/2014,19157125,"Pam was a kind, welcoming host. She offered to...",City & Lake Views. Central- shared,,My condo is located in the center of Seattle. ...,My condo is located in the center of Seattle. ...,,4133860,Westlake,Cascade,98109,47.633474,-122.343445,Apartment,1.0


In [None]:
 = stations_query.query('id in @unique_listing')
filtered_list_august.head(5)

#### Step 7: check for the mentions of transportation, transport, cycling or bikes in the review text 

- in the review column you will find sentences
- you can utilise any of the methods you encountered in the week 8 text analysis topic
- for example, you can use pandas str contain function, or regexp pattern matching
- the aim is to flag all the reviews which contains reference to transport or transportation, cycling, bikes etc

#### Step 8: data visualisations 
- visualise the number of listings, beds per room type of properties you have found close to the cycle stations mentioned
- then visualise how many reviews for those properties mentioned the theme of transport