## Wrangling Airbnb Rental Listings in Seattle 

### Goals of the Task



There are three tables in the data set which was scraped from the airbnb website on different dates <br>
*listings - each row is a unique rental property* <br>
*reviews - each row is a review left by a guest after checking out of a property* <br>
*calendar - each row is a date and property showing if it was available or unvailable on that date* <br>

We want to use this data to try to understand how many airbnb properties are located close to less popular cycle hire stations and if any airbnb guest commented on transport in their reviews for those properties. This insight could be used to make decisions about the future of those cycle stations. 

- the three data tables are large and unmanageable in excel, are also slow to visualise in PowerBI
- we want to focus on specific locations only 
- not all the columns will be useful to this analysis
- potentially useful mentions of transport are embedded in the review text 
- there are different dates across the 3 data sets (date scraped, date of review, dates available)

#### Step 1 : use pandas to read the data from the 3 csv files to create 3 data frames (listings, reviews, calendar)
- import pandas as pd 
- use pandas read_csv 
- ensure you are pointing at the correct file path for the data sources (you may have to navigate in your notebook!) 


#### Step 2: preview each dataframe using pandas functions like .info() .head(), .tail() and .describe() 
- look out for nulls and missing data 
- any problematic data types 
- consider if you need to do anything about missing data (replace/ impute /ignore / drop)

#### Step 3: detect and manage any duplicate rows using pandas .duplicated()
- consider what a true duplicate is in each data frame
- decide whether to drop the duplicates entirely or to review and try to understand why it exists 
- if you drop any rows, remember to reset your index on the dataframe afterwards using .reset_index(inplace=True,drop=True)

#### step 4: filter the listings data frame and reduce the number of columns 

We want to identify airbnb properties listed near to locations of less popular cycle stations found in the cycle hire data using pandas .query() 

This is the location information we know about the stations:

- station WF-03 (160 trips from here) zipcode 98121 lat/long 47.6114 -122.349 Alaskan Way Belltown
- station SLU-22 (761 trips from here) zipcode 98109 lat/long 47.6209 -122.347 Thomas Street South Lake Union  <br>

You will have to identify columns in the listings dataframe which containing matching location information. 

Next, reduce the number of columns in this data frame to those which will be likely useful to analyse the sample of properties. You should keep the Id, Host Id, some location information and some columns about the space eg property type, no of bedrooms 

#### Step 5: obtain a list of unique listing ids from the calendar table which show properties available for rent on any day in August 2016

- note that it is not possible to do this using excel due to the large size of the file 
- extract the unique values from the dataframe using unique() 
- convert the series to a list 
- use the list of ids to filter the listings table with a pandas query 

#### Step 6: combine the data sources into a new data frame
- the Id in the listings table and the Listing Id in reviews are the relevant keys to use 
- use the pandas merge method
- the data frame should contain a subset of columns that you consider useful to analyse airbnb properties near the 2 least popular cycle stations, plus the review text column
- the data frame should only contain properties that were available for rent at August 2016
- the data frame should contain only the reviews those properties received in the period matching our cycle hire sample dates (2014-2016)

#### Step 7: check for the mentions of transportation, transport, cycling or bikes in the review text 

- in the review column you will find sentences
- you can utilise any of the methods you encountered in the week 8 text analysis topic
- for example, you can use pandas str contain function, or regexp pattern matching
- the aim is to flag all the reviews which contains reference to transport or transportation, cycling, bikes etc

#### Step 8: data visualisations 
- visualise the number of listings, beds per room type of properties you have found close to the cycle stations mentioned
- then visualise how many reviews for those properties mentioned the theme of transport