# Filtering Airbnb

### Introduction

In this lesson, we'll use our knowledge of loops and filtering to work with Airbnb data in New York City. Let's get started.

### Loading and Exploring our Data

Let's start by loading up our data.

In [4]:
import pandas as pd
listings_df = pd.read_csv('AB_NYC_2019.csv')

listings = listings_df.to_dict('records')

Let's start by seeing the number of listings we have gathered.

In [5]:
len(listings)

# 48895

48895

And now, let's see which attributes are available to us on each listing.  Look at the keys available on a single listing.

In [6]:
listings[0].keys()

# dict_keys(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 
# 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
# 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 
# 'calculated_host_listings_count', 'availability_365'])

dict_keys(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'])

Now we'll like to calculate some statistics with this data.  To do so, perhaps we should remove data that is not up to date.  

Let's begin by looking at the `last_review` data.  Use list comprehension to create a list of `last_reviews`, one for each listing.

In [7]:
last_reviews = [listing['last_review'] for listing in listings]

In [8]:
last_reviews[:5]

['2018-10-19', '2019-05-21', nan, '2019-07-05', '2018-11-19']

So we can see that some of our listings have `nan` values.  As we may know `nan` stands for not a number and is generally used to represent missing values.  Let's use filter out the listings with last_review values of `nan`.  

> Now an nan is of type `float`.  So to check if the our value is `nan` we can use something like the following.

In [9]:
type(last_reviews[0]) == float

False

In [10]:
type(last_reviews[2]) == float

True

In [11]:
last_reviews[0], last_reviews[2]

('2018-10-19', nan)

Use the logic above to only select listings that do not have a `last_review` value of nan.

In [102]:
listings_not_nan = [listing for listing in listings if not type(listing['last_review']) == float]

In [103]:
len(listings_not_nan)

38843

Calculate the percentage of listings we have left.

In [74]:
len(listings_not_nan)/len(listings)

# 0.7944166070150323

0.7944166070150323

Ok, not amazing, but not bad.

### Back on Track

Now remember that our goal is to ensure we are working with relatively recent reviews.

Begin by coercing the first `last_review` to a Python datetime object.

> See [this post](https://chrisalbon.com/python/basics/strings_to_datetime/) for coercing data to a datetime to do so.

In [105]:
first_listing = listings[0]

In [106]:
last_review = first_listing['last_review']

In [107]:
from datetime import datetime

last_review_datetime = datetime.strptime(last_review, '%Y-%m-%d')

In [108]:
last_review_datetime.year, last_review_datetime.month 

(2018, 10)

In [109]:
listings_not_nan_copied = listings_not_nan.copy()

In [110]:
for copied_listing in listings_not_nan_copied:
    last_review_dt = datetime.strptime(copied_listing['last_review'], '%Y-%m-%d')
    copied_listing['last_review'] = last_review_dt

In [113]:
updated_last_reviews = [listing['last_review'] 
                        for listing in listings_not_nan_copied]

updated_last_reviews[:2]

[datetime.datetime(2018, 10, 19, 0, 0), datetime.datetime(2019, 5, 21, 0, 0)]

Ok, now find the listing with the `most_recent_last_review`, and then we'll find the listing with the `oldest_last_review` .

In [114]:
sorted_listings = sorted(listings_not_nan_copied, key=lambda x: x['last_review'])

In [117]:
earliest_listing = sorted_listings[0]


{'id': 74860,
 'name': 'Sunlit and Cozy Williamsburg/Greenpoint, Brooklyn',
 'host_id': 394752,
 'host_name': 'Allison',
 'neighbourhood_group': 'Brooklyn',
 'neighbourhood': 'Greenpoint',
 'latitude': 40.72488,
 'longitude': -73.95018,
 'room_type': 'Private room',
 'price': 55,
 'minimum_nights': 2,
 'number_of_reviews': 1,
 'last_review': datetime.datetime(2011, 3, 28, 0, 0),
 'reviews_per_month': 0.01,
 'calculated_host_listings_count': 1,
 'availability_365': 0}

In [119]:
earliest_listing['last_review']

# datetime.datetime(2011, 3, 28, 0, 0)

datetime.datetime(2011, 3, 28, 0, 0)

In [120]:
sorted_listings[-1]['last_review']
# datetime.datetime(2019, 7, 8, 0, 0)

datetime.datetime(2019, 7, 8, 0, 0)

So we can see that our data ranges from 2011 to July 2019.  Let's limit our data so that we only are working with data from July 2017 to July 2019.

In [129]:
recent_listings = [listing for listing in sorted_listings if (listing['last_review'].year > 2016 and  listing['last_review'].month > 6)]

In [130]:
len(recent_listings)

# 10774

10774

So now we have about 11000 recent listings.

### Listings by Neighborhood

Let's getter a better sense of some of these recent listings.  Begin by creating a list of each `neighbourhood_group` of our recent listings.

In [133]:
list(set([listing['neighbourhood_group'] for listing in recent_listings]))

# ['Queens', 'Brooklyn', 'Bronx', 'Manhattan', 'Staten Island']

['Queens', 'Brooklyn', 'Bronx', 'Manhattan', 'Staten Island']

Ok, so it looks like the five boroughs.  Now let's find the number `recent_listings` in Manhattan that were less than or equal to 50 dollars.

In [136]:
cheaper_manhattan_listings = [listing for listing in recent_listings if listing['neighbourhood_group'] == 'Manhattan' and listing['price'] <= 50] 

In [137]:
len(cheaper_manhattan_listings)

# 215

215

### Summary

In this lesson, we saw how to use filtering to reduce our dataset down to higher quality data and to then query our data.  We also worked to remove `nan` values by looking to see if the datatype of our data was a float.  As an alternative, we could have also used the function from the math library.

In [139]:
import math
import numpy as np

math.isnan(np.nan)

True