# Filtering Airbnb

### Introduction

In this lesson, we'll use our knowledge of loops and filtering to work with Airbnb data in New York City. Let's get started.

### Loading and Exploring our Data

Let's start by loading up our data.

In [1]:
import pandas as pd
listings_df = pd.read_csv('https://raw.githubusercontent.com/jigsawlabs-student/mod-1-a-data-structures/master/3-coercing-filtering-data/AB_NYC_2019.csv')

listings = listings_df.to_dict('records')

HTTPError: HTTP Error 404: Not Found

Let's start by seeing the number of listings we have gathered.

In [38]:


# 48895

48895

And now, let's see which attributes are available to us on each listing.  Look at the keys available on a single listing.

In [97]:


# dict_keys(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 
# 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
# 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 
# 'calculated_host_listings_count', 'availability_365'])

dict_keys(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'])

Now we'll like to calculate some statistics with this data.  Before doing so, let's get a sense of how recent our data is.  

To start, let's select just the `last_review` value from each element.  
> Use list comprehension to create a list of `last_reviews`, one for each listing.

In [98]:
last_reviews = None

In [99]:
last_reviews[:5]

# ['2018-10-19', '2019-05-21', nan, '2019-07-05', '2018-11-19']

['2018-10-19', '2019-05-21', nan, '2019-07-05', '2018-11-19']

We can see from the above that some of our listings have `nan` values.  
> As we may know `nan` stands for not a number and is generally used to represent missing values.  

Let's filter out the listings with last_review values of `nan`.  Removing only `nan` values can be tricky.  So here's a hint to get you started:

> nan is of type `float`.  So to check if the our value is `nan` we can use something like the following.

In [100]:
type(last_reviews[0]) == float

False

In [101]:
type(last_reviews[2]) == float

True

In [None]:
last_reviews[0], last_reviews[2]

Now, use what you learned from above to select *listings* that do not have a `last_review` value of nan.

In [141]:
listings_not_nan = []

In [103]:
len(listings_not_nan)

# 38843

38843

Calculate the percentage of listings we have left.

In [74]:


# 0.7944166070150323

0.7944166070150323

Ok, not amazing, but not bad.

### Back on Track

Now that we've removed our listnigs  that our goal is to ensure we are working with relatively recent reviews.

Begin by coercing the first `last_review` to a Python datetime object.

> See [this post](https://chrisalbon.com/python/basics/strings_to_datetime/) for coercing data to a datetime to do so.

In [105]:
first_listing = listings[0]

In [106]:
last_review = first_listing['last_review']

In [142]:
from datetime import datetime

last_review_datetime = None

In [108]:
last_review_datetime.year, last_review_datetime.month 

(2018, 10)

> Now iterate through the copy of `listings_not_nan_copied` and change each `last_review` to a datetime.

In [109]:
listings_not_nan_copied = listings_not_nan.copy()

In [110]:
for copied_listing in listings_not_nan_copied:
    last_review_dt = datetime.strptime(copied_listing['last_review'], '%Y-%m-%d')
    copied_listing['last_review'] = last_review_dt

> We can check that each `last_review` is now a datetime object.

In [113]:
updated_last_reviews = [listing['last_review'] 
                        for listing in listings_not_nan_copied]

updated_last_reviews[:2]

# [datetime.datetime(2018, 10, 19, 0, 0), datetime.datetime(2019, 5, 21, 0, 0)]

[datetime.datetime(2018, 10, 19, 0, 0), datetime.datetime(2019, 5, 21, 0, 0)]

Ok, now find the listing with the `most_recent_last_review`, and then we'll find the listing with the `oldest_last_review` .

In [143]:
earliest_listing = None

In [119]:
earliest_listing['last_review']

# datetime.datetime(2011, 3, 28, 0, 0)

datetime.datetime(2011, 3, 28, 0, 0)

In [145]:
latest_listing = None


In [146]:
latest_listing['last_review']
# datetime.datetime(2019, 7, 8, 0, 0)

So we can see that our data ranges from 2011 to July 2019.  Let's limit our data so that we only are working with data from July 2017 to July 2019.

In [147]:
recent_listings = []

In [148]:
len(recent_listings)

# 10774

0

So now we have about 11000 recent listings.

### Listings by Neighborhood

Let's getter a better sense of some of these recent listings.  Begin by creating a list of each `neighbourhood_group` of our recent listings.

In [133]:
# write code here

# ['Queens', 'Brooklyn', 'Bronx', 'Manhattan', 'Staten Island']

['Queens', 'Brooklyn', 'Bronx', 'Manhattan', 'Staten Island']

Ok, so it looks like the five boroughs.  Now let's find the number `recent_listings` in Manhattan that were less than or equal to 50 dollars.

In [149]:
cheaper_manhattan_listings = []

In [150]:
len(cheaper_manhattan_listings)

# 215

0

### Summary

In this lesson, we saw how to use filtering to reduce our dataset down to higher quality data and to then query our data.  We also worked to remove `nan` values by looking to see if the datatype of our data was a float.  As an alternative, we could have also used the function from the math library.

In [139]:
import math
import numpy as np

math.isnan(np.nan)

True