![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/69450217-6cee2780-0d3b-11ea-947b-461ea407da85.jpg"
    style="width:400px; float: right; margin: 0 40px 40px 40px;"></img>

### Project

# New York City Airbnb Open Data

Let's put in practice the topics covered in the course and analyze Airbnb listings data.

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Hands on! 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Read the `airbnb_data` dataset into a `airbnb_df` DataFrame variable.

This data file includes all needed information to find out more about hosts and geographical availability.

This public dataset is part of Airbnb, and the original source can be found on this [website](http://insideairbnb.com/).

Here's a preview of that file:

In [None]:
!head data/airbnb_data.csv

The column names are taken from the original documentation for this dataset.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df = pd.read_csv('data/airbnb_data.csv')

airbnb_df.head()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Setting `airbnb_data` index.

Set the index of the DataFrame to the `listing_id` column.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df.set_index('listing_id', inplace=True)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Checking missing data

We need to check if our `airbnb_data` has any null value. 

To do that, let's create a `percent_missing` Series containing the column names and the percent of missing values per column.

In [None]:
# your code goes here



In [None]:
# solution

percent_missing = airbnb_df.isna().mean().round(4) * 100

percent_missing

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Dealing with `reviews_per_month` missing values

Impute null values in the `reviews_per_month` column with a `0` value.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df['reviews_per_month'].fillna(0, inplace=True)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Dealing with `host_name` missing values

Drop the rows where `host_name` has a missing value.

In [None]:
# your code goes here



In [None]:
# solution 1

airbnb_df = airbnb_df.loc[airbnb_df['host_name'].notna()]

In [None]:
# solution 2

airbnb_df = airbnb_df.dropna(subset=['host_name'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Dealing with `last_review` missing values

Drop the rows where `last_review` has a missing value.

In [None]:
# your code goes here



In [None]:
# solution 1

airbnb_df = airbnb_df.loc[airbnb_df['last_review'].notna()]

In [None]:
# solution 2

airbnb_df = airbnb_df.dropna(subset=['last_review'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Dealing with `host_id` invalid values

Drop the rows where `host_id` has a `0` value.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df = airbnb_df.loc[airbnb_df['host_id'] != 0]

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Now cast this `last_review` column to `datetime`

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df['last_review'] = pd.to_datetime(airbnb_df['last_review'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Duplicated listings

Many listings could be duplicated. Drop all the entries that have the same `listing_name`, `price` and `room_type`. Keep just the last entry.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df.drop_duplicates(subset=['listing_name', 'price', 'room_type'],
                          keep='last',
                          inplace=True)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Invalid `price`

Check the `price`. It should be a numeric type column.

- Remove the `$`, `.` and `-` characters.
- Replace `,` with a `.` character.
- Cast the column to `float` dtype.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df['price'] = airbnb_df['price'].str.replace('$', '')
airbnb_df['price'] = airbnb_df['price'].str.replace(',', '.')
airbnb_df['price'] = airbnb_df['price'].str.replace('.-', '')

airbnb_df['price'] = airbnb_df['price'].astype('float')

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Separating `neighbourhood_group` column

Check the `neighbourhood_group` column. It should be divided into two different columns: `neighbourhood` and `borough`.

After the split, drop the `neighbourhood_group` column.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df[['neighbourhood', 'borough']] = airbnb_df['neighbourhood_group'].str.split(', ', expand=True)

airbnb_df.drop(columns='neighbourhood_group', inplace=True)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Separating `lat_lon` column

Create two new columns `latitude` and `longitude` containing the product of splitting the `lat_lon` column. Both new columns should be casted to `float` dtype.

After the split, drop the `lat_lon` column.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df[['latitude', 'longitude']] = airbnb_df['lat_lon'].str.split(';', expand=True)

airbnb_df.drop(columns='lat_lon', inplace=True)

airbnb_df = airbnb_df.astype({'latitude': 'float',
                              'longitude': 'float'})

#### Visualizing points on a map

Let's visualize your `airbnb_df` to confirm everything is in correct format so far.

Execute the below code to visualize the `latitude` and `longitude` you just created.

In [None]:
longlat_min_max = (airbnb_df.longitude.min(), airbnb_df.longitude.max(), airbnb_df.latitude.min(), airbnb_df.latitude.max())

nyc = plt.imread('./data/nyc-map.png')

fig, ax = plt.subplots(figsize=(10,9))

ax.scatter(airbnb_df.longitude, airbnb_df.latitude, zorder=1, alpha=0.6, c='#fd5c63', s=10)
ax.set_title('Airbnb listing locations ')
ax.imshow(nyc, extent=longlat_min_max);

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Fixing `availability_365` values

This columns should casted to numeric dtype, but it has many invalid string values, like `43+N524`, that should be coerced while casting.

Also we can see many values out of domain, let's fix them:
- Negative numbers, that should be converted to positive.
- Numbers above 365, that should be dropped.

In [None]:
airbnb_df['availability_365'].unique()

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df['availability_365'] = pd.to_numeric(airbnb_df['availability_365'], errors='coerce')

airbnb_df = airbnb_df.loc[(airbnb_df['availability_365'] <= 365) & (airbnb_df['availability_365'] >= 0)]

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Most reviews

Create a Series `most_reviews` containing the top 10 `host_name`s with the most total (sum) reviews per month accross all properties owned by that host.

In [None]:
# your code goes here



In [None]:
# solution 

most_reviews = airbnb_df['reviews_per_month'].groupby(airbnb_df['host_name']).sum().sort_values(ascending=False).head(10)

#### Visualizing most_reviews

In [None]:
most_reviews.plot(kind='pie',figsize=(8, 8))

plt.title("Top 10 Most Reviews Per Month")
plt.ylabel("")

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Most expensive listings

Create a variable `expensive_listings_df` containing the top 100 most expensive per night listings. The most expensive should be at the top.

In [None]:
# your code goes here



In [None]:
# solution

expensive_listings_df = airbnb_df.sort_values(by='price', ascending=False).head(100)

expensive_listings_df

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Which neighbourhood has biggest amount of expensive listings?

Using the `expensive_listings_df` listings count the occurrences of each `neighbourhood` they belong.

In [None]:
# your code goes here



In [None]:
# solution

expensive_listings_df['neighbourhood'].value_counts()

#### Visualizing neighbourhoods

In [None]:
expensive_listings_df['neighbourhood'].value_counts().plot(kind='bar', figsize=(10,7))

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Which neighbourhood has the most listings?

Using all the listings from `airbnb_df` count how many listings each neighbourhood has. Keep just the top 10 neighbourhoods with the most listings.

In [None]:
# your code goes here



In [None]:
# solution

airbnb_df['neighbourhood'].value_counts().head(10)

#### Visualizing neighbourhoods

In [None]:
airbnb_df['neighbourhood'].value_counts().head(10).plot(kind='bar', figsize=(10,7))

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)