# AirBnb Berlin 2020

This dataset contain aspects of listings in Berlin scrapped in March 2020

## Kickoff Questions
1.How do accomodation facilities impact in the final review score of each listing ?  
3.How did prices evolve over time?  
4.Heat map of hottest neighbourhoods.  

In [0]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
df_listings = pd.read_csv('/content/drive/My Drive/data-science-notebooks/airbnb-berlin/listings.csv')
df_listings.head()

Below is the correlation between aspects of accomadation with overall review score and price.

In [0]:
df = df_listings.select_dtypes(exclude='object')
df = df.dropna(axis=1, how='all')
df = df[df['float_price'] != 0.0]

In [0]:
import seaborn as sns
df = df[['accommodates','bathrooms','bedrooms','beds','number_of_reviews','review_scores_rating','float_price']]
sns.heatmap(df.corr(), annot=True)
plt.show()

### Conclusion

**Review Score Rating** and **Price** have very little to do with accomodation characteristics as number of accomodates, bathrooms and bedrooms. 

In [0]:
df_reviews = pd.read_csv('/content/drive/My Drive/data-science-notebooks/airbnb-berlin/reviews.csv')
df_reviews.head()

In [0]:
df_calendar = pd.read_csv('/content/drive/My Drive/data-science-notebooks/airbnb-berlin/calendar.csv')
df_calendar.head()

In [0]:
print(f"df_listings.shape = {df_listings.shape}")
print(f"df_reviews.shape = {df_reviews.shape}")
print(f"df_calendar.shape = {df_calendar.shape}")

In [0]:
df_listings.loc[:, 'price':'extra_people'].head()

# Acommodations Overview (Prices)

Following are the questions I will explore in this section
1. How do prices vary per neigborhood ?
2. What neighborhood has the most expensive listings ?
3. What does differ a expensive listing to a cheap listing in terms of facilities? What are the most popular facilities?


Shown below are the top 10 **most** and **less** expensive neighbourhoods by listing price in average.

In [0]:
# create an additional column with price as float
df_listings['float_price'] = df_listings['price'].str.replace('[$\,]','').astype('float64')
df_grouped = df_listings.groupby('neighbourhood')['float_price']

In [0]:
# sort neighbourhoods by price
df_mean = df_grouped.mean().sort_values(ascending=False)
df_mean = df_mean.reset_index()
df_mean = pd.concat([df_mean[:10], df_mean[-10:]])
df_mean.plot.bar(x='neighbourhood')

Shown below are the amenities present in more than 50% of listings in the 10 more expensive neighbourhoods

In [0]:
# select popular amenities among listings in the most expensive neighbourhoods
indexer = [item in list(df_mean[:10]['neighbourhood']) for item in df_listings['neighbourhood']]
df_expensive_neig = df_listings[indexer]
number_of_listings = df_expensive_neig.shape[0]

df_amenities = df_expensive_neig['amenities']
df_amenities = pd.DataFrame([elem.replace('{','').replace('}', '').replace('"','').split(',') for elem in df_amenities])
df_amenities = pd.DataFrame(df_amenities.values.flatten())
df_amenities = df_amenities.dropna()

In [0]:
amenitites_counts = df_amenities[0].value_counts()
top_amenities = amenitites_counts[[count/number_of_listings > .5 for count in amenitites_counts]]
top_amenities.plot.bar()

Shown below are the amenities present in more than 50% of listings in the 10 less expensive neighbourhoods

In [0]:
# select popular amenities among listings in the less expensive neighbourhoods
indexer = [item in list(df_mean[-10:]['neighbourhood']) for item in df_listings['neighbourhood']]
df_cheap_neig = df_listings[indexer]
number_of_listings = df_cheap_neig.shape[0]

df_amenities = df_cheap_neig['amenities']
df_amenities = pd.DataFrame([elem.replace('{','').replace('}', '').replace('"','').split(',') for elem in df_amenities])
df_amenities = pd.DataFrame(df_amenities.values.flatten())
df_amenities = df_amenities.dropna()

In [0]:
amenitites_counts = df_amenities[0].value_counts()
top_amenities = amenitites_counts[[count/number_of_listings > .5 for count in amenitites_counts]]
top_amenities.plot.bar()

### Conclusion

Amenities do not have huge impact on price. Nevertheless, there is a set of essential things you **MUST** have in your accomdation.

# Accomodations Overview (Reviews)

Following the questions I will explore in this section
1. How do facilities impact the final review score ?
2. What does characterize a good/bad listing (high/low review score) ?

In [0]:
df_listings.loc[:, 'number_of_reviews':'review_scores_value'].head()

In [0]:
# remove lisitings with no reviews
df = df_listings[df_listings['number_of_reviews'] > 0]

In [0]:
df[['number_of_reviews']].hist()

In [0]:
df = df_listings[df_listings['number_of_reviews'] > 100]
df[['number_of_reviews']].hist()

In [0]:
df[['review_scores_rating']].plot(kind='hist',bins=[0,20,40,60,80,100],rwidth=0.8)
plt.show()

# Prices Over time

Key questions of this section
1. How do prices vary over time?

**Note:** Data from previous months/year is required

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Hottest Neibourghoods

Key questions of this section
1. What are neighnourhoods with most number of listings ?
2. What are neighbourhoods with largest occupancy rate?
3. What are the trending neighbourhoods, i.e., those with recent spikes in number of listings ?
