# Analysis of Airbnb data from Munich, Germany

![https://images.portal.muenchen.de/upload/media/000/000/218/415/resized/0750x0310/marienplatz-750.jpg](https://images.portal.muenchen.de/upload/media/000/000/218/415/resized/0750x0310/marienplatz-750.jpg)

**Motivation: ** Being relatively new to the field of data science, I am enthusiastic about the information that can be obtained from data. Especially in how businesses can use data to make decisions! For my training I chose an Airbnb data set to be examined in this notebook - and because I live in Munich, Germany I find that data particularly interesting.

**Dataset: ** The used data sets were created on November 25th, 2019 and contain detailed listings data, review data and calendar data of current Airbnb listings in Munich, Germany. This data was created by Murray Cox and his Inside Airbnb project which can be found here:  http://insideairbnb.com/get-the-data.html

The data set can also be found here: https://www.kaggle.com/chriskue/munich-airbnb-data

**Methodology: ** For my analysis I will use the *CRISP-DM* Methodology. CRIPS-DM stands for "cross-industry process for data mining". It provides a structured approach to planning a data mining project. The process consists of six steps:

1. Business Understanding
1. Data Understanding
1. Data Preparation
1. Modeling
1. Evaluation
1. Deployment

The underlying Jupyter notebook can be found on Github (https://github.com/noema-git/airbnb-analysis)

**Let's get started ...**

# 1. Business Understanding  
The phase of the business understanding is about defining the specific goals and requirements for data mining. The result of this phase is the formulation of the task and the description of the planned rough procedure. In this phase, goals and specific questions are defined. Users and analysts exchange information on tasks and expectations. Appropriate procedures for the task are discussed and defined. In this phase, the criteria for success are finally set.

Airbnb is a community-based platform that supports magical travel that is local, authentic and unique. Airbnb has more than 5 million listings and operates in 191 countries and more than 81,000 cities. To date, the Airbnb community has hosted nearly half-a-billion guests through a model designed to support healthy travel. Moreover, it cultivates a sharing-economy by allowing property owners to rent out private flats worldwide. (Source: https://news.airbnb.com/en-us/an-update-about-our-community-in-san-francisco/)


**In order to get a better insight into the business model of Airbnb, especially in Munich, I am particularly interested in the answers to the following three questions:**
1. When is the most expensive time of the year to visit Munich and how much do the price spike?
2. What are the most expensive neighbourhoods in Munich? 
3. What factors influence the price most?

In [None]:
# Import all the libraries which will be needed later
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, median_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.utils import shuffle

%matplotlib inline

# 2. Data Understanding
As part of the data understanding, an attempt is made to get a first overview of the available data and their quality. The data quality is analyzed and evaluated. Problems with the quality of the existing data in relation to the task defined in the previous phase must be identified. Afterwards you become familiar with the data in the phase of data understanding. The quality and reliability of the data is also checked in this phase. For this purpose, all observations are named and appropriate corrections are implemented.

## 2.1. Load the relevant dataset  
The available csv data sets from Inside Airbnb (http://data.insideairbnb.com/germany/bv/munich/2019-11-25/data/) was stored in a separately dataset (https://www.kaggle.com/chriskue/munich-airbnb-data) for further analysis.

The data set contains the following files:
*     **listings.csv**:  descriptions and review score
*     **calendar.csv**:  listing id, price and availability for the upcoming year
*     **reviews.csv**:   unique id for each reviewer and detailed comments

In [None]:
df_calendar = pd.read_csv("../input/munich-airbnb-data/calendar.csv");
df_listings = pd.read_csv("../input/munich-airbnb-data/listings.csv");
df_reviews = pd.read_csv("../input/munich-airbnb-data/reviews.csv");

## 2.2. Understand the underlaying data
To get a better understanding of the data we will look at the features and the overall quality of the data ( e.g. missing values).

### 2.2.1. Dataset "calendar.csv"

In [None]:
# Take a look at a concise summary of the DataFrame 'calendar'
df_calendar.info()

In [None]:
# Show the first 5 rows of the data set
df_calendar.head(5)

In [None]:
# List all features in this data set and show the number of missing values
obj = df_calendar.isnull().sum()
for key,value in obj.iteritems():
    percent = round((value * 100 / df_calendar['listing_id'].index.size),3)
    print(key,", ",value, "(", percent ,"%)")

**Summary after analysing the dataset "calendar"**  
The data set consists of 7 features and a total rows of 4.190.565.
The overall quality is good, only the features 'price' and 'adjusted_price' have missing data (171).

For further analysis the following data cleaning is required:
- Drop feature 'adjusted_price'.
- Remove the rows with missing data in 'price' (number of affected rows is low ~0.004%).
- The feature 'price' needs to be converted to numerical value
- The feature 'date' needs to be converted to datetime format
- The feature 'available' needs to be converted into Boolean data type

### 2.2.2. Dataset "listing.csv"

In [None]:
# Take a look at a concise summary of the DataFrame 'listings'
df_listings.info()

In [None]:
# Show the first 5 rows of the data set
df_listings.head(5)

In [None]:
# List all features in this data set and show the number of missing values
obj = df_listings.isnull().sum()
for key,value in obj.iteritems():
    percent = round((value * 100 / df_listings['id'].index.size),3)
    print(key,", ",value, "(", percent ,"%)")

In [None]:
# Count distinct observations per feature
df_listings.nunique()

**Summary after analysing the dataset "listings"**  
The dataset consists of 106 features and a total rows of 11.481.  
The overall quality is not for all features good. It must be considered how to deal with the remaining missing values. Filling does not look very promising in all cases, so the features should be droped from a threshold.
Some features have only constant values and don't help us any further.

For further analysis the following data cleaning is required:  
- Drop features with constant values 
- Drop features with more than 50% missing data
- Fill missing numerical data with mean value
- Convert features to useable data type (e.g. price)
- Drop features that do not provide us with any useful information (for our specific questions) 

Appendix: Consider whether the outliers in feature 'price' should be dropped.

### 2.2.3 Dataet "reviews.csv"


In [None]:
# Take a look at a concise summary of the DataFrame 'reviews'
df_reviews.info()

In [None]:
# Show the first 5 rows of the data set
df_reviews.head(5)

In [None]:
# List all features in this data set and show the number of missing values
obj = df_reviews.isnull().sum()
for key,value in obj.iteritems():
    percent = round((value * 100 / df_reviews['id'].index.size),3)
    print(key,", ",value, "(", percent ,"%)")

In [None]:
# Plot the number reviews over time to see any patterns
df_reviews_plot = df_reviews.groupby('date')['id'].count().reset_index()
df_reviews_plot["rolling_mean"] = df_reviews_plot.id.rolling(window=30).mean()
df_reviews_plot['date'] = pd.to_datetime(df_reviews_plot['date'])

plt.figure(figsize=(20, 10));
plt.plot(df_reviews_plot.date, df_reviews_plot.rolling_mean);

plt.title("Number of reviews by date (monthly mean)");
plt.xlabel("time");
plt.ylabel("reviews");
plt.grid()

**Summary after analysing the dataset "reviews"**  
The dataset consists of 6 features and a total of 175.561 rows.   
The overall quality is good, only the feature 'comments' has missing data (74).  
At first sight the data set cannot be used to answer the questions. Nonetheless, the number of reviews shows an interesting pattern which should also be examined. 
Maybe the feature 'id' can be connected to the other data sets 'calendar' or 'listings'. 
If I used the data set for my further analysis, I would drop missing data in 'comments' (number of affected rows is low ~0.04%) and convert the feature 'date' in to the 'DateTime' data type.

# 3. Data Preparation
The data preparation is used to create a final data set that forms the basis for the next phase of the modeling. In data preparation you then create the data set used for the further analyzes. Variables are encoded or transformed if necessary. Appropriate procedures for missing data can be used. Experience has shown that a large part of the time is required for this phase. Only if the data is valid and reliable can CRISP-DM predictive analytics deliver reliable results.

## 3.1. Preparation for Question 1
### "When is the most expensive time of the year to visit Munich and how much do the price spike?"

In [None]:
# Copy the data to a new DataFrame for further clean up
df_calendar_clean = df_calendar.copy(deep=True)

In [None]:
# Clean up the data set "calendar" as the previous analysis pointed out

# Drop "adjusted_price"
df_calendar_clean = df_calendar_clean.drop("adjusted_price", axis = 1)

# Remove missing values
df_calendar_clean.dropna(how='all', inplace=True)

# Convert the data type of feature 'date' from object to DateTime
df_calendar_clean['date'] = pd.to_datetime(df_calendar_clean['date'])

# clean up the format of the 'price' values. Maybe not the best solution - but will do the job
df_calendar_clean['price'] = df_calendar_clean['price'].replace({'\$': '', ',': ''}, regex=True).astype(float)

# Convert the feature 'available' to boolean data type
# (This conversion is actually not necessary for further analysis)
df_calendar_clean['available'] = df_calendar_clean['available'].replace({'t': True, 'f':False})

In [None]:
# Group the data by mean price per date
df_calendar_clean = df_calendar_clean.groupby('date')['price'].mean().reset_index()

In [None]:
# Plot the mean price over time
plt.figure(figsize=(20, 10));
plt.plot(df_calendar_clean.date, df_calendar_clean.price);

plt.title("Mean price by date");
plt.xlabel("date");
plt.ylabel("price");
plt.grid();

In [None]:
# Create a new feature 'month'
df_calendar_clean["month"] = df_calendar_clean["date"].dt.month

# Show a Boxplot to see the price distribution per month
plt.figure(figsize=(20, 10))
boxplot = sns.boxplot(x = 'month',  y = 'price', data 
                      = df_calendar_clean).set_title('Distribution of price per month');

In [None]:
# Create a new feature 'weekday'
df_calendar_clean["weekday"] = df_calendar_clean["date"].dt.weekday_name

# Show a Boxplot to see the price distribution per weekday
plt.figure(figsize=(20, 10))
sns.boxplot(x = 'weekday',  y = 'price', data 
            = df_calendar_clean).set_title('Distribution of price per weekday');

In [None]:
# Show the statistics per weekday
df_calendar_clean.groupby(['weekday'])['price'].describe()

In [None]:
# Show the statistics per per week of the year
df_calendar_clean["week"] = df_calendar_clean["date"].dt.week
df_calendar_clean.groupby(['week'])['price'].describe()

In [None]:
# Show a Boxplot to see the price distribution per week
plt.figure(figsize=(20, 10))
sns.boxplot(x = 'week',  y = 'price', data 
            = df_calendar_clean).set_title('Distribution of price per week');

### Summary for Question 1: "When is the most expensive time of the year to visit Munich and how much do the price spike?"

Regardless of the location or other properties of the apartment, the following information results:
- **There are two periods in the year in which the price differs significantly.**  
    The mean price in the year is ~114 USD.  
    a) In weeks 39 and 40 the price increases on average by ~13% to ~129 USD. It looks like there is an Oktoberfest effect at Airbnb.  
    b)  In weeks 48 and 49 the price drops on average by 21% to ~94 USD. The drop in price is not obvious and requires additional research. 
- **The mean price rises slightly at the weekends (Friday and Saturday).**  
    During the week the mean price is around ~112 USD and on the weekend around ~114 USD; an increase of ~ 2%.   
- **The price of apartments also rises during certain times of the year.**  
    It is believed that both the German holidays and large exhibitions could have an impact on this.  
    But this information cannot be obtained from the examined data set, and would needs additional analysis (e.g. in comparison with the holiday calendar)

**Answer:**  
The most expensive time of the year 2020 is between the end of September and the beginning of October during the Oktoberfest.
The mean price spikes around +15 USD.

## 3.2. Preparation for Question 2/3
### "What are the most expensive neighbourhoods in Munich?" and "What factors influence the price most?"

In [None]:
# Copy the data to a new DataFrame for further clean up
df_listings_clean = df_listings.copy(deep=True)

In [None]:
# Clean up the data set "listings" as the previous analysis pointed out

# Drop features which are not used further 
features_to_drop = ['listing_url', 'picture_url','host_url', 'host_thumbnail_url', 'host_picture_url',
                    'name', 'summary', 'space', 'neighborhood_overview', 'transit', 'interaction', 'description',
                    'host_name', 'host_location', 'host_neighbourhood', 'street', 'last_scraped', 'zipcode',
                    'calendar_last_scraped', 'first_review', 'last_review', 'host_since', 'calendar_updated',
                    'experiences_offered', 'state', 'country', 'country_code', 'city', 'market',
                    'host_total_listings_count', 'smart_location']
df_listings_clean.drop(features_to_drop, axis=1, inplace=True)

In [None]:
# Remove constant features by finding unique values per feature 
df_listings_clean = df_listings_clean[df_listings_clean.nunique().where(df_listings_clean.nunique()!=1).dropna().keys()]

# Drop features with 50% or more missing values
more_than_50 = list(df_listings_clean.columns[df_listings_clean.isnull().mean() > 0.5])
df_listings_clean.drop(more_than_50, axis=1, inplace=True)

# Clean up the format values. Maybe not the best solution - but will do the job.
df_listings_clean['price'] = df_listings_clean['price'].replace('[\$,]', '', regex=True).astype(float)
df_listings_clean['extra_people'] = df_listings_clean['extra_people'].replace({'\$': '', ',': ''}, regex=True).astype(float)
df_listings_clean['cleaning_fee'] = df_listings_clean['cleaning_fee'].replace({'\$': '', ',': ''}, regex=True).astype(float)
        
# Convert rates type from string to float and remove the % sign
df_listings_clean['host_response_rate'] = df_listings_clean['host_response_rate'].str.replace('%', '').astype(float)
df_listings_clean['host_response_rate'] = df_listings_clean['host_response_rate'] * 0.01
    
# Covert boolean data from string data type to boolean
boolean_features = ['instant_bookable', 'require_guest_profile_picture', 
                'require_guest_phone_verification', 'is_location_exact', 'host_is_superhost', 'host_has_profile_pic', 
                'host_identity_verified']
df_listings_clean[boolean_features] = df_listings_clean[boolean_features].replace({'t': True, 'f': False})

## Fill numerical missing data with mean value
numerical_feature = df_listings_clean.select_dtypes(np.number)
numerical_columns = numerical_feature.columns

imp_mean = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imp_mean = imp_mean.fit(numerical_feature)

df_listings_clean[numerical_columns] = imp_mean.transform(df_listings_clean[numerical_columns])
     
# Remove all remaining missing values  
df_listings_clean.dropna(inplace=True)

In [None]:
# Show price statistic for each neighbourhood  
df_listings_clean.groupby(["neighbourhood_cleansed"])["price"].describe()

In [None]:
# Create new feature 'mean' with the mean price per neighbourhood
df_listings_clean['mean']=df_listings_clean.groupby('neighbourhood_cleansed')['price'].transform(lambda r : r.mean())

In [None]:
df_listings_plot = df_listings_clean
df_listings_plot = df_listings_plot.groupby('neighbourhood_cleansed')[['price']].mean()
df_listings_plot = df_listings_plot.reset_index()
df_listings_plot = df_listings_plot.sort_values(by='price',ascending=False)
df_listings_plot.plot.bar(x='neighbourhood_cleansed', y='price', color='blue', rot=90, figsize = (20,10)).set_title('Mean Price per Neighbourhood');

In [None]:
# Since we also have the geo data (latitude and longitude) of the apartments we can create a map
fig = px.scatter_mapbox(df_listings_clean, color="mean", lat='latitude', lon='longitude',
                        center=dict(lat=48.137154, lon=11.576124), zoom=10,
                        mapbox_style="stamen-terrain",width=1000, height=800);
fig.show()

### Summary for Question 2: "What are the most expensive neighbourhoods in Munich?"

From the analysis, there is a clear difference in the costs between the different neighbourhoods in Munich.   
In general - the closer the apartment is to the city center or the Munich fair (Messe Munich), the higher the price.  
Furthermore the outliers must also be taken into account, since there are individual apartments in which the price deviates considerably.

**Answer:**  
The TOP 3 expensive neighbourhoods in Munich (average) are   
- Altstadt-Lehel 
- Trudering-Riem
- Allach-Untermenzing

# 4. Modeling
As part of the modeling, the methods of data mining suitable for the task are applied to the data set created in the data preparation. Typical for this phase are the optimization of the parameters and the creation of several models. In modeling, you perform the procedures necessary to answer the questions. Usually, different parameters have to be varied and different models created. If predictive models are formed, one speaks of CRISP-DM predictive analytics. There are a number of possible data mining methods for this step, the applicability of which largely depends on the question.

In [None]:
# Copy the data to a new DataFrame for encoding 
df_listings_encoded = df_listings_clean.copy(deep=True)

In [None]:
# Show the remaining features and the data type
df_listings_encoded.info()

In [None]:
# Remove outliers of the feature 'price" - drop all higher than 90% quantile
outliers = df_listings_encoded["price"].quantile(0.90)
df_listings_encoded = df_listings_encoded[df_listings_encoded["price"] < outliers]

In [None]:
# Encode features for use in machine learing model

# Encode feature 'amenities' and concat the data
df_listings_encoded.amenities = df_listings_encoded.amenities.str.replace('[{""}]', "")
df_amenities = df_listings_encoded.amenities.str.get_dummies(sep = ",")
df_listings_encoded = pd.concat([df_listings_encoded, df_amenities], axis=1) 

# Encode feature 'host_verification' and concat the data
df_listings_encoded.host_verifications = df_listings_encoded.host_verifications.str.replace("['']", "")
df_verification = df_listings_encoded.host_verifications.str.get_dummies(sep = ",")
df_listings_encoded = pd.concat([df_listings_encoded, df_verification], axis=1)
    
# Encode feature 'host_response_time'
dict_response_time = {'within an hour': 1, 'within a few hours': 2, 'within a day': 3, 'a few days or more': 4}
df_listings_encoded['host_response_time'] = df_listings_encoded['host_response_time'].map(dict_response_time)

# Encode the remaining categorical feature 
for categorical_feature in ['neighbourhood_cleansed', 'property_type', 'room_type', 'bed_type', 'neighbourhood', 
                            'cancellation_policy']:
    df_listings_encoded = pd.concat([df_listings_encoded, 
                                     pd.get_dummies(df_listings_encoded[categorical_feature])],axis=1)
        
# Drop features
df_listings_encoded.drop(['amenities', 'neighbourhood_cleansed', 'property_type', 'room_type', 'bed_type', 
                          'host_verifications', 'neighbourhood','cancellation_policy','security_deposit',
                          'id', 'host_id', 'mean', 'latitude', 'longitude'],
                         axis=1, inplace=True)

In [None]:
# Last check if there are any missing values in the data set
sum(df_listings_encoded.isnull().sum())

In [None]:
# Shuffle the data to ensure a good distribution
df_listings_encoded = shuffle(df_listings_encoded)

X = df_listings_encoded.drop(['price'], axis=1)
y = df_listings_encoded['price']

# Split the data into random train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
 # Initalize the model
model = RandomForestRegressor(max_depth=15, n_estimators=100, criterion='mse', random_state=42)
# Fit the model on training data
model.fit(X, y)
        
# Predict results
prediction = model.predict(X_test)

# 5. Evaluation

The evaluation ensures an exact comparison of the created data models with the task and selects the most suitable model. In the evaluation phase, you compare the models created from CRISP-DM predictive analytics. Various parameters of model quality are used. Often a balance is made between the goodness of adaptation and the complexity of the models as well as their applicability. Based on the results, phases are repeated or the last phase of the CRISP-DM model is initiated.

For the evaluation we look at the coefficient of determination (r squared value) of the training set and the test set, so we can compare the quality of the model.

In [None]:
# Evaluate the result - compare r squared of the training set with the test set

# Find R^2 on training set
print("Training Set:")
print("R_squared:", round(model.score(X_train, y_train) ,2))

# Find R^2 on testing set
print("\nTest Set:")
print("R_squared:", round(model.score(X_test, y_test), 2))

In [None]:
# Scatter plot of th actual vs predicted data
plt.figure(figsize=(10, 10))
plt.grid()
plt.xlim((0, 200))
plt.ylim((0, 200))
plt.plot([0,200],[0,200], color='#AAAAAA', linestyle='dashed')
plt.scatter(y_test, prediction, alpha=0.5)
coef = np.polyfit(y_test,prediction,1)
poly1d_fn = np.poly1d(coef) 
plt.plot(y_test, poly1d_fn(y_test))
plt.title('Actual vs. Predicted data');
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.show()

In [None]:
# Sort the importance of the features
importances = model.feature_importances_
    
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
feature_importances = pd.DataFrame(values, columns = ["feature", "score"])
feature_importances = feature_importances.sort_values(by = ['score'], ascending = False)

features = feature_importances['feature'][:10]
y_feature = np.arange(len(features))
score = feature_importances['score'][:10]

# Plot the importance of a feature to the price
plt.figure(figsize=(20,10));
plt.bar(y_feature, score, align='center');
plt.xticks(y_feature, features, rotation='vertical');
plt.xlabel('Features');
plt.ylabel('Score');
plt.title('Importance of features (TOP 10)');

### Summary for Question 3: "What factors influence the price most?"

With a coefficient of determination (R^2) of 0.87, the model and the prediction seems accurate enough to predict the price of an apartment. Moreover, the coefficient of determination of the training data is the same.

**Answer:**  
The TOP 5 factors of an apartment that have the greatest influence on the price are:
- Accommodates
- Entire home/ Apartment
- Extra people
- Number of reviews
- Guests included

# 6. Deployment

The deployment is the final phase of the CRISP-DM process.   
In this phase, the results obtained are processed in order to present them and feed them into the client's decision-making process. And in the last step of the CRISP-DM model, the results obtained are summarized, processed and presented in an understandable way

This blog post and the Repo on Github (https://github.com/noema-git/airbnb-analysis) is the deployment of this work. To improve the quality of the model the CRSIP-DM cycle should be with adjusted parameters run through again.

![https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/240px-CRISP-DM_Process_Diagram.png](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/240px-CRISP-DM_Process_Diagram.png)

# Summary and conclusion

This notebook uses data from the Munich, Germany area of Airbnb and has been analyzed to answer the following questions. 

### When is the most expensive time of the year to visit Munich and how much do the price spike?
**Answer:**   
 The most expensive time of the year 2020 is between the end of September and the beginning of October during the Oktoberfest. The average price spikes around +15 USD.
### What are the most expensive neighbourhoods in Munich?  
**Answer:**   
The TOP 3 expensive neighbourhoods in Munich (in average) are:
- Altstadt-Lehel
- Trudering-Riem
- Allach-Untermenzing

### What factors influence the price most?
**Answer:**
The TOP 5 factors of an apartment that have the greatest influence on the price are:
- Accommodates
- Entire home/ Apartment
- Extra people
- Number of reviews
- Guest included

This analysis was done as a project within the Udacity Data Science Nanodegree.   
Any kind of optimization is very welcome. I really appreciate any feedback to improvement.  

The underlying Jupyter Notebook to this evaluation can be found on GitHub (https://github.com/noema-git/airbnb-analysis)