Exploratory Data Analysis on AirBnb ratings & reviews dataset

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 
import pandas as pd
import scipy.stats

df = pd.read_csv('/Users/v1teka/airbnb-ratings-dataset/LA_Listings.csv', 
                 encoding='ISO-8859-1')
df.head()

In [None]:
# Read the NY_Listings.csv
df2 = pd.read_csv('/Users/v1teka/airbnb-ratings-dataset/NY_Listings.csv', 
                  encoding='ISO-8859-1')
df2.head()

In [None]:
# Read the airbnb_ratings_new.csv
df3 = pd.read_csv('/Users/v1teka/airbnb-ratings-dataset/airbnb_ratings_new.csv', 
                  encoding='ISO-8859-1')
pd.set_option('display.max_columns', None)
df3.head()

As the prices varies a lot in different countries It is reasonably to consider only USA listings

In [None]:
df_filtered = df3[df3['Country'] == 'United States']

df_filtered.head()

In [None]:
df.describe()

In [None]:
df2.describe()

In [None]:
df_filtered.describe()

Now, let's combine those three datasets into one:

In [None]:
combinedDf = df.append(df2)
df_final = combinedDf.append(df_filtered)

df_final.describe()

Now, 'df_final' has 295,452 lines of data and ready to use.

Let's do exploratory data analysis by examine the correlation between Price with number of bedrooms, bathrooms and review scores.

In [None]:
# Density Plot and Histogram of variable "Price"
sns.distplot(df_final['Price'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

From the graph we can see, becaue of the large range of 'Price', we need to filter those unnessary data which could influence our analysis. After observation, we found set the range from 0 to 500 is appropriate.

In [None]:
# Filter the Price to below 500
PriceFilteredData = df_final[df_final['Price'] < 500]

# Density Plot and Histogram of variable "Price"
sns.distplot(PriceFilteredData['Price'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

Now, let's see the distribution of numbers of Bedrooms:

In [None]:
# Density Plot and Histogram of variable "Bedrooms"
sns.distplot(df_final['Bedrooms'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

We can see that most houses have bedrooms from 0 to 6, so let's filter the data

In [None]:
# Filter the Bedrooms to below 6
BedroomsFilteredData = df_final[df_final['Bedrooms'] < 6]

# Density Plot and Histogram of variable "Price"
sns.distplot(PriceFilteredData['Bedrooms'], hist=True, kde=False, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

From this graph, we can see that the shape of the distributions of 'Numbers of Bedrooms' and the distributions of 'Price' are very similar, which indicates the possibilities between them, and we will do further investigations later. Before that, let's do more distribution graph on other variables.

In [None]:
# Filter the Bathrooms to below 6
BedroomsFilteredData = df_final[df_final['Bathrooms'] < 6]

# Density Plot and Histogram of variable "Bathrooms"
sns.distplot(BedroomsFilteredData['Bathrooms'], hist= True, kde=False, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

Let's see the distribution with more varibles:

In [None]:
# Density Plot and Histogram of variable "Bedrooms"
sns.distplot(df_final['Availability 365'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

In [None]:
# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(df_final['Review Scores Value'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

In [None]:
# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(df_final['Reviews per month'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

Let's filter the Varible:

In [None]:
# Filter the Bathrooms to below 6
ReviewsFilteredData = df_final[df_final['Reviews per month'] < 10]


# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(ReviewsFilteredData['Reviews per month'], hist=True, 
             kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

In [None]:
# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(df_final['Number of reviews'], hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

Filter the data:

In [None]:
# Filter the Bathrooms to below 6
ReviewsFilteredData = df_final[df_final['Number of reviews'] < 60]


# Density Plot and Histogram of variable "Review Scores Value"
sns.distplot(ReviewsFilteredData['Number of reviews'], hist=True, 
             kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 1})

We can see that most listing have 0 - 10 reviews.

Next, let's do the correlation test to find out the potential correlation with _Spearman correlation test_:

In [None]:
BedroomsFilteredData = PriceFilteredData[PriceFilteredData['Bedrooms'] < 6]
BathroomsFilteredData = BedroomsFilteredData[BedroomsFilteredData['Bathrooms'] < 6]
filteredData = BedroomsFilteredData[BedroomsFilteredData['Reviews per month'] < 10]
filteredData.corr(method='spearman')

From the result table, we found that *'Price'* and *'Accommodates'* have a correlation coefficient of 0.55, which indicates they are *moderately* correlated, and *'number of Bedrooms'* has a correlation coefficient of 0.46 with *'Price'*, which is the second highest value in all variables, which can be understand, because more bedrooms a house has, the higher the price can be, and more people a house can accommodates, more expensive it will be.