# DATA EXPLORATION AND VISUALISATION OF AIRBNB DATASET

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Content
* Importing required libraries
* Reading of Data
* Dealing with Null and Empty values
* Dealing with Outliers
* Data Exploration and Visualisation of Numerical features
* Data Exploration and Visualisation of Categorical features
* Combined Exploration and Visualisation of all features

# IMPORTING LIBRARIES

These are the libraries commonly required for a data science / kaggle beginner project.

In [None]:
#data wrangling
import pandas as pd
import numpy as np
import random as rnd

#data visualisation
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline
import re
import unicodedata
import nltk
from wordcloud import WordCloud,STOPWORDS

#machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import LabelEncoder

# READ DATA

In [None]:
data = pd.read_csv('/kaggle/input/us-airbnb-open-data/AB_US_2020.csv')

From the head and tail of the dataset, I could learn a few things :
* identified the categorical and numerical features
* feature [neighbourhood] contains some as words and some as postal codes. This would require some standardisation later if we use it
* feature [host_name] may not be very useful on its own. However, we can feature engineer it to link it to the gender of the host name for better analysis. 
* feature [neighbourhood_group] seems to contain alot of missing data.

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()

# DEALING WITH NULL EMPTY VALUES

There are missing values for features [name], [host_name], [neighbourhood_group], [last_review] and [reviews_per_month]. I deleted [neighbourhood_group] because almost 50% of the data is missing. 

*improvement that can be made: display missing values as a percentage of the feature

In [None]:
data.isnull().sum()

In [None]:
data2 = data.drop(['neighbourhood_group'], axis = 1)

We can also check for the number of unique data points for each feature. This could be useful when we decide to encode categorical data later. 

In [None]:
data2.select_dtypes('object').apply(pd.Series.nunique, axis = 0) 

In [None]:
data2.select_dtypes('int').apply(pd.Series.nunique, axis = 0)

In [None]:
data2.select_dtypes('float').apply(pd.Series.nunique, axis = 0)

# DEALING WITH OUTLIERS

In dealing with numerical outliers, I plotted a density distribution for all numerical features to obtain a big picture of the data. With the distribution, it is much easier to catch the outliers. Features [price], [number of reviews], [reviews per month], [minimum nights] seem to contain outliers. It is ridiculous for prices to go anywhere close to $25000 or that you must stay a minimum of 100000 nights. 

The outliers are removed following the interquartile price range. 

In [None]:
numerical = data2.select_dtypes(include = ('int', 'float')).columns
numerical

In [None]:
plt.figure(figsize=(20,20))

for i, feature in enumerate(numerical):
    plt.subplot(4,3,i+1)
    sns.kdeplot(data2[feature])
    plt.title('Distribution of %s' %feature)
    plt.xlabel('%s' % feature); plt.ylabel('Density')
    plt.tight_layout()
        


In [None]:
lower_bound = .25
upper_bound = .75
iqr = data2[data2['price'].between(data2['price'].quantile(lower_bound), data2['price'].quantile(upper_bound), inclusive=True)]
iqr = iqr[iqr['number_of_reviews'] > 0]
iqr = iqr[iqr['calculated_host_listings_count'] < 10]
iqr = iqr[iqr['number_of_reviews'] < 400]
iqr = iqr[iqr['minimum_nights'] < 10]
iqr = iqr[iqr['reviews_per_month'] < 5]


#referenced code from Thomas Konstantin's notebook

In [None]:
iqr.info()

In [None]:
numerical_iqr = iqr.select_dtypes(include = ('int', 'float')).columns

plt.figure(figsize=(20,20))

for i, feature in enumerate(numerical_iqr):
    plt.subplot(4,3,i+1)
    sns.kdeplot(iqr[feature], bw = 0.2)
    plt.title('Distribution of %s' %feature)
    plt.xlabel('%s' % feature); plt.ylabel('Density')
    plt.tight_layout()

From the distribution above, we can also draw a few observations
* there is a large density of latitude and longititude at 42 and -70 respectively. A simple google search will tell you that this refers to New York. This possibly means that most listings are from NY. We will look into this later. 
* Most listings are priced in whole such as $100, $150 and $200 per night.
* Most listings requires 1-2 nights minimally. 
* Most hosts have only one listing
* There is a large number of listings in city 14 which is in New York. This correlates with what we learnt from the distributions in latitude and longititude which also points to New York. 

In [None]:
cleaned_data = iqr.copy()

# EXPLORATION OF NUMERICAL FEATURES


 ****The Target Feature : Price****

In datsets like these, the end goal is usually to predict price. We examine the distribution of the price of each listing. A describe function to get the gist of the distribution is a good start. It seems that the outliers are indeed out of our way. 

In [None]:
cleaned_data['price'].describe()

In [None]:
sns.boxplot(cleaned_data['price'])

In [None]:
sns.kdeplot(cleaned_data['price'])
#compare this with a .plot(kind='kde'), this sns plot is better as more could be seen. you can only see a single peak in the latter.

We can look at the correlation between price and other numerical features. 

I define significance as more than 50%. 
Positive correlations of significance: Minimum_nights. The larger the number of minimum night, the higher the price.
Negative correlations of significance: lattitude and Reviews per month. The lower the lattitude the higher the price. The lower the review count the higher the price.

Some of the correlation seems strange isn't it?


In [None]:
correlations = cleaned_data.corr()['price'].sort_values()
correlations

In [None]:
fig, ax = plt.subplots()
ax.scatter(x = cleaned_data['latitude'],y = cleaned_data['price'], s=0.01)
plt.ylabel('price', fontsize=13)
plt.xlabel('lattitude', fontsize=13)
plt.show()

**Minimum Nights**

In [None]:
plt.hist(cleaned_data['minimum_nights'])

In [None]:
fig, ax = plt.subplots()
ax.scatter(x = cleaned_data['minimum_nights'], y = cleaned_data['price'], s=0.01)
plt.ylabel('price', fontsize=13)
plt.xlabel('minimum_nights', fontsize=13)
plt.show()

In [None]:
fig, ax = plt.subplots()
ax.scatter(x = cleaned_data['reviews_per_month'],y = cleaned_data['price'], s=0.01)
plt.ylabel('price', fontsize=13)
plt.xlabel('reviews_per_month', fontsize=13)
plt.show()

# EXPLORATION OF CATEGORICAL DATA

To deal with the categorical features, we need to do some encoding. The categorical features that may require encodings are neighbourhood, room types and cities. There are 1450 unique neighbourhood, 4 room types, 28 unique cities. 

* label encoding or one-hot encoding? Generally, for feature with more than 2 categories, we will use one-hot encoding. However, i would do label encoding in this project as it is easier to deal with and I am not ready to get into dimensionality reduction in this project. We should note that with label encoding with more than 2 categories, there will be arbitrary ordering and may asisgn different weights to each category, 
* 1450 neighbourhood is too much to encode. And also, I do not know how to clean this feature up. 
* For room types and cities, I carried on with encoding. 

In [None]:
le = LabelEncoder()

le.fit(cleaned_data['room_type'])
le_room_type_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_room_type_mapping)
cleaned_data['room_type'] = le.transform(cleaned_data['room_type'])

le.fit(cleaned_data['city'])
le_city_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_city_mapping)
data2['city'] = le.transform(data2['city'])

**Room Types**

Most listings are entire home/apt or private rooms.

In [None]:
plt.figure(figsize= (10,10))
cleaned_data.room_type.value_counts().plot.pie(autopct="%.1f%%", title = 'distribution of room types')

In [None]:
fig, ax = plt.subplots()
ax.scatter(x = cleaned_data['room_type'], y = cleaned_data['price'], s=0.01)
plt.ylabel('price', fontsize=13)
plt.xlabel('room_type', fontsize=13)


**Cities Listed**

As expected, most listings are from New York City followed by Los Angeles then Hawaii. 

In [None]:
plt.figure(figsize = (10,10))
ax = sns.countplot(y=cleaned_data['city'],order=cleaned_data['city'].value_counts().index,palette='rocket')
ax.set_yticklabels(ax.get_yticklabels(),fontsize=11,fontweight='bold')
ax.set_title('Distribution Of Different Cities In Our Data',fontsize=16,fontweight='bold')
ax.set_xlabel('Count',fontsize=14,fontweight='bold')
plt.show()

#referenced code from Thomas Konstantin's notebook

**Number of Reviews per City**

In [None]:
plt.figure(figsize = (10,10))
box_plot = sns.barplot(x='number_of_reviews', y='city', 
                 data=cleaned_data, 
                 palette="rocket")

**Length of Description of Listings**

* There is a large variance in word length. While there is a median of 6 words, there are listings with more than 40 words. However, most listings are kept to less than 10 words. There are probably outliers but I do not think it is necessary to remove them. 

In [None]:
word_length = cleaned_data['name'].apply(lambda x : len(str(x).split()))
word_length.describe()

In [None]:
plt.figure(figsize = (10,5))
sns.kdeplot(word_length, shade=True, color='r').set_title('Distribution of word length in name')

**Most Used Words in Listings**

Through a wordcloud and a barplot, we can find out the popular words used in the listings. 
Most popular words seen from word cloud : Cosy, home, apartment, beautiful, downtown, studio, heart, beach, charming, modern. 
Most poular words seen from barplot : private, bedroom, apartment, home, studio, cosy, room, beach, house, spacious, modern, downtown, park. 

It seems like most listings are trying to portray an apartment that is filled with warmth. 

Improvements that can be made: Add words like 'apartment', 'private', 'bedroom' into stopwords. 

In [None]:
x = cleaned_data['name'].astype(str)
listToStr = ' '.join([str(elem) for elem in x if elem not in STOPWORDS]).lower()

In [None]:
plt.figure(figsize = (20,20))
wordcloud = WordCloud(width=800,height=600,min_font_size=10).generate(listToStr)
plt.imshow(wordcloud)

In [None]:
text = re.sub("[^a-zA-Z_]", ' ', listToStr) #removes everything other than letters
text = re.sub(r'\b\w{1,3}\b', '', text) #remove words less than 3 chars

In [None]:
words_df = pd.DataFrame(text.split(), columns = ['words'])
plt.figure(figsize = (20,20))
sns.countplot(y= words_df['words'],order=words_df['words'].value_counts().iloc[:50].index,palette='rocket')

# ALL FEATURES

Looking at a heatmap that draws correlations across all numerical and encoded categorical features, we can attempt to draw more observations. Something that pops up to me is the correlation between id and number of reviews. Apparently there are more reviews for ids that are listed first. Is this a way that the listings are sorted on Airbnb? Or maybe the listings are older thus have more reviews?

In [None]:
plt.figure(figsize = (10,5))
sns.heatmap(cleaned_data.corr())

In [None]:
le_city_mapping

In [None]:
city_grouped = cleaned_data.groupby(['city'], as_index = False)
city_grouped2 = city_grouped[('city','price','room_type', 'number_of_reviews')].mean()
city_grouped2

**Prices in each City**

Looking at the box plot, you could see that most cities are priced at around $120. Listings in pacific grove seems to have a wide variance. A few listings are really expensive.

In [None]:
plt.figure(figsize = (10,10))
box_plot = sns.boxplot(x='price', y='city', 
                 data=cleaned_data, 
                 palette="rocket")

In [None]:
plt.figure(figsize = (10,10))
box_plot = sns.barplot(x='number_of_reviews', y='city', 
                 data=cleaned_data, 
                 palette="rocket")