In [20]:
import pandas as pd
import numpy as np
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

%matplotlib inline

In [21]:
listings_raw = pd.read_csv('data/listings_CPH.csv')

# Can we project the data onto a lower dimensional space?

We can use PCA to project the data onto a lower dimensional space. This is great for visualizations as well as gathering insights about the data. Let's try to see if the principal components explain something meaningful about the data.

Let's start by dropping some features which are largely unsuitable for PCA. `id`, `name`, `host_id`, `host_name`, `neighbourhood_group`, `license` and `last_review` don't really say much about the data and is more for identification purposes. We also drop `latitude` and `longitude` as we have `neighbourhood` which convey much of the same information, but we gain interpretability in the names of the neighbourhoods.

In [40]:
features_to_drop = ["id", "name", "host_id", "host_name", 
                    "neighbourhood_group", "license", "last_review", 
                    "longitude", "latitude"]

listings_dropped = listings_raw.drop(features_to_drop, axis=1)
listings_dropped[["reviews_per_month"]] = listings_dropped[["reviews_per_month"]].fillna(-1)

listings_dropped.head()

Unnamed: 0,neighbourhood,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm
0,Nrrebro,Entire home/apt,898,3,172,1.08,1,0,4
1,Indre By,Entire home/apt,2600,4,59,0.55,1,303,8
2,Indre By,Entire home/apt,3250,3,300,2.06,3,56,7
3,Vesterbro-Kongens Enghave,Entire home/apt,725,7,24,0.16,1,59,2
4,Vesterbro-Kongens Enghave,Entire home/apt,1954,3,19,0.13,1,0,2


We'll scale the data before applying PCA.

In [41]:
listings_dummies = pd.get_dummies(listings_dropped, columns=["neighbourhood", "room_type"])

scaler = StandardScaler()
listings_scaled = scaler.fit_transform(listings_dummies)

pca = PCA(n_components=2)
listings_pca = pca.fit_transform(listings_scaled)

In [42]:
listings_pca = pd.DataFrame(np.c_[listings_pca, listings_raw["room_type"]], columns=["PC1", "PC2", "Room Type"])
px.scatter(listings_pca, x="PC1", y="PC2", color="Room Type", title="PCA on Listings")

From the scatter plot we can see that the room type is separated into two clusters for `Entire home/apt` and `Private room`. `Hotel room` and `Shared room` does not have that many observations, so it is not clear how well these separate.

