# Assignment 6 - [30 points]

##  Seattle Airbnb Listing Analysis

Suppose that you work as a data scientist at Airbnb. You'd like to learn more about the main different types of Airbnb listings in the Seattle area and then use this information to advertise these different types of listings to interested customers. 

The following dataset is a sample of available Airbnb listings in Seattle, WA. These listings were collected in January 2016, and filtered to just contain listings from some of the most popular Seattle neighborhoods (for Airbnb listings) and just contain listings that are either in a house or apartment property. Rows with missing values have already been dropped from the dataset.

This dataset contains the following variables.

**Listing Information**
The dataset contains the following information about the Airbnb *listing*:
* <u>price</u>: price of the listing (per night)(in US dollars)
* <u>review_scores_rating</u>: the average rating of the listing [0,100] (100 is the best)
* <u>number_of_reviews</u>: the number of reviews for the listing
* <u>security_deposit</u>: the security deposit required for the listing (in US dollars)
* <u>cleaning_fee</u>: the cleaning fee required for the listing (in US dollars)
* <u>neighborhood</u>: the neighborhood of Seattle the listing is located in
* <u>property_type</u>: is the listing in a 'House' or 'Apartment'
* <u>room_type</u>: is the listing a 'Entire home/apt', 'Private room', or 'Shared room'
* <u>accommodates</u>: how many guests will the listing accommodate
* <u>bathrooms</u>:how many bathrooms does the listing have
* <u>beds</u>: how many beds does the listing have

**Host Information**
The dataset also contains the following information about the *host* of the given Airbnb listing:
* <u>host_is_superhost</u>: is the host a "superhost": t=True, f=False
* <u>host_has_profile_pic</u>: does the host have a profile pic in their bio: t=True, f=False
* <u>host_response_time</u>: how fast will the host respond to requests (on average)
* <u>host_acceptance_rate</u>: what percent of booking requests will the host accept

### <u>Case Study 1</u>: Reduced Dataset - *Just Categorical Variables*

In this assignment 6, we would like to try clustering our Airbnb listing dataset for now using just the categorical variables. In assignment 7 we will consider the full dataset which will use the categorical and the numerical variables.

### <u>Research Goals</u>:

In this analysis, we have the following research goals.

1. Identify larger "main clusters" of Airbnb listings in the Seattle area. In general, we would like for most of our listings to be clustered in with a larger sized clusters, rather than be separated as small or singleton clusters.
2. We would also like to identify potential sub-clusters of listings within each of the "main clusters."
3. What attributes characterize each of these clusters and sub-clusters?

#### Imports

## 1. Data Preprocessing and Cleaning

### 1.1. Original Dataset
Read the seattle_airbnb_listings_cleaned.csv into a dataframe. This dataframe has already been cleaned (rows with missing values have already been dropped).

### 1.2. Categorical Dataset

Next, create a dataframe that is just comprised of the categorical variables.

### 1.3. Label Encoding

Next, label encode this dataframe that is comprised of just your numerical variables. *That is, each distinct value in each of your categorical variables should be represented with a number.*

### 1.4 Hamming distance matrix

Finally, create a Hamming distance matrix of your categorical variables.

## 2. Clusterability

### 2.1. t-SNE Plots
Using 6 different perplexity values and at least two random states for each perplexity value, map this **distance matrix** onto a two-dimensional dataset with the t-SNE algorithm. Show your projected coordinates in a scatterplot for each combination of random states and perplexity value.

### 2.2 Assessing Clustering Structure

Answer the following questions below.

1. Does the t-SNE algorith suggest that this dataset is clusterable?
2. How many "main clusters" do you think that this dataset has? *[Subjective Answer: As long as your logic is correct, you will not lose points].*

Finally, pick out a random state and perplexity value that reflects the answers to your questions and show the corresponding t-SNE plot below.

### 2.3. Association between the Attributes and the Clustering Structure Suggested by the t-SNE Plots

Finally, we would like to assess how each of our 7 categorical attributes is individually associated with the clustering structure suggested by our selected t-SNE from 2.2. In the code below, plot your t-SNE plot 7 times, each time color coding the points by each of the 7 categorical attributes.

### 2.4. Interpretation

Select one of the smaller *highly* dense clouds of points in your t-SNE plot. Do the categorical attribute values in the points in this dense cloud that you selected differ at all?

## 3. K-Modes - *Parameter Selection*

Next, we would like to cluster this dataset with the k-modes algorithm. We would like to explore what the best values of $k$  would be to use in this clustering algorithm to meet our research goals.


### 3.1. Elbow Plot

Create an elbow plot for the k-modes algorithm. Your plot should assess clusterings with k=1, k=2,..., k=16 clusters. For each k, run a single k-modes algorithm, using a random state of 100.

### 3.2. t-SNE Plots

For k=1, k=2,..., k=10, run the k-modes clustering algorithm on your dataset, using a random state of 100. For each of your clusterings, plot a t-SNE plot in which you have color coded the points by their cluster labels.

### 3.3. Interpretation

How many clusters does your elbow plot suggest are in this dataset? Does the k-modes clustering with this k number of clusters *strongly* agree with the clustering structured suggested by the t-SNE plot?

### 3.4. Clustering Again

Use the value of k that you selected in your elbow plot to cluster the dataset one more time using k-modes.

### 3.5. Cluster Modes

Display the modes for each of your clusters in this clustering that you found in 3.4. Your modes should be *unencoded* so you are able to examine the actual attribute values that correspond to each of the modes.

### 3.6. Describing the Clusters

Next, we would like to describe how each of our 7 categorical attributes *associates* with each of the clusters in our clustering.

#### 3.6.1. Cluster Label Distribution for each Distinct Attribute Level

First, create 7 side-by-side barplot figures.

1. One that visualizes the relationship between neighborhood and cluster labels.
2. One that visualizes the relationship between property_type and cluster labels.

...

7. One that visualizes the relationship between host_identity_verified and cluster labels.

For each of these figures, your "attribute" should be in the "x-axis".

#### 3.6.2. Plot Interpretation

Which cluster do Wallingford listing houses most belong to?

#### 3.6.3. Cluster Label Distribution for each Distinct Attribute Level

Next, create 7 side-by-side barplot figures.

1. One that visualizes the relationship between neighborhood and cluster labels.
2. One that visualizes the relationship between property_type and cluster labels.

...

7. One that visualizes the relationship between host_identity_verified and cluster labels.

For each of these figures, your "cluster labels" should be in the "x-axis".

#### 3.6.4. Plot Interpretation

Which neighborhood to the the listings in cluster 1 most belong to?

## 4. Hierarchical Agglomerative Clustering

Next, we would like to cluster our Hamming distance matrix with hierarchical agglomerative clustering using single linkage, complete linkage, and average linkage.



### 4.1. Single Linkage

#### 4.1.1. Dendrogram

Create a dendrogram using hierarchical agglomerative clustering with single linkage for using your Hamming distance matrix. 
* Because this is a small dataset, we do not need/want to truncate our dendrogram results. You should be able to see the indices of each of your observations at the leaves of your dendrogram tree.
* Make sure you are able to read the indices labels in your dendrogram.

#### 4.1.2  t-SNE Individual Clustering Visualization

Then for each of the clusterings with k=2,k=3,...,k=10 clusters, color code the points in your selected t-SNE plot with the respective cluster labels.

### 4.2. Complete Linkage

#### 4.2.1. Dendrogram

Create a dendrogram using hierarchical agglomerative clustering with complete linkage using your Hamming distance matrix.
* Because this is a small dataset, we do not need/want to truncate our dendrogram results. You should be able to see the indices of each of your observations at the leaves of your dendrogram tree.
* Make sure you are able to read the indices labels in your dendrogram.

#### 4.2.2  t-SNE Individual Clustering Visualization

Then for each of the clusterings with k=2,k=3,...,k=10 clusters, color code the points in your selected t-SNE plot with the respective cluster labels.

### 4.3. Average Linkage

#### 4.3.1. Dendrogram

Create a dendrogram using hierarchical agglomerative clustering with average linkage using your Hamming distance matrix.
* Because this is a small dataset, we do not need/want to truncate our dendrogram results. You should be able to see the indices of each of your observations at the leaves of your dendrogram tree.
* Make sure you are able to read the indices labels in your dendrogram.

#### 4.3.2 t-SNE Individual Clustering Visualization

Then for each of the clusterings with k=2,k=3,...,k=10 clusters, color code the points in your selected t-SNE plot with the respective cluster labels.

### 4.4. Dendrogram Comparison 

Out of the three dendrograms that were created, which one do you think best helped us BOTH:
* identifify any "main clusters" that exist in the dataset that are larger in size (as opposed to small singleton clusters) AND
* identifty clusters in which we have evidence to suggest are meaningfully separated from each other in the dataset?

Explain.

### Note:

Note that a more complete analysis may have also explored what attributes characterize some of the clusterings found in this "best" dendrogram that you selected in 4.4.