# Let's do Clustering Analysis

* Goal is to be able to segment the population and see what features drive their satisfaction.
* First we read in the training data (won't focus on the test set)

In [None]:
import pandas as pd
df = pd.read_csv("/kaggle/input/airline-passenger-satisfaction/train.csv")

In [None]:
df.head()

We have around 25 different features available

In [None]:
df.columns

Let's separate them out into numerical and categorical columns and the columns we need to drop. The column that holds the labels is `satisfaction` , we save it as `y_col`

In [None]:
drop_cols = ["Unnamed: 0" , "id"]
num_cols = ["Age" , "Flight Distance" ,"Departure Delay in Minutes" , "Arrival Delay in Minutes"]
y_col = "satisfaction"
cat_cols = list(set(df.columns).difference(set(drop_cols+num_cols)))

Now we one hot encode all the categorical columns.
* This is primarily because we are going to use a clustering algorithm which will require all columns to be numerical and also of similar scale.
* One Hot Encoding will convert all categorical columns to numerical columns with values `0 or 1`

For Numerical Columns we use `MinMaxScaler` from sklearn to squeeze them into `0-1`

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder(drop="first")
cat_df = pd.DataFrame(ohe.fit_transform(df[cat_cols]).todense() , columns=ohe.get_feature_names(cat_cols))

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
mms = MinMaxScaler()
num_df = pd.DataFrame(mms.fit_transform(df[num_cols]) , columns=num_cols)

Now we join our categorical dataframe with our numerical dataframe.

In [None]:
X = cat_df.join(num_df)
y = df[y_col]

Any missing values are filled using `Mean Imputation` below.

In [None]:
X = X.fillna(X.mean())

Let's split the data we have into a training and testing set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train , X_test , y_train  ,y_test = train_test_split(X,y)

### Finally Down to clustering!

In [None]:
from sklearn.cluster import KMeans

Thing with KMeans clustering is that it takes as an input parameter `Number of Clusters (k)` to find. Therefore you might want to experiment with different numbers and keep the one that makes most sense while interpreting results.

We can also use `Elbow Method` to figure out a good `k`.

For now we first try 2 clusters as there are two labels for `satisfaction` variable

In [None]:
kmeans = KMeans(2)
kmeans.fit(X_train)

Performing KMeans clustering is fast enough and therefore different values of `k` can be tried out quickly.

To visually analyse the results of clustering we use `TSNE`. Like `PCA`, `TSNE` helps reduce the dimensionality of the data at hand. Therefore to be able to plot our individual data points we reduce the dimensionality to `2`.

NOTE : TSNE takes time. Patience is Mandatory

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(2 , n_iter=250)
transformed_df = tsne.fit_transform(X_train)

Now that we have our reduced data, we can plot to see our cluster labels are compared to the true labels `(satisfaction) column`.

Let's call the 2 dimensions of the tsne returned data `c1` and `c2`. We take the labels found from `KMeans` using `kmeans.labels_`. We then plot `c1` on `xaxis`, `c2` on `y-axis` and `color` using `labels` that we have found.

In [None]:
from plotnine import *

In [None]:
(
    ggplot(pd.DataFrame({"c1" : transformed_df[: , 0]
                        ,"c2" : transformed_df[: , 1]
                        ,"cluster" : kmeans.labels_}) , aes(x="c1" , y="c2" , fill="cluster"))
    + geom_point(alpha=0.4 , stroke=0)
)

We can see that the labels found mostly don't overlap and therefore we have a good separation of the data points using the `KMeans` labels. To compare this to the `ground truth` labels we have in `satisfaction` variable, let's do another plot.

In [None]:
(
    ggplot(pd.DataFrame({"c1" : transformed_df[: , 0]
                        ,"c2" : transformed_df[: , 1]
                        ,"y" : y_train}) , aes(x="c1" , y="c2" , fill="y"))
    + geom_point(alpha=0.4 , stroke=0)
)

We can now see that both graphs look pretty similar and therefore we have been able to come up with labels very similar to the ground truth using an `unsupervised approach` , purely based on the data we have.

### Cluster Features Analysis

Now that we have our cluster labels, we can see how these clusters are different from one another in terms of the original feature set that we had.

To do that we group by the `label` found and then take the mean of each of our variable. This allows us to compare each individual feature for each of the cluster we have found

In [None]:
result_df = X_train.copy()
result_df["cluster"] = kmeans.labels_

To be able to plot easily, we also melt the dataframe so that we get two columns, one containing the original feature name and the other containing the mean value.

In [None]:
melt_cluster = result_df.groupby("cluster").mean().reset_index().melt(id_vars="cluster")
melt_cluster

In [None]:
melt_cluster = melt_cluster["variable"].str.split("_" , expand=True).join(melt_cluster)
melt_cluster = melt_cluster.rename({0 : "variable_base" , 1:"response"} , axis=1)
melt_cluster

In [None]:
melt_cluster["variable_base"].unique()

In [None]:
melt_cluster["cluster"] = melt_cluster["cluster"].astype("category")

In [None]:
(
    ggplot(melt_cluster[~melt_cluster["response"].isna()],aes(x="response" , y="value" ,fill="cluster"))
    + geom_col(position="fill")
    + coord_flip()
    + facet_wrap("~ variable_base")
    + theme(figure_size=(12,10))
)

### Interpreting the results

Looking at the above graphs, it turns out that our cluster `0` has most people rating `1 or 2` to the survey questions and cluster `1` has most people rating `4 or 5` for most of the questions. That isn't very helpful. BUT it does help us eliminate few of the questions like 
1. `Departure/Arrival Time convenience` and 
2. `Gate Location` 
3. `Ease of Online Booking`

because irrespective of the cluster label we see the response spread all over `1-5`.

Another important deduction is that people in the `'unsatisfied'` category are mostly the personal travel folks whereas the business travel people are mostly satisified.


To do even better at the interpretation. Let's re-run KMeans clustering, this time for 6 clusters (arbitrary).

In [None]:
kmeans = KMeans(6)
kmeans.fit(X_train)

In [None]:
def get_melted_clusters(labels):
    result_df = X_train.copy()
    result_df[num_cols] = mms.inverse_transform(X_train[num_cols])
    result_df["cluster"] = labels
    melt_cluster = result_df.groupby("cluster").mean().reset_index().melt(id_vars="cluster")
    melt_cluster = melt_cluster["variable"].str.split("_" , expand=True).join(melt_cluster)
    melt_cluster = melt_cluster.rename({0 : "variable_base" , 1:"response"} , axis=1)
    melt_cluster["cluster"] = melt_cluster["cluster"].astype("category") 
    return melt_cluster

In [None]:
mc2 = get_melted_clusters(kmeans.labels_)

In [None]:
(
    ggplot(mc2[~mc2["response"].isna()],aes(x="response" , y="value" ,fill="cluster"))
    + geom_col(position="fill")
    + coord_flip()
    + facet_wrap("~ variable_base")
    + theme(figure_size=(12,10))
)

From the above chart we can eliminate one other question `checkin service`

Let's try to summarize each cluster

1. Cluster 0:
    1. Mostly satisfied with their experience
    2. Key for them seems `Inflight Entertainment`
    2. Specially Happy with `On-board Service , Cleaniness , Food and Drinks`
2. Cluster 1:
    1. Quite Unhappy with their experience
    2. Key for them `Cleaniness, Seat Comfort , Inflight Entertainment`
    3. As we will see below , this is the youngest group on average and unsatisfied with most in-flight services
3. Cluster 2:
    1. This is the oldest group with avg age being close to 50
    2. Most satisfied with `Cleaniness and Food and Drinks`
    3. Mostly approve of the `Seat comfort` and `Inflight Entertainment`
    4. This is also the group that travels mostly in long distance flights and does not experience `delays` (likley because they get moved to a different flight being a business customer)
4. Cluster 3:
    1. This is the `'middle'` cluster , where people have mostly rated 3
5. Cluster 4:
    1. This is the group of people that despite experiencing delays is mostly satisfied with their experience
6. Cluster 5:
    1. This group is unhappy with most of their experience

In [None]:
(
    ggplot(mc2[mc2["response"].isna()],aes(x="cluster" , y="value" ,fill="cluster"))
    + geom_col()
    + coord_flip()
    + facet_wrap("~ variable_base" , scales="free_x")
    + theme(figure_size=(12,5),subplots_adjust={'hspace': 0.5})
)