# Titanic Unsupervised KMeans

The goal of this is to look at the titanic data from a clustering perspective. I will do some basic cleaning and a simple K-Means fit. Then I will perform cluster analysis. I do not check very closly to see if my K-Means did a good job, I just want to analyze the clusters. Normally one might try different models with different hyperparameters.

## Setup

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import random
import os

from sklearn.cluster import KMeans
#from sklearn.preprocessing import StandardScaler

In [None]:
def seed_everything(seed_value):
    # makes sure i can run this again with the same results!
    random.seed(seed_value)
    np.random.seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    
        
seed_everything(2718)

In [None]:
df = pd.read_csv('/kaggle/input/titanic/train.csv')

In [None]:
df

In [None]:
df.info()

# Cleaning

there are some missing values. I am going to either replace with mean or drop

In [None]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Fare'].fillna(df['Fare'].mean(), inplace=True)
df = df[~df.Embarked.isna()]
df.info()

There are still missing values in Cabin, but I am not interested in this column for analysis. I will leave it as is

In [None]:
passenger_df = df.drop(['PassengerId', 'Pclass', 'Name', 'Cabin', 'Ticket'], axis=1)

sex and embarked are categorical. I am going to One Hot Encode these columns

In [None]:
passenger_df = pd.get_dummies(passenger_df, columns=['Sex', 'Embarked'])

In [None]:
passenger_df.info()

In [None]:
passenger_df.head()

In [None]:

X = passenger_df

# Clustering

Using K-Means, I will check the inertia to determine the # of clusters. Next I will determine what each cluster seems to be telling me about the data

In [None]:
inert = []
for i in range(1,20):
    inert.append(KMeans(n_clusters=i, random_state=2718).fit(X).inertia_)


In [None]:
# use bigger plots
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [None]:
sns.lineplot(range(1,20), inert);

Looks like 5 is the best number of clusters based on inertia

In [None]:
km = KMeans(n_clusters=5, random_state=2718)
clusters = km.fit_predict(X)

Now that I have my clusters, I can attached them back to my original dataframe. This way I can look at what group each row got assigned to!

In [None]:
# makes a new column called "clusters"
df['clusters'] = clusters

# Plots

In [None]:
sns.barplot(x=clusters, y=df.Fare).set_title('Ticket Price');

In [None]:
sns.countplot(data=df, x='clusters', hue='Embarked').set_title('Embarked Code');

In [None]:
sns.countplot(data=df, x='clusters', hue='Pclass').set_title('Ticket Class');

Pclass was not included in the `passenger_df` as input to the K-Means clustering model. However, I can still see if I was able to seperate classes in any kind of meaningful way! It looks like group 1 has a vast majority of the 2nd and 3rd class tickets. The other groups are almost exclusivly 1st class.

In [None]:
sns.barplot(x=clusters, y=df.SibSp).set_title('Number of Siblings/Close Family');

In [None]:
sns.barplot(x=clusters, y=df.Parch).set_title('Number of Parents/Children');

In [None]:
sns.countplot(data=df, x='clusters', hue='Sex').set_title('Sex');

In [None]:
sns.boxplot(x=df.clusters, y=np.log(df.Fare)).set_title('Ticket Price Box-Plot');

In [None]:
sns.barplot(x=df.clusters, y=df.Survived).set_title('Survived');

In [None]:
df[df.clusters == 4]

In [None]:
df[df.Fare == df.Fare.max()]

In [None]:
df.clusters.value_counts()

# Analysis

With 5 groups there is a bit of imbalance. 

### Group 0

* The second largest class
* Seems to be made up mostly of siblings.

---

### Group 1 
* seems to be heavily 2nd and 3dr class passenger. 
* There is a fair amount of 1st class, but not as many as group 1. 
* Seems to be 75% male
* Most people in this group did not survive.

-----

### Group 2

* This groups seems to be traveling with the most children
* They also paid the second most for tickets

---

### Group 3 

* small population, but seems to have more females traving with siblings or children.

---

### Group 4 
    
* just 3 people who paid the most for their ticket. 
* They seem to be outliers as far as Fare goes
* 100% survived in this group




# Conclusion

Thanks for making it this far. the basic K-Means put most of the non-first class passengers into 1 group (Group 1). The other groups had different segments of first class passengers. Some traveled with family, while others paid very high amounts for their tickets. More work can be done to see if there is a better clustering model to help explain the data.

Was this helpful? Are you interested in more unsupervised learning? Let me know in the comments.