## **Upload File**

The CSV file is uploaded from the drive to Google Colab using the following code:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Clustering Titanic Passengers with K Means**

For this project, we would like to use K Means clustering on the famous Titanic Passenger dataset to see if the K Means algorithm can give us any insight as to which factors led to a passenger's death or survival.

We will start by importing some libraries to use.

In [2]:
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import numpy as np
from sklearn.cluster import KMeans
from sklearn import preprocessing
import pandas as pd

Now the CSV file will be checked using read_csv() function from pandas .
The head() function will show the first 5 rows of the data

In [3]:
df = pd.read_csv("dataset-1.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


info() function is used to print a concise summary of the dataframe

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We can fill in the null values of the data.

In [5]:
df.fillna(0, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


All the nulls have been filled. We will create a function for handling text, since we can only use quantifiable data here. This function will turn each unique text value into a unique number which we can use, but will still hold the same value of information. For example, under the 'sex' column, Female will become 0 and Male will become 1.

In [6]:
def handle_non_numerical_data(df):
    columns = df.columns.values
    for column in columns:
        text_digit_vals = {}
        #ex. {'Female': 0, 'Male': 1}
        def convert_to_int(val):
            return text_digit_vals[val]
        #this is asking if the column is numerical. If not, it will populate the dict above
        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            column_contents = df[column].values.tolist()
            unique_elements = set(column_contents) #This will give us all unique non-repetitive values
            x = 0
            #if not numerical, converts to list, gets the set, populates the dict with the unique elements and changes to ints
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    text_digit_vals[unique] = x
                    x+=1
            df[column] = list(map(convert_to_int, df[column]))
        
    return df

Great, we have our function, now let's run it on our dataframe.

In [8]:
df = handle_non_numerical_data(df)

df.drop(['Cabin','SibSp', 'Embarked'], 1, inplace=True)

df.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Parch,Ticket,Fare
0,1,0,3,458,0,22.0,0,134,7.25
1,2,1,1,781,1,38.0,0,528,71.2833
2,3,1,3,685,1,26.0,0,455,7.925
3,4,1,1,228,1,35.0,0,71,53.1
4,5,0,3,60,0,35.0,0,120,8.05


Looks like everything worked, and we now have only quantifiable data to work with. Let's set-up and train the model.

We will start by determining our X and y values, preprocessing (scaling) the X data, and fitting the model. Since this is unsupervised learning, there is no splitting the data for training and testing.

We will set the number of clusters to 2, hoping that the model will separate the passengers into survived and deceased clusters.

In [10]:
X = np.array(df.drop(['Survived'], 1).astype(float))
X= preprocessing.scale(X)
y=np.array(df['Survived'])

clf = KMeans(n_clusters=2)
clf.fit(X)

  """Entry point for launching an IPython kernel.


KMeans(n_clusters=2)

Let's take a look at the 2 clusters using describe()

In [12]:
df['Predicted'] = clf.predict(X)

print(df[df['Predicted'] == 0]['Survived'].describe())

print(df[df['Predicted'] ==1]['Survived'].describe())

count    285.000000
mean       0.645614
std        0.479168
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64
count    606.000000
mean       0.260726
std        0.439393
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64


#**Conclusion**

We can see here that the first cluster has a survival rate of 64%, and the second has a survival rate of 26% (mean of 'survived'). Though this is a discernible difference, it isn't as large of a difference as we were hoping. Although not distinctly groups of 'survived' and 'dead' like we had hoped, we can still learn from these clusters. We should say this model was somewhat successful.