# Mean Shift applied to Titanic Dataset

We're going to take a look at the Titanic dataset via clustering with Mean Shift. What we're interested to know is whether or not Mean Shift will automatically separate passengers into groups or not. If so, it will be interesting to inspect the groups that are created. The first obvious curiosity will be the survival rates of the groups found, but, then, we will also poke into the attributes of these groups to see if we can understand why the Mean Shift algorithm decided on the specific groups.

In [1]:
import numpy as np
from sklearn.cluster import MeanShift, KMeans
from sklearn import preprocessing, cross_validation
import pandas as pd
import matplotlib.pyplot as plt



Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination

In [2]:
## # https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls
## we have downloaded the data from teh above URL
## Upload the titanic.xls file to Jupyter notebook before running the below code
#Read the data from the xls using pandas and stored in df
df = pd.read_excel('titanic.xls')

In [3]:
## making a copy of the original data
##dropping the irrelevant columns and filling the missing values with 0

original_df = pd.DataFrame.copy(df)
df.drop(['body','name'], 1, inplace=True)
df.fillna(0,inplace=True)

In [9]:
def handle_non_numerical_data(df):
    
    # handling non-numerical data: must convert.
    columns = df.columns.values

    for column in columns:
        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]

        #print(column,df[column].dtype)
        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            
            column_contents = df[column].values.tolist()
            #finding just the uniques
            unique_elements = set(column_contents)
            # great, found them. 
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    # creating dict that contains new
                    # id per unique string
                    text_digit_vals[unique] = x
                    x+=1
            # now we map the new "id" vlaue
            # to replace the string. 
            df[column] = list(map(convert_to_int,df[column]))

    return df

df = handle_non_numerical_data(df)

In [5]:
df.drop(['ticket','home.dest'], 1, inplace=True)

In [6]:
X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])

In [7]:
clf = MeanShift()
clf.fit(X)

MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1,
     n_jobs=1, seeds=None)

In [8]:
#Now that we've created the fitment, we can get some attributes from our clf object:
labels = clf.labels_
cluster_centers = clf.cluster_centers_

In [10]:
print(labels)

[2 2 2 ... 0 0 0]


In [11]:
print(cluster_centers)

[[ 0.34910996  0.20475299 -0.12348244 -0.23556698 -0.2851465  -0.35671074
  -0.3928262   0.20573184 -0.24998224]
 [-1.54609786 -0.30074929  0.97374665 -0.47908676  0.1328818   9.26124543
   1.39416512 -1.81687688  2.40535265]
 [-1.54609786 -0.82287239 -0.02146906  2.40203684  1.86652569  4.44117492
   3.25376159  0.62364835  0.90153727]
 [-1.54609786 -0.30074929  2.1680055   0.48128777  4.17805088  4.44117492
   3.25376159  0.62364835  0.38203741]
 [ 0.84191642 -0.30074929 -1.35790158  0.48128777  9.95686385  0.70136971
  -0.45501345  0.62364835 -0.65696231]]


In [12]:
#we're going to add a new column to our original dataframe
original_df['cluster_group']=np.nan

In [13]:
#we can iterate through the labels and populate the labels to the empty column
for i in range(len(X)):
    original_df['cluster_group'].iloc[i] = labels[i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [14]:
n_clusters_ = len(np.unique(labels))

In [15]:
print(n_clusters_)

5


In [16]:
#we can check the survival rates for each of the groups we happen to find

survival_rates = {}
for i in range(n_clusters_):
    temp_df = original_df[ (original_df['cluster_group']==float(i)) ]
    #print(temp_df.head())

    survival_cluster = temp_df[  (temp_df['survived'] == 1) ]

    survival_rate = len(survival_cluster) / len(temp_df)
    #print(i,survival_rate)
    survival_rates[i] = survival_rate
    
print(survival_rates)

{0: 0.36837815810920943, 1: 1.0, 2: 0.6379310344827587, 3: 0.5, 4: 0.1}


This is somewhat curious as we know there were three actual "passenger classes" on the ship. I immediately wonder if 0 is the second-class group, 1 is first-class, and 2 is 3rd class. The classes on the ship were ordered with 3rd class on the bottom, and first class on the top. The bottom flooded first, and the top is where the life-boats were. I can look deeper by doing:

In [17]:
#What this does is give us just the rows from the original_df where the cluster_group column is 1.
print(original_df[ (original_df['cluster_group']==1) ])

     pclass  survived                                               name  \
35        1         1                           Bowen, Miss. Grace Scott   
49        1         1                 Cardeza, Mr. Thomas Drake Martinez   
50        1         1  Cardeza, Mrs. James Warburton Martinez (Charlo...   
66        1         1                        Chaudanson, Miss. Victorine   
183       1         1                             Lesurer, Mr. Gustave J   
302       1         1                                   Ward, Miss. Anna   

        sex   age  sibsp  parch    ticket      fare        cabin embarked  \
35   female  45.0      0      0  PC 17608  262.3750          NaN        C   
49     male  36.0      0      1  PC 17755  512.3292  B51 B53 B55        C   
50   female  58.0      0      1  PC 17755  512.3292  B51 B53 B55        C   
66   female  36.0      0      0  PC 17608  262.3750          B61        C   
183    male  35.0      0      0  PC 17755  512.3292         B101        C   
302  

In [18]:
print(original_df[ (original_df['cluster_group']==4) ])

      pclass  survived                                               name  \
629        3         0                        Andersson, Mr. Anders Johan   
632        3         0  Andersson, Mrs. Anders Johan (Alfrida Konstant...   
644        3         0         Asplund, Mr. Carl Oscar Vilhelm Gustafsson   
646        3         1  Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...   
831        3         0                     Goodwin, Mr. Charles Frederick   
832        3         0            Goodwin, Mrs. Frederick (Augusta Tyler)   
1106       3         0             Panula, Mrs. Juha (Maria Emilia Ojala)   
1146       3         0               Rice, Mrs. William (Margaret Norton)   
1179       3         0                              Sage, Mr. John George   
1180       3         0                     Sage, Mrs. John (Annie Bullen)   

         sex   age  sibsp  parch    ticket     fare cabin embarked boat  \
629     male  39.0      1      5    347082  31.2750   NaN        S  NaN   
63

Sure enough, this entire group is first-class. That said, there are actually only 11 people here. Let's look into group 0, which seemed a bit more diverse. This time, we will use the .describe() method via Pandas:

In [19]:
print(original_df[ (original_df['cluster_group']==0) ].describe())

            pclass     survived         age        sibsp        parch  \
count  1227.000000  1227.000000  975.000000  1227.000000  1227.000000   
mean      2.349633     0.368378   29.446923     0.427058     0.289324   
std       0.808301     0.482561   14.256699     0.836671     0.638801   
min       1.000000     0.000000    0.166700     0.000000     0.000000   
25%       2.000000     0.000000   21.000000     0.000000     0.000000   
50%       3.000000     0.000000   28.000000     0.000000     0.000000   
75%       3.000000     1.000000   38.000000     1.000000     0.000000   
max       3.000000     1.000000   80.000000     5.000000     4.000000   

              fare        body  cluster_group  
count  1226.000000  113.000000         1227.0  
mean     24.317213  161.530973            0.0  
std      27.738835   98.070317            0.0  
min       0.000000    1.000000            0.0  
25%       7.895800   72.000000            0.0  
50%      13.000000  165.000000            0.0  
75%   

1,233 people here. We can see the average class here is just above 2nd class, but this ranges from 1st to 3rd.

Let's check the final group, 2, which we are expected to all be 3rd class:

In [20]:
#Sure enough, we are correct, this group, which had the worst survival rate, is all 3rd class.
print(original_df[ (original_df['cluster_group']==2) ].describe())

          pclass   survived        age      sibsp      parch        fare  \
count  58.000000  58.000000  49.000000  58.000000  58.000000   58.000000   
mean    1.310345   0.637931  32.222790   2.000000   1.120690  160.228086   
std     0.730462   0.484796  15.581265   2.708013   0.880143   66.817344   
min     1.000000   0.000000   0.916700   0.000000   0.000000   69.550000   
25%     1.000000   0.000000  23.000000   0.000000   0.000000  113.162475   
50%     1.000000   1.000000  31.000000   1.000000   1.000000  146.520800   
75%     1.000000   1.000000  40.000000   2.000000   2.000000  221.779200   
max     3.000000   1.000000  67.000000   8.000000   2.000000  263.000000   

             body  cluster_group  
count    5.000000           58.0  
mean    93.400000            2.0  
std     37.792856            0.0  
min     45.000000            2.0  
25%     67.000000            2.0  
50%     96.000000            2.0  
75%    124.000000            2.0  
max    135.000000            2.0  


When we revisit cluster 1, which is all first-class, we see the range of fare here is 262-512, with a mean of 350. Despite cluster 0 having some 1st class passengers, it's clear this group is the most elite group.

#Out of curiosity, what is the survival rate of the 1st class passengers in cluster 0, compared to the overall survival rate of cluster 0?

In [21]:
cluster_0 = (original_df[ (original_df['cluster_group']==0) ])

In [22]:
cluster_0_fc = (cluster_0[ (cluster_0['pclass']==1) ])

In [23]:
print(cluster_0_fc.describe())

       pclass    survived         age       sibsp       parch        fare  \
count   261.0  261.000000  223.000000  261.000000  261.000000  261.000000   
mean      1.0    0.590038   40.116592    0.352490    0.191571   59.846010   
std       0.0    0.492771   13.938324    0.517282    0.465839   39.427437   
min       1.0    0.000000    4.000000    0.000000    0.000000    0.000000   
25%       1.0    0.000000   29.500000    0.000000    0.000000   29.700000   
50%       1.0    1.000000   40.000000    0.000000    0.000000   52.554200   
75%       1.0    1.000000   50.000000    1.000000    0.000000   79.200000   
max       1.0    1.000000   80.000000    2.000000    2.000000  227.525000   

             body  cluster_group  
count   30.000000          261.0  
mean   167.666667            0.0  
std     82.975000            0.0  
min     16.000000            0.0  
25%    113.000000            0.0  
50%    170.500000            0.0  
75%    233.500000            0.0  
max    307.000000         

Sure enough, they have a better survival rate, ~61%, but still much worse than the 91% of the more apparently elite group (by both ticket price and survival rate).