## Analyzing the bachelorette data set

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('..//input//bachelorettedataset//BacheloretteDSFinal-Dogu.csv')

In [None]:
data.head()

In [None]:
data["Season"].unique()   # to see how many seasons the data covers

In [None]:
# cleaning the data

In [None]:
data.isnull().sum() # checking out missing values

In [None]:
# handling missing values

In [None]:
data = data[~data['Name'].isnull()]  # dropping rows with no data, which all have missing name

In [None]:
data.loc[data['Height (cm)'].isnull(),['Height (cm)']] = data['Height (cm)'].mean() # filling missing height info with mean

In [None]:
data.loc[data['College'].isnull(),'College'] = data['College'].value_counts().index[0] # filling in with most common college

In [None]:
data.isnull().sum() # better

In [None]:
data.info()

In [None]:
data[data['Win_Loss']==1]

Because of the fact that out of 141 contestant only 5 were chosen, a ML model would not have enough data to work with ( might as well try to guess ). So we'll only stick to analyzing the data.

In [None]:
# replace outcome with "Yes" or "No"

data['Win_Loss'] = data['Win_Loss'].astype(object)
data.replace({'Win_Loss':{'1.0':'Yes','0.0':'No'}},inplace=True)

In [None]:
#cleaning up a little 

data['Age'] = data["Age"].astype('int')
data['Height (cm)'].round(decimals=2)


In [None]:
print(data['Hometown'].nunique())
print(data['Occupation'].nunique())
print(data['College'].nunique())
print(data['State'].nunique())

In [None]:
# dropping cols that have almost no common values

data.drop(['Hometown','Occupation','College'],axis=1,inplace=True)

Doing some clustering

Because we are dealing with both numerical ( age / height ) and categorical ( state, color of hair / eyes)  , K means will not work. Instead we will use K -modes which can work with categorical data. If we want to add the numerical data into our analysis with K-modes, we have to import KPrototypes.

more info here: https://github.com/nicodv/kmodes
original paper here: https://pdfs.semanticscholar.org/d42b/b5ad2d03be6d8fefa63d25d02c0711d19728.pdf

In [None]:
from kmodes.kprototypes import KPrototypes

# determine no of clusters with elbow method
cost = []
K = range(1,7)
for num_clusters in list(K):
    kproto = KPrototypes(n_clusters=num_clusters, init = "Cao")
    kproto.fit_predict(data.drop(['Name'],axis=1), categorical=[0,2,3,5,6,7])
    cost.append(kproto.cost_)
    
plt.plot(K, cost, 'bx-')
plt.xlabel('k clusters')
plt.ylabel('Cost')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
# let's try 5
kproto = KPrototypes(n_clusters=5, init='Cao')
clusters = kproto.fit_predict(data.drop(['Name'],axis=1), categorical=[0,2,3,5,6,7]) #define the cols which are categorical

In [None]:
data['cluster'] = clusters

In [None]:
data.head()

In [None]:
#print the cluster groups
gb = data[['Name','cluster']].groupby('cluster')
for key, item in gb:
    print(gb.get_group(key), "\n\n")

In [None]:
# another way to see the clusters
# for s, c in zip(data['Name'], clusters):
#     print("Name: {}, cluster:{}".format(s, c))

In [None]:
for c in range(0,5):
    print("Cluster {} is composed of {} people".format(c, data[data['cluster']==c].shape[0]))

In [None]:
#rough visio of the clusters

fig, ax= plt.subplots(figsize=(30,10))
ax.scatter(data['Name'], clusters, c=clusters, s=200)
ax.set_ylabel('Cluster')
ax.set_xticks('')
ax.set_yticklabels(['','0','1','2','3','4']) # i'll keep the same cluster names as in the print above
ax.set_ylim(-1,6)
for x,y in zip(data['Name'],clusters):

    label = "{}".format(x)

    plt.annotate(label, 
                 (x,y),
                 textcoords="offset points",
                 xytext=(0,10), 
                 ha='left',
                 rotation=45)

plt.show()

Dimension reduction

Again, a PCA analysis won't work on our data becuase of the categorical variables
In this case, the way to go is with a Factor Analysis of Mixed Data (FAMD)

We will use the prince library which can do a FAMD in Python

More about it here: https://nextjournal.com/pc-methods/calculate-pc-mixed-data?change-id=CWQNw1kVRgQMFFzobMC2bo&node-id=d4243af6-f940-41fc-8ffa-a235bc135601
https://github.com/MaxHalford/prince

In [None]:
pip install prince

In [None]:
import prince

In [None]:
famd = prince.FAMD(
     n_components=5,
     n_iter=10,
     copy=True,
     check_input=True,
     engine='auto',       ## Can be "auto", 'sklearn', 'fbpca'
     random_state=42)

In [None]:
famd = famd.fit(data.drop('Win_Loss', axis=1)) ## Exclude target variable

In [None]:
print(famd.explained_inertia_)

In [None]:
# The first dimmension stands for 55 % of variance within the data
# Will update the notebook as soon as I can
# To Do : Categorize eveyrthing and do a MCA