**The Young People Survey database has a very large amount of information about music tastes, movies, habits of young people. Out of these elements I thought it would be interesting to see if I could identify clusters using movie preferences and phobias people might have.
Most of the variables are answers to questions about a particular subject in which the respondent had to choose between 1 and 5, 1 being the lowest level the respondent agrees with a statement and 5 the highest.
Since most of the variables are categorical I used KModes to make the clusters.**

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



Firstly, I'm reading all the database.

In [None]:
from sklearn import preprocessing
from kmodes.kmodes import KModes
import matplotlib.pyplot as plt

db = pd.read_csv('/kaggle/input/young-people-survey/responses.csv')
db.head()

There is a fairly large amount of variables so the first thing to do is have a look at all of them and keep the ones I'm interested in into a new database.

In [None]:
pd.set_option('display.max_rows', 500)
cols = pd.DataFrame(db.columns)
cols

In [None]:
df1 = db[db.columns[19:30]] #these are the 'Movies' related questions that I want to use
df2 = db[db.columns[63:72]] #these are the 'Phobias' related questions that I want to use
df3 = df1.merge(df2, how = 'inner', left_index = True, right_index = True) #let's put them together
df3.head()

Let's see if there are any missings:

In [None]:
df3.isnull().sum()

Yes, there are missings so I'll replace them with the median value, using sklearn's Imputer:

In [None]:
df3_copy = df3


from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'median')

imputer.fit(df3)

X = imputer.transform(df3)

db = pd.DataFrame(X)

Checking again the missings:

In [None]:
db.isnull().sum() 

All good, now we have our database as follows:

In [None]:
db.columns = df3_copy.columns
db.head()

Next step is to see how many clusters I would need; I'm checking between 1 and 10 clusters: 

In [None]:
db = np.array(db)

from kmodes.kmodes import KModes

cost = []
for nb_clusters in list(range(1,10)): 
    kmode = KModes(n_clusters = nb_clusters, init = 'Huang', n_init = 1, verbose = 1)
    kmode.fit_predict(db)
    cost.append(kmode.cost_)

In [None]:
y = np.array([i for i in range(1, 10, 1)]) 
plt.plot(y, cost) 

Not a very specific point, it seems between 3 and 5, I will go with 4 clusters.

In [None]:
km = KModes(n_clusters = 4, init = 'Huang', n_init = 1, verbose = 1)
fitClusters = km.fit_predict(db)

In [None]:
db = df3_copy.reset_index()

clusters_df = pd.DataFrame(fitClusters)
clusters_df.columns = ['clusters_pred']
db_w_clusters = pd.concat([db, clusters_df], axis = 1).reset_index()

db_w_clusters = db_w_clusters.drop(['level_0', 'index'], axis = 1)

db_w_clusters.head()

Now that the clusters are well defined (from 0 to 3) let's have a look at them.
Firstly, let's see how are the movie lovers distributed in each cluster.

In [None]:
import seaborn as sns

plt.subplots(figsize = (15,5))

sns.countplot(x=db_w_clusters['clusters_pred'],order=db_w_clusters['clusters_pred'].value_counts().index,hue=db_w_clusters['Movies'])
plt.show() 

It seems all groups of people are people enjoying to watch movies.
Let's see a bit if there are differences between clusters in terms of what they like to watch:

In [None]:
db_gr = db_w_clusters.groupby('clusters_pred').mean()
db_gr

In [None]:
for col in db_gr.columns:
    plt.subplots(figsize = (15,5))
    sns.countplot(x=db_w_clusters['clusters_pred'],order=db_w_clusters['clusters_pred'].value_counts().index,hue=db_w_clusters[col])
    plt.show() 

# Conclusions:

So we can notice that cluster 0 has quite a dislike for horror movies; comedy is at high levels for everyone. Sci-fi genre is loved especially in cluster 3, animated movies and fantasies seem to be the favourite genres for clusters 0 and 3, while westerns seem to be disliked in all clusters, but especially in cluster 0.

However, this analysis would not be interesting if we didn't have a look at clusters for both movies and phobias, this was its main purpose.


Cluster 0 seem to be enjoying the romantic, animated and fantasy movies a lot,dislikes horrors and seems to fear spiders, dogs and snakes.

The fear of ageing, of heights, storms or flying don't seem like really valid fears for none of the groups.
