# Clustering Kaggle's User-base

## Table of Contents

1. [Objective](#obj)
2. [Data Clean and Prep](#clean)
3. [Exploratory Data Analysis](#eda)
4. [Clustering Users](#users)
5. [Decribing Clusters](#desc)

## Objective <a id = 'obj'></a>

This notebook aims identify the different subgroups of users of Kaggle. Clustering analysis, K-Modes, will be used to separate users into several distinct buckets of users. Then, these types of users will be described based on the survey questions. 

## Data Clean & Prep <a id = 'clean'></a>

We will need to clean the data.
Criteria:
First column will be removed, it is not necessary for our studies. Also uses alot of memory since it is a mixed data type.
First row will be removed since it is just states the full question.
People who did not answer Q5 will be removed because the answer is paramount to describe the demographics.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')

In [None]:
# Remove first column
df = df.iloc[:,1:]

In [None]:
# Remove first row 
df = df.iloc[1:]
df = df.reset_index(drop = True)

In [None]:
# Drop rows where question 5 is empty
q5_drop_count = 0
for i in range(len(df)):
    if pd.isna(df.at[i, 'Q5']):
        q5_drop_count += 1
        df = df.drop(i)
        
# Reset index
df = df.reset_index(drop = True)

# Display rows dropped
print('rows dropped because q5 was empty: ' + str(q5_drop_count))

## Exploratory Data Analysis <a id = 'eda'></a>

A basic EDA of the dataset can be found in the link below:

https://www.kaggle.com/paultimothymooney/2020-kaggle-data-science-machine-learning-survey

Key takeaways from analyzing the demographics (questions 1 through 5) are:
1. Age: Most participants are in the age groups of 18-21, 22-24, and 25-29.
2. Gender: Participants are predominantly male. 
3. Country: Most participants are either from India or United States of America. Indian users outnumber US users by more than 2 times.
4. Degree: Most participants have either a Master's or Bachelor's degree.
5. Current Role: Most participants are Students with the second most popular role as Data Scientists.

## Clustering Users <a id = 'users'></a>

We will use K-modes to identify clusters from our data set or particular types of users. Since we do not have an intial label, it is up to us to describe these clusters.

Sci-kit learn does not have a built in method for clustering categorical data. We will be using the library ***kmodes*** for this.

Choosing the number of clusters is an essential task for a clustering problem. There are numerous ways to evaluate the number of clusters. We will be using the elbow method to determine the optimal number of clusters. We want the most distinction and the least amout of error. The library, ***yellowbrick***, provides an easy way to create the elbow plot. 

There are more than 300 columns in our dataset. The high dimensionality could create noise in our analysis, o we need to choose the questions that would create the most distinction. Choosing the questions to analyze can be subjective but I chose Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q9, Q10, Q11, Q12, Q13, Q14, and Q15. 


In [None]:
pip install kmodes

In [None]:
pip install yellowbrick

In [None]:
label_df = df[['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 
               'Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5', 'Q7_Part_6',
               'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10', 'Q7_Part_11', 'Q7_Part_12' ,'Q7_OTHER',
               'Q9_Part_1', 'Q9_Part_2', 'Q9_Part_3', 'Q9_Part_4', 'Q9_Part_5', 'Q9_Part_6', 'Q9_Part_7',
               'Q9_Part_8', 'Q9_Part_9', 'Q9_Part_10', 'Q9_Part_11', 'Q9_OTHER', 
               'Q10_Part_1', 'Q10_Part_2', 'Q10_Part_3', 'Q10_Part_4', 'Q10_Part_5', 'Q10_Part_6', 'Q10_Part_7',
               'Q10_Part_8', 'Q10_Part_9', 'Q10_Part_10', 'Q10_Part_11', 'Q10_Part_12', 'Q10_Part_13', 'Q10_OTHER',
               'Q11', 'Q12_Part_1', 'Q12_Part_2', 'Q12_Part_3', 'Q12_OTHER', 'Q13',
               'Q14_Part_1', 'Q14_Part_2', 'Q14_Part_3', 'Q14_Part_4', 'Q14_Part_5', 'Q14_Part_6', 'Q14_Part_7',
               'Q14_Part_8', 'Q14_Part_9', 'Q14_Part_10', 'Q14_Part_11', 'Q14_OTHER', 'Q15']]
label_df = label_df.fillna('0')


In [None]:
# Encode categorical data for machine learning
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
label_df_enc = label_df.apply(le.fit_transform)
pd.options.display.max_columns = None
label_df_enc.head(5)

In [None]:
# Elbow Method for K means
from yellowbrick.cluster import KElbowVisualizer
from kmodes.kmodes import KModes

model = KModes( init = 'Cao', n_init = 1, verbose=1)
# k is range of number of clusters.
visualizer = KElbowVisualizer(model, k=(2,7), timings= True)
visualizer.fit(label_df_enc)        # Fit data to visualizer
visualizer.show()        # Finalize and render figure

From the elbow plot, we can see that the optimal number of clusters is 3, the first major inflection point. 

Now, lets take a look at the clusters and see their characteristics.

In [None]:
from kmodes.kmodes import KModes

km_cao = KModes(n_clusters=3, init = 'Cao', n_init = 1, verbose=1)
fitClusters_cao = km_cao.fit_predict(label_df) # predict cluster

clusterCentroidsDf = pd.DataFrame(km_cao.cluster_centroids_)
clusterCentroidsDf.columns = label_df.columns
pd.options.display.max_columns = None

clusterCentroidsDf

    Cluster 0
Looking at the centroids above, we describe cluster 0 as having the main dominant features of being 18-21 years old, man, from India, having or pursuing a Bachelor's degree, and having 1-2 years of programming. In other terms, this group of people are just starting out exploring data science, they are still in the process of learning, don't have experience with a wide variety of data tools, and more likely to try new ideas. We can infer that this group is using Kaggle as learning experience or something to put on the resume.

    Cluster 1
Looking at the qualities of the centroids, cluster 1 seems to be mostly being 25-29 years old, men, from India, having or pursuing a Master's degree, are data scientists, and have 3-5 years of programming experience. In other terms, this group has just started their data science career, are generally of entry level experience, and are generally very educated. 

    Cluster 2
For cluster 2, the main qualiies are being 35-39 years old, men, from the United States of America, have a Master's degree, are data scientists, and have 3-5 years of programming experience. This group is generally older than the previous two clusters, should have more job experiene, more set in their data science careers, and have used a variety of data tools. Since this group is already established in their careers, they could be using Kaggle for intrinsic values such as for fun or learning about new data methods.  

    Percentage Plot
Plotting the counts of the predicted cluster values in the below graph, we can see that most participants fall under cluster 0. We can, see that most of Kaggle's user base is young, in school, most likely pursuing a Bachelors, don't have an extensive amount of coding experience, and are using Kaggle as a means to expore the world of data science.

In [None]:
# Combine df and predicted cluter to one df
pred_df = label_df.reset_index()
clustersDf = pd.DataFrame(fitClusters_cao)
clustersDf.columns = ['cluster_predicted']
combinedDf = pd.concat([pred_df, clustersDf], axis = 1).reset_index()
combinedDf = combinedDf.drop(['index', 'level_0'], axis = 1)

# Percentage plot
percentage_group = combinedDf.groupby(['cluster_predicted'])['Q1'].count()
percentage_group = percentage_group / sum(percentage_group)
percentage_group = percentage_group.reset_index().rename(columns = { 'Q1': 'Proportion'})

# settings
plt.figure(figsize = (5,5))
sns.set_style("whitegrid")

# plot
plot2 = sns.barplot(x = 'cluster_predicted', y = 'Proportion', data = percentage_group, palette = 'viridis')
plot2.set(ylim=(0, .75))

# annotate bar plots
for p in plot2.patches:
    plot2.annotate(format(p.get_height(), '.2f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
# title
plt.title('Proportion of Predicted Clusters', fontsize = 20) # title with fontsize 20

plot2

## Describing Clusters <a id = 'desc'></a>

Now we will do a deep dive into the clusters and their characteristics. 

In [None]:
# df : the dataframe that contains predicted clusters (null values should be string '0')
# columns : takes list of column names (columns should be in df)
# cluster_col : takes one string of cluster column name (this should be in df) (values should be integers)

def probability_heatmap(df,columns, cluster_col, optional_size = (15,10), title = ''):
    intialize_df = pd.DataFrame(columns = [cluster_col, ' ', 'probability'])
    clusters_list = sorted(df[cluster_col].unique())
    
    for j in columns:
        data_crosstab = pd.crosstab(df[cluster_col], df[j], margins = True)
                        
        column_value = list(df[j].unique())
        column_value.remove('0')
        column_value = column_value[0]
        
        for i in clusters_list:
            prob = data_crosstab.at[i, column_value] / data_crosstab.at[i, 'All']
            
            intialize_df = intialize_df.append({cluster_col : i, ' ' : column_value, 'probability':prob}, ignore_index=True)
            
    #pivot map for seaborn heatmap
    pivot_df = intialize_df.pivot(index = cluster_col, columns = ' ', values = 'probability')

    #set correlation graph size
    sns.set(font_scale=1)
    plt.figure(figsize = (optional_size))

    #correlation graph settings
    ax = sns.heatmap(pivot_df, vmin=0., vmax=1,cmap='viridis', fmt='.2f', annot = True, square = True,
                     linewidths = .5, cbar = False)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

    # set title
    plt.title(title, fontsize = 20) # title with fontsize 20
    # show
    plt.show()

In [None]:
# Q7: What programming languages do you use on a regular basis?
Q7_cols = ['Q7_Part_1', 'Q7_Part_2', 'Q7_Part_3', 'Q7_Part_4', 'Q7_Part_5', 'Q7_Part_6',
           'Q7_Part_7', 'Q7_Part_8', 'Q7_Part_9', 'Q7_Part_10', 'Q7_Part_11', 'Q7_Part_12' ,'Q7_OTHER']

probability_heatmap(combinedDf, columns = Q7_cols, cluster_col = 'cluster_predicted',
                   title = 'What programming languages do you use on a regular basis?')

In [None]:
Q9_cols = ['Q9_Part_1', 'Q9_Part_2', 'Q9_Part_3', 'Q9_Part_4', 'Q9_Part_5', 'Q9_Part_6', 'Q9_Part_7',
          'Q9_Part_8', 'Q9_Part_9', 'Q9_Part_10', 'Q9_Part_11', 'Q9_OTHER']

probability_heatmap(combinedDf, columns = Q9_cols, cluster_col = 'cluster_predicted',
                   title = 'Which of the following integrated development environments (IDE\'s) do you use on a regular basis?')

In [None]:
Q10_cols = ['Q10_Part_1', 'Q10_Part_2', 'Q10_Part_3', 'Q10_Part_4', 'Q10_Part_5', 'Q10_Part_6', 'Q10_Part_7',
               'Q10_Part_8', 'Q10_Part_9', 'Q10_Part_10', 'Q10_Part_11', 'Q10_Part_12', 'Q10_Part_13', 'Q10_OTHER']

probability_heatmap(combinedDf, columns = Q10_cols, cluster_col = 'cluster_predicted',
                   title = 'Which of the following hosted notebook products do you use on a regular basis?  ')

In [None]:
# df : the dataframe that contains predicted clusters (null values should be string '0')
# column : takes one string of column name (column should be in df)
# cluster_col : takes one string of cluster column name (this should be in df) (values should be integers)

def probability_heatmap_single_col(df,column, cluster_col, optional_size = (15,10), title = ''):
    intialize_df = pd.DataFrame(columns = [cluster_col, ' ', 'probability'])
    clusters_list = sorted(df[cluster_col].unique())
    column_values = sorted(df[column].unique()) 
    column_values.remove('0')
    
    data_crosstab = pd.crosstab(df[cluster_col], df[column], margins = True)
    #data_crosstab = data_crosstab.drop('0', axis = 1)

    for j in column_values:   
        for i in clusters_list:
            prob = data_crosstab.at[i, j] / data_crosstab.at[i, 'All']
            intialize_df = intialize_df.append({cluster_col : i, ' ' : j, 'probability':prob}, 
                                               ignore_index=True)
            
    #pivot map for seaborn heatmap
    pivot_df = intialize_df.pivot(index = cluster_col, columns = ' ', values = 'probability')

    #set correlation graph size
    sns.set(font_scale=1)
    plt.figure(figsize = (optional_size))

    #correlation graph settings
    ax = sns.heatmap(pivot_df, vmin=0., vmax=1,cmap='viridis', fmt='.2f', annot = True, square = True,
                     linewidths = .5, cbar = False)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

    # set title
    plt.title(title, fontsize = 20) # title with fontsize 20
    # show
    plt.show()

In [None]:
probability_heatmap_single_col(df = combinedDf, column = 'Q11' , cluster_col = 'cluster_predicted', optional_size = (7,7), 
                               title = 'What type of computing platform do you use most often for your data science projects?')

In [None]:
Q12_cols = ['Q12_Part_1', 'Q12_Part_2', 'Q12_Part_3', 'Q12_OTHER']

probability_heatmap(combinedDf, columns = Q12_cols, cluster_col = 'cluster_predicted', optional_size = (5,5),
                   title = 'Which types of specialized hardware do you use on a regular basis?')

In [None]:
probability_heatmap_single_col(df = combinedDf, column = 'Q13' , cluster_col = 'cluster_predicted', optional_size = (7,7), 
                               title = 'Approximately how many times have you used a TPU (tensor processing unit)?')

In [None]:
Q14_cols = ['Q14_Part_1', 'Q14_Part_2', 'Q14_Part_3', 'Q14_Part_4', 'Q14_Part_5', 'Q14_Part_6', 'Q14_Part_7',
               'Q14_Part_8', 'Q14_Part_9', 'Q14_Part_10', 'Q14_Part_11', 'Q14_OTHER']
probability_heatmap(combinedDf, columns = Q14_cols, cluster_col = 'cluster_predicted',
                   title = 'What data visualization libraries or tools do you use on a regular basis?')

In [None]:
probability_heatmap_single_col(df = combinedDf, column = 'Q15' , cluster_col = 'cluster_predicted', optional_size = (10,10), 
                               title = 'For how many years have you used machine learning methods?')