# How unique are you on Kaggle?
I apologize for the clickbaity title, but it encapsulates the question I wanted to answer in this notebook - given my answers to the survey, how unique am I compared to other people on kaggle.

## My Answers
I did not actually participate in the survey (and that wouldn't have mattered anyways since it's all annonymized), so first up I'll make a small interactive widget allowing me to quickly fill in the questionaire based on answers provided by other users. This is not perfect (i.e. nobody from Denmark apparently filled the survey, so I guess I'll be swedish today), but I reckon it's good enough for this exercise to be valid and fun anyways :)

**Note:** By forking this notebook, you can of course fill in your answers to the questions, and get a personalized analysis :)

In [None]:
!pip install trimap scikit-plot --quiet | cat

In [None]:
import time
import re
from typing import Tuple

import numpy as np
import pandas as pd

import trimap
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier
from joblib import Parallel, delayed

from ipywidgets import Label, Accordion, SelectMultiple, Dropdown, VBox

from plotly.subplots import make_subplots
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import scikitplot as skplt

sns.set()
pd.set_option('display.max_colwidth', -1)

# Read the raw data
raw = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv').iloc[:, 1:]

# Strip whitespace from all entries
raw = raw.apply(lambda column: column.str.strip(), axis=0)

# We'll keep the processed data in 'df'. We replace everything in first row with empty strings; a placeholder for user answers 
df = raw.copy().fillna('')
df.iloc[0] = ''

# Get ordered list of questions names (i.e. exclude the _Part and _Other columns)
questions_names = [re.sub('(_Part_\d+|_OTHER)', '', q) for q in raw.columns]
questions_names = [x for i, x in enumerate(questions_names) if x not in questions_names[:i]]

# Keep track of which question names have multiple raw columns
has_multiple = [True if any([k for k in df.columns if f'{c}_Part' in k]) else False for c in questions_names]

# Get the question texts (& remove anything after 'Selected Choice')
questions_texts = [raw.loc[0, q] if q in raw.columns else raw.loc[0, f'{q}_Part_1'] for q in questions_names]
questions_texts = [re.sub('\s-\sSelected.+', '', q) for q in questions_texts]

# Create a tab for each of the questions
tabs = []
for name, text in zip(questions_names, questions_texts):
    
    # Get all answer options (in all related columns) for this question
    related = [c for c in raw.columns if name == c or f'{name}_' in c]
    options = sorted(pd.unique(df[related].values.ravel()))
    
    # Type of widget
    select_widget = (SelectMultiple(options=options) 
        if 'all that apply' in text
        else Dropdown(options=options))

    # Show the dropdown
    tabs.append(VBox([
        Label(text),
        select_widget
    ]))
    
# Set the question titles for each child of the accordian
tab = Accordion(children=tabs)
for i, name in enumerate(questions_names):
    tab.set_title(i, name)

In [None]:
display(tab)

Once the questions have been filled out, we can continue to preparing the data for analysis.

# Featurizing the data
In order to quantify similarity between my answers and those in the survey, the data has to be featurized in some manner. To keep things simple, I went for the following approach:

* As a simplification we'll assume that all data is categorical (even those categories that are strictly ordinal in nature), and one-hot-encode them all taking into account that some questions allow to multiple answers (i.e. multiple `1`'s in our encoding.
* Apply the L2 norm to the data to scale all user vectors to unit norm. By doing this, we can calculate the cosine similarity between each user is simply as the dot product of their feature vectors.

In [None]:
# If no answers were provided, then use mine (note: I'm okay with this being public)
did_answer = any([True if t.children[1].index else False for t in tab.children])
if not did_answer:
    mathias_answers = [4,1,45,2,6,2,(1, 5, 10, 12),10,(1, 10, 12),(4, 6, 7, 11),2,(1,),1,(7, 10, 11),6,(3, 4, 6, 7, 11, 12, 13, 14, 16),(1, 2, 3, 4, 5, 7, 8, 11, 12),(1, 3, 4, 6),(1, 2, 5, 6),2,7,5,(1, 2, 3, 4, 5, 6),5,3,(2, 5),(1, 2, 3, 4, 5, 10),(10,),(2, 11, 16, 17),15,(8,),0,(1, 2, 3, 4, 5, 6),(1, 2, 3, 5, 11),(9, 11),(1, 2, 4),(2, 3, 4, 9),4,(1, 2, 5, 8, 12),3,(1, 5, 8),(10,),(18,),(8,),(1, 2, 3, 4, 5, 6),(5,),(9, 11)]
    for child, answer in zip(tab.children, mathias_answers):
        child.children[1].index = answer
    did_answer = True
    
# Go through the questions one by one
result = []
for i, (question, multiple) in enumerate(zip(questions_names, has_multiple)):        
    
    # USER ANSWER
    ############################
    
    # Get the values picked in the widget
    value = tab.children[i].children[1].value    
    
    # Check if single or multiple label
    if not multiple:
        df.loc[0, question] = value
    else:
        columns = [c for c in df.columns if f'{question}_' in c]
        for column in columns:
            options = df[column].unique()
            if any([k in options for k in value]):
                df.loc[0, column] = 1 
    
    # FEATURIZE
    ############################
    
    if not multiple:
        # If a single-answer question, then just get OHE columns
        result.append(pd.get_dummies(df[question]))    
    else:
        # Convert texts to 1s and nothings to 0s
        columns = [c for c in df.columns if f'{question}_' in c]
        result.append((df[columns] != '').astype(int))

# Concatenate all the results. Output is [responses X features]
result = pd.concat(result, axis=1)
feature_names = result.columns

# L2 Normalize the data
result = normalize(result, norm='l2', axis=1)

# Calculate the angular distances between all users with all users
distances = result.dot(result.T)

# So, how unique am I?
With all the prep-work out of the way, we can now try to answer the question at hand. "Uniqueness" is something that could be quantified using a variety of approaches, e.g. multivariate density estimation, correlation metrics, or some other custom logic. To keep things simple, I simply calculated the dot product between the L2 norm of my answers with all the other response vectors (to find the angular distance). 

With this I get the angular distances between my answers and those of all other respondees, with values of `1` being very similar and values of `0` being very dissimilar. To get a single "score" for uniqueness, I've chosen to take the mean of all these distances to all other users, so that a *lower score* means *more unique* and a *higher score* means *less unique* 

In [None]:
# Calculate uniqueness score of all kaggles, as well as for notebook
scores = distances.mean(axis=1)
score = scores[0]

# Create histogram from scores
counts, bins = np.histogram(scores, bins=np.linspace(0.1, 0.4, num=101))
bins = 0.5 * (bins[:-1] + bins[1:])

# For histogram of notebooks answers, we only show one count
notebook_counts = np.zeros_like(counts)
notebook_counts[np.abs(bins - score).argmin()] = np.max(counts) + 100

# Show plot with all kaggler scores & notebook scores
fig = go.Figure([
    go.Bar(
        x=bins, 
        y=counts,
        name='Other kaggler scores',
        marker=dict(
            color=sns.color_palette('deep', 1).as_hex()[0]
        )
    ),
    go.Bar(
        x=bins, 
        y=notebook_counts,
        name='Answers in this notebook',
        marker=dict(
            color='darkred'
        )
    )
], layout=dict(
    xaxis_title='<-- More unique | Similarity Score | Less unique -->',
    yaxis_title='# Responses',
    yaxis=dict(range=[0, np.max(counts)+100]),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ), 
    barmode='overlay'
))

fig.add_annotation(
    text=f"""
    Notebook Score: {score:.3f}<br />
    More unique answers: {(scores < score).sum()}<br />
    Less unique answers: {(scores > score).sum()}
    """,
    xref="paper", yref="paper",
    x=0.05, y=0.9, 
    showarrow=False
)


fig.show()

![](http://)For me I got a score of 0.238 and a uniqueness rank of 4097 - i.e. 4097 people are more unique than me. Guess I'm pretty average based off this metric :)

# Visualizing the Survey
With the previous section we deduced a quantitative score for how similar the answers provided in this notebook are to answers given in the survey. It could also be interesting to see this visually - i.e. let us try to do some dimensionality reduction, and see where the answers filled into the notebook widget lie in a 2D plot containing all the other respondees in the survey. 

The idea of the dimensionality reduction algorithm is to retain as much information about the dataset  as possible in a lower dimensional space (typically 2D), which can be achieved in different ways, see e.g. [examples of this in scikit-learn](https://scikit-learn.org/stable/modules/manifold.html). These days the type of dimensionality reduction we're trying to do is often achieved with either t-SNE or UMAP, but in this notebook I'll use a recent technique called TriMap [(See paper here)](https://arxiv.org/abs/1910.00204), which in my experience empirically works better than t-SNE/UMAP when it comes to global accuracy, and which is faster to run.

In [None]:
import trimap

# Create TriMap embeddings
trimap_embedding = trimap.TRIMAP(
  n_iters=1000,
  distance='angular',
  apply_pca=True,
  weight_adj=5000
).fit_transform(result)

In [None]:
# What column to color by
COLOR_COL = 'Q6'

# Use the seaborn palette
classes = df[COLOR_COL].unique()
color_palette = sns.color_palette('deep', len(classes)).as_hex()
color_map = {c: color_palette[i] for i, c in enumerate(classes)}

# Put TRIMAP embeddings into dataframe & attach question information
trimap_df = pd.DataFrame({
    'Component 1': trimap_embedding[:, 0], 
    'Component 2': trimap_embedding[:, 1],
    COLOR_COL: df[COLOR_COL]
})

# Visualize the embeddings in 2D plot
fig = px.scatter(
    trimap_df, 
    x='Component 1', y='Component 2', 
    color=COLOR_COL,
    color_discrete_map=color_map,
    opacity=0.8
)
fig.update_traces(
    marker=dict(size=5, line=dict(width=1, color='DarkSlateGrey')),
    selector=dict(mode='markers')
)
fig.add_trace(
    go.Scattergl(
        mode='markers',
        x=[trimap_df.loc[0, 'Component 1']],
        y=[trimap_df.loc[0, 'Component 2']],
        name='My Answers',
        marker=dict(
            color='red',
            size=20,
            line=dict(
                color='DarkSlateGrey',
                width=2
            )
        ),
        showlegend=True
    )
)
fig.show()

I've colored the plots by years of coding experience - it's clear there's a long tail of un-filled responses, as well as a tail of "I have never written code" responses - these seem to be more "unique", in a sense. Interestingly, I find myself in the big pile of respondees with different experience levels. Could be fun to dig deeper to see if there are some groupings of people, which may align more or less with my (or yours, if you fork this notebook) responses.

## Clustering
I'll use a simple KMeans to cluster in the L2 normed space, restricting focus to those respondees that have written code before, i.e. ignore the long tails seen in the TriMAP embedding. First, let's find how many clusters to use, which I do by scoring the KMeans algorithm with different amounts of clusters.

In [None]:
def _scoreKMeans(X: pd.DataFrame, n_clusters: int) -> Tuple[float, float]:
    """Score KMeans with a given set of clusers on a dataset X. Return score, time_spent"""
    start = time.time()
    kmeans = KMeans(n_clusters=n_clusters)
    return -kmeans.fit(X).score(X), time.time() - start


# Get the respondees that have written code before
idx = (df.Q6 != '') & (df.Q6 != 'I have never written code')

# Perform KMeans with different numbers of clusters
cluster_ranges = np.arange(1, 30, 3)
       
# Run the KMeans clusterings with different clustering ranges
tuples = Parallel(n_jobs=-1)(
    delayed(_scoreKMeans)(result[idx], i) for i in cluster_ranges
)
cluster_scores, cluster_times = zip(*tuples)

In [None]:
# Color scheme
colors = sns.color_palette('deep', 2).as_hex()

# Create plot with secondary y axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Create plot
fig.add_trace(
    go.Scatter(
        x=cluster_ranges, 
        y=cluster_scores,
        mode='lines+markers',
        marker=dict(
            color=colors[0]
        )
    ),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(
        x=cluster_ranges, 
        y=cluster_times,
        mode='lines+markers',
        marker=dict(
            color=colors[1]
        )
    ),
    secondary_y=True,
)

# Set axes titles
fig.update_yaxes(
    title_text="<b>Sum of Squared Errors</b>", 
    title_font=dict(color=colors[0]),
    secondary_y=False
)
fig.update_yaxes(
    title_text="<b>Clustering duration (seconds)</b>", 
    title_font=dict(color=colors[1]),
    secondary_y=True
)
fig.update_xaxes(title_text="<b>Number of Clusters</b>")

# Remove legend
fig.update_layout(showlegend=False)

fig.show()

Looks like about 15 clusters is reasonable - i.e. here the sum or errors does not increase as rapidly with more clusters anymore. Let's see how these clusters would be distributed in the TriMap plot from before (note clustering is done in the original feature space, and not in the 2D TriMap space)

In [None]:
# Create clusterer & fit it to those who've written code
clusterer = KMeans(n_clusters=15)
clusterer.fit(result[idx])

# Create a dataframe with all respondees & their assigned clusters for those who write code
cluster_df = pd.DataFrame(result[idx], columns=feature_names)
cluster_df['cluster'] = -1
cluster_df['cluster'] = clusterer.labels_

# Create cluster name
cluster_df['cluster_name'] = cluster_df['cluster'].apply(lambda x: f'Cluster {x}')
color_palette = sns.color_palette('deep', len(cluster_df.cluster.unique())).as_hex()
color_map = {f'Cluster {x}': color_palette[x] if x >= 0 else 'lightgrey' for x in clusterer.labels_}

# Put TRIMAP embeddings into dataframe & attach question information
trimap_df = pd.DataFrame({'Component 1': trimap_embedding[idx, 0], 'Component 2': trimap_embedding[idx, 1]})
trimap_df = pd.concat([trimap_df, cluster_df], axis=1)

# Sort by cluster to put -1 first
trimap_df = trimap_df.sort_values('cluster')

# Visualize the embeddings in 2D plot
fig = px.scatter(
    trimap_df, 
    x='Component 1', y='Component 2', 
    color='cluster_name',
    color_discrete_map=color_map,
)
fig.update_traces(
    marker=dict(size=5, line=dict(width=1, color='DarkSlateGrey')),
    selector=dict(mode='markers')
)
fig.add_trace(
    go.Scattergl(
        mode='markers',
        x=[trimap_df.loc[0, 'Component 1']],
        y=[trimap_df.loc[0, 'Component 2']],
        name='My Answers',
        marker=dict(
            color='red',
            size=20,
            line=dict(
                color='DarkSlateGrey',
                width=2
            )
        ),
        showlegend=True
    )
)
fig.show()

Looks like clustering in the L2-normed space translates well to the 2D representation from TriMap, but although the plot looks nice, it does not tell us too much, except for the fact that we're in a cluster with some random ID. 

### Interpreting the clusters
To remedy this lack of cluster interpretation, I'll try to see if we can determine which answers are more important for each cluster, and from that gain a bit more insight. The way I'm going to go about doing this is by setting up a random forest classification model to predict whether a given respondee belongs to a given cluster or not. From the model fit, I then extract the feature importances for each feature, and with that see which features are important for a given cluster classification.

Using this approach I can also estimate how much I actually belong to any given cluster (i.e. with `predict_proba`), which is pretty neat.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

def getKeywords(cluster_id):
    
    # Let's try to predict this cluster based on features in the dataset
    X = (cluster_df[feature_names] > 0).astype(int)
    y = (cluster_df.cluster == cluster_id).astype(int)
    
    # Fit a logistic regression for predicting the cluster
    model = ExtraTreesClassifier(n_jobs=-1)
    model.fit(X.iloc[1:], y.iloc[1:])
    
    # Get probability of this notebook answers belonging to cluster
    proba = model.predict_proba(X.iloc[[0]])[0, 1]
    
    # Get the 20 most important features for classifying this cluster
    sorted_features = X.columns[np.argsort(model.feature_importances_)]
    
    # Get keywords either just as column name, or as entry from OHE columns
    def _cleanFeatures(features):        
        keywords = [[e for e in df[f].unique() if e not in [1, '']][0] if '_Part_' in f else f for f in features]
        keywords = [f for f in keywords if f not in ['', 'No / None', 'None']]
        return ', '.join(keywords)
    
    # Get the most positive keywords & most negative keywords
    keywords = _cleanFeatures(sorted_features[-10:])
    if not keywords:
        keywords = 'Users did not fill, or mostly filled None, No/None in most answers'
        
    # Return cluster keywords, and probability of belonging to it
    return keywords, proba

# Get the number of respondees in each cluster
cluster_counts = cluster_df[cluster_df.cluster >= 0].cluster.value_counts()

# Create dataframe with keywords for each cluster
data, probas = [], []
for cluster_id, n_responses in cluster_counts.items():
    keywords, proba = getKeywords(cluster_id)
    data.append({
        'Cluster': cluster_id,
        'High Frequency Words': keywords,
        'Respondees': n_responses,
        'Notebook Answers [%]': int(proba*100)
    })
    probas.append(proba)
data = pd.DataFrame(data).sort_values('Cluster')

# Display the resulting dataframe & mark notebook 
# answers with color based on probability
display(
    data
    .style
    .background_gradient(subset='Notebook Answers [%]')
    .hide_index()
)

Taking my own answers in the notebook widget, with this I can conclude that I'm part of the clusters using GPUs, various deep learning networks, as well as various cloud technologies, which fits pretty well. 

# Kaggle Archetypes
Now that we've ventured into clustering of people, it could be interesting to take it a bit further and see if we can determine some underlying archetypes of kagglers. My thought on this is that if we perform a principal component analysis (PCA) on our featurized dataset, we should be able to pick up components which are *orthorgonal* to each other; it could be interesting to investigate the largest of these components, as they may reflect overarching types within the community. First up, let us run the PCA on the dataset, to see how much variance is described by each component. I'll again restrict focus to those who have written code before.

In [None]:
from sklearn.decomposition import PCA

# Get the respondees that have written code before
idx = (df.Q6 != '') & (df.Q6 != 'I have never written code')
       
# Run PCA
pca = PCA(random_state=2020)
trafo = pca.fit_transform(result[idx])

# Get the cumulative explained variance
cumulative_sum_ratios = np.cumsum(pca.explained_variance_ratio_)
components_list = list(range(len(cumulative_sum_ratios)))

# Figure out how many components to explain 75% of variance
idx = np.searchsorted(cumulative_sum_ratios, 0.75)

# Show the plot
fig = go.Figure(data=go.Scatter(
    x=components_list,
    y=cumulative_sum_ratios, 
    showlegend=False,
    mode='lines+markers',
    marker_color=color_palette[0]
))

fig.add_trace(
    go.Scatter(
        mode='lines',
        x=components_list,
        y=[cumulative_sum_ratios[idx]] * len(cumulative_sum_ratios),
        name=f'75% variance explained with {idx} components',
        showlegend=True,
        line=dict(color='firebrick', width=2, dash='dash'),
    )
)

fig.update_layout(
    xaxis_title='Principal Components',
    yaxis_title='Cumulative explained variance',
    legend=dict(
        x=0,
        y=1.1,
        traceorder='normal',
        font=dict(size=12),
    ),
)
fig.show()

print(f"First three components: {pca.explained_variance_[0:3]}")

My main takeway from this is that there are not a lot of super strong components that explain e.g. 30 or 20% of the variance - rather it actually takes a whopping 122 components to describe 75% of the dataset; I reckon that's indicative of a lot of individuality on Kaggle. Looking a bit close, though, the first three components do explain to 7.0%, 3.8%, and 2.7% of the variance, respectively. So let us now say that these are our underlying archetypes for kagglers, and see what they contain.

In [None]:
# Calculate the PCA feature loadings (pick out PC1-3)
loadings = pd.DataFrame(
    pca.components_.T * np.sqrt(pca.explained_variance_),
    columns=[f'PC{i}' for i in range(len(pca.components_))],
    index=feature_names
)[['PC1', 'PC2', 'PC3']]

# Give better names to features
name_map = {
    c: '{}'.format(f.split(" - Selected Choice - ")[-1].strip())
    if 'Selected Choice' in f else raw.loc[0, c]
    for c, f in raw.iloc[0].items()
}
loadings.index = [name_map[i] if i in name_map else i for i in loadings.index]

# Remove all features where all absolute loading < 0.02
loadings = loadings[loadings.apply(lambda row: row.abs().max() > 0.02, axis=1)]

# Remove meaningless features
loadings = loadings[~loadings.index.isin(['None', 'Never', '', 'No / None'])]

# Show heatmap plot
_, ax = plt.subplots(1, 1, figsize=(10, 20))
sns.heatmap(loadings, ax=ax)
plt.show()

So with that we can conclude the following for the archetypes:
* Archetype 1 (7% of variance): High correlation with lots of cloud & ML technologies. Also anticorrelated with "<1 year", which is consistent with this type being more experienced. 
* Archetype 2 (3.8% of variance): This seems like a archetype that covers people not using machine learning methods, but who are more correlated with SQL, Tableau, PowerBI, and where their companies have not invested into ML.
* Archetype 3 (2.7% of variance): This seems to be a more beginner level archetype, where development mostly happens locally and with various ML technologies

# What is a normal Kaggler?
Finally, having done all the above analyses, it could be interesting to look at the answers from the 'least' unique respondee - i.e. the person with the highest average cosine similarity to all other kagglers, just to see who the most stereotypical kaggler is :)

In [None]:
name_map = {
    c: '{}'.format(f.split(" - Selected Choice - ")[0].strip())
    if 'Selected Choice' in f else raw.loc[0, c]
    for c, f in raw.iloc[0].items()
}

person = df.loc[np.argmax(scores)]
person = pd.DataFrame({'Average Joe': person})
person = person[person['Average Joe'] != ""]
person.index = person.index.map(name_map)
person

So the most typical kaggler is relatively new to coding (1-2 years) and machine learning (<1 year), lives in india, uses python, SQL and knows some C/C++/Java, typically uses a personal computer/laptop, and hopes to familiarize himself with one of the three main cloudforms within the next 2 years. Makes sense to me; could be interesting to see if there has been a shift in this "stereotypical" kaggler over time.