In [None]:
from pathlib import Path
import random

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import altair as alt
from IPython.display import Image

import umap
import umap.plot
from sklearn.mixture import GaussianMixture

%config InlineBackend.figure_format = 'retina'
alt.data_transformers.disable_max_rows();

# Data Science Practitioner Segmentation

* [Introduction](#intro)
    - [What does a segment look like?](#segment-example)
    - [Professionals and non-professionals](#prof-nonprof)
* [Preprocessing](#preprocess)
    - [An important choice!](#important-choice)
* [Clustering](#clustering)
    - [UMAP projection](#umap)
    - [How about the number of clusters?](#num-clusters)
* [Analysis](#analysis)
    - [Cluster lift](#lift)
* [Segmentation summary](#segm-summary)

<a id="intro"></a>
## Introduction

Data science / ML practitioners tackle a large variety of tasks at work, from analyze data to come up with business insights to deploying ML models in cloud environments. In this notebook we try to tease out some of this complexity by grouping similar users together based on their responses to the Kaggle ML and DS survey. Some of the questions we tackle are:

* what kind of problems do different segments focus on at work
* what are the primary software tools (libraries, frameworks, cloud services) they use to solve these problems
* what is the relationship between level of experience and types of problem at work

Importantly, our focus is on **interpretable** segments.


<a id="segment-example"></a>
### What does a segment look like?

We will use UMAP for dimensionality reduction and Gaussian Mixture models for clustering using the responses to most of the questions in the survey. We end up with a total of 9 user segments. We then create profiles that emphasize unique features in each cluster. Here is an example:

**Cluster 5: Computer Vision**

In [None]:
Image('../input/example-cluster/cluster_5_profile.png', width=700)

The bars on the left (in lighter color) show the percent of users that have selected a given response to a question. For example, the top bar shows that about 40% of users have selected Convolutional Neural Networks (CNNs) as an answer to Q17 *Which of the following ML algorithms do you use on a regular basis?* The bars on the right in solid color show the **lift**, or the increase in response rate specific **to the given cluster** compared to the overall average in the dataset. This means that **an additional** 50% in Cluster 5 have selected CNNs as an answer as compared to the baseline, for a whopping 90% in total in this cluster.

We will go into more details about the segmentation in the analysis section. Here is my summary for this cluster (based on the visualization above and more analysis further down):

> More than 90% of the users in this cluster use Convolutional Neural Networks to tackle various problems in image classification, object detection and segmentation. They are most likely using Tensorflow (78%) and/or Keras (73%). Interestingly, in addition to Python, more than a quarter of the users utilize C++, presumably for high-performance computer vision applications. 
>
> These users utilize GPUs, but more than half do not use a cloud platform.
This is the group with highest proportion of PhDs (25%) and Research Scientists as the job role (19%). Other common job roles are Data Scientist, ML Engineer, Software Engineer. In terms of ML experience, this cluster falls in the middle, with about half the users having less than 2 years of experience with ML models. Users in this cluster are more likely to do research to advance ML methods compared to other segments. 

If you want to jump to the rest of the cluster summaries, jump over to the **Segmentation summary** section near the bottom.

<a id="prof-nonprof"></a>
### Professionals and non-professionals

Before we go any further, let's look at a projection of the survey dataset in two dimensions using UMAP: this will give us a bird's-eye view of the dataset. We will go into all the technical details below, but for now, let's examine projection. Each point represents a survey participant, and points that are closer together have similar responses. Points are colored based on the participant job role (Q5).

In [None]:
Image('../input/segmentation-mappng/segmentation_map.png', width=700)

You will notice two main groups, one on the left, and one on the right, separated by a wide margin. This separation is in fact driven by the survey structure itself. From the survey document:

> Non-professionals received questions with an alternate phrasing (questions for non-professionals asked what tools they hope to become familiar with in the next 2 years instead of asking what tools they use on a regular basis). Non-professionals were defined as students, unemployed, and respondents that have never spent any money in the cloud.

It is not surprising that many of the respondents in the group on the right (non-professionals) are students. On the other hand, the professional group on the left includes data scientists, analysts, machine learning engineers and more!

My focus is on understanding data science practitioners at work, so for the rest of the analysis I will use the professionals group on the left and perform segmentation on it. I am sure the group on the right will also provide lots of insights, and I invite others to analyze that subset as well. 

<a id="preprocess"></a>
## Preprocessing

Below is my code for preprocessing. Basically, this consists of:
* selecting which questions to use for clustering, and which to use for validation / analysis 
* encoding each response to a question as a binary variable (but note that some questions are multi-option select, while others are single-option select)
* adding annotations to the questions as well as the responses to make them easier to analyze, e.g. `Q14 -> viz libraries`, and `Q14_Part_2 -> seaborn`

In [None]:
DATA_DIR = Path('../input/kaggle-survey-2020/')
df = pd.read_csv(DATA_DIR / 'kaggle_survey_2020_responses.csv', skiprows=[1])
df.head(2)

<a id="important-choice"></a>
### An important choice!

We need to select questions we will include in the clustering (`cluster_qs`) and questions we will use later to study and validate the clustering solution (`valid_qs`).

Most of the questions in this survey ask about things data science practitioners do at work: types of problems they solve, frameworks they use, models they build. So we will use the responses to these questions for clustering. For the full list of questions included, you can open the code cell below.

The questions selected for analysis (post-clustering) relate to the users' prior experience (years of coding experience, formal education), as well as work environment (company size, company using ML methods). This selection will allow us to make conclusions such as: *Users from cluster X focus on building deep learning prototypes using Tensorflow. Most of them have at least 2 years of experience working with ML models.*

We do not include questions that relate to compensation in the analysis as these depend on the country and a lot of extraneous factors. 


In [None]:
# A mapping between the question and a short description
# question: (short_description, question_type)
short_qs = {
    'Q2':  ('gender', 'valid'),
    'Q4':  ('highest edu formal', 'valid'),
    'Q5':  ('most similar role', 'valid'),
    'Q6':  ('writing code years', 'valid'),
    'Q7':  ('languages regular', 'program'),
    'Q8':  ('language recommend', 'program'),
    'Q9':  ('IDEs regular', 'program'),
    'Q10': ('hosted notebooks', 'cloud'),
    'Q11': ('compute platform', 'cloud'),
    'Q12': ('hardware', 'tools'),
    'Q13': ('used TPU', 'tools'),
    'Q14': ('viz libraries', 'program'),
    'Q15': ('years ML', 'valid'),
    'Q16': ('ML frameworks', 'ML'),
    'Q17': ('ML algos', 'ML'),
    'Q18': ('CV methods', 'ML'),
    'Q19': ('NLP methods', 'ML'),
    'Q20': ('company size', 'valid'),
    'Q21': ('num. DS individuals', 'valid'),
    'Q22': ('employer ML', 'valid'),
    'Q23': ('work activities', 'work'),
    'Q25': ('cloud money', 'valid'),
    'Q26': ('cloud platforms', 'cloud'),
    'Q27': ('cloud compute', 'cloud'),
    'Q28': ('cloud ML', 'cloud'),
    'Q29': ('big data regular', 'tools'),
    'Q30': ('big data most often', 'tools'),
    'Q31': ('BI tools regular', 'tools'),
    'Q32': ('BI tools most often', 'tools'),
    'Q33': ('autoML tasks', 'ML'),
    'Q34': ('autoML tools', 'ML'),
    'Q35': ('ML experiments', 'ML'),
    'Q36': ('share deploy apps', 'tools'),
    'Q37': ('DS courses', 'learn'),
    'Q38': ('primary tool', 'tools'),
    'Q39': ('media sources DS', 'learn'),
}

for q, v in short_qs.items():
    assert len(v) == 2, v
    
# types of questions, will be used for coloring later
qtypes = ['cloud', 'ML', 'program', 'tools', 'work', 'learn', 'valid']
assert set(qtypes) == set(v[1] for v in short_qs.values())
# this mapping will be used for plotting, 2 colors per question type
qtype_map = {qtype: i * 2 for i, qtype in enumerate(qtypes)}

In [None]:
cluster_qs = [q for q, v in short_qs.items() if v[1] != 'valid']
valid_qs =   [q for q, v in short_qs.items() if v[1] == 'valid']
valid_qs

In [None]:
{q: short_qs[q][0] for q in valid_qs}

In [None]:
# random sample of questions used for clustering
{q: short_qs[q] for q in random.sample(cluster_qs, 5)}

The `uniques` dictionary will keep track of unique responses per column (one-hot encoded), e.g.

```
{
    'Q7_Part_1': 'Python',
    'Q7_Part_2': 'R'
 }
```

In [None]:
def make_uniques_dict(df):
    """Make a dictionary mapping a given multiple-selection option to the actual text response.
    e.g. 'Q7_Part_1' -> 'Python'
    """
    uniques = {}
    for col in df.columns:
        if df[col].nunique() == 1:
            # keep track of the response text - to be used in analysis
            uniques[col] = df[col].dropna().unique()[0].strip()
    return uniques
            
uniques = make_uniques_dict(df)

# convert each column (corresponding to a single selection) to 0 / 1 col
for col in uniques:
    df[col] = df[col].notnull().astype(np.int8)

In [None]:
# some of the categorical questions have an ordering so we assign the order manually
# used for visualizations later
ordered_cats = {
    'Q4': [
        # somewhat arbitrary how we order the education degrees
        'I prefer not to answer',
        'No formal education past high school',
        'Some college/university study without earning a bachelor’s degree',
        'Bachelor’s degree',
        'Master’s degree',
        'Professional degree',
        'Doctoral degree'
    ],
    'Q6': [
        'I have never written code',
        '< 1 years',
        '1-2 years',
        '3-5 years',
        '5-10 years',
        '10-20 years',
        '20+ years'
    ],
    'Q15': [
        'I do not use machine learning methods',
        'Under 1 year',
        '1-2 years',
        '2-3 years',
        '3-4 years',
        '4-5 years',
        '5-10 years',
        '10-20 years',
        '20 or more years'
    ],
    'Q20': [
        '0-49 employees',
        '50-249 employees',
        '250-999 employees',
        '1000-9,999 employees',
        '10,000 or more employees',
    ],
    'Q21': ['0', '1-2', '3-4', '5-9', '10-14', '15-19', '20+'],
    'Q22': [
        'No (we do not use ML methods)', 
        'I do not know',
        'We are exploring ML methods (and may one day put a model into production)',
        'We use ML methods for generating insights (but do not put working models into production)',
        'We recently started using ML methods (i.e., models in production for less than 2 years)',
        'We have well established ML methods (i.e., models in production for more than 2 years)',
    ]
}

In [None]:
def col_to_q(col):
    return col.split('_')[0]

# drop OTHER responses since we cannot correlate them with a segment
df = df[[col for col in df.columns if 'OTHER' not in col]]

# keep only questions for analysis / validation
df = df[[col for col in df.columns if col_to_q(col) in short_qs]]
df.head(2)

In [None]:
def object_cols_to_cats(df, ordered_cats):
    """Convert any columns with dtype = object to categorical.
    If a column contains ordered categories (specified in the ordered_cats dict),
    the order will be preserved in the resulting pandas categorical.
    Modifies df inplace.
    
    Returns a dictionary which maps each a column to its categorical response.
    For example: {
       'Q6_0': 'I have never written code',
       'Q6_1': '< 1 years',
       ...
    }
    """
    uniques_cat = {}
    for col in df.select_dtypes('object').columns:
        q = col_to_q(col)
        cats = ordered_cats.get(q)
        if cats is not None:
            assert set(cats) == set(df[col].dropna())
            df[col] = pd.Categorical(df[col], ordered=True, 
                                      categories=ordered_cats[col])
        else:
            df[col] = pd.Categorical(df[col])

        for i, c in enumerate(df[col].cat.categories):
            # uniques_cat['Q6_1'] = '< 1 years'
            uniques_cat[f'{q}_{i}'] = c
            
    assert len(df.select_dtypes('object').columns) == 0
    return uniques_cat
    
uniques_cat = object_cols_to_cats(df, ordered_cats)
# merge the uniques + uniques_cat dicts
uniques = {**uniques, **uniques_cat}

In [None]:
df.to_feather('processed_responses.feather')

In [None]:
# cl_df only contains questions used for clustering
cl_df = df[[col for col in df if col_to_q(col) in cluster_qs]].copy()
cl_df.shape, df.shape

In [None]:
def cat_cols_to_dummies(cl_df):
    """onvert categorical columns to dummy"""
    cat_cols = cl_df.select_dtypes('category').columns
    for col in cat_cols:
        cl_df[col] = cl_df[col].cat.codes

    cl_df = pd.get_dummies(cl_df, columns=cat_cols)
    return cl_df
    
cl_df = cat_cols_to_dummies(cl_df)
# drop '-1 columns' corresponding to missing values
cl_df = cl_df[[c for c in cl_df if '-1' not in c]]
cl_df = cl_df.astype(np.int8)

# check to ensure each col is mapped to a value
for col in cl_df:
    assert col in uniques, col
    
# some users have not responded to any of the selected qs so we drop them
cl_df = cl_df[cl_df.mean(axis=1) > 0.01]
cl_df.shape

<a id="clustering"></a>
## Clustering

<a id="umap"></a>
### UMAP Projection

[UMAP](https://umap-learn.readthedocs.io/) is a very useful nonlinear dimensionality reduction technique and it can give insights into wide range of different datasets, provided it is configured correctly.

#### Jaccard coefficient as a distance measure

Perhaps the most important UMAP parameter is the distance measure. Our dataset includes binary features only so the Jaccard coefficient is a good choice. It is defined as the size of the intersection of two sets $u$ and $v$, divided by the size of their union:

$$J(u, v) = \frac{|u \cap v|}{|u \cup v |}$$


Wikipedia has a nice [graphic](https://en.wikipedia.org/wiki/Jaccard_index) of the Jaccard coefficient. In our case, the numerator will count the number of matches (shared selections) between users $u$ and $v$. The denominator will normalize this count by the total number of unique selections of both $u$ and $v$. It is easier to have more matches with a user that has made a lot of positive selections, so we need the denominator to control for this effect.

In [None]:
%%time

job_roles = df.loc[cl_df.index, 'Q5']
mapper = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='jaccard', random_state=0)
mapper.fit(cl_df);

# uncomment this line to create the UMAP projection from the introduction section
# throws an error with the Kaggle version of umap-learn, but visualization is still generated
# umap.plot.points(mapper, labels=job_roles, color_key_cmap='tab20', width=700, height=600)

#### Let's study the professionals!

<img src="https://upload.wikimedia.org/wikipedia/en/0/03/Leon-poster.jpg" align="center"/>

We will split the projection into two, and select the professionals for further analysis.

In [None]:
def get_prof_df(mapper, cl_df, boundary_x):
    """Get df with professionals only by splitting the UMAP
    projection based on a boundary value."""
    
    proj = mapper.transform(cl_df)
    prof_df = cl_df[proj[:, 0] < boundary_x]
    plot_proj_with_boundary(proj, boundary_x)
    return prof_df

def plot_proj(proj, i0, i1, ax=None, alpha=0.2, s=5, **kwargs):
    """Plot 2D UMAP projection for components i0 and i1."""
    ax = ax or plt.gca()
    ax.scatter(proj[:, i0], proj[: , i1], alpha=alpha, s=s, **kwargs)
    ax.set_xlabel(f'UMAP component {i0}', fontsize=12)
    ax.set_ylabel(f'UMAP component {i1}', fontsize=12)
    return ax

def plot_proj_with_boundary(proj, boundary_x):
    """Plot projection together with boundary value."""
    plt.figure(figsize=(8, 6))
    ax = plot_proj(proj, 0, 1, ax=plt.gca())
    ymin, ymax = ax.get_ylim()
    ax.vlines(boundary_x, ymin, ymax, linestyle='dashed')
    return ax
    
    
prof_df = get_prof_df(mapper, cl_df, boundary_x=4)
prof_df.shape

For the rest of the analysis, we focus on professionals (group on the left).

For clustering, we are going to refit UMAP on the professionas data only. We will use the UMAP-transformed data (rather than the sparse binary data) as input to the clustering algorithm. However, we will increase the number of UMAP components (dimensions) to 4, so we can capture more information in our projection. There is a discussion on using UMAP as a clustering preprocessing step [here](https://umap-learn.readthedocs.io/en/latest/clustering.html). 

In [None]:
prof_mapper = umap.UMAP(n_components=4, n_neighbors=15, min_dist=0, 
                        metric='jaccard', random_state=0)
prof_mapper.fit(prof_df)
trans_prof = prof_mapper.transform(prof_df)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
# plot a few 2D projections
plot_proj(trans_prof, 0, 1, ax=axes[0])
plot_proj(trans_prof, 2, 3, ax=axes[1]);

From the 2D projections of the professionals, we can make a few observations: 
* some regions of the UMAP space are more dense than others
* high-density regions are not spherical
* high-density regions are not well separated from each other with one exception in the bottom left corner. Large separation between regions tends to be driven by the survey structure (as opposed to an inherent user segmentation), as we saw above in the projection of the full survey dataset.

Based on this, we will use Gaussian Mixture with full covariance matrix for clustering. This will allow us to capture the non-spherical regions (as opposed to say, using K-means).

I tried several different algorithms (e.g. spectral clustering on the raw data), and they all roughly agreed on the segments, which gives us some confidence about the results.

<a id="num-clusters"></a>
### How about the number of clusters / segments?


There are a few important considerations when selecting the number of clusters for this type of tasks:
* The overall purpose of this clustering is to enhance our understanding of the Kaggle userbase (as opposed to, e.g. feature engineering for an ML algorithm). Each cluster needs to be analyzed, so we can understand the segment of users it represents. We cannot use a large number of clusters, e.g. 100. We can limit ourselves to a few segments especially when getting started.
* When doing the analysis, it is easy to **overcluster** (pick more clusters than what we expect the right number is), and then manually merge similar clusters together. Going in the opposite direction is harder. 

With this in mind, we will use the following recipe:
* Start with 10 clusters
* Go through each of them and try to interpret the user segment it represents
* Any clusters that are judged very similar to each other can be merged together.

In [None]:
clusterer = GaussianMixture(n_components=10, random_state=0)
clusterer.fit(trans_prof)
cl_labels_prof = clusterer.predict(trans_prof)
# show the cluster sizes
np.unique(cl_labels_prof, return_counts=True)

None of the clusters are particularly large or small, so we obtain a balanced solution.

Let's plot a few 2D projections (colored by cluster label) to examine the solution qualitatively.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
plot_proj(trans_prof, 0, 1, ax=axes[0], c=cl_labels_prof, cmap='tab10')
plot_proj(trans_prof, 2, 3, ax=axes[1], c=cl_labels_prof, cmap='tab10');

There is some overlap between clusters near the borders, but overall, the regions of high density appear to be well-clustered. Note how the full covariance clustering allows us to capture non-spherical regions.

<a id="analysis"></a>
## Analysis

It is time to actually make sense of our clustering. Below are some helper functions. 

In [None]:
def col_label(col, n_words=5):
    """Make a human-friendly (but short) label for a given column.
    For example:
    'Q26_Part_2' -> 'CV methods: Image segmentation methods (U-Net, Mask'
    """
    answer_label = col_to_answer(col, n_words=5)
    q = col_to_q(col)  # e.g. 'Q26'
    q_label = short_qs[q][0]  # e.g. 'CV methods'
    return q_label + ': ' + answer_label

def col_to_answer(col, n_words=5):
    answer_words = uniques[col].split()
    return ' '.join(answer_words[:n_words])  # limit answer to n_words

def col_to_qtype(col):
    q = col_to_q(col)  # 'Q26'
    qtype = short_qs[q][1]  # 'ML'
    return qtype 


print(col_label('Q26_A_Part_1'))
print(col_label('Q33_A_Part_2'))

In [None]:
print(col_to_qtype('Q26_A_Part_1'))
print(col_to_qtype('Q33_A_Part_2'))


<a id="lift"></a>
### Cluster Lift

Some responses to a question are inherently more popular than others. For example, for Q7 (programming languages regularly used), almost all users (87%) in the professionals dataset use Python. In comparison, only 16% use C++. If we only look at response rates per cluster, Python will drown all the other responses. 

In order to emphasize differences between clusters, we use *cluster lift* - the difference between cluster response rate and overall response rate for a given selection. If we see a large lift for C++ for a given cluster, then we know that this cluster is using C++ more than the rest of the clusters. Meanwhile, we are unlikely to see large values of lift for Python because almost all clusters are using Python. In fact, we might see negative values in lift for clusters that tend to use Python less, which can be informative as well.

In [None]:
prof_lift = prof_df - prof_df.mean()
centers_lift = prof_lift.groupby(cl_labels_prof).mean()

def cluster_top(idx, centers_df, n=15):
    """Given a cluster centers df, return the columns with largest value for a given cluster."""
    return centers_df.loc[idx].nlargest(n)

cluster_top(0, centers_lift, n=5)

In [None]:
def make_lift_df(cl_idx, centers_lift, avg_resp, n=15):
    """Create a dataframe with the responses with highest lift for a given cluster.
    
    cl_idx: the cluster integer index
    lift_centers: the lift_centers dataframe with lift data
    avg_resp: series with the overall average response
    n: the number of responses to include in the result
    
    Returns a dataframe with top n questions with highest lift.
    Also included is the overall average response (across all clusters)
    for a given question.
    """
    lift_top = cluster_top(cl_idx, centers_lift, n=n)
    avg_top  = avg_resp.loc[lift_top.index]
    res = pd.DataFrame({'lift': lift_top, 'overall_avg':  avg_top})
    res['cluster_avg'] = res['lift'] + res['overall_avg']
    res = res.reset_index().rename(columns={'index': 'q'})
    res['qtype'] = res['q'].apply(col_to_qtype)
    res.index =    res['q'].apply(col_label)
    res.index.name = 'question_label'
    return res

def get_qcolors(qtypes, primary):
    """Create the paired color coding for the question types.
    This is used in the plot_segment_qs below.
    If primary=True, we use a solid color, otherwise we use a light color."""
    paired = plt.get_cmap('Paired')
    
    def qtype_color(qtype):
        """Map a question type to a color, e.g.
        'cloud' -> (0.650, 0.808, 0.890, 1.0)
        """
        if primary: return paired(qtype_map[qtype] + 1)
        else:       return paired(qtype_map[qtype])
        
    return [qtype_color(qtype) for qtype in qtypes]

def plot_segment_qs(lift_df, ax=None):
    """Create a bar plot for the questions with highest lift in a given cluster.
    
    We show both the lift as well as the overall avg response for a given question.
    Questions are colored by the question type (e.g. cloud, ML, etc.)
    """
    # reverse df to plot values with highest lift at top
    lift_df = lift_df.iloc[::-1]  
    if ax is None:
        fig = plt.figure(figsize=(6, 7))
        ax = fig.gca()
        
    idx = np.arange(len(lift_df))
    ax.barh(idx, lift_df['lift'] * 100,
            color=get_qcolors(lift_df['qtype'], primary=True))
    ax.barh(idx, -lift_df['overall_avg'] * 100,
            color=get_qcolors(lift_df['qtype'], primary=False))
    ax.set_yticks(idx)
    ax.set_yticklabels(lift_df.index, fontsize=12)
    ax.set_xlabel('% users', fontsize=12)
    return ax

In [None]:
# here is the lift dataframe for cluster 0 with top 5 entries
make_lift_df(0, centers_lift, prof_df.mean(), n=5)

We can interpret the result for cluster 0 as follows. Let's take a look at the first row. Across all users (in the professional group), 25% use Amazon EC2. However, in cluster 0, this proportion is much higher: 25% (overall) + 37% (lift) = 62%. More than double!

Let's take a look at the fourth row ("hardware: GPUs"). The overall response rate is higher (53%) while the lift is slightly smaller (32%). But that means that the majority of cluster 0 users use GPU: 53% + 32% = 85%.

It is important to consider both the lift, as well as the overall response rate, when analyzing a given question. So let's plot them together!

In [None]:
# change cluster index to analyze each cluster in turn
cl_idx = 5
ax = plot_segment_qs(make_lift_df(cl_idx, centers_lift, prof_df.mean(), n=15))
ax.set_title(f'Cluster {cl_idx}', fontsize=14)
ax.grid(axis='x')

As explained in the introduction, this chart shows the overall response rate as well as the cluster-specific lift for a given answer choice. The answers are color-coded based on categories (manually assigned earlier).

In [None]:
def make_analysis_df(df, valid_qs, cluster_labels):
    """Make the analysis df based on the validation questions and cluster labels."""
    analysis_df = df[valid_qs].assign(cluster=cluster_labels)

    # ordering info for plotting
    for name, col in analysis_df.iteritems():
        if hasattr(col, 'cat') and col.cat.ordered:
            analysis_df[f'{name}_order'] = col.cat.codes + 1
    return analysis_df
            
analysis_df = make_analysis_df(df.loc[prof_df.index], valid_qs, cl_labels_prof)
analysis_df.head(2)

Let's start exploring the data in the validation questions (`analysis_df`).

Below we can see the distribution of job roles for the different clusters. Hover over the bars to see the actual counts.
Since the cluster numbers are arbitrary, all charts are ordered based on the level of experience with ML for each cluster.

In [None]:
def prop_new_to_ml(s):
    """Find the proportion of users that are new to ML model."""
    return ((s == 'Under 1 year') | (s ==  'I do not use machine learning methods')).mean()

# sort clusters by years of experience in ML models
sort_order = (analysis_df
              .groupby('cluster')['Q15']
              .apply(prop_new_to_ml)
              .sort_values()
              .index.tolist())

def plot_job_titles(data, q='Q5'):
    return alt.Chart(data=data[[q, 'cluster']]).mark_bar(size=35).encode(
        x=alt.X('cluster:N', sort=sort_order),
        y=alt.Y('count()', stack='normalize', title='Proportion (per cluster)'),
        color=alt.Color(q, scale=alt.Scale(scheme='tableau20')),
        tooltip=[q, 'count()']
    ).properties(width=500)

plot_job_titles(analysis_df).properties(title='Job title most similar to current role')

In [None]:
def stacked_bar_cluster(data, q):
    """Stacked bar for a given question with possibly order information.
    
    data: dataframe to plot, needs to contain the question q, and cluster label
    q: name of the question, e.g. 'Q6'
    """
    q_order_col = f'{q}_order'
    if q_order_col in data.columns:
        order = q_order_col
        cols = [q, q_order_col, 'cluster']
    else:
        order = []
        cols = [q, 'cluster']
        
    return alt.Chart(data=data[cols]).mark_bar(size=25).encode(
        x=alt.X('cluster:N', sort=sort_order),
        y=alt.Y('count()', title='Proportion (per cluster)', stack='normalize'),
        # need to provide a list with ordered categories to display correctly
        color=alt.Color(f'{q}:O', scale=alt.Scale(scheme='inferno'), 
                        sort=list(data[q].cat.categories)),
        tooltip=[q, 'count()'],
        # force an order on a categorical variable
        order=order
    ).properties(
        width=400,
        height=280,
    )

In [None]:
stacked_bar_cluster(analysis_df, 'Q6').properties(title='Years writing code')

In [None]:
stacked_bar_cluster(analysis_df, 'Q15').properties(title='Years using machine learning methods')

In [None]:
# stacked_bar_cluster(analysis_df, 'Q2').properties(title='Gender')
# stacked_bar_cluster(analysis_df, 'Q4').properties(title='Education')
stacked_bar_cluster(analysis_df, 'Q22').properties(title='Employer incorporating ML methods')

Note how the experience with ML models (Q15) correlates with employer incorporating ML methods (Q22). 

In [None]:
stacked_bar_cluster(analysis_df, 'Q20').properties(title='Company size')

Not a large variation in the company size across clusters, but we can notice that cluster 4 users tend to be employed in larger companies than cluster 0 and 3 users (but are otherwise similar in terms of ML experience).

**Q23** (select activities that make an important part of your work) is central to this analysis so it deserves its own table with lift coefficients. Note that we included this question in the clustering.

In [None]:
role_avg = centers_lift.filter(like='Q23').loc[sort_order]
role_avg.columns = role_avg.columns.map(lambda c: col_to_answer(c, n_words=50))
role_avg.index.name = 'cluster'
role_avg.style.background_gradient(cmap='PiYG').format("{:.1%}")

You will notice "more green near the top" - more experienced users in clusters 0, 4, 3, 7 are more likely to make a selection (or multiple selections) that matches one of the standard data science roles.  

<a id="segm-summary"></a>
## Segmentation summary

You can find below the summaries I wrote for each cluster, based on the charts and analysis above.

In [None]:
cl_summary = {
    0: 'Advanced ML',
    1: 'ML Beginners',
    2: 'Business Intelligence',
    3: 'Google Cloud and models in production',
    4: 'ML on tabular data',
    5: 'Computer Vision',
    6: 'R Users',
    7: 'Azure and R and more',
    8: 'Getting started with data',
    9: 'Getting started with data'
}

cl_counts = pd.Series(cl_labels_prof).value_counts().loc[sort_order]
# fix cluster label
cl_counts.index = cl_counts.index.map({k: f'{k}: {v}' for k, v in cl_summary.items()})
ax = cl_counts.iloc[::-1].plot(kind='barh', figsize=(6, 6))
ax.set_yticklabels(ax.get_yticklabels(), fontsize=12)
ax.set_title('Cluster size', fontsize=14)
ax.set_xlabel('Number of participants', fontsize=12);

### Cluster 0: advanced ML

Cluster 0 includes some of the most experienced users in terms of machine learning: more than 75% of them have been using ML methods for more than 2 years. From the lift plots, we see that they use a wide range of ML methods, and focus particularly on deep learning: anything from convolutional networks for CV to NLP models. PyTorch, Keras and Tensorflow are used, and the proportions are roughly equal between the three frameworks. In terms of cloud services, AWS (and in particular EC2) receive the highest lift. 

This cluster includes the highest proportion of ML Engineers (nearly 25%) compared to other clusters, and also Data Scientists, Software Engineers, and Research Scientists. These participants focus heavily on building ML prototypes to explore new areas (73%) and experimentation to improve existing ML models (63%). 

### Cluster 1: ML beginners

Users in Cluster 1 tend to use standard Python libraries (matplotlib, scikit-learn) in their local Jupyter environment. The majority of them (more than 70%) have <= 2 years of experience using machine learng methods. They tend not to use cloud ML services and AutoML tools. They utilize classic algorithms such as linear and logistic regression, random forests. 

In terms of job roles, there is a mix of Software Engineers, Data Scientists, Data Analysts, with fewer proportions for the rest of the roles. About 75% of these participants say that their employer is either exploring or already using ML methods, which suggests that there will be future opportunities for them to gain more experience in the field. 

### Cluster 2: Business Intelligence

Cluster 2 users tend to use BI tools such as Tableau, Microsoft Power BI, Excel. This is the cluster with the largest proportion of employers not using ML (about half).

In terms of job roles, this group includes a larger proportion of Business and Data Analysts (about 30% together), and also the highest proportion of unspecified roles ("Other") - more than a quater!

Note: these survey participants have selected "I do not use ML methods", which is why they received some questions that are differently worded. In the UMAP projections, you will notice how they are separated from the rest of the clusters. This is also the smallest cluster.

### Cluster 3: Google Cloud and models in production

This cluster heavily utilizes Google Cloud Products: anything from GC AutoML to Google Cloud SQL. They use Tensorflow (85%) and Keras (73%), but also PyTorch (63%). These users also apply various autoML tools to their tasks, such as auto-sklearn. 

Similar to cluster 0, these participants apply deep learning techiques to both CV and NLP problems. About 46% of the users say they do research to advance ML, which is the highest across all clusters! In addition, many cluster 3 members focus on putting models in production: 58% build and run the data infrastructure, and 57% build and run ML services. The distribution of job roles is also similar to that of cluster 0, where most users are Data Scientists, ML engineers, Data Analysts or Software Engineers. About 40% of the users are employed in small companies of size less than 50.

### Cluster 4: ML on tabular data

Most participants in Cluster 4 use regularly SQL (more than 75%) to query various databases and warehouses such as PostgresSQL, Redshift, Google BigQuery. They use ML models that are typically used on tabular data, such as gradient boosting (59%) or regression (85%) and the scikit-learn library (80%). They are cloud users (AWS at 72%), but do not utilize cloud ML services. They do not use specialized hardware like GPUs / TPUs. This cluster includes the most experienced coders: more than 65% of the users have more than 5 years of coding experience. 

These are some of the most experienced users in terms of using ML methods and writing code - about 90% have more than 3 years of writing code. They tend to be employed in large companies, nearly half of them are employed in a company with more than 1000 people. Curiously, this is the group with the largest proportion of Data Science job roles (nearly half). The primary focus is analyzing data to influence business decisions (81%) and building ML prototypes in new areas (65%).

### Cluster 5: Computer vision

More than 90% of the users in this cluster use Convolutional Neural Networks to tackle various problems in image classification, object detection and segmentation. They are most likely using Tensorflow (78%) and/or Keras (73%). Interestingly, in addition to Python, more than a quarter of the users utilize C++, presumably for high-performance computer vision applications. These users utilize GPUs, but more than half do not use a cloud platform.

This is the group with highest proportion of PhDs (25%) and Research Scientists as the job role (19%). Other common job roles are Data Scientist, ML Engineer, Software Engineer. In terms of ML experience, this cluster falls in the middle, with about half the users having less than 2 years of experience with ML models. Users in this cluster are more likely to do research to advance ML methods compared to other segments. 

### Cluster 6: R Users

This is the R Cluster! Nearly all participants in this cluster use R (in RStudio) and standard R packages such as ggplot2, Caret as well as Shiny (40%) for building and deploying apps. 

This cluster has a higher percentage of Business and Data Analysts (nearly 30% in total) which is in line with the fact that many analysts use R. Perhaps not surpisingly, the majority of the Statisticians in the dataset fall under this cluster (but still represent a small proportion at 12%). The majority of users (more than 80%) focus heavily on analyzing data at work. Curiously, this is the cluster with the highest percentage of women (about 20% vs. 13% in the whole "professionals" dataset). 

### Cluster 7: Azure and R and more

Just like Cluster 3 uses Google Cloud, Cluster 7 participants are more likely to use a set of tools from the Microsoft Azure platform, such as SQL Server (46%) and Power BI (52%). Interestingly, more than half of the participants in this cluster use R regularly. This might be related to the fact that Microsoft provides good support for R. In terms of ML experience, this cluster falls in the middle, with roughly half of the participants relatively new to ML methods. 

Users in this cluster are involved in a variety of activities, with the most common being "analyzing data to influence decisions" (80%), and "building ML prototypes to explore new areas" (60%). In addition, nearly half of them also run data infrastructure. This is the largest cluster and also perhaps the most diverse in terms of work activities. So I further split it into two subgroups  - it is very easy to do by applying the functions defined above only on this cluster. I found that one subgroup is more focused on ML model research (Tensorboard, PyTorch, CNNs, autoML), whereas the other subgroup is focused on data analysis (R, RStudio, Azure Notebooks, ggplot2).

### Clusters 8 and 9: Getting started with data

I decided to merge these clusters together. It is challenging to infer much about these clusters since participants here selected "None" to many of the questions. A few exceptions: many participants selected "using basic statistical tool (e.g. Excel)", as well as "regularly using SQL". It is likely that many of these users have non-data-science primary work responsibilities, and are trying to incorporate some data analysis / data science techniques into their work. 

In terms of experience with ML methods, most of the participants in these clusters are beginners (and slightly more so in cluster 8).  Cluster 8 has the highest percentage of software engineers across clusters (above 30%), which might explain why Javascript was often picked as one of the regularly-used languages.    