In [None]:
import numpy as np 
import pandas as pd 
from sklearn.preprocessing import StandardScaler
from yellowbrick.cluster import KElbowVisualizer
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns
from scipy.stats import mannwhitneyu, spearmanr, pearsonr
pd.options.mode.chained_assignment = None

# Introduction

Kaggle is a formidable platform with a great community, but who are the people who make up this platform? That's the question I wanted to answer with this notebook. I grouped the participants of this survey into 5 clusters by utilizing kmeans-clustering. The five types of Kagglers I found are the following:

* Seasoned ML-Professionals
* Seasoned Coders
* Inexperienced with ML-Methods
* Young ML-Professionals
* Students

The clusters are based on a selection of discrete, ordinal and binary features. I further explored the clusters with regards to the occupation of Kagglers from the clusters, how much they or their team spends on ml- or cloud computing services, their gender and which type of Kagglers makes up which share of the respondents from the 15 countries with the most respondents. 

In [None]:
survey = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', low_memory=False)
survey['Q6'].iloc[survey['Q6'] == 'I have never written code'] = '0 years'
survey['Q15'].iloc[survey['Q15'] == 'I do not use machine learning methods'] = '0 years' 
# removing the question
data = survey.iloc[1:]

# Getting the Features for KMeans-Clustering

In [None]:
# Creating the features for clustering
profiles = pd.DataFrame(index = data.index)

# Number of programming languages used
data['Q7_Part_12'] = np.nan
profiles['num_langs'] = data.filter(regex='Q7').notna().sum(axis=1)

# Number of ml-algorithms used
data['Q17_Part_11'] = np.nan
profiles['num_algs'] = data.filter(regex='Q17').notna().sum(axis=1)

# Number of sources
data['Q42_Part_11'] = np.nan
profiles['num_sources'] = data.filter(regex='Q42').notna().sum(axis=1)

# Age class
profiles['age_class'] = data['Q1']
age_classes = ['18-21', '22-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-69', '70+']
i = 0
for a in age_classes:
    profiles['age_class'].iloc[profiles['age_class'] == a] = i
    i += 1

# Users degree    
profiles['degree'] = data['Q4']
degrees = ['I prefer not to answer', 'No formal education past high school', 'Some college/university study without earning a bachelor’s degree',
          'Bachelor’s degree', 'Master’s degree', 'Professional doctorate', 'Doctoral degree']
i = 0
for d in degrees:
    profiles['degree'].iloc[profiles['degree'] == d] = i
    if d != 'Professional doctorate': i += 1 # assign same rank to doctoral degree and professional doctorate 
        
# How long has the user been writing code
profiles['years_coding_class'] = data['Q6']
code = ['0 years', '< 1 years', '1-3 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']
i = 0
for c in code:
    profiles['years_coding_class'].iloc[profiles['years_coding_class'] == c] = i
    i += 1

# How long has the user been using ml-methods
profiles['years_ml_class'] = data['Q15']
ml = ['0 years', 'Under 1 year', '1-2 years', '2-3 years', '3-4 years',  '4-5 years', '5-10 years', '10-20 years', '20 or more years']
i = 0
for m in ml:
    profiles['years_ml_class'].iloc[profiles['years_ml_class'] == m] = i
    i += 1
profiles['years_ml_class'] = profiles['years_ml_class'].fillna(profiles['years_ml_class'].median())

# Is the user a student?
profiles['student'] = data['Q5'].str.contains('Student')

# does the employer use ml
profiles['ml_used_at_job'] = data['Q23'].fillna('I do not know')
profiles['ml_used_at_job'].iloc[profiles['ml_used_at_job'] == 'I do not know'] = 0
profiles['ml_used_at_job'].iloc[profiles['ml_used_at_job'] == 'No (we do not use ML methods)'] = 0
profiles['ml_used_at_job'].iloc[profiles['ml_used_at_job'] == 'We are exploring ML methods (and may one day put a model into production)'] = 1
profiles['ml_used_at_job'].iloc[profiles['ml_used_at_job'] == 'We recently started using ML methods (i.e., models in production for less than 2 years)'] = 1
profiles['ml_used_at_job'].iloc[profiles['ml_used_at_job'] == 'We use ML methods for generating insights (but do not put working models into production)'] = 1
profiles['ml_used_at_job'].iloc[profiles['ml_used_at_job'] == 'We have well established ML methods (i.e., models in production for more than 2 years)'] = 2

In [None]:
profiles

**The Features:**
* num_langs: The number of languages a Kaggler uses on a regular basis *(discrete)*
* num_algs: The number of ml-algorithms an individual uses on a regular basis *(discrete)*
* num_sources: The variety of sources a Kaggler uses *(discrete)*
* age_class: The class corresponding to a Kagglers age *(ordinal)*
* degree: Highest degree of a Kaggler or the highest degree the Kaggler will obtain in the next 2 years *(ordinal)*
* years_coding_class: Variable describing how many years an individual has been coding *(ordinal)*
* years_ml_class: Variable describing how many years an individual has been using ml-methods *(ordinal)*
* student: Variable describing whether an individual is a student  *(binary)*
* ml_used_at_job: How intensively does the workplace use ml-methods *(ordinal)*

In [None]:
palette = ['#016E78', '#02ABB7', '#00F3D7', "#94167F", '#E672E0', "#F62E97"]
pal5 = ['#02ABB7', '#00F3D7', "#F62E97", '#E672E0', '#674076']
pal3 = ['#00F3D7', "#94167F", "#F62E97"]
pal2 = ['#00F3D7', "#F62E97"]
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,10))

sns.countplot(ax=ax1, x="num_langs", palette=palette, data=profiles)
sns.countplot(ax=ax2, x="num_algs", palette=palette, data=profiles)
sns.countplot(ax=ax3, x="num_sources", palette=palette, data=profiles)
plt.show()

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,10))

sns.countplot(ax=ax1, x="age_class", palette=palette, data=profiles)
sns.countplot(ax=ax2, x="years_coding_class", palette=palette, data=profiles)
sns.countplot(ax=ax3, x="years_ml_class", palette=palette, data=profiles)
plt.show()

In [None]:
pal = pal2
fig = plt.subplots(figsize=(10,10))

data1 = [profiles['student'].mean(), 1-profiles['student'].mean()]
labels = ['student', 'not a student']
plt.pie(data1, labels = labels, colors = pal, autopct='%.0f%%')
plt.show()

In [None]:
pal = pal3

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,10))

sns.countplot(ax=ax1, x="Q4", palette=pal, data=survey[1:])
ax1.tick_params(labelrotation=45)
sns.countplot(ax=ax2, x="ml_used_at_job", palette=pal, data=profiles)
plt.show()

# KMeans

In [None]:
# scaling the data improves the performance of kmeans-clustering
scl = StandardScaler()
s_profiles = pd.DataFrame(scl.fit_transform(profiles), index=profiles.index, columns=profiles.columns)
# using the elbow-method to find a suitable number of clusters
model = KMeans(random_state=69)
visualizer = KElbowVisualizer(model, k=(2,10))
visualizer.fit(s_profiles)  
visualizer.show()
plt.show()

In [None]:
# fit and predict
kmeans = KMeans(n_clusters=5,random_state=69)
profiles['cluster'] = kmeans.fit_predict(s_profiles)

# Exploring the Clusters

In [None]:
#pal = ["palegreen", "paleturquoise", "lightpink","salmon","plum"]
pal = pal5

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(30,10))

sns.boxplot(ax=ax1, x="cluster", y="num_algs", palette=pal, data=profiles)
sns.boxplot(ax=ax2, x="cluster", y="num_langs", palette=pal, data=profiles)
sns.boxplot(ax=ax3, x="cluster", y="num_sources", palette=pal, data=profiles)
plt.show()

**Number of Algorithms**

Kagglers from cluster 2 use the fewest ml-algorithms. 57% of users from cluster 2 use no ml-algorithm on a regular basis, another 15.84% use one ml-algorithm. The boxplots for clusters 1 and 4 generally follow the same shape with a 25th percentile of 0, a median of 1 and a 50th percentile of 3. Kagglers from these two clusters use a lower variety of algorithms than Kagglers from the clusters 0 and 3 but a larger variety than kagglers from cluster 2. Clusters 0 and 3 use the most algorithms and are in general very similar. 57.75% from cluster 3 and 58.26% from cluster 0 reported that they use between 3 and 5 different algorithms on a regular basis.

**Number of Languages**

The median for the number of languages used on a regular basis by Kagglers from Cluster 2 is 2. 1186 Kagglers from this cluster don't use any programming language on a regular basis. This number was 1351 for all participants. Kagglers from the clusters 0 and 1 use the most programming languages on a regular basis. The median for cluster 0 is 3 languages, so is the mode and the arithmetic mean is 3.1938. The median for cluster 1 is also 3, the mode is 2 and the arithmetic mean is 2.9310.

**Number of Sources**

Individuals from the clusters 0 and 3 use the largest variety of sources. The median for both clusters is 3 and the 25th percentile and 75th percentile are at 2 and 4 respectively.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(30,10))

sns.countplot(ax=ax1, x="age_class", hue='cluster', palette=pal, data=profiles)
sns.countplot(ax=ax2, x="years_coding_class", hue='cluster', palette=pal, data=profiles)
sns.countplot(ax=ax3, x="years_ml_class", hue='cluster', palette=pal, data=profiles)
plt.show()

**Age**

Cluster 0 and 1 are the oldest. Cluster 0's median age class is '40-44' and the mode is '30-34'. Cluster one is older with a median age class of '45-49' and a mode of '40-44'. Whilst Cluster 4 is the youngest both the median and the mode are at '18-21'. 66.95% of Cluster 3 are at max 29 years old and 84.57% are under 35.

**Years a Kaggler has been Coding**

Kagglers from cluster 0 have been mostly coding for '10-20 years', as that's both the median and mode. The same applies to cluster 1. Both the median and mode for cluster 2 are '< 1 year'. Cluster 2 also includes almost all of the users who've never written any code, 983 out of 1032 Kagglers who've never written code. Kagglers from Cluster 4 have a little bit more experience as their mode and median is '1-3 years'. The mode and median for cluster 3 are the same as for cluster 4, however the classes 'I've never written code', '< 1 years' and '1-3 years' make up 55.30% of cluster 3 and 78.92% of cluster 4. So, it can be said that cluster 4 is generally more experienced in writing code.

**Years a Kaggler has been using ML-Methods**

The biggest difference between cluster 0 and 1 can be found when we look at their experience with machine learning methods. 53.40% of Kagglers from cluster 1 have less than one year of experience with ml-methods. The median for cluster 0 is '4-5 years' and the mode is '5-10 years'. In fact, 48% from this cluster belong to the classes '5-10 years' and upwards.

In [None]:
# function getting the data for the pie-chart
def get_data(df, cluster, key):
    out = []
    labl = []
    dft = df[df['cluster'] == cluster]
    for i in sorted(dft[key].unique()):
        out.append(len(dft[key][dft[key] == i]))
        labl.append(str(i))
    return out, labl

In [None]:
fig, axs = plt.subplots(figsize=(18,15))

fig.suptitle('Student')

pal = pal2

data0, label0 = get_data(profiles, 0, 'student')
data1, label1 = get_data(profiles, 1, 'student')
data2, label2 = get_data(profiles, 2, 'student')
data3, label3 = get_data(profiles, 3, 'student')
data4, label4 = get_data(profiles, 4, 'student')

ax1 = plt.subplot2grid(shape=(2,6), loc=(0,0), colspan=2)
ax2 = plt.subplot2grid((2,6), (0,2), colspan=2)
ax3 = plt.subplot2grid((2,6), (0,4), colspan=2)
ax4 = plt.subplot2grid((2,6), (1,1), colspan=2)
ax5 = plt.subplot2grid((2,6), (1,3), colspan=2)

ax1.pie(data0, labels = label0, colors = pal, autopct='%.0f%%')
ax1.set_title('Cluster 0')
ax2.pie(data1, labels = label1, colors = pal, autopct='%.0f%%')
ax2.set_title('Cluster 1')
ax3.pie(data2, labels = label2, colors = pal, autopct='%.0f%%')
ax3.set_title('Cluster 2')
ax4.pie(data3, labels = label3, colors = pal, autopct='%.0f%%')
ax4.set_title('Cluster 3')
ax5.pie(data4, labels = label4, colors = pal, autopct='%.0f%%')
ax5.set_title('Cluster 4')
plt.show()

**Share of Students**

Cluster 4 only consists of students, whilst the other clusters almost exclusively consist of non-students. 

In [None]:
fig, axs = plt.subplots(figsize=(18,15))

fig.suptitle('Highest Degree')

pal = palette

data0, label0 = get_data(profiles, 0, 'degree')
data1, label1 = get_data(profiles, 1, 'degree')
data2, label2 = get_data(profiles, 2, 'degree')
data3, label3 = get_data(profiles, 3, 'degree')
data4, label4 = get_data(profiles, 4, 'degree')

labels = ['Not Disclosed', 'No Degree p. HS.', 'College no Degree', 'Bachelor', 'Master', 'Doctor']

ax1 = plt.subplot2grid(shape=(2,6), loc=(0,0), colspan=2)
ax2 = plt.subplot2grid((2,6), (0,2), colspan=2)
ax3 = plt.subplot2grid((2,6), (0,4), colspan=2)
ax4 = plt.subplot2grid((2,6), (1,1), colspan=2)
ax5 = plt.subplot2grid((2,6), (1,3), colspan=2)

ax1.pie(data0, labels = labels, colors = pal, autopct='%.0f%%')
ax1.set_title('Cluster 0')
ax2.pie(data1, labels = labels, colors = pal, autopct='%.0f%%')
ax2.set_title('Cluster 1')
ax3.pie(data2, labels = labels, colors = pal, autopct='%.0f%%')
ax3.set_title('Cluster 2')
ax4.pie(data3, labels = labels, colors = pal, autopct='%.0f%%')
ax4.set_title('Cluster 3')
ax5.pie(data4, labels = labels, colors = pal, autopct='%.0f%%')
ax5.set_title('Cluster 4')
plt.show()

**Highest Degree**

Cluster 0 is highly educated with 46% master’s degrees and 41% doctorates making for a combined share of 87%. Cluster 1 has the second largest share of doctorates with 20%. It's share of bachelor’s degrees is with 24% larger than the share for cluster 0 but it's smaller than in all the other clusters. Clusters 2 and three mostly consist of people with bachelor's and master’s degrees. The share of master’s degrees is larger in cluster 3 with 48% and the share of bachelor degrees is larger in cluster 2 with 44%. Cluster 4 has the largest share of people who've had/will have some form of college/uni education without a degree with 12%. Cluster 4 bachelor’s degrees make up the majority of this cluster with 53%.

In [None]:
fig, axs = plt.subplots(figsize=(18,15))

fig.suptitle('Usage of ML-Algorithms at the Workplace')

pal = pal3

data0, label0 = get_data(profiles, 0, "ml_used_at_job")
data1, label1 = get_data(profiles, 1, "ml_used_at_job")
data2, label2 = get_data(profiles, 2, "ml_used_at_job")
data3, label3 = get_data(profiles, 3, "ml_used_at_job")
data4, label4 = get_data(profiles, 4, "ml_used_at_job")

labels = ['No Usage/Unknown', 'Moderate Usage', 'Extensive Usage']

ax1 = plt.subplot2grid(shape=(2,6), loc=(0,0), colspan=2)
ax2 = plt.subplot2grid((2,6), (0,2), colspan=2)
ax3 = plt.subplot2grid((2,6), (0,4), colspan=2)
ax4 = plt.subplot2grid((2,6), (1,1), colspan=2)
ax5 = plt.subplot2grid((2,6), (1,3), colspan=2)

ax1.pie(data0, labels = labels, colors = pal, autopct='%.0f%%')
ax1.set_title('Cluster 0')
ax2.pie(data1, labels = labels, colors = pal, autopct='%.0f%%')
ax2.set_title('Cluster 1')
ax3.pie(data2, labels = labels, colors = pal, autopct='%.0f%%')
ax3.set_title('Cluster 2')
ax4.pie(data3, labels = labels, colors = pal, autopct='%.0f%%')
ax4.set_title('Cluster 3')
ax5.pie(data4, labels = ['No Usage/Unknown'], colors = pal, autopct='%.0f%%')
ax5.set_title('Cluster 4')
plt.show()

**Usage of ML-Algorithms at the Workplace** 

There isn't any usage of ML-Algorithms at the workplaces of people from cluster 4. Which is unsurprising as this cluster only contains students. Cluster 0 has the largest share of extensive use, which corresponds to the answer 'We have well established ML methods (i.e., models in production for more than 2 years)'. The Clusters 1 and 2 are both mostly made up of people who either don't know whether their company uses ml-methods or the company doesn't use any ml-algorithms. This share is 61% for cluster 1 and 79%. Cluster 3's share of people who report a moderate usage of m-methods at their company is 58%.

# Naming the Clusters

In [None]:
fig = plt.subplots(figsize=(10,10))
pal = pal5
data1 = [len(profiles[profiles['cluster'] == 0]), len(profiles[profiles['cluster'] == 1]), 
        len(profiles[profiles['cluster'] == 2]), len(profiles[profiles['cluster'] == 3]),
       len(profiles[profiles['cluster'] == 4])]
labels = ['0', '1', '2', '3', '4']
plt.pie(data1, labels = labels, colors = pal, autopct='%.0f%%')
plt.show()

**Cluster 0: Seasoned ML-Professionals**

Kagglers from cluster 0 are among the most experienced in coding and have the most experience in using ml-methods. They are highly educated with the largest share of doctors in their ranks. Their workplace uses ml-methods to the highest extend with a share of 38% of well-established ml-models. This Cluster account for 13% of the participants.

**Cluster 1: Seasoned Coders**

Cluster 1 is the oldest. They're equally as experienced in coding as Kagglers from cluster 0 however they have generally less experience in using ml-methods. The workplace of 61% of these Kagglers either doesn't use ml-methods or the Kagglers don't know whether the workplace makes use of them. 13% of the participants belong to this cluster.

**Cluster 2: Inexperienced with ML-Methods**

A majority from this cluster doesn't work in a company where ml-methods are used, as 79% either don't know whether or not ml-methods are used at their company, or they reported that they aren't used. 983 out of 1032 participants who've never written code belong to this cluster. Those who work in data science are generally less experienced when it comes to the usage of ml-methods. 

**Cluster 3: Young ML-Professionals**

Cluster 3 consists of young adults who mostly work at companies, with a moderate usage of ml-methods. Kagglers of this type use a wide variety of ml-algorithms but have less experience than the 'Seasoned ML-Professionals'.

**Cluster 4: Students**

We're students. There isn't much else to say.

# Occupation and Industry

In [None]:
# preparing the dataframes for the exploration concerning further questions
data['cluster'] = profiles['cluster']
profiles['gender'] = data['Q2']
profiles['country'] = data['Q3']
profiles['occupation'] = data['Q5']

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(19,15))

# top ten professions by cluster
axs[0, 0].set_title('Seasoned ML-Professionals')
axs[0, 0].bar(data[profiles['cluster'] == 0].value_counts('Q5')[:10].index, data[profiles['cluster'] == 0].value_counts('Q5')[:10], color = palette)
axs[0, 0].tick_params(labelrotation=45)
axs[0, 0].grid(False)

axs[0, 1].set_title('Seasoned Coders')
axs[0, 1].bar(data[profiles['cluster'] == 1].value_counts('Q5')[:10].index, data[profiles['cluster'] == 1].value_counts('Q5')[:10], color = palette)
axs[0, 1].tick_params(labelrotation=45)
axs[0, 1].grid(False)

axs[1, 0].set_title('Inexperienced with ML-Methods')
axs[1, 0].bar(data[profiles['cluster'] == 2].value_counts('Q5')[:10].index, data[profiles['cluster'] == 2].value_counts('Q5')[:10], color = palette)
axs[1, 0].tick_params(labelrotation=45)
axs[1, 0].grid(False)

axs[1, 1].set_title('Young Professionals')
axs[1, 1].bar(data[profiles['cluster'] == 3].value_counts('Q5')[:10].index, data[profiles['cluster'] == 3].value_counts('Q5')[:10], color = palette)
axs[1, 1].tick_params(labelrotation=45)
axs[1, 1].grid(False)

plt.tight_layout()

**Seasoned ML-Professionals**

The most people belonging to this cluster are data scientists with 1074 individuals which is followed by research scientists with 549. This cluster has the largest share of research scientists with 16.67%.

**Seasoned Coders**

Most Kagglers from this cluster are software engineers. 482 of the participants belonging to this cluster answered with other.

**Inexperienced with ML-Methods**

The largest group from this cluster is currently not employed with 1291 individuals 1226 answered with other. However there is a large group of data analysts and data scientist with 1106 and 850 respondents respectively. 

**Young Professionals**

The four largest groups in this cluster are data scientists with 1458, data analyst with 758, software engineer with 717 and machine learning engineer with 668.

In [None]:
data['Q20'] = data['Q20'].fillna('No Answer')

fig, axs = plt.subplots(2, 2, figsize=(19,15))

# top ten industries by cluster
axs[0, 0].set_title('Seasoned ML-Professionals')
axs[0, 0].bar(data[profiles['cluster'] == 0].value_counts('Q20')[:10].index, data[profiles['cluster'] == 0].value_counts('Q20')[:10], color = palette)
axs[0, 0].tick_params(labelrotation=45)
axs[0, 0].grid(False)

axs[0, 1].set_title('Seasoned Coders')
axs[0, 1].bar(data[profiles['cluster'] == 1].value_counts('Q20')[:10].index, data[profiles['cluster'] == 1].value_counts('Q20')[:10], color = palette)
axs[0, 1].tick_params(labelrotation=45)
axs[0, 1].grid(False)

axs[1, 0].set_title('Inexperienced with ML-Methods')
axs[1, 0].bar(data[profiles['cluster'] == 2].value_counts('Q20')[:10].index, data[profiles['cluster'] == 2].value_counts('Q20')[:10], color = palette)
axs[1, 0].tick_params(labelrotation=45)
axs[1, 0].grid(False)

axs[1, 1].set_title('Young Professionals')
axs[1, 1].bar(data[profiles['cluster'] == 3].value_counts('Q20')[:10].index, data[profiles['cluster'] == 3].value_counts('Q20')[:10], color = palette)
axs[1, 1].tick_params(labelrotation=45)
axs[1, 1].grid(False)

plt.tight_layout()

# Expenses for ML and Cloud Computing Services

In [None]:
fig, axs = plt.subplots(figsize=(18,15))

fig.suptitle('USD spent on ml-/cloud computing services in the past 5 years by the team or the individual')

pal = palette

data0, label0 = data['Q26'][profiles['cluster'] == 0].value_counts().sort_index(), sorted(data['Q26'][profiles['cluster'] == 0].value_counts().index)
data1, label1 = data['Q26'][profiles['cluster'] == 1].value_counts().sort_index(), sorted(data['Q26'][profiles['cluster'] == 1].value_counts().index)
data2, label2 = data['Q26'][profiles['cluster'] == 2].value_counts().sort_index(), sorted(data['Q26'][profiles['cluster'] == 2].value_counts().index)
data3, label3 = data['Q26'][profiles['cluster'] == 3].value_counts().sort_index(), sorted(data['Q26'][profiles['cluster'] == 3].value_counts().index)
data4, label4 = data['Q26'][profiles['cluster'] == 4].value_counts().sort_index(), sorted(data['Q26'][profiles['cluster'] == 4].value_counts().index)

ax1 = plt.subplot2grid(shape=(2,6), loc=(0,0), colspan=2)
ax2 = plt.subplot2grid((2,6), (0,2), colspan=2)
ax3 = plt.subplot2grid((2,6), (0,4), colspan=2)
ax4 = plt.subplot2grid((2,6), (1,1), colspan=2)
ax5 = plt.subplot2grid((2,6), (1,3), colspan=2)

ax1.pie(data0, labels = label0, colors = pal, autopct='%.0f%%')
ax1.set_title('Seasoned ML-Professionals')
ax2.pie(data1, labels = label1, colors = pal, autopct='%.0f%%')
ax2.set_title('Seasoned Coders')
ax3.pie(data2, labels = label2, colors = pal, autopct='%.0f%%')
ax3.set_title('Inexperienced with ML-Methods')
ax4.pie(data3, labels = label3, colors = pal, autopct='%.0f%%')
ax4.set_title('Young Professionals')
ax5.pie(data4, labels = label4, colors = pal, autopct='%.0f%%')
ax5.set_title('Students')
plt.show()

**USD spent on ml-/cloud computing services in the past 5 years by the team or the individual**

Seasoned ML-Professionals have spent the most money with staggering 17% answering that they/their team has spent at least 100,000 USD. 59% have spent at least 1000 USD. The contrary is true for the group that's inexperienced with ml-methods. 58% have reported that they or their team didn't spend money on ml-/cloud computing services and overall 85% of them have spent less than 1000 USD.

# Gender

In [None]:
profiles['gender'].iloc[profiles['gender'] == 'Nonbinary'] = 'NB'
profiles['gender'].iloc[profiles['gender'] == 'Prefer to self-describe'] = 'Self-described'
profiles['gender'].iloc[profiles['gender'] == 'Prefer not to say'] = 'Not disclosed'

fig, axs = plt.subplots(figsize=(18,15))

fig.suptitle('Gender')

pal = palette
ex = (0, 0.05, 0.2, 0.1, 0)

data0, label0 = profiles['gender'][profiles['cluster'] == 0].value_counts().sort_index(), sorted(profiles['gender'][profiles['cluster'] == 0].value_counts().index)
data1, label1 = profiles['gender'][profiles['cluster'] == 1].value_counts().sort_index(), sorted(profiles['gender'][profiles['cluster'] == 1].value_counts().index)
data2, label2 = profiles['gender'][profiles['cluster'] == 2].value_counts().sort_index(), sorted(profiles['gender'][profiles['cluster'] == 2].value_counts().index)
data3, label3 = profiles['gender'][profiles['cluster'] == 3].value_counts().sort_index(), sorted(profiles['gender'][profiles['cluster'] == 3].value_counts().index)
data4, label4 = profiles['gender'][profiles['cluster'] == 4].value_counts().sort_index(), sorted(profiles['gender'][profiles['cluster'] == 4].value_counts().index)

ax1 = plt.subplot2grid(shape=(2,6), loc=(0,0), colspan=2)
ax2 = plt.subplot2grid((2,6), (0,2), colspan=2)
ax3 = plt.subplot2grid((2,6), (0,4), colspan=2)
ax4 = plt.subplot2grid((2,6), (1,1), colspan=2)
ax5 = plt.subplot2grid((2,6), (1,3), colspan=2)

ax1.pie(data0, labels = label0, explode = ex, colors = pal, autopct='%.0f%%')
ax1.set_title('Seasoned ML-Professionals')
ax2.pie(data1, labels = label1,explode = ex, colors = pal, autopct='%.0f%%')
ax2.set_title('Seasoned Coders')
ax3.pie(data2, labels = label2, explode = ex, colors = pal, autopct='%.0f%%')
ax3.set_title('Inexperienced with ML-Methods')
ax4.pie(data3, labels = label3, explode = ex, colors = pal, autopct='%.0f%%')
ax4.set_title('Young Professionals')
ax5.pie(data4, labels = label4, explode = ex, colors = pal, autopct='%.0f%%')
ax5.set_title('Students')
plt.show()

**Gender by cluster**

The community is still predominantly male. The cluster with the lowest share of women are the seasoned ml-professionals, only 11% of this cluster are women. The situation when it comes to the seasoned coders isn't much different, since only 14% of this group are women. Young ml-professionals are also for the largest part male. The clusters for students and people who are inexperienced with ml-methods quite similar as both have a share of women of 23%.

# Clusters by Country

In [None]:
fig, axs = plt.subplots(5, 3, figsize=(20,20))

pal = pal5

axs[0, 0].set_title('India')
axs[0, 0].pie(profiles['cluster'][profiles['country'] == 'India'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'India'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[0, 1].set_title('United States of America')
axs[0, 1].pie(profiles['cluster'][profiles['country'] == 'United States of America'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'United States of America'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[0, 2].set_title('Japan')
axs[0, 2].pie(profiles['cluster'][profiles['country'] == 'Japan'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Japan'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[1, 0].set_title('China')
axs[1, 0].pie(profiles['cluster'][profiles['country'] == 'China'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'China'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[1, 1].set_title('Brazil')
axs[1, 1].pie(profiles['cluster'][profiles['country'] == 'Brazil'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Brazil'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[1, 2].set_title('Russia')
axs[1, 2].pie(profiles['cluster'][profiles['country'] == 'Russia'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Russia'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[2, 0].set_title('Nigeria')
axs[2, 0].pie(profiles['cluster'][profiles['country'] == 'Nigeria'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Nigeria'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[2, 1].set_title('UK') 
axs[2, 1].pie(profiles['cluster'][profiles['country'] == 'United Kingdom of Great Britain and Northern Ireland'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'United Kingdom of Great Britain and Northern Ireland'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[2, 2].set_title('Pakistan')
axs[2, 2].pie(profiles['cluster'][profiles['country'] == 'Pakistan'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Pakistan'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[3, 0].set_title('Egypt')
axs[3, 0].pie(profiles['cluster'][profiles['country'] == 'Egypt'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Egypt'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[3, 1].set_title('Germany')
axs[3, 1].pie(profiles['cluster'][profiles['country'] == 'Germany'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Germany'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[3, 2].set_title('Spain')
axs[3, 2].pie(profiles['cluster'][profiles['country'] == 'Spain'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Spain'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[4, 0].set_title('Indonesia')
axs[4, 0].pie(profiles['cluster'][profiles['country'] == 'Indonesia'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Indonesia'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[4, 1].set_title('Turkey')
axs[4, 1].pie(profiles['cluster'][profiles['country'] == 'Turkey'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'Turkey'].value_counts().index), colors = pal, autopct='%.0f%%')
axs[4, 2].set_title('France')
axs[4, 2].pie(profiles['cluster'][profiles['country'] == 'France'].value_counts().sort_index(), labels = sorted(profiles['cluster'][profiles['country'] == 'France'].value_counts().index), colors = pal, autopct='%.0f%%')
plt.show()

In [None]:
# loading data GDP per capita, ppp in 2018
gdp_per_capita = pd.read_csv('../input/gdp-per-capita-all-countries/GDP.csv')

# Making the country names compatible
gdp_per_capita['Country '].iloc[gdp_per_capita['Country '] == 'Russian Federation'] = 'Russia'
gdp_per_capita['Country '].iloc[gdp_per_capita['Country '] == 'Egypt, Arab Rep.'] = 'Egypt'
gdp_per_capita['Country '].iloc[gdp_per_capita['Country '] == 'Iran, Islamic Rep.'] = 'Iran, Islamic Republic of...'
gdp_per_capita['Country '].iloc[gdp_per_capita['Country '] == 'United States'] = 'United States of America'
gdp_per_capita['Country '].iloc[gdp_per_capita['Country '] == 'Vietnam'] = 'Viet Nam'
gdp_per_capita['Country '].iloc[gdp_per_capita['Country '] == 'United Kingdom'] = 'United Kingdom of Great Britain and Northern Ireland'
gdp_per_capita['Country '].iloc[gdp_per_capita['Country '] == 'Hong Kong SAR, China'] = 'Hong Kong (S.A.R.)'
gdp_per_capita['Country '].iloc[gdp_per_capita['Country '] == 'Korea, Rep.'] = 'South Korea'
# ['Other', 'I do not wish to disclose my location', 'Taiwan'] were not included

p_country = pd.get_dummies(profiles, columns = ['cluster'])
p_country = p_country.groupby('country').mean()
gdp_per_capita = gdp_per_capita.set_index('Country ')
gdp_per_capita = gdp_per_capita[gdp_per_capita.index.isin(p_country.index)]
p_country['gdp_pc'] = np.nan
p_country['gdp_pc'] = gdp_per_capita ['2018']
p_country = p_country.dropna()

In [None]:
fig, axs = plt.subplots(figsize=(20,15))

fig.suptitle('GDP per capita, ppp in 2018 to the share of each cluster')

ax1 = plt.subplot2grid(shape=(2,6), loc=(0,0), colspan=2)
ax2 = plt.subplot2grid((2,6), (0,2), colspan=2)
ax3 = plt.subplot2grid((2,6), (0,4), colspan=2)
ax4 = plt.subplot2grid((2,6), (1,1), colspan=2)
ax5 = plt.subplot2grid((2,6), (1,3), colspan=2)

ax1.scatter(p_country['gdp_pc'], p_country['cluster_0'], color = "#F62E97")
ax1.set_title('Seasoned ML-Professionals')
ax2.scatter(p_country['gdp_pc'], p_country['cluster_1'], color = "#F62E97")
ax2.set_title('Seasoned Coders')
ax3.scatter(p_country['gdp_pc'], p_country['cluster_2'], color = "#F62E97")
ax3.set_title('Inexperienced with ML-Methods')
ax4.scatter(p_country['gdp_pc'], p_country['cluster_3'], color = "#F62E97")
ax4.set_title('Young Professionals')
ax5.scatter(p_country['gdp_pc'], p_country['cluster_4'], color = "#F62E97")
ax5.set_title('Students')
plt.show()

In [None]:
print('r and p-value')
print('Seasoned ML-Professionals: ' + str(pearsonr(p_country['gdp_pc'], p_country['cluster_0'])))
print('Seasoned Coders: ' + str(pearsonr(p_country['gdp_pc'], p_country['cluster_1'])))
print('Inexperienced with ML-Methods: ' + str(pearsonr(p_country['gdp_pc'], p_country['cluster_2'])))
print('Young Professionals: ' + str(pearsonr(p_country['gdp_pc'], p_country['cluster_3'])))
print('Students: ' + str(pearsonr(p_country['gdp_pc'], p_country['cluster_4'])))

**Relationship between GDP per capita, ppp in 2018 and a cluster's share of the respondents from this country**

There is a significant linear relationship between the share of each cluster and a country's GDP per capita. It's strong and positive for seasoned ml-professionals. It's moderately positive for seasoned coders. The shares for students, young professionals and inexperienced people are negatively correlated. There are two possible reason I can think of for the relationships. The first one is the negative relationship between birthrates and a country's wealth. The above observed relationships could be representative for a lack of young people who could become data analysts and data scientists in wealthy countries. The second reason I can think of is that this may represent different levels of maturity in each country's tech-sector. Nevertheless, this is still an interesting observation, despite my inability to properly explain it.

# Conclusion

I identified 5 distinct types of Kagglers. The first group are the Seasoned ML-Professionals. They are among the oldest Kagglers and have the most experience in machine learning. The most common occupation is 'Data Scientist' and the second most common is 'Research Scientist'. 17% of this type reported that they or their team have spent at least 100,000 USD on ml- and cloud computing services in the past 5 years. This cluster has the smallest share of women with only 11%. The group of Seasoned Coders is the oldest one and has similar experience in coding as the Seasoned ML-Professionals, yet they have little experience in machine learning. Most Kagglers belonging to this group are software engineers. They for a large part don't use ml-methods at their workplace as 61% reported that their company doesn't use ml-methods or they are unaware if their company does. The share of women is also very low with only 14%. The cluster for people who are inexperienced with ml-methods is rather heterogenous as the four most common answers regarding their occupation were that they're currently not employed, that their job can't be described by any of the options, that they work as data analysts or data scientist. Data analysts and data scientists from this group are generally less experienced in using ml-methods. This group is at the younger end of the spectrum. A relatively large share of this cluster is female with 23%. Young ML-Professionals are young adults who mostly work at companies, with a moderate usage of ml-methods. By far the most common job is data scientist, followed by data analyst, software engineer and machine learning engineer. The students are the youngest and fairly inexperienced. The share of women is 23%. It might be reasonable to have a look at accessibility of well-paid data science jobs for women. Another interesting observation is that the relative frequency of each type differs a lot across countries. This is correlated with a country's PPP adjusted GDP per capita. 