# Kaggle DS and ML Survey 2020

%%HTML
<style type="text/css">

div.h2 {
    background-color: firebrick; 
    color: white; 
    padding: 8px; 
    padding-right: 300px; 
    font-size: 24px; 
    max-width: 1500px; 
    margin-top: 50px;
    margin-bottom:4px;
}

</style>

![](https://www.gammanalytics.com/assets/img/services/DataScience.png)

Image: https://www.gammanalytics.com

%%HTML
<style type="text/css">

div.h3 {
    background-color: dodgerblue; 
    color: white; 
    padding: 8px; 
    padding-right: 300px; 
    font-size: 20px; 
    max-width: 1500px; 
    margin-top: 50px;
    margin-bottom:4px;
}

</style>

<div class=h2>Overview</div>

This year, 20,036 Kaggle users told us how they learn and level up, which tools they’re using, and what they recommend. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field.

<div class=h2>Key Results</div>

### Here are a few of the top takeaways from this year’s results.

* Python, the fastest-growing major programming language, is the most used programming language on the survey, followed by SQL, which is standing as the second most used language.
* The overwhelming majority of respondents are still men, although this situation is slowly changing.
* Around half of respondents reside in Asia. This fact makes Asia the continent with the highest number of Kaggle users.
* The most prevailing occupation role is student and the most common degrees respondents have are master's and bachelor's degrees.
* India is on top of the countries on the survey, followed by the USA, while countries like Ghana and Ireland are among the ones that have the lowest number of respondents.
* When thinking about work experience, we can see that more than half of respondents have experience of less than 5 years.
* Respondents were asked about their age. The data indicates that Kaggle users are mostly people in the age from 18 to 30.

In [None]:
pip install --upgrade pip

In [None]:
pip install seaborn --upgrade

In [None]:
!pip install pycountry_convert

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print('Seaborn version', sns.__version__)
import os
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')
import pycountry
import pycountry_convert as pc
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.offline as py
import textwrap

In [None]:
data = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")
data.head()

In [None]:
questions = data.iloc[0, :].T
data = data.iloc[1:, :]

<div class=h2>Kagglers Profile</div>

### What we know about Kaggle users

In [None]:
data['Time from Start to Finish (seconds)'].median()/60

<div class=h2>Geography</div>

In [None]:
Map=data.Q3.value_counts().to_frame()
def alpha3code(column):
    CODE=[]
    for country in column:
      if country !='Other': 
        try:
            code=pycountry.countries.search_fuzzy(country)[0].alpha_3
           # .alpha_3 means 3-letter country code 
           # .alpha_2 means 2-letter country code
            CODE.append(code)
        except:
            CODE.append('None')
      else:
        CODE.append('Other')
    return CODE
# create a column for code 
Map['CODE']=alpha3code(Map.index)
Map.head()

In [None]:
import geopandas
from geopandas import GeoDataFrame
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# rename the columns so that we can merge with our data
world.columns=['pop_est', 'continent', 'name', 'CODE', 'gdp_md_est', 'geometry']
# then merge with our data 
merge=pd.merge(Map,world, how='right', on='CODE')
# merge['Q3'] = merge['Q3'].fillna(0)
merge = GeoDataFrame(merge).sort_values(by='Q3',ascending=False)
location=pd.read_csv('https://raw.githubusercontent.com/melanieshi0120/COVID-19_global_time_series_panel_data/master/data/countries_latitude_longitude.csv')
merge=merge.merge(location,on='name').reset_index()
merge.head()

In [None]:
x=pd.array(merge[merge.name=="Egypt"].latitude)[0]
merge['latitude'] = merge['latitude'].replace( x,26.8357675)
merge['longitude'] = merge['longitude'].replace([-78.183406],30.7956597)

In [None]:
merge.plot(column='Q3', scheme="quantiles",
           figsize=(30, 25), cmap='Reds',
           legend=True, missing_kwds={'color': 'grey',
           "hatch": "",
           "label": "Missing values"} )
plt.title('2020 Participants',fontsize=30, weight='bold')
# add countries names and numbers 
for i in range(0,20):
    plt.text(float(merge.longitude[i]), float(merge.latitude[i]),
             "{}\n{}".format(merge.name[i], int(merge.Q3[i])), size=10)

The median time spent on the survey for qualified responses was 10.43 minutes. Unfortunately, the survey data contains missing values. This unanticipated limitation should be kept in mind when interpreting survey results.

In [None]:
data['Q3'].replace({'United States of America':
                   'USA', 'Viet Nam': 'Vietnam',
                   'United Kingdom of Great Britain and Northern Ireland': 'UK',
                   'Iran, Islamic Republic of...': 'Iran'}, inplace=True)

In [None]:
data['Q3'].unique()

In [None]:
data['Q3'].count()

In [None]:
data['Q3'].value_counts()

<div class=h3>Continents</div>

In [None]:
countries = np.asarray(data["Q3"])
# Continent_code to Continent_names
continents = {
    'NA': 'North America',
    'SA': 'South America', 
    'AS': 'Asia',
    'OC': 'Australia',
    'AF': 'Africa',
    'EU' : 'Europe',
    'na' : 'Others'
}

# Defininng Function for getting continent code for country.
def country_to_continent_code(country):
    try:
        return pc.country_alpha2_to_continent_code(pc.country_name_to_country_alpha2(country))
    except :
        return 'na'
    
#Collecting Continent Information
data.insert(2,"continent", [continents[country_to_continent_code(country)] for country in countries[:]])

In [None]:
df_continents = data.groupby(["continent"]).sum()

In [None]:
continents = data['continent'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(15,6))
color = ['dodgerblue' if (x < max(continents)) else 'firebrick' for x in continents]
ax = sns.countplot(x="continent", data=data, order=continents.index, palette=color, saturation=1)
plt.xticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data['continent'])),
                (x + width/2, y + height/2), ha='center', va='center',fontsize=12)

About 50% of respondents currently reside in Asia, and about 15% of respondents currently reside in Europe. These are the continents with the highest number of respondents.

<div class=h3>Countries</div>

In [None]:
countries = data['Q3'].value_counts()
plt.figure(figsize=(10, 30))
color = ['dodgerblue' if (y < max(countries)) else 'firebrick' for y in countries]
ax= sns.countplot(y="Q3", data=data, order=countries.index, palette=color, saturation=1)
plt.ylabel('country')
plt.yticks(fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q3'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

More than 29% of Kagglers are from India, and more than 11% of Kagglers are from the USA. Countries like Belarus, Ireland and Ghana are the least represented among kagglers.

<div class=h2>Demographics</div>

<div class=h3>Gender</div>

In [None]:
colors = ['firebrick', 'dodgerblue', 'black', 'yellow', 'olive'] 
counts = data['Q2'].value_counts(sort=True)
labels = counts.index
values = counts.values
pie = go.Pie(labels=labels, values=values, marker=dict(colors=colors, line=dict(color='#000000', width=1)))
fig = go.Figure(data=[pie])
py.iplot(fig)

Respondents were asked about their gender identity, and it turned out that globally about 80% of respondents are men. This year more than 19% of survey respondents are women, a little bit up from on last year's survey. This represents improvement in this area, but the continued low proportion points to problems with inclusion in the tech industry in general and on Kaggle in particular.

<div class=h3>Age</div>

In [None]:
age = data['Q1'].value_counts()
plt.figure(figsize=(15,6))
color = ['dodgerblue' if (x < max(age)) else 'firebrick' for x in age]
ax= sns.countplot(x="Q1", data=data, order=age.index, palette=color, saturation=1)
plt.xlabel('age')
plt.xticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data['Q1'])),
                (x + width/2, y + height/2), ha='center', va='center',fontsize=12)

About three-fourths of the people who took the survey are younger than 35.

<div class=h3>Education</div>

In [None]:
education = data['Q4'].value_counts()
plt.figure(figsize=(10, 10))
color = ['dodgerblue' if (y < max(education)) else 'firebrick' for y in education]
ax= sns.countplot(y="Q4", data=data, order=education.index, palette=color, saturation=1)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
plt.ylabel('education')
plt.yticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q4'])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)

Worldwide, more than three-fourths of respondents have the equivalent of a bachelor's degree or higher. However, it is not that rare to find accomplished professionals who have not completed a degree.

<div class=h3>Current Job Role</div>

In [None]:
job_role = data['Q5'].value_counts()
plt.figure(figsize=(10, 10))
color = ['dodgerblue' if (y < max(job_role)) else 'firebrick' for y in job_role]
ax= sns.countplot(y="Q5", data=data, order=job_role.index, palette=color, saturation=1)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
plt.ylabel('job role')
plt.yticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q5'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

Almost 26% of all respondents are Students. Among professionals, more than 13% of respondents are Data Scientists, and about 10% of respondents are Software Engineers.

<div class=h3>Code Experience</div>

In [None]:
experience = data['Q6'].value_counts()
plt.figure(figsize=(15, 6))
color = ['dodgerblue' if (x < max(experience)) else 'firebrick' for x in experience]
ax= sns.countplot(x="Q6", data=data, order=experience.index, palette=color, saturation=1)
max_width = 20
ax.set_xticklabels(textwrap.fill(x.get_text(), max_width) for x in ax.get_xticklabels())
plt.xlabel('code experience')
plt.xticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data['Q5'])),
                (x + width/2, y + height/2), ha='center', va='center',fontsize=12)

Over 60% of respondents have less than five years of professional coding experience.

<div class=h3>Compensation</div>

In [None]:
compensation = data['Q24'].value_counts()
plt.figure(figsize=(10, 15))
color = ['dodgerblue' if (y < max(compensation)) else 'firebrick' for y in compensation]
ax= sns.countplot(y="Q24", data=data, order=compensation.index, palette=color, saturation=1)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
plt.ylabel('compensation')
plt.yticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q24'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

Respondents were asked about their compensation, and the majority of people who answered this question, makes less than 1000 USD a year.

<div class=h2>Technology</div>

<div class=h3>Programming Languages Used on a Regular Basis</div>

In [None]:
df = data[[i for i in data.columns if 'Q7' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q7' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('programming languages')

Python is the most commonly used programming language. This year, Python was used on a regular basis by almost 80% of respondents. Python is the fastest-growing major programming language today. SQL was used by almost 38% of respondents, and it is the second most commonly used programming language. R is the third most commonly used programming language, and it is preferred by over 20% of respondents. Swift and Julia are the least used programming languages.

<div class=h3>Programming Languages Recommended for Aspiring Data Scientists</div>

In [None]:
experience = data['Q8'].value_counts()
plt.figure(figsize=(10, 10))
color = ['dodgerblue' if (y < max(experience)) else 'firebrick' for y in experience]
ax= sns.countplot(y="Q8", data=data, order=experience.index, palette=color, saturation=1)
plt.ylabel('programming languages')
plt.yticks(fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q8'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

Globally, over 70% of respondents recommend Python for aspiring Data Scientists, while over 6% of respondents recommend R. SQL is recommended by less than 5% of respondents.

<div class=h3>Computing Platforms</div>

In [None]:
experience = data['Q11'].value_counts()
plt.figure(figsize=(15,6))
color = ['dodgerblue' if (x < max(experience)) else 'firebrick' for x in experience]
ax= sns.countplot(x="Q11", data=data, order=experience.index, palette=color, saturation=1)
max_width = 20
ax.set_xticklabels(textwrap.fill(x.get_text(), max_width) for x in ax.get_xticklabels())
plt.xlabel('computing platform')
plt.xticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data['Q11'])),
                (x + width/2, y + height/2), ha='center', va='center',fontsize=12)

Respondents were asked what computing platforms they use for work. Over 60% say they mainly use a personal computer or laptop, about 12% use a cloud computing platform, and over 4% use a deep learning work station.

<div class=h3>Hardware Accelerators</div>

In [None]:
df = data[[i for i in data.columns if 'Q12' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=False)
color = ['dodgerblue' if (x<max(df_all)) else 'firebrick' for x in df_all]
plt.figure(figsize=(15, 6))
ax = df_all.plot(kind='bar', color=color, alpha=1, width=0.8)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data[[i for i in data.columns if 'Q12' in i]])),
                (x + width/2, y + height/2), ha='center', va='center', fontsize=12)
plt.xticks(rotation=0, fontsize=15)
plt.xlabel('hardware accelerators')

When asked how about using hardware accelerators such as GPU or TPU, about 40% of respondents say they are not using any such technology today. Those who are using hardware accelerators most commonly prefer GPUs, and only about 5% of respondents use TPUs.

<div class=h3>Usage of TPUs</div>

In [None]:
experience = data['Q13'].value_counts()
plt.figure(figsize=(15,6))
color = ['dodgerblue' if (x < max(experience)) else 'firebrick' for x in experience]
ax= sns.countplot(x="Q13", data=data, order=experience.index, palette=color, saturation=1)
plt.xlabel('number of times')
plt.xticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data['Q13'])),
                (x + width/2, y + height/2), ha='center', va='center',fontsize=12)

Over 60% of respondents never use TPUs.

<div class=h3>Machine Learning Algorithms</div>

In [None]:
df = data[[i for i in data.columns if 'Q17' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q17' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('algorithms')

When it comes to machine learning algorithms, more than half of respondents use Linear or Logistic Regression, and around 45% use Decision Trees or Random Forests.

<div class=h3>Favorite Integrated Development Environments (IDEs)</div>

In [None]:
df = data[[i for i in data.columns if 'Q9' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q9' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('favorite ides')

Jupiter is the most loved integrated environment for development, with both Visual Studio Code and Pycharm also highly loved this year. Vim/Emacs and MATLAB are the least loved IDEs.

<div class=h3>Most Hosted Notebooks</div>

In [None]:
df = data[[i for i in data.columns if 'Q10' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 15))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q10' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('notebooks')

Colab Notebooks are the most commonly used notebooks. However, the difference between the number of respondents who use Colab Notebooks and the number of respondents who use Kaggle Notebooks is less than 2%, and a good amount of people use none.

<div class=h3>Visualization Libraries</div>

In [None]:
df = data[[i for i in data.columns if 'Q14' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q14' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('visualization tools')

Matplotlib is the number one visualization library kagglers use for work, followed by Seaborn. The third spot is about evenly split between Plotly and Ggplot. Altair is the least popular visualization library on the survey.

<div class=h3>Machine Learning Frameworks</div>

In [None]:
df = data[[i for i in data.columns if 'Q16' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q16' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('ml frameworks')

Scikit-learn is a dominant player among machine learning frameworks this year, followed by Tensorflow and Keras.

<div class=h3>Machine Learning Experience</div>

In [None]:
ml_experience = data['Q15'].value_counts()
plt.figure(figsize=(10, 10))
color = ['dodgerblue' if (y < max(ml_experience)) else 'firebrick' for y in ml_experience]
ax= sns.countplot(y="Q15", data=data, order=ml_experience.index, palette=color, saturation=1)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
plt.ylabel('ml experience')
plt.yticks(fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q15'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

About half of respondents have less than two years of machine learning experience.

<div class=h3>Computer Vision Methods</div>

In [None]:
df = data[[i for i in data.columns if 'Q18' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 15))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q18' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('cv methods')

Respondents were asked what computer vision methods they use the most, and Image classification and General purpose image/video tools were the most common answers.

<div class=h3>Natural Language Processing (NLP) Methods</div>

In [None]:
df = data[[i for i in data.columns if 'Q19' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q19' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('nlp methods')

Word embeddings/vectors are the most popular NLP methods.

<div class=h3>Cloud Computing Platforms</div>

In [None]:
df = data[[i for i in data.columns if 'Q26' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q26' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('cloud computing platforms')

The most common cloud computing platforms for respondents are Amazon Web Services, Google Cloud Platform and Microsoft Azure.

<div class=h3>Cloud Computing Products</div>

In [None]:
df = data[[i for i in data.columns if 'Q27' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q27' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('cloud computing products')

Google Cloud Compute Engine is the most used cloud computing product.

<div class=h3>Machine Learning Products</div>

In [None]:
df = data[[i for i in data.columns if 'Q28' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q28' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('ml products')

Google Cloud AI Platform/Google Cloud ML Engine is the most broadly used of the machine learning products.

<div class=h3>Big Data Products Used on a Regular Basis</div>

In [None]:
df = data[[i for i in data.columns if 'Q29' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 15))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q29' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('big data products')

MySQL is the most popular database used on a regular basis. MongoDB has taken the second spot.

<div class=h3>Big Data Products Used Most Often</div>

In [None]:
experience = data['Q30'].value_counts()
plt.figure(figsize=(10, 15))
color = ['dodgerblue' if (y < max(experience)) else 'firebrick' for y in experience]
ax= sns.countplot(y="Q30", data=data, order=experience.index, palette=color, saturation=1)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
plt.ylabel('big data products')
plt.yticks(fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q30'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

MySQL is the most commonly used database. PostgreSQL has taken the second spot, edging ahead of Microsoft SQL Server.

<div class=h3>Business Intelligence Tools Used on a Regular Basis</div>

In [None]:
df = data[[i for i in data.columns if 'Q31' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q31' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('business ntelligence tools')

About 19% of respondents use Tableau on a regular basis, and about 16% use Microsoft Power BI.

<div class=h3>Business Intelligence Tools Used Most Often</div>

In [None]:
experience = data['Q32'].value_counts()
plt.figure(figsize=(10, 10))
color = ['dodgerblue' if (y < max(experience)) else 'firebrick' for y in experience]
ax= sns.countplot(y="Q32", data=data, order=experience.index, palette=color, saturation=1)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
plt.ylabel('visualization software')
plt.yticks(fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q32'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

Tableau and Microsoft Power BI are business iintelligence tools that are also used most often.

<div class=h3>Atomated Machine Learning Tools</div>

In [None]:
df = data[[i for i in data.columns if 'Q33' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 15))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q33' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('automated ml tools')

Most respondents on the survey say they are not using any automated machine learning tools, and the most common used are Automated model selection and Automation of full ML pipelines.

<div class=h3>Automated Machine Learning Tools Used on a Regular Basis</div>

In [None]:
df = data[[i for i in data.columns if 'Q34' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q34' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('automated ml tools')

Auto-Sklearn, Google Cloud AutoML and Auto-Keras are the most commonly used automated ML tools.

<div class=h3>Tools to Help Machine Learning Experiments</div>

In [None]:
df = data[[i for i in data.columns if 'Q35' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q35' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('tools for ml experiments')

About 40% of respondents do not use any tools to help ML experiments.

<div class=h3>Platforms to Share Applications</div>

In [None]:
df = data[[i for i in data.columns if 'Q36' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 10))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q36' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('platforms to share apps')

Respondents were asked what social platforms they use to share or deploy their data analysis or machine learning applications, and Github was the most common answer.

<div class=h3>Platforms for Data Science Courses</div>

In [None]:
df = data[[i for i in data.columns if 'Q37' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 15))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q37' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('platforms for courses')

There are many online platforms people use to learn Data Science. The most common choices this year for respondents were Coursera, Kaggle Learn Courses and Udemy.

<div class=h3>Primary Tools at Work</div>

In [None]:
experience = data['Q38'].value_counts()
plt.figure(figsize=(10, 10))
color = ['dodgerblue' if (y < max(experience)) else 'firebrick' for y in experience]
ax= sns.countplot(y="Q38", data=data, order=experience.index, palette=color, saturation=1)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
plt.ylabel('primart tools at work')
plt.yticks(fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q38'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

Over half of respondents use local development environment and basic statistical software as primary tools at work.

<div class=h3>Favorite Data Science Media Sources</div>

In [None]:
df = data[[i for i in data.columns if 'Q39' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 15))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q39' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('favorite media sources')

Kaggle is the most loved Data Science media source among respondents, followed close behind by Youtube and blogs. Podcasts and Slack communities are the the least loved Data Science media sources.

<div class=h2>Company Information</div>

<div class=h3>Company Size</div>

In [None]:
experience = data['Q20'].value_counts()
plt.figure(figsize=(15, 6))
color = ['dodgerblue' if (x < max(experience)) else 'firebrick' for x in experience]
ax= sns.countplot(x="Q20", data=data, order=experience.index, palette=color, saturation=1)
max_width = 20
ax.set_xticklabels(textwrap.fill(x.get_text(), max_width) for x in ax.get_xticklabels())
plt.xlabel('company size')
plt.xticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data['Q20'])),
                (x + width/2, y + height/2), ha='center', va='center',fontsize=12)

Kagglers work in companies of all sizes, from small to large enterprise organizations. About 30% work at small companies and more than 20% work at large companies.

<div class=h3>Activities at Work</div>

In [None]:
df = data[[i for i in data.columns if 'Q23' in i]]
df_all = pd.Series(dtype='int')
for i in df.columns:
    df_all[df[i].value_counts().index[0]] = df[i].count()
df_all = df_all.sort_values(ascending=True)
color = ['dodgerblue' if (y<max(df_all)) else 'firebrick' for y in df_all]
plt.figure(figsize=(10, 15))
ax = df_all.plot(kind='barh', color=color, alpha=1, width=0.8)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x,y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data[[i for i in data.columns if 'Q23' in i]])),
                (x + width/2, y + height/2), ha='left', va='center', fontsize=12)
plt.yticks(fontsize=15)
plt.ylabel('activities')

Different types of organizations apply different sets of activities when considering work. Over 30% of respondents analyze and understand data to influence product or business decisions.

<div class=h3>People Responsible for Data Science at Work</div>

In [None]:
experience = data['Q21'].value_counts()
plt.figure(figsize=(15,6))
color = ['dodgerblue' if (x < max(experience)) else 'firebrick' for x in experience]
ax= sns.countplot(x="Q21", data=data, order=experience.index, palette=color, saturation=1)
plt.xlabel('people responsible for ds')
plt.xticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data['Q21'])),
                (x + width/2, y + height/2), ha='center', va='center',fontsize=12)

Over 40% of respondents say that there are less than 10 people responsible for Data Science at their organization.

<div class=h3>Machine Learning Incorporation at the Company</div>

In [None]:
experience = data['Q22'].value_counts()
plt.figure(figsize=(10, 10))
color = ['dodgerblue' if (y < max(experience)) else 'firebrick' for y in experience]
ax= sns.countplot(y="Q22", data=data, order=experience.index, palette=color, saturation=1)
max_width = 20
ax.set_yticklabels(textwrap.fill(y.get_text(), max_width) for y in ax.get_yticklabels())
plt.ylabel('ml incorporation')
plt.yticks(fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_width())*100/len(data['Q22'])),
                (x + width/2, y + height/2), ha='left', va='center',fontsize=12)

Less than 20% of respondents say that their companies are using ML methods (including putting models into production), and over 20% say their companies are not using ML methods.

<div class=h3>Money Spent on Machine Learning</div>

In [None]:
experience = data['Q25'].value_counts()
plt.figure(figsize=(15,6))
color = ['dodgerblue' if (x < max(experience)) else 'firebrick' for x in experience]
ax= sns.countplot(x="Q25", data=data, order=experience.index, palette=color, saturation=1)
max_width = 20
ax.set_xticklabels(textwrap.fill(x.get_text(), max_width) for x in ax.get_xticklabels())
plt.xlabel('money spent on ml')
plt.xticks(rotation=0, fontsize=15)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate("{:.2f}%".format((p.get_height())*100/len(data['Q25'])),
                (x + width/2, y + height/2), ha='center', va='center',fontsize=12)

Most respondents on the survey say their organizations are not spending any money on ML.

# Thanks for reading! ☺