# Introduction

Here is an analysis of Kaggle surveys from 2018 to 2021. The work consists of two parts: socio-economic analysis of respondents and analysis of trends in data science tools. The first part shows the dynamics of changes in age, education level, salary level and other indicators provided by respondents. Since the 2018 survey was different from the 2019-2021 surveys, some plots may show dynamics that do not include 2018. The second part is simpler. You will see which tools were in demand a few years ago, and what is becoming popular now.

**25973** participants from **66** countries took part in the 2021 survey. The number of questions is **38**. The median survey execution time is **10.9** minutes.

All conclusions are based solely on the available data and are not generalized due to the presence of many social groups and a small data.

<h1 style="background-color:#a0c6eb;text-align:center;color:white;font-size:400%;font-family:monospace;border-radius:10px">PART I</h1>

![](https://www.guidobarbacci.com/wp-content/uploads/home1-2.png)

In [None]:
import warnings
import numpy as np 
import pandas as pd
import plotly as py
import seaborn as sns
import statistics as stat
from datetime import date
import plotly.express as px
import plotly.graph_objs as go
import matplotlib.image as image
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
from matplotlib.ticker import FixedLocator, FixedFormatter
from matplotlib.offsetbox import TextArea, DrawingArea, OffsetImage, AnnotationBbox
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)
import matplotlib.lines as lines
import matplotlib.pyplot as plt
%matplotlib inline

sur2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
sur2019 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
sur2020 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
sur2021 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')

sur2021 = sur2021[1:]
sur2020 = sur2020[1:]
sur2019 = sur2019[1:]
sur2018 = sur2018[1:]

c_codes = pd.read_csv('../input/iso-country-codes-global/wikipedia-iso-country-codes.csv')
c_codes = c_codes[['English short name lower case', 'Alpha-3 code']]

cnt = pd.read_csv('../input/gapminder-datacamp-2007/gapminder - gapminder.csv')
cnt = cnt[['country', 'continent']]

# Age

In [None]:
fig = plt.figure(figsize = (15, 5))

age = sur2021['Q1'].value_counts().reset_index()
age['age'] = age['index'].apply(lambda x: x[:2])
age = age.sort_values('age')

plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
clrs = ['#65b2e5' if x > 4902 else '#AACDE5' for x in age['Q1']]
a = sns.barplot(data = age, x = 'Q1', y = 'index', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'left', 'bottom']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(150 + width, p.get_y() + 0.55*p.get_height(), f'{round((width / age["Q1"].sum()) * 100, 1)}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
age = {'18-21': 'Young',
      '22-24': 'Young',
      '25-29': 'Young',
      '30-34': 'Middle aged',
      '35-39': 'Middle aged',
      '40-44': 'Middle aged',
      '45-49': 'Middle aged',
      '50-54': 'Old',
      '55-59': 'Old',
      '60-69': 'Old',
      '70+': 'Old'}
for i in [sur2019, sur2020, sur2021]:
    i['Q1'] = i['Q1'].map(age)
    
sur2018['Q2'] = sur2018['Q2'].map(age)

age2021 = sur2021['Q1'].value_counts().reset_index()
age2020 = sur2020['Q1'].value_counts().reset_index()
age2019 = sur2019['Q1'].value_counts().reset_index()
age2018 = sur2018['Q2'].value_counts().reset_index().rename(columns={'Q2': 'Q1'})

for i in [age2021, age2020, age2019, age2018]:
    i['sum'] = i['Q1'].sum()
    i['percentage'] = round(i['Q1']/i['sum']*100, 1)
    i.drop(['Q1', 'sum'], axis = 1, inplace = True)     

age2021['year'] = 2021
age2020['year'] = 2020
age2019['year'] = 2019
age2018['year'] = 2018

age_dynamics = pd.concat([age2018, age2019, age2020, age2021])
age_dynamics = age_dynamics.pivot(index = 'year', columns = 'index', values = 'percentage').reset_index()    

fig = plt.figure(figsize = (15, 6))
plt.title('Dynamics', size = 25, y = 1.14, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1)
a = sns.lineplot(data = age_dynamics, x = 'year', y = 'Young', color = '#75b1a9', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([55.9]), color = '#75b1a9', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = age_dynamics, x = 'year', y = 'Middle aged', color = '#d9b44a', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([35.5]), color = '#d9b44a', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = age_dynamics, x = 'year', y = 'Old', color = '#AACDE5', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([8.6]), color = '#AACDE5', s = 90, edgecolor = 'black', zorder = 5)
plt.xticks(np.array([2018, 2019, 2020, 2021]), fontname = 'monospace', size = 16, color = '#313233')
a.yaxis.set_major_formatter(FixedFormatter(['10%', '20%', '30%', '40%', '50%', '60%']))
a.yaxis.set_major_locator(FixedLocator([10, 20, 30, 40, 50, 60]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
plt.figtext(0.925, 0.8, 'Young', fontname = 'monospace', size = 17, color = '#75b1a9') 
plt.figtext(0.915, 0.76, '<29 years', fontname = 'monospace', size = 14, color = '#75b1a9') 
plt.figtext(0.9, 0.55, 'Middle aged', fontname = 'monospace', size = 17, color = '#d9b44a') 
plt.figtext(0.91, 0.51, '30-49 years', fontname = 'monospace', size = 14, color = '#d9b44a')
plt.figtext(0.94, 0.22, 'Old', fontname = 'monospace', size = 17, color = '#AACDE5') 
plt.figtext(0.92, 0.18, '50< years', fontname = 'monospace', size = 14, color = '#AACDE5') 


plt.figtext(0.85, 0.82, '55.9', fontname = 'monospace', size = 14, color = '#75b1a9')

plt.figtext(0.85, 0.57, '35.5', fontname = 'monospace', size = 14, color = '#d9b44a')

plt.figtext(0.855, 0.23, '8.6', fontname = 'monospace', size = 14, color = '#AACDE5') 

plt.figtext(0.125, -0.14, '''More than half of the survey participants belong to the group of young people (under 29 years old).
However, compared to 2018, their number decreased by 4.4%, while the number of participants
of the older generation (over 50 years old) increased by 3.6%.''', fontname = 'monospace', size = 14, color = '#313233')

l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

plt.show()

# Gender

In [None]:
fig = plt.figure(figsize = (15, 2.5))
clrs = ['#65b2e5' if x > 10000 else '#AACDE5' for x in sur2021['Q2'].value_counts().reset_index()['Q2']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = sur2021['Q2'].value_counts().reset_index(), x = 'Q2', y = 'index', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(800 + width, p.get_y() + 0.55*p.get_height(), f'{round((width / sur2021["Q2"].value_counts().reset_index()["Q2"].sum()) * 100, 1)}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
gen2020_2021 = {'Man': 'Man',
               'Woman': 'Woman',
               'Nonbinary': 'Others',
               'Prefer not to say': 'Others',
               'Prefer to self-describe': 'Others'}
for i in [sur2020, sur2021]:
    i['Q2'] = i['Q2'].map(gen2020_2021)
    
gen2018_2019 = {'Male': 'Man',
               'Female': 'Woman',
               'Prefer not to say': 'Others',
               'Prefer to self-describe': 'Others'}
sur2018['Q1'] = sur2018['Q1'].map(gen2018_2019)
sur2019['Q2'] = sur2019['Q2'].map(gen2018_2019)

gen2021 = sur2021['Q2'].value_counts().reset_index()
gen2020 = sur2020['Q2'].value_counts().reset_index()
gen2019 = sur2019['Q2'].value_counts().reset_index()
gen2018 = sur2018['Q1'].value_counts().reset_index().rename(columns={'Q1': 'Q2'})

for i in [gen2021, gen2020, gen2019, gen2018]:
    i['sum'] = i['Q2'].sum()
    i['percentage'] = round(i['Q2']/i['sum']*100, 1)
    i.drop(['Q2', 'sum'], axis = 1, inplace = True) 

gen2021['year'] = 2021
gen2020['year'] = 2020
gen2019['year'] = 2019
gen2018['year'] = 2018

gen_dynamics = pd.concat([gen2018, gen2019, gen2020, gen2021])
gen_dynamics = gen_dynamics.pivot(index = 'year', columns = 'index', values = 'percentage').reset_index()  

fig = plt.figure(figsize = (15, 6))
plt.title('Dynamics', size = 25, y = 1.14, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1)
a = sns.lineplot(data = gen_dynamics, x = 'year', y = 'Man', color = '#75b1a9', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([79.3]), color = '#75b1a9', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = gen_dynamics, x = 'year', y = 'Woman', color = '#d9b44a', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([18.8]), color = '#d9b44a', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = gen_dynamics, x = 'year', y = 'Others', color = '#AACDE5', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([1.9]), color = '#AACDE5', s = 90, edgecolor = 'black', zorder = 5)
plt.xticks(np.array([2018, 2019, 2020, 2021]), fontname = 'monospace', size = 16, color = '#313233')
a.yaxis.set_major_formatter(FixedFormatter(['10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%']))
a.yaxis.set_major_locator(FixedLocator([10, 20, 30, 40, 50, 60, 70, 80]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
plt.figtext(0.925, 0.815, 'Man', fontname = 'monospace', size = 17, color = '#75b1a9') 
plt.figtext(0.92, 0.29, 'Woman', fontname = 'monospace', size = 17, color = '#d9b44a') 
plt.figtext(0.915, 0.15, 'Others', fontname = 'monospace', size = 17, color = '#AACDE5') 
 
plt.figtext(0.85, 0.85, '79.3', fontname = 'monospace', size = 14, color = '#75b1a9')
plt.figtext(0.85, 0.34, '18.8', fontname = 'monospace', size = 14, color = '#d9b44a')
plt.figtext(0.855, 0.19, '1.9', fontname = 'monospace', size = 14, color = '#AACDE5') 

plt.figtext(0.125, -0.14, '''Most of the participants identified their gender as male for 4 years.
However, it should be noted that in 2021, compared to 2018, the percentage
of women increased by 2%, while the percentage of men decreased by 2.1%.''', fontname = 'monospace', size = 14, color = '#313233')

l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

plt.show()

# Education

In [None]:
sur2018 = sur2018.replace('Some college/university study without earning a bachelor’s degree', 'University without bachelor’s degree')
sur2019 = sur2019.replace('Some college/university study without earning a bachelor’s degree', 'University without bachelor’s degree')
sur2020 = sur2020.replace('Some college/university study without earning a bachelor’s degree', 'University without bachelor’s degree')
sur2021 = sur2021.replace('Some college/university study without earning a bachelor’s degree', 'University without bachelor’s degree')

fig = plt.figure(figsize = (15, 3))
clrs = ['#65b2e5' if x > 10000 else '#AACDE5' for x in sur2021['Q4'].value_counts().reset_index()['Q4']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = sur2021['Q4'].value_counts().reset_index(), x = 'Q4', y = 'index', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(350 + width, p.get_y() + 0.55*p.get_height(), f'{round((width / sur2021["Q4"].value_counts().reset_index()["Q4"].sum()) * 100, 1)}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
educ2021 = {'Bachelor’s degree': 'Bachelor',
       'Master’s degree': 'Master',
       'Doctoral degree': 'Doctorate',
       'I prefer not to answer': 'Other',
       'University without bachelor’s degree': 'Other',
       'No formal education past high school': 'Other',
       'Professional doctorate': 'Doctorate'}

sur2021['Q4'] = sur2021['Q4'].map(educ2021)

educ2018_2020 = {'Bachelor’s degree': 'Bachelor',
       'Master’s degree': 'Master',
       'Doctoral degree': 'Doctorate',
       'I prefer not to answer': 'Other',
       'University without bachelor’s degree': 'Other',
       'No formal education past high school': 'Other',
       'Professional degree': 'Doctorate'}
for i in [sur2018, sur2019, sur2020]:
    i['Q4'] = i['Q4'].map(educ2018_2020)
    
edc2021 = sur2021['Q4'].value_counts().reset_index()
edc2020 = sur2020['Q4'].value_counts().reset_index()
edc2019 = sur2019['Q4'].value_counts().reset_index()
edc2018 = sur2018['Q4'].value_counts().reset_index()

for i in [edc2021, edc2020, edc2019, edc2018]:
    i['sum'] = i['Q4'].sum()
    i['percentage'] = round(i['Q4']/i['sum']*100, 1)
    i.drop(['Q4', 'sum'], axis = 1, inplace = True) 

edc2021['year'] = 2021
edc2020['year'] = 2020
edc2019['year'] = 2019
edc2018['year'] = 2018

edc_dynamics = pd.concat([edc2018, edc2019, edc2020, edc2021])
edc_dynamics = edc_dynamics.pivot(index = 'year', columns = 'index', values = 'percentage').reset_index() 

fig = plt.figure(figsize = (15, 6))
plt.title('Dynamics', size = 25, y = 1.14, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1)
a = sns.lineplot(data = edc_dynamics, x = 'year', y = 'Bachelor', color = '#75b1a9', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([38.1]), color = '#75b1a9', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = edc_dynamics, x = 'year', y = 'Doctorate', color = '#d9b44a', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([12.1]), color = '#d9b44a', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = edc_dynamics, x = 'year', y = 'Master', color = '#AACDE5', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([39.0]), color = '#AACDE5', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = edc_dynamics, x = 'year', y = 'Other', color = '#4f6457', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([10.7]), color = '#4f6457', s = 90, edgecolor = 'black', zorder = 5)
plt.xticks(np.array([2018, 2019, 2020, 2021]), fontname = 'monospace', size = 16, color = '#313233')
a.yaxis.set_major_formatter(FixedFormatter(['10%', '15%', '20%', '25%', '30%', '35%', '40%', '45%']))
a.yaxis.set_major_locator(FixedLocator([10, 15, 20, 25, 30, 35, 40, 45]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
plt.figtext(0.915, 0.67, 'Bachelor', fontname = 'monospace', size = 17, color = '#75b1a9') 
plt.figtext(0.91, 0.25, 'Doctorate', fontname = 'monospace', size = 17, color = '#d9b44a') 
plt.figtext(0.925, 0.72, 'Master', fontname = 'monospace', size = 17, color = '#AACDE5') 
plt.figtext(0.93, 0.205, 'Other', fontname = 'monospace', size = 17, color = '#4f6457') 

plt.figtext(0.85, 0.65, '38.1', fontname = 'monospace', size = 14, color = '#75b1a9') 
plt.figtext(0.85, 0.29, '12.1', fontname = 'monospace', size = 14, color = '#d9b44a')
plt.figtext(0.85, 0.76, '39.0', fontname = 'monospace', size = 14, color = '#AACDE5') 
plt.figtext(0.85, 0.17, '10.7', fontname = 'monospace', size = 14, color = '#4f6457')

plt.figtext(0.122, -0.36, '''For 3 years, there has been a decrease in the interest of people with a master's
degree and doctor's degree. At the same time, there are more people with a bachelor's
degree and others. It is quite possible that next year there will be a majority of
participants with a bachelor's degree, and a minority with a doctor's degree. This is
due to the fact that Kaggle is primarily an educational platform for obtaining basic
knowledge in DS and there are much more beginners here than professionals. Data Science
continues to be the most promising direction for many years, so there are more and more
newcomers. And professionals are more interested in competitions
with money prizes and in other projects.''', fontname = 'monospace', size = 14, color = '#313233')

l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

plt.show()

# Job

In [None]:
fig = plt.figure(figsize = (15, 6.5))
clrs = ['#65b2e5' if x > 5000 else '#AACDE5' for x in sur2021['Q5'].value_counts().reset_index()['Q5']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = sur2021['Q5'].value_counts().reset_index(), x = 'Q5', y = 'index', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(230 + width, p.get_y() + 0.55*p.get_height(), f'{round((width / sur2021["Q5"].value_counts().reset_index()["Q5"].sum()) * 100, 1)}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
jobs = ['Software Engineer', 'Data Scientist', 'Statistician', 'Data Analyst', 'Research Scientist', 'Business Analyst', 'Data Engineer', 'DBA/Database Engineer']

j2021 = sur2021['Q5'].value_counts().reset_index().rename(columns={'Q5': 'q2021'})
j2018 = sur2018['Q6'].value_counts().reset_index().rename(columns={'Q6': 'q2018'})

j2021 = j2021[j2021['index'].isin(jobs)]
j2018 = j2018[j2018['index'].isin(jobs)]

j18_21 = j2021.merge(j2018, how = 'left', left_on = 'index', right_on = 'index')

j18_21['sum21'] = j18_21['q2021'].sum()
j18_21['sum18'] = j18_21['q2018'].sum()

j18_21['2021'] = round((j18_21['q2021'] / len(sur2021)) * 100, 1)
j18_21['2018'] = round((j18_21['q2018'] / len(sur2018)) * 100, 1)

fig = plt.figure(figsize = (15, 6))
plt.title('Dynamics', size = 25, y = 1.12, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = j18_21['2021'], y = j18_21['index'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = j18_21['2018'], y = j18_21['index'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['0%', '2.5%', '5%', '7.5%', '10%', '12.5%', '15%', '17.5%']))
a.xaxis.set_major_locator(FixedLocator([0, 2.5, 5, 7.5, 10, 12.5, 15, 17.5]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
plt.figtext(0.48, 0.94, '2018', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.94, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 

plt.figtext(0.122, -0.18, '''The leading position is the data scientist, despite a significant decrease in 2021. Why did
this happen? In 2018, the profession of a data scientist was generalized - any data analyst
or business analyst could identify himself as a data scientist. However, now the community
has a clear division of what a specific specialization is responsible for.''', fontname = 'monospace', size = 14, color = '#313233')

l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

def newline(p1, p2, color='black'):
    a = plt.gca()
    l = lines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color = '#313233')
    a.add_line(l)
    return l

for i, p1, p2 in zip(j18_21['index'], j18_21['2018'], j18_21['2021']):
    newline([p1, i], [p2, i])

plt.show()

# Salary

In [None]:
sal21d = {'25,000-29,999': '20k-50k',
          '60,000-69,999': '50k-100k',
          '$0-999': '<5k',
          '30,000-39,999': '20k-50k',
          '15,000-19,999': '5k-20k',
          '70,000-79,999': '50k-100k',
          '2,000-2,999': '<5k',
          '10,000-14,999': '5k-20k',
          '5,000-7,499': '5k-20k',
          '20,000-24,999': '20k-50k',
          '1,000-1,999': '<5k',
          '100,000-124,999': '100k-250k',
          '7,500-9,999': '5k-20k',
          '4,000-4,999': '<5k',
          '40,000-49,999': '20k-50k',
          '50,000-59,999': '50k-100k',
          '3,000-3,999': '<5k',
          '300,000-499,999': '250k<',
          '200,000-249,999': '100k-250k',
          '125,000-149,999': '100k-250k',
          '250,000-299,999': '250k<',
          '80,000-89,999': '50k-100k',
          '90,000-99,999': '50k-100k',
          '150,000-199,999': '100k-250k',
          '>$1,000,000': '250k<',
          '$500,000-999,999': '250k<'}

sal21 = sur2021['Q25'].map(sal21d).reset_index()

fig = plt.figure(figsize = (15, 3))
clrs = ['#65b2e5' if x > 5000 else '#AACDE5' for x in sal21['Q25'].value_counts().reset_index()['Q25']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = sal21['Q25'].value_counts().reset_index(), x = 'Q25', y = 'index', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(230 + width, p.get_y() + 0.55*p.get_height(), f'{round((width / sal21["Q25"].value_counts().reset_index()["Q25"].sum()) * 100, 1)}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
sal18d = {'10-20,000': '5k-20k',
          '0-10,000': '<5k', # This is not entirely correct, but it is necessary for comparative analysis
          '20-30,000': '20k-50k',
          '125-150,000': '100k-250k',
          '30-40,000': '20k-50k',
          '50-60,000': '50k-100k',
          '100-125,000': '100k-250k',
          '90-100,000': '50k-100k',
          '70-80,000': '50k-100k',
          '80-90,000': '50k-100k',
          '60-70,000': '50k-100k',
          '400-500,000': '250k<',
          '40-50,000': '20k-50k',
          '150-200,000': '100k-250k',
          '500,000+': '250k<',
          '300-400,000': '250k<',
          '200-250,000': '100k-250k',
          '250-300,000': '250k<'}

sal18 = sur2018['Q9'].map(sal18d).reset_index()

sal18 = sal18['Q9'].value_counts().reset_index()
sal21 = sal21['Q25'].value_counts().reset_index()

sal18_21 = sal21.merge(sal18, how = 'left', left_on = 'index', right_on = 'index')

sal18_21['sum21'] = sal18_21['Q25'].sum()
sal18_21['sum18'] = sal18_21['Q9'].sum()

sal18_21['2021'] = round((sal18_21['Q25'] / sal18_21['sum21']) * 100, 1)
sal18_21['2018'] = round((sal18_21['Q9'] / sal18_21['sum18']) * 100, 1)

fig = plt.figure(figsize = (15, 5))
plt.title('Dynamics', size = 25, y = 1.12, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = sal18_21['2021'], y = sal18_21['index'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = sal18_21['2018'], y = sal18_21['index'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['0%', '5%', '10%', '15%', '20%', '25%', '30%', '35%']))
a.xaxis.set_major_locator(FixedLocator([0, 5, 10, 15, 20, 25, 30, 35]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(6, -0.3)
    
plt.figtext(0.48, 0.94, '2018', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.94, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 

plt.figtext(0.121, -0.11, '''5k $ is not so much, but do not forget that most of the respondents are students. Less than 2%
of respondents have an annual salary of over 250k $.''', fontname = 'monospace', size = 14, color = '#313233')

l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(sal18_21['index'], sal18_21['2018'], sal18_21['2021']):
    newline([p1, i], [p2, i])

plt.show()

# Programming experience

In [None]:
fig = plt.figure(figsize = (15, 3.5))
clrs = ['#65b2e5' if x > 7000 else '#AACDE5' for x in sur2021['Q6'].value_counts().reset_index().rename({0: 1, 1: 0}).sort_index()['Q6']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = sur2021['Q6'].value_counts().reset_index().rename({0: 1, 1: 0}).sort_index(), x = 'Q6', y = 'index', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(270 + width, p.get_y() + 0.55*p.get_height(), f'{round((width / sur2021["Q6"].value_counts().reset_index()["Q6"].sum()) * 100, 1)}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
exp2021 = {'< 1 years': '< 1 years',
       '1-3 years': '1-3 years',
       '3-5 years': '3-5 years',
       '5-10 years': '5-10 years',
       '10-20 years': '10-20 years',
       '20+ years': '20< years',
       'I have never written code': '< 1 years'}

sur2021['Q6'] = sur2021['Q6'].map(exp2021)

exp2020 = {'< 1 years': '< 1 years',
       '1-2 years': '1-3 years',
       '3-5 years': '3-5 years',
       '5-10 years': '5-10 years',
       '10-20 years': '10-20 years',
       '20+ years': '20< years',
       'I have never written code': '< 1 years'}

sur2020['Q6'] = sur2020['Q6'].map(exp2020)

exp2019 = {'0': '< 1 years',
       '1-2': '1-3 years',
       '3-4': '3-5 years',
       '5-9': '5-10 years',
       '10-14': '10-20 years',
       '15-19': '10-20 years',
       '20+': '20< years'}

sur2019['Q7'] = sur2019['Q7'].map(exp2019)

exp2018 = {'0-1': '< 1 years',
       '1-2': '1-3 years',
       '2-3': '1-3 years',
       '3-4': '3-5 years',
       '4-5': '3-5 years',
       '5-10': '5-10 years',
       '10-15': '10-20 years',
       '15-20': '10-20 years',
       '20-25': '20< years',
       '25-30': '20< years',
       '30 +': '20< years'}

sur2018['Q8'] = sur2018['Q8'].map(exp2018)

e2021 = sur2021['Q6'].value_counts().reset_index()
e2020 = sur2020['Q6'].value_counts().reset_index()
e2019 = sur2019['Q7'].value_counts().reset_index().rename(columns={'Q7': 'Q6'})
e2018 = sur2018['Q8'].value_counts().reset_index().rename(columns={'Q8': 'Q6'})

for i in [e2021, e2020, e2019, e2018]:
    i['sum'] = i['Q6'].sum()
    i['percentage'] = round(i['Q6']/i['sum']*100, 1)
    i.drop(['Q6', 'sum'], axis = 1, inplace = True) 
    
indx = {0: 1, 1: 0}
e2018 = e2018.rename(indx).sort_index()
e2019 = e2019.rename(indx).sort_index()
e2020 = e2020.rename(indx).sort_index()
e2021 = e2021.rename(indx).sort_index()

e18_21 = e2018.merge(e2021, how = 'left', left_on = 'index', right_on = 'index').rename(columns={'percentage_x': '2018', 'percentage_y': '2021'})

fig = plt.figure(figsize = (15, 5))
plt.title('Dynamics', size = 25, y = 1.12, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = e18_21['2021'], y = e18_21['index'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = e18_21['2018'], y = e18_21['index'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['0%', '5%', '10%', '15%', '20%', '25%', '30%', '35%']))
a.xaxis.set_major_locator(FixedLocator([0, 5, 10, 15, 20, 25, 30, 35]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(6, -0.3)
    
plt.figtext(0.48, 0.94, '2018', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.94, '2021', fontname = 'monospace', size = 15, color = '#AACDE5')

plt.figtext(0.122, -0.18, '''The situation with the respondent's experience vary a lot every year. In 2018 and 2021, the largest
number of respondents had 1-3 years of experience, and in 2019, the largest number of participants
had more than 20 years of experience! It's pretty weird. Nevertheless, in 2018, 2020 and 2021, the
majority of respondents have up to 5 years of experience.''', fontname = 'monospace', size = 14, color = '#313233')

l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(e18_21['index'], e18_21['2018'], e18_21['2021']):
    newline([p1, i], [p2, i])
    
sns.scatterplot(x = 23.3, y = ['20< years'], color = '#4f6457', s = 100, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = 13.8, y = ['< 1 years'], color = '#4f6457', s = 100, edgecolor = 'black', zorder = 5)

plt.annotate('''anomaly in 2019''', xy = (14, ['< 1 years']), xytext = (23, ['3-5 years']),
             arrowprops = dict(facecolor = 'steelblue', arrowstyle = "->", connectionstyle = "arc3,rad = 0.22",color = '#4f6457'), 
             fontsize = 13, fontfamily = 'monospace', ha = 'center', color = '#4f6457')
plt.annotate('', xy = (23.5, ['20< years']), xytext = (23, ['3-5 years']),
             arrowprops = dict(facecolor = 'steelblue', arrowstyle = "->", connectionstyle = "arc3,rad = -0.4",color = '#4f6457'), 
             fontsize = 12, fontfamily = 'monospace', ha = 'center', color = '#4f6457')

plt.show()

# Do you want a big salary? Let's see what needs to be done for this.

But I will consider the period from 2019 to 2021, because, as we have already found out, most respondents have a salary of up to 5k per year, and in the 2018 survey, respondents chose salaries with too large scale (for example 0-10k), so it would be more correct to consider surveys where more correct salaries are indicated.

In [None]:
sal21d = {'25,000-29,999': 27500,
          '60,000-69,999': 65000,
          '$0-999': 500,
          '30,000-39,999': 35000,
          '15,000-19,999': 17500,
          '70,000-79,999': 75000,
          '2,000-2,999': 2500,
          '10,000-14,999': 12500,
          '5,000-7,499': 6250,
          '20,000-24,999': 22500,
          '1,000-1,999': 1500,
          '100,000-124,999': 112500,
          '7,500-9,999': 8750,
          '4,000-4,999': 4500,
          '40,000-49,999': 45000,
          '50,000-59,999': 55000,
          '3,000-3,999': 3500,
          '300,000-499,999': 400000,
          '200,000-249,999': 225000,
          '125,000-149,999': 137500,
          '250,000-299,999': 275000,
          '80,000-89,999': 85000,
          '90,000-99,999': 95000,
          '150,000-199,999': 175000,
          '>$1,000,000': 1000000,
          '$500,000-999,999': 750000}

sur2021['salary'] = sur2021['Q25'].map(sal21d)

sal19d = {'30,000-39,999': 35000,
          '5,000-7,499': 6250,
          '250,000-299,999': 275000,
          '4,000-4,999': 45000,
          '60,000-69,999': 65000,
          '10,000-14,999': 12500,
          '80,000-89,999': 85000,
          '$0-999': 500,
          '2,000-2,999': 2500,
          '70,000-79,999': 75000,
          '90,000-99,999': 95000,
          '125,000-149,999': 137500,
          '40,000-49,999': 45000,
          '20,000-24,999': 22500,
          '15,000-19,999': 17500,
          '100,000-124,999': 112500,
          '7,500-9,999': 8750,
          '150,000-199,999': 175000,
          '25,000-29,999': 27500,
          '3,000-3,999': 3500,
          '1,000-1,999': 1500,
          '200,000-249,999': 225000,
          '50,000-59,999': 55000,
          '> $500,000': 500000,
          '300,000-500,000': 400000}

sur2019['salary'] = sur2019['Q10'].map(sal19d)

ms19 = sur2019.groupby(["Q4", "Q5"]).agg({'salary': 'mean'}).reset_index()
ms19 = ms19[ms19['Q5'].isin(jobs)].query("Q4 != 'Other'")

ms21 = sur2021.groupby(["Q4", "Q5"]).agg({'salary': 'mean'}).reset_index()
ms21 = ms21[ms21['Q5'].isin(jobs)].query("Q4 != 'Other'")

ms21_2 = sur2021.groupby(["Q5"]).agg({'salary': 'mean'}).reset_index()

ms19_21 = sur2019.groupby(["Q5"]).agg({'salary': 'mean'}).reset_index()
ms19_21 = ms19_21[ms19_21['Q5'].isin(jobs)]
ms19_21 = ms19_21.merge(ms21_2, how = 'left', left_on = 'Q5', right_on = 'Q5').rename(columns={'salary_x': '2019', 'salary_y': '2021'})
ms19_21['dynamics'] = round((100 - (ms19_21['2021'] / ms19_21['2019']) * 100), 1)

fig = plt.figure(figsize = (10, 18))

plt.subplot(211)
plt.title('Salary dependence on education', size = 27, y = 1.07, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'x', alpha = 0.1, zorder = 0)
a = sns.barplot(data = ms19, x = 'salary', y = 'Q5', hue = 'Q4', palette = ['#4f6457', '#AACDE5', '#d9b44a'], linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['10k', '20k', '30k', '40k', '50k', '60k', '70k', '80k']))
a.xaxis.set_major_locator(FixedLocator([10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')
a.get_legend().remove()

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
plt.subplot(212)
plt.grid(color = 'gray', linestyle = '-', axis = 'x', alpha = 0.1, zorder = 0)
a = sns.barplot(data = ms21, x = 'salary', y = 'Q5', hue = 'Q4', palette = ['#4f6457', '#AACDE5', '#d9b44a'], linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['10k', '20k', '30k', '40k', '50k', '60k', '70k', '80k']))
a.xaxis.set_major_locator(FixedLocator([10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')
plt.legend(bbox_to_anchor = (0.77, 1.18), ncol = 3, borderpad = 1, frameon = True, fontsize = 14)

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
fig.tight_layout(h_pad = 5)

plt.figtext(0.99, 0.77, '2019', fontsize = 35, fontname = 'monospace', color = '#313233')
plt.figtext(0.99, 0.23, '2021', fontsize = 35, fontname = 'monospace', color = '#313233')

l1 = lines.Line2D([1.05, 1.05], [0.75, 0.265], transform = fig.transFigure, figure = fig, color = '#313233', linestyle = '-',linewidth = 2)
a.lines.extend([l1])
l1 = lines.Line2D([0.29, 0.985], [-0.01, -0.01], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

y = 0.74
for job in ms19_21['Q5'].tolist():
    plt.figtext(1.18, y, job, fontsize = 13, fontname = 'monospace', color = '#313233', ha = 'center')
    y -= 0.064
    
y = 0.72
for job in ms19_21['dynamics'].tolist():
    plt.figtext(1.18, y, f'-{job}%', fontsize = 13, fontname = 'monospace', color = '#AA4B41', ha = 'center')
    y -= 0.064
    
plt.figtext(0.29, -0.19, '''The first thing you should pay attention to is that the
highest-paid job among those compared is Data scientist
in both years. Data analysts receive the least of all.

But let's move on to education. In almost all professions,
respondents with a doctorate degree receive more than others.
Bachelors get the least.

And what is even more interesting is the drop in salaries in
all professions. Within one country, the reasons may be
different, but on a global scale, this can only be explained
by the COVID-19 pandemic and salaries, as well as other
costs of companies, have been reduced to withstand
this difficult time.''', fontsize = 13, fontname = 'monospace', color = '#313233')
    
plt.show()

In [None]:
exp_m_s19 = sur2019.groupby(["Q7"]).agg({'salary': 'mean'}).reset_index()
exp_m_s21 = sur2021.groupby(["Q6"]).agg({'salary': 'mean'}).reset_index()

# What can you not do for the sake of a beautiful plot :)
def swap_rows(df, i1, i2):
    a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
    df.iloc[i1, :], df.iloc[i2, :] = b, a
    return df

for df in [exp_m_s19, exp_m_s21]:
    df = swap_rows(df, 0, 5)
    df = swap_rows(df, 1, 5)
    df = swap_rows(df, 2, 3)
    df = swap_rows(df, 3, 4)
    df = swap_rows(df, 4, 5)
    
exp_m_s19_21 = exp_m_s19.merge(exp_m_s21, how = 'left', left_on = 'Q7', right_on = 'Q6').rename(columns={'salary_x': '2019', 'salary_y': '2021'}).drop('Q6', axis = 1)
exp_m_s19_21['2019'] = exp_m_s19_21['2019'].astype('int')
exp_m_s19_21['2021'] = exp_m_s19_21['2021'].astype('int')

fig = plt.figure(figsize = (15, 5))
plt.title('Salary dependence on experience', size = 25, y = 1.12, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = exp_m_s19_21['2021'], y = exp_m_s19_21['Q7'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = exp_m_s19_21['2019'], y = exp_m_s19_21['Q7'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['20k', '30k', '40k', '50k', '60k', '70k', '80k', '90k']))
a.xaxis.set_major_locator(FixedLocator([20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(6, -0.3)
    
l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(exp_m_s19_21['Q7'], exp_m_s19_21['2019'], exp_m_s19_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.94, '2019', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.94, '2021', fontname = 'monospace', size = 15, color = '#AACDE5')
    
plt.figtext(0.125, -0.23, '''Of course, there has always been a positive correlation between experience and salary, but in 2
years there has been a big gap between those with experience up to 5 years and those with
experience over 5 years. Perhaps this is due to the fact that there is a shortage of
highly qualified specialists in the world, while there are a lot of newcomers and
they have a big rivalry.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

In [None]:
gen_ms19_21 = sur2019.groupby(["Q2"]).agg({'salary': 'mean'}).reset_index()\
.merge(sur2021.groupby(["Q2"]).agg({'salary': 'mean'}).reset_index(), how = 'left', left_on = 'Q2', right_on = 'Q2')\
.rename(columns={'salary_x': '2019', 'salary_y': '2021'})
gen_ms19_21['2019'] = gen_ms19_21['2019'].astype('int')
gen_ms19_21['2021'] = gen_ms19_21['2021'].astype('int')

fig = plt.figure(figsize = (15, 3.5))
plt.title('Salary dependence on gender', size = 25, y = 1.13, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = gen_ms19_21['2021'], y = gen_ms19_21['Q2'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = gen_ms19_21['2019'], y = gen_ms19_21['Q2'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['30k', '40k', '50k', '60k', '70k']))
a.xaxis.set_major_locator(FixedLocator([30000, 40000, 50000, 60000, 70000]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

a.set_ylim(-1, 2.8)

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
l1 = lines.Line2D([0.125, 0.9], [-0.01, -0.01], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(gen_ms19_21['Q2'], gen_ms19_21['2019'], gen_ms19_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.94, '2019', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.94, '2021', fontname = 'monospace', size = 15, color = '#AACDE5')
    
plt.figtext(0.125, -0.42, '''This is a very ambiguous situation. The average salary of men and women has fallen significantly
in two years, but the salary of those who identify themselves in a different way has increased,
even taking into account the fact that in 2019 they already had much more. I will not go into
details, but note that a group with "other" gender does not differ from groups of men and women
in terms of the number of respondents with a doctorate degree, with extensive experience or
the number of data scientists.''', fontsize = 14, fontname = 'monospace', color = '#313233')

plt.show()

In [None]:
ms_country19 = pd.concat([sur2019.groupby("Q3").agg({'salary': 'mean'}).reset_index().sort_values('salary').head(5), 
                         sur2019.groupby("Q3").agg({'salary': 'mean'}).reset_index().sort_values('salary').tail(5)])
ms_country19['Q3'][22] = 'Iran'
ms_country19['Q3'][57] = 'USA'

ms_country21 = pd.concat([sur2021.groupby("Q3").agg({'salary': 'mean'}).reset_index().sort_values('salary').head(5), 
                         sur2021.groupby("Q3").agg({'salary': 'mean'}).reset_index().sort_values('salary').tail(5)])
ms_country21['Q3'][64] = 'USA'

c = ['United States of America', 'India', 'China', 'Russia']
msc1 = sur2019.groupby("Q3").agg({'salary': 'mean'}).reset_index()
msc2 = sur2021.groupby("Q3").agg({'salary': 'mean'}).reset_index()
msc1 = msc1[msc1['Q3'].isin(c)]
msc2 = msc2[msc2['Q3'].isin(c)]
msc = msc1.merge(msc2, how = 'left', left_on = 'Q3', right_on = 'Q3').rename(columns={'salary_x': '2019', 'salary_y': '2021'})
msc['dynamics'] = round((msc['2021'] / msc['2019']) * 100 - 100, 1)
msc['Q3'][3] = 'USA'

fig = plt.figure(figsize = (9, 15))

plt.subplot(211)
clrs = ['#AACDE5' if x > 50000 else '#75b1a9' for x in ms_country19['salary']]
plt.title('Salary dependence on country', size = 27, x = 0.47, y = 1.08, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ms_country19, x = 'salary', y = 'Q3', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.axhline(4.49, 0, 0.95, color = '#AA4B41', linewidth = 2)
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.ylabel('')
plt.xlabel('')
for p in a.patches:
    width = p.get_width()
    plt.text(7500 + p.get_width(), p.get_y() + 0.6*p.get_height(), f'{round((width/1000), 1)}k',
             ha = 'center', va = 'center', fontname = 'monospace', size = 14, color = '#313233')
    
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
plt.subplot(212)
clrs = ['#AACDE5' if x > 50000 else '#75b1a9' for x in ms_country21['salary']]
a = sns.barplot(data = ms_country21, x = 'salary', y = 'Q3', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.axhline(4.505, 0, 0.95, color = '#AA4B41', linewidth = 2)
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.ylabel('')
plt.xlabel('')
for p in a.patches:
    width = p.get_width()
    plt.text(7500 + p.get_width(), p.get_y() + 0.6*p.get_height(), f'{round((width/1000), 1)}k',
             ha = 'center', va = 'center', fontname = 'monospace', size = 14, color = '#313233')
    
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
plt.figtext(0.99, 0.7, '2019', fontsize = 35, fontname = 'monospace', color = '#313233')
plt.figtext(0.99, 0.29, '2021', fontsize = 35, fontname = 'monospace', color = '#313233')

l1 = lines.Line2D([1.05, 1.05], [0.68, 0.33], transform = fig.transFigure, figure = fig, color = '#313233', linestyle = '-',linewidth = 2)
a.lines.extend([l1])
l1 = lines.Line2D([0.125, 0.86], [0.1, 0.1], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

y = 0.6
for job in msc['Q3'].tolist():
    plt.figtext(1.13, y, job, fontsize = 14, fontname = 'monospace', color = '#313233', ha = 'center')
    y -= 0.064
    
y = 0.58
for i in msc['dynamics'].tolist():
    if i < 0:
        plt.figtext(1.13, y, f'{i}%', fontsize = 13, fontname = 'monospace', color = '#AA4B41', ha = 'center')
        y -= 0.064
    else:
        plt.figtext(1.13, y, f'+{i}%', fontsize = 13, fontname = 'monospace', color = '#4f6457', ha = 'center')
        y -= 0.064

plt.figtext(0.12, 0.04, '''The highest average salaries are in Norway, Switzerland,
Australia, Israel and the USA. But don't rush to conclusions
ahead of time. Don't forget that every country has different
taxes, insurance and cost of living.''', fontsize = 13, fontname = 'monospace', color = '#313233')
    
plt.show()

**So, if you want a high salary, you have to live in USA, have a doctorate degree with 20 years of experience and be a data scientist. Easy :)**

# World map of Kaggle respondents

**This map provides a detailed analysis by social groups in each country, where you can see the ratios by gender, age, education and average salary.**

In [None]:
# Data for plotly map
mapp = sur2021['Q3'].value_counts().reset_index().merge(c_codes, how = 'left', left_on = 'index', right_on = 'English short name lower case')
mapp.drop('English short name lower case', axis = 1, inplace = True)
mapp.iloc[1,2] = "USA"
mapp.iloc[8,2] = "GBR"
mapp.iloc[22,2] = "VNM"
mapp.iloc[27,2] = "IRN"
mapp.iloc[49,2] = "HKG"
mapp.drop(index = [2, 52], inplace = True)
mapp.rename(columns = {'Q3': "Count of respondents", 'Alpha-3 code': "Code"}, inplace = True)

# Statistics
gen_stat = sur2021.groupby(['Q3', 'Q2']).agg({'Q1': 'count'}).reset_index()
gen_stat['sum'] = gen_stat.groupby('Q3')['Q1'].transform('sum')
gen_stat['fraction'] = round(gen_stat['Q1']/gen_stat['sum'] * 100,1)

age_stat = sur2021.groupby(['Q3', 'Q1']).agg({'Q2': 'count'}).reset_index()
age_stat['sum'] = age_stat.groupby('Q3')['Q2'].transform('sum')
age_stat['fraction'] = round(age_stat['Q2']/age_stat['sum'] * 100,1)

educ_stat = sur2021.groupby(['Q3', 'Q4']).agg({'Q2': 'count'}).reset_index()
educ_stat['sum'] = educ_stat.groupby('Q3')['Q2'].transform('sum')
educ_stat['fraction'] = round(educ_stat['Q2']/educ_stat['sum'] * 100,1)

salary_stat = sur2021.groupby("Q3").agg({'salary': 'mean'}).reset_index()
salary_stat['salary'] = round(salary_stat['salary'] / 1000, 1)
salary_stat['salary'] = salary_stat['salary'].astype('str') + 'k $'

# Merge data
for gender in gen_stat['Q2'].unique().tolist():
    mapp = mapp.merge(gen_stat.query("Q2 == @gender")[['Q3', 'fraction']], how = 'left', left_on = 'index', right_on = 'Q3').drop('Q3', axis = 1).rename(columns={'fraction': gender})
    mapp = mapp.fillna(0)
    mapp[gender] = mapp[gender].astype('str') + '%'

mapp['Man / Woman / Others'] = mapp['Man'] + ' / ' + mapp['Woman'] + ' / ' + mapp['Others']

for age in age_stat['Q1'].unique().tolist():
    mapp = mapp.merge(age_stat.query("Q1 == @age")[['Q3', 'fraction']], how = 'left', left_on = 'index', right_on = 'Q3').drop('Q3', axis = 1).rename(columns={'fraction': age})
    mapp = mapp.fillna(0)
    mapp[age] = mapp[age].astype('str') + '%'

mapp['Young / Middle aged / Old'] = mapp['Young'] + ' / ' + mapp['Middle aged'] + ' / ' + mapp['Old']

for education in educ_stat['Q4'].unique().tolist():
    mapp = mapp.merge(educ_stat.query("Q4 == @education")[['Q3', 'fraction']], how = 'left', left_on = 'index', right_on = 'Q3').drop('Q3', axis = 1).rename(columns={'fraction': education})
    mapp = mapp.fillna(0)
    mapp[education] = mapp[education].astype('str') + '%'
    
mapp['Doctorate / Master / Bachelor / Other'] = mapp['Doctorate'] + ' / ' + mapp['Master'] + ' / ' + mapp['Bachelor'] + ' / ' + mapp['Other']

mapp = mapp.merge(salary_stat, how = 'left', left_on = 'index', right_on = 'Q3').drop('Q3', axis = 1)

# Text for hover
mapp['text'] = 'Country</b> - ' + mapp['index'] + '<br>' + '<br>' + '<b>Respondents</b> - ' + mapp['Count of respondents'].astype('str') + \
    '<br>' + '<b>Man / Woman / Others</b> - ' + mapp['Man / Woman / Others'] + \
    '<br>' + '<b>Young / Middle aged / Old</b> - ' + mapp['Young / Middle aged / Old'] + \
    '<br>' + '<b>Doctorate / Master / Bachelor / Other</b> - ' + mapp['Doctorate / Master / Bachelor / Other'] + \
    '<br>' + '<b>Average salary</b> - ' + mapp['salary']

# Plot
fig = go.Figure(data=go.Choropleth(
    locations=mapp['Code'],
    z = mapp['Count of respondents'],
    locationmode = 'ISO-3',
    colorscale = 'Blues',
    zmax = 4000,
    zmin = 0,
    text = mapp['text'],
    hovertemplate = "<b>%{text}</b>" + "<extra></extra>"
    
))

fig.update_layout(
    geo_scope='world',
    hoverlabel=dict(
        bgcolor="#bed5eb",
        font_size=14,
        font_family="monospace"
    )
)

fig.update_geos(visible=False, 
                resolution=110, 
                showcountries=True,
                countrywidth=0.5)

fig.update_traces(showscale=False)

fig.show()

mapp.drop(['Man / Woman / Others', 'Young / Middle aged / Old', 'Doctorate / Master / Bachelor / Other', 'text'], axis = 1, inplace = True)

In [None]:
for col in mapp.columns.tolist()[3:-1]:
    mapp[col] = mapp[col].apply(lambda x: x[:-1])
    mapp[col] = mapp[col].astype('float')
    
r_country = pd.concat([mapp.sort_values('Count of respondents').head(5), mapp.sort_values('Count of respondents').tail(5)])
                      
fig = plt.figure(figsize = (15, 7))
clrs = ['#AACDE5' if x > 100 else '#75b1a9' for x in r_country['Count of respondents']]
a = sns.barplot(data = r_country, x = 'Count of respondents', y = 'index', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.axhline(4.5, 0, 0.95, color = '#AA4B41', linewidth = 2)
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 13, color = '#313233')
plt.ylabel('')
plt.xlabel('')
for p in a.patches:
    width = p.get_width()
    plt.text(180 + p.get_width(), p.get_y() + 0.6*p.get_height(), int(width),
             ha = 'center', va = 'center', fontname = 'monospace', size = 13, color = '#313233')
    
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
plt.figtext(0.12, 0.96, 'The', fontsize = 21, fontname = 'monospace', color = '#313233')
plt.figtext(0.16, 0.96, 'largest', fontsize = 21, fontname = 'monospace', color = '#AACDE5')
plt.figtext(0.25, 0.96, 'and the', fontsize = 21, fontname = 'monospace', color = '#313233')
plt.figtext(0.345, 0.96, 'smallest', fontsize = 21, fontname = 'monospace', color = '#75b1a9')
plt.figtext(0.45, 0.96, 'count of respondents by country in 2021', fontsize = 21, fontname = 'monospace', color = '#313233')

l1 = lines.Line2D([0.125, 0.9], [0.07, 0.07], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

plt.figtext(0.122, -0.08, '''It would be important to note that the number of countries has increased in 3 years,
the Data Science and Kaggle community is expanding and there have appeared new
countries such as: Ecuador, Uganda, Kazakhstan, Ethiopia and Iraq, which have
never appeared in the survey results!''', fontname = 'monospace', size = 14, color = '#313233')

plt.show()

In [None]:
cntrs = sur2018['Q3'].value_counts().reset_index().rename(columns={'Q3':'2018'})\
.merge(sur2019['Q3'].value_counts().reset_index().rename(columns={'Q3':'2019'}), how = 'left', left_on = 'index', right_on = 'index')\
.merge(sur2020['Q3'].value_counts().reset_index().rename(columns={'Q3':'2020'}), how = 'left', left_on = 'index', right_on = 'index')\
.merge(sur2021['Q3'].value_counts().reset_index().rename(columns={'Q3':'2021'}), how = 'left', left_on = 'index', right_on = 'index')\
.query("index == 'United States of America' | index == 'India' | index == 'China' | index == 'Russia'")\
.transpose()

cntrs.columns = cntrs.iloc[0]
cntrs = cntrs[1:].reset_index().rename(columns={'index':'year'})

for i in cntrs.columns.tolist():
    cntrs[i] = cntrs[i].astype('int')
    
fig = plt.figure(figsize = (15, 6))
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1)
a = sns.lineplot(data = cntrs, x = 'year', y = 'United States of America', color = '#75b1a9', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([2650]), color = '#75b1a9', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = cntrs, x = 'year', y = 'India', color = '#d9b44a', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([7434]), color = '#d9b44a', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = cntrs, x = 'year', y = 'China', color = '#AACDE5', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([814]), color = '#AACDE5', s = 90, edgecolor = 'black', zorder = 5)
a = sns.lineplot(data = cntrs, x = 'year', y = 'Russia', color = '#4f6457', linewidth = 2)
sns.scatterplot(x = 2021, y = np.array([742]), color = '#4f6457', s = 90, edgecolor = 'black', zorder = 5)
plt.xticks(np.array([2018, 2019, 2020, 2021]), fontname = 'monospace', size = 16, color = '#313233')
plt.yticks(np.array([2000, 4000, 6000, 8000]), fontname = 'monospace', size = 15, color = '#313233')
a.yaxis.set_major_formatter(FixedFormatter(['2k', '4k', '6k', '8k']))
a.yaxis.set_major_locator(FixedLocator([2000, 4000, 6000, 8000]))
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)

plt.figtext(0.845, 0.87, '7 434', fontname = 'monospace', size = 14, color = '#d9b44a')
plt.figtext(0.845, 0.4, '2 650', fontname = 'monospace', size = 14, color = '#75b1a9')
plt.figtext(0.85, 0.23, '814', fontname = 'monospace', size = 14, color = '#AACDE5') 
plt.figtext(0.85, 0.14, '742', fontname = 'monospace', size = 14, color = '#4f6457')

l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

plt.figtext(0.12, 0.98, 'Dynamics of participation of the', fontsize = 23, fontname = 'monospace', color = '#313233')
plt.figtext(0.545, 0.98, 'USA', fontsize = 23, fontname = 'monospace', color = '#75b1a9')
plt.figtext(0.583, 0.98, ',', fontsize = 23, fontname = 'monospace', color = '#313233')
plt.figtext(0.609, 0.98, 'India', fontsize = 23, fontname = 'monospace', color = '#d9b44a')
plt.figtext(0.672, 0.98, ',', fontsize = 23, fontname = 'monospace', color = '#313233')
plt.figtext(0.698, 0.98, 'China', fontsize = 23, fontname = 'monospace', color = '#AACDE5')
plt.figtext(0.775, 0.98, 'and', fontsize = 23, fontname = 'monospace', color = '#313233')
plt.figtext(0.826, 0.98, 'Russia', fontsize = 23, fontname = 'monospace', color = '#4f6457')

plt.figtext(0.122, -0.18, '''In 2018, the majority of respondents were from the USA, but since 2019, India has gained 
the upper hand and continues to conquer Kaggle. For now the United States remains
in second place but the number of respondents from this country decreased by 43.8%.
China and Russia have about the same result.''', fontname = 'monospace', size = 14, color = '#313233')

plt.show()

<h1 style="background-color:#a0c6eb;text-align:center;color:white;font-size:400%;font-family:monospace;border-radius:10px">PART II</h1>

![](https://content.techgig.com/thumb/msid-76354712,width-860,resizemode-4/Roadmap-to-become-a-successful-programmer.jpg?216831)

# Programming languages

In [None]:
languages = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[7:18]:
    languages = np.append(languages, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
languages = {'languages': languages}
lan21 = pd.DataFrame(data=languages)
lan21['results'] = results
lan21['percentage'] = round(lan21['results'] / len(sur2021) * 100, 1)
lan21.sort_values('results', inplace = True)

fig = plt.figure(figsize = (15, 5.5))
clrs = ['#65b2e5' if x > 12000 else '#AACDE5' for x in lan21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = lan21, x = 'percentage', y = 'languages', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(3 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
languages = np.array([])
results = np.array([])
for i in sur2018.columns.tolist()[65:83]:
    languages = np.append(languages, np.array(sur2018[sur2018[i] == sur2018[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2018[sur2018[i] == sur2018[i]][i])))
languages = {'languages': languages}
lan18 = pd.DataFrame(data=languages)
lan18['results'] = results
lan18['percentage'] = round(lan18['results'] / len(sur2018) * 100, 1)
lan18.sort_values('results', inplace = True)

lan18_21 = lan18.merge(lan21, how = 'right', left_on = 'languages', right_on = 'languages')\
.rename(columns={'percentage_x': '2018', 'percentage_y': '2021'})\
.dropna().drop(['results_x', 'results_y'], axis = 1)

dsl = ['Python', 'R', 'SQL']
lan18_21 = lan18_21[lan18_21['languages'].isin(dsl)]

fig = plt.figure(figsize = (15, 3.5))
plt.title('Dynamics', size = 25, y = 1.12, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = lan18_21['2021'], y = lan18_21['languages'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = lan18_21['2018'], y = lan18_21['languages'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(np.array([0, 10, 20, 30, 40, 50, 60, 70, 80]), fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['0%', '10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%']))
a.xaxis.set_major_locator(FixedLocator([0, 10, 20, 30, 40, 50, 60, 70, 80]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

a.set_ylim(-1, 2.8)

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom', 'left']:
    a.spines[j].set_linewidth(2)
    
l1 = lines.Line2D([0.125, 0.9], [-0.03, -0.03], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(lan18_21['languages'], lan18_21['2018'], lan18_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.92, '2018', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.92, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 
    
plt.figtext(0.125, -0.75, '''From year to year, Python is becoming more popular while R is gradually fading. Many people
understand that Python is simpler and often used in commerce, and the R language is more complex
and it is a scientific tool. SQL is also becoming more popular.

These are the three main languages in data science, which are most often recommended
for learning first:

Python - 59.4% in 2018 and 77.8% in 2021;
SQL - 3.8% in 2018 and 5.2% in 2021;
R - 9.8% in 2018 and 5.6% in 2021.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Integrated development environments (IDE's)

In [None]:
ide = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[21:33]:
    ide = np.append(ide, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
ide = {'ide': ide}
ide21 = pd.DataFrame(data=ide)
ide21['results'] = results
ide21['percentage'] = round(ide21['results'] / len(sur2021) * 100, 1)
ide21.sort_values('results', inplace = True)
ide21 = ide21.query("ide != 'None'")
ide21['ide'] = ide21['ide'].apply(lambda x: x.strip())

fig = plt.figure(figsize = (15, 5.5))
clrs = ['#65b2e5' if x > 12000 else '#AACDE5' for x in ide21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ide21, x = 'percentage', y = 'ide', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(2 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
ide = np.array([])
results = np.array([])
for i in sur2018.columns.tolist()[30:44]:
    ide = np.append(ide, np.array(sur2018[sur2018[i] == sur2018[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2018[sur2018[i] == sur2018[i]][i])))
ide = {'ide': ide}
ide18 = pd.DataFrame(data=ide)
ide18['results'] = results
ide18['percentage'] = round(ide18['results'] / len(sur2018) * 100, 1)
ide18.sort_values('results', inplace = True)
ide18 = ide18.query("ide != 'None'")

ide18_21 = ide18.merge(ide21, how = 'right', left_on = 'ide', right_on = 'ide')\
.rename(columns={'percentage_x': '2018', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 6))
plt.title('Dynamics', size = 25, y = 1.12, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = ide18_21['2021'], y = ide18_21['ide'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = ide18_21['2018'], y = ide18_21['ide'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['10%', '15%', '20%', '25%', '30%', '35%']))
a.xaxis.set_major_locator(FixedLocator([10, 15, 20, 25, 30, 35]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)

a.set_ylim(7, -0.3)
    
l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(ide18_21['ide'], ide18_21['2018'], ide18_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.94, '2018', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.94, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 
    
plt.figtext(0.125, -0.13, '''In 2021, Jupyter notebook occupies the leading position. In 2018, it was not so famous yet and
therefore did not even get into the survey, then RStudio took the first place.
All IDE's are losing users, even PyCharm.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Notebook products

In [None]:
nb = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[34:50]:
    nb = np.append(nb, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
nb = {'nb': nb}
nb21 = pd.DataFrame(data=nb)
nb21['results'] = results
nb21['percentage'] = round(nb21['results'] / len(sur2021) * 100, 1)
nb21.sort_values('results', inplace = True)
nb21 = nb21.query("nb != 'None'")
nb21['nb'] = nb21['nb'].apply(lambda x: x.strip())

fig = plt.figure(figsize = (15, 7.5))
clrs = ['#65b2e5' if x > 9600 else '#AACDE5' for x in nb21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = nb21, x = 'percentage', y = 'nb', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(1.3 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
nb = np.array([])
results = np.array([])
for i in sur2019.columns.tolist()[69:81]:
    nb = np.append(nb, np.array(sur2019[sur2019[i] == sur2019[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2019[sur2019[i] == sur2019[i]][i])))
nb = {'nb': nb}
nb19 = pd.DataFrame(data=nb)
nb19['results'] = results
nb19['percentage'] = round(nb19['results'] / len(sur2019) * 100, 1)
nb19.sort_values('results', inplace = True)
nb19['nb'] = nb19['nb'].apply(lambda x: x.strip())

nb19 = nb19.replace('Microsoft Azure Notebooks', 'Azure Notebooks')
nb19 = nb19.replace('Google Colab', 'Colab Notebooks')
nb19 = nb19.replace('Kaggle Notebooks (Kernels)', 'Kaggle Notebooks')
nb19 = nb19.replace('Google Cloud Notebook Products (AI Platform, Datalab, etc)', 'Google Cloud Notebook Products')
nb21 = nb21.replace('Google Cloud Notebooks (AI Platform / Vertex AI)', 'Google Cloud Notebook Products')

nb19_21 = nb19.merge(nb21, how = 'right', left_on = 'nb', right_on = 'nb')\
.rename(columns={'percentage_x': '2019', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 6))
plt.title('Dynamics', size = 25, y = 1.12, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = nb19_21['2021'], y = nb19_21['nb'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = nb19_21['2019'], y = nb19_21['nb'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['0%', '5%', '10%', '15%', '20%', '25%', '30%', '35%']))
a.xaxis.set_major_locator(FixedLocator([0, 5, 10, 15, 20, 25, 30, 35]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(8, -0.3)
    
l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(nb19_21['nb'], nb19_21['2019'], nb19_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.94, '2019', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.94, '2021', fontname = 'monospace', size = 15, color = '#AACDE5')  
    
plt.figtext(0.125, -0.07, '''It would be quite strange if Kaggle Notebooks and Colab Notebooks were not popular
among the respondents of the Kaggle survey :)''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Specialized hardware

In [None]:
h = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[52:57]:
    h = np.append(h, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
h = {'h': h}
h21 = pd.DataFrame(data=h)
h21['results'] = results
h21['percentage'] = round(h21['results'] / len(sur2021) * 100, 1)
h21.sort_values('results', inplace = True)
h21 = h21.query("h != 'None'")
h21['h'] = h21['h'].apply(lambda x: x.strip())

fig = plt.figure(figsize = (15, 2))
clrs = ['#65b2e5' if x > 5000 else '#AACDE5' for x in h21['results']]
plt.title('Results 2021', size = 25, y = 1.2, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = h21, x = 'percentage', y = 'h', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(1 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')
    
l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

plt.figtext(0.125, -0.4, '''In 2019, only 2.5% of participants used TPU and more than half indicated that they use CPU.
In 2021, the CPU is no longer considered special hardware, so there are already
more GPUs and TPUs in use, because they are much more powerful.''', fontsize = 14, fontname = 'monospace', color = '#313233')

plt.show()

# ML frameworks

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[72:89]:
    ml = np.append(ml, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
ml = {'ml': ml}
ml21 = pd.DataFrame(data=ml)
ml21['results'] = results
ml21['percentage'] = round(ml21['results'] / len(sur2021) * 100, 1)
ml21.sort_values('results', inplace = True)
ml21 = ml21.query("ml != 'None'")
ml21['ml'] = ml21['ml'].apply(lambda x: x.strip())
ml21 = ml21.replace('H2O 3', 'H20')

fig = plt.figure(figsize = (15, 7.5))
clrs = ['#65b2e5' if x > 10000 else '#AACDE5' for x in ml21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ml21, x = 'percentage', y = 'ml', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(2 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2018.columns.tolist()[88:107]:
    ml = np.append(ml, np.array(sur2018[sur2018[i] == sur2018[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2018[sur2018[i] == sur2018[i]][i])))
ml = {'ml': ml}
ml18 = pd.DataFrame(data=ml)
ml18['results'] = results
ml18['percentage'] = round(ml18['results'] / len(sur2018) * 100, 1)
ml18.sort_values('results', inplace = True)
ml18['ml'] = ml18['ml'].apply(lambda x: x.strip())
ml18 = ml18.replace('catboost', 'CatBoost')
ml18 = ml18.replace('lightgbm', 'LightGBM')
ml18 = ml18.replace('Scikit-Learn', 'Scikit-learn')
ml18 = ml18.replace('Mxnet', 'MXNet')
ml18 = ml18.replace('Fastai', 'Fast.ai')

ml18_21 = ml18.merge(ml21, how = 'right', left_on = 'ml', right_on = 'ml')\
.rename(columns={'percentage_x': '2018', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 10))
plt.title('Dynamics', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = ml18_21['2021'], y = ml18_21['ml'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = ml18_21['2018'], y = ml18_21['ml'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['0%', '10%', '20%', '30%', '40%', '50%']))
a.xaxis.set_major_locator(FixedLocator([0, 10, 20, 30, 40, 50]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
l1 = lines.Line2D([0.125, 0.9], [0.06, 0.06], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(ml18_21['ml'], ml18_21['2018'], ml18_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.935, '2019', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.935, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 
    
plt.figtext(0.125, -0.01, '''In general, preferences in algorithms have not changed much in 3 years, except that there is a
decrease in the use of TensorFlow, Keras, Caret and H2O. PyTorch has become more popular.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# ML algorithms

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[90:101]:
    ml = np.append(ml, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
ml = {'ml': ml}
ml21 = pd.DataFrame(data=ml)
ml21['results'] = results
ml21['percentage'] = round(ml21['results'] / len(sur2021) * 100, 1)
ml21.sort_values('results', inplace = True)
ml21 = ml21.query("ml != 'None'")
ml21['ml'] = ml21['ml'].apply(lambda x: x.strip())

fig = plt.figure(figsize = (15, 5))
clrs = ['#65b2e5' if x > 12000 else '#AACDE5' for x in ml21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ml21, x = 'percentage', y = 'ml', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
for p in a.patches:
    width = p.get_width()
    plt.text(2 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2019.columns.tolist()[118:130]:
    ml = np.append(ml, np.array(sur2019[sur2019[i] == sur2019[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2019[sur2019[i] == sur2019[i]][i])))
ml = {'ml': ml}
ml19 = pd.DataFrame(data=ml)
ml19['results'] = results
ml19['percentage'] = round(ml19['results'] / len(sur2019) * 100, 1)
ml19.sort_values('results', inplace = True)
ml19['ml'] = ml19['ml'].apply(lambda x: x.strip())
ml19 = ml19.replace('Transformer Networks (BERT, gpt-2, etc)', 'Transformer Networks')
ml19 = ml19.replace('Gradient Boosting Machines (xgboost, lightgbm, etc)', 'Gradient Boosting Machines')
ml19 = ml19.replace('Dense Neural Networks (MLPs, etc)', 'Dense Neural Networks')
ml21 = ml21.replace('Transformer Networks (BERT, gpt-3, etc)', 'Transformer Networks')
ml21 = ml21.replace('Gradient Boosting Machines (xgboost, lightgbm, etc)', 'Gradient Boosting Machines')
ml21 = ml21.replace('Dense Neural Networks (MLPs, etc)', 'Dense Neural Networks')

ml19_21 = ml19.merge(ml21, how = 'right', left_on = 'ml', right_on = 'ml')\
.rename(columns={'percentage_x': '2019', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 8))
plt.title('Dynamics', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = ml19_21['2021'], y = ml19_21['ml'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = ml19_21['2019'], y = ml19_21['ml'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['10%', '20%', '30%', '40%', '50%', '60%']))
a.xaxis.set_major_locator(FixedLocator([10, 20, 30, 40, 50, 60]))
plt.yticks(fontname = 'monospace', size = 15, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(10, -0.3)
    
l1 = lines.Line2D([0.125, 0.9], [0.06, 0.06], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(ml19_21['ml'], ml19_21['2019'], ml19_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.93, '2019', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.93, '2021', fontname = 'monospace', size = 15, color = '#AACDE5')
    
plt.figtext(0.125, 0.01, '''In general, there have been no major changes, but interest in the Transformer Networks has grown.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Computer vision methods

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[102:108]:
    ml = np.append(ml, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
ml = {'ml': ml}
ml21 = pd.DataFrame(data=ml)
ml21['results'] = results
ml21['percentage'] = round(ml21['results'] / len(sur2021) * 100, 1)
ml21.sort_values('results', inplace = True)
ml21 = ml21.query("ml != 'None'")
ml21['ml'] = ml21['ml'].apply(lambda x: x.strip())
ml21 = ml21.replace('Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)',
                    'Image classification and other general purpose networks')
ml21 = ml21.replace('Image segmentation methods (U-Net, Mask R-CNN, etc)',
                    'Image segmentation methods')
ml21 = ml21.replace('Object detection methods (YOLOv3, RetinaNet, etc)',
                    'Object detection methods')
ml21 = ml21.replace('General purpose image/video tools (PIL, cv2, skimage, etc)',
                    'General purpose image/video tools')
ml21 = ml21.replace('Generative Networks (GAN, VAE, etc)',
                    'Generative Networks')

fig = plt.figure(figsize = (15, 2.5))
clrs = ['#65b2e5' if x > 3000 else '#AACDE5' for x in ml21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ml21, x = 'percentage', y = 'ml', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(0.5 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2019.columns.tolist()[140:147]:
    ml = np.append(ml, np.array(sur2019[sur2019[i] == sur2019[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2019[sur2019[i] == sur2019[i]][i])))
ml = {'ml': ml}
ml19 = pd.DataFrame(data=ml)
ml19['results'] = results
ml19['percentage'] = round(ml19['results'] / len(sur2019) * 100, 1)
ml19.sort_values('results', inplace = True)
ml19['ml'] = ml19['ml'].apply(lambda x: x.strip())
ml19 = ml19.replace('Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)',
                    'Image classification and other general purpose networks')
ml19 = ml19.replace('Image segmentation methods (U-Net, Mask R-CNN, etc)',
                    'Image segmentation methods')
ml19 = ml19.replace('Object detection methods (YOLOv3, RetinaNet, etc)',
                    'Object detection methods')
ml19 = ml19.replace('General purpose image/video tools (PIL, cv2, skimage, etc)',
                    'General purpose image/video tools')
ml19 = ml19.replace('Generative Networks (GAN, VAE, etc)',
                    'Generative Networks')

ml19_21 = ml19.merge(ml21, how = 'right', left_on = 'ml', right_on = 'ml')\
.rename(columns={'percentage_x': '2019', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 4.5))
plt.title('Dynamics', size = 25, y = 1.08, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = ml19_21['2021'], y = ml19_21['ml'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = ml19_21['2019'], y = ml19_21['ml'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['6%', '8%', '10%', '12%', '14%', '16%']))
a.xaxis.set_major_locator(FixedLocator([6, 8, 10, 12, 14, 16]))
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(5, -0.3)
    
l1 = lines.Line2D([0.125, 0.9], [0.02, 0.02], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-',linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(ml19_21['ml'], ml19_21['2019'], ml19_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.9, '2019', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.9, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 
    
plt.figtext(0.125, -0.22, '''There is a slight increase in Image classification and other general purpose networks
(VGG, Inception, ResNet, ResNeXt) and Object detection methods (YOLOv3, RetinaNet).
There is also a slight decrease in Genreral purpose
image/video tools (PIL, cv2, skimage).''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Natural language processing methods

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[109:114]:
    ml = np.append(ml, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
ml = {'ml': ml}
ml21 = pd.DataFrame(data=ml)
ml21['results'] = results
ml21['percentage'] = round(ml21['results'] / len(sur2021) * 100, 1)
ml21.sort_values('results', inplace = True)
ml21 = ml21.query("ml != 'None'")
ml21['ml'] = ml21['ml'].apply(lambda x: x.strip())
ml21 = ml21.replace('Contextualized embeddings (ELMo, CoVe)',
                    'Contextualized embeddings')
ml21 = ml21.replace('Encoder-decorder models (seq2seq, vanilla transformers)',
                    'Encoder-decorder models')
ml21 = ml21.replace('Transformer language models (GPT-3, BERT, XLnet, etc)',
                    'Transformer language models')
ml21 = ml21.replace('Word embeddings/vectors (GLoVe, fastText, word2vec)',
                    'Word embeddings/vectors')

fig = plt.figure(figsize = (15, 2))
clrs = ['#65b2e5' if x > 2400 else '#AACDE5' for x in ml21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ml21, x = 'percentage', y = 'ml', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
for p in a.patches:
    width = p.get_width()
    plt.text(0.3 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2019.columns.tolist()[148:154]:
    ml = np.append(ml, np.array(sur2019[sur2019[i] == sur2019[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2019[sur2019[i] == sur2019[i]][i])))
ml = {'ml': ml}
ml19 = pd.DataFrame(data=ml)
ml19['results'] = results
ml19['percentage'] = round(ml19['results'] / len(sur2019) * 100, 1)
ml19.sort_values('results', inplace = True)
ml19['ml'] = ml19['ml'].apply(lambda x: x.strip())
ml19 = ml19.replace('Transformer language models (GPT-2, BERT, XLnet, etc)', 'Transformer language models')
ml19 = ml19.replace('Contextualized embeddings (ELMo, CoVe)',
                    'Contextualized embeddings')
ml19 = ml19.replace('Encoder-decorder models (seq2seq, vanilla transformers)',
                    'Encoder-decorder models')
ml19 = ml19.replace('Word embeddings/vectors (GLoVe, fastText, word2vec)',
                    'Word embeddings/vectors')

ml19_21 = ml19.merge(ml21, how = 'right', left_on = 'ml', right_on = 'ml')\
.rename(columns={'percentage_x': '2019', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 4))
plt.title('Dynamics', size = 25, y = 1.08, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = ml19_21['2021'], y = ml19_21['ml'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = ml19_21['2019'], y = ml19_21['ml'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['3%', '4%', '5%', '6%', '7%', '8%', '9%', '10%', '11%']))
a.xaxis.set_major_locator(FixedLocator([3,4,5,6,7,8,9,10,11]))
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(4, -0.3)

l1 = lines.Line2D([0.125, 0.9], [0, 0], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-', linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(ml19_21['ml'], ml19_21['2019'], ml19_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.9, '2019', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.9, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 
    
plt.figtext(0.125, -0.2, '''And here it's more interesting. There is a sharp increase in interest in Transformer
language models (GPT-3, BERT, XLnet). Encoder-decoder models (seq2seq, vanilla transformers)
also has a slight increase.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Automated machine learning tools

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[213:219]:
    ml = np.append(ml, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
ml = {'ml': ml}
ml21 = pd.DataFrame(data=ml)
ml21['results'] = results
ml21['percentage'] = round(ml21['results'] / len(sur2021) * 100, 2)
ml21.sort_values('results', inplace = True)
ml21 = ml21.query("ml != 'None'")
ml21['ml'] = ml21['ml'].apply(lambda x: x.strip())

fig = plt.figure(figsize = (15, 3))
clrs = ['#65b2e5' if x > 500 else '#AACDE5' for x in ml21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ml21, x = 'percentage', y = 'ml', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
for p in a.patches:
    width = p.get_width()
    plt.text(0.1 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2019.columns.tolist()[220:232]:
    ml = np.append(ml, np.array(sur2019[sur2019[i] == sur2019[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2019[sur2019[i] == sur2019[i]][i])))
ml = {'ml': ml}
ml19 = pd.DataFrame(data=ml)
ml19['results'] = results
ml19['percentage'] = round(ml19['results'] / len(sur2019) * 100, 1)
ml19.sort_values('results', inplace = True)
ml19['ml'] = ml19['ml'].apply(lambda x: x.strip())
ml19 = ml19.replace('Google AutoML', 'Google Cloud AutoML')
ml19 = ml19.replace('H20 Driverless AI', 'H2O Driverless AI')

ml19_21 = ml19.merge(ml21, how = 'right', left_on = 'ml', right_on = 'ml')\
.rename(columns={'percentage_x': '2019', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 4))
plt.title('Dynamics', size = 25, y = 1.08, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = ml19_21['2021'], y = ml19_21['ml'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = ml19_21['2019'], y = ml19_21['ml'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['1%', '1.25%', '1.5%', '1.75%', '2%', '2.25%', '2.5%', '2.75%']))
a.xaxis.set_major_locator(FixedLocator([1,1.25,1.5,1.75,2,2.25,2.5,2.75]))
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(4, -0.3)

l1 = lines.Line2D([0.125, 0.9], [0, 0], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-', linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(ml19_21['ml'], ml19_21['2019'], ml19_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.9, '2019', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.9, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 
    
plt.figtext(0.125, -0.17, '''Google Cloud AutoML has a leading position, however, in 2019 it had Auto-Sklearn, which for
some reason didn't appear in the response option of the 2021 survey.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Data visualization libraries

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[59:70]:
    ml = np.append(ml, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
ml = {'ml': ml}
ml21 = pd.DataFrame(data=ml)
ml21['results'] = results
ml21['percentage'] = round(ml21['results'] / len(sur2021) * 100, 1)
ml21.sort_values('results', inplace = True)
ml21 = ml21.query("ml != 'None'")
ml21['ml'] = ml21['ml'].apply(lambda x: x.strip())

fig = plt.figure(figsize = (15, 5))
clrs = ['#65b2e5' if x > 14000 else '#AACDE5' for x in ml21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ml21, x = 'percentage', y = 'ml', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
for p in a.patches:
    width = p.get_width()
    plt.text(2.2 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2018.columns.tolist()[110:123]:
    ml = np.append(ml, np.array(sur2018[sur2018[i] == sur2018[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2018[sur2018[i] == sur2018[i]][i])))
ml = {'ml': ml}
ml18 = pd.DataFrame(data=ml)
ml18['results'] = results
ml18['percentage'] = round(ml18['results'] / len(sur2018) * 100, 1)
ml18.sort_values('results', inplace = True)
ml18['ml'] = ml18['ml'].apply(lambda x: x.strip())
ml18 = ml18.replace('ggplot2', 'Ggplot / ggplot2')
ml18 = ml18.replace('Plotly', 'Plotly / Plotly Express')
ml18 = ml18.replace('D3', 'D3 js')


ml18_21 = ml18.merge(ml21, how = 'right', left_on = 'ml', right_on = 'ml')\
.rename(columns={'percentage_x': '2018', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 7))
plt.title('Dynamics', size = 25, y = 1.08, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = ml18_21['2021'], y = ml18_21['ml'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = ml18_21['2018'], y = ml18_21['ml'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['0%', '10%', '20%', '30%', '40%', '50%', '60%', '70%']))
a.xaxis.set_major_locator(FixedLocator([0,10,20,30,40,50,60,70]))
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
a.set_ylim(9, -0.3)

l1 = lines.Line2D([0.125, 0.9], [0.05, 0.05], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-', linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(ml18_21['ml'], ml18_21['2018'], ml18_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.91, '2018', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.91, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 
    
plt.figtext(0.125, -0.07, '''The most popular libraries are Matplotlib and Seaborn, having increased the number of respondents
using them by more than 10% in 4 years. This is because these libraries belong to Python, 
and ggplot2, which belongs to R, lost more than 10% of respondents.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Big Data products

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2021.columns.tolist()[165:185]:
    ml = np.append(ml, np.array(sur2021[sur2021[i] == sur2021[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2021[sur2021[i] == sur2021[i]][i])))
ml = {'ml': ml}
ml21 = pd.DataFrame(data=ml)
ml21['results'] = results
ml21['percentage'] = round(ml21['results'] / len(sur2021) * 100, 1)
ml21.sort_values('results', inplace = True)
ml21 = ml21.query("ml != 'None'")
ml21['ml'] = ml21['ml'].apply(lambda x: x.strip())

fig = plt.figure(figsize = (15, 10))
clrs = ['#65b2e5' if x > 3000 else '#AACDE5' for x in ml21['results']]
plt.title('Results 2021', size = 25, y = 1.1, fontname = 'monospace', color = '#313233')
a = sns.barplot(data = ml21, x = 'percentage', y = 'ml', palette = clrs, linestyle = "-", linewidth = 1, edgecolor = "black")
plt.xticks([])
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.xlabel('')
plt.ylabel('')
for j in ['right', 'top', 'bottom', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)
    
for p in a.patches:
    width = p.get_width()
    plt.text(0.5 + width, p.get_y() + 0.55*p.get_height(), f'{width}%',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 14, color = '#313233')

plt.show()

In [None]:
ml = np.array([])
results = np.array([])
for i in sur2018.columns.tolist()[195:223]:
    ml = np.append(ml, np.array(sur2018[sur2018[i] == sur2018[i]][i].reset_index().iloc[0,1]))
    results = np.append(results, np.array(len(sur2018[sur2018[i] == sur2018[i]][i])))
ml = {'ml': ml}
ml18 = pd.DataFrame(data=ml)
ml18['results'] = results
ml18['percentage'] = round(ml18['results'] / len(sur2018) * 100, 1)
ml18.sort_values('results', inplace = True)
ml18['ml'] = ml18['ml'].apply(lambda x: x.strip())
ml18 = ml18.replace('AWS Aurora', 'Amazon Aurora')
ml18 = ml18.replace('Google Cloud Bigtable', 'Google Cloud BigTable')
ml18 = ml18.replace('IBM Cloud Db2', 'IBM Db2')
ml18 = ml18.replace('Azure Cosmos DB', 'Microsoft Azure Cosmos DB')
ml18 = ml18.replace('AWS DynamoDB', 'Amazon DynamoDB')
ml18 = ml18.replace('Azure SQL Database', 'Microsoft Azure SQL Database')
ml18 = ml18.replace('PostgresSQL', 'PostgreSQL')

ml18_21 = ml18.merge(ml21, how = 'right', left_on = 'ml', right_on = 'ml')\
.rename(columns={'percentage_x': '2018', 'percentage_y': '2021'}).dropna()

fig = plt.figure(figsize = (15, 10))
plt.title('Dynamics', size = 25, y = 1.08, fontname = 'monospace', color = '#313233')
plt.grid(color = 'gray', linestyle = '-', axis = 'y', alpha = 0.1, zorder = 0)
a = sns.scatterplot(x = ml18_21['2021'], y = ml18_21['ml'], color = '#AACDE5', s = 300, edgecolor = 'black', zorder = 5)
sns.scatterplot(x = ml18_21['2018'], y = ml18_21['ml'], color = '#d9b44a', s = 100, edgecolor = 'black', zorder = 5)
plt.xticks(fontname = 'monospace', size = 15, color = '#313233')
a.xaxis.set_major_formatter(FixedFormatter(['0%', '5%', '10%', '15%', '20%', '25%']))
a.xaxis.set_major_locator(FixedLocator([0,5,10,15,20,25]))
plt.yticks(fontname = 'monospace', size = 14, color = '#313233')
plt.ylabel('')
plt.xlabel('')

for j in ['right', 'top', 'left']:
    a.spines[j].set_visible(False)
for j in ['bottom']:
    a.spines[j].set_linewidth(2)

l1 = lines.Line2D([0.125, 0.9], [0.05, 0.05], transform = fig.transFigure, figure = fig, color = '#ededed', linestyle = '-', linewidth = 4)
a.lines.extend([l1])

for i, p1, p2 in zip(ml18_21['ml'], ml18_21['2018'], ml18_21['2021']):
    newline([p1, i], [p2, i])
    
plt.figtext(0.48, 0.92, '2018', fontname = 'monospace', size = 15, color = '#d9b44a') 
plt.figtext(0.52, 0.92, '2021', fontname = 'monospace', size = 15, color = '#AACDE5') 
    
plt.figtext(0.125, -0.07, '''Despite the decrease in the activity of respondents in the use of databases, the trend towards
use remains almost the same. MySQL remains the most popular database and more than 20% of
respondents noted that they are going to use it in the next 2 years. MongoDB, which is not
popular yet, is going to be used by more than 15% of respondents.''', fontsize = 14, fontname = 'monospace', color = '#313233')
    
plt.show()

# Inference

So, I have analyzed the main questions of the survey for each of which a conclusion was made. You can go into details for a very long time, considering each question from a certain social group, of which there are a lot. Therefore, it seemed to me that it would be more correct to consider general trends. If you liked my work, you can **upvote** it. Also, if you find an error, write in the comments :)