# <font color='#4a8bad'>Table of contents</font>
***
* [1. Introduction](#intro)

* [2. Overview of participant profile](#overview)

* [3. The current state of programming languages](#state)

* [4. Trends in Data Science and Machine Learning](#trend)

* [5. Exploring the salaray gap](#salary)

# <font color='#4a8bad'>1. Introduction</font>
<a id="intro"></a>
***


The Kaggle platform has been gaining popularity in recent years. It is the fifth year in a row that Kaggle conducts an industry-wide survey that presents a comprehensive view of the state of Data Science and Machine Learning. This year, the survey has attracted over 25,000 enthusiastic participants around the globe.

**In my previous notebook [(click here for details)](https://www.kaggle.com/lynnxy/kaggle-survey-participants-profile-and-response), I have provided a snapshot of the status quo** , summarising participants' basic profile and how participants responsed to the questions with respect to their experience, jobs, and academic achievement.

**The objective of this notebook is to explore how things have been evolving in the Data Science and Machine Learning space over the past 5 years** since Kaggle first launched the survey. The notebook aims to address the following questions:
- Is the community getting more diversified in terms of participant's age, gender, job and academic background?
- Which countries have seen rising interests in Data Science and Machine Learning?
- The current state of programming languages, will R fade away with the rise of Python?
- Trends for the popular Data Science and Machine Learning tools
- Which country has the highest paying jobs for Data Scientists and Machine Learners?

In [None]:
pip install joypy

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import squarify
import plotly.graph_objects as go
import plotly.express as px
import joypy

import warnings
warnings.filterwarnings('ignore')

# Chart setting
sns.set()
facecolor = '#f1f1f9'
plt.rcParams['figure.facecolor'] = facecolor
plt.rcParams['axes.facecolor'] = facecolor
#plt.rcParams.update({'axes.spines.top': False, 'axes.spines.right': False})
plt.rcParams['font.family'] = ['Calibri']
palette17 = ['#FFD166','#F16A70','#B1D877','#8CDCDA','#4D4D4D']
palette18 = ['#F16A70','#B1D877','#8CDCDA','#4D4D4D']
palette19 = ['#B1D877','#8CDCDA','#4D4D4D']

In [None]:
# Load files
raw21 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
raw20 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
raw19 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
raw18 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
raw17 = pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv',encoding='unicode_escape')

In [None]:
# Clean data
column_name21 = raw21.columns.str.replace(r'_Part([\s\S])*', '', regex=True).str.replace('_OTHER', '')
column_name20 = raw20.columns.str.replace(r'_Part([\s\S])*', '', regex=True).str.replace('_OTHER', '')
column_name19 = raw19.columns.str.replace(r'_Part([\s\S])*', '', regex=True).str.replace('_OTHER', '')
column_name18 = raw18.columns.str.replace(r'_Part([\s\S])*', '', regex=True).str.replace('_OTHER', '')
column_name17 = raw17.columns.str.replace(r'_Part([\s\S])*', '', regex=True).str.replace('_OTHER', '')

In [None]:
# 2021 data
data21 = raw21.iloc[1:,:]
data21.columns = column_name21
data21['year'] = '2021'

# 2020 data
data20 = raw20.iloc[1:,:]
data20.columns = column_name20
data20['year'] = '2020'

# 2019 data
data19 = raw19.iloc[1:,:]
data19.columns = column_name19
data19['year'] = '2019'

# 2018 data
data18 = raw18.iloc[1:,:]
data18.columns = column_name18
data18['year'] = '2018'

# 2017 data
data17= raw17.loc[1:,:]
data17columns = column_name17
data17['year'] = '2017'

# consolidate
files = [data21, data20, data19, data18, data17] # all data
files1 = [data21, data20, data19, data18] # exclude 2017
files2 = [data21, data20, data19] # exclude 2018

In [None]:
sns.palplot(palette17)
plt.title('color palette for this notebook', size=14, fontweight='bold')
plt.show()

#### <font color='#4a8bad'>Before we start, let's find out how long it takes to complete the survey</font>
> - Majority of participants take no more than 15 mins to finish the survey    
> - And notably, participants spent less time on the survey compared to 4 years ago as a result of the questionaire being more concise and consistent

In [None]:
duration_cols = ['Time from Start to Finish (seconds)', 'Time from Start to Finish (seconds)', 'Time from Start to Finish (seconds)', 'Time from Start to Finish (seconds)']

df_duration = []

for (file, duration) in zip(files1, duration_cols):
    df_duration.append(file.loc[:,['year', duration]])

for i in range(len(duration_cols)):
    df_duration[i].columns = ['year', 'duration']

df_duration = pd.concat(df_duration, axis=0, ignore_index=False)

df_duration['duration'] = pd.to_numeric(df_duration['duration'], downcast="float")
df_duration = df_duration[((df_duration['duration']>0) & (df_duration['duration']<3600))]

fig,axes =  joypy.joyplot(df_duration, by='year',color=palette18, alpha=0.8)
plt.title('Time taken to complete the survey (in seconds)', size=14, fontweight='bold')
plt.show()

# <font color='#4a8bad'>2. Overview of participant profile</font>
<a id="overview"></a>
***

> ### <font color='#4a8bad'>2.1. Record high number of participants with higher female ratio</font>
<a id="number"></a>
- The Kaggle Survey has registered a record high number of participants this year, with nearly 26,000 people providing valuable feedback, representing a 55% increase compared to 2017.
- In terms of gender breakdown, there are 20,598 male and 4,890 female participanted in the 2021 survey, representing 51% and 76% growth respectively compared to when the survey was first launched.
- Although women are still significantly underrepresented in the DS/ML space, with the growth of female participants outpacing male, women have now accounted for almost one fifth of the sample population

In [None]:
# Total number of participants 
no_participants = []
for i, file in enumerate(files):
    no_participants.append([(2021-i), len(file)])
no_participants = pd.DataFrame(no_participants, columns=['year', 'number'])

# chart (no. of participants vs. year)
fig = plt.figure(figsize=(6,4))
plt.bar(x=no_participants['year'], height=no_participants['number'], width=0.6, color=palette17[::-1])

# Label bar values
for rect in plt.gca().patches:
    height = rect.get_height()
    width = rect.get_width()
    x = rect.get_x()
    y = rect.get_y()
    # label
    label = '{:,}'.format(int(height))
    # coordinates for text
    x_coord = x + width/2
    y_coord = y + height+200
    plt.gca().text(x_coord, y_coord, label, ha='center', va='bottom', fontsize=12)
plt.title('No. of participants between 2017 - 2021', y=1.05, size=14)
plt.show()

In [None]:
# Rename to make gender choices consistent
data19.replace(['Female', 'Male'],['Woman', 'Man'], inplace=True)
data18.replace(['Female', 'Male'],['Woman', 'Man'], inplace=True)
data17.replace(['Female', 'Male'],['Woman', 'Man'], inplace=True)

# Gender distribution
gender_col = ['Q2', 'Q2', 'Q2', 'Q1', 'GenderSelect']
df_genders = []

for (file, gender) in zip(files, gender_col):
    df_genders.append(file.loc[:,['year', gender]])

for i in range(len(gender_col)):
    df_genders[i].columns = ['year', 'gender']

df_genders = pd.concat(df_genders, axis=0, ignore_index=False)
df_genders_ct = pd.crosstab(index=df_genders['year'], columns=df_genders['gender'], values=df_genders['gender'], aggfunc='count') 

In [None]:
color_b = ['#6c8fc5', '#6c8fc5', '#6c8fc5', '#6c8fc5', '#34568B']
color_o = ['#fbc283', '#fbc283', '#fbc283', '#fbc283', '#FDAC53']

fig, ax = plt.subplots(figsize=(10,6))
ax.barh(df_genders_ct.index, df_genders_ct['Man'], color=color_b, linewidth=2, label='Male')
ax.barh(df_genders_ct.index, df_genders_ct['Woman'], left = df_genders_ct['Man'], color=color_o, linewidth=2, label='Female')
ax.invert_yaxis()    
# Label bar values
for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    x = rect.get_x()
    y = rect.get_y()
    # label
    label = '{:,}'.format(int(width))
    # coordinates for text
    x_coord = x + width - 300
    y_coord = y + height / 2
    ax.text(x_coord, y_coord, label, ha='right', va='center', fontsize=14, color='w',  fontweight='bold')

plt.title('No. of male and female particpants',  size=16)
plt.legend() 
plt.show()

In [None]:
# percentage male and female
df_genders_pct = df_genders_ct.loc[:,['Man', 'Woman']].div(df_genders_ct.loc[:,['Man', 'Woman']].sum(axis=1), axis=0) *100

color_b = ['#6c8fc5', '#6c8fc5', '#6c8fc5', '#6c8fc5', '#34568B']
color_o = ['#fbc283', '#fbc283', '#fbc283', '#fbc283', '#FDAC53']

fig, ax = plt.subplots(figsize=(10,6))
ax.barh(df_genders_pct.index, df_genders_pct['Man'], color=color_b, linewidth=2, label='Male')
ax.barh(df_genders_pct.index, df_genders_pct['Woman'], left = df_genders_pct['Man'], color=color_o, linewidth=2, label='Female')
ax.invert_yaxis()    

# Label bar values
for rect in ax.patches:
    height = rect.get_height()
    width = rect.get_width()
    x = rect.get_x()
    y = rect.get_y()
    # label
    label = '{:.1f}%'.format(width)
    # coordinates for text
    x_coord = x + width - 2
    y_coord = y + height / 2
    ax.text(x_coord, y_coord, label, ha='right', va='center', fontsize=14, color='w',  fontweight='bold')

plt.title('% male vs. female',  size=16)
plt.show()

> ### <font color='#4a8bad'>2.2. The rise of young coders</font>
<a id="age"></a>
- The chart below provides a clear trend that the growth of the youngest age group(18-21) that took part in the survey is outpacing rest of age groups
- Participants who are at age 18-21 account for nearly 19% of the sample population this year, almost on par with the largest population age group (25-29).
- In recent years, we have seen coding is gain popularity among children and many countries have introduced programming classes as an important part of school curriculum. If Kaggle Survery were to include age selection for non-adults, we should see some intersting figures regarding evolution of programming for younger students below age18

In [None]:
def plot_heatmap(data, cmap=['Reds', 'Greens', 'Blues', 'Greys'], figsize=(12,6)):
    fig = plt.figure(figsize=figsize)
    for i in range(len(data.index)):
        ax = fig.add_subplot(5,1,i+1)
        sns.heatmap(pd.DataFrame(data.iloc[i,:]).transpose(), cbar=False, cmap=cmap[i], annot=True, fmt='.1%', linewidth=0.5)
        ax.set_xlabel('')
        plt.yticks(rotation=360)
        if i < (len(data.index)-1):
            ax.set_xticklabels('')
    plt.subplots_adjust(hspace=0.01)

In [None]:
# Age columns in different files
age_col = ['Q1', 'Q1', 'Q1', 'Q2']

# Rename to make age choices consistent
data18.replace(['70-79', '80+'],['70+', '70+'], inplace=True)

df_age = []

for (file, age) in zip(files1, age_col):
    df_age.append(file.loc[:,['year', age]])

for i in range(len(age_col)):
    df_age[i].columns = ['year', 'age']

df_age = pd.concat(df_age, axis=0, ignore_index=False)
df_age_ct = pd.crosstab(index=df_age['year'], columns=df_age['age'], values=df_age['age'], aggfunc='count') 
df_age_pct = df_age_ct.div(df_age_ct.sum(axis=1), axis=0)

# Plot
plot_heatmap(data = df_age_pct)
plt.title('Age comparison from 2018 - 2021',  y=4.1, size=16)
plt.show()

> ### <font color='#4a8bad'>2.3. Participants represented 60 countries with over 7,000 from India</font>
- Not only the number of participants reach a record high this year, the number of countries represented by participants is also rising to 60. It is good to see a more diversified Kaggle society
- India continues to top the list for the third year in a row, with 7,434 participants in 2021. This number has almost tripled from 5 years ago
- While US continues to remain at top 2 of the list, the number of participants has dropped by 40% when compared to the peak in 2018
- However, for age group 45 or above, US paraticipants still represent the largest sample population
- Apart from India and US, countries that remain at top 10 of the list for the past 5 years include Russia, UK, China and Brazil
- There were no African countries in the top 10 list 5 years ago, however we have seen  African country (Nigeria) breaking into top 10 for the first time in 2020. In 2021, we have also seen 482 participants from Egypt

In [None]:
# country columns in  files
country_col = ['Q3', 'Q3', 'Q3', 'Q3','Country']

# Rename 
data17.replace(['United States', 'United Kingdom', "People 's Republic of China"],['USA', 'UK', 'China'], inplace=True)
data18.replace(['United Kingdom of Great Britain and Northern Ireland', 'United States of America','Viet Nam'],['UK','USA', 'Vietnam'], inplace=True)
data19.replace(['United Kingdom of Great Britain and Northern Ireland', 'United States of America','Viet Nam'],['UK','USA', 'Vietnam'], inplace=True)
data20.replace(['United Kingdom of Great Britain and Northern Ireland', 'United States of America','Viet Nam'],['UK','USA', 'Vietnam'], inplace=True)
data21.replace(['United Kingdom of Great Britain and Northern Ireland', 'United States of America','Viet Nam'],['UK','USA', 'Vietnam'], inplace=True)

# DataFrame summary
df_country = []
for (file, country) in zip(files, country_col):
    df_country.append(file.loc[:,['year', country]])

for i in range(len(country_col)):
    df_country[i].columns = ['year', 'country']

df_country = pd.concat(df_country, axis=0, ignore_index=False)
df_country_ct = pd.crosstab(index=df_country['year'], columns=df_country['country'], values=df_country['country'], aggfunc='count').transpose().drop(index='Other')


country_no = []
for i in range(len(country_col)):
    country_no.append([2017+i, df_country_ct.iloc[:,i].nunique()])

country_no = pd.DataFrame(country_no, columns=['year', 'number'])

# chart (no. of participants vs. year)
fig = plt.figure(figsize=(6,4))
plt.bar(x=country_no['year'], height=country_no['number'], width=0.6, color=palette17)

# Label bar values
for rect in plt.gca().patches:
    height = rect.get_height()
    width = rect.get_width()
    x = rect.get_x()
    y = rect.get_y()
    # label
    label = '{:,}'.format(int(height))
    # coordinates for text
    x_coord = x + width/2
    y_coord = y + height+1
    plt.gca().text(x_coord, y_coord, label, ha='center', va='bottom', fontsize=12)
plt.title('No. of countries participated in the survey', y=1.05, size=14)   
plt.show()

In [None]:
color = palette17

fig = plt.figure(figsize=(15,6))
for i in range(len(country_col)):
    ax = fig.add_subplot(1,5, i+1)
    top10 =  df_country_ct.iloc[:, i].nlargest(10)
    ax= top10.plot(kind='barh', color=color[i])
    ax.set_title(top10.name + ' top 10 countries', size=12, fontweight='bold')
    
    ax.invert_yaxis()
    ax.set_ylabel("")
    
    values = top10.values    
    for j, value in enumerate(values):
        ax.text(value+values.max()*0.05 , j, '{:,}'.format(int(value)),va='center', 
                 fontweight='bold', alpha=0.8, size=12, ha='left')   
plt.tight_layout()

In [None]:
age_country_tm = data21.loc[:,['Q1','Q3']].reset_index().groupby(['Q1','Q3'], as_index=False).count()
age_country_tm
fig = px.treemap(age_country_tm, path=['Q1','Q3'], values='index')
fig.update_layout(
    treemapcolorway = palette17[::-1],
    margin = dict(t=50, l=25, r=25, b=25)
)
fig.show()

> ### <font color='#4a8bad'>2.4. The Kaggle Survey is attracted by people with higher education</font>
- Although getting a degree is not prerequisite for becoming a Data Scientist, results from Kaggle Survey suggest that the majority of the Kaggle community received higher educations. People with Master's and Bachelor's represent the largest sample population. 
- Less than 10% of participants do not receive formal education beyond high school 

In [None]:
# education columns in files
education_col = ['Q4', 'Q4', 'Q4', 'Q4']

# Rename 
data21.replace(['Professional doctorate'],['Professional degree'], inplace=True)

df_education = []

for (file, education) in zip(files1, education_col):
    df_education.append(file.loc[:,['year', education]])

for i in range(len(age_col)):
    df_education[i].columns = ['year', 'education']

df_education = pd.concat(df_education, axis=0, ignore_index=False)
df_education_ct = pd.crosstab(index=df_education['year'], columns=df_education['education'], values=df_education['education'], aggfunc='count') 
df_education_ct.columns = ['Bachelor’s', 'PhD', 'Not disclosed', 'Master’s', 'High school', 'Professional', 'College']
df_education_ct = pd.DataFrame(df_education_ct, columns=['Professional','PhD', 'Master’s', 'Bachelor’s', 'High school', 'College', 'Not disclosed'])
df_education_pct = df_education_ct.div(df_education_ct.sum(axis=1), axis=0)

# Plot
plot_heatmap(data=df_education_pct, figsize=(9,6))
plt.title('Academic qualification from 2018 - 2021', y=4.2, size=16)
plt.show()

> ### <font color='#4a8bad'>2.5. Half of participants have less than 3 years of coding experience</font>

In [None]:
exp_col = ['Q6', 'Q6', 'Q15', 'Q24']

# Rename
data18.replace(['1-2 years', '20-30 years','30-40 years', '40+ years', '< 1 year', 'I have never written code and I do not want to learn', 'I have never written code but I want to learn'],['1-3 years', '20+ years', '20+ years','20+ years', '< 1 years', 'Never','Never'], inplace=True)
data19.replace(['1-2 years', 'I have never written code'],['1-3 years', 'Never'], inplace=True)
data20.replace(['1-2 years', 'I have never written code'],['1-3 years', 'Never'], inplace=True)
data21.replace(['I have never written code'],['Never'], inplace=True)

df_exp = []

for (file, exp) in zip(files1, exp_col):
    df_exp.append(file.loc[:,['year', exp]])

for i in range(len(exp_col)):
    df_exp[i].columns = ['year', 'experience']

df_exp = pd.concat(df_exp, axis=0, ignore_index=False)
df_exp_ct = pd.crosstab(index=df_exp['year'], columns=df_exp['experience'], values=df_exp['experience'], aggfunc='count') 
df_exp_ct = pd.DataFrame(df_exp_ct, columns=['Never','< 1 years','1-3 years', '3-5 years', '5-10 years','10-20 years','20+ years'])
df_exp_pct = df_exp_ct.div(df_exp_ct.sum(axis=1), axis=0)

plot_heatmap(df_exp_pct, figsize=(9,6))
plt.title('Years of coding experience', y=4.2, size=16)
plt.show()

> ### <font color='#4a8bad'>2.6. Students and Data Scientists are most active participants</font>
- The initial Kaggle Survey had less than 1,000 students participating, but the number surged by over 4 times in just a year to 5,000+. The number of student reaches 6,804 this year. 
- The second largest occupation is Data Scientist

In [None]:
job_col = ['Q5', 'Q5', 'Q5', 'Q6', 'CurrentJobTitleSelect']

df_job = []
for (file, job) in zip(files, job_col):
    df_job.append(file.loc[:,['year', job]])

for i in range(len(job_col)):
    df_job[i].columns = ['year', 'job']

df_job = pd.concat(df_job, axis=0, ignore_index=False)
df_job_ct = pd.crosstab(index=df_job['year'], columns=df_job['job'], values=df_job['job'], aggfunc='count').transpose().drop(index='Other')
df_job_ct.loc['Student', '2017'] = len(data17[data17['StudentStatus'] == 'Yes'])

df = df_job.reset_index().groupby(['year','job'], as_index=False).count()

fig = px.treemap(df, path=['year','job'], values='index')
fig.update_layout(
    treemapcolorway = palette17[::-1],
    margin = dict(t=50, l=25, r=25, b=25)
)

> ### <font color='#4a8bad'>2.7. A snapshot of 2021 participant's profile</font>
- The `parallel category` plot provides a quick snapshot of 2021 participant's profile
- In summary, Kaggle Survey is mostly attacted by young male students with higher education who have 1-3 years of coding experience

In [None]:
snapshot21 = data21.loc[:, ['Q1','Q2','Q4','Q5', 'Q6']]
fig = px.parallel_categories(snapshot21,  labels={'Q1':'Age', 'Q2':'Gender', 'Q4':'Education', 'Q5':'Job', 'Q6': 'coding experience'})

fig.update_layout(
    hovermode = 'x',
    title="<b>2021 Kaggle Survey Participant's Profile</b>",
    font=dict(size = 10, color = 'black'),
    titlefont=dict(size =16),
    paper_bgcolor=facecolor
)

fig.show()

# <font color='#4a8bad'>3. The current state of programming languages</font>
<a id="state"></a>
***

> ### <font color='#4a8bad'>3.1. Will R fade away with the rise of Python?</font>
- Python is extensively used by the Kaggle Community and it has been the most popular programming languages  in recent years.
- In 2018, 66% of participants used Python and this ratio increases to 84% this year
- Meanwhile, the importance of SQL cannot be underestimated, over 40% of the participants have also used SQL on a regular basis
- At first glance, it seems that R has gradually lost intests to the Kaggle community
- However, if we look at the programming language usage by jobs, R is heavily used by statisticians. According to the 2021 survey, 64% of statisticians use R on a regular basis.
- The below figure suggests that if you want to be a Database Engineer or Data Engineer, you should also learn SQL
- And if you want to be a Software Engineer or Developer Relations, you should learn Javascript on top of Python to enhance your competitiveness

In [None]:
# Function to extract all mc responses
def question(columns, store_list, choice, file=files1):
    for i in range(len(file)):
        data_year = files[i]
        col = columns[i]
        for j in range(len(data_year[col].columns)):
            choice_name = data_year[col].iloc[:, j].dropna().unique()
            count = data_year[col].iloc[:, j].count()
            store_list.append([2021-i, choice_name, count, len(data_year)])
            
    df = pd.DataFrame(store_list)
    df.columns = ['year', choice, 'count', 'participants']
    df.iloc[:,1] =   df.iloc[:,1].str.get(0)
    df['pct'] = df['count'] / df['participants']

    return df

# Function for bar plot
def bar_plot(df, columns, choice,title=None, color=palette18):
    fig = plt.figure(figsize=(15,10))
    color = color
    for i in range(len(lang_cols)):
        ax = fig.add_subplot(2,2, i+1)
        df[df['year']==2018+i].loc[:, [choice, 'pct']].sort_values(by=['pct'], ascending=False).plot.barh(y='pct', x=choice, ax=ax, color=color[i])
        ax.invert_yaxis()
        ax.set_title('Response from '+str(2018+i) + ' Kaggle Survey', size=15)
        ax.set_ylabel('')
        ax.get_legend().remove()
        
        # label values
        values = df[df['year']==2018+i].loc[:,'pct'].sort_values(ascending=False)
        for j, value in enumerate(values):
            ax.text(value+values.max()*0.01 , j, '{:.1%}'.format(value),va='center', 
                    fontweight='bold', alpha=0.8, size=11, ha='left')
    plt.suptitle(t=title, y=1, size=20, fontweight='bold')
    plt.tight_layout()

    return

# Function for scatter plot
def scatter_plot(data, choice, palette=palette19, title=None):
    fig = plt.figure(figsize=(10,5))
    ax = sns.scatterplot(x='pct', y=choice, hue='year', s=50, data=data, palette = palette, edgecolor='grey', linewidth=1.5)
    plt.title(title, y=1.05, size=16)
    plt.xlabel('')
    plt.ylabel('')
    plt.gca().xaxis.set_major_formatter(mtick.PercentFormatter(1.0, decimals=0))
    
    return

In [None]:
lang_cols = ['Q7', 'Q7', 'Q18', 'Q16']
lang_list = []
choice = 'language'
title = "What programming languages do you use on a regular basis?"

lang_df = question(columns=lang_cols, store_list=lang_list, choice=choice)
bar_plot(df=lang_df, columns=lang_cols, choice=choice, title=title)

In [None]:
# Which languages to learn for your job?
choice_name = []  # create an empty list to store names of programming languages 
job_lang= data21.loc[:, ['Q5','Q7']]
for i in range(len(job_lang['Q7'].columns)):
    choice_name.append(job_lang['Q7'].iloc[:, i].dropna().unique())

choice_name= pd.DataFrame(choice_name).values
choice_name = [element for sublist in choice_name for element in sublist] # to seperate list in list
job_lang = pd.pivot_table(data=job_lang, index='Q5', values='Q7', aggfunc='count')
job_lang.columns = choice_name # replace columns with the language's name
job_lang['participants'] = data21.groupby('Q5')['Q5'].count()
job_lang =  job_lang.div(job_lang.loc[:, 'participants'], axis=0)
job_lang.drop(columns='participants', inplace=True)

In [None]:
fig = plt.figure(figsize=(15,8))
sns.heatmap(job_lang, cmap='Blues', annot=True, fmt='.1%', linewidth=1)
plt.ylabel('')
plt.title('Which programming languages to learn for your job? (2021 data)', y=1.05, size=16)
plt.show()

In [None]:
# Create Sankey diagram from DataFrame
# prepare Sankey plot
label = (job_lang.index.tolist(), job_lang.columns.tolist())
label = [element for sublist in label for element in sublist]
color_palette = ['#ffc107', '#00ffff', '#c0ff3e', '#d6d6dd','#d6d6dd','#d6d6dd','#d6d6dd','#d6d6dd','#d6d6dd','#d6d6dd','#d6d6dd','#d6d6dd','#d6d6dd']

source= []
target = []
value = []
color = []

# create array to store 'source', 'target', 'value', and 'color'
for i in range(len(job_lang.index)):
    for j in range(len(job_lang.columns)):
        source.append(i)  # source encoding
        target.append(j+ len(job_lang.index)) # target encoding
        value.append(job_lang.iloc[i,j]) # value with respect to source/target
        color.append(color_palette[j])
                
source = np.array(source)
target = np.array(target)
value = np.array(value)

# plot

link = dict(source=source, target=target, value=value, color=color)
node = dict(pad=8, thickness=30, label=label, line = dict(color = "blue", width = 0.5), color='#56a1ec')

fig = go.Figure(go.Sankey(link=link, node=node))

fig.update_layout(
    hovermode = 'x',
    title="<b>Who uses Python, R and SQL? </b> (Visualization in a different way)",
    titlefont=dict(size =16),
    font=dict(size = 11, color = 'black'),
    paper_bgcolor= facecolor
)


> ### <font color='#4a8bad'>3.2.  Kagglers suggest to start DS/ML with Python</font>
- In response to the question `What programming language would you recommend an aspiring data scientist to learn first?`, the ratio of participants opting for Python increases every year, in constrast to proportion of participants choosing R

In [None]:
langrec_cols = ['Q8' ,'Q8', 'Q19', 'Q18']

# Rename to make age choices consistent
#data18.replace(['70-79', '80+'],['70+', '70+'], inplace=True)

df_rec = []

for (file, rec) in zip(files1, langrec_cols):
    df_rec.append(file.loc[:,['year', rec]])

for i in range(len(langrec_cols)):
    df_rec[i].columns = ['year', 'recommendation']

df_rec = pd.concat(df_rec, axis=0, ignore_index=False)
df_rec_ct = pd.crosstab(index=df_rec['year'], columns=df_rec['recommendation'], values=df_rec['recommendation'], aggfunc='count') 
df_rec_pct = df_rec_ct.div(df_rec_ct.sum(axis=1), axis=0)
df_rec_pct.sort_index(axis=1, inplace=True, ascending=False)

In [None]:
import matplotlib.ticker as mtick
fig = plt.figure(figsize=(10,5), facecolor = facecolor)
color =palette18
for i in range(len(langrec_cols)):
    x = df_rec_pct.iloc[i, :]
    y =  df_rec_pct.columns
    plt.scatter(x=x, y=y, s=40, label=2018+i, color=color[i], edgecolor='black')
plt.title('What programming language would you recommend an aspiring data scientist to learn first?', y=1.05, size=16)
plt.gca().xaxis.set_major_formatter(mtick.PercentFormatter(1.0, decimals=0))
plt.legend()
plt.show()

# <font color='#4a8bad'>4. Trends in Data Science and Machine Learning</font>
<a id="trend"></a>
***

> ### <font color='#4a8bad'>4.1. Jupyter Notebook - data scientists' IDE of choice?</font>
- Jupyter Notebook is the most popular IDE for data scientists with over 60% of participants use Jupyter on a regular basis, significantly outperforming others
- Back in 2018,  RStudio was the second most popular IDE according to the survey, with 36% of particpants using RStudio on a regular basis. However, this ratio is dropping every year with only 18% of participants choosing this option in the latest survey. 
- The rate of decline in number of participants use RStudio is faster than that of participants use R programming language, implying that some R users may have switched their favour from RStudio to other IDEs
- It is also worth norting that VSCode has been gaining popularity in recent years

In [None]:
ide_cols = ['Q9', 'Q9', 'Q16', 'Q13']
ide_list = []
choice = 'IDE'
title = "Which of the integrated development environments (IDE's) do you use on a regular basis?"

ide_df = question(columns=ide_cols, store_list=ide_list, choice=choice)
bar_plot(df=ide_df, columns=ide_cols, choice=choice, title=title)

> ### <font color='#4a8bad'>4.2. Which hosted notebook to choose?</font>
- When the question was asked in 2018, over 30% of participants have not used any hosted notebook products
- Although this ratio drops in recent years as more people are opting for Kaggle Notebook and Colab Notebook, it is still hard to say at this point who will win in the future

In [None]:
notebook_cols = ['Q10', 'Q10', 'Q17', 'Q14']
notebook_list = []
choice = 'notebook'
title = "Which of the following hosted notebook products do you use on a regular basis?"

notebook_df = question(columns=notebook_cols, store_list=notebook_list, choice=choice)
bar_plot(df=notebook_df, columns=notebook_cols, choice=choice, title=title)

> ### <font color='#4a8bad'>4.3. How important is specialized hardware?</font>
- Half of participants do not use specialized hareware according to the latest survey, this ratio increases compare to 2 years ago, likely because of higher proportion of less experienced participants took part in the survey

In [None]:
hardware_cols = ['Q12','Q12','Q21']
# Rename
data19.replace(['None / I do not know'],['None'], inplace=True)

hardware_list = []
choice = 'hardware'
title = "Which types of specialized hardware do you use on a regular basis? "

hardware_df = question(columns=hardware_cols, store_list=hardware_list, choice=choice, file=files2)

fig = plt.figure(figsize=(10,5))
ax = sns.scatterplot(x='pct', y='hardware', hue='year', data=hardware_df, s=hardware_df['pct']*30**2, palette = ['#B1D877','#8CDCDA','#4D4D4D'], edgecolor='grey', linewidth=1.5)
plt.title(title, y=1.05, size=16)
plt.xlabel("")
plt.ylabel("")
plt.gca().xaxis.set_major_formatter(mtick.PercentFormatter(1.0, decimals=0))
plt.show()

> ### <font color='#4a8bad'>4.4. Matplotlib - a swiss  army knife for visiualization</font>
- Matplotlib is the most popular visualization library due to its flexibility and power to create complicated charts
- Nearly 70% of participants used Matplotlib on a regular basis, followed by Seaborn
- Plotly should gain wider acceptance in the future as it is capable of creating beautiful interactive charts
- Less participants use ggplot, inline with the rate of decline in number of R users

In [None]:
vis_cols = ['Q14', 'Q14', 'Q20', 'Q21']
vis_list = []
choice = 'visualisation library'
title = "What data visualization libraries or tools do you use on a regular basis?"

vis_df = question(columns=vis_cols, store_list=vis_list, choice=choice)
bar_plot(df=vis_df, columns=notebook_cols, choice=choice, title=title)

> ### <font color='#4a8bad'>4.5. Machine learning with Scikit-learn</font>
- There are many machine learning framework available for DS and ML to choose from
- Scikit-learn is one of the most popular ML frameworks that provide fast and easy way to build and test ML models
- TensorFlow and Keras have consistently ranked in top 3 of the list

In [None]:
mlf_cols = ['Q16', 'Q16', 'Q28', 'Q19']
mlf_list = []
choice = 'ML framework'
title = "Which of the following machine learning frameworks do you use on a regular basis?"

mlf_df = question(columns=mlf_cols, store_list=mlf_list, choice=choice)
bar_plot(df=mlf_df, columns=notebook_cols, choice=choice, title=title)

> ### <font color='#4a8bad'>4.6. Machine learning algorithm</font>
- LG and DT/RF are the most popular Machine Learning algorithms, there are not much changes from the past 3 years. 

In [None]:
algo_cols = ['Q17', 'Q17', 'Q24']
algo_list = []
choice = 'ML algorithm'
title = "Which of the following ML algorithms do you use on a regular basis?"

algo_df = question(columns=algo_cols, store_list=algo_list, choice=choice, file=files2)
scatter_plot(algo_df, choice, palette=palette19, title=title)

> ### <font color='#4a8bad'>4.7. Computer Vision</font>
- Based on the survey, the evolution of computer vision methods remains largely the same over the past three years

In [None]:
vision_cols = ['Q18', 'Q18', 'Q26']
vision_list = []
choice = 'vision'
title = "Which categories of computer vision methods do you use on a regular basis? "

vision_df = question(columns=vision_cols, store_list=vision_list, choice=choice, file=files2)
scatter_plot(vision_df, choice, palette=palette19, title=title)

> ### <font color='#4a8bad'>4.8. NLP</font>
- Similar trend is also seen in NLP methods, but this is likely due to the fact that survey participants are more involved in data analysis/visualization rather than NLP. 

In [None]:
nlp_cols = ['Q19', 'Q19', 'Q27']
nlp_list = []
choice = 'NLP'
title = "Which of the following natural language processing (NLP) methods do you use on a regular basis? "

nlp_df = question(columns=nlp_cols, store_list=nlp_list, choice=choice, file=files2)
scatter_plot(nlp_df, choice, palette=palette19, title=title)

> ### <font color='#4a8bad'>4.9. Cloud Computing</font>

In [None]:
cloud_cols =  ['Q27_A', 'Q26_A', 'Q29']
cloud_list = []
choice = 'cloud'
title = "Which of the following cloud computing platforms do you use on a regular basis? "

cloud_df = question(columns=cloud_cols, store_list=cloud_list, choice=choice, file=files2)
scatter_plot(cloud_df, choice, palette=palette19, title=title)

# <font color='#4a8bad'>5. Exploring the salaray gap</font>
<a id="salary"></a>
***

> ### <font color='#4a8bad'>5.1. Salary trend from 2019-2021</font>
- This year, there is a large number of low income participants, as a result of higher ratio of students and unemployed participants

In [None]:
salary_cols = ['Q25', 'Q24', 'Q10']

# Rename
data21.replace(['$500,000-999,999','>$1,000,000','300,000-499,999'],['> 500,000','> 500,000','300,000-500,000'], inplace=True)
data21['Q25'] = data21['Q25'] .str.replace('$', '')
data20['Q24'] = data20['Q24'] .str.replace('$', '')
data19['Q10'] = data19['Q10'] .str.replace('$', '')

# Salary order
salary_order = ['0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999',
         '15,000-19,999', '20,000-24,999', '25,000-29,999', '30,000-39,999', '40,000-49,999', '50,000-59,999', '60,000-69,999',
         '70,000-79,999', '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999', '150,000-199,999', '200,000-249,999',
         '250,000-299,999', '300,000-500,000', '> 500,000']

df_salary = []
choice = 'salary'
title = "Which of the following machine learning frameworks do you use on a regular basis?"

for (file, salary) in zip(files2, salary_cols):
    df_salary.append(file.loc[:,['year', salary]])

for i in range(len(salary_cols)):
    df_salary[i].columns = ['year', 'salary']

df_salary = pd.concat(df_salary, axis=0, ignore_index=False)
df_salary_ct = pd.crosstab(index=df_salary['year'], columns=df_salary['salary'], values=df_salary['salary'], aggfunc='count') 
df_salary_ct = pd.DataFrame(df_salary_ct, columns=salary_order)
df_salary_pct = df_salary_ct.div(df_salary_ct.sum(axis=1), axis=0)

In [None]:
# bar plot to show salary distribution
color = palette19
fig = plt.figure(figsize=(15,10))
for i in range(len(salary_cols)):
    ax = fig.add_subplot(1,3, i+1)
    df_salary_ct.iloc[i,:].plot(kind='barh', color=palette19[i])
    ax.invert_yaxis()
    ax.set_title(str(2019+i) +' salary distribution (in $)', fontweight='bold', ha='center', size=12)
    if i > 0:
        ax.set_yticks([])
        
    values = df_salary_ct.iloc[i,:].values    
    for j, value in enumerate(values):
        ax.text(value+50 , j, '{:,}'.format(int(value)),va='center', 
                 fontweight='bold', alpha=0.8, size=12, ha='left')   

In [None]:
plot_heatmap(df_salary_pct, cmap=['Greens', 'Blues', 'Greys'], figsize=(15,6))
plt.title('salary distribution in %', y=3.2, size=16)
plt.show()

> ### <font color='#4a8bad'>52. Which country has the highest ratio of high-income participants? </font>
- Stripping out students and unemployed participants, more than 40% of participants in Egypt, Pakistan, Indonesia and Nigeria earn less than $1000 a year    

- In comparison, more than 40% of participants from US and Australia earn $100k+ a year

In [None]:
data21_salary = data21[(data21['Q5']!= 'Student') & (data21['Q5']!= 'Currently not employed')]
data21_salary = pd.crosstab(index=data21_salary['Q3'], columns = data21_salary['Q25'], values = data21_salary['Q25'], aggfunc='count', margins=True).sort_values(by='All', ascending=False).drop(index=['All', 'Other']).head(20).iloc[:,:-1]
data21_salary = pd.DataFrame(data21_salary, columns=salary_order)
data21_salary_pct = data21_salary.div(data21_salary.sum(axis=1), axis=0)

In [None]:
fig = plt.figure(figsize=(15,12))
sns.heatmap(data21_salary_pct, cmap='Blues', annot=True, fmt='.0%', linewidth=1, cbar=False)
plt.xlabel('')
plt.ylabel('')
plt.title('Which country has  the highest % of high-income participants? (excl. students/unemployed)', y=1.02, size=15, fontweight='bold')
plt.show()