## Data Preprocessing

You can find all preprocessing steps in this [notebook](https://www.kaggle.com/katokev/2021-kaggle-machine-learning-data-science-survey/)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pycountry 
import plotly
import plotly.express as px
import seaborn as sb
import matplotlib.pyplot as plt
from IPython.display import Markdown as md, HTML

%matplotlib inline

In [None]:
country_region = pd.read_csv('../input/kaggle-surveys/country_region.csv')
df_19 = pd.read_csv('../input/kaggle-surveys/df_19.csv')
df_20 = pd.read_csv('../input/kaggle-surveys/df_20.csv')
df_21 = pd.read_csv('../input/kaggle-surveys/df_21.csv')

# Survey time

The first thing I was interested to know about is the time taken to complete the survey specially that the structure changed after 2019.
The number of questions has increased from 2019 till 2021, and despite that the change in time taken is not huge, however when we look at how the time is distributed by `gender` & `age`, we find younger respondants have more outliers, where these outliers consume more time.

One more observation is that for 2021, there are not outliers with low values as the previous surveys, it would be interesting to know if survey mechanism was changed.

In [None]:
# numbers of questions in each survey & time taken by gender

data19 = df_19.loc[:, ['gender', 'age_range','time_in_seconds']]
data19['year'] = '2019 (36 Qs)'

data20 = df_20.loc[:, ['gender', 'age_range','time_in_seconds']]
data20['year'] = '2020 (47 Qs)'

data21 = df_21.loc[:, ['gender', 'age_range','time_in_seconds']]
data21['year'] = '2021 (51 Qs)'

data = pd.concat([data19, data20], ignore_index=True)
data = pd.concat([data, data21], ignore_index=True)


In [None]:
fig = px.box(data, y="age_range", x="time_in_seconds", color='gender', facet_col="year", log_x=True,
             height = 1000, width=1200,
            labels=dict(age_range="Age", time_in_seconds="Log(Time) (seconds)", gender="Gender"))
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.update_yaxes(categoryorder='array', categoryarray= ['18-21', '22-24', '25-29', '30-34', '35-39',
                                                       '40-44', '45-49', '50-54', '55-59', '60-69', '70'])
fig.show()

# General Profiling

At the first glance you might think you have the characteristics of the sample Kaggle has collected responses from.
Overall collected data doesn't change much before/after **B.C.** (Before Covid)

A Typical Profile would be:
- **Country of origin:** one of the top 3 countries [India, United States, Other]
- **Age range:** either [18-21, 22-24, 25-29]
- **Gender:** 79% chance a male
- **Education:** holding a master's degree or a bachelor
- **Profession:** if you're not a full-time student, most probably the respondant is a data scientist, whose main activity is *Analyze and understand data to influence product or business decisions*, in the field of Technology
- **Programming skills:** you use at most 3 languages, which are Python, SQL, and R, with an average 3 years of coding experience, and you favorite IDE(s) are: *Jupyter (JupyterLab, Jupyter Notebooks, etc) , Visual Studio Code (VSCode), PyCharm*
- **ML skills:**: <2 years experience, utlizing mostly *Linear or Logistic Regression, Decision Trees or Random Forests* and spending less than $1K on ML

# Closer Look

Let's view the main changes over the past 3 years, I'd start with ***demographics***!

In 2019, **59** countries were represented in the survey
> <b><span style="color: #FF0000">8</span></b> of them dropped out .. <b><span style="color: #228B22">4</span></b> new countries joined the race

Making it <b>55</b> countries in 2020 survey.
> <b><span style="color: #228B22">5</span></b> out of the dropped countries returned .. plus <b><span style="color: #228B22">5</span></b> brand new countries .. in addition to the new option <span style="background-color: #F5F5F5">not to disclose location</span>

Making 2021 survey has total of **66** countries

In [None]:
highest_index = country_region.sort_values('human development index', ascending=False)['country'].head(1).item()
highest_internet = country_region.sort_values('internet users', ascending=False)['country'].head(1).item()

largest_country = country_region[country_region['world_share'] == max(country_region['world_share'])]['country'].item()
largest_share = country_region[country_region['world_share'] == max(country_region['world_share'])]['world_share'].item()*100

In [None]:
md("You might think that Kaggle world have the same representation as the real world, meaning, it would make perfect sense if **{}** -who has the highest world share- dominates the survey with same or near its world share **{}%**, but that is not the case here!)".format(largest_country, largest_share))

In [None]:
md("Okay, then we would assume the country dominating would be the one with the highest *human development index:* **{}** .. if not, it would definitly be the one with the highest *internet users:* **{}**. but no!".format(highest_index, highest_internet))

<center><b>India</b> -consistently- dominates survey data by more than <span style="background-color: #F5F5F5">20%</span> for the past the 3 years.</center>

<center>Comes at the second place; <b>USA</b> with percentage of dominance equal to <span style="background-color: #F5F5F5">half</span> of it's predecessor!</center>

<center>And at the third place comes; <b>Other</b> with dominance slightly higher/lower than <span style="background-color: #F5F5F5">5%</span> of survey data!</center>

In [None]:
con19 = df_19['country'].value_counts().reset_index(name='count')
con19['perc'] = (con19['count']/con19['count'].sum())*100
con19['year']=2019
con20 = df_20['country'].value_counts().reset_index(name='count')
con20['perc'] = (con20['count']/con20['count'].sum())*100
con20['year']=2020
con21 = df_21['country'].value_counts().reset_index(name='count')
con21['perc'] = (con21['count']/con21['count'].sum())*100
con21['year']=2021

data = pd.concat([con19, con20], ignore_index=True)
data = pd.concat([data, con21], ignore_index=True)

fig = px.bar(data[data['perc']>= data['perc'].quantile(0.95)], x="year", y="perc",
              facet_col="index", facet_col_wrap=7,
              facet_row_spacing=0.04, # default is 0.07 when facet_col_wrap is used
              facet_col_spacing=0.09, # default is 0.03
              height=300, width=800,
              range_x = [2018,2022],
              range_y = [0,35],
              title="Countries with volume of at least 5% of Surveys' total records for 2019-2021",
            labels=dict(count="", year="Year"))
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.show()

When we take a look at the remaining countries (whose respondants constitute <5% of the data), we find a pattern of 
developing versus developed countries.

`Developed countries` *(except Japan, UAE, Israel & South Korea)* seem to have a decrease in representation over the past 3 years, on the other hand `developing countries` maintain their presence, for few countries it is even better, they either newly joined or have increased their representation.

In [None]:
subdata = country_region.loc[:, ['country', 'region', 'developing']]
data = data.merge(subdata, how='left', left_on='index', right_on='country')

data['developing'].fillna('NA', inplace=True)
data['region'].fillna('NA', inplace=True)

In [None]:
fig = px.bar(data[data['perc']< data['perc'].quantile(0.95)], x="year", y="perc", color = 'developing',
             facet_col="index", facet_col_wrap=7,
              facet_row_spacing=0.04, # default is 0.07 when facet_col_wrap is used
              height=1000, width=800,
              range_x = [2018,2022],
              range_y = [0,4],
              title="Survey poputlation percentage of remaining countries (2019-2021)",
            labels=dict(count="", year="Year", developing="is a developing country?"))
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()

##### And just to see how the world -as regions- is represented on `Kaggle` universe, let's have a look at their percentage over the past 3 years.

##### Shout out to `Africa`, `Latin/South America`, & `Middle East` for sustaining their presence while `Europe & North America` have decreasing perecntage in the face of `Asia & Pacific` region!

##### Side note: NA here stands for `other` & `undisclosed` countries, it would be very useful to know their details as they hold up nicely in the race!

In [None]:
data19 = data[(data['year']==2019)].groupby('region')['count'].sum().reset_index(name='perc')
data19 = data19.sort_values('region')
data20 = data[(data['year']==2020)].groupby('region')['count'].sum().reset_index(name='perc')
data20 = data20.sort_values('region')
data21 = data[(data['year']==2021)].groupby('region')['count'].sum().reset_index(name='perc')
data21 = data21.sort_values('region')

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

colors = ['orange', 'lightyellow','powderblue', 'violet', 'crimson', 'thistle', 'lightgreen', 'cornflowerblue']
labels = data21['region'].tolist()

fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['2019', '2020', '2021'])
fig.add_trace(go.Pie(labels=labels, values=data19['perc'], 
                     name="2019", pull=[0, 0, 0, 0, 0, 0.1]), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=data20['perc'], 
                     name="2020", pull=[0, 0, 0, 0, 0, 0.1]), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=data21['perc'], 
                     name="2021", pull=[0, 0, 0, 0, 0, 0.1]), 1, 3)
fig.update_traces(hoverinfo='label+percent', 
                  marker=dict(colors=colors, line=dict(color='#000000')))
fig.update(layout_title_text="Regions' representation in 2019-2021 surveys")
fig.show()

In [None]:
def count_variable(vars):
    data19 = df_19.groupby(vars).size().reset_index(name='count').sort_values('count', ascending=False)
    data19['perc'] = round(data19['count']/df_19.shape[0]*100, 4)
    data19['year'] = 2019

    data20 = df_20.groupby(vars).size().reset_index(name='count').sort_values('count', ascending=False)
    data20['perc'] = round(data20['count']/df_20.shape[0]*100, 4)
    data20['year'] = 2020

    data21 = df_21.groupby(vars).size().reset_index(name='count').sort_values('count', ascending=False)
    data21['perc'] = round(data21['count']/df_21.shape[0]*100, 4)
    data21['year'] = 2021

    data = pd.concat([data19, data20], ignore_index=True)
    data = pd.concat([data, data21], ignore_index=True)
    
    return data

# Closer Look - Cont'd

### What's New after B.C.?

Kaggle has attracted different audience after B.C., depicted below how the spectrum changed over the past 3 years, in terms of *age*, *education*, & *title*, split by gender.

In [None]:
plot_data = count_variable(['gender','age_range'])
fig = px.bar(plot_data, x="age_range", y="perc", color="gender",
  animation_frame="year",animation_group="age_range", range_y=[0,25],
             title="Overall percentage of Age per Gender (2019-2021)",
            labels=dict(age_range="Age Range", perc="percentage", gender="Gender"))
fig.update_xaxes(categoryorder='array', categoryarray= ['18-21', '22-24', '25-29', '30-34', '35-39',
                                                       '40-44', '45-49', '50-54', '55-59', '60-69', '70'])

fig.show()

In [None]:
plot_data = count_variable(['gender','education'])
fig = px.bar(plot_data, x="education", y="perc", color="gender",
  animation_frame="year",animation_group="education", range_y=[0,50],
             title="Overall percentage of Education per Gender (2019-2021)",
            labels=dict(education="Education", perc="percentage", gender="Gender"))
fig.update_xaxes(categoryorder='array', categoryarray= ['Some college/university study without earning a bachelor’s degree',
                                                        'Bachelor’s degree',
                                                        'Master’s degree',  
                                                        'Doctoral degree', 
                                                        'I prefer not to answer',
                                                        'Professional doctorate'])


fig.show()

##### Most obvious note is `Data Scientist` percentage decreased when `Machine Learning Engineer` was introduced in 2020.

In [None]:
plot_data = count_variable(['gender','title'])
fig = px.bar(plot_data, x="title", y="perc", color="gender",
  animation_frame="year",animation_group="title", range_y=[0,30],
             title="Overall percentage of Title per Gender (2019-2021)",
            labels=dict(title="Title", perc="percentage", gender="Gender"))
fig.update_xaxes(categoryorder='array', categoryarray= ['Not employed', 
                                                        'Student',
                                                        'Other', 
                                                        'Product/Project Manager',
                                                        'Software Engineer',
                                                        'Research Scientist',
                                                        'Data Scientist', 
                                                        'Data Analyst', 
                                                        'Machine Learning Engineer', 
                                                        'Business Analyst', 
                                                        'Data Engineer', 
                                                        'Statistician', 
                                                        'Developer Relations/Advocacy',
                                                        'DBA/Database Engineer'])

fig.show()

Here is an interesting finding when comparing the data of the past 3 years, for some professions, men are not the highest paid, however, males are not bounded by certain salary like other genders! 

Check out below comparison of average salary Vs. years of coding, split by gender and title.

In [None]:
data19 = df_19[['coding_years', 'title', 'gender','yearly_salary_avg']]
data19 = data19[~data19['yearly_salary_avg'].isna()]
data19['year'] = 2019

data20 = df_20[['coding_years', 'title', 'gender','yearly_salary_avg']]
data20 = data20[~data20['yearly_salary_avg'].isna()]
data20['year'] = 2020

data21 = df_21[['coding_years', 'title', 'gender','yearly_salary_avg']]
data21 = data21[~data21['yearly_salary_avg'].isna()]
data21['year'] = 2021

data = pd.concat([data19, data20], ignore_index=True)
data = pd.concat([data, data21], ignore_index=True)

# remove outliers
data = data[data['yearly_salary_avg']<=600000]

fig = px.box(data, x="coding_years", y="yearly_salary_avg", facet_col="year",
             color="gender", hover_name="gender", facet_row="title",
             height=3500, width=2000, facet_row_spacing=0.009,
             title="Average reported salary per years of coding, split by profession per gender (2019-2021)",
             labels=dict(coding_years="# of Coding Years", yearly_salary_avg="Avergae Salary ($)", gender="Gender")
)
fig.update_xaxes(categoryorder='array', categoryarray= ['-', '1', '1-2', 
                                                        '1-3', '3-5', '5-10',
                                                        '10-20', '20'])
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()

# Sub-Grouping

I chose to split our sample space into 3 groups: 

- the emerging audience (18-21 yrs)
- the least represented ages (> 40 yrs)
- those in-between

Let's dig deeper into their world to know more about their background, tools and plans!

#### Here the top 5 values for each subgroup over 2019, 2020 & 2021 .. for the following: 

**gender, country, education, title, learn first recommendation, company size, if business is using ML, # of data science workers, yearly salary, ML spending, data primary tool, # of coding years, TPU avg. usage, # of ML years**

In [None]:
def get_stats():

    youth_data19 = df_19[df_19['age_range']=='18-21']
    youth_data19 = youth_data19.drop(columns=['age_range', 'time_in_seconds'])

    youth_data20 = df_20[df_20['age_range']=='18-21']
    youth_data20 = youth_data20.drop(columns=['age_range', 'time_in_seconds'])

    youth_data21 = df_21[df_21['age_range']=='18-21']
    youth_data21 = youth_data21.drop(columns=['age_range', 'time_in_seconds'])

    least_data19 = df_19[df_19['age_range'].isin(['40-44', '45-49', '50-54', '55-59', '60-69', '70'])]
    least_data19 = least_data19.drop(columns=['age_range', 'time_in_seconds'])

    least_data20 = df_20[df_20['age_range'].isin(['40-44', '45-49', '50-54', '55-59', '60-69', '70'])]
    least_data20 = least_data20.drop(columns=['age_range', 'time_in_seconds'])

    least_data21 = df_21[df_21['age_range'].isin(['40-44', '45-49', '50-54', '55-59', '60-69', '70'])]
    least_data21 = least_data21.drop(columns=['age_range', 'time_in_seconds'])

    mid_data19 = df_19[df_19['age_range'].isin(['22-24', '25-29', '30-34', '35-39'])]
    mid_data19 = mid_data19.drop(columns=['age_range', 'time_in_seconds'])

    mid_data20 = df_20[df_20['age_range'].isin(['22-24', '25-29', '30-34', '35-39'])]
    mid_data20 = mid_data20.drop(columns=['age_range', 'time_in_seconds'])

    mid_data21 = df_21[df_21['age_range'].isin(['22-24', '25-29', '30-34', '35-39'])]
    mid_data21 = mid_data21.drop(columns=['age_range', 'time_in_seconds'])

    
    youth_df = pd.DataFrame(columns=['value','count','perc','year', 'variable', 'subgroup'])

    for col in youth_data19:
        df_tmp = youth_data19.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/youth_data19.shape[0]*100, 2)
        df_tmp['year'] = 2019
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'youth'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        youth_df = pd.concat([youth_df, df_tmp], ignore_index=True)

    for col in youth_data20:
        df_tmp = youth_data20.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/youth_data20.shape[0]*100, 2)
        df_tmp['year'] = 2020
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'youth'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        youth_df = pd.concat([youth_df, df_tmp], ignore_index=True)
    
    for col in youth_data21:
        df_tmp = youth_data21.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/youth_data21.shape[0]*100, 2)
        df_tmp['year'] = 2021
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'youth'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        youth_df = pd.concat([youth_df, df_tmp], ignore_index=True)

    least_df = pd.DataFrame(columns=['value','count','perc','year', 'variable', 'subgroup'])
    
    for col in least_data19:
        df_tmp = least_data19.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/least_data19.shape[0]*100, 2)
        df_tmp['year'] = 2019
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'least'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        least_df = pd.concat([least_df, df_tmp], ignore_index=True)

    for col in least_data20:
        df_tmp = least_data20.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/least_data20.shape[0]*100, 2)
        df_tmp['year'] = 2020
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'least'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        least_df = pd.concat([least_df, df_tmp], ignore_index=True)
    
    for col in least_data21:
        df_tmp = least_data21.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/least_data21.shape[0]*100, 2)
        df_tmp['year'] = 2021
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'least'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        least_df = pd.concat([least_df, df_tmp], ignore_index=True)
    
    
    mid_df = pd.DataFrame(columns=['value','count','perc','year', 'variable', 'subgroup'])
    
    for col in mid_data19:
        df_tmp = mid_data19.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/mid_data19.shape[0]*100, 2)
        df_tmp['year'] = 2019
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'in-between'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        mid_df = pd.concat([mid_df, df_tmp], ignore_index=True)

    for col in mid_data20:
        df_tmp = mid_data20.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/mid_data20.shape[0]*100, 2)
        df_tmp['year'] = 2020
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'in-between'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        mid_df = pd.concat([mid_df, df_tmp], ignore_index=True)
    
    for col in mid_data21:
        df_tmp = mid_data21.groupby(col).size().reset_index(name='count').sort_values('count', ascending=False)
        df_tmp['perc'] = round(df_tmp['count']/mid_data21.shape[0]*100, 2)
        df_tmp['year'] = 2021
        df_tmp['variable'] = col
        df_tmp['subgroup'] = 'in-between'
        df_tmp.rename(columns={col:'value'}, inplace=True)

        mid_df = pd.concat([mid_df, df_tmp], ignore_index=True)
    
    data = pd.concat([youth_df, mid_df], ignore_index=True)
    data = pd.concat([data, least_df], ignore_index=True)
    
    return data

In [None]:
data = get_stats()

In [None]:
data['new_variable'] = data['variable'].apply(lambda x:'Role Activities' if x== 'role_activities_count' or x=='combined_work_activities'
                                      else 'Media Platforms' if x=='media_plat_count' or x=='combined_media_plat'
                                      else 'Course Platforms' if x=='course_plat_count' or x=='combined_course_plat'
                                      else 'Data Primary Tools' if x=='data_primary_tool_count'
                                      else 'IDEs' if x=='ide_count'or x=='combined_IDE'
                                      else 'Notebooks' if x=='notebook_count'or x=='combined_notebook'
                                      else 'Programming Languages' if x=='language_count' or x=='combined_lang'
                                      else 'Visualization Libs' if x=='viz_lib_count' or x=='combined_libs'
                                      else 'Hardware' if x=='hardware_count' or x=='combined_HW'
                                      else 'ML Algorithms' if x=='ml_algo_count' or x=='combined_ml_algo'
                                      else 'Auto ML' if x=='auto_ml_count' or x=='combined_auto_ml'
                                      else 'CV methods' if x=='cv_method_count' or x=='combined_cv_method'
                                      else 'NLP methods' if x=='nlp_method_count' or x=='combined_nlp_method'
                                      else 'ML frameworks' if x=='ml_framework_count' or x=='combined_ml_framework'
                                      else 'Cloud Platforms' if x=='cloud_plat_count' or x=='combined_cloud_platform'
                                      else 'Cloud Products' if x=='cloud_prod_count' or x=='combined_cloud_prod'
                                      else 'Bigdata Products' if x=='bigdata_prod_count' or x=='combined_bigdata_prod'
                                      else 'ML Management' if x=='ml_manage_count' or x=='combined_ml_manage'
                                      else 'Regular autoML' if x=='auto_ml_regular_count' or x=='combined_auto_ml_regular'
                                      else 'ML Products' if x=='ml_prod_count' or x=='combined_ml_prod'
                                      else 'BI Products' if x=='bi_prod_count' or x=='combined_bi'
                                      else 'Deploy Env.' if x=='deploy_count' or x=='combined_deploy'
                                      else 'Future Cloud Plats ' if x=='f_cloud_plat_count' or x=='combined_f_cloud_plat'
                                      else 'Future Cloud Prods' if x=='f_cloud_prod_count' or x=='combined_f_cloud_prod'
                                      else 'Future ML Prods' if x=='f_ml_prod_count' or x=='combined_f_ml_prod'
                                      else 'Future Bigdata Prods' if x=='f_bigdata_prod_count' or x=='combined_f_bigdata_prod'
                                      else 'Future BI Prods' if x=='f_bi_prod_count' or x=='combined_f_bi_prod'
                                      else 'Future autoML' if x=='f_auto_ml_count'
                                      else 'Future Regular autoML' if x=='f_auto_ml_regular_count'
                                      else 'Future ML Management' if x=='f_manage_ml_count' or x=='combined_f_manage_ml'
                                      else 'Storage' if x=='storage_count' or x=='combined_storage'
                                      else 'Future Storage' if x=='f_storage_count'
                                      else 'Future autoML Cat.' if x=='f_automl_cat_count'
                                      else 'Future autoML tools' if x=='f_automl_tool_count'
                                      else 'Company Size' if x=='company_size'
                                          else 'Data Scientist #' if x=='data_science_workers'
                                          else 'Business Using ML?' if x=='business_using_ml'
                                          else 'Yearly Salary Range' if x=='yearly_salary'
                                          else 'Yearly ML Spend Range' if x=='ml_spending'
                                          else 'Data Primary Tool' if x=='data_primary_tool'
                                          else 'Coding experience (yrs)' if x=='coding_years'
                                          else 'TPU average usage' if x=='tpu_avg'
                                          else 'ML experience (yrs)' if x=='ml_years'
                                          else 'Recommended to learn first' if x=='learn_first_recommendation'
                                              else 'Bigdata Products Often used' if x=='often_bigdata_products'
                                              else 'Bigdata Regular Products' if x=='combined_bigdata_prod_regular'
                                              else 'Regular Computing Platforms' if x=='often_computing_platform'
                                              else 'Future AutoML Cat.' if x=='combined_f_automl_cat'
                                              else 'Future AutoML Tools' if x=='combined_f_automl_tool'
                                              else 'Industry' if x=='industry'
                                      else x)

In [None]:
data['rank'] = (data.groupby(['variable', 'year', 'subgroup'])['perc']
                      .rank(method='dense', ascending=False)
                      .astype(int)
                   )

In [None]:
def draw_pivots(values, cells_youth, cells_between, cells_least):
    html = """<style type="text/css">.tg  {border:none;}</style><table>"""
    
    for i in range(len(values)):
        new_variable = data[data['variable']==values[i]]['new_variable'].unique()
        html += """<thead>
        <tr> 
            <th colspan="3" style=align: center;> Showing <em><u>""" +new_variable+ """</em></u> Data for the 3 subgroups .. pivoted on Survey year</th>
        </tr>
        <tr>
            <td style="text-align:center !important">Youth</td>
            <td style="text-align:center !important">In-Between</td>
            <td style="text-align:center !important">Least</td>
        </tr>
          <tr style="text-align:center !important">
            <td >"""+cells_youth[i]+"""</td>
            <td >"""+cells_between[i]+"""</td>
            <td >"""+cells_least[i]+"""</td>
          </tr>
        </thead>
        """
    
    html += "<tbody></tbody></table>"
    
    display(HTML(str(html).replace('\\n', '').replace("['", '').replace("']", '')))

In [None]:
def get_top_values(rank, values):
    subdata = data[(data['rank'].between(1,rank)) & (data['variable'].isin(values))]
    
    cells_youth = []
    for item in values:
        show_data = subdata[(subdata['variable']==item) & (subdata['subgroup']=='youth')]
        show_data = show_data.pivot(index=['value'], columns='year', values='perc')
        show_data = show_data.fillna('')
        show_data.index.name = None

        print_label = subdata[subdata['variable']==item]['new_variable'].unique().item()
        
        cells_youth.append(show_data.to_html())
        
    cells_between = []
    for item in values:
        show_data = subdata[(subdata['variable']==item) & (subdata['subgroup']=='in-between')]
        show_data = show_data.pivot(index=['value'], columns='year', values='perc')
        show_data = show_data.fillna('')
        show_data.index.name = None

        print_label = subdata[subdata['variable']==item]['new_variable'].unique().item()
        
        cells_between.append(show_data.to_html())
        
    cells_least = []
    for item in values:
        show_data = subdata[(subdata['variable']==item) & (subdata['subgroup']=='least')]
        show_data = show_data.pivot(index=['value'], columns='year', values='perc')
        show_data = show_data.fillna('')
        show_data.index.name = None

        print_label = subdata[subdata['variable']==item]['new_variable'].unique().item()
        
        cells_least.append(show_data.to_html())
    
    draw_pivots(values, cells_youth, cells_between, cells_least)

In [None]:
values = ['gender', 'country', 'education', 'title',
          'learn_first_recommendation',  'company_size','business_using_ml',
          'data_science_workers', 'yearly_salary', 'ml_spending',
          'data_primary_tool', 'coding_years', 'tpu_avg', 'ml_years']

get_top_values(5, values)

#### How about the number of tools, languages, .. etc these 3 subgroups use?

##### I have counted the number of choices selected for all multiple choice questions, to see if there are any trends comparing 2019, 2020, and 2021 data, and here is what's noticeable:
- Media/Course platforms were more popular B.C.!
- More respondants are using more Visualization Libs, IDEs, programming lang., Notebooks, and HW than previous years
- Respondants of this year's survey want to learn more & have bigger future plans than their predcessors
- AutoML, ML management tools & cloud platforms are used more by the least population
- Older populations know/use more programming languages than the emerging one
- Number of role activities is directly propotional to age!

Side note: percentages are calculated per subgroup relative to itself not relative to total sample space

In [None]:
subdata = data[data['variable'].isin(['often_bigdata_products', 'role_activities_count', 'media_plat_count', 'course_plat_count',
 'data_primary_tool_count', 'ide_count', 'notebook_count', 'language_count',
 'viz_lib_count', 'hardware_count', 'ml_algo_count', 'auto_ml_count', 'cv_method_count', 'nlp_method_count',
 'ml_framework_count', 'cloud_plat_count', 'cloud_prod_count', 'bigdata_prod_count', 'ml_manage_count',
 'auto_ml_regular_count',  'ml_prod_count', 'bi_prod_count',
 'deploy_count', 'f_cloud_plat_count', 'f_cloud_prod_count', 'f_ml_prod_count',
 'f_bigdata_prod_count', 'f_bi_prod_count', 'f_auto_ml_count', 'f_auto_ml_regular_count',
 'f_manage_ml_count', 'storage_count', 'f_storage_count', 'f_automl_cat_count', 'f_automl_tool_count'])]

fig = px.bar(subdata, x='value', y="perc", color='year', facet_row='new_variable', facet_col='subgroup',
             facet_row_spacing=0.002, barmode='group',
             facet_col_spacing=0.02, # default is 0.07 when facet_col_wrap is used
              height=4500, width=1100,range_x=[0,10],range_y=[0,100],
             title="Count Percentage of multiple choice Qs per Year",
            )
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()

#### After knowing about the count, let's know more about the content ..

#### Here the top 3 values for each subgroup over 2019, 2020 & 2021 .. for all MCQs!

In [None]:
values = ['combined_work_activities', 'combined_media_plat',
       'combined_course_plat', 
       'combined_IDE', 'combined_notebook', 'combined_lang',
       'combined_libs', 'combined_HW', 'combined_ml_algo',
       'combined_auto_ml', 'combined_cv_method', 'combined_nlp_method',
       'combined_ml_framework', 'combined_cloud_platform',
       'combined_cloud_prod', 'combined_bigdata_prod',
       'combined_ml_manage', 'combined_auto_ml_regular',
       'combined_bigdata_prod_regular', 
       'often_computing_platform', 
       'combined_ml_prod', 'combined_bi', 'combined_deploy',
       'combined_f_cloud_plat', 'combined_f_cloud_prod',
       'combined_f_ml_prod', 'combined_f_bigdata_prod',
       'combined_f_bi_prod', 'combined_f_automl_cat',
       'combined_f_automl_tool', 'combined_f_manage_ml', 'industry',
        'combined_storage']

get_top_values(3, values)

# Conclusion

- Splitting our sample space based on age and comparing those subgroups across the 3 years in scope was insightful, it enabled us to see the difference and how this difference changed over time.

- Ofcourse, this is not the sole direction to look at our data, every time our goal changes we might exclude/include various aspects of data, whether collected by the survey or additional complementry data that will help us take the right decisions.

- I see many opportunities coming from the future MCQs, maybe Kaggle can consider making those questions part of the original survey, or even make a seperate for this type of questions in the future.

*Side note: Further cleaning to data can be done (specifically for the choices of the multiple choice questions)*