# One chart, many answers: Kaggle Surveys in Slopes 

![](https://media.giphy.com/media/SwyVL4IjvWMfncmM9h/giphy.gif)

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
On previous surveys, I explored <a href="https://www.kaggle.com/andresionek/what-makes-a-kaggler-valuable">What Makes a Kaggler Valuable</a> and a comparison between job posts and survey answers on <a href="https://www.kaggle.com/andresionek/is-there-any-job-out-there-kaggle-vs-glassdoor">Is there any job out there? Kaggle vs Glassdoor</a>.
<br><br>
This is the 4th Kaggle Survey, so I decided to explore trends over time. Unfortunately, the 2017 survey was very different from the others, so I decided to exclude it from the analysis. I was left with 2018, 2019 and 2020 surveys and tried to extract as much value as possible. 
<br><br>
    <b>With one extra challenge: use only one chart type.</b>
</div>
<br>
<h3>I present to you Kaggle Surveys in Slopes! Enjoy!</h3>


## Slope Graphs - How to read them?

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    Despite the fancy name, Slope Graphs are simple line charts, the ones you are already familiar with. But lets give you an intro to how to read the charts I'm presenting here. <b>I promise you only need to learn it only once!</b>
<br><br>
Let's look at this example:
</div>
<img src="https://i.imgur.com/EX1X0Zi.png" align="left" style="width:600px;"/>

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    Here we have two Slope Graphs, one for women and another for men. Note that they share the y axis.

<img src="https://i.imgur.com/19JgPzl.png" align="left" style="width:600px;"/>
    </div>

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    Each line in this chart represents a country. This information is available in the subtitle and also when you hover your mouse over the data points.

<img src="https://i.imgur.com/1g0LdeL.png" align="left" style="width:600px;"/>
    </div>

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    
<div class="alert alert-warning">
  <strong>Warning!</strong> For all charts in this study we applied a filter to select only Professionals (people who are actively working).
</div>
</div>
<br>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
Non-professionals were defined as those who answered Job Title (Q5) as either: 
<ul>
<li>Student</li>
<li>Currently not employed</li>
<li>those who didn't answer the question (NaN)</li>
</ul>
<br>
<h3 style="color:red;">Professionals were defined as everyone but the non-professionals.</h3>
<br>
<b>Now let's start the fun part!</b>
</div>

In [None]:
"""
Prior to starting I created a spreadsheets mapping all questions from the 4 years of survey.

https://docs.google.com/spreadsheets/d/1HpVi0ipElWYxwXali7QlIbMWjCQWk6nuaZRAZLcksn4/edit?usp=sharing

Some questions were the same through the years and had exactly the same wording.
Others had changes that did not compromise too much the question meaning. For example:

2020 - For how many years have you been writing code and/or programming?
2019 - How long have you been writing code to analyze data (at work or at school)?

Or 

2020 - Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis? 
2019 - Which specific big data / analytics products do you use on a regular basis?

Or

2020 - Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?
2019 - Which specific big data / analytics products do you use on a regular basis?
2018 - Which of the following big data and analytics products have you used at work or school in the last 5 years?


---

Other questions had a different wording that implied in a different meaning, so they were excluded from this analysis.

I picked only questions that were the same over the last 3 years (2020, 2019 and 2018). 
The 2017 survey was very different from the others and only a few questions were useful, so I decided to exclude 2017 from the analysis.


## ## ## ## ## 
I suggest that Kaggle keeps the survey consistent over the following years to allow better time-series analysis. 
## ## ## ## ##


Note: I'm trying to write functions for all transformations following the single responsability principle.
"""
pass

In [None]:
from enum import Enum
import numpy as np


class Mapping(Enum):
    """
    Contains dicts mapping values found in the surveys to values we want to replace with.
    """
    COMPENSATION={ 
        '$0-999': '0-10k',
        '1,000-1,999': '0-10k',
        '2,000-2,999': '0-10k',
        '3,000-3,999': '0-10k',
        '4,000-4,999': '0-10k',
        '5,000-7,499': '0-10k',
        '7,500-9,999': '0-10k',
        '10,000-14,999': '10-20k',
        '15,000-19,999': '10-20k',
        '20,000-24,999': '20-30k',
        '25,000-29,999': '20-30k',
        '30,000-39,999': '30-40k',
        '40,000-49,999': '40-50k',
        '50,000-59,999': '50-60k',
        '60,000-69,999': '60-70k',
        '70,000-79,999': '70-80k',
        '80,000-89,999': '80-90k',
        '90,000-99,999': '90-100k',
        '100,000-124,999': '100-125k',
        '125,000-149,999': '125-150k',
        '150,000-199,999': '150-200k',
        '200,000-249,999': '200-250k',
        '300,000-500,000': '300-500k',
        '> $500,000': np.nan,
        '0-10,000': '0-10k',
        '10-20,000': '10-20k',
        '20-30,000': '20-30k',
        '30-40,000': '30-40k',
        '40-50,000': '40-50k',
        '50-60,000': '50-60k',
        '60-70,000': '60-70k',
        '70-80,000': '70-80k',
        '80-90,000': '80-90k',
        '90-100,000': '90-100k',
        '100-125,000': '100-125k',
        '125-150,000': '125-150k',
        '150-200,00': '150-200k',
        '200-250,000': '200-250k',
        '300-400,000': '300-500k',
        '400-500,000': '300-500k',
        '500,000+': np.nan,
        'I do not wish to disclose my approximate yearly compensation': np.nan
    }
    JOB_TITLE={
        'Data Scientist': 'Data Scientist',
        'Software Engineer': 'Software Engineer',
        'Data Analyst': 'Data Analyst',
        'Other': 'Other',
        'Research Scientist': 'Research Scientist/Statistician',
        'Business Analyst': 'Business Analyst',
        'Product/Project Manager': 'Product/Project Manager',
        'Data Engineer': 'Data Engineer/DBA',
        'Not employed': 'Currently not employed',
        'Machine Learning Engineer': 'Machine Learning Engineer',
        'Statistician': 'Research Scientist/Statistician',
        'Consultant': 'Other',
        'Research Assistant': 'Research Scientist/Statistician',
        'Manager': 'Manager/C-level',
        'DBA/Database Engineer': 'Data Engineer/DBA',
        'Chief Officer': 'Manager/C-level',
        'Developer Advocate': 'Other',
        'Marketing Analyst': 'Business Analyst',
        'Salesperson': 'Other',
        'Principal Investigator': 'Research Scientist/Statistician',
        'Data Journalist': 'Other',
        'Currently not employed': 'Currently not employed', 
        'Student': 'Student'
    } 
    GENDER={
        'Male': 'Men',
        'Female': 'Women',
        'Man': 'Men',
        'Woman': 'Women',
        'Prefer not to say': np.nan, # Very few answers on those categories to do any meaningful analysis
        'Prefer to self-describe':  np.nan, # Very few answers on those categories to do any meaningful analysis
        'Nonbinary':  np.nan # Very few answers on those categories to do any meaningful analysis
    }
    AGE={
        '18-21': '18-21', 
        '22-24': '22-24', 
        '25-29': '25-29',
        '30-34': '30-34', 
        '35-39': '35-39', 
        '40-44': '40-44', 
        '45-49': '45-49', 
        '50-54': '50-54', 
        '55-59': '55-59', 
        '60-69': '60-69', 
        '70+': '70+',
        '70-79': '70+',
        '80+': '70+'
    }
    EDUCATION={
        'Master’s degree': 'Master’s', 
        'Bachelor’s degree': 'Bachelor’s',
        'Some college/university study without earning a bachelor’s degree': 'Some college',
        'Doctoral degree': 'Doctoral',
        'Professional degree': 'Professional',
        'I prefer not to answer': np.nan,
        'No formal education past high school': 'High school'
    }
    YEARS_WRITING_CODE={
        '3-5 years': '3-5 years',
        '1-2 years': '1-3 years',
        '2-3 years': '1-3 years',
        '5-10 years': '5-10 years',
        '10-20 years': '10+ years',
        '< 1 years': '< 1 year',
        '< 1 year': '< 1 year',
        '20+ years': '10+ years',
        np.nan: 'None',
        'I have never written code': 'None',
        'I have never written code but I want to learn': 'None',
        '20-30 years': '10+ years',
        '30-40 years': '10+ years',
        '40+ years': '10+ years'
    }    
    YEARS_WRITING_CODE_PROFILES={
        '3-5 years': '3-10 years',
        '1-2 years': '1-2 years',
        '2-3 years': '2-3 years',
        '5-10 years': '3-10 years',
        '10-20 years': '10+ years',
        '< 1 years': '0-1 years',
        '< 1 year': '0-1 years',
        '20+ years': '10+ years',
        np.nan: 'None',
        'I have never written code': 'None',
        'I have never written code but I want to learn': 'None',
        '20-30 years': '10+ years',
        '30-40 years': '10+ years',
        '40+ years': '10+ years'
    } 
    RECOMMENDED_LANGUAGE={
        'Python': 'Python',
        'R': 'R',
        'SQL': 'SQL',
        'C++': 'C++',
        'MATLAB': 'MATLAB',
        'Other': 'Other',
        'Java': 'Java',
        'C': 'C',
        'None': 'None',
        'Javascript': 'Javascript',
        'Julia': 'Julia',
        'Scala': 'Other',
        'SAS': 'Other',
        'Bash': 'Bash',
        'VBA': 'Other',
        'Go': 'Other',
        'Swift': 'Swift',
        'TypeScript': 'Other'
    } 
    LANGUAGES={
        'SQL': 'SQL', 
        'R': 'R', 
        'Java': 'Java', 
        'MATLAB': 'MATLAB', 
        'Python': 'Python', 
        'Javascript/Typescript': 'Javascript/Typescript',
        'Bash': 'Bash', 
        'Visual Basic/VBA': 'VBA', 
        'Scala': 'Scala', 
        'PHP': 'Other', 
        'C/C++': 'C/C++',
        'Other': 'Other', 
        'C#/.NET': 'Other',
        'Go': 'Other', 
        'SAS/STATA': 'Other', 
        'Ruby': 'Other', 
        'Julia': 'Julia',
        'None': 'None',
         np.nan: 'None',
        'Javascript': 'Javascript/Typescript',
        'C': 'C/C++', 
        'TypeScript': 'Javascript/Typescript', 
        'C++': 'C/C++', 
        'Swift': 'Swift'
    }
    YEARS_USING_ML={
        '1-2 years': '1-3 years',
        '2-3 years': '1-3 years',
        '< 1 year': '< 1 year',
        'Under 1 year': '< 1 year',
        '< 1 years': '< 1 year',
        '3-4 years': '3-5 years',
        '5-10 years': '5+ years',
        '4-5 years': '3-5 years',
        np.nan: 'None',
        'I have never studied machine learning but plan to learn in the future': 'None',
        'I do not use machine learning methods': 'None',
        '10-15 years': '5+ years',
        '20+ years': '5+ years',
        '10-20 years': '5+ years',
        '20 or more years': '5+ years',
        'I have never studied machine learning and I do not plan to': 'None'
    } 
    YEARS_USING_ML_PROFILES={
        '1-2 years': '1-2 years',
        '2-3 years': '2-3 years',
        '< 1 year': '0-1 years',
        'Under 1 year': '0-3 years',
        '< 1 years': '0-1 years',
        '3-4 years': '3-10 years',
        '5-10 years': '3-10 years',
        '4-5 years': '3-10 years',
        np.nan: 'None',
        'I have never studied machine learning but plan to learn in the future': 'None',
        'I do not use machine learning methods': 'None',
        '10-15 years': '10+ years',
        '20+ years': '10+ years',
        '10-20 years': '10+ years',
        '20 or more years': '10+ years',
        'I have never studied machine learning and I do not plan to': 'None'
    } 
    PRIMARY_TOOL={
        'Local development environments (RStudio, JupyterLab, etc.)': 'Local or hosted development environments',
        'Basic statistical software (Microsoft Excel, Google Sheets, etc.)': 'Basic statistical software',
        'Local or hosted development environments (RStudio, JupyterLab, etc.)': 'Local or hosted development environments',
        'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)': 'Cloud-based data software & APIs',
        'Other': 'Other',
        'Advanced statistical software (SPSS, SAS, etc.)': 'Advanced statistical software',
        'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)': 'Business intelligence software'
    }
    COUNTRY = {
        'India': 'India',
        'United States of America': 'United States',
        'Other': 'Other',
        'Brazil': 'Brazil',
        'Russia': 'Russia',
        'Japan': 'Japan',  
        'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
        'Germany': 'Germany',
        'China': 'China',
        'Spain': 'Spain',
        'France': 'France',
        'Canada': 'Canada',
        'Italy': 'Italy',
        'Nigeria': 'Nigeria',
        'Turkey': 'Turkey',
        'Australia': 'Australia'
    }
    IDE={
        'None': 'None', 
        'MATLAB': 'MATLAB', 
        'RStudio': 'RStudio', 
        'Jupyter/IPython': 'Jupyter/IPython', 
        'PyCharm': 'PyCharm', 
        'Atom': 'Vim/Emacs/Atom',
        'Visual Studio': 'Visual Studio',
        'Notepad++': 'Notepad++/Sublime', 
        'Sublime Text': 'Notepad++/Sublime', 
        'IntelliJ': 'PyCharm', 
        'Spyder': 'Spyder',
        'Visual Studio Code': 'Visual Studio',
        'Vim': 'Vim/Emacs/Atom', 
        'Other': 'Other', 
        'nteract': 'Other', 
        np.nan: 'Other',
        'Jupyter (JupyterLab, Jupyter Notebooks, etc) ': 'Jupyter/IPython', 
        ' RStudio ': 'RStudio',
        ' PyCharm ': 'PyCharm', 
        ' MATLAB ': 'MATLAB', 
        '  Spyder  ': 'Spyder', 
        '  Notepad++  ': 'Notepad++/Sublime',
        '  Sublime Text  ': 'Notepad++/Sublime', 
        ' Atom ': 'Vim/Emacs/Atom',
        ' Visual Studio / Visual Studio Code ': 'Visual Studio',
        '  Vim / Emacs  ': 'Vim/Emacs/Atom',
        'Visual Studio Code (VSCode)': 'Visual Studio'
    }
    CLOUD={
        'I have not used any cloud providers': 'None', 
        'Microsoft Azure': 'Azure',
       'Google Cloud Platform (GCP)': 'GCP', 
        'Amazon Web Services (AWS)': 'AWS',
       'IBM Cloud': 'IBM/Red Hat', 
        'Other': 'Other', 
        'Alibaba Cloud': 'Alibaba', 
        np.nan: 'None',
       ' Amazon Web Services (AWS) ': 'AWS', 
        ' Google Cloud Platform (GCP) ': 'GCP',
       ' Microsoft Azure ': 'Azure', 
        'None': 'None', 
        ' Salesforce Cloud ': 'Other',
       ' Red Hat Cloud ': 'IBM/Red Hat', 
        ' VMware Cloud ': 'Other', 
        ' Alibaba Cloud ': 'Alibaba',
       ' SAP Cloud ': 'Other', 
        ' IBM Cloud ': 'IBM/Red Hat', 
        ' Oracle Cloud ': 'Other',
       ' IBM Cloud / Red Hat ': 'IBM/Red Hat',
        ' Tencent Cloud ': 'Other'
    }
    ML_STATUS={ 
        'No (we do not use ML methods)': 'Do not use ML / Do not know',
        'I do not know': 'Do not use ML / Do not know',
        'We recently started using ML methods (i.e., models in production for less than 2 years)': 'Recently started using ML',
        'We have well established ML methods (i.e., models in production for more than 2 years)':  'Well established ML',
        'We are exploring ML methods (and may one day put a model into production)': 'Exploring ML',
        'We use ML methods for generating insights (but do not put working models into production)': 'Use ML for generating insights',
        np.nan: 'Do not use ML / Do not know',
    }
    ML_FRAMEWORKS={
        'None': 'None', 
        'Prophet': 'Prophet', 
        'Scikit-Learn': 'Scikit-learn', 
        'Keras': 'Keras', 
        'TensorFlow': 'TensorFlow',
        'Spark MLlib': 'Other', 
        'Xgboost': 'Xgboost', 
        'randomForest': 'Other', 
        'lightgbm': 'LightGBM',
        'Caret': 'Caret',
        'mlr': 'Other', 
        'PyTorch': 'PyTorch', 
        'Mxnet': 'Other', 
        'CNTK': 'Other', 
        'Caffe': 'Other', 
        'H20': 'H2O', 
        'catboost': 'CatBoost',
        'Fastai': 'Fast.ai', 
        'Other': 'Other', 
        np.nan: 'None', 
        '  Scikit-learn ': 'Scikit-learn', 
        ' RandomForest': 'Other',
        ' Xgboost ': 'Xgboost', 
        ' LightGBM ': 'LightGBM',
        '  TensorFlow ': 'TensorFlow',
        ' Keras ': 'Keras', 
        ' Caret ': 'Caret',
        ' PyTorch ': 'PyTorch', 
        ' Spark MLib ': 'Spark MLlib',
        ' Fast.ai ': 'Fast.ai', 
        ' Tidymodels ': 'Other',
        ' CatBoost ': 'CatBoost', 
        ' JAX ': 'Other', 
        ' Prophet ': 'Prophet', 
        ' H2O 3 ': 'H2O', 
        ' MXNet ': 'Other'   
    }
    
    
class Category(Enum):
    COMPENSATION=[
        'Not Disclosed', '0-10k', '10-20k', '20-30k', '30-40k', '40-50k', '50-60k', 
        '60-70k', '70-80k', '80-90k', '90-100k', '100-125k', '125-150k', '150-200k', 
        '200-250k', '300-500k'
    ]
    JOB_TITLE=[
        'Other', 'Manager/C-level', 'Product/Project Manager', 'Business Analyst', 'Data Analyst', 
        'Research Scientist/Statistician', 'Data Scientist', 'Machine Learning Engineer', 
        'Data Engineer/DBA', 'Software Engineer'
    ]  
    GENDER = ['Women', 'Men'] 
    AGE=['18-21', '22-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-69', '70+']
    YEARS_WRITING_CODE=['None', '< 1 year', '1-3 years', '3-5 years', '5-10 years', '10+ years']
    YEARS_USING_ML=['None', '< 1 year', '1-3 years', '3-5 years', '5+ years']
    SURVEY_YEAR=[2018, 2019, 2020]
    EDUCATION=['High school', 'Some college', 'Professional', 'Bachelor’s', 'Master’s', 'Doctoral']
    PROFILES=['Beginners', 'Others', 'Modern DS', 'Coders', 'ML Veterans']

COLORS = {
    'India': '#FE9933', 
    'Brazil': '#179B3A',
    'United States': '#002366', 
    'China': '#ED2124', 
    'Average': 'blueviolet',
    'Canada': '#F60B00',
    'Data Scientist': '#13A4B4',
    'Product/Project Manager': '#D70947',
    'Software Engineer': '#E8743B', 
    'Data Analyst': '#BF399E',
    'Data Engineer/DBA': '#144B7F',
    '< 1 year': 'lightgreen', 
    '10+ years': 'green', 
    'Women': 'hotpink', 
    'Men': 'midnightblue',
    'Python': '#FEC331',
    'SQL': '#66B900',
    'R': '#2063b7',
    'C/C++': 'slateblue',
    'Basic statistical software': '#0D7036', 
    'Local or hosted development environments': '#36B5E2',
    'Visual Studio': '#349FED',
    'Jupyter/IPython': '#EC7426',
    'AWS': '#F79500',
    'GCP': '#1AA746',
    'Azure': '#3278B1',
    'Well established ML': 'dodgerblue', 
    'PyTorch': 'orangered', 
    'Scikit-learn': 'goldenrod',
    'None': 'darkblue',
    'Do not use ML / Do not know': 'slategrey',
    'Exploring ML': 'lightseagreen', 
    'Recently started using ML': 'forestgreen'
}



In [None]:
from typing import List, Type, Tuple
import pandas as pd
from abc import ABC, abstractmethod


class BaseKaggle(ABC):
    """
    Base class to handle cleaning and transformation of datasets from different years.
    """
    def __init__(self) -> None:
        self.df = None
        self.non_professionals = ['Student', 'Currently not employed', np.nan]
        self.mapping = {}
        self.questions_to_combine = []
        self.survey_year = None
   
    @property
    def questions_to_keep(self) -> List[str]:
        """
        Select which questions we should keep in the dataframe using the mapping keys
        """
        return [key for key, value in self.mapping.items()]

    def remove_non_professionals(self) -> pd.DataFrame:
        """
        Non-professionals were defined as students, unemployed and NaNs. 
        Also removed those who didn't disclose compensation.
        """
        self.df = self.df.drop(self.df[self.df['Job Title'].isin(self.non_professionals)].index)
        self.df.dropna(subset=['Yearly Compensation'], inplace=True)
        return self.df
    
    @abstractmethod
    def filter_question_columns(columns: List[str], question: str) -> List[str]:
        pass
    
    @staticmethod
    def remove_nans_from_list(answers: List[str]) -> List[str]:
        """
        This function removes all nans from a list
        """
        return [x for x in answers if pd.notnull(x)]

    def combine_answers_into_list(self, question: str) -> pd.DataFrame:
        """
        This function will create a new column in the dataframe adding 
        all answers to a list and removing nans.
        """
        filtered_columns = self.filter_question_columns(list(self.df.columns), question)
        self.df[question] = self.df[filtered_columns].values.tolist()
        self.df[question] = self.df[question].apply(self.remove_nans_from_list)
        return self.df

    def batch_combine_answers_into_list(self, questions_to_combine: List[str]) -> pd.DataFrame:
        """
        Applyes combine_answers_into_list to multiple columns
        """
        for question in questions_to_combine:
            self.combine_answers_into_list(question=question)
        return self.df
    
    def rename_columns(self) -> pd.DataFrame:
        """
        Renames columns using mapping
        """
        self.df = self.df.rename(columns=self.mapping)
        return self.df
    
    def do_mapping(self, column: str, mapping: Mapping) -> pd.DataFrame:
        """
        Maps values to have same classes accross all years
        """
        self.df[column] = self.df[column].map(mapping.value)
        return self.df
    
    def do_list_mapping(self, column: str, mapping: Mapping) -> pd.DataFrame:
        """
        Maps values to have same classes accross all years for columns that are list type
        """
        mapping_dict = mapping.value
        self.df[column] = self.df[column].apply(lambda x: [mapping_dict[val] for val in x])
        return self.df

    def add_numeric_average_compensation(self) -> pd.DataFrame:
        """
        Create a numeric value for compensation, taking the average between the max and min values for each class
        
        We are summing up the lowest and highest value for each category, and then dividing by 2.
        Some regex needed to clean the text
        """
        compensation = self.df['Yearly Compensation'].str.replace(r'(?:(?!\d|\-).)*', '').str.replace('500', '500-500').str.split('-')
        self.df['Yearly Compensation Numeric'] = compensation.apply(lambda x: (int(x[0]) + int(x[1]))/ 2) # it is calculated in thousand of dollars
        return self.df

    def add_survey_year_column(self) -> pd.DataFrame:
        """
        Adds the year the survey was taken as a column
        """
        self.df['Survey Year'] = self.survey_year
        return self.df
    
    def add_dummy_column(self) -> pd.DataFrame:
        """
        Adds Dummy = 1 to make easier to perform group by
        """
        self.df['Dummy'] = 1
        return self.df
    
    def select_questions(self) -> pd.DataFrame:
        """
        Selects only the relevant questions from each survey year
        """
        self.df = self.df[self.questions_to_keep]
        return self.df
    
    def fill_na(self, column: str, value: str) -> pd.DataFrame:
        """
        Fill column NaNs with a given value
        """
        self.df[column] = self.df[column].fillna(value)
        return self.df
   
    def calculate_profile(self, values: tuple) -> str:
        """
        This function creates profiles for professionals adapted from the work developed by Teresa Kubacka on last years survey
        https://www.kaggle.com/tkubacka/a-story-told-through-a-heatmap
        """
        years_code, years_ml = values
        if years_code in ['0-1 years', '1-2 years'] and years_ml in ['0-1 years', '1-2 years']:
            return 'Beginners'
        elif years_code in ['2-3 years', '3-10 years'] and years_ml in ['1-2 years', '2-3 years', '3-10 years']:
            return 'Modern DS'
        elif years_code == '10+ years' and years_ml in ['0-1 years', '1-2 years']:
            return 'Coders'
        elif years_code == '10+ years' and years_ml == '10+ years':
            return 'ML Veterans'
        else:
            return 'Others'

    def create_profiles(self) -> None:
        """
        This function creates a new columns with profiles for professionals adapted from the work developed by Teresa Kubacka on last years survey
        https://www.kaggle.com/tkubacka/a-story-told-through-a-heatmap
        """
        self.df['Years Writing Code Profile'] = self.df['Tenure: Years Writing Code'].map(Mapping.YEARS_WRITING_CODE_PROFILES.value)
        self.df['Years Using ML Profile'] = self.df['Tenure: Years Using Machine Learning Methods'].map(Mapping.YEARS_USING_ML_PROFILES.value)
        
        self.df['Profile'] = self.df[
            ['Years Writing Code Profile', 
             'Years Using ML Profile']
        ].apply(self.calculate_profile, axis=1)

        
    def transform(self) -> pd.DataFrame:
        """
        Process and clean the dataset
        """

        self.df.drop(0, axis=0, inplace=True)  # dropping first row (questions) from processed data

        self.batch_combine_answers_into_list(questions_to_combine=self.questions_to_combine)
        self.select_questions()
        self.rename_columns()

        self.create_profiles()
        self.do_mapping(column='Yearly Compensation', mapping=Mapping.COMPENSATION)
        self.do_mapping(column='Job Title', mapping=Mapping.JOB_TITLE)
        self.do_mapping(column='Gender', mapping=Mapping.GENDER)
        self.do_mapping(column='Age', mapping=Mapping.AGE)
        self.do_mapping(column='Education', mapping=Mapping.EDUCATION)
        self.do_mapping(column='Tenure: Years Writing Code', mapping=Mapping.YEARS_WRITING_CODE)
        self.do_mapping(column='Recommended Programming Language', mapping=Mapping.RECOMMENDED_LANGUAGE)
        self.do_mapping(column='Tenure: Years Using Machine Learning Methods', mapping=Mapping.YEARS_USING_ML)
        self.do_mapping(column='Primary Tool to Analyze Data', mapping=Mapping.PRIMARY_TOOL)
        self.do_mapping(column='Country', mapping=Mapping.COUNTRY)
        self.do_mapping(column='Machine Learning Status in Company', mapping=Mapping.ML_STATUS)
        self.do_list_mapping(column='Machine Learning Frameworks', mapping=Mapping.ML_FRAMEWORKS)
        
        self.do_list_mapping(column='Programming Languages', mapping=Mapping.LANGUAGES)
        self.do_list_mapping(column='IDEs', mapping=Mapping.IDE)
        self.do_list_mapping(column='Cloud Computing Platforms', mapping=Mapping.CLOUD)
        self.fill_na(column='Country', value='Other')

        self.remove_non_professionals()       
        self.add_numeric_average_compensation()
        self.add_survey_year_column()
        self.add_dummy_column()
                
        self.df.reset_index(drop=True, inplace=True)
    
        return self.df
    

In [None]:
class Kaggle2020(BaseKaggle):
    """
    Processing and cleaning 2020 Dataset

    Here we do the following:
    * Group all multiple choice answers into a list in a single column.
    * Remove Non-Professionals from the data set. Non-professionals were defined as students, unemployed and NaNs.
    * Select the questions we want to keep, based on the spreadsheet analysis done previously.
    * Remove all non-multiple choice answers
    """
    
    def __init__(self) -> None:
        super().__init__()
        self.survey_year = 2020
        self.df = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)
        self.mapping = {
            'Q1':'Age',
            'Q2':'Gender',
            'Q3':'Country',
            'Q4':'Education',
            'Q5':'Job Title',
            'Q6':'Tenure: Years Writing Code',
            'Q7':'Programming Languages',
            'Q8':'Recommended Programming Language',
            'Q9':'IDEs',
            'Q10':'Hosted Notebooks',
            'Q14':'Data Visualization Libraries',
            'Q15':'Tenure: Years Using Machine Learning Methods',
            'Q16':'Machine Learning Frameworks',
            'Q22':'Machine Learning Status in Company',
            'Q23':'Daily activities',
            'Q24':'Yearly Compensation',
            'Q26_A':'Cloud Computing Platforms',
            'Q27_A':'Cloud Computing Products',
            'Q28_A':'Machine Learning Products',
            'Q29_A':'Big Data Products',
            'Q37':'Data Science Courses',
            'Q38':'Primary Tool to Analyze Data',
            'Q39':'Media Sources',
        }
        self.questions_to_combine = [
            'Q7', 'Q9', 'Q10', 'Q14', 'Q16', 'Q23', 'Q26_A', 'Q27_A', 'Q28_A', 'Q29_A', 'Q37', 'Q39'
        ]
           
    @staticmethod
    def filter_question_columns(columns: List[str], question: str) -> List[str]:
        """
        Filters only questions that starts with the question_number and do not end with the string _OTHER
        """
        return [col for col in columns if col.startswith(f'{question}_P') and not col.endswith('_OTHER')]


In [None]:

class Kaggle2019(BaseKaggle):
    """
    Processing and cleaning 2019 Dataset
    """
    
    
    def __init__(self) -> None:
        super().__init__()
        self.survey_year = 2019
        self.df = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv', low_memory=False)
        self.mapping = {
            'Q1':'Age',
            'Q2':'Gender',
            'Q3':'Country',
            'Q4':'Education',
            'Q5':'Job Title',
            'Q15':'Tenure: Years Writing Code',
            'Q18':'Programming Languages',
            'Q19':'Recommended Programming Language',
            'Q16':'IDEs',
            'Q17':'Hosted Notebooks',
            'Q20':'Data Visualization Libraries',
            'Q23':'Tenure: Years Using Machine Learning Methods',
            'Q28':'Machine Learning Frameworks',
            'Q8':'Machine Learning Status in Company',
            'Q9':'Daily activities',
            'Q10':'Yearly Compensation',
            'Q29':'Cloud Computing Platforms',
            'Q30':'Cloud Computing Products',
            'Q32':'Machine Learning Products',
            'Q31':'Big Data Products',
            'Q13':'Data Science Courses',
            'Q14':'Primary Tool to Analyze Data',
            'Q12':'Media Sources',
        }
        self.questions_to_combine = [
            'Q18', 'Q16', 'Q17', 'Q20', 'Q28', 'Q9', 'Q29', 'Q30', 'Q32', 'Q31', 'Q13', 'Q12'
        ]
           
    @staticmethod
    def filter_question_columns(columns: List[str], question: str) -> List[str]:
        """
        Filters only questions that starts with the question_number and do not end with the string _OTHER_TEXT
        """
        return [col for col in columns if col.startswith(f'{question}_P') and not col.endswith('_OTHER_TEXT')]


In [None]:

class Kaggle2018(BaseKaggle):
    """
    Processing and cleaning 2019 Dataset
    """
    
    def __init__(self) -> None:
        super().__init__()
        self.survey_year = 2018
        self.df = pd.read_csv('/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv', low_memory=False)
        self.mapping = {
            'Q2':'Age',
            'Q1':'Gender',
            'Q3':'Country',
            'Q4':'Education',
            'Q6':'Job Title',
            'Q24':'Tenure: Years Writing Code',
            'Q16':'Programming Languages',
            'Q18':'Recommended Programming Language',
            'Q13':'IDEs',
            'Q14':'Hosted Notebooks',
            'Q21':'Data Visualization Libraries',
            'Q25':'Tenure: Years Using Machine Learning Methods',
            'Q19':'Machine Learning Frameworks',
            'Q10':'Machine Learning Status in Company',
            'Q11':'Daily activities',
            'Q9':'Yearly Compensation',
            'Q15':'Cloud Computing Platforms',
            'Q27':'Cloud Computing Products',
            'Q28':'Machine Learning Products',
            'Q30':'Big Data Products',
            'Q36':'Data Science Courses',
            'Q12_MULTIPLE_CHOICE':'Primary Tool to Analyze Data',
            'Q38':'Media Sources',
        }
        self.questions_to_combine = [
            'Q16', 'Q13', 'Q14', 'Q21', 'Q19', 'Q11', 'Q15', 'Q27', 'Q28', 'Q30', 'Q36', 'Q38'
        ]
    
    @staticmethod
    def filter_question_columns(columns: List[str], question: str) -> List[str]:
        """
        Filters only questions that starts with the question_number and do not end with the string _OTHER_TEXT
        """
        return [col for col in columns if col.startswith(f'{question}_P') and not col.endswith('_OTHER_TEXT')]


In [None]:
class KaggleCombinedSurvey:
    """
    This class combines surveys from multiple years into a concatenated dataframe.
    """
    
    def __init__(self, surveys: List[Type[BaseKaggle]]) -> None:
        self.surveys = surveys
        self._cached_df = None
        
    @property
    def df(self) -> pd.DataFrame:
        """
        If df was already processed get it from cache, otherwise process it and saves to cache.
        """
        if isinstance(self._cached_df, type(None)):
            self._cached_df = self._concatenate()
        return self._cached_df
    
    def _get_surveys_dfs(self) -> List[pd.DataFrame]:
        """
        Applies the transform method for each survey and return the dfs in a list
        """
        return [survey().transform() for survey in self.surveys]
    
    def _concatenate(self) -> pd.DataFrame:
        """
        Concatenate survey dataframes into a single dataframe
        """
        df = pd.concat(self._get_surveys_dfs())
        df = df.reset_index(drop=True)
        return df

In [None]:
import plotly.graph_objects as go
import plotly.offline as pyo
import plotly.express as px
from plotly.subplots import make_subplots
from collections import namedtuple 


MetricData = namedtuple('MetricData', ['subplot_name','trace_name','y_values', 'x_values', 
                                       'subplot_y_position', 'subplot_x_position', 'highlighted_traces']) 


class BaseMetric(ABC):
    """
    Creates a plotly plot for slopegraphs
    """
    
    def __init__(
        self, 
        survey: KaggleCombinedSurvey, 
        traces_col: str, 
        y_col: str, 
        x_col: str,
        explode: bool = False
    ) -> None:
        """
        traces: the column name we want to creaate traces from
        y: the column name we will be ploting
        x: Will always be survey year for our slopegraphs.
        """
        self.traces_col = traces_col
        self.y_col = y_col
        self.x_col = x_col
        self.survey = survey
        self.traces = []
        self.explode = explode
        self.metric_df = None

    @property
    def traces_names(self) -> List[str]:
        """
        Calculate unique values of traces_col
        """
        return self.metric_df[self.traces_col].cat.categories
    
    @property
    def subplots_names(self) -> List[str]:
        """
        Calculate unique values of traces_col
        """
        return self.metric_df[self.y_col].cat.categories

    @property
    def subplots_qty(self):
        return len(self.subplots_names)
    
    @property
    def traces_qty(self):
        return len(self.traces_names)
    
    def apply_filter(self, df: pd.DataFrame, column: str, value: str) -> pd.DataFrame:
        """
        filters data for a single trace
        """
        return df[df[column] == value] 

    @abstractmethod
    def calculate(self) -> pd.DataFrame:
        """
        Group the data by y_col, perform count and convert it to a list
        Transforms absolute values into percentages
        Yeld the metrics for a given trace
        """
        pass
   
    def groupby(self, df: pd.DataFrame, columns: List[str]) -> pd.DataFrame:
        """"
        Calculates quantity per x, y and traces col
        """
        return df.groupby(columns, as_index=False)['Dummy'].sum()
    
    def join_dataframes(self, df1: pd.DataFrame, df2: pd.DataFrame, on_columns: List[str]) -> pd.DataFrame:
        return df1.set_index(on_columns).join(df2.set_index(on_columns), rsuffix='_total').reset_index()
    
    def to_categorical(self, column: str, categories: Category) -> pd.DataFrame:
        cat_dtype = pd.api.types.CategoricalDtype(categories=categories.value, ordered=True)
        self.metric_df[column] = self.metric_df[column].astype(cat_dtype)
        return self.metric_df
    
    def get_df(self):
        """
        Returns a dataframe with or without lists exploded 
        """
        if self.explode:
            return self.survey.df.explode(self.traces_col)
        else:
            return self.survey.df
        
    def get_subplots(self, highlighted_traces: List[str]) -> List[MetricData]:
        self.apply_categories()
        self.metric_df['subplot_y_position'] = self.metric_df[self.y_col].cat.codes + 1
        self.metric_df['subplot_x_position'] = 1       

        for index, row in self.metric_df.iterrows():
            filtered_df = self.apply_filter(df=self.metric_df, column=self.y_col, value=row[self.y_col])
            filtered_df = self.apply_filter(df=filtered_df, column=self.traces_col, value=row[self.traces_col])
            
            metric_data = MetricData(
                subplot_name=row[self.y_col],
                trace_name=row[self.traces_col],
                y_values=filtered_df['Metric'].values,
                x_values=filtered_df[self.x_col].values,
                subplot_y_position=row['subplot_y_position'],
                subplot_x_position=row['subplot_x_position'],
                highlighted_traces=row[self.traces_col] in highlighted_traces
            )
            self.traces.append(metric_data)

In [None]:
class PercentageMetric(BaseMetric):
    """
    Creates a plotly plot for slopegraphs
    """   
    
    def calculate_average(self, df=pd.DataFrame) -> pd.DataFrame:
        detail = self.groupby(df=df, columns=[self.x_col, self.y_col])
        total = self.groupby(df=df, columns=[self.x_col])
        joined = self.join_dataframes(df1=detail, df2=total, on_columns=[self.x_col]) 
        joined['Metric'] = joined['Dummy'] / joined['Dummy_total'] * 100  # get percentage
        joined[self.traces_col] = 'Average'
        return joined
            
    
    def calculate(self, add_avg: bool = False) -> pd.DataFrame:
        """
        Group the data by y_col, perform count and convert it to a list
        Transforms absolute values into percentages
        Yeld the metrics for a given trace
        """
        df = self.get_df()
        detail = self.groupby(df=df, columns=[self.x_col, self.y_col, self.traces_col])
        total = self.groupby(df=df, columns=[self.x_col, self.traces_col]) 
        joined = self.join_dataframes(df1=detail, df2=total, on_columns=[self.x_col, self.traces_col]) 
        joined['Metric'] = joined['Dummy'] / joined['Dummy_total'] * 100  # get percentage
        
        if add_avg:
            avg_df = self.calculate_average(df=joined)
            joined = joined.append(avg_df)
                
        self.metric_df = joined
        return joined

    @abstractmethod
    def apply_categories(self):
        pass 

In [None]:
class GenderProportionMetric(PercentageMetric):
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(survey=survey, traces_col='Country', y_col='Gender', x_col='Survey Year')
    
    def apply_categories(self):
        self.to_categorical(column='Gender', categories=Category.GENDER)
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)

In [None]:
class BasePlot(ABC):
    """
    Creates a plotly plot for slopegraphs
    """
    
    def __init__(
        self, 
        metric: Type[BaseMetric], 
        title: str, 
        yaxes_title: str, 
        shared_yaxes: bool, 
        yticks: List[float], 
        yticks_template: str, 
        annotation_template: str,
        x_nticks: int,
        hover_template: str
    ) -> None:
    
        pyo.init_notebook_mode()
        self.metric = metric
        self.yaxes_title = yaxes_title
        self.shared_yaxes = shared_yaxes
        self.hover_template = hover_template
        self.title = title
        self.yticks = yticks
        self.yticks_template = yticks_template
        self.annotation_template = annotation_template
        self.x_nticks = x_nticks
        self.figure = go.Figure()
        self.range = (0, 0)

    def make_subplots(self) -> None:
        """
        Creates subplots in the figure and add titles
        """
        self.figure = make_subplots(
            cols=self.metric.subplots_qty, # our subplots will have the number of unique values for the select column
            rows=1, # and 1 row
            subplot_titles=self.metric.subplots_names, # Add titles to subplots
            specs=[[{'type': 'scatter'}]*self.metric.subplots_qty]*1, # Define chart type for each subplot
            shared_yaxes=self.shared_yaxes,
            shared_xaxes=True
        )

        for idx, subplot_title in enumerate(self.figure['layout']['annotations']):
            subplot_title['font'] = dict(size=14,color='grey')  # Size and color of subplot title
            subplot_title['align'] = 'left'
            subplot_title['xanchor'] = 'left'
            subplot_title['x'] = 0
            subplot_title['xref'] = 'x' if idx == 0 else f'x{idx + 1}'
            

    def update_common_layout(self) -> None:
        """
        Updates general layout characteristics
        """
        self.figure.update_layout(
            showlegend = False,
            plot_bgcolor='white',
            title_text = self.title,
            title_font_color = 'grey',
            title_font_size = 15,
            title_x=0,
            title_y=0.98,
            margin_t=130,
            margin_l=0,
            margin_r=0,
            height=600,
            width=800,
            yaxis_range=self.range
        )

    def get_yticks_text(self) -> List[str]:
        """
        Calculates the y_ticks text for charts
        """
        return [self.yticks_template.format(i) for i in self.yticks]
    
    def update_subplots_layout(self) -> None:
        """
        Updates scatter subplots layout characteristics
        """
        for subplot_idx in range(self.metric.subplots_qty):
            self.figure.update_xaxes(
                type='category',
                color='lightgrey', # to not draw to much attention to axis
                showgrid=False, 
                visible=subplot_idx == 0,  # Visible only to the first subplot
                row=1,
                nticks=self.x_nticks,               
                col=subplot_idx + 1 # Subplots start at 1 
            )
            self.figure.update_yaxes(
                showgrid=False,
                visible=subplot_idx == 0 or not self.shared_yaxes,
                title=self.yaxes_title if subplot_idx == 0 else None,  # Visible only to the first subplot
                color='grey',
                row=1, 
                col=subplot_idx + 1,
                tickvals=self.yticks, # show ticks ate 25, 50 and 75
                ticktext=self.get_yticks_text(),
                tickmode='array',
                tickfont_color='lightgrey',
                autorange=True
            )

    def line_color(self, trace: MetricData) -> str:
        """
        Sets color to the highlight color or to a tone of grey if not highlighted
        """
        return self.highlight_color(trace=trace) if trace.highlighted_traces else 'lightslategrey'
    
    def highlight_color(self, trace: MetricData) -> str:
        """
        Returns the highlight color
        """
        return COLORS[trace.trace_name]
    
    def line_width(self, trace: MetricData) -> str:
        """
        Returns the line width of traces depending if trace is highlighted or not
        """
        return 1.6 if trace.highlighted_traces else 0.6
   
    def opacity(self, trace: MetricData) -> str:
        """
        Returns the opacity depending if trace is highlighted or not
        """
        return 0.8 if trace.highlighted_traces else 0.25

    def add_trace(self, trace: MetricData) -> None:
        """
        Adds a new trace to a figure
        """
        self.figure.add_trace(
            go.Scatter(
                x=trace.x_values, 
                y=trace.y_values, 
                mode='lines',
                name=trace.trace_name,
                hoverinfo='name+text+y',
                hovertemplate=self.hover_template,
                text=trace.x_values,
                line_color=self.line_color(trace=trace),
                showlegend=False,
                opacity= self.opacity(trace=trace),
                line_shape='linear',
                line_width=self.line_width(trace=trace),
                connectgaps=True
            ), 
            trace.subplot_x_position, 
            trace.subplot_y_position
        )
    
    def get_annotation_text(self, trace: MetricData, idx: int) -> str:
        """
        Calculates the annotation text to be added to the plot
        """
        if trace.subplot_y_position == 1 and idx == 0:
            template = '{}<br>' + f'{self.annotation_template}'
            return template.format(trace.trace_name, trace.y_values[idx])
        else:
            return self.annotation_template.format(trace.y_values[idx])
        
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                # Add left annotation
                self.figure.add_annotation(
                    xref=f'x{trace.subplot_y_position}',
                    yref=f'y{trace.subplot_y_position}',
                    font=dict(
                        size=11,
                        color=self.highlight_color(trace=trace)
                    ),
                    opacity=0.8,
                    align='center',
                    yshift=0,
                    xshift=-3,
                    xanchor='right',
                    x=trace.x_values.codes[0], 
                    y=trace.y_values[0],
                    text=self.get_annotation_text(trace=trace, idx=0),
                    showarrow=False
                    )
                # Add right annotation
                self.figure.add_annotation(
                    xref=f'x{trace.subplot_y_position}',
                    yref=f'y{trace.subplot_y_position}',
                    font=dict(
                        size=11,
                        color=self.highlight_color(trace=trace)
                    ),
                    opacity=0.8,
                    align='center',
                    yshift=0,
                    xshift=3,
                    xanchor='left',
                    x=trace.x_values.codes[-1], 
                    y=trace.y_values[-1],
                    text=self.get_annotation_text(trace=trace, idx=-1),
                    showarrow=False
                    )

    def add_subplot_axis_annotation(self) -> None:
        """
        Add subplot axis annotation
        """
        self.figure.add_annotation(
            xref="x", 
            yref="paper",
            font=dict(
                size=14,
                color='lightgrey'
            ),
            align='left',
            x=0, 
            xanchor='left',
            y=1.05,
            yanchor='bottom',
            text=f'{self.metric.y_col}',
            showarrow=False
        )
        
    def add_source_annotation(self) -> None:
        """
        Add source annotation
        """
        self.figure.add_annotation(
            xref="paper", 
            yref="paper",
            font=dict(
                size=11,
                color='lightgrey'
            ),
            align='left',
            x=-0.07, 
            xanchor='left',
            y=-0.13,
            yanchor='bottom',
            text='<b>Source:</b> Kaggle surveys from 2018 to 2020.',
            showarrow=False
        )
        
    def add_data(self) -> None:
        """
        Adds a trace to the figure following the same standard for each trace
        """
        # Add all non-highlighted traces.
        for trace in self.metric.traces:
            self.add_trace(trace=trace)
            self.update_range(data=trace.y_values)

        
    def update_range(self, data: List[float]) -> None:
        """
        Updates the range to be 90% of minimum values and 110% of maximum value of all traces
        """
        if len(data) == 0:
            return self.range
        
        max_range = max(data) * 1.2
        min_range = min(data) * 0.8
        self.range = (self.range[0], max_range) if max_range > self.range[1] else self.range 
        self.range = (min_range, self.range[1]) if min_range < self.range[0] else self.range
        
    def show(self) -> None:
        """
        Renders and shows the plot
        """
        self.make_subplots()
        self.update_common_layout()
        self.add_data()
        self.add_annotations()
        self.add_subplot_axis_annotation()
        self.update_subplots_layout()
        self.add_source_annotation()
        self.figure.show()

In [None]:
class GenderProportionPlot(BasePlot):
    pass

In [None]:
kaggle_combined_survey = KaggleCombinedSurvey(surveys=[Kaggle2018, Kaggle2019, Kaggle2020])

<h1>The Gender Gap</h1>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
I wanted to start with something simple and important at the same time. So I questioned myself: over the past three years, did the proportion of Men and Women change? I knew there was a gap, but I would like to see it decreasing (and a lot) in the recent years.
    </div>

In [None]:
metric = GenderProportionMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=True)
metric.get_subplots(highlighted_traces=['Average'])

GenderProportionPlot(
    metric=metric,
    yaxes_title='% of Respondents per Survey Year',
    shared_yaxes=True,
    yticks=[20, 40, 60, 80],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=3,
    title='<b>Gender Gap: Kaggle members are mostly men. </b><br>And there are no signs of increase in women participation since 2018.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents per country</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    Unfortunately, there is a considerable gap in professionals participation in Kaggle: 84% of men against 16% of women.
<br><br>
And what is worse than that is that <mark>women participation did not increase over the past three years</mark>. I saw some <a href="https://www.kaggle.com/chandramanaha/kyc-know-your-community">other notebooks</a> that showed an increase in female students. Maybe this will cause an increase in data professionals next year, but need a lot more women to close this gap.
<br>
<div class="alert alert-success">Maybe Kaggle could host women only competitions, in a bid to attract more of them to the platform (and to Data Science).</div>
<br>

If we zoom-in using the same chart, we can see which countries are getting more women into Data over the past few years.</div>

In [None]:
metric = GenderProportionMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['India', 'Brazil', 'Canada'])

GenderProportionPlot(
    metric=metric, 
    yaxes_title='% of Respondents per Survey Year',
    shared_yaxes=False,
    yticks=[5, 10, 15, 20, 75, 80, 85, 90],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=3,
    title='<b>Gender Gap: India, Brazil and Canada are countries where the gender gap is reducing. </b><br>However changes are still very small to make any difference.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents per country</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    While most countries increased the gender gap in 2020, <mark>India is the country that is closing the gap faster</mark>. But remember that we are still talking about 18.5% women against 80.3% of men in India.
    </div>

In [None]:
class AverageBaseMetric(BaseMetric):
    """
    Creates a plotly plot for slopegraphs
    """   
        
    def groupby(self, df: pd.DataFrame, columns: List[str]) -> pd.DataFrame:
        """"
        Calculates quantity per x, y and traces col
        """
        return df.groupby(columns, as_index=False).agg({'Yearly Compensation Numeric': lambda x: x.mean(skipna=False)})
    
    def calculate_average(self, df=pd.DataFrame) -> pd.DataFrame:
        """
        Calculates the average trace
        """
        detail = self.groupby(df=df, columns=[self.x_col, self.y_col])
        detail['Metric'] = detail['Yearly Compensation Numeric'] 
        detail[self.traces_col] = 'Average'
        return detail
            
    
    def calculate(self, add_avg: bool = False) -> pd.DataFrame:
        """
        Group the data by y_col, perform count and convert it to a list
        Transforms absolute values into percentages
        Yeld the metrics for a given trace
        """
        df = self.get_df()
        detail = self.groupby(df=df, columns=[self.x_col, self.y_col, self.traces_col])
        detail['Metric'] = detail['Yearly Compensation Numeric'] 
        if add_avg:
            avg_df = self.calculate_average(df=detail)
            detail = detail.append(avg_df)
                
        self.metric_df = detail
        return detail

In [None]:
class CompensationGenderMetric(AverageBaseMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(survey=survey, traces_col='Gender', y_col='Education', x_col='Survey Year')

    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Education', categories=Category.EDUCATION)      
        
class CompensationPlot5(BasePlot):
            
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=5,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    Ok, there is a huge gap in participation, but the pay gap must be closing, no? We are past 2020 after all.
<br><br>
To analyse the pay gap I decided to break down the average anual compensation of women and men for education levels. I was hoping to see the gap closing with higher degrees. 
    </div>

In [None]:
metric = CompensationGenderMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Women', 'Men'])

CompensationPlot5(
    metric=metric,
    yaxes_title='Average Yearly Compensation (USD)',
    shared_yaxes=True,
    yticks=[20, 40, 60],
    yticks_template='U$ {}k',
    hover_template='U$ %{y:0.1f}k',
    annotation_template='U$ {:0.1f}k',
    x_nticks=1,
    title='<b>Gender Gap: In 2020 most women saw the pay gap increase regardless of their education.</b> '\
          '<br>The gap is greater at the extremes: those with either too little or too much education.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Average Yearly Compensation in USD of professional respondents.</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    I have to confess that seeing this chart was very disappointing. First <mark>the distance between women and men salaries increased in 2020 for most education levels</mark>. The pandemic could partially explain the sharp drop in 2020, with women having to leave their jobs or reduce hours to take care of children at home, for example. However, the gap also slightly increased in 2019, we are in an alarming trend.
<br><br>
And the worst news is that <mark> even though the gap closes a little bit for Bachelor's and Master's degrees, it increases again for PhDs (Doctoral)!</mark> <b>This was something that I did not expect, and I feel sorry for all women that despite all effort  to achieve the highest education title are still treated unequally to men.</b>
<br>
<div class="alert alert-success"><b>Let's do something to close the gap?</b> <br>Give more opportunities for women to ingress data careers even if they don't have all the required experience. And please, pay women the same you pay men for consistent education level and experience.</div>
<br>
    </div>

**Extra Chart - Added after the competition finished**

In [None]:
class CompensationGenderMetric2(AverageBaseMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(survey=survey, traces_col='Gender', y_col='Tenure: Years Writing Code', x_col='Survey Year')

    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Tenure: Years Writing Code', categories=Category.YEARS_WRITING_CODE)      
        
class CompensationPlot6(BasePlot):
            
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=5,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = CompensationGenderMetric2(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Women', 'Men'])

CompensationPlot6(
    metric=metric,
    yaxes_title='Average Yearly Compensation (USD)',
    shared_yaxes=True,
    yticks=[20, 40, 60],
    yticks_template='U$ {}k',
    hover_template='U$ %{y:0.1f}k',
    annotation_template='U$ {:0.1f}k',
    x_nticks=1,
    title='<b>Gender Gap: comparing men and women by years of experience writing code also shows a gap.</b> '\
          '<br>Men are probably able to find better jobs and end up working for companies that pay more.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Average Yearly Compensation in USD of professional respondents.</i></span>'
).show()

When breaking down by experience there is still a gap, smaller though.For me this is a symptom that men have better access to better-paying jobs when compared to women. The main reason is that men have privileges in selection processes and interviews. It gives them (including myself) an advantage in getting the best salaries out there.

How many women were turned down at interviews because they were pregnant? How many women are in leadership positions at your company?

<h1>Education vs Experience</h1>

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    This is one that I love, because I dropped-off university and never got a degree. And I'm always curious to see if someone who got more formal education could be earning more than I do. Are you curious to see the results?
</div>

In [None]:
class CompensationEducationMetric(AverageBaseMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(survey=survey, traces_col='Tenure: Years Writing Code', y_col='Education', x_col='Survey Year')

    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Education', categories=Category.EDUCATION)      
        
class CompensationPlot4(BasePlot):
            
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=10 if trace.trace_name == '10+ years' else -25,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = CompensationEducationMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['< 1 year', '10+ years'])

CompensationPlot4(
    metric=metric,
    yaxes_title='Average Yearly Compensation (USD)',
    shared_yaxes=True,
    yticks=[30, 60, 90, 120],
    yticks_template='U$ {}k',
    hover_template='U$ %{y:0.1f}k',
    annotation_template='U$ {:0.1f}k',
    x_nticks=1,
    title='<b>Formal education has little impact on salary when compared to experience writing code.</b> <br>But dropping off university is better than no university at all.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Average Yearly Compensation in USD of professional respondents. Lines are years of experience writing code.</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
And the truth is that <mark>experience is much more important than formal education.</mark> Just look at people with less than 1 year of experience writing code, their <mark>salary did not increase with more education</mark>. 
<br>

<div class="alert alert-danger">A PhD with no experience writing code will earn the same as someone fresh from High School without experience.</div> 
</div>

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">

Now there is one curious thing about getting into a university. <mark>Those with more than 10 years of experience, but that never attended an university, tend to earn less than those who did at least some college.</mark> And there is no noticeable distinction between the salary of experienced people who didn't finish university and those who went all the way up to a doctoral degree.
<br>

<div class="alert alert-success">So if you are considering between getting more education or getting a job, the answer is crystal clear: <b>get a job!</b></div>
</div>


<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">

<i>Note: I swear I didn't tamper the results to confirm my bias :D (and you can always check the code as well)</i>

</div>

<br><br>
<h1>Why are salaries decreasing?</h1>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
From the previouis chart you might have noticed that salaries are dropping (a lot) since 2018. Let's have a closer look into that by breaking the data by Years of Experience Writing Code and Job Title. 
</div>

In [None]:
class CompensationJobTitleMetric(AverageBaseMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(survey=survey, traces_col='Job Title', y_col='Tenure: Years Writing Code', x_col='Survey Year')

    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Tenure: Years Writing Code', categories=Category.YEARS_WRITING_CODE)
        

In [None]:
class CompensationPlot(BasePlot):
            
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                # Add left annotation
                self.figure.add_annotation(
                    xref=f'x{trace.subplot_y_position}',
                    yref=f'y{trace.subplot_y_position}',
                    font=dict(
                        size=11,
                        color=self.highlight_color(trace=trace)
                    ),
                    opacity=0.8,
                    align='center',
                    yshift=8,
                    yanchor='bottom',
                    xshift=0,
                    xanchor='left',
                    x=trace.x_values.codes[0], 
                    y=trace.y_values[0],
                    text=self.get_annotation_text(trace=trace, idx=0),
                    showarrow=False
                    )
                # Add right annotation
                self.figure.add_annotation(
                    xref=f'x{trace.subplot_y_position}',
                    yref=f'y{trace.subplot_y_position}',
                    font=dict(
                        size=11,
                        color=self.highlight_color(trace=trace)
                    ),
                    opacity=0.8,
                    align='center',
                    yshift=-8,
                    yanchor='top',
                    xshift=0,
                    xanchor='right',
                    x=trace.x_values.codes[-1], 
                    y=trace.y_values[-1],
                    text=self.get_annotation_text(trace=trace, idx=-1),
                    showarrow=False
                    )


In [None]:
metric = CompensationJobTitleMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=True)
metric.get_subplots(highlighted_traces=['Average'])

CompensationPlot(
    metric=metric,
    yaxes_title='Average Yearly Compensation (USD)',
    shared_yaxes=True,
    yticks=[30, 60, 90],
    yticks_template='U$ {}k',
    hover_template='U$ %{y:0.1f}k',
    annotation_template='U$ {:0.1f}k',
    x_nticks=1,
    title='<b>The average salary increases with experience in writing code for all job titles.</b><br>But all salaries have been decreasing since 2018.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Average Yearly Compensation in USD of professional respondents by job title</i></span>'
    ).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
We can clearly see that on average all salaries are decreasing since 2018, regardless of the experience level or job title, and that there is a sharper drop in 2020.
<br><br>
2020 as we all know was an exceptional year. <mark>By around March 2020 practically all countries went into some sort of lockdown because of COVID-19.</mark> As result of that, many employees started working from home and many others were dismissed due to the global economic crisis caused by the pandemic.
<br>
<div class="alert alert-info">If there are more professionals available, their market price will drop. <b>Simple economics.</b></div>
</div>

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    Another effect here is due to <b>data science not being on the hype anymore</b>. A few years ago it was named the sexiest job of the century and there was a huge hype and inflated expectations around what data science could deliver to companies. Now business are starting to realise what is actually comercially viable. This well known as the technology hype cycle.
</div><br>


<img src="https://i.imgur.com/DtArIoA.png" align="left" style="width:600px;"/>

<div style="font-family:Helvetica Neue;  color:slategray;">
<i>Technology Hype Cycle. Adapted from <a href="https://www.gartner.com/en/research/methodologies/gartner-hype-cycle">Gartner</a></i></div>
<br><br>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
We can check if this is true by looking at the actual Gartner Analysis for AI in 2020. Look at the Machine Learning position in the chart!
</div>

<img src="https://specials-images.forbesimg.com/imageserve/5f7a42499897d2d0a1c67cf5/960x0.jpg" align="left" style="width:600px;"/>


<div style="font-family:Helvetica Neue; color:slategray;">
<i> What’s New In Gartner’s Hype Cycle For AI, 2020. Source: <a href="https://www.forbes.com/sites/louiscolumbus/2020/10/04/whats-new-in-gartners-hype-cycle-for-ai-2020/?sh=1b6f0992335c"> Forbes</a></i>
<br><br>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
Given that we are past the peak of inflated expectations, <mark>I would expect salaries to continue decreasing over the next two to five years</mark>, until Machine Learning reaches the plateau of productivity.
<br><br>
Continuing with the same chart I want to highlight two professions and show how experience writing code impacts their average salary.
</div>

In [None]:
class CompensationPlot2(BasePlot):
            
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=-90 if trace.trace_name == 'Data Scientist' else 45,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = CompensationJobTitleMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Product/Project Manager', 'Data Scientist'])

CompensationPlot2(
    metric=metric,
    yaxes_title='Average Yearly Compensation (USD)',
    shared_yaxes=True,
    yticks=[30, 60, 90],
    yticks_template='U$ {}k',
    hover_template='U$ %{y:0.1f}k',
    annotation_template='U$ {:0.1f}k',
    x_nticks=1,
    title='<b>Data Scientists with coding skills benefit more from it than product managers.</b>'\
          '<br>Data Scientists with little coding experience are amongst the least paid professionals.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Average Yearly Compensation in USD of professional respondents per job title</i></span>'
    ).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
We can clearly see that experience writing code increases salary for both Product/Project Managers (PMs) and for Data Scientists. A PM doesn't need to have coding experience and to earn more than a Data Scientist. However, <mark>because writing code is much more important for Data Science than for Product Management, the lines switch places after 5 years of experience and Data Scientists start earning more than PMs.</mark>
<br><br>
Also note that in 2020 Data scientists wit less than 3 years of experience are the ones with worse salaries. This might also be an indication of our current position in the Hype Cycle. 
<br>
<div class="alert alert-info">There are a lot of Data Science begginers available in the market, but companies want to hire experienced data scientists with proven records of delivering business results.</div>
<br>
The next charts will be ploted by Machine Learning Experience instead of Years of Experience Writing Code.
</div>

<h1>How experience using machine learning methods change compensation?</h1>
<br>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
We had two questions in the survey asking about respondents experience, one was experienc writing code to analize data. The other one was experience using machine learning methods. I wanted to know how each country values ML experience in terms of salary. This is the chart: </div>

In [None]:
class CompensationCountryMetric(AverageBaseMetric):
    """
    Creates a plotly plot for slopegraphs
    """   
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(survey=survey, traces_col='Country', y_col='Tenure: Years Using Machine Learning Methods', x_col='Survey Year')
        

    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Tenure: Years Using Machine Learning Methods', categories=Category.YEARS_USING_ML)
        

In [None]:
metric = CompensationCountryMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=True)
metric.get_subplots(highlighted_traces=['Average'])

compensation_plot = CompensationPlot(
    metric=metric,
    yaxes_title='Average Yearly Compensation (USD)',
    shared_yaxes=True,
    yticks=[30, 60, 90, 120],
    yticks_template='U$ {}k',
    hover_template='U$ %{y:0.1f}k',
    annotation_template='U$ {:0.1f}k',
    x_nticks=1,
    title='<b>Looking at ML experience, average salaries are stable over time.</b> <br>However, those with less experience saw a drop in earnings in 2020.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Average Yearly Compensation in USD of professional respondents by country</i></span>'
    )
compensation_plot.show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
We can see straight away that on average experienced ML professionals did not notice reduction in their salaries. This is a sign that companies are finding professionals and that the global market is well balanced with offer and demand. <mark>However, different from experience writing code, gaining more ML experience does not increase your compensation so much.</mark>
<br><br>
Now I want to focus in two countries. The united states because it's the one that pays more, and Brazil, because it's where I came from.</div>

In [None]:
class CompensationPlot3(BasePlot):
            
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=10 if trace.trace_name == 'Brazil' else 50,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = CompensationCountryMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Brazil', 'United States'])

CompensationPlot3(
    metric=metric,
    yaxes_title='Average Yearly Compensation (USD)',
    shared_yaxes=True,
    yticks=[30, 60, 90, 120],
    yticks_template='U$ {}k',
    hover_template='U$ %{y:0.1f}k',
    annotation_template='U$ {:0.1f}k',
    x_nticks=1,
    title='<b>The United States is certainly yhe country where ML experience is most valued (or overpriced).</b> <br>Other countries, such as Brazil, saw a decrease in compensation in 2020 even for the most experienced.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Average Yearly Compensation in USD of professional respondents</i></span>'
    ).show()


<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
In The united States, experienced ML professionals are very well paid (I would probably say that they are overpriced). There is clearly high demand for such professionals and their salaries tend to increase in such situations. <mark>This high demand often also causes data professionals to stay for very short periods (often less than a year) at their jobs, because they receive better offers from other companies.</mark> 
<br><br>
I heard this once, and I think it describes this kind of professional. <b>They are POC engineers - because in such short time before changing jobs the only thing possible to deliver is a proof of concept.</b>
<br><br>
Now in Brazil, we see a more stable trend over time and over experience, with some decrease in the salary of most professionals in 2020. <mark>There is a currency effect to be considered here, the Brazilian Real lost ~25% of its value against US Dollar in 2020.</mark>
<br><br>
We see a bigger drop for experienced professionals, probably due to expensive employees that were laid off due to the pandemic effects on economy and had to find other jobs at a lower salary.
</div>


<h1>Creating Professional Profiles</h1>

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
For the rest of this analysis we will create data professional profiles to help us iunderstand some behaviours. To create those profiles I used the definition created by <a href="https://www.kaggle.com/tkubacka">Teresa Kubacka</a> on the winning submission of the 2019 Kaggle Survey.
<br><br>

In her notebook <a href="https://www.kaggle.com/tkubacka/a-story-told-through-a-heatmap">Who codes what and how long - a story told through a heatmap</a> she created professionals categories using the following two questions:
<li> How long have you been writing code to analyze data (at work or at school)?
<li> For how many years have you used machine learning methods?
<br><br>
They are as follows:
</div>




<img src="https://i.imgur.com/J9gFPPi.png" align="left" style="width:500px;"/>

<div style="font-family:Helvetica Neue; color:slategray;">
<i>Professional subgroups based on the answers for the two questions. <br>Author: <a href="https://www.kaggle.com/tkubacka">Teresa Kubacka</a> <br>Source: <a href="https://www.kaggle.com/tkubacka/a-story-told-through-a-heatmap">A story told through a heatmap</a></i>
</div>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
<br>
Here I'm just shortening their names for better visualization in the charts:
<ul>
<li> <b>Beginners</b>: Those with less than 2 years of experience of both coding and ML methods.</li>
<li> <b>Coders</b>: Those with lot's of coding experience, but that have started working with machine learning only recently.</li>
<li> <b>ML Veterans</b>: Those that have been coding and doing machine learning for a very long time.</li>
<li> <b>Moodern DS</b>: They have started their carreers in ML when it started to hype and have enough coding experience to provide measurable value.</li>
</ul>

Now lets look at the yearly compensation for each profile!
    </div>

In [None]:
class CompensationProfileMetric(AverageBaseMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(survey=survey, traces_col='Job Title', y_col='Profile', x_col='Survey Year')

    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Profile', categories=Category.PROFILES)   

class CompensationPlot5(BasePlot):
            
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=15 if trace.trace_name == 'Product/Project Manager' else -55,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = CompensationProfileMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Data Engineer/DBA', 'Product/Project Manager'])

CompensationPlot5(
    metric=metric,
    yaxes_title='Average Yearly Compensation (USD)',
    shared_yaxes=True,
    yticks=[20, 50, 100, 150],
    yticks_template='U$ {}k',
    hover_template='U$ %{y:0.1f}k',
    annotation_template='U$ {:0.1f}k',
    x_nticks=1,
    title='<b>ML Veterans working in Data Engineering and Product Management are in high demand.</b>'\
          '<br>Salaries for both professions are the ones that increased the most since the first survey.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Average Yearly Compensation in USD of professional respondents.</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
While most of other profiles remained stable or had a drop in their earnings in 2020, salaries for ML veterans in Data Engineering and Product management continued to increase sharply. This means that those seasoned professionals are being requested to deliver real value to companies, and the problems they are facing have nothing to do with ML algorithms...
  <br>  
<div class="alert alert-danger">The real problems in 2020 are:<b><ul>
<li>how to get and process data for ML</li>
<li>how to manage projects so that they deliver what was promised</li></ul></b>
</div>
    <br>

Now let's have a look at what they think is the best language for an aspiring data scientist learn first.
</div>

In [None]:

class RecommendedLanguageMetric(PercentageMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(survey=survey, traces_col='Recommended Programming Language', y_col='Profile', x_col='Survey Year')

    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Profile', categories=Category.PROFILES) 
    
    def calculate(self, add_avg: bool = False) -> pd.DataFrame:
        """
        Group the data by y_col, perform count and convert it to a list
        Transforms absolute values into percentages
        Yeld the metrics for a given trace
        """
        df = self.get_df()
        df = df[df[self.y_col] != 'None']
        detail = self.groupby(df=df, columns=[self.x_col, self.y_col, self.traces_col])
        total = self.groupby(df=df, columns=[self.x_col, self.y_col])
        joined = self.join_dataframes(df1=detail, df2=total, on_columns=[self.x_col, self.y_col]) 
        joined['Metric'] = joined['Dummy'] / joined['Dummy_total'] * 100  # get percentage
        
        if add_avg:
            avg_df = self.calculate_average(df=joined)
            joined = joined.append(avg_df)
                
        self.metric_df = joined
        return joined

        
class RecommendedLanguagePlot(BasePlot):
                
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=-15 if trace.trace_name == 'SQL' else 10,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = RecommendedLanguageMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Python', 'R', 'SQL'])

RecommendedLanguagePlot(
    metric=metric,
    yaxes_title='% of Respondents',
    shared_yaxes=True,
    yticks=[0, 20, 40, 60, 80],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=1,
    title='<b>R is losing space to Python as the most recommended language to learn first.</b> '\
          '<br>Those experienced in writing code are the ones that changed their minds the most over the past years.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents that recommend '\
          'a programming language <br>for an aspiring data scientist to learn first.</i></span>'
).show()


<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
<div class="alert alert-info">That old fight between Python fans and R lovers is in the past.<br><b>Python has consolidated itself as the most recommended language to start with. </b>
</div>
    <br>

Around 80% of Beginners recommend Python as the first language. Because this group has little experience coding, this probably means that Python is also their first language.
    <br><br>
    The old ML veterans, that grew up using R for analysis, are also giving a chance to Python and started to recommend it more in the last year. SQL recommendations is consistent across all profiles.
<br>
    <div class="alert alert-success">If you want to learn a programming language to do Data Science projects go with Python, you won't regret it. 
    </div>
    </div>

In [None]:
class ListColumnsPercentageMetric(PercentageMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    
    def calculate(self, add_avg: bool = False) -> pd.DataFrame:
        """
        Group the data by y_col, perform count and convert it to a list
        Transforms absolute values into percentages
        Yeld the metrics for a given trace
        """
        df = self.get_df()
        detail = self.groupby(df=df, columns=[self.x_col, self.y_col, self.traces_col])
        self.explode = False
        df = self.get_df()
        total = self.groupby(df=df, columns=[self.x_col, self.y_col])
        joined = self.join_dataframes(df1=detail, df2=total, on_columns=[self.x_col, self.y_col]) 
        joined['Metric'] = joined['Dummy'] / joined['Dummy_total'] * 100  # get percentage
        
        if add_avg:
            avg_df = self.calculate_average(df=joined)
            joined = joined.append(avg_df)
                
        self.metric_df = joined
        return joined

In [None]:
class LanguagesMetric(ListColumnsPercentageMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(
            survey=survey, 
            traces_col='Programming Languages', 
            y_col='Profile', 
            x_col='Survey Year', 
            explode=True
        )
        
    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Profile', categories=Category.PROFILES) 

In [None]:
class LanguagesPlot(BasePlot):
                
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=-15 if trace.trace_name == 'Python' else 10,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = LanguagesMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Python', 'C/C++'])

LanguagesPlot(
    metric=metric,
    yaxes_title='% of Professionals',
    shared_yaxes=True,
    yticks=[30, 60, 90],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=1,
    title='<b>Python is the language most beginners use on a regular basis and adoption is increasing.</b> '\
          '<br>C/C++ usage is also increasing for all profiles, but specially for Coders.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents that use a language on a regular basis.</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
    There is a noticeable increase in C/C++ usage for all profiles, but specially for Coders, a group tha has already lot's of experience in writing code, this means that <mark>more people coming from a C/C++ background (and that use it on a daily basis) want to dive in Machine Learning.</mark> They are coming to Kaggle to practice their skills and learn from the community.

<br><br>
    
Now that we know the languages used on a regular basis for each profile let's have a look at the primary tool they use to analyse data.
</div>


In [None]:
class PrimaryToolMetric(ListColumnsPercentageMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(
            survey=survey, 
            traces_col='Primary Tool to Analyze Data', 
            y_col='Profile', 
            x_col='Survey Year', 
            explode=True
        )
        
    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Profile', categories=Category.PROFILES) 

class PrimaryToolPlot(BasePlot):
                
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=-20 if trace.trace_name == 'Basic statistical software' else 10,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = PrimaryToolMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Basic statistical software', 'Local or hosted development environments'])

PrimaryToolPlot(
    metric=metric,
    yaxes_title='% of Professionals',
    shared_yaxes=True,
    yticks=[15, 30, 45, 60],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=1,
    title='<b>Basic statistical software gaining space in data analysis.</b> '\
          '<br>And Adoption of local or hosted dev environments is greater with Modern Data Scientists.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents and their primary tool used to analyze data.</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
And here again I was very surprised with the results. <mark>Who would imagine that in 2020 Modern Data Scientists and Beginners would use Basic Statistical Software (Such as Excel and Google Sheets) as their <b>primary tool</b> to analyse data instead of local or hosted development environments.</mark>

<br><br>
I understand that Basic Statistical Software is common ground for everyone, and easy to use. But once I switched to writing code and gained experience, I could never conceive moving back to Spreadsheets as my primary tool. I can't remember of any release or market change in those tools that could justify moving back to them.
<br><br>
    
 <div class="alert alert-danger">I'm aware that both Google and Microsoft <a href="https://techcrunch.com/2020/06/30/google-sheets-will-soon-be-able-to-autocomplete-data-for-you/">added some ML features into their products</a>...<br> <b>But no... Once you start coding you should never move back to spreadsheets. Or should you?</b>
    </div>
</div>

In [None]:
class IDEMetric(ListColumnsPercentageMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(
            survey=survey, 
            traces_col='IDEs', 
            y_col='Profile', 
            x_col='Survey Year', 
            explode=True
        )
        
        
    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Profile', categories=Category.PROFILES) 

class IDEPlot(BasePlot):
                
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=10,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = IDEMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Jupyter/IPython', 'Visual Studio'])

IDEPlot(
    metric=metric,
    yaxes_title='% of Professionals',
    shared_yaxes=True,
    yticks=[20, 40, 60, 80],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=1,
    title='<b>Visual Studio gained adoption with all professional profiles in 2020</b> '\
          '<br>Overall IDE usage is decreasing with time.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents and the IDEs they use.</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
Jupyter/IPython is very popular with Beginners and Modern Data Scientists, and less popular with coders and ML Veterans. <mark>Interesting to note that regular use of Ipython is slowly decreasing over time an giving way to IDEs traditionally used by Software Developers.</mark> Here it's important to highlight the increase in Visual Studio adoption in 2020. I believe this movement is due to the <a href="https://devblogs.microsoft.com/python/notebooks-are-getting-revamped/">native integration with notebooks released by mid 2020.</a>
    
    
<div class="alert alert-info">Do you wanna try a proper IDE that has all good features such as code-completion, variable inspection, debugging, etc and still work on your loved notebook environment? <b>Then I suggest you follow the lead and give a try to Visual Studio Code.</b></div>
</div>

In [None]:
class CloudMetric(ListColumnsPercentageMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(
            survey=survey, 
            traces_col='Cloud Computing Platforms', 
            y_col='Profile', 
            x_col='Survey Year', 
            explode=True
        )
        
        
    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Profile', categories=Category.PROFILES) 

class CloudPlot(BasePlot):
              
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=-60 if trace.trace_name == 'Azure' else 5,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = CloudMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['None'])

CloudPlot(
    metric=metric,
    yaxes_title='% of Professionals',
    shared_yaxes=True,
    yticks=[10, 30, 50],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=1,
    title='<b>Cloud adoption is increasing amongst Kagglers since 2018!</b> '\
          '<br>Those who answered None for cloud platform are decreasing consistently.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents and the cloud platforms they use.</i></span>'
).show()

<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
Here we are seeing how many persons answered None to cloud platforms (meaning that they don't use a cloud platform on a regular basis). And it is decreasing over time! So... Cloud adoption is increasing amongst professionals with <mark>Modern Data Scientists being the ones that use cloud services the most.</mark> This is very good news, meaning that everyone is having more access to the best Data Science tools and they are also getting closer to productionizing Data Science!
<br>
<div class="alert alert-warning">Now there is one thing I think it's curious... <b>I would expect ML Veterans to have a lot of experience with cloud, but they don't</b>. Are they too cool for using the cloud?</div>
<br>
Hey Kaggle! This a good question for next years survey: How many years of experience with cloud platforms?
<br><br>
Now how about we have a look at cloud adoption per provider?

</div>

In [None]:
metric = CloudMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['AWS', 'Azure', 'GCP'])

CloudPlot(
    metric=metric,
    yaxes_title='% of Professionals',
    shared_yaxes=True,
    yticks=[10, 30, 50],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=1,
    title='<b>The three big providers remain the three big providers, with AWS losing marketshare.</b> '\
          '<br>GCP usage amongst coders has increased and now is above Azure'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents and the cloud platforms they use.</i></span>'
).show()


<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
No big surprises in the cloud providers adoption. Google Cloud and Microsoft are increasing marketshare due to discounts and policies for both startups and large corporations. AWS is the biggest provider and usually adopted by business that were "cloud first" a few years ago.
</div>

In [None]:
class MLStatusMetric(ListColumnsPercentageMetric):
    """
    Creates a plotly plot for slopegraphs
    """ 
    def __init__(self, survey: KaggleCombinedSurvey) -> None:
        super().__init__(
            survey=survey, 
            traces_col='Machine Learning Status in Company', 
            y_col='Profile', 
            x_col='Survey Year', 
            explode=True
        )
        
        
    def apply_categories(self):
        self.to_categorical(column='Survey Year', categories=Category.SURVEY_YEAR)
        self.to_categorical(column='Profile', categories=Category.PROFILES) 

class MLStatusPlot(BasePlot):
              
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=5 if trace.trace_name == 'Exploring ML' else -30,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = MLStatusMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Well established ML'])

MLStatusPlot(
    metric=metric,
    yaxes_title='% of Professionals',
    shared_yaxes=True,
    yticks=[15, 30, 45],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=1,
    title='<b>Veterans usually work for companies that have well established models in production</b> '\
          '<br>Coders usually work for companies that are exploring ML an may one day put a model into production'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents and the incorporation of ML methods into the business.</i></span>'
).show()


<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
I found this chart particularly interesting! The proportion of ML Veterans that work for companies with well established machine learning models in production is huge (and all other categories scored very low). I would say that <mark>ML Veterans are the ones responsible for putting all those models into production, and their experience allows them to do it fast!</mark>
<br><br>
Modern Data Scientists also work for companies with well established models, but there is another aspect here that I woul like to explore. Let's see in the next chart.
</div>

In [None]:
class MLStatusPlot2(BasePlot):
              
    def add_annotations(self) -> None:
        """
        Adds annotations to the plot
        """
        for trace in self.metric.traces:
            if trace.highlighted_traces:
                if trace.subplot_y_position == 1:
                    # Add left annotation
                    self.figure.add_annotation(
                        xref=f'x{trace.subplot_y_position}',
                        yref=f'y{trace.subplot_y_position}',
                        font=dict(
                            size=11,
                            color=self.highlight_color(trace=trace)
                        ),
                        opacity=0.8,
                        align='center',
                        yshift=65 if trace.trace_name == 'Do not use ML / Do not know' else 5,
                        yanchor='bottom',
                        xshift=0,
                        xanchor='left',
                        x=trace.x_values.codes[0], 
                        y=trace.y_values[0],
                        text=trace.trace_name,
                        showarrow=False
                        )

In [None]:
metric = MLStatusMetric(survey=kaggle_combined_survey)
metric.calculate(add_avg=False)
metric.get_subplots(highlighted_traces=['Do not use ML / Do not know', 'Exploring ML', 'Recently started using ML'])

MLStatusPlot2(
    metric=metric,
    yaxes_title='% of Professionals',
    shared_yaxes=True,
    yticks=[15, 30, 45],
    yticks_template='{}%',
    hover_template='%{y:0.1f}%',
    annotation_template='{:0.1f}%',
    x_nticks=1,
    title='<b>Beginners usually work for companies that do not use machine learning</b> '\
          '<br>Modern Data Scientists might be driving ML adoption in their workplaces.'\
          '<br><span style="font-size:14px;color:lightgrey"><i>Percentage of professional respondents and the incorporation of ML methods into the business.</i></span>'
).show()


<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
<ul>
<li>Many Modern Data Scientists also work for companies that started using ML recently. <mark>Are they the ones driving this change?</mark></li>
<li>Beginners usually work for companies that either don't use machine learning ore are starting to explore or use it.</li>
</ul>
I have heard this complaint from so many friends that you will probably relate to it as well: <b>Companies don't want to hire Juniors. It is very difficult to get a chance to join the Data Science market without having prior experience.</b><br><br>
Well... Unfortunately this is true and the data confirms it.
</div>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
<div class="alert alert-danger"><ul>
<li>Companies with well established ML models will only hire ML Veterans.</li>
<li>Companies that are doing ok with ML, will only hire mid level Moder Data Scientists.</li>
<li><b>Only companies that don't use ML or that have started using it recently will hire the beginners.</b></li>
</ul></div>
</div>

<h1>Conclusion</h1>
<br>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;">
My main objective here was to study the trends in Data Professionals over three years of Kaggle Surveys using only one type of chart: the slopegraphs. It was very challenging to plot this data in a way that was both informative, useful and beautiful while trying to explore different facets at the same time.
    <br><br>
I won't use too much of your time in this conclusion. <b>Just would like to outline that seeing those results made me very sad, especially because of two things:</b>
</div>
<div style="font-family:Helvetica Neue; font-size:16px; line-height:1.7; color:slategray;"> 
 <ul>
     <li>How the gender gap is wide and how we didn't do anything to solve it in the past three years</li>
     <li>How it's difficult for new joiners to enter this very competitive market. <mark>I was lucky to start doing data science in the first hype (around 2015), and now I'm a happy Modern Data Scientist working for a company that already has well established machine learning models in production.</mark> But I understand how frustrating it must be to want to work with Data Science, but not being hired just because you lack the experience (that you need to acquire by working with Data Science).</li>
</ul>
</div>
<br><br>
<h3>Hope you have enjoyed this notebook. And good luck for both women and beginners!</h3>