## Table of contents
* [Introduction](#section-one)
     *      [Methodology to analyze past years surevy responses](#subsection-one)
* [Which country sees the huge impact because of Covid19?](#section-two)
* [Which segments of kagglers are highly impacted by pandemic?](#section-three)
     *      [Coding Experience](#subsection-two)
     *      [Age Group](#subsection-three)
     *      [Education Background](#subsection-four)
     *      [ML Experience](#subsection-five)
* [How Data Scientist profile evolved over time and across countries?](#section-four)
* [Conclusion](#section-five)
* [References](#section-six)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

from pandas.plotting import parallel_coordinates


import matplotlib.pyplot as plt
import seaborn as sb

import warnings
warnings.filterwarnings(action = 'ignore')

!pip install pycountry-convert
import pycountry_convert as pc

## plotly library
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px



In [None]:
## main class to load and clean past years surveys responses
class kaggle_survey:
    def __init__(self):
        self.jobs_oths_2018 = ["Consultant","Research Assistant","Manager","Chief Officer","Developer Advocate",
                               "Marketing Analyst","Salesperson","Principal Investigator","Data Journalist"]
        self.top_countries = ["India","United States of America","Brazil","Japan","Russia"]
        self.sea_countires = ["Indonesia","Malaysia","Thailand","Singapore","Philippines","Viet Nam"]
        self.job_grp = {"Scientist":["Data Scientist","Statistician","Research Scientist","Machine Learning Engineer"],
                   "Engineer" : ["DBA/Database Engineer","Software Engineer","Data Engineer"],
                   "Business": ["Business Analyst","Product/Project Manager"]}
        
        self.ml_exp_mapping = {"I do not use machine learning methods":{"grp":"grp1_[I do not use ML methods]","num":0 },"10+ years" : {"grp":"grp8_[10+]","num":12.5},
                                 "Under 1 year" : {"grp":"grp2_[<1]","num":0.5 }, "1-2 years" : {"grp":"grp3_[1-2]","num":1.5 },
                                 "2-3 years" : {"grp":"grp4_[2-3]","num":2.5 }, "3-4 years":{"grp":"grp5_[3-4]","num":3.5 },
                                 "4-5 years" : {"grp":"grp6_[4-5]","num": 4.5}, "5-10 years" : {"grp":"grp7_[5-10]","num":7.5 }}
        
        self.coding_exp_mapping = {"I have never written code":{"grp":"grp1_[0]","num":0 },"20+ years" : {"grp":"grp7_[20+]","num":22.5},
                                 "< 1 years" : {"grp":"grp2_[<1]","num":0.5 }, "1-2 years" : {"grp":"grp3_[1-2]","num":1 },
                                 "3-5 years" : {"grp":"grp4_[3-5]","num":4 }, "5-10 years":{"grp":"grp5_[5-10]","num":7.5 },
                                 "10-20 years" : {"grp":"grp6_[10-20]","num": 15}}
        
        self.employer_size_mapping = {"0-49 employees":{"grp":"grp1_[0-49]","num": 25 },"10,000 or more employees" : {"grp":"grp5_[10,000+]","num":12000},
                                 "50-249 employees" : {"grp":"grp2_[50-249]","num":150 }, "250-999 employees" : {"grp":"grp3_[250-999]","num": 625},
                                 "1000-9,999 employees" : {"grp":"grp4_[1000-9,999]","num":5500 }}
        
        self.ds_workloads_mapping = {"0":{"grp":"grp1_[0]","num": 0 },
                             "1-2" : {"grp":"grp2_[1-10]","num":5},'3-4' : {"grp":"grp2_[1-10]","num":5},
                             '5-9' : {"grp":"grp2_[1-10]","num":5},
                             '10-14' : {"grp":"grp3_[10+]","num":15},'15-19' : {"grp":"grp3_[10+]","num":15},
                             '20+' : {"grp":"grp3_[10+]","num":15}}
        
        self.ml_usage_mapping = {"I do not know":{"grp":"grp1_[I do not know]"},
                             "No (we do not use ML methods)" : {"grp":"grp2_[No (we do not use ML methods)]"},
                             'We use ML methods for generating insights (but do not put working models into production)' : {"grp":"grp3_[We use ML methods for generating insights (but do not put working models into production)]"},
                             'We are exploring ML methods (and may one day put a model into production)' : {"grp":"grp4_[We are exploring ML methods (and may one day put a model into production)]"},
                             'We recently started using ML methods (i.e., models in production for less than 2 years)' : {"grp":"grp5_[We recently started using ML methods (i.e., models in production for less than 2 years)]"},
                             '"We have well established ML methods (i.e., models in production for more than 2 years)' : {"grp":"grp6_[We have well established ML methods (i.e., models in production for more than 2 years)]"}}
                                    
                                
        
    def load_past_years_data(self):
        self.data_2020 = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
        self.data_2019 = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
        self.data_2018 = pd.read_csv('/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv')
        self.data_2017 = pd.read_csv('/kaggle/input/kaggle-survey-2017/multipleChoiceResponses.csv',engine='python')
    
    @staticmethod    
    def country_continent(row):
        try:
            country_code = pc.country_name_to_country_alpha2(row, cn_name_format="default")
            continent_name = pc.country_alpha2_to_continent_code(country_code)

        except KeyError:
            continent_name = "Other"
            if row not in ["Other","I do not wish to disclose my location"]:
                print (row)
        
        return continent_name

    def preprocess_data(self):
        '''
        This function performes cleaning of past years responses.
        '''
        self.data_2020['Year'] = [2020]*len(self.data_2020)
        self.data_2019['Year'] = [2019]*len(self.data_2019)
        self.data_2018['Year'] = [2018]*len(self.data_2018)
        self.data_2017['Year'] = [2017]*len(self.data_2017)

        self.data_2020['cnt'] = [1]*len(self.data_2020)
        self.data_2019['cnt'] = [1]*len(self.data_2019)
        self.data_2018['cnt'] = [1]*len(self.data_2018)
        self.data_2017['cnt'] = [1]*len(self.data_2017)

        self.data_2020 = self.data_2020.loc[1:].reset_index(drop=True)
        self.data_2019 = self.data_2019.loc[1:].reset_index(drop=True)
        self.data_2018 = self.data_2018.loc[1:].reset_index(drop=True)

        self.data_2017 = self.data_2017[self.data_2017.Age>=18]
        
    def get_gender(self):
        self.data_2017['Q2'] =  self.data_2017['GenderSelect']
        self.data_2020['Q2'] = np.where(self.data_2020['Q2']=="Man","Male",self.data_2020['Q2'])
        self.data_2020['Q2'] = np.where(self.data_2020['Q2']=="Woman","Female",self.data_2020['Q2'])
    
    def clean_country_col(self):
        '''
         This function performes cleaning of country question response over the years.
        '''
        self.data_2017['Q3'] = self.data_2017['Country']
        self.data_2017['Q3'] = np.where(self.data_2017['Q3'] == "United States","United States of America",self.data_2017['Q3'])
        
        self.data_2020['country'] = self.data_2020["Q3"]
        self.data_2019['country'] = self.data_2019["Q3"]
        self.data_2018['country'] = self.data_2018["Q3"]
        self.data_2017['country'] = self.data_2017["Q3"]
        
        
        self.data_2020['country'] = np.where(self.data_2020['country'] == "United Kingdom of Great Britain and Northern Ireland","United Kingdom",self.data_2020['country'] )
        self. data_2019['country'] = np.where(self.data_2019['country'] == "United Kingdom of Great Britain and Northern Ireland","United Kingdom",self.data_2019['country'] )
        self.data_2018['country'] = np.where(self.data_2018['country'] == "United Kingdom of Great Britain and Northern Ireland","United Kingdom",self.data_2018['country'] )
        
        self.data_2020['country'] = np.where(self.data_2020['country'] == "Iran, Islamic Republic of...","Iran",self.data_2020['country'] )
        self. data_2019['country'] = np.where(self.data_2019['country'] == "Iran, Islamic Republic of...","Iran",self.data_2019['country'] )
        self.data_2018['country'] = np.where(self.data_2018['country'] == "Iran, Islamic Republic of...","Iran",self.data_2018['country'] )

        self.data_2020['country'] = np.where(self.data_2020['country'] == "Republic of Korea","South Korea",self.data_2020['country'] )
        self.data_2019['country'] = np.where(self.data_2019['country'] == "Republic of Korea","South Korea",self.data_2019['country'] )
        self.data_2018['country'] = np.where(self.data_2018['country'] == "Republic of Korea","South Korea",self.data_2018['country'] )

        self. data_2019['country'] = np.where(self.data_2019['country'] == "Hong Kong (S.A.R.)","Hong Kong",self.data_2019['country'] )
        self.data_2018['country'] = np.where(self.data_2018['country'] == "Hong Kong (S.A.R.)","Hong Kong",self.data_2018['country'] )

        self.data_2017['country'] = np.where(self.data_2017['country'].isin(["People 's Republic of China","Republic of China"]),"China",self.data_2017['country'] )
        self.data_2017['country'] = np.where(self.data_2017['country'].isin(["Vietnam"]),"Viet Nam",self.data_2017['country'] )

        self.data_2017['country'] = self.data_2017['country'].fillna("Other")
        
        self.data_2020["country_grp"] = self.data_2020['country'].apply(self.country_continent)
        self.data_2019["country_grp"] = self.data_2019['country'].apply(self.country_continent)
        self.data_2018["country_grp"] = self.data_2018['country'].apply(self.country_continent)
        self.data_2017["country_grp"] = self.data_2017['country'].apply(self.country_continent)
        
        
        self.data_2020["top_countries"] = np.where(self.data_2020['country'].isin(self.top_countries),self.data_2020['country'],"Others")
        self.data_2020["top_countries"] = np.where(self.data_2020['country'].isin(self.sea_countires),"SEA",self.data_2020['top_countries'])

        self.data_2019["top_countries"] = np.where(self.data_2019['country'].isin(self.top_countries),self.data_2019['country'],"Others")
        self.data_2019["top_countries"] = np.where(self.data_2019['country'].isin(self.sea_countires),"SEA",self.data_2019['top_countries'])

        self.data_2018["top_countries"] = np.where(self.data_2018['country'].isin(self.top_countries),self.data_2018['country'],"Others")
        self.data_2018["top_countries"] = np.where(self.data_2018['country'].isin(self.data_2018),"SEA",self.data_2018['top_countries'])

        self.data_2017["top_countries"] = np.where(self.data_2017['country'].isin(self.top_countries),self.data_2017['country'],"Others")
        self.data_2017["top_countries"] = np.where(self.data_2017['country'].isin(self.sea_countires),"SEA",self.data_2017['top_countries'])

    
    def get_age(self):
        '''
         This function performes cleaning of age response over the years.
         '''
        self.data_2018['Q2'] = np.where(self.data_2018['Q2'].isin(["70-79","80+"]),
                                  "70+", self.data_2018['Q2'])
        
    def get_job(self):
        '''
         This function performes cleaning of job question response over the years.
         '''
        self.data_2018['Q6'] = np.where(self.data_2018['Q6'].isin(self.jobs_oths_2018),"Other",self.data_2018['Q6'] )
        self.data_2020['Q5'] = np.where(self.data_2020['Q5'] == "Currently not employed","Not employed",self.data_2020['Q5'])

        self.data_2020['job'] = self.data_2020["Q5"]
        self.data_2020['job'] = np.where(self.data_2020['job'] == "Machine Learning Engineer","Data Scientist",self.data_2020['job'] )

        self.data_2019['job'] = self.data_2019["Q5"]
        self.data_2018['job'] = self.data_2018["Q6"]
        
        self.data_2017['job'] = np.where(self.data_2017['EmploymentStatus'].isin(['Not employed, but looking for work','Not employed, and not looking for work']),
                                         "Not employed",self.data_2017['CurrentJobTitleSelect'])
        self.data_2020['job_grp'] = self.data_2020['job']
        self.data_2019['job_grp'] = self.data_2019['job']
        self.data_2018['job_grp'] = self.data_2018['job']
        self.data_2017['job_grp'] = self.data_2017['job']
        for key, values in self.job_grp.items():
            self.data_2020['job_grp'] = np.where(self.data_2020['job'].isin(values),key,self.data_2020['job_grp'])
            self.data_2019['job_grp'] = np.where(self.data_2019['job'].isin(values),key,self.data_2019['job_grp'])
            self.data_2018['job_grp'] = np.where(self.data_2018['job'].isin(values),key,self.data_2018['job_grp'])
            self.data_2017['job_grp'] = np.where(self.data_2017['job'].isin(values),key,self.data_2017['job_grp'])


    def create_num(self,df,col,mapping,final_col):
        '''
         This function convert bining data into integer.
        '''
        df[final_col] = df[col]
        for key,value in mapping.items():
            df[final_col] = np.where(df[final_col]==key,value['num'],df[final_col])
        #display(df[final_col])
        df[final_col] = df[final_col].astype(float)
        return df
    
    def create_grp(self,df,col,mapping,final_col):
        '''
         This function convert bining data into order for presenting purpose.
        '''
        df[final_col] = df[col]
        for key,value in mapping.items():
            df[final_col] = np.where(df[final_col]==key,value['grp'],df[final_col])
        return df
        
    def get_ml_exp(self):
        '''
         This function performes cleaning of ML experience question response over the years.
        '''
        
        self.data_2018['Q25'] = np.where(self.data_2018['Q25'] == "< 1 year","Under 1 year",self.data_2018['Q25'])
        self.data_2018['Q25'] = np.where(self.data_2018.Q25.isin(['10-15 years','20+ years']), "10+ years", self.data_2018.Q25)
        self.data_2018['Q25'] = np.where(self.data_2018.Q25.isin(['I have never studied machine learning but plan to learn in the future','I have never studied machine learning and I do not plan to']),
                                    "I do not use machine learning methods", self.data_2018.Q25)
        
        self.data_2019['Q23'] = np.where(self.data_2019.Q23.isin(['10-15 years','20+ years']), "10+ years", self.data_2019.Q23)
        self.data_2019['Q23'] = np.where(self.data_2019['Q23'] == "< 1 years","Under 1 year",self.data_2019['Q23'])
        
        self.data_2020['Q15'] = np.where(self.data_2020.Q15.isin(['10-20 years','20 or more years']), "10+ years", self.data_2020.Q15)
              
        self.data_2020 = self.create_num(self.data_2020,"Q15",self.ml_exp_mapping,"ml_exp_num" )
        self.data_2020 = self.create_grp(self.data_2020,"Q15",self.ml_exp_mapping,"ml_exp_grp" )
        
        self.data_2019 = self.create_num(self.data_2019,"Q23",self.ml_exp_mapping,"ml_exp_num" )
        self.data_2019 = self.create_grp(self.data_2019,"Q23",self.ml_exp_mapping,"ml_exp_grp" )
        
        self.data_2018 = self.create_num(self.data_2018,"Q25",self.ml_exp_mapping,"ml_exp_num" )
        self.data_2018 = self.create_grp(self.data_2018,"Q25",self.ml_exp_mapping,"ml_exp_grp" )
        
        
    def get_coding_exp(self):
        '''
         This function performes cleaning of coding experience question response over the years.
        '''
        
        self.data_2020 = self.create_num(self.data_2020,"Q6",self.coding_exp_mapping,"coding_exp_num" )
        self.data_2020 = self.create_grp(self.data_2020,"Q6",self.coding_exp_mapping,"coding_exp_grp" )
        
        self.data_2019 = self.create_num(self.data_2019,"Q15",self.coding_exp_mapping,"coding_exp_num" )
        self.data_2019 = self.create_grp(self.data_2019,"Q15",self.coding_exp_mapping,"coding_exp_grp" )
        
        self.data_2018['Q24'] = np.where(self.data_2018.Q24.isin(["20-30 years","30-40 years","40+ years"]), "20+ years" , self.data_2018['Q24'])
        self.data_2018['Q24'] = np.where(self.data_2018.Q24.isin(["I have never written code but I want to learn","I have never written code and I do not want to learn"]),
                                    "I have never written code" ,self.data_2018['Q24'])
        self.data_2018['Q24'] = np.where(self.data_2018.Q24.isin(["< 1 year"]),"< 1 years" ,self.data_2018['Q24'])
        self.data_2018 = self.create_num(self.data_2018,"Q24",self.coding_exp_mapping,"coding_exp_num" )
        self.data_2018 = self.create_grp(self.data_2018,"Q24",self.coding_exp_mapping,"coding_exp_grp" )
        
        
        
    def get_company_size(self):
        '''
         This function performes cleaning of employer size question response over the years.
        '''
        self.data_2020 = self.create_num(self.data_2020,"Q20",self.employer_size_mapping,"employer_size_num" )
        self.data_2020 = self.create_grp(self.data_2020,"Q20",self.employer_size_mapping,"employer_size_grp" )
        
        
        self.data_2019['Q6'] = np.where(self.data_2019['Q6'].isin(["> 10,000 employees"]), "10,000 or more employees" , self.data_2019['Q6'])
        self.data_2019 = self.create_num(self.data_2019,"Q6",self.employer_size_mapping,"employer_size_num" )
        self.data_2019 = self.create_grp(self.data_2019,"Q6",self.employer_size_mapping,"employer_size_grp" )
        
    def get_ds_workloads(self):
        '''
         This function performes cleaning of ML workload size question response over the years.
        '''
        
        self.data_2020 = self.create_num(self.data_2020,"Q21",self.ds_workloads_mapping,"ds_workloads_num" )
        self.data_2020 = self.create_grp(self.data_2020,"Q21",self.ds_workloads_mapping,"ds_workloads_grp" )
        
        self.data_2019 = self.create_num(self.data_2019,"Q7",self.ds_workloads_mapping,"ds_workloads_num" )
        self.data_2019 = self.create_grp(self.data_2019,"Q7",self.ds_workloads_mapping,"ds_workloads_grp" )
       
        
    def get_ml_usage(self):       
        '''
         This function performes cleaning of ML usage question response over the years.
        '''
        self.data_2020 = self.create_grp(self.data_2020,"Q22",self.ml_usage_mapping,"ml_usage_grp" )
        self.data_2019 = self.create_grp(self.data_2019,"Q8",self.ml_usage_mapping,"ml_usage_grp" )
        self.data_2018 = self.create_grp(self.data_2018,"Q10",self.ml_usage_mapping,"ml_usage_grp" )
                                 
                                 
    def get_past_data(self):
        '''
         Main function to call above functions.
        '''
        self.load_past_years_data()
        self.preprocess_data()
        self.get_gender()
        self.clean_country_col()
        self.get_age()
        self.get_job()
        self.get_ml_exp()
        self.get_ml_usage()
        self.get_coding_exp()
        self.get_company_size()
        self.get_ds_workloads()
        
        return self.data_2020,self.data_2019,self.data_2018,self.data_2017
        
        

In [None]:
## load data
ks = kaggle_survey()
data_2020,data_2019,data_2018,data_2017 = ks.get_past_data()

In [None]:
default = ["ALL"]
common_countries = ["India","United States of America"]
top4_countries = ["India","United States of America",'Brazil','Japan']
years_color_code = {"2020":"#CF7224","2019":"#17B0EB","2018":"#3D17EB"}

In [None]:
## Common Functions 

def yearly_trends_NE(data_sets = {"2020":data_2020,
                                  "2019":data_2019,
                                  "2018":data_2018} ,
                     columns = {"2020": "Q1",
                                "2019": "Q1",
                                "2018": "Q1"}, 
                     filters = {"countries" : ["ALL"], "job" : ['ALL']},
                     exclude_null = True, threshold_per_year = 10,sort_values = True,figure_hgt = 400,
                     title_values = {"main":"Title","xaxis":"xaxis","yaxis":"yaxis"},
                     color_range = years_color_code ):
    
    '''
    This function creates a grouped bar chart.
    
    It has the following dependencies:
    plotly express: 0.4.1
    
    import plotly.express as px
    '''
    
    yearly_data = pd.DataFrame()
    for year in columns.keys() :
        tmp_col = columns[year]
        tmp_data = data_sets[year].copy()
        tmp_data = tmp_data[~tmp_data.job.isnull()]
        tmp_data['not_emplyed'] = np.where(tmp_data.job == "Not employed",1,0)
        
        #add availabe filters
        for filter_col,values in filters.items():
            if values[0] != "ALL":
                tmp_data = tmp_data[tmp_data[filter_col].isin(values)] 
                
        #exclude null values
        if exclude_null:
            tmp_data = tmp_data[~tmp_data[tmp_col].isnull()]
        
        df = tmp_data.groupby([tmp_col,'Year'])['not_emplyed'].agg({"count","mean","sum"}).reset_index()
        df['mean'] = 100*df['mean']  
        df.rename(columns= {tmp_col:"var"}, inplace = True)        
        yearly_data = pd.concat([yearly_data,df], axis=0)
        

    
    if (sort_values): 
        # sorting based on yearly change in trend 
        a = yearly_data.pivot_table(index = "var",columns = "Year",values = "mean").reset_index()
        a['2020_2019'] = a[2020] - a[2019]
        a['2019_2018'] = a[2019] - a[2018]
        a['yearly_cahange'] = a['2020_2019']/a['2019_2018']
        a.sort_values(by =['yearly_cahange'], inplace = True)

        #removing insufficient data points 
        b = yearly_data.pivot_table(index = "var",columns = "Year",values = "sum").reset_index()
        b['total_sum'] = b.sum(axis=1)
        a = a.merge(b,on=['var'], how = "inner")
        
        #implement both the sorting
        fil_val = list(a[a.total_sum>threshold_per_year*len(columns.keys())]['var'].values)

        dummy_df = pd.DataFrame()
        dummy_df['var'] = list(fil_val)
        dummy_df['ranking'] = range(0,len(dummy_df))
        
        #sorting the columns
        yearly_data = yearly_data.merge(dummy_df,on=['var'], how = "inner")
        yearly_data.sort_values(by = ['ranking'], inplace = True)
    
    # Draw Plot
    fig = go.Figure()
    for year in list(columns.keys())[::-1]:
        tmp_data = yearly_data[yearly_data.Year==int(year)]
        fig.add_trace(go.Bar( x=tmp_data['var'],
                                y=tmp_data['mean'],
                                name=year,
                                marker_color=color_range[year]))

    fig.update_layout(barmode='group', xaxis_tickangle=-45, title = title_values['main'], width = 1300, height = figure_hgt)
    fig.update_xaxes(title=dict(text = title_values['xaxis']))
    fig.update_yaxes(title=dict(text = title_values['yaxis']))
    fig.show()

    
def plotly_bar_chart(response_counts,title,y_axis_title,orientation):
    '''
    This function creates a bar chart.
    
    It has the following dependencies:
    plotly express: 0.4.1
    
    import plotly.express as px
    '''
    response_counts_series = pd.Series(response_counts)
    fig = px.bar(response_counts_series,
             labels={"index": '',"value": y_axis_title},
             text=response_counts_series.values,
             orientation=orientation,)
    fig.update_layout(showlegend=False,
                      title={'text': title,
                             'y':0.95,
                             'x':0.5,})
    fig.show()
    
    
def multilevel_cols(df):
    new_col = []
    for i in df.columns:
        if len(i[1]):
            new_col.append(i[0]+"_"+i[1])
        else:
             new_col.append(i[0])
    return new_col

def yearly_trends_job_title(data_sets = {"2020":data_2020,
                                  "2019":data_2019,
                                  "2018":data_2018} ,
                            years = ['2020','2019','2018'],
                            countries_comp = ['India'],job = ['Data Scientist'],
                            figure_hgt = 500,x = "ml_exp_num",y="coding_exp_num",
                            title_values = {"main":"Title","xaxis":"xaxis","yaxis":"yaxis"}):
    
    '''
    This function creates a grouped bar chart.
    
    It has the following dependencies:
    plotly express: 0.4.1
    
    import plotly.express as px
    '''
    
    res = pd.DataFrame()
    for year in years:
        df = data_sets[year].groupby(['country','job'])[[x,y]].agg({"mean","count"}).reset_index()
        df.columns = multilevel_cols(df)
        df = df[df['job'].isin(job)]
        df = df[df.country.isin(countries_comp)]
        df['Year'] = [year]*len(df)
        res = pd.concat([res,df], axis=0)
    size_col = x+"_count"
    x+= "_mean"
    y+= "_mean"
    
    ##Draw plot 
    fig = px.scatter(res, x=x, y=y, text="country", log_x=True, size_max=100, color="Year",size = size_col,
                     title=title_values['main'])
    fig.update_traces(textposition='top center')
    fig.update_layout(xaxis_tickangle=-45, width = 1400, height = figure_hgt)
    fig.update_xaxes(title=dict(text = title_values['xaxis']))
    fig.update_yaxes(title=dict(text = title_values['yaxis']))

    fig.show()
    

### <a id="section-one"></a>[](http://)
<font color='#2A6FBB' size=5 >(1) Introduction</font><br>  

It's no secret that how [the novel coronavirus 2019 (COVID-19)](https://en.wikipedia.org/wiki/COVID-19_pandemic) outbreak is affecting millions of lives around the world, many companies shut down, lots of people lose their jobs. In this notebook/work, I’ve analyzed the impact of Covid19 pandemic impact on kagglers. The kaggle survey was live for 3.5 weeks in October, so I believe by diagnostic survey responses, we can get some initial trend of pandemic on data science and machine learning communiity. I’ve also included past year kaggle surveys responses to highlight changes in trend of year-2020 as compared to 2019, 2018. I’ve not not included available 2017 survey responses as there was no significant overlap in questions.  
 
In this work, I’ve analyzed ***Select the title most similar to your current role (or most recent title if retired)*** question from the survey and especially focused on understanding the Not Employed segment of kagglers and analyzed past years' trend to draw parallel insights of pandemic impact. The main focus of this work to understand the pandemic impact and help to answer the below questions? 
 
- `Which country sees the huge impact because of Covid19?`
- `Which segments of kagglers are highly impacted by pandemic?`


Lets begin with understanding the country wise responses count of 2020 survey.

[](http://)

In [None]:
percentages = round(100*data_2020['country'].value_counts(normalize = True)[:10],2)
title_for_chart = '<b>Most Common Nationalities (2020 Survey)<b>'
title_for_y_axis = '% of respondents'
orientation_for_chart = 'h'
plotly_bar_chart(response_counts=percentages,
                 title=title_for_chart,
                 y_axis_title=title_for_y_axis,
                 orientation=orientation_for_chart)

- It seems India and USA contributed close to 41% of overall responses. In order to reduce the bias from these two countries on overall insights across the world, I’ve tried to analyze these 2 countries' data separately where required. 
- To analyze covid19 impact on kagglers jobs, I’ve especially forced on **"Not employed”** kagglers extracted from Q5 of the survey “Select the title most similar to your current role (or most recent title if retired)”  

In [None]:
rows=2
cols=3
fig = make_subplots(
    rows=rows, cols=cols,
    specs=[[{"type": "pie","rowspan":2},{"type": "pie"},{"type": "pie"}],
           [None, {"type": "pie"},{"type": "pie"}]],horizontal_spacing = 0.05,vertical_spacing = 0.25,
    print_grid=False)

not_employed_2020 = data_2020['job'].value_counts(normalize = True)
pull = [0.2]*len(not_employed_2020)
pull = pull*(not_employed_2020.index=="Not employed")
fig.add_trace(go.Pie(labels=not_employed_2020.index, values=not_employed_2020.values, pull=pull,title = "<b>2020 Survey - All responses<b>"),row=1,col=1)
for count,country in enumerate(common_countries+ ['Brazil','Japan']):
    row = int(count/rows)+1
    column = count +1-rows*int(count/rows)+1
    not_employed_country = data_2020[data_2020.country==country].Q5.value_counts(normalize = True)
    pull = [0.2]*len(not_employed_country)
    pull = pull*(not_employed_country.index=="Not employed")
    fig.add_trace(go.Pie(labels=not_employed_country.index, values=not_employed_country.values, pull=pull,title = '<b>' + country + '</b>'),row=row,col=column)
fig.show()


- In the 2020 survey, out of 19,277 non-null responses on Q5 questions participants, 1652 kagglers are ‘Not Employed’, which is 8.57% of the total. 
- Not employed % varies from 4% to 10.1% among the top responded countries, India with the highest percentage of 10.1%. 
- **Just looking at the 2020 numbers is not enough to make any conclusion whether high not employed occurring because of some external events( Covid19) or its normal to have such high %. Analysing past years trends will help to answer the main cause of high not employed% in 2020.**

<a id="subsection-one"></a>[](http://)
<font color='#2A6FBB' size=5> Methodology to analyze past years surevy responses</font><br>  
I’ve consider the following methodology to effectively compare the past years data and to generate insights further 
- As base populations change over time and also the distribution of dimensions (country, age,job) which we can use for generating insights also varies across years. 
- For example: For USA nationality, count of responded kagglers decreased from 3085(year-2019) to 2237(year-2020) whereas no. of total survey count increased in 2020 as compared to 2019. To normalize this, **I’ve tracked how a fraction of not-employed changes over time across different dimensions (country, age,job) rather than using absolute not employed responses count**
- I’ve apply minimum condition on not-employed/total count across dimensions to remove unwanted noise from the data.
- In the end, I’ve sorted data into based on year-wise increment in % of non-employed changes from 2018-19 to 2019-2020


<a id="section-two"></a>[](http://)
<font color='#2A6FBB' size=5 >Q1) Which country sees the huge impact because of Covid19?</font><br>  
[](http://)

In [None]:
yearly_trends_NE(columns = {"2020": "country",
                            "2019": "country",
                            "2018": "country"},
                 threshold_per_year = 15 , 
                 title_values = {"main":"Not Employed yearly trend <b>(Countries Wise)</b>", "xaxis" : "country", "yaxis":"% of responses Not Employed"})

## Insights


- In 2020, Not employed% increased in all the major countries as compared to 2019, indicating occurrence of some world event **(COVID19)** which resulted in huge job loss in data science and machine learning community 
- In Russia/Indonesia/Turkey, not-employed trends was in downward side in 2018-19 but drastically increased in 2019-20
- India, USA, and Spain are the highly affected countries based on year-to-year changes in rate.
- All highly impacted countries based on not-employed rate are also heavily impacted by Covid19, [reported by World Health Organization (WHO)](https://covid19.who.int/?gclid=Cj0KCQiA3NX_BRDQARIsALA3fILEptDxsQZ0WhYMBincH-NmbFHk81OCudjm0I8AJ4qy7hlXDQjO_rMaAm13EALw_wcB)




<a id="section-three"></a>[](http://)
<font color='#2A6FBB' size=5 >Q2) Which segments of kagglers are highly impacted by pandemic?</font><br> 
* Analyzed not employed population on different segments of kagglers breaking using coding experience, machine learning experience, education background, and age group dimensions. 
[](http://) 

<a id="subsection-two"></a>[](http://)
<font color='#2A6FBB' size=5> Coding Experience</font><br> 

In [None]:
for country in ["ALL"] + common_countries:
    title = "Not Employed yearly trend <b> Coding Experience-({}) </b>".format(country)
    yearly_trends_NE(columns = {"2020": "coding_exp_grp",
                            "2019": "coding_exp_grp",
                            "2018": "coding_exp_grp"},
                 filters = {"country":[country]},
                 sort_values=True, figure_hgt=350,
                 title_values = {"main":title, "xaxis" : "coding exp (years)", "yaxis":"% of responses Not Employed"})


## Insights

- Experienced kagglers(10+ experience) survied well, as less numbers of kagglers impacted by external worldwide event (covid19) as the year-to-year change in not-employed rate is not that high as compared to beginners in coding.  
- In USA, fraction of not-employed in experience coding bucket decrease in 2020


<a id="subsection-three"></a>[](http://)
<font color='#2A6FBB' size=5> Age Group </font><br> 

In [None]:
for country in ["ALL"] + common_countries:
    title = "Not Employed yearly trend <b> Age Group-({}) </b>".format(country)
    yearly_trends_NE(columns = {"2020": "Q1",
                            "2019": "Q1",
                            "2018": "Q2"},
                 filters = {"country":[country]},
                 sort_values=True,figure_hgt=350,threshold_per_year=5,
                 title_values = {"main":title, "xaxis" : "Age (years)", "yaxis":"% of responses Not Employed"})


## Insights

- Not-employed rate increased for all the age group
- 25-29 and 45-49: not-employed fraction changes heavily in 2019-2020 (middle experienced employee affected may be because of layoff)
- In 22-24 group, fraction of not-employed rate is highest among the group (decrease in demand of new jobs because of covid19) 


<a id="subsection-four"></a>[](http://)
<font color='#2A6FBB' size=5> Education Background</font><br> 

In [None]:
for country in ["ALL"] + common_countries:
    title = "Not Employed yearly trend <b> Education Background-({}) </b>".format(country)
    yearly_trends_NE(columns = {"2020": "Q4",
                            "2019": "Q4",
                            "2018": "Q4"},
                 filters = {"country":[country]},
                 sort_values=True,figure_hgt=350,threshold_per_year = 10,
                 title_values = {"main":title, "xaxis" : "Education Background", "yaxis":"% of responses Not Employed"})


## Insights

- Kagglers with no formal education background less impacted, as in first place they usually not opt for corporate job and most of them work on their own business/startup/freelance  
- Kagglers with Bachelors education background affected the most, as companies growth slows down during pandemic period and they not hired/fired recent pass out 
- In USA, Doctoral degree professional affected the most 

- Because of covid19, [lots of companies business impacted and they revoked offers](https://www.forbes.com/sites/poetsandquants/2020/03/29/how-covid-19-is-crashing-on-the-class-of-2020-job-offers-already-disappearing/?sh=5b0f93b63ffb) or [delayed joining to new joiners (fresh pass out from colleges)](https://economictimes.indiatimes.com/jobs/covid-19-to-delay-job-interviews-impact-hiring-experts/articleshow/74757861.cms?from=mdr) which resulted increase in not-employed rate for kagglers who just starting their careers in data science machine 


<a id="subsection-five"></a>[](http://)
<font color='#2A6FBB' size=5> ML Experience</font><br> 

In [None]:
for country in ["ALL"] + common_countries:
    title = " Not Employed yearly trend <b> Machine Learning Experience-({}) </b>".format(country)
    yearly_trends_NE(columns = {"2020": "ml_exp_grp",
                            "2019": "ml_exp_grp",
                            "2018": "ml_exp_grp"},
                 filters = {"country":[country]},
                 sort_values=True,figure_hgt=400,threshold_per_year = 10,
                 title_values = {"main":title, "xaxis" : "ML Experience (years)", "yaxis":"% of responses Not Employed"})


## Insights
- Experienced ML kagglers(4+ experience) survied well, as less numbers of kagglers impacted based on the year-to-year change in not-employed rate is not that high as compared to beginners in ML.  

<a id="section-four"></a>[](http://)
<font color='#2A6FBB' size=5 >Q3) How Data Scientist profile evolved over time and across countries?</font><br> 
- In this section, I've analyzed data scientist profile in respect to coding experience, ML experience,and employer size.  
- I've converted the binning(window) group into numerical information by picking the middle point of the window and for the corner window, I've considered the high capped point and added the gap/2 of the previous window. 
- Example used 7.5 for window [5,10], and used 12.5 for [10+] window 


[](http://) 

<a id="subsection-six"></a>[](http://)
<font color='#2A6FBB' size=5> Evolution of Data Scientist </font><br> 

In [None]:
job = ['Data Scientist']
title = "Evolution of <b>" + job[0] + "</b> Profile over time (Experience in ML and Coding)"
yearly_trends_job_title(countries_comp=top4_countries,figure_hgt=600,job = job ,
                        title_values = {"main":title,
                                        "xaxis" : "Avg. ML Experience (Years)", 
                                        "yaxis":"Avg. Coding Experience (Years)"})

title = "Evolution of <b>" + job[0] + "</b> Profile over time (Experience in ML and Employer size)"
yearly_trends_job_title(countries_comp=top4_countries,figure_hgt=600,job = job ,
                        x = "ml_exp_num", y = "employer_size_num",
                        years = ['2020','2019'],
                        title_values = {"main":title,
                                        "xaxis" : "Avg. ML Experience (Years)", 
                                        "yaxis":"Avg. Employer Size"})

## Insights

- It seems in the USA, a data scientist profile is highly experienced in ML and coding. Based on the 2020 survey, average 5 years of ML experience and average 10 years of coding experience, which is highest among the countries.
- Data Scientists in India have minimal experience in coding and ML as compared to other major countries. In short we can say that the Data scientist position in the USA its kind of a managerial role whereas in India its entry role position.    
- From 2019-2020, India shifted towards the left side in the above graph, meaning lots of new fresh data scientists (with minimal experience in ML) added in 2020. Whereas there is not that much shift in the USA, indicates the data scientist market in the USA is kind of stable with respect to coding and ML experience. 
- From 2019-2020, Japan and Brazil shifted in upward directions, indicating data scientist profiles move towards software/developer profiles where high experience in coding is required.   

### Employer Size
- From 2019-2020, India shifted towards left bottom, indicating lot of small companies (startup) hiring freshers as data scientists
USA is placed at top-right position, meaning big companies leveraging ML solutions, which is true as lot of technical giants companies as their headquarters in USA   


<a id="section-five"></a>[](http://)
<font color='#2A6FBB' size=5 >Conclusion</font><br> 
- In 2020, Not employed% increased in all the major countries as compared to 2019, indicating occurrence of some world event **(COVID19)** which resulted in huge job loss in data science and machine learning community 
- All highly impacted countries based on not-employed rate are also heavily impacted by Covid19, [reported by World Health Organization (WHO)](https://covid19.who.int/?gclid=Cj0KCQiA3NX_BRDQARIsALA3fILEptDxsQZ0WhYMBincH-NmbFHk81OCudjm0I8AJ4qy7hlXDQjO_rMaAm13EALw_wcB)
- Experienced(coding or ML) kagglers survied well in 2020, as less fraction of kagglers impacted by external worldwide event (covid19)
- Kagglers with no formal education background less impacted, as in first place they usually not opt for corporate job and most of them work on thor own business/startup/freelance  
- Kagglers with Bachelors education background affected the most, as companies growth slow down during pandemic period and they not hired recent/fired recent pass out

<a id="section-six"></a>[](http://)
<font color='#2A6FBB' size=5 >References </font><br> 
- [2020 Kaggle Data Science & Machine Learning Survey](https://www.kaggle.com/paultimothymooney/2020-kaggle-data-science-machine-learning-survey) by Paul Mooney
- [ Who codes what and how long - a story told through a heatmap](https://www.kaggle.com/tkubacka/a-story-told-through-a-heatmap) by Teresa Kubacka
- [WHO Report - COVID 19](https://covid19.who.int/?gclid=Cj0KCQiA3NX_BRDQARIsALA3fILEptDxsQZ0WhYMBincH-NmbFHk81OCudjm0I8AJ4qy7hlXDQjO_rMaAm13EALw_wcB)