## Kaggle surveys and the animated graphs (Plotly)

Data visuals do not have to be always static! The animated graphs with Plotly are amazing tools that are relatively easy to implement and bring data to life. 

In this notebook, I want to use a couple of basic animated graphs in order to explore what trends are in the Data Science Kaggle surveys from 2017 to 2020. Let's try to animate the following:

* [Countries - India vs USA and others](#section-one)

* [Respondents' salaries change over time](#section-two)

* [Respondents' sex change over time](#section-three)

* [Respondents' job field change over time](#section-four) 

* [Respondents' age change over time](#section-five) 

* [Visualisation modules over time](#section-six) 

* [ML modules/frameworks over time](#section-seven)

* [Conclusion](#section-eight)

Let's first import the required python modules and the datasets. 

In [None]:
#data work
import pandas as pd
import numpy as np

# visuals
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [None]:
# Load 2017 data
data_2017 = pd.read_csv("../input/kaggle-survey-2017/multipleChoiceResponses.csv",encoding='ISO-8859-1', low_memory=False)

# Load 2018 data
data_2018 = pd.read_csv("../input/kaggle-survey-2018/multipleChoiceResponses.csv", low_memory=False)

# Load 2019 data
data_2019 = pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv", low_memory=False)

# Load 2020 data
data_2020 = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", low_memory=False)

Plotting the animated graphs isn't super difficult, but the Kaggle Survey data is quite changeling as it mostly contains categorical data that has different formats across all years. So some data wrangling and consolidation need to be done here. 

In [None]:
#function to consolidate countries across all years as they're in different formats
def consolidate_country_2017_to_2020(x):
    if ((x == 'United States') | (x== 'United States of America')):
        return 'USA'
    elif ((x == 'United Kingdom of Great Britain and Northern Ireland') | (x == 'United Kingdom')):
        return 'UK'
    elif ((x == "People 's Republic of China")):
        return 'China' 
    elif ((x == 'Iran, Islamic Republic of...')):
        return 'Iran' 
    else: 
        return x    

In [None]:
#function to consolidate salaries across all years as they're in different formats  
def consolidate_salary_2017(x):
    try:
        x = int(x.replace(',', ''))
        if x < 30000:
            return '0-30k'
        if x < 50000:
            return '30-50k'
        elif x <100000:
            return '50-100k'
        elif x < 150000:
            return '100-150k'
        elif x < 200000:
            return '150-200k'
        elif x < 250000:
            return '200-250k'
        elif x < 500000:
            return '250-500k'
        else: 
            return '$500k+'
    except:
        return 'No compensation disclosed'

    
def consolidate_salary_2018(x):
    if ((x == '0-10,000') | (x== '10-20,000') | (x == '20-30,000')):
        return '0-30k'
    elif ((x == '30-40,000') | (x == '40-50,000')):
        return '30-50k'
    elif ((x == '50-60,000') | (x == '60-70,000') | (x == '70-80,000') | (x == '80-90,000') | (x == '90-100,000')):
        return '50-100k' 
    elif ((x == '100-125,000') | (x == '125-150,000')):
        return '100-150k'
    elif ((x == '150-200,000')):
        return '150-200k'
    elif ((x == '200-250,000')):
        return '200-250k'
    elif ((x == '250-300,000') | (x == '300-400,000') | (x == '400-500,000')):
        return '250-500k'
    elif ((x == '500,000+')):
        return '$500k+'  
    else: 
        return 'No compensation disclosed'    

def consolidate_salary_2019_2020(x):
    if ((x == '$0-999') | (x== '1,000-1,999') | (x == '2,000-2,999') | (x == '3,000-3,999') | (x == '4,000-4,999')  
        | (x == '5,000-7,499') | (x == '7,500-9,999')| (x == '10,000-14,999')  | (x == '15,000-19,999') | (x == '20,000-24,999') | (x == '25,000-29,999')):
        return '0-30k'
    elif ((x == '30,000-39,999') | (x == '40,000-49,999')):
        return '30-50k'
    elif ((x == '50,000-59,999  ') | (x == '60,000-69,999') | (x == '70,000-79,999') | (x == '80,000-89,999') | (x == '90,000-99,999')):
        return '50-100k' 
    elif ((x == '100,000-124,999') | (x == '125,000-149,999')):
        return '100-150k'
    elif ((x == '150,000-199,999')):
        return '150-200k'
    elif ((x == '200,000-249,999')):
        return '200-250k'
    elif ((x == '250,000-299,999') | (x == '300,000-500,000')):
        return '250-500k'
    elif ((x == '> $500,000')):
        return '$500k+'
    else: 
        return 'No compensation disclosed'
    


In [None]:
#function to consolidate sex across all years as they're in different formats 
def consolidate_sex_2017_to_2020(x):
    if ((x == 'Prefer not to say ') | (x== 'Prefer to self-describe') | (x== 'Nonbinary') | (x== 'A different identity') | (x== 'Prefer not to say') | (x== 'Non-binary, genderqueer, or gender non-conforming')):
        return 'Other'
    elif ((x == "Man")):
        return 'Male' 
    elif ((x == 'Woman')):
        return 'Female' 
    else: 
        return x   

In [None]:
#function to consolidate job titles across all years as they're in different formats
def consolidate_job_titles_2017_to_2020(x):
    if ((x == 'Data Scientist') | (x == 'Machine Learning Engineer') | (x == 'Predictive Modeler')):
        return 'Data science/ML engineers'
    
    elif ((x == 'Software Engineer') | (x == 'Software Developer/Software Engineer') | (x == 'Programmer') | (x == 'Developer Advocate') | (x == 'Computer Scientist')):
        return 'Software developers'
    
    elif ((x == 'Data Analyst') | (x == 'Business Analyst') | (x == 'Data Miner') | (x == 'Marketing Analyst') | (x == 'Data Journalist')):
        return 'Data/business analysts'
    
    elif ((x == 'DBA/Database Engineer') | (x == 'Data Engineer') | (x == 'Data Miner') | (x == 'Marketing Analyst') | (x == 'Data Journalist')):
        return 'Data engineers'
    
    elif ((x == 'Student') | (x== 'Researcher')):
        return 'Students'
    else: 
        return 'Other job fields'    

In [None]:
#function to consolidate respondents' age across all years as they're in different formats
def consolidate_age_2017(x):
    if x < 25:
        return 'Under 25'
    elif x < 35:
        return '25-35'  
    elif x < 45:
        return '35-45'
    elif x < 60:
        return '45-60'
    elif x < 70:
        return '60-69'
    else:
        return '70+'
    

def consolidate_age_2018_to_2020(x):
    if ((x == '18-21') | (x == '22-24')):
        return 'Under 25'
    elif ((x == '25-29') | (x == '30-34')):
        return '25-35'
    elif ((x == '35-39') | (x == '40-44')):
        return '35-45'
    elif ((x == '45-49') | (x == '50-54') | (x == '55-59')):
        return '45-60'
    else: 
        return x    

In [None]:
#function to consolidate visual modules across all years as they're in different formats
def consolidate_vizmodules_2018_to_2020(x):
    if (x == 'MatplotlibSeaborn'):
        return 'Matplotlib & Seaborn'
    elif ((x == 'Plotly / Plotly Express') | (x == 'Plotly')):
        return 'Plotly Alone'
    elif ((x == 'MatplotlibPlotly / Plotly Express') | (x == 'MatplotlibPlotly')):
        return 'Matplotlib & Plotly'
    elif ((x == 'MatplotlibSeabornPlotly / Plotly ExpressBokeh') | (x == 'MatplotlibPlotly / Plotly ExpressBokehSeaborn') | (x == 'MatplotlibPlotlyBokehSeaborn')):
        return 'All 4 modules'
    elif (x == 'Bokeh'):
        return 'Bokeh Alone'
    elif ((x == 'MatplotlibBokeh')):
        return 'Matplotlib & Bokeh'
    elif ((x == 'Seaborn')):
        return 'Seaborn Alone'
    elif (x == 'Matplotlib'):
        return 'Matplotlib Alone' 
    elif ((x == 'Shiny')):
        return 'Shiny Alone'
    elif ((x == 'Ggplot / ggplot2') | (x == 'ggplot2')):
        return 'Ggplot Alone' 
    elif ((x == 'Ggplot / ggplot2Shiny') | (x == 'ggplot2Shiny')):
        return 'Both Modules'
    else: 
        return x    

In [None]:
#function to consolidate ML modules across all years as they're in different formats
def consolidate_mlmodules_2018_to_2020(x):
    if ((x == 'Scikit-learn') | (x == 'Scikit-Learn') | (x == 'Scikit-LearnrandomForest') | (x == 'Scikit-learnRandomForest') | (x == 'Scikit-learnCaret') | (x == 'randomForest') | (x == 'Caret')
       | (x == 'Scikit-learnRandomForestCaret') | (x == 'CaretrandomForest') | (x == 'RandomForest') | (x == 'Scikit-LearnCaret') | (x == 'RandomForestCaret') | (x == 'Scikit-LearnCaretrandomForest')):
        return 'Classic ML modules'
    
    elif ((x == 'TensorFlow') | (x == 'TensorFlowKeras') | (x == 'PyTorch') | (x == 'TensorFlowKerasPyTorch')
         | (x == 'Keras') | (x == 'TensorFlowPyTorch') | (x == 'KerasPyTorch') | (x == 'TensorFlowKerasPyTorchFast.ai') | (x == 'TensorFlowKerasPyTorchFastai')
         | (x == 'PyTorchFast.ai') | (x == 'TensorFlowKerasFastai') | (x == 'Fast.ai') | (x == 'TensorFlowKerasFast.ai') | (x == 'KerasPyTorchFast.ai')
         | (x == 'TensorFlowPyTorchFast.ai') | (x == 'Fastai') | (x == 'PyTorchFastai') | (x == 'KerasFast.ai') | (x == 'TensorFlowPyTorchFastai')
         | (x == 'TensorFlowFast.ai ') | (x == 'KerasPyTorchFastai') | (x == 'TensorFlowFastai') | (x == 'KerasFastai') | (x == 'TensorFlowFast.ai')
         | (x == 'Xgboost') | (x == 'XgboostLightGBM') | (x == 'Xgboostlightgbm') | (x == 'Xgboostlightgbmcatboost') | (x == 'XgboostLightGBMCatBoost') | (x == 'LightGBM')
         | (x == 'Xgboostcatboost') | (x == 'XgboostCatBoost') | (x == 'lightgbm') | (x == 'CatBoost') | (x == 'catboost') | (x == 'LightGBMCatBoost') | (x == 'lightgbmcatboost')
         | (x == 'TensorFlowKerasXgboost') | (x == 'TensorFlowKerasPyTorchXgboost') | (x == 'TensorFlowKerasXgboostLightGBM') | (x == 'TensorFlowXgboost') | (x == 'KerasXgboost') | (x == 'TensorFlowKerasPyTorchXgboostLightGBM') | (x == 'TensorFlowKerasXgboostlightgbm')
         | (x == 'PyTorchXgboost') | (x == 'TensorFlowKerasXgboostlightgbmcatboost') | (x == 'KerasXgboostLightGBM') | (x == 'KerasPyTorchXgboost') | (x == 'KerasPyTorchXgboostLightGBM') | (x == 'PyTorchLightGBM') | (x == 'TensorFlowKerasPyTorchFast.aiXgboostLightGBM ')
         
         | (x == 'TensorFlowKerasPyTorchXgboostLightGBMCatBoost') | (x == 'PyTorchXgboostLightGBM') | (x == 'TensorFlowKerasXgboostLightGBMCatBoost') | (x == 'TensorFlowKerasPyTorchXgboostlightgbm') | (x == 'TensorFlowPyTorchXgboost') | (x == 'TensorFlowKerasPyTorchXgboostlightgbmcatboost')):
        return 'Deep Learning & Boosting modules'
    
    else: 
        return x

In [None]:
# Merge all survey data from all years into one dataframe and unify responses as much as possible
consolidated_data =  pd.DataFrame()
year_2017 =  pd.DataFrame()
year_2018 =  pd.DataFrame()
year_2019 =  pd.DataFrame()
year_2020 =  pd.DataFrame()

#'17
year_2017['Salary'] = data_2017.CompensationAmount.apply(lambda x: consolidate_salary_2017(x))
year_2017['Country'] = data_2017.Country.apply(lambda x: consolidate_country_2017_to_2020(x))
year_2017['Age'] = data_2017.Age.drop(0).apply(lambda x: consolidate_age_2017(x))
year_2017['Job_field'] = data_2017.CurrentJobTitleSelect.drop(0).apply(lambda x: consolidate_job_titles_2017_to_2020(x))
year_2017['Sex'] = data_2017.GenderSelect.drop(0).apply(lambda x: consolidate_sex_2017_to_2020(x))
year_2017['Year'] = 2017

#'18
year_2018['Salary'] = data_2018.Q9.apply(lambda x: consolidate_salary_2018(x))
year_2018['Country'] = data_2018.Q3.drop(0).apply(lambda x: consolidate_country_2017_to_2020(x))
year_2018['Age'] = data_2018.Q2.drop(0).replace({'70-79' : '70+', '80+' : '70+'})
year_2018['Age'] = year_2018.Age.apply(lambda x: consolidate_age_2018_to_2020(x))
year_2018['Job_field'] = data_2018.Q6.drop(0).apply(lambda x: consolidate_job_titles_2017_to_2020(x))
year_2018['Sex'] = data_2018.Q1.drop(0).apply(lambda x: consolidate_sex_2017_to_2020(x))
year_2018['Visual_modules'] = (data_2018['Q21_Part_1'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_2'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_3'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_4'].fillna('').drop(0).str.strip(' ') 
                                  + data_2018['Q21_Part_5'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_6'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_7'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_8'].fillna('').drop(0).str.strip(' ') 
                                  + data_2018['Q21_Part_9'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_10'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_11'].fillna('').drop(0).str.strip(' ') + data_2018['Q21_Part_12'].fillna('').drop(0).str.strip(' ') 
                                  + data_2018['Q21_Part_9'].fillna('').drop(0).str.strip(' '))
year_2018['Visual_modules'] = year_2018.Visual_modules.apply(lambda x: consolidate_vizmodules_2018_to_2020(x))
year_2018['ML_modules'] = (data_2018['Q19_Part_1'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_2'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_3'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_4'].fillna('').drop(0).str.strip(' ')
                                 + data_2018['Q19_Part_5'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_6'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_7'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_8'].fillna('').drop(0).str.strip(' ')
                                 + data_2018['Q19_Part_9'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_10'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_11'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_12'].fillna('').drop(0).str.strip(' ')
                                 + data_2018['Q19_Part_13'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_14'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_15'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_16'].fillna('').drop(0).str.strip(' ')
                                 + data_2018['Q19_Part_17'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_18'].fillna('').drop(0).str.strip(' ') + data_2018['Q19_Part_19'].fillna('').drop(0).str.strip(' '))
year_2018['ML_modules'] = year_2018.ML_modules.apply(lambda x: consolidate_mlmodules_2018_to_2020(x))
year_2018['Year'] = 2018 

#'19
year_2019['Salary'] = data_2019.Q10.apply(lambda x: consolidate_salary_2019_2020(x))
year_2019['Country'] = data_2019.Q3.drop(0).apply(lambda x: consolidate_country_2017_to_2020(x))
year_2019['Age'] = data_2019.Q1.drop(0)
year_2019['Age'] = year_2019.Age.apply(lambda x: consolidate_age_2018_to_2020(x))
year_2019['Job_field'] = data_2019.Q5.drop(0).apply(lambda x: consolidate_job_titles_2017_to_2020(x))
year_2019['Sex'] = data_2019.Q2.drop(0).apply(lambda x: consolidate_sex_2017_to_2020(x))
year_2019['Visual_modules'] = (data_2019['Q20_Part_1'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_2'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_3'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_4'].fillna('').drop(0).str.strip(' ') 
                                  + data_2019['Q20_Part_5'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_6'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_7'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_8'].fillna('').drop(0).str.strip(' ') 
                                  + data_2019['Q20_Part_9'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_10'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_11'].fillna('').drop(0).str.strip(' ') + data_2019['Q20_Part_12'].fillna('').drop(0).str.strip(' '))
year_2019['Visual_modules'] = year_2019.Visual_modules.apply(lambda x: consolidate_vizmodules_2018_to_2020(x))
year_2019['ML_modules'] = (data_2019['Q28_Part_1'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_2'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_3'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_4'].fillna('').drop(0).str.strip(' ')
                                 + data_2019['Q28_Part_5'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_6'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_7'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_8'].fillna('').drop(0).str.strip(' ')
                                 + data_2019['Q28_Part_9'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_10'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_11'].fillna('').drop(0).str.strip(' ') + data_2019['Q28_Part_12'].fillna('').drop(0).str.strip(' '))
year_2019['ML_modules'] = year_2019.ML_modules.apply(lambda x: consolidate_mlmodules_2018_to_2020(x))
year_2019['Year'] = 2019 

#'20 
year_2020['Salary'] = data_2020.Q24.apply(lambda x: consolidate_salary_2019_2020(x))
year_2020['Country'] = data_2020.Q3.drop(0).apply(lambda x: consolidate_country_2017_to_2020(x))
year_2020['Age'] = data_2020.Q1.drop(0)
year_2020['Age'] = year_2020.Age.apply(lambda x: consolidate_age_2018_to_2020(x))
year_2020['Job_field'] = data_2020.Q5.drop(0).apply(lambda x: consolidate_job_titles_2017_to_2020(x))
year_2020['Sex'] = data_2020.Q2.drop(0).apply(lambda x: consolidate_sex_2017_to_2020(x))
year_2020['Visual_modules'] = (data_2020['Q14_Part_1'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_Part_2'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_Part_3'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_Part_4'].fillna('').drop(0).str.strip(' ')
                              + data_2020['Q14_Part_5'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_Part_6'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_Part_7'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_Part_8'].fillna('').drop(0).str.strip(' ')
                              + data_2020['Q14_Part_9'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_Part_10'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_Part_11'].fillna('').drop(0).str.strip(' ') + data_2020['Q14_OTHER'].fillna('').drop(0).str.strip(' '))
year_2020['Visual_modules'] = year_2020.Visual_modules.apply(lambda x: consolidate_vizmodules_2018_to_2020(x))
year_2020['ML_modules'] = (data_2020['Q16_Part_1'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_2'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_3'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_4'].fillna('').drop(0).str.strip(' ')
                                 + data_2020['Q16_Part_5'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_6'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_7'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_8'].fillna('').drop(0).str.strip(' ') 
                                 + data_2020['Q16_Part_9'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_10'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_11'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_12'].fillna('').drop(0).str.strip(' ') 
                                 + data_2020['Q16_Part_13'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_14'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_Part_15'].fillna('').drop(0).str.strip(' ') + data_2020['Q16_OTHER'].fillna('').drop(0).str.strip(' '))
year_2020['ML_modules'] = year_2020.ML_modules.apply(lambda x: consolidate_mlmodules_2018_to_2020(x))
year_2020['Year'] = 2020


#consolidating the dataframe
consolidated_data = year_2017
consolidated_data = consolidated_data.append(year_2018)
consolidated_data = consolidated_data.append(year_2019)
consolidated_data = consolidated_data.append(year_2020)

consolidated_data = consolidated_data.reset_index(drop=True)

<a id="section-one"></a>
### Countries - India vs USA and others 

How the portion of respondents from different countries has changed during the years? Let's first plot the static bar charts to see it

In [None]:
#making subplots for top 10 countries on total number of respondents

sub_2017 = year_2017['Country'].value_counts().sort_values(ascending = False).head(10).reset_index().round(1)
sub_2018 = year_2018['Country'].value_counts().sort_values(ascending = False).head(10).reset_index().round(1)
sub_2019 = year_2019['Country'].value_counts().sort_values(ascending = False).head(10).reset_index().round(1)
sub_2020 = year_2020['Country'].value_counts().sort_values(ascending = False).head(10).reset_index().round(1)

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=("2017", "2018", "2019", "2020"))

fig.add_trace(go.Bar(x=sub_2017['index'], y=sub_2017['Country']),
              row=1, col=1)
              
fig.add_trace(go.Bar(x=sub_2018['index'], y=sub_2018['Country']),
              row=1, col=2)

fig.add_trace(go.Bar(x=sub_2019['index'], y=sub_2019['Country']),
              row=2, col=1)

fig.add_trace(go.Bar(x=sub_2020['index'], y=sub_2020['Country']),
              row=2, col=2)

fig.update_layout(
                  title_text="Top 10 countries 2017-20 (total number of respondents)", title_x=0.5)

fig.update_layout(showlegend=False)
fig.show()

As seen the US was dominating the place until 2019 and then India started ruling the game. So instead of looking into these boring subplots, let's animate them. Just press "Play" button below

In [None]:
#making a dataframe for animation 
animate_countries = consolidated_data.groupby(['Year', 'Country']).size().reset_index()
animate_countries['Percentage'] = consolidated_data.groupby(['Year', 'Country']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_countries.columns = ['Year', 'Country', 'Counts', 'Percentage']
animate_countries = animate_countries.loc[(animate_countries['Country'] == 'USA') | (animate_countries['Country'] == 'India')]

#plotting animated bars for USA vs India change over the years
fig = px.bar(
        animate_countries, x="Country", y = 'Percentage', color="Country",
        animation_frame="Year", range_y=[0,50], text=animate_countries['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)))
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text='India vs USA (2017-2020)', title_x=0.5)
fig.update_yaxes(title='Portion of Respondents (%)')
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(31, 27, 25)', marker_line_width=1.2, opacity=1)
fig.update_layout(showlegend=False)

fig.show()

In [None]:
# Trend for India and USA over time
animate_countries['Years'] = animate_countries['Year'].astype(str)
fig = px.bar(animate_countries, x="Country", y = 'Percentage', color='Years', barmode='group')
fig.show()
#animate_countries.head(20)

Since 2017 there has been strong trend of growth in India as can be seen in the plot above

Now let's see how Brazil and China slightly took over other countries in the last 4 years

In [None]:
#making a dataframe for animation 
animate_countries = consolidated_data.groupby(['Year', 'Country']).size().reset_index()
animate_countries['Percentage'] = consolidated_data.groupby(['Year', 'Country']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_countries.columns = ['Year', 'Country', 'Counts', 'Percentage']
animate_countries = animate_countries.loc[(animate_countries['Country'] == 'Germany') | (animate_countries['Country'] == 'Brazil') | (animate_countries['Country'] == 'Russia') | (animate_countries['Country'] == 'UK') | (animate_countries['Country'] == 'China')]
animate_countries = animate_countries.sort_values(by=['Percentage'], ascending = False)

#plotting animated bars for other countries change over the years
fig = px.bar(
        animate_countries, x="Country", y = 'Percentage', color="Country",
        animation_frame="Year", range_y=[0,10], text=animate_countries['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
    category_orders={'Year': [2017, 2018, 2019, 2020]})

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text='Rise  Brazil & China (2017-2020)', title_x=0.5)
fig.update_yaxes(title='Portion of Respondents (%)')
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(61, 57, 55)', marker_line_width=1, opacity=1)
fig.update_layout(showlegend=False)
fig.show()

As seen in the USA Kagglers dominated the Survey in 2017, and then India took the lead in 2019 and 2020. The number of Chinese respondents increased dramatically in 2018 and then went back in 2019 & 2020, while Brazil steadily was taking the 4th place.

<a id="section-two"></a>
### Respondents' salaries change over time

Good idea to see how reported by the survey respondents' salaries changed from 2017 to 2020. Again just press the play button to see it in action

In [None]:
#making a dataframe for animation 
consolidated_data_0 = consolidated_data[consolidated_data.Salary != 'No compensation disclosed']
animate_salaries = consolidated_data_0.groupby(['Year', 'Salary']).size().reset_index()
animate_salaries['Percentage'] = consolidated_data_0.groupby(['Year', 'Salary']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_salaries.columns = ['Year', 'Salary', 'Counts', 'Percentage']
animate_salaries = animate_salaries.sort_values(by=['Percentage'], ascending = False)


#plotting animated bars for salary change over years
fig = px.bar(
        animate_salaries, x="Salary", y = 'Percentage', color="Salary", 
        animation_frame="Year", range_y=[0,100], text=animate_salaries['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
    category_orders={'Year': [2017, 2018, 2019, 2020]})

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text="Respondents' salaries change (2017-2020)", title_x=0.5)
fig.update_yaxes(title='Portion of Respondents (%)')
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(31, 27, 25)', marker_line_width=1, opacity=1)
fig.update_layout(showlegend=False)

fig.show()

As seen in 2017 the largest number of respondents were reporting their salary in the range of USD 50-100k (27 percent). However, with time the range of $0-30k started reporting by the majority of respondents. 

At least this can probably be because of the fact that respondents are actually more honest or potentially more respondents with lower salaries took part in the survey (junior positions, people from regions with lower cost of living etc)

<a id="section-three"></a>
### Respondents' sex change over time

Let's see how respondents' sex was changing over the years.

In [None]:
#making a dataframe for animation 
animate_sex = consolidated_data.groupby(['Year', 'Sex']).size().reset_index()
animate_sex['Percentage'] = consolidated_data.groupby(['Year', 'Sex']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_sex.columns = ['Year', 'Sex', 'Counts', 'Percentage']
animate_sex = animate_sex.sort_values(by=['Percentage'], ascending = False)


#plotting animated bars for sex change over years
fig = px.bar(
        animate_sex, x="Sex", y = 'Percentage', color="Sex", 
        animation_frame="Year", range_y=[0,100], text=animate_sex['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), category_orders={'Year': [2017, 2018, 2019, 2020]})

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text="Respondents' sex (2017-2020)", title_x=0.5)
fig.update_yaxes(title='Portion of Respondents (%)')
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(31, 27, 25)', marker_line_width=1, opacity=1)
fig.update_layout(showlegend=False)

fig.show()

In [None]:
# Trend for Sex  time
animate_sex['Years'] = animate_sex['Year'].astype(str)
fig = px.bar(animate_sex, x="Sex", y = 'Percentage', color='Years', barmode='group')
fig.show()


Kaggle is still dominated by males to large extent. However, there was a surge in the portion of female respondents in 2020 and hopefully, this trend is going to remain in the future. 

<a id="section-four"></a>
### Respondents' job field change over time

Let's see how respondents' job titles were changing across the years.

In [None]:
#making a dataframe for animation 
animate_job_field = consolidated_data.groupby(['Year', 'Job_field']).size().reset_index()
animate_job_field['Percentage'] = consolidated_data.groupby(['Year', 'Job_field']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_job_field.columns = ['Year', 'Job_field', 'Counts', 'Percentage']
animate_job_field = animate_job_field.sort_values(by=['Percentage'], ascending = False)


#plotting animated bars for salary change over years
fig = px.bar(
        animate_job_field, x="Job_field", y = 'Percentage', color="Job_field", 
        animation_frame="Year", range_y=[0,70], text=animate_job_field['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), category_orders={'Year': [2017, 2018, 2019, 2020]})

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text='Job field change (2017-2020)', title_x=0.5)
fig.update_yaxes(title='Portion of Respondents (%)')
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(31, 27, 25)', marker_line_width=1, opacity=1)
fig.update_layout(showlegend=False)

fig.show()

As seen there was an increase in the number of students since 2017 and that is likely because there was no specific question on students in the 2017 survey. Also, it is interesting that direct DS/ML job titles been only within approx. 20 percent of respondents while there is much more people studying or in other job fields like software dev. 

<a id="section-five"></a>
### Respondents' age change over time

Let's see how respondents' age was changing across the years:

In [None]:
#making a dataframe for animation 
animate_age = consolidated_data.groupby(['Year', 'Age']).size().reset_index()
animate_age['Percentage'] = consolidated_data.groupby(['Year', 'Age']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_age.columns = ['Year', 'Age', 'Counts', 'Percentage']
animate_age = animate_age.sort_values(by=['Percentage'], ascending = False)


#plotting animated bars for age change over years
fig = px.bar(
        animate_age, x="Age", y = 'Percentage', color="Age", 
        animation_frame="Year", range_y=[0,50], text=animate_age['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), category_orders={'Year': [2017, 2018, 2019, 2020]})

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text="Respondents' age change (2017-2020)", title_x=0.5)
fig.update_yaxes(title='Portion of Respondents (%)')
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(31, 27, 25)', marker_line_width=1, opacity=1)
fig.update_layout(showlegend=False)

fig.show()

In [None]:
# Trend for age over time
animate_age['Years'] = animate_age['Year'].astype(str)
fig = px.bar(animate_age, x="Age", y = 'Percentage', color='Years', barmode='group',
             category_orders={"Age": ["Under 25", "25-35", "35-45", "45-60","60-69","70+"]})
fig.show()

All we can say is that Kaggle is definitely getting younger and younger. The Age group is shifting from 25-35 to Under 25


<a id="section-six"></a>
### Visualisation modules over time

#### Python: Matplotlib vs Seaborn, Plotly & Bokeh

In [None]:
#making a dataframe for animation 
exclude_2017 = consolidated_data[consolidated_data.Visual_modules != '']
animate_visuals = exclude_2017.groupby(['Year', 'Visual_modules']).size().reset_index()
animate_visuals['Percentage'] = exclude_2017.groupby(['Year', 'Visual_modules']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_visuals.columns = ['Year', 'Visual_modules', 'Counts', 'Percentage']
animate_visuals = animate_visuals.loc[(animate_visuals['Visual_modules'] == 'Matplotlib Alone') | (animate_visuals['Visual_modules'] == 'Seaborn Alone') 
                  | (animate_visuals['Visual_modules'] == 'Plotly Alone') | (animate_visuals['Visual_modules'] == 'Bokeh Alone') | (animate_visuals['Visual_modules'] == 'Matplotlib & Bokeh')
                  | (animate_visuals['Visual_modules'] == 'Matplotlib & Seaborn') | (animate_visuals['Visual_modules'] == 'Matplotlib & Plotly') | (animate_visuals['Visual_modules'] == 'All 4 modules')]
animate_visuals = animate_visuals.sort_values(by=['Percentage'], ascending = False)

#plotting animated bars for visual modules change over years
fig = px.bar(
        animate_visuals, x="Visual_modules", y = 'Percentage', color="Visual_modules", 
        animation_frame="Year", range_y=[0,25], text=animate_visuals['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
    category_orders={'Year': [2018, 2019, 2020]})

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text='Most popular Python visualisation modules use (2018-2020)', title_x=0.5)
fig.update_yaxes(title='Portion of Respondents (%)')
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(31, 27, 25)', marker_line_width=1, opacity=1)

fig.update_layout(showlegend=False)

fig.show()

In 2018 Matplotlib alone was a dominating visualisation module, but then a combination with Seaborn became a number 1 choice in 2020. 

#### R: Ggplot vs Shiny

In [None]:
#making a dataframe for animation 
exclude_2017 = consolidated_data[consolidated_data.Visual_modules != '']
animate_visuals = exclude_2017.groupby(['Year', 'Visual_modules']).size().reset_index()
animate_visuals['Percentage'] = exclude_2017.groupby(['Year', 'Visual_modules']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_visuals.columns = ['Year', 'Visual_modules', 'Counts', 'Percentage']
animate_visuals = animate_visuals.loc[(animate_visuals['Visual_modules'] == 'Ggplot Alone') | (animate_visuals['Visual_modules'] == 'Shiny Alone') | (animate_visuals['Visual_modules'] == 'Both Modules')]
animate_visuals = animate_visuals.sort_values(by=['Percentage'], ascending = False)

#plotting animated bars for visual modules change over years
fig = px.bar(
        animate_visuals, x="Visual_modules", y = 'Percentage', color="Visual_modules", 
        animation_frame="Year", range_y=[0,10], text=animate_visuals['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
    category_orders={'Year': [2018, 2019, 2020]})

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text='R visualisation modules use (2018-2020)', title_x=0.5)
fig.update_yaxes(title="Portion of Respondents (%)")
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(31, 27, 25)', marker_line_width=1, opacity=1)

fig.update_layout(showlegend=False)

fig.show()

So the most dominating visualisation modules are Seaborn with Matplotlib and Ggplot for Python and R respectively. Plotly isn't getting much of attention, but hopefully that will change at some time.

<a id="section-seven"></a>
### ML modules/frameworks over time

I decided to generalize and compare the use of classic ML modules with deep learning and boosting modeling ones. The comparison are the combinations of modules as follows:

- Classic ML = Scikit-learn (Python) + Caret (R)
- Deep Learning = TensorFlow + Keras + Pytorch + Fast.ai (Python)
- Boosting models = Catboost + Lightgbm + Xgboost

In [None]:
#making a dataframe for animation 
exclude_2017 = consolidated_data[consolidated_data.ML_modules != '']
animate_mlmods = exclude_2017.groupby(['Year', 'ML_modules']).size().reset_index()
animate_mlmods['Percentage'] = exclude_2017.groupby(['Year', 'ML_modules']).size().groupby(level=0).apply(lambda x: 100 * x / float(x.sum())).values.round(2)
animate_mlmods.columns = ['Year', 'ML_modules', 'Counts', 'Percentage']
animate_mlmods = animate_mlmods.sort_values(by=['Percentage'], ascending = False)
animate_mlmods = animate_mlmods.loc[(animate_mlmods['ML_modules'] == 'Classic ML modules') | (animate_mlmods['ML_modules'] == 'Deep Learning & Boosting modules')]

#plotting animated bars for ML modules change over years
fig = px.bar(
        animate_mlmods, x="ML_modules", y = 'Percentage', color="ML_modules", 
        animation_frame="Year", range_y=[0,20], text=animate_mlmods['Percentage'].apply(lambda x: '{0:1.2f}%'.format(x)), 
    category_orders={'Year': [2018, 2019, 2020]})

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 1200
fig.update_layout(title_text='ML frameworks change over time (2018-2020)', title_x=0.5)
fig.update_yaxes(title='Portion of Respondents (%)')
fig.update_xaxes(title='')
fig.update_traces(marker_line_color='rgb(31, 27, 25)', marker_line_width=1, opacity=1)

fig.update_layout(showlegend=False)

fig.show()

As seen the classic ML modules still dominating, but the use of deep learning & boosting models rising closer to 2020 and probably will continue doing so. 

<a id="section-eight"></a>
### Conclusion

So as it's seen, the animated plots are quite easy to implement and they can sometimes be quite helpful. Specifically, in this notebook, graphs have shown that:

- India took over the USA in the last 2 years. Strong trend of growth in india

- Reported salaries of respondents became lower and mostly are in the range of USD 0-30k  

- Kaggle is still dominated by males, but there is "A New Hope" in 2020 as % of females increased.

- Actually, only around 20% of Kaggle survey respondents work in the direct DS/ML field, but there is a surge in the learning field

- Kaggle is becoming younger and younger. That's probably good. 

- Matplotlib and Seaborn along with Classic ML modules are still 'go-to' modules

More info on the animated plots can be found [here](https://plotly.com/python/animations/)
