# Kaggle 2021 community bias
<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
This is the second analytics challenge I am participating on kaggle. The first one being <a href="https://www.kaggle.com/djvivekj/solving-complex-issues-with-simple-vizualizations"> Solving complex issues with simple vizualizations</a>, which did not impress many. 
<br>
As per the competition, I have tried to come up with a story revolving around a subgroup within the kaggle community. I have  used data from previous surveys. As the questionnaire has changed frequently, I was able to use the previous data only for the most common questions. Using line and horizontal bar graph, I have reflected upon the change that has come over the previous years, within the community. Lets find out more about our community. 
    <\div>


In [None]:
import numpy as np 
import pandas as pd
import os
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode
import plotly.graph_objects as go
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', 5000)
import warnings
warnings.filterwarnings("ignore")
if not os.path.exists('individual_charts'):
  os.mkdir('individual_charts')
import matplotlib.gridspec as gridspec
from IPython.display import clear_output
import matplotlib.pyplot as plt

In [None]:
def count_then_return_percent(dataframe,column_name,count=None):
    '''
    A function to return value counts as percentages. 
    '''
    if count:
        counts = dataframe[column_name].value_counts(dropna=False)[0:count]
    else:
        counts = dataframe[column_name].value_counts(dropna=False)
    percentages = round(counts*100/(dataframe[column_name].count()),1).sort_values(ascending=True)
    return percentages

def top_count(dataframe,column,count):
    '''
    A function to return a dataframe sorted by a given column.
    '''
    list = dataframe[column].value_counts().sort_values(ascending=False).keys()[0:count]
    df = dataframe[dataframe[column].isin(list)]
    return df 

def create_dataframe_of_agr(dataframe,column,new_column,count_column,agr,return_percentages=False):
    '''
    A function to create a dataframe by aggregating over a column. resulting column will be named 
    as new_column if new_column is passed, else same as agrregated column.
    '''
    if count_column==None:
          dataframe[new_column] = dataframe[column]
    else:
          dataframe[new_column] = dataframe[count_column]

    dataframe = dataframe.groupby(column).agg(agr)[new_column]
    if return_percentages==True:
        dataframe[new_column] = (dataframe[new_column]*100)/(dataframe[new_column].sum())
    return dataframe

class plot_data():    
    '''
      A class to carry out an agrregation operation and plot the resulting data.
      data: A list of format [dataframe,column,new_column,count_column], where dataframe is one over which agrregation operation
        is carried out, column is the column to groupby, new_column is the name of the new_column(if None, the new column will
        be named after column), count_column is the column to agrregate over(if None, agrregation will be carried over column)
      title: title of the first graph, showing count
      x_label: graph xlabel
      ylabel: graph ylabel
      ax: graph plot axis
      fsize: figure size
      title_c: title of second graph, showing % change
      count: bool, wheater to sort the dataframe or not
      agr: mathematical operation to aggregate, eg 'count', 'mean'.
      cat: bool, wheater the dataframe has both categorical column or not
      return_percentages: bool, wheater to return aggregation as percentage or not.
    '''

    def __init__(self,data,title=None,x_label=None,y_label=None,agr_list=None,ax=None,fsize=None,title_c=False,count=False,cat=False,agr='count',return_percentages=False):
        self.data = data
        self.agr = agr
        self.cat = cat
        self.return_percentages = return_percentages
        self.title = title
        self.x_label = x_label
        self.y_label = y_label
        self.title_c = title_c
        self.ax = ax
        self.fsize = fsize
        self.count = count
        self.agr_list = agr_list
        
    def create_dataframe_of_agr(self,dataframe,column,new_column,count_column=None):
        '''
        A function to create a dataframe by aggregating over a column. resulting column will be named 
        as new_column if new_column is passed, else same as agrregated column.
        '''
        if count_column==None:
            dataframe[new_column] = dataframe[column]
        else:
            dataframe[new_column] = dataframe[count_column]

        if self.cat:
            dataframe = dataframe.groupby(column).agg({new_column: self.agr}).unstack(fill_value=0)[new_column]
        else:
            dataframe = dataframe.groupby(column).agg(self.agr)[new_column]

        if self.return_percentages==True:
            dataframe[new_column] = (dataframe[new_column]*100)/(dataframe[new_column].sum())

        return dataframe
   
    def plot_column_cat_multi(self):
        '''
         A function to plot data for multiple years, when both columns are categorical,
         eg age-duration or industry-experience.
        '''
    
        dataframe = self.data[0][0]
        column = self.data[0][1]
        new_column = self.data[0][2]
        count_column = self.data[0][3]

        df_all = pd.DataFrame(self.create_dataframe_of_agr(dataframe,column,new_column,count_column))
        x = self.ax[0]
        x[0].set_xlabel(self.x_label, fontsize=15)
        x[0].set_ylabel(self.y_label, fontsize=15)                    
        x[0].set_title(self.title, fontsize=20, fontweight='bold')

        df_all = pd.DataFrame([(x,df_all[x].sort_values().keys()[-1],df_all[x].sort_values()[-1]) for x in df_all.keys()])
        df_all.rename(columns={1:column,0:self.x_label+'_'+new_column,2:new_column},inplace=True)
        df_all.index = df_all[self.x_label+'_'+new_column]
        df_all.index.name=count_column
        df_all.drop(columns={self.x_label+'_'+new_column},inplace=True)
      
        for i, v in enumerate(df_all[new_column]):
            x[0].text(v + 3, i + .15, str(df_all.iloc[i][column]), fontweight='bold',color='blue')
        df_all[new_column].plot(kind='barh',figsize=self.fsize, ax = x[0],ylabel=self.y_label,legend=True,color='blue',width=0.5)

        for d in self.data[1::]:
            df = d[0]
            column = d[1]
            new_column = d[2]
            count_column=d[3]
      
            df_ = pd.DataFrame(self.create_dataframe_of_agr(df,column,new_column,count_column))
            df_ = pd.DataFrame([(x,df_[x].sort_values().keys()[-1],df_[x].sort_values()[-1]) for x in df_.keys()])
            df_.rename(columns={1:column,0:self.x_label+'_'+new_column,2:new_column},inplace=True)
            df_.index = df_[self.x_label+'_'+new_column]
            df_.index.name=self.data[0][3]
            df_.drop(columns={self.x_label+'_'+new_column},inplace=True)
            for i, v in enumerate(df_[new_column]):
                x[0].text(v + 3, i - 0.15, str(df_.iloc[i][column]), fontweight='bold',color='red')
            df_[new_column].plot(kind='barh',figsize=self.fsize, ax = x[0],ylabel=self.y_label,legend=True,color='red',width=0.25)

    def plot_column_cat(self):
            '''
            A function to plot data for just year 2021 when both columns are categorical,
            eg age-duration or industry-experience.
            '''
            dataframe = self.data[0][0]
            column = self.data[0][1]
            new_column = self.data[0][2]
            count_column = self.data[0][3]
            df_all = pd.DataFrame(self.create_dataframe_of_agr(dataframe,column,new_column,count_column))
            self.ax[0][0].set_xlabel(self.x_label, fontsize=15)
            self.ax[0][0].set_ylabel(self.y_label, fontsize=15)                    
            self.ax[0][0].set_title(self.title, fontsize=20, fontweight='bold')
            df_all = pd.DataFrame([(x,df_all[x].sort_values().keys()[-1],df_all[x].sort_values()[-1]) for x in df_all.keys()])
            df_all.rename(columns={1:column,0:self.x_label+'_'+new_column,2:new_column},inplace=True)
            df_all.index = df_all[self.x_label+'_'+new_column]
            df_all.index.name=self.data[0][3]
            df_all.drop(columns={self.x_label+'_'+new_column},inplace=True)
      
            for i, v in enumerate(df_all[new_column]):
              self.ax[0][0].text(v + 3, i + .25, str(df_all.iloc[i][column]), fontweight='bold',color='blue')
            df_all[new_column].plot(kind='barh',figsize=self.fsize, ax = self.ax[0][0],ylabel=self.y_label,legend=True,color='blue',width=0.5)

    def plot_column(self):
            '''
            A function to plot data for multiple years, when one columns is categorical and other numerical, 
            eg age-duratin or industry-experience.
            '''
            dataframe = self.data[0][0]
            column = self.data[0][1]
            new_column = self.data[0][2]
            count_column = self.data[0][3]

            df_all = pd.DataFrame(self.create_dataframe_of_agr(dataframe,column,new_column,count_column))

            if self.count:
                df_all = df_all.sort_values(by=new_column,ascending=False)[0:self.count]
    
            for x in self.ax:
                for d in self.data[1::]:
                    df = d[0]
                    column = d[1]
                    new_column = d[2]
                    count_column = d[3]
                    df_ = pd.DataFrame(self.create_dataframe_of_agr(df,column,new_column,count_column))
                    try:
                        df_all = pd.merge(df_all,df_,on=column,how='left')
                    except KeyError:
                        df_all = pd.merge(df_all,df_,left_on=df_all.index, right_on=column,how='left')
                        df_all.index = df_all[column]

                x[0].set_xlabel(self.x_label, fontsize=15)
                x[0].set_ylabel(self.y_label, fontsize=15)                    
                x[0].set_title(self.title, fontsize=20, fontweight='bold')

                df_all.plot(ax=x[0],figsize=self.fsize,xlabel=self.x_label)
    
                if self.title_c:
                    if '2020' in df_all.columns:
                        df_all['% change from 2020-2021'] = 100*(df_all['2021']-df_all['2020'])/df_all['2020']
                    if '2019' in df_all.columns:
                        df_all['% change from 2019-2020'] = 100*(df_all['2020']-df_all['2019'])/df_all['2019']
                    if '2018' in df_all.columns:
                        df_all['% change from 2018-2019'] = 100*(df_all['2019']-df_all['2018'])/df_all['2018']
                    if '2017' in df_all.columns:
                        df_all['% change from 2017-2018'] = 100*(df_all['2018']-df_all['2017'])/df_all['2018']
        
                    x[1].set_xlabel(self.x_label, fontsize=15)
                    x[1].set_ylabel('percentage', fontsize=15)                    
                    x[1].set_title(self.title_c, fontsize=20, fontweight='bold')
                    df_all[[x for x in df_all.columns if '%' in x]].plot(figsize=self.fsize, 
                                                            ax=x[1],
                                                            xlabel=self.x_label,
                                                            )
                    
    def multi_column_count(self,df,column,agr_list,count_column,new_column=None):

        d = df.groupby(count_column).agg('count')[agr_list].T
        d = pd.DataFrame([(x,d[x].sort_values(ascending=False).keys()[0:4].values,d[x].sort_values(ascending=False)[0:4].values) for x in d.keys()])
        d[2] = [x.mean() for x in d[2]]

        d.rename(columns={1:column,0:count_column,2:new_column},inplace=True)
        d.index = d[count_column]
        d.index.name=count_column
        d.drop(columns={count_column},inplace=True)
        return d

    def plot_multi_column(self):
        '''
        A function to plot data for multiple years, when the columns are a list of preference,
         eg age and list of languages used on regular basis.
        '''
        dataframe = self.data[0][0]
        column = self.data[0][1]
        new_column = self.data[0][2]
        count_column = self.data[0][3]
        agr_list = self.agr_list
        df_all = self.multi_column_count(dataframe,column,agr_list,count_column,new_column)
        x = self.ax[0]

        x[0].set_xlabel(self.x_label, fontsize=15)
        x[0].set_ylabel(self.y_label, fontsize=15)                    
        x[0].set_title(self.title, fontsize=20, fontweight='bold')

        for i, v in enumerate(df_all[new_column]):
            x[0].text(v + 3, i + .15, str(df_all.iloc[i][column]), fontweight='bold',color='blue')
        df_all[new_column].plot(kind='barh',figsize=self.fsize, ax = x[0],ylabel=self.y_label,legend=True,color='blue',width=0.5)

        for d in self.data[1::]:
            df = d[0]
            column = d[1]
            new_column = d[2]
            count_column=d[3]
      
            df_ = self.multi_column_count(df,column,agr_list,count_column,new_column)

            for i, v in enumerate(df_[new_column]):
                  x[0].text(v + 3, i - 0.15, str(df_.iloc[i][column]), fontweight='bold',color='red')
            df_[new_column].plot(kind='barh',figsize=self.fsize, ax = x[0],ylabel=self.y_label,legend=True,color='red',width=0.25)  
            

def plotly_choropleth_map(df, column, title, max_value):
    '''
    This function creates a choropleth map.
    '''
    df = pd.DataFrame(df)
    df['country']=df.index
    fig = px.choropleth(df, 
                    locations = 'country',  
                    color = column,
                    locationmode = 'country names', 
                    color_continuous_scale = 'viridis',
                    title = title,
                    range_color = [0, max_value])
    fig.update(layout=dict(title=dict(x=0.5)))
    fig.show()
    return df    

def plotly_bar_chart(response_counts,title,y_axis_title,orientation):
    '''
    This function creates a bar chart.
    '''
    response_counts_series = pd.Series(response_counts)
    pd.DataFrame(response_counts_series).to_csv('./individual_charts/chart_'+title+'.csv',index=True)
    fig = px.bar(response_counts_series,
             labels={"index": '',"value": y_axis_title},
             text=response_counts_series.values,
             orientation=orientation,)
    fig.update_layout(showlegend=False,
                      title={'text': title+' in 2021',
                             'y':0.95,
                             'x':0.5,})
    fig.show()

In [None]:
df_21 = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
df_20 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
df_19 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
df_18 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
df_17 = pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv',low_memory=False,encoding='ISO-8859-1')

def col_ren(df):
    df.rename(columns={'Time from Start to Finish (seconds)':'duration',
                   'Q1':'age', 'Q2':'gender', 'Q3':'country', 'Q4':'education', 'Q5':'title', 'Q6':'programming_exp', 
                   'Q7_Part_1':'python', 'Q7_Part_2':'r', 'Q7_Part_3':'sql', 'Q7_Part_4':'c', 'Q7_Part_5':'c++', 'Q7_Part_6':'java', 
                   'Q7_Part_7':'javascript', 'Q7_Part_8':'julia', 'Q7_Part_9':'swift', 'Q7_Part_10':'bash', 'Q7_Part_11':'matlab', 
                   'Q7_Part_12':'none_7', 'Q7_OTHER':'other_7', 
                   'Q8':'first_language', 
                   'Q9_Part_1':'jupyter', 'Q9_Part_2':'rstudio', 'Q9_Part_3':'vs', 'Q9_Part_4':'vscode', 'Q9_Part_5':'pycharm', 
                   'Q9_Part_6':'spyder', 'Q9_Part_7':'notepad++', 'Q9_Part_8':'sublime_text', 'Q9_Part_9':'vim/emacs', 'Q9_Part_10':'matlab_', 
                   'Q9_Part_11':'jupyter_n', 'Q9_Part_12':'none_9', 'Q9_OTHER':'other_9',
                   'Q10_Part_1':'kaggle','Q10_Part_2':'colab','Q10_Part_3':'azure','Q10_Part_4':'paperspace_gradient',
                   'Q10_Part_5':'binder_jupyterhub','Q10_Part_6':'code_ocean','Q10_Part_7':'ibm_watson_studio',
                   'Q10_Part_8':'sagemaker','Q10_Part_9':'emr','Q10_Part_10':'gcn','Q10_Part_11':'gcd','Q10_Part_12':'databricks',
                   'Q10_Part_13':'zeppelin','Q10_Part_14':'deepnote','Q10_Part_15':'observable','Q10_Part_16':'none_10','Q10_OTHER':'other_10',
                   'Q11':'prefered_computing', 
                   'Q12_Part_1':'gpu','Q12_Part_2':'tpu','Q12_Part_3':'trainium','Q12_Part_4':'inferentia','Q12_Part_5':'none_12',
                   'Q12_OTHER':'other_12',
                   'Q13':'tpu_usage',
                   'Q14_Part_1':'matplotlib','Q14_Part_2':'seaborn','Q14_Part_3':'plotly','Q14_Part_4':'ggplot','Q14_Part_5':'shiny'
                   ,'Q14_Part_6':'d3js','Q14_Part_7':'altair','Q14_Part_8':'bokeh','Q14_Part_9':'geoplotlib'
                   ,'Q14_Part_10':'leaflet_folium','Q14_Part_11':'none_14','Q14_OTHER':'other_14'
                   ,'Q15':'machine_exp'
                   ,'Q16_Part_1':'scikit_learn','Q16_Part_2':'tensorflow','Q16_Part_3':'keras','Q16_Part_4':'pytorch','Q16_Part_5':'fastai'
                   ,'Q16_Part_6':'mxnet','Q16_Part_7':'xgboost','Q16_Part_8':'lightgbm','Q16_Part_9':'catboost','Q16_Part_10':'prophet'
                   ,'Q16_Part_11':'h2o3','Q16_Part_12':'caret','Q16_Part_13':'tidymodels','Q16_Part_14':'jax','Q16_Part_15':'pytorch_lightning'
                   ,'Q16_Part_16':'huggingface','Q16_Part_17':'none_16','Q16_OTHER':'other_16'
                   ,'Q17_Part_1':'regression','Q17_Part_2':'decision_random','Q17_Part_3':'gradient_boosting','Q17_Part_4':'bayesian'
                   ,'Q17_Part_5':'evolution','Q17_Part_6':'dnn','Q17_Part_7':'cnn','Q17_Part_8':'gan','Q17_Part_9':'rnn'
                   ,'Q17_Part_10':'transformer','Q17_Part_11':'none_17','Q17_OTHER':'other_17'
                   ,'Q18_Part_1':'iv_tools','Q18_Part_2':'image_segmentation','Q18_Part_3':'object_detection','Q18_Part_4':'image_classification',
                   'Q18_Part_5':'generative','Q18_Part_6':'none_18','Q18_OTHER':'others_18'
                   ,'Q19_Part_1':'word_embedding','Q19_Part_2':'encoder_decoder','Q19_Part_3':'contextualized_embedding'
                   ,'Q19_Part_4':'transformer_','Q19_OTHER':'others_19','Q19_Part_5':'none_19',
                   'Q20':'industry', 'Q21':'company_size', 'Q22':'data_science_size', 'Q23':'ml_relevance',
                   'Q24_Part_1':'analysis', 'Q24_Part_2':'data_infrastructure', 'Q24_Part_3':'prototypes', 'Q24_Part_4':'product_improvement', 
                   'Q24_Part_5':'model_improvement', 'Q24_Part_6':'R&D', 'Q24_Part_7':'none_24','Q24_OTHER':'others_24',
                   'Q25':'current_yr_compensation', 'Q26':'company_past_5_yr_expense',
                   'Q27_A_Part_1':'aws','Q27_A_Part_2':'azure_','Q27_A_Part_3':'gcp','Q27_A_Part_4':'ibm_red_hat','Q27_A_Part_5':'oracle','Q27_A_Part_6':'sap',
                   'Q27_A_Part_7':'salesforce','Q27_A_Part_8':'vmware','Q27_A_Part_9':'alibaba','Q27_A_Part_10':'tencent','Q27_A_Part_11':'none_27',
                   'Q27_A_OTHER':'othes_27',
                   'Q28':'best_developer_exp', 
                   'Q29_A_Part_1':'aws_ec2','Q29_A_Part_2':'ms_azure_vm','Q29_A_Part_3':'google_cce','Q29_A_Part_4':'none_29','Q29_A_OTHER':'others_29',
                   'Q30_A_Part_1':'ms_azure_dls', 'Q30_A_Part_2':'ms_azure_ds', 'Q30_A_Part_3':'aws_s3', 'Q30_A_Part_4':'aws_efs', 'Q30_A_Part_5':'gcs', 
                   'Q30_A_Part_6':'gcf', 'Q30_A_Part_7':'none_30', 'Q30_A_OTHER':'other_30',
                   'Q31_A_Part_1':'aws_sagemaker','Q31_A_Part_2':'aws_mls','Q31_A_Part_3':'google_vi','Q31_A_Part_4':'datarobot','Q31_A_Part_5':'databricks_',
                   'Q31_A_Part_6':'dataiku','Q31_A_Part_7':'alterys','Q31_A_Part_8':'rapidminer','Q31_A_Part_9':'none_31','Q31_A_OTHER':'othes_31',
                   'Q32_A_Part_1':'mysql','Q32_A_Part_2':'postgresql','Q32_A_Part_3':'sqlite','Q32_A_Part_4':'oracle_db','Q32_A_Part_5':'mongodb',
                   'Q32_A_Part_6':'snowflake','Q32_A_Part_7':'ibmdb2','Q32_A_Part_8':'ms_sqlserver','Q32_A_Part_9':'ms_azure_sql',
                   'Q32_A_Part_10':'ms_sqldatabase','Q32_A_Part_11':'ms_cosmosdb','Q32_A_Part_12':'aws_redshift','Q32_A_Part_13':'aws_aurora',
                   'Q32_A_Part_14':'aws_rds','Q32_A_Part_15':'aws_dynamodb','Q32_A_Part_16':'google_bigquery','Q32_A_Part_17':'google_sql',
                   'Q32_A_Part_18':'google_firestore','Q32_A_Part_19':'google_bigtable', 'Q32_A_Part_20':'google_spanner','Q32_A_Part_21':'none_32',
                   'Q32_A_OTHER':'other_32',
                   'Q33':'bdp_most_often', 
                   'Q34_A_Part_1':'aws_quicksight','Q34_A_Part_2':'ms_powerbi','Q34_A_Part_3':'google_ds','Q34_A_Part_4':'looker','Q34_A_Part_5':'tableau',
                   'Q34_A_Part_6':'salesforce(bi)','Q34_A_Part_7':'tableau_crm','Q34_A_Part_8':'Qlik',
                   'Q34_A_Part_9':'domo','Q34_A_Part_10':'tibco_spotfire','Q34_A_Part_11':'alteryx','Q34_A_Part_12':'sisence','Q34_A_Part_13':'sap_analytics',
                   'Q34_A_Part_14':'ms_synapse','Q34_A_Part_15':'thoughtspot','Q34_A_Part_16':'none_34','Q34_A_OTHER':'other_34',
                   'Q35':'bi_most_often',
                   'Q36_A_Part_1':'data_augumentation', 'Q36_A_Part_2':'feature_engineering', 'Q36_A_Part_3':'model_selection', 
                   'Q36_A_Part_4':'model_architecture_searches', 'Q36_A_Part_5':'hyperparameter_tuning', 'Q36_A_Part_6':'auto_ml_pipeline', 
                   'Q36_A_Part_7':'none_36', 'Q36_A_OTHER':'other_36',
                   'Q37_A_Part_1':'google_automl','Q37_A_Part_2':'h2o_driverless_ai','Q37_A_Part_3':'databricks_automl','Q37_A_Part_4':'datarobot_automl',
                   'Q37_A_Part_5':'aws_sagemaker_autopilot','Q37_A_Part_6':'azure_aml','Q37_A_Part_7':'none_37','Q37_A_OTHER':'other_37',
                   'Q38_A_Part_1':'neptune_ai','Q38_A_Part_2':'weights_biases','Q38_A_Part_3':'comet_ml','Q38_A_Part_4':'sacred_omniboard',
                   'Q38_A_Part_5':'tensorboard','Q38_A_Part_6':'guild_ai','Q38_A_Part_7':'polyaxon','Q38_A_Part_8':'clearml',
                   'Q38_A_Part_9':'domino_model_monitor','Q38_A_Part_10':'mlflow','Q38_A_Part_11':'none_38','Q38_A_OTHER':'other_38',
                   'Q39_Part_1':'plotly_','Q39_Part_2':'streamlit','Q39_Part_3':'nbviewer','Q39_Part_4':'github','Q39_Part_5':'personal_blog',
                   'Q39_Part_6':'kaggle_','Q39_Part_7':'colab_','Q39_Part_8':'shiny_','Q39_Part_9':'none_39','Q39_OTHER':'other_39',
                   'Q40_Part_1':'coursera','Q40_Part_2':'edx','Q40_Part_3':'kaggle_learn','Q40_Part_4':'datacamp','Q40_Part_5':'fast_ai',
                   'Q40_Part_6':'udacity','Q40_Part_7':'udemy','Q40_Part_8':'linkedin_learning','Q40_Part_9':'cloud_certifications',
                   'Q40_Part_10':'university','Q40_Part_11':'none_40','Q40_OTHER':'other_40',
                   'Q41':'primary_work_tool',
                   'Q42_Part_1':'twitter','Q42_Part_2':'email_newsletters','Q42_Part_3':'reddit','Q42_Part_4':'kaggle_forums','Q42_Part_5':'course_forums',
                   'Q42_Part_6':'youtube','Q42_Part_7':'podcasts','Q42_Part_8':'blogs','Q42_Part_9':'journals','Q42_Part_10':'slack_communities',
                   'Q42_Part_11':'none_42','Q42_OTHER':'others_42',
                   'Q27_B_Part_1':'aws_2','Q27_B_Part_2':'ms_azure_2','Q27_B_Part_3':'gcp_2','Q27_B_Part_4':'ibm_redhat_5','Q27_B_Part_5':'oracle_2',
                   'Q27_B_Part_6':'sap_2',
                   'Q27_B_Part_7':'vmware_2','Q27_B_Part_8':'salesforce_2','Q27_B_Part_9':'alibaba_2','Q27_B_Part_10':'tencent_2','Q27_B_Part_11':'none_27b',
                   'Q27_B_OTHER':'other_27b',
                   'Q29_B_Part_1':'ec2_2','Q29_B_Part_2':'ms_azure_vm_2','Q29_B_Part_3':'gcc_2','Q29_B_Part_4':'none_29b','Q29_B_OTHER':'other_29b',
                   'Q30_B_Part_1':'ms_azure_dls_2','Q30_B_Part_2':'ms_azure_ds_2','Q30_B_Part_3':'aws_s3_2','Q30_B_Part_4':'aws_efs_2','Q30_B_Part_5':'gcs_2',
                   'Q30_B_Part_6':'google_firestore_2','Q30_B_Part_7':'none_30b','Q30_B_OTHER':'other_30b',
                   'Q31_B_Part_1':'aws_sagemaker_2','Q31_B_Part_2':'azure_mls_2','Q31_B_Part_3':'google_vertex_ai_2','Q31_B_Part_4':'datarobot_2',
                   'Q31_B_Part_5':'databricks_2','Q31_B_Part_6':'dataiku_2','Q31_B_Part_7':'alteryx_2','Q31_B_Part_8':'rapidminer_2',
                   'Q31_B_Part_9':'none_31b','Q31_B_OTHER':'other_31b',
                   'Q32_B_Part_1':'mysql_2','Q32_B_Part_2':'postgresql_2','Q32_B_Part_3':'sqlite_2','Q32_B_Part_4':'oracle_db_2','Q32_B_Part_5':'mongo_db_2',
                   'Q32_B_Part_6':'snowflake_2','Q32_B_Part_7':'ibm_db_2','Q32_B_Part_8':'ms_sql_2','Q32_B_Part_9':'ms_sqldb_2',
                   'Q32_B_Part_10':'ms_cosmosdb_2','Q32_B_Part_11':'aws_redshift_2','Q32_B_Part_12':'aws_aurora_2','Q32_B_Part_13':'aws_dynamodb_2',
                   'Q32_B_Part_14':'amazon_rds_2','Q32_B_Part_15':'google_bigquery_2','Q32_B_Part_16':'google_sql_2','Q32_B_Part_17':'google_firestore_3',
                   'Q32_B_Part_18':'google_bigtable_2','Q32_B_Part_19':'google_spanner_2','Q32_B_Part_20':'none_32b','Q32_B_OTHER':'other_32b',
                   'Q34_B_Part_1':'ms_powerbi_2','Q34_B_Part_2':'aws_quicksight_2','Q34_B_Part_3':'google_ds_2','Q34_B_Part_4':'looker_2',
                   'Q34_B_Part_5':'tableau_crm_2',
                   'Q34_B_Part_6':'qlik_2','Q34_B_Part_7':'domo_2','Q34_B_Part_8':'tibco_spotfire_2','Q34_B_Part_9':'alteryx_bi_2','Q34_B_Part_10':'sisense_2',
                   'Q34_B_Part_11':'sap_analytics_2','Q34_B_Part_12':'ms_synapse_2','Q34_B_Part_13':'thoughtspot_2','Q34_B_Part_14':'none_34b',
                   'Q34_B_Part_15': 'thoughtspot','Q34_B_Part_16': 'none',
                   'Q34_B_OTHER':'other_34b',
                   'Q36_B_Part_1':'data_aug_2','Q36_B_Part_2':'feature_engineering_2','Q36_B_Part_3':'model_selection_2',
                   'Q36_B_Part_4':'model_architecture_2','Q36_B_Part_5':'hyperparam_2','Q36_B_Part_6':'full_ml_2','Q36_B_Part_7':'none_36b',
                   'Q36_B_OTHER':'other_36b',
                   'Q37_B_Part_1':'google_automl_2','Q37_B_Part_2':'h2o_driverless_2','Q37_B_Part_3':'databricks_2_','Q37_B_Part_4':'datarobot_2',
                   'Q37_B_Part_5':'aws_sagemaker_2','Q37_B_Part_6':'azure_ml_2','Q37_B_Part_7':'none_37b','Q37_B_OTHER':'other_37b',
                   'Q38_B_Part_1':'neptune_ai_2','Q38_B_Part_2':'weights_biases_2','Q38_B_Part_3':'comet_ml','Q38_B_Part_4':'sacred_omniboard_2',
                   'Q38_B_Part_5':'tensorboard_2','Q38_B_Part_6':'guild_ai_2','Q38_B_Part_7':'polyaxon_2','Q38_B_Part_8':'clearml_2',
                   'Q38_B_Part_9':'domino_2','Q38_B_Part_10':'mlflow_2','Q38_B_Part_11':'none_38b','Q38_B_OTHER':'other_38b'
      }, inplace=True)
    df = df.iloc[1::]
    df['duration'] = df['duration'].astype('int')
    return df

df_21 = col_ren(df_21)
df_20 = col_ren(df_20)
df_19 = df_19.iloc[1::]
df_19.rename(columns={'Q10':'current_yr_compensation', 'Q15':'programming_exp', 'Q23':'machine_exp','Q5':'title',
                      'Time from Start to Finish (seconds)':'duration','Q1':'age','Q2':'gender','Q3':'country',
                      'Q4':'education'
                     },inplace=True)
df_20.drop(columns={'current_yr_compensation'}, inplace=True)
df_20.rename(columns={'Q24':'current_yr_compensation'}, inplace=True)
df_18 = df_18.iloc[1::]
df_21.replace({'United Kingdom of Great Britain and Northern Ireland':'UK'}, inplace=True)

df_21['machine_exp_avg'] = df_21['machine_exp'].replace({'5-10 years':7, 'Under 1 year':0.5, 
                                                         'I do not use machine learning methods':0, 
               '10-20 years':15, '2-3 years':2.5, np.nan:0, '1-2 years':1.5, '4-5 years':4.5, '3-4 years':3.5,
               '20 or more years':20,
               '20+ years':20, '1-3 years':2, '< 1 years':0.5, '3-5 years':4,'I have never written code':0})

df_21['programming_exp_avg'] = df_21['programming_exp'].replace({'5-10 years':7, 'Under 1 year':0.5, 'I do not use machine learning methods':0, 
               '10-20 years':15, '2-3 years':2.5, np.nan:0, '1-2 years':1.5, '4-5 years':4.5, '3-4 years':3.5,
               '20 or more years':20,
               '20+ years':20, '1-3 years':2, '< 1 years':0.5, '3-5 years':4,'I have never written code':0})

df_19['machine_exp'] = df_19['machine_exp'].replace({'1-2 years':'1-2 years', 
                                                     'I have never written code':'I do not use machine learning methods', 
       '20+ years':'20+ years', '3-5 years':'3-4 years', '5-10 years':'5-10 years', '10-20 years':'10-20 years'})

df_19['programming_exp'] = df_19['programming_exp'].replace({'1-2 years':'1-2 years', 
                                                     'I have never written code':'I do not use machine learning methods', 
       '20+ years':'20+ years', '3-5 years':'3-4 years', '5-10 years':'5-10 years', '10-20 years':'10-20 years'})

df_21['current_yr_compensation_avg'] = df_21['current_yr_compensation'].replace({'25,000-29,999':27500, 
        '60,000-69,999':65000, '$0-999':500, '30,000-39,999':35000, np.nan:0,
       '15,000-19,999':17500, '70,000-79,999':75000, '2,000-2,999':2500, '10,000-14,999':12500,
       '5,000-7,499':6000, '20,000-24,999':22500, '1,000-1,999':1500, '100,000-124,999':112500,
       '7,500-9,999':8700, '4,000-4,999':4500, '40,000-49,999':45000, '50,000-59,999':55000,
       '3,000-3,999':3500, '300,000-499,999':375000, '200,000-249,999':225000,
       '125,000-149,999':137000, '250,000-299,999':275000, '80,000-89,999':85000,
       '90,000-99,999':95000, '150,000-199,999':175000, '>$1,000,000':1000000,
       '$500,000-999,999':75000})

df_20['current_yr_compensation_avg'] = df_20['current_yr_compensation'].replace({'25,000-29,999':27500, 
        '60,000-69,999':65000, '$0-999':500, '30,000-39,999':35000, np.nan:0,
       '15,000-19,999':17500, '70,000-79,999':75000, '2,000-2,999':2500, '10,000-14,999':12500,
       '5,000-7,499':6000, '20,000-24,999':22500, '1,000-1,999':1500, '100,000-124,999':112500,
       '7,500-9,999':8700, '4,000-4,999':4500, '40,000-49,999':45000, '50,000-59,999':55000,
       '3,000-3,999':3500, '300,000-499,999':375000, '200,000-249,999':225000,
       '125,000-149,999':137000, '250,000-299,999':275000, '80,000-89,999':85000,
       '90,000-99,999':95000, '150,000-199,999':175000, '>$1,000,000':1000000,
       '$500,000-999,999':75000,'> $500,000':500000,'300,000-500,000':400000})

df_19['current_yr_compensation_avg'] = df_19['current_yr_compensation'].replace({'25,000-29,999':27500, 
        '60,000-69,999':65000, '$0-999':500, '30,000-39,999':35000, np.nan:0,
       '15,000-19,999':17500, '70,000-79,999':75000, '2,000-2,999':2500, '10,000-14,999':12500,
       '5,000-7,499':6000, '20,000-24,999':22500, '1,000-1,999':1500, '100,000-124,999':112500,
       '7,500-9,999':8700, '4,000-4,999':4500, '40,000-49,999':45000, '50,000-59,999':55000,
       '3,000-3,999':3500, '300,000-499,999':375000, '200,000-249,999':225000,
       '125,000-149,999':137000, '250,000-299,999':275000, '80,000-89,999':85000,
       '90,000-99,999':95000, '150,000-199,999':175000, '>$1,000,000':1000000,'> $500,000':500000,
       '$500,000-999,999':75000,'300,000-500,000':400000})

clear_output()

<h1> A Look At The Data </h1>
<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
The survey has 42 questions, some divided into parts. Information like age, country, education, experiences and title tells about how the community has evolved. Where as preferences like language , IDE and computing platform can tell how the overall industry has shaped over the years. I have used the class plot_data which has the following functions to plot, the details are explained in the code:
    <br> <br>
    plot_column_cat_multi: It can plot data using two dataframes eg. for men/women or for year 2021/2020. The columns to plot need to be categorical eg, age/industry. The plot has be used to compare two different sets of data.
    <br><br>
    plot_column_cat: It can take only one dataframe with both categorical columns eg, age/ml_relevance
    <br><br>
    plot_column: It can take any number of dataframe with one categorical columns and other numerical eg, age/current_compensation_avg. It has been used to plot line graphs and compare data over previous years. Eg. how compensation has changed with experience over past years.
    <br><br>
    plot_multi_column: It can take two dataframe with list of preference columns, eg. list of languages used on regular basis and a categorical column eg. age.
    <br><br>
 <mark>All the horizontal bar plots find and plot the most common values after grouping and aggregation. Eg. The most common age group based
 on count or the most popular languages used on regular basis.</mark>
    </div>

In [None]:
d = create_dataframe_of_agr(df_21,'country','Count of respondents',None,'count',return_percentages=False)
plotly_choropleth_map(d, 
                       'Count of respondents', 
                       'Total number of responses per country in 2021',
                        max_value = 1000)

d = count_then_return_percent(df_21,'country',20)
plotly_bar_chart(d,'% of respondents from top 20 countries','% of respondents',orientation='h')
print('\n\n')

In [None]:
fig1, ax1 = plt.subplots()
fig2, ax2 = plt.subplots()

plot_data(data=[[df_21,'country','2021',None],
                  [df_20,'country','2020',None],
                  [df_19,'country','2019',None],
                  [df_18,'Q3','2018',None],
                 ],
            agr='count',
            title='Number of respondents from 30 countries with most respondents',
            y_label='count',
            x_label='Countries',
            title_c = '% change in number of respondents from various countries compared to previous year',
            ax=[[ax1,ax2]],
            fsize=(20,7),
            count=30
            ).plot_column()

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
As we can see, the number of respondents and % increase from previous years, from many countries decreased from 2018 to 2021.
Even though there was a global pendamic towards 2020, a drop from 2018-2019 is a surprise. Overall, numbers have recovered. But 2021 count is still not a substantial improvement and in some countries 2018 count is more than 2021.
    </div>

In [None]:
fig1, ax1 = plt.subplots()
fig2, ax2 = plt.subplots()

plot_data(data=[[df_21,'machine_exp','2021','current_yr_compensation_avg'],
                  [df_20,'machine_exp','2020','current_yr_compensation_avg'],
                  [df_19,'machine_exp','2019','current_yr_compensation_avg'],
                 ],
            agr='mean',
            title='How yearly compensation has changed with machine exp. over the past years ',
            y_label='yearly compensation (USD)',
            x_label='Machine Experience',
            title_c='% change in yearly compensation',
            ax=[[ax1,ax2]],
            fsize=(20,7),
            count=30
            ).plot_column()

In [None]:
fig1, ax1 = plt.subplots()
fig2, ax2 = plt.subplots()

plot_data(data=[[df_21,'programming_exp','2021','current_yr_compensation_avg'],
                  [df_20,'programming_exp','2020','current_yr_compensation_avg'],
                  [df_19,'programming_exp','2019','current_yr_compensation_avg'],
                 ],
            agr='mean',
            title='How yearly compensation has changed with programming exp. over the past years',
            y_label='yearly compensation (USD)',
            x_label='Machine Experience',
            title_c='% change in yearly compensation',
            ax=[[ax1,ax2]],
            fsize=(20,7),
            count=30
            ).plot_column()

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
As it can be seen yearly compensation had droped from 2019 to 2020. But from 2020 to 2021 the increase was only for those with 5 years or more of machine learning experience or 3 to 5 years of programming experience.    
    <\div>

In [None]:
fig1, ax1 = plt.subplots()
fig1, ax2 = plt.subplots()

plot_data(data=[[df_21,'age','2021',None],
                  [df_20,'age','2020',None],
                  [df_19,'age','2019',None],
                  [df_18,'Q2','2018',None],
                 ],
            agr='count',
            title='Number of entries from each age group',
            y_label='count',
            x_label='Age groups',
            title_c = '% change in number of respondents of various age groups',
            ax=[[ax1,ax2]],
            fsize=(10,5),
            ).plot_column()

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
Again, there were more respondents aged between 23 to 35 years in 2018 than 2021. Respondents aged between 23 to 35 years were more in number in 2019 than 2020. Respondents aged 35 years and above have increased the most in year 2021, when compared to previous years. 
Similarly, negative growth can be seen from 2018 to 2019 for respondents aged upto around 35 years. 2019 to 2020 saw negative or very less growth for respondents aged 23 to 55 years.
    </div>

In [None]:
fig1, f1_axes = plt.subplots(ncols=2, nrows=2, constrained_layout=True,figsize=(20,10),)
spec4 = gridspec.GridSpec(ncols=2, nrows=2, figure=fig1)

ax1 = fig1.add_subplot(spec4[0, 0])
ax2 = fig1.add_subplot(spec4[0, 1])
ax3 = fig1.add_subplot(spec4[1, 0])
ax4 = fig1.add_subplot(spec4[1, 1])

plot_data(data=[[df_21[df_21.gender=='Woman'],'age','2021',None],
                  [df_20[df_20.gender=='Woman'],'age','2020',None],
                  [df_19[df_19.gender=='Female'],'age','2019',None],
                  [df_18[df_18.Q1=='Female'],'Q1','2018',None],
                 ],
            agr='count',
            title='Number of entries by women from each age group',
            y_label='count',
            x_label='Age groups',
            title_c = '% change in women respondents from various age groups',
            ax=[[ax1,ax2]],
            ).plot_column()

plot_data(data=[[df_21[df_21.gender=='Man'],'age','2021',None],
                  [df_20[df_20.gender=='Man'],'age','2020',None],
                  [df_19[df_19.gender=='Male'],'age','2019',None],
                  [df_18[df_18.Q1=='Male'],'Q1','2018',None],
                 ],
            agr='count',
            title='Number of entries by men from each age group',
            y_label='count',
            x_label='Age groups',
            title_c = '% change in men respondents from various age groups',
            ax=[[ax3,ax4]],
            ).plot_column()            

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
A look at the count based on gender tells how a drop in number of male respondents aged from around 23 to 35 years contributed to an overall drop in number of respondents from 2019 to 2020. Also, the decrease from 2018 to 2021 was more significant for number of male respondents aged around 23 to 35 years when compared to female. 
Finally, from 2020 to 2021 the increase percentage in case of male was much more than female.
<mark>Overall, it seems that from 2018 to 2021, when the world was going through a phase of panic and uncertainity, kaggle male community was affected more in comparison to female community members.</mark> An insight, that could be of great help to corporate policy makers.
    </div>

In [None]:
titles = list(set(df_21.title.unique()).intersection(set(df_20.title.unique())).intersection(set(df_19.title.unique()))-set(['Other']))
titles_21 = df_21[df_21.title.isin(titles)][['gender','title']]
titles_20 = df_20[df_20.title.isin(titles)][['gender','title']]
titles_19 = df_19[df_19.title.isin(titles)][['gender','title']]

fig1, f1_axes = plt.subplots(ncols=2, nrows=2, constrained_layout=True,figsize=(30,10),)
spec4 = gridspec.GridSpec(ncols=2, nrows=2, figure=fig1)

ax1 = fig1.add_subplot(spec4[0, 0])
ax2 = fig1.add_subplot(spec4[0, 1])
ax3 = fig1.add_subplot(spec4[1, 0])
ax4 = fig1.add_subplot(spec4[1, 1])

plot_data(data=[[titles_21[titles_21.gender=='Woman'],'title','2021',None],
                  [titles_20[titles_20.gender=='Woman'],'title','2020',None],
                  [titles_19[titles_19.gender=='Female'],'title','2019',None],
                 ],
            agr='count',
            title='Number of entries by women with different titles',
            y_label='count',
            x_label='Titles',
            title_c = '% change in number of women with different titles',
            ax=[[ax1,ax2]],
            ).plot_column()

plot_data(data=[[titles_21[titles_21.gender=='Man'],'title','2021',None],
                  [titles_20[titles_20.gender=='Man'],'title','2020',None],
                  [titles_19[titles_19.gender=='Male'],'title','2019',None],
                 ],
            agr='count',
            title='Number of entries by men with different titles',
            y_label='count',
            x_label='Titles',
            title_c = '% change in number of men with different titles',
            ax=[[ax3,ax4]],
            ).plot_column()            

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
The decrease in number of male data analyst,data engineer, data scientist, research scientist and software engineer from 2019 to 2020 is much more significant than the female. From 2020 to 2021 increase in male data engineer, data scientist, research scientist was more than female. Where as female software engineer increase percentage was more than male during this time.
    </div>

In [None]:
c = df_21.country.value_counts().sort_values(ascending=False)[0:9].keys()
d = df_21[df_21.country.isin(c)]
cn_men = pd.DataFrame(d[d.gender=='Man'].groupby(['country']).agg("count")['gender'])
cn_men.rename(columns={'gender':'men'},inplace=True)
cn_women = pd.DataFrame(d[d.gender=='Woman'].groupby(['country']).agg("count")['gender'])
cn_women.rename(columns={'gender':'women'},inplace=True)
d = pd.merge(cn_men,cn_women,on='country',how='left')
d['mw'] = 100*d['women']/d['men']
d = d.sort_values(by='mw',ascending=True)
plotly_bar_chart(d['mw'], 'Percentage of female respondents in countries with most respondents', 'Percentage',orientation='h')

i = df_21.industry.value_counts().sort_values(ascending=False)[1:6].keys()
d = df_21[df_21.industry.isin(i)]
d = d[d.country.isin(c)]
d = d.groupby(['country']).agg({"industry": "value_counts"}).unstack(fill_value=0)['industry']
(d/d.sum()).plot(figsize=(20,10),title='Top 5 industry count in countries with most respondents')
plt.show()

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
Countries with higher percentage of female respondents (India, U.S.A, U.K., Nigeria) have more respondents from computers/technology than manufacturing/fabrication. Whereas in countries with lower female percentage like Brazil, Japan and Russia, manufacturing/fabrication has more respondents than computers/technology. 
    </div>

In [None]:
fig1, ax1 = plt.subplots()

plot_data(data=[[df_21[df_21.gender=='Man'],'age','men','duration'],
                  [df_21[df_21.gender=='Woman'],'age','women','duration'],
                 ],
            agr='mean',
            title='Time spent by each age group to complete the survey',
            y_label='mean duration',
            x_label='Age groups',
            ax=[[ax1,ax1]],
            fsize=(10,5),
            ).plot_column()            

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
Female respondents from 21 to 45 and 49 to 59 have spent more duration of time to complete the survey.
    </div>

In [None]:
d = top_count(df_21,'industry',11)

fig1, ax1 = plt.subplots()

plot_data(data=[[d[d.gender=='Man'],'industry','men machine learning experience','machine_exp_avg'],
                  [d[d.gender=='Woman'],'industry','women machine learning experience','machine_exp_avg'],
                 ],
            agr='mean',
            title='Level of experience of respondents from top 10 industries with most respondents',
            y_label='Average Experience (years)',
            x_label='Industries',
            ax=[[ax1]],
            fsize=(20,5),
            ).plot_column()            

plot_data(data=[[d[d.gender=='Man'],'industry','men programming experience','programming_exp_avg'],
                  [d[d.gender=='Woman'],'industry','women programming experience','programming_exp_avg'],
                 ],
            agr='mean',
            title='Level of experience of respondents from top 10 industries with most respondents',
            y_label='Average Experience (years)',
            x_label='Industries',
            ax=[[ax1]],
            fsize=(20,5),
            ).plot_column()                        

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
 As we can see programming experience for men and women repondents are more than machine learning. But overall men respondents have more work experience. 
    </div>

<h1> The Subgroup </h1>
<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
We can see that like at many other organisations and professional communities the gender bias is quit visible in kaggle 2021 community and there can be many reasons behind that. But lets look at the data further in order to understand our community better and how it has has evolved. 
<br><br>
<mark>Till now you would have guessed that this story is about the female kaggle community.</mark>
    </div>

In [None]:
fig1, ax1 = plt.subplots()

d_21 = df_21[(df_21.ml_relevance!=np.nan) & (df_21.ml_relevance!='I do not know')]

plot_data(data=[[d_21[d_21.gender=='Man'],'ml_relevance','men','industry'],
                [d_21[d_21.gender=='Woman'],'ml_relevance','women','industry']
                 ],
            cat=True,
            agr='value_counts',
            title='How relevant is machine learning in various industries',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,15),
            ).plot_column_cat_multi()           

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
The above graph shows how within the same industry, men and women can have different level of relevance for machine learning.
For instance in shipping/transportation, while most men do not use machine learning methods, most women respondents have been exploring machine learning methods. Similarly, in online bussiness/internet based sales while more men responded for well established ml methods, women have recently started using ml methods. In military/security/defence while men are still exploring for ml methods, women have already being using ml methods since last two years. In marketing/crm men are using well established method, while women are not using ml methos.
<br><br>
Overall, in shipping/transportation, military/security/defence and energy/mining women are ahead in terms of level of machine learning methods. While men have used machine leaning methods in online bussiness/internet, marketing/crm, broadcasting/communication and acounting/finance with higher level of relevance. 
<br><br>
<mark>This could be infered in many ways, but in sectors related to information and technology more men are using machine learning. While most women using machine learning algorithm are working in non IT sectors .</mark>
    </div>

In [None]:
cn_men = pd.DataFrame(df_21[df_21.gender=='Man'].groupby(['industry']).agg("count")['gender'])
cn_men.rename(columns={'gender':'men'},inplace=True)
cn_women = pd.DataFrame(df_21[df_21.gender=='Woman'].groupby(['industry']).agg("count")['gender'])
cn_women.rename(columns={'gender':'women'},inplace=True)
d = pd.merge(cn_men,cn_women,on='industry',how='left')
d['mw'] = 100*d['women']/d['men']
d = d.sort_values(by='mw',ascending=True)
plotly_bar_chart(d['mw'], 'Percentage of female respondents in various industry', 'Percentage',orientation='h')

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
In industries like shipping/transportation, military/security/defence and energy/mining where more women were using machine learning methods, have one of the lowest percentage of women population.  
    </div>

In [None]:
fig1, ax1 = plt.subplots()

plot_data(data=[[df_21[df_21.gender=='Man'],'age','men','industry'],
                [df_21[df_21.gender=='Woman'],'age','women','industry'],
                 ],
            cat=True,
            agr='value_counts',
            title='Most common age group in every industry',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_column_cat_multi() 

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
In most industries there is a similarity in common age groups for men and women. But in industries like non-profit/crm, insurance/risk-assesment more women respondents are from a higher age froup then men. Where as in industries like miltary, medical/pharmaceutical, broadcasting/communications more men respondents were from a higher age group. Again  miltary and broadcasting/communications where women presence is less than men. It might be a case where age could be a reason behind discrimination.
    </div>

In [None]:
fig1, ax1 = plt.subplots()

plot_data(data=[[df_21[df_21.gender=='Man'],'language','Man','age'],
                [df_21[df_21.gender=='Woman'],'language','Woman','age'],
                 ],
            agr_list = ['python','r', 'sql', 'c', 'c++', 'java', 'javascript', 'julia', 'swift', 'bash', 'matlab'],
            title='Most common languages used on regular basis by different age groups (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_multi_column()                        

fig1, ax1 = plt.subplots()

plot_data(data=[[df_20[df_20.gender=='Man'],'language','Man','age'],
                [df_20[df_20.gender=='Woman'],'language','Woman','age'],
                 ],
            agr_list = ['python','r', 'sql', 'c', 'c++', 'java', 'javascript', 'julia', 'swift', 'bash', 'matlab'],
            title='Most common languages used on regular basis by different age groups (2020)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_multi_column()                        

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
The above graph is between age groups and languages. It shows the list of four most popular languages in every age group from 2020 to 2021. This is based in count. For instance in the age group 25-29 for men, in 2020, python, sql, r and javascript were the top four languages used on regular basis. But in 2021 c++ replaced javascript. 
<br>
Apart from the similarities, there are some differences in how women have chosen there languages in 2021. For instance python, sql and r are used on regular basis by all. But most women respondents prefer java, c++, and matlab, where as men have prefered javascipt.
<br>
Also, we can see how regular language preference has changed from 2020-2021. Matlab, which was popular among women respondents aged between 25-39 years in 2020, is less popular in 2021 (30-34 age group), replaced by either java or c++.
    </div>

In [None]:
fig1, ax1 = plt.subplots()
fig2, ax2 = plt.subplots()

d = top_count(df_21,'industry',11)
first_language = list(df_21.first_language.unique())
d_21 = d[d.first_language.isin(first_language)]

plot_data(data=[[df_21[df_21.gender=='Man'],'title','men','age'],
                [df_21[df_21.gender=='Woman'],'title','women','age'],
                 ],
            cat=True,
            agr='value_counts',
            title='Most common title for every age group (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_column_cat_multi()                        

plot_data(data=[[df_20[df_20.gender=='Man'],'title','men','age'],
                [df_20[df_20.gender=='Woman'],'title','women','age'],
                 ],
            cat=True,
            agr='value_counts',
            title='Most common title for every age group (2020)',
            x_label='count',
            ax=[[ax2]],
            fsize=(15,10),
            ).plot_column_cat_multi()

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
In both 2020 and 2021, while more men respondents in age group 25-29 were data scintist, most women respondents were students.
<mark>While in 2020 for age groups above 30-34 years, most common title for women was either other or currently unemployed. In 2021 things have changed drastically. Most common title for women from 30-44 years is data scientist.</mark> 
<br>
Also, while for some age group research scientist is a common title for women, it is not for men. Seems, less men are interested in research in data science compared to women.
<br>
Finally, software engineer was also most popular title for men in age group 45-49 years in 2020, which is not the case for any age group for either men or women. It does support the fact that as level of automation is progressing, jobs like programming might not exist in the furure.
    </div>

In [None]:
fig1, ax1 = plt.subplots()

d_21 = df_21[df_21.current_yr_compensation!='$0-999']

plot_data(data=[[d_21[d_21.gender=='Man'],'current_yr_compensation','men','industry'],
                [d_21[d_21.gender=='Woman'],'current_yr_compensation','women','industry'],
                 ],
            cat=True,
            agr='value_counts',
            title='Most common yearly compensation for every industry (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_column_cat_multi()      

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
Retail/sales, online services, manufacturing/fabrication, hospitality/entertainment/sports, government/public service and energy/mining  are the industries where women are more often earning more than men. 
    </div>

In [None]:
fig1, ax1 = plt.subplots()

d_21 = df_21[df_21.current_yr_compensation!='$0-999']
plot_data(data=[[d_21[d_21.gender=='Man'],'current_yr_compensation','men','title'],
                [d_21[d_21.gender=='Woman'],'current_yr_compensation','women','title'],
                 ],
            cat=True,
            agr='value_counts',
            title='Most common yearly compensation for every title (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_column_cat_multi()                        

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
I terms of title, the picture was a little different. Women statisticians, program/project manager, data engineer, dba/database engineer were mostly earning more than men in 2021. 
<br>
Whereas, men software engineer, research scientist, product managers and business analyst were able to earn more than women.
<br>    
<mark>It can be said that data science jobs are more fruitful for women than men.</mark>
    </div>

In [None]:
fig1, ax1 = plt.subplots()

d_21 = df_21[df_21.current_yr_compensation!='$0-999']

plot_data(data=[[d_21[d_21.gender=='Man'],'current_yr_compensation','men','education'],
                [d_21[d_21.gender=='Woman'],'current_yr_compensation','women','education'],
                 ],
            cat=True,
            agr='value_counts',
            title='Most common, yearly compensation for every education (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_column_cat_multi()                        

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
In case of non professional education like Some college/university study without earning a bachelorâ€™s degree and No formal education past high school women were earning more than men in 2021. 
<br>
Else in case of every professional course, men respondents were earning more than women.
<br>
<mark>For men the compensation seem to increase from bachelor's, master's, doctoral to professional doctoral. But for women, for some reason it decreases and then remains same.</mark>
    </div>

In [None]:
fig1, ax1 = plt.subplots()

d_21 = df_21[df_21.current_yr_compensation!='$0-999']

plot_data(data=[[d_21[d_21.gender=='Man'],'current_yr_compensation','men','machine_exp'],
                [d_21[d_21.gender=='Woman'],'current_yr_compensation','women','machine_exp'],
                 ],
            cat=True,
            agr='value_counts',
            title='Most common, yearly compensation for every machine experience (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_column_cat_multi()                        

fig1, ax1 = plt.subplots()

plot_data(data=[[d_21[d_21.gender=='Man'],'current_yr_compensation','men','programming_exp'],
                [d_21[d_21.gender=='Woman'],'current_yr_compensation','women','programming_exp'],
                 ],
            cat=True,
            agr='value_counts',
            title='Most common, yearly compensation for every programming experience (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,10),
            ).plot_column_cat_multi()                        

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">
Yearly compensation has increased for both men and women as machine and programming experience increases. 
<br>
 <mark>Compensation has increased more with machine learning experience in comparision to programming experience. For instance people with 3-4 years of machine learning experience were earning more than those with 3-5 years of programming experince.</mark>
    </div>

In [None]:
fig1, ax1 = plt.subplots()

d = df_21[df_21.gender=='Woman']

plot_data(data=[
                [d[d.title==d.title.unique()[2]],'machine_exp',d.title.unique()[2],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[3]],'machine_exp',d.title.unique()[3],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[4]],'machine_exp',d.title.unique()[4],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[5]],'machine_exp',d.title.unique()[5],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[7]],'machine_exp',d.title.unique()[7],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[8]],'machine_exp',d.title.unique()[8],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[9]],'machine_exp',d.title.unique()[9],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[10]],'machine_exp',d.title.unique()[10],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[11]],'machine_exp',d.title.unique()[11],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[12]],'machine_exp',d.title.unique()[12],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[13]],'machine_exp',d.title.unique()[13],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[14]],'machine_exp',d.title.unique()[14],'current_yr_compensation_avg'],
                 ],
            agr='mean',
            title='How yearly compensation changes for women with title and machine exp.',
            y_label='compensation (USD)',
            x_label='Experience',
            ax=[[ax1]],
            fsize=(20,10),
            ).plot_column()            

In [None]:
fig1, ax1 = plt.subplots()

d = df_21[df_21.gender=='Woman']

plot_data(data=[
                [d[d.title==d.title.unique()[2]],'programming_exp',d.title.unique()[2],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[3]],'programming_exp',d.title.unique()[3],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[4]],'programming_exp',d.title.unique()[4],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[5]],'programming_exp',d.title.unique()[5],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[7]],'programming_exp',d.title.unique()[7],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[8]],'programming_exp',d.title.unique()[8],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[9]],'programming_exp',d.title.unique()[9],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[10]],'programming_exp',d.title.unique()[10],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[11]],'programming_exp',d.title.unique()[11],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[12]],'programming_exp',d.title.unique()[12],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[13]],'programming_exp',d.title.unique()[13],'current_yr_compensation_avg'],
                [d[d.title==d.title.unique()[14]],'programming_exp',d.title.unique()[14],'current_yr_compensation_avg'],
                 ],
            agr='mean',
            title='How yearly compensation changes for women with title and programming exp.',
            y_label='compensation (USD)',
            x_label='Experience',
            ax=[[ax1]],
            fsize=(20,10),
            ).plot_column()        

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
The above two graphs above support the earlier findings. Women statisticians, product managers and data learning engineers with work experience earn the most. 
    </div>

In [None]:
fig1, ax1 = plt.subplots()

plot_data(data=[[df_21[df_21.gender=='Man'],'language','Man','age'],
                [df_21[df_21.gender=='Woman'],'language','Woman','age'],
                 ],
agr_list = ['jupyter', 'rstudio', 'vs', 'vscode','pycharm', 'spyder', 'notepad++', 
            'sublime_text', 'vim/emacs','matlab_', 'jupyter_n'],
            title='Most common IDE used on regular basis by different age groups (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,15),
            ).plot_multi_column()                        

fig1, ax1 = plt.subplots()

plot_data(data=[[df_20[df_20.gender=='Man'],'language','Man','age'],
                [df_20[df_20.gender=='Woman'],'language','Woman','age'],
                 ],
            agr_list = ['jupyter', 'rstudio', 'vs', 'vscode','pycharm', 'spyder', 'notepad++', 
                        'sublime_text', 'vim/emacs','matlab_', 'jupyter_n'],
            title='Most common IDE used on regular basis by different age groups (2020)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,15),
            ).plot_multi_column()                     

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
In case of every age group jupyter is most popular IDE. But the second most popular in case of men in 2021 is visual studio code. In case of women it is rstdio from age groups 40 onwards and pycharm from 30-39 years. 
<br>
Even in 2020 the situation seems to be pretty much same.
    </div>

In [None]:
fig1, ax1 = plt.subplots()

plot_data(data=[[df_21[df_21.gender=='Man'],'language','Man','age'],
                [df_21[df_21.gender=='Woman'],'language','Woman','age'],
                 ],
agr_list = ['kaggle', 'colab',
       'azure', 'paperspace_gradient', 'binder_jupyterhub', 'code_ocean',
       'ibm_watson_studio', 'sagemaker', 'emr', 'gcn', 'gcd',
       'databricks', 'zeppelin', 'deepnote', 'observable'],
            title='Most common notebook used on regular basis by different age groups (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,15),
            ).plot_multi_column()                        

fig1, ax1 = plt.subplots()

plot_data(data=[[df_20[df_20.gender=='Man'],'language','Man','age'],
                [df_20[df_20.gender=='Woman'],'language','Woman','age'],
                 ],
            agr_list = ['kaggle', 'colab',
       'azure', 'paperspace_gradient', 'binder_jupyterhub', 'code_ocean',
       'ibm_watson_studio', 'sagemaker', 'emr', 'gcn', 'gcd',
       'databricks', 'zeppelin'],
            title='Most common notebook used on regular basis by different age groups (2020)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,15),
            ).plot_multi_column()                     

<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
Colab notebook is the most popular in 2021 except for few age group as 35-39 years.
<br>
In 2020 however, zeppelin/zepl notebook was the most popular among both men and women, except for men below 39 years.
    </div>

In [None]:
fig1, ax1 = plt.subplots()

plot_data(data=[[df_21[df_21.gender=='Man'],'language','Man','title'],
                [df_21[df_21.gender=='Woman'],'language','Woman','title'],
                 ],
agr_list = ['iv_tools', 'image_segmentation', 'object_detection',
       'image_classification', 'generative'],
            title='Most common computer vision methods used on regular basis by different titles (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,15),
            ).plot_multi_column()                        


In [None]:
fig1, ax1 = plt.subplots()

plot_data(data=[[df_21[df_21.gender=='Man'],'language','Man','title'],
                [df_21[df_21.gender=='Woman'],'language','Woman','title'],
                 ],
agr_list = ['analysis',
       'data_infrastructure', 'prototypes', 'product_improvement',
       'model_improvement', 'R&D'],
            title='Most common organizational role for different titles (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,15),
            ).plot_multi_column()                        


<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;">  
For women machine learning engineer, the most common role is to analyze an understand data, whereas for men it is to build prototypes.
For women data engineer common task include build and or run data infrastructure, whereas for men it is data analyses.    
    </div>

In [None]:
fig1, ax1 = plt.subplots()

plot_data(data=[[df_21[df_21.gender=='Man'],'language','Man','title'],
                [df_21[df_21.gender=='Woman'],'language','Woman','title'],
                 ],
agr_list = ['aws',
       'azure_', 'gcp', 'ibm_red_hat', 'oracle', 'sap', 'salesforce',
       'vmware', 'alibaba', 'tencent'],
            title='Most common cloud computing platforms used by different titles (2021)',
            x_label='count',
            ax=[[ax1]],
            fsize=(15,15),
            ).plot_multi_column()                        


<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
While AWS overall seems to be more popular, women product managers, statisticians and research scientist prefer google cloud platform over AWS.    
   </div>

# Conclusion
<div style="font-family:Helvetica Neue; font-size:20px; line-height:1.7; color:slategray;"> 
Kaggle survey includes questions related to most professional aspects, but still it is not sufficient to predict the changes in the future for women in data science and machine learning. A lot more analysis is possible. At the moment few things are clear:
<ul>
    <li>
        Though there is a substantial bias against women, like in many other industries, there is a brighter side too. First, overall data science industry has more opportunities for women, when compared to the overall IT sector. In traditional jobs like a software engineer or business analyst the bias in terms of compensation is much higher when compared to jobs like a data scientist. There are not just more opportunities but also there is more growth for people with more experience.</li>
     <li>
         Data science is still an unknown domain where many corporations have just joined the market. Roles like data engineer, research scientist, statistician and other have been more fruitful for women than men. I feel the gender bias will only decrease in the comming future.
        </li>
    </ul>
    </div>    