# Objective

## This notebook aims to analyze how Kagglers from any chosen Country differ from the Rest of the World

<b>Data Source</b>: kaggle_survey_2021_responses.csv

"Copy and Edit" this notebook to choose a country, and watch how the graphs and insights change accordingly!

<b>Please don't forget to <font color=green>UPVOTE</font> if you find this interesting. Feedback is more than welcome :)</b>

### This analysis convers almost**<font color=green>ALL</font>** the questions asked in the Kaggle 2021 DS and ML survey, divided into the below sections:
* Demographics
* Personal Background
* Personal Preferences
* Learning Preferences
* Workplace

### So hold on tight, this is going to take a while!

In [None]:
""" 
TO DO:
1. Add missing questions
2. Document code
3. Use plotly buttons to select other parameters
4. Lint with Black
"""

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import ipywidgets as widgets
import numpy as np
import pandas as pd 
import plotly.express as px
import plotly.graph_objects as go

from IPython.display import display, clear_output
from ipywidgets import Output, Button, interact
from plotly.subplots import make_subplots
from typing import Union

pd.set_option('mode.chained_assignment', None)
!jupyter nbextension enable --py --sys-prefix widgetsnbextension

In [None]:
def donut_subplots(df:pd.DataFrame, question_text:str, country:str, title:str) -> None:
    """
    Plots plotly donut charts comparing the country to the global stats for a given question.
    
    Arguments:
    ----------
    df (pandas.DataFrame): Source dataframe
    question_text (str): Exact text of the question, or the column name if using a custom column. We use the text instead of the index as the text is more consistent across multiple years' surveys
    country (str): The country to compare global averages with 
    title (str): Supertitle of the subplots
    
    Returns:
    --------
    None
    """
    
    fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
    fig.add_trace(go.Pie(labels=df[question_text].value_counts().index
                     , values=df[question_text].value_counts()
                     , name="Global", title="Global"), 
              1, 1)
    fig.add_trace(go.Pie(labels=df[df[COUNTRY_QUESTION_TEXT]==country][question_text].value_counts().index
                     , values=df[df[COUNTRY_QUESTION_TEXT]==country][question_text].value_counts()
                     , name=country, title=country), 
              1, 2)

    fig.update_traces(hole=.6, textposition='inside', textinfo='percent+label')
    fig.update_layout(uniformtext_minsize=10, uniformtext_mode='hide', title_text=title)
    fig.show()

In [None]:
def get_popular_answer_and_plot(df: pd.DataFrame, question_text:str, country:str) -> None:
    """Calculates the most common response to a question and plots plotly donut charts comparing the country's answers to that of others
    
    Arguments:
    ----------
    df (pandas.DataFrame): Source dataframe
    question_text (str): Exact text of the question, or the column name if using a custom column. We use the text instead of the index as the text is more consistent across multiple years' surveys
    country (str): The country to compare global averages with 
    
    Returns:
    --------
    None
    """
    
    items = df[question_text].value_counts(normalize=True)*100
    most_common_item = items.index[0].strip()
    most_common_pct = items[0]

    country_items = df[df[COUNTRY_QUESTION_TEXT]==country][question_text].value_counts(normalize=True)*100
    country_common_item = country_items.index[0].strip()
    country_common_pct = country_items[0]

    if most_common_item == country_common_item:
        if country_common_pct > most_common_pct:
            title = f"In {country}, this percentage increases to {country_common_pct:.0f}%"
        else:
            title = f"In {country}, this percentage decreases to {country_common_pct:.0f}%"
    else:
        title = f"However, in {country}, {country_common_item} is more popular<br>with {country_common_pct:.0f}% using it most often"

    donut_subplots(df, question_text, country, f"{most_common_pct:.0f}% of all Kagglers use {most_common_item} most often<br>{title}")

In [None]:
def bar_with_mean_num(df:pd.DataFrame, question_text:str, country:str, global_mean:int, title:str) -> None:
    """
    Plots a bar chart comparing the country's mean value for a numerical column to that of others, with a line for the global mean
    
    Arguments:
    ----------
    df (pandas.DataFrame): Source dataframe
    question_text (str): Exact text of the question, or the column name if using a custom column. We use the text instead of the index as the text is more consistent across multiple years' surveys. This string is also used as the `yaxis_title` 
    country (str): The country to compare global averages with 
    global_mean (int): The global mean of the column in question
    title (str): Title of the plot
    
    Returns:
    --------
    None
    """

    df_group = df.groupby(COUNTRY_QUESTION_TEXT)[question_text].mean().sort_values(ascending=False)
    loc = df_group.index.to_list().index(country)
    
    color = ['#636EFA']*len(df.groupby(COUNTRY_QUESTION_TEXT)[question_text].mean().sort_values(ascending=False).index)
    color[loc] = 'orange'  
    
    fig = go.Figure(data=[go.Bar(x=df_group.index, y=df_group, marker_color=color)])
    
    fig.update_layout(
        shapes=[
        dict(
          type='line',
          yref='y', y0=global_mean, y1=global_mean,
          xref='x', x0=-0.5, x1=len(df.groupby(COUNTRY_QUESTION_TEXT)[question_text])-0.5
        )],
        title=title,
        xaxis_title=None,
        yaxis_title=question_text)

    fig.add_annotation(x=len(df.groupby(COUNTRY_QUESTION_TEXT)[question_text])*0.95, y=global_mean, xshift=-20, yshift=10,
                text="Global Average",
                showarrow=False)
    fig.show()

In [None]:
##### WIP #####
def dropdown_bar_chart(df:pd.DataFrame, selected_choice, question, x, y, text, axes=None, showpercent=False):
    buttons = []
    fig = go.Figure()
    for i in range(len(y)):
        label = y[i]
        fig.add_trace(go.Bar(
                 x           = x, 
                 y           = y, 
                 orientation = "h",
                 text        = ['{0:1.2f}%'.format(100.*x[i]/x.sum()) for i in range(len(y))] if showpercent else x,
                 marker      = dict(color = ["#2471A3" if x == label else "#BBBBBB" for x in y])))
        
        buttons.append({'label'  : label,
                        'method' : 'update',
                        'args'   : [{'visible'  : [True if x == i else False for x in range(len(y))]},
                                    {'title'    : f"{country_avg:.1f}% of Kagglers from {country} {filler_text} \"{label}\",<br>compared to the global average of {global_mean:.1f}%"
},
                                    {'selected' : [True if label == selected_choice else False for x in y]}]})
    
    index = y.get_loc(label)
    fig.update_layout(updatemenus = [dict(type='dropdown', x = 1.0, y = 1.0, buttons=buttons, active=index)],
                      title       = f"{country_avg:.1f}% of Kagglers from {country} {filler_text} \"{label}\",<br>compared to the global average of {global_mean:.1f}%",
                      xaxis       = dict(title = axes[0]),
                      yaxis       = dict(title = axes[1]))
    
    for i in range(len(y)):
        fig.data[i].visible = False
    fig.data[index].visible = True
    fig.show()

#dropdown_bar_chart(df, 'Man', question_text, df[question_text].value_counts().values, df[question_text].value_counts().index, 'identified as', ['Number of Kagglers', 'Genders'], showpercent=False)

In [None]:
def bar_with_mean_cat(df:pd.DataFrame, question_text:str, country:str, filler_text:str) -> None:
    """
    Plots bar charts for all categories in a categorical column comparing the country's mean value to that of others with a line for the global mean, for all the distinct values in a column
    
    Arguments:
    ----------
    df (pandas.DataFrame): Source dataframe
    question_text (str): Exact text of the question, or the column name if using a custom column. We use the text instead of the index as the text is more consistent across multiple years' surveys. This string is also used as the `yaxis_title` 
    country (str): The country to compare global averages with 
    filler_text (str): Filler to be used in the plot title in the format f"{country_avg} of Kagglers from {country} {filler_text} {value} compared to the global average of {global_mean} OR f"Nobody from {country} {filler_text} {value}"
    
    Returns:
    --------
    None
    """
    values = [x for x in df[question_text].unique() if x !='Prefer not to say']

    for value in values:

        df_all = df.groupby(COUNTRY_QUESTION_TEXT)[question_text].value_counts().groupby(level=0).apply(
            lambda x: 100 * x / float(x.sum()))[:,value].sort_values(ascending=False)

        if country in df_all.index:
            country_avg = df_all[country]
            global_mean = len(df[df[question_text]==value])*100/len(df)
            val_pct = len(df[df[question_text]==value])*100/len(df)
            title=f"{country_avg:.1f}% of Kagglers from {country} {filler_text} \"{value}\",<br>compared to the global average of {global_mean:.1f}%"

            loc = df_all.index.to_list().index(country)
            color = ['#636EFA']*len(df_all.index)
            color[loc] = 'orange'

            fig = go.Figure(data=[go.Bar(x=df_all.index, y=df_all.values, marker_color=color)])
            fig.update_layout(
                shapes=[
                    dict(
                      type= 'line',
                      yref= 'y', y0= global_mean, y1= global_mean,
                      xref= 'x', x0= -0.5, x1= len(df_all.index)-0.5
                    )],
                title=title,
                xaxis_title=None,
                yaxis_title='Percentage')
            fig.add_annotation(x=len(df_all.index)*0.95, y=global_mean, xshift=-20, yshift=10,
                        text="Global Average",
                        showarrow=False)
            fig.show()
            
        else:
            print(f"Nobody from {country} {filler_text} {value}")

In [None]:
def plot_distribution(df:pd.DataFrame, question_text:str, country:str, title:str, xaxis_title:str=None) -> None:
    """
    Plots a bar chart comparing the distribution of the column's values between that country and the rest of the world
    
    Arguments:
    ----------
    df (pandas.DataFrame): Source dataframe
    question_text (str): Exact text of the question, or the column name if using a custom column. This is also used as the `xaxis_title` of the plot. We use the text instead of the index as the text is more consistent across multiple years' surveys. 
    country (str): The country to compare global averages with 
    title (str): Title of the plot
    xaxis_title (str): X-axis title of the plot.
    
    Returns:
    --------
    None
    """
    
    df_country = df[df.country_agg==country][question_text].value_counts(normalize=True).sort_index()
    df_others = df[df.country_agg=='Others'][question_text].value_counts(normalize=True).sort_index()

    for index in df_others.index:
        if index not in df_country.index:
            df_country[index] = 0
    df_country.sort_index(inplace=True)

    fig = go.Figure(data=[
        go.Bar(name=country, y=df_country.values*100),
        go.Bar(name='Others', y=df_others.values*100)
    ])

    fig.update_layout(
        barmode='group',
        title=title,
        xaxis_title=xaxis_title,
        yaxis_title='Percentage of respondents',
        xaxis = dict(
            tickmode = 'array',
            tickvals = [x for x in range(df[question_text].nunique()+1)],
            ticktext = df[question_text].sort_values().unique()
        )
    )

    fig.show()

In [None]:
def global_and_country_bars(df:pd.DataFrame, question_text:str, country:str, drop_vals:Union[list, bool] = None, title:bool = True) -> None:
    """
    Plots global and country specific bar charts for 'preference' questions
    
    Arguments:
    ----------
    df (pandas.DataFrame): Source dataframe
    question_text (str): Exact text of the question, or the column name if using a custom column. We use the text instead of the index as the text is more consistent across multiple years' surveys. This string is also used as the `yaxis_title` 
    country (str): The country to compare global averages with 
    drop_vals (list|bool): Answer values which need to dropped from calculations and plots. This overrides dropping 'None' and 'No / None' which are dropped by default. Use False to drop no values.
    title (bool): True to have a detailed title, False to display just 'Global' and 'Country' (Default: True)
    
    Returns:
    --------
    None
    """
    
    filler_text = 'use' if ' use ' in question_text else 'hope to become more familiar with'
        
    cols = [col for col in df.columns if question_text in col]
    mapper = [col.split('- ',maxsplit=2)[2].strip() for col in cols]
    mapping_dict = dict(zip(cols,mapper))
    df = df[cols + [COUNTRY_QUESTION_TEXT] + ['country_agg']].rename(columns=mapping_dict)
    df.dropna(how='all', subset=mapper, inplace=True)
    
    if drop_vals:
        df.drop(columns=drop_vals, inplace=True)
    elif not drop_vals:
        _
    else:
        try:
            df.drop(columns=["None"], inplace=True)
        except KeyError:
            df.drop(columns=['No / None'], inplace=True)

    most_comm = df[df.columns[:-2]].count().sort_values(ascending=False)/len(df)
    most_comm_val = most_comm.index[0]
    most_comm_pct = most_comm[0]*100
    fig = px.bar(df[df.columns[:-2]].count().sort_values(ascending=False))
    fig.update_layout(
        title=f"{most_comm_pct:.0f}% of all respondents {filler_text} {most_comm_val}" if title else 'Global',
        xaxis_title=None,
        yaxis_title='Number of respondents',
        showlegend=False
    )
    fig.show()
    
    filler_text = 'using it' if ' use ' in question_text else 'hoping to become more familiar with it'
    
    country_most_comm = df[df[COUNTRY_QUESTION_TEXT]==country][df.columns[:-2]].count().sort_values(ascending=False)
    country_most_comm_val = country_most_comm.index[0]
    country_most_comm_pct = country_most_comm[0]*100/len(df[df[COUNTRY_QUESTION_TEXT]==country])
    if country_most_comm_val==most_comm_val:
        if country_most_comm_pct > most_comm_pct:
            title_text = f'{most_comm_val} is even more popular in {country},<br>with {country_most_comm_pct:.0f}% of respondents {filler_text}'
        else:
            title_text = f'{most_comm_val} remains the popular in {country} too,<br>with {country_most_comm_pct:.0f}% of respondents {filler_text}'
    else:
        title_text = f"However, in {country}, {country_most_comm_val} is more popular,<br>with {country_most_comm_pct:.0f}% of respondents {filler_text}"

    fig = px.bar(df[df[COUNTRY_QUESTION_TEXT]==country][df.columns[:-2]].count().sort_values(ascending=False))
    fig.update_layout(
        title=title_text if title else f'In {country}',
        xaxis_title=None,
        yaxis_title='Number of respondents',
        showlegend=False
    )
    fig.show()

In [None]:
COUNTRY_QUESTION_TEXT = 'In which country do you currently reside?'

In [None]:
df = pd.read_csv('/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', skiprows=1, low_memory=False)

df[COUNTRY_QUESTION_TEXT].replace({'United Kingdom of Great Britain and Northern Ireland':'UK',
                           'Iran, Islamic Republic of...':'Iran',
                           'United Arab Emirates':'UAE',
                           'United States of America':'USA',
                           'Viet Nam':'Vietnam'}, inplace=True)

<font color=red>Ipywidgets do not work in committed Kaggle notebooks. To select a country, "Copy and Edit" the notebook.</font>

In [None]:
output = Output()

country_widget = widgets.Dropdown(
    options=np.sort(df[COUNTRY_QUESTION_TEXT].unique()),
    value='India',
    description='Country:',
    disabled=False,
)

def on_change(change):
    global country
    country = country_widget.value

country = country_widget.value

country_widget.observe(on_change)
display(country_widget, output)

df['country_agg'] = np.where(df[COUNTRY_QUESTION_TEXT]==country,country,'Others')

# Demographics
## 1. Country

In [None]:
fig = go.Figure()
fig.add_trace(go.Pie(values=df[COUNTRY_QUESTION_TEXT].value_counts().values,
                     labels=df[COUNTRY_QUESTION_TEXT].value_counts().index,
                     pull=(df[COUNTRY_QUESTION_TEXT].value_counts().index==country)*0.2,
                     hoverinfo ='label+percent'))

fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title={'text':f"{len(df[df[COUNTRY_QUESTION_TEXT]==country])*100/len(df):.2f}% of all survey respondents are from {country}"},
                  uniformtext_minsize=10, uniformtext_mode='hide')
fig.show()

## 2. Age

In [None]:
question_text = 'What is your age (# years)?'

##### Donut subplots #####
overall_pct = len(df[(df[question_text].isin(['18-21','22-24','25-29']))])*100/len(df)

age_pct = len(df[(df[COUNTRY_QUESTION_TEXT]==country) 
                 & (df[question_text].isin(['18-21','22-24','25-29']))])*100/len(df[df[COUNTRY_QUESTION_TEXT]==country])

if age_pct < overall_pct:
    title = f"{country} is older, with {age_pct:.0f}% of Kagglers being under under 30"
elif age_pct > overall_pct:
    title = f"{country} is younger, with {age_pct:.0f}% of Kagglers being under under 30"
else:
    title = f"{age_pct:.0f}% of Kagglers from {country} are also under 30"

donut_subplots(df, question_text, country, f"{overall_pct:.0f}% of all Kagglers are less than 30 years old<br>{title}")

##### Barchart with global average #####
df['age1'] = df[question_text].str.split('-').str[0]
df['age1'].replace('70+','70', inplace=True)
df['age2'] = df[question_text].str.split('-').str[1]
df['age1'] = df.age1.astype('int')
df.age2.fillna(70, inplace=True)
df['age2'] = df.age2.astype('int')
df['Age'] = (df.age1+df.age2)/2

global_mean = df.Age.mean()
country_mean = df[df.country_agg==country].Age.mean()

if country_mean <= global_mean:
    title = f"With an average age of {country_mean:.0f},<br>Kagglers from {country} are generally {global_mean - country_mean:.0f} years younger than the average Kaggler"
else:
    title = f"With an average age of {country_mean:.0f},<br>Kagglers from {country} are generally {global_mean - country_mean:.0f} years older than the average Kaggler"

bar_with_mean_num(df, 'Age', country, global_mean, title)

##### Distribution compared to rest of the world #####
plot_distribution(df, question_text, country, f'Age distribution of Kagglers from {country} compared to others', 'Age')

## 3. Gender diversity

In [None]:
question_text = 'What is your gender? - Selected Choice'

##### Donut subplots #####
donut_subplots(df, question_text, country, "Gender Ratio of Kagglers: Global vs Indian")

##### Barchart with global average for each gender #####
bar_with_mean_cat(df, question_text, country, 'identified as')

# Personal Background
## 4. Academic qualification
<font color=red> Only includes respondents who answered the question "<b>What is the highest level of formal education that you have attained or plan to attain within the next 2 years?</b>" </font> 

In [None]:
question_text = 'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'

##### Donut subplots #####
donut_subplots(df, question_text, country, "Academic qualification of Kagglers")

##### Barchart with global average for each qualification #####
bar_with_mean_cat(df, question_text, country, 'reported their qualification as')

## 5. Job title
<font color=red> Only includes respondents who answered the question "<b>Select the title most similar to your current role (or most recent title if retired)</b>" </font>

In [None]:
question_text = 'Select the title most similar to your current role (or most recent title if retired): - Selected Choice'

df_filtered = df[~df[question_text].isna()]

##### Donut subplots #####
comm = df[question_text].value_counts(normalize=True)[[0]]*100
comm_val = comm.index[0]
comm_pct = comm.values[0]

country_comm = df_filtered[df_filtered[COUNTRY_QUESTION_TEXT]==country][question_text].value_counts(normalize=True)[[0]]*100
country_comm_val = country_comm.index[0]
country_comm_pct = country_comm.values[0]

title = f"Most Kagglers are {comm_val}s ({comm_pct:.0f}%)<br>"
if country_comm_val==comm_val:
    title = f"{title}In {country} too, most Kagglers are {country_comm_val}s ({country_comm_pct:.0f}%)"
else:
    title = f"{title}However, in {country}, most Kagglers are {country_comm_val}s ({country_comm_pct:.0f}%)"

donut_subplots(df, question_text, country, title)

##### Barchart with global average for each job title #####
bar_with_mean_cat(df, question_text, country, 'reported their job-title as')

## 6. Annual Compensation (in USD)
<font color=red>Only includes respondents who have answered the question '<b>What is your current yearly compensation?</b>'<br>Compensation >1,000,000 has been limited to 5,000,000 for calculation purposes</font>

In [None]:
question_text = 'What is your current yearly compensation (approximate $USD)?'

df_filtered = df[~df[question_text].isna()]

##### Barchart with global average #####
df_filtered['comp1'] = df_filtered[question_text].str.split('-').str[0].apply(lambda x: x.replace(',','').replace('$','').replace('>','')).astype('int')
df_filtered['comp2'] = df_filtered[question_text].str.split('-').str[1].fillna('5000000').apply(lambda x: x.replace(',','')).astype('int')
df_filtered['Annual Compensation (USD)'] = (df_filtered.comp1+df_filtered.comp2)/2

global_mean = df_filtered['Annual Compensation (USD)'].mean()
country_mean = df_filtered[df_filtered.country_agg==country]['Annual Compensation (USD)'].mean()

if country_mean <= global_mean:
    title = f"With an average annual compensation of {country_mean:.0f} USD,<br>Kagglers from {country} generally earn less than the global average ({global_mean:.0f} USD)"
else:
    title = f"With an average annual compensation of {country_mean:.0f} USD,<br>Kagglers from {country} generally earn more than the global average ({global_mean:.0f} USD)"

bar_with_mean_num(df_filtered, 'Annual Compensation (USD)', country, global_mean, title)

##### Donut plot #####
most_common = df_filtered[df_filtered.country_agg==country].groupby(df_filtered[question_text]).size().sort_values(ascending=False)
most_common_val = most_common.index[0]
most_common_pct = most_common[0]*100/most_common.sum()

fig = px.pie(df_filtered[df_filtered.country_agg==country], question_text, 
             title=f'{most_common_pct:.0f}% of Kagglers from {country} reported an annual compensation between {most_common_val}', 
             hole=0.6)
fig.update_traces(textposition='inside',textinfo='percent+label')
fig.update_layout(uniformtext_minsize=10, uniformtext_mode='hide')
fig.show()

##### Distribution compared to rest of the world #####
df_filtered['comp1'] = df_filtered[question_text].str.split('-').str[0].apply(lambda x: x.replace(',','').replace('$','').replace('>','')).astype('int')
df_filtered['comp2'] = df_filtered[question_text].str.split('-').str[1].fillna('5000000').apply(lambda x: x.replace(',','')).astype('int')
df_filtered['clean_comp'] = df_filtered['comp1'].apply(str) + '-' + df_filtered['comp2'].apply(str)
categories = sorted(set(df_filtered['clean_comp'].values), key=(lambda x: int(x.split('-')[0])))

df_country_agg = df_filtered[df_filtered.country_agg==country][question_text].value_counts(normalize=True)
df_country_agg.index = pd.Categorical(df_country_agg.index)
df_country_agg.sort_index(inplace=True)

df_others_agg = df_filtered[df_filtered.country_agg=='Others'][question_text].value_counts(normalize=True)
df_others_agg.index = pd.Categorical(df_others_agg.index)
df_others_agg.sort_index(inplace=True)

for index in df_others_agg.index:
        if index not in df_country_agg.index:
            df_country_agg[index] = 0
df_country_agg.sort_index(inplace=True)

fig = go.Figure(data=[
    go.Bar(name=country, x=categories, y=df_country_agg.values*100),
    go.Bar(name='Others', x=categories, y=df_others_agg.values*100)
])

fig.update_layout(
    title=f'Annual Compensation (USD) of Kagglers from {country} compared to others',
    xaxis_title='Annual Compensation (USD)',
    yaxis_title='Percentage',
    xaxis={'categoryorder':'array',
           'categoryarray':categories}
)
fig.show()

## 7. Coding Experience
<font color=red> Only includes respondents who answered the question "<b>For how many years have you been writing code and/or programming?</b>"<br>Experience >20 years has been limited to 25 for calculation purposes</font>

In [None]:
question_text = 'For how many years have you been writing code and/or programming?'

df_filtered = df[~df[question_text].isna()]

##### Barchart with global average #####
df_filtered[question_text] = df_filtered[question_text].replace('< 1 years','0-1 years').replace('I have never written code','0-0').apply(lambda x: x.split()[0])
df_filtered['code1'] = df_filtered[question_text].str.split('-').str[0].replace('20+','20').astype('int')
df_filtered['code2'] = df_filtered[question_text].str.split('-').str[1].fillna('25').astype('int')
df_filtered.groupby([df_filtered[question_text]]+['code1','code2']).size()
df_filtered['Coding Experience'] = (df_filtered.code1+df_filtered.code2)/2

global_mean = df_filtered['Coding Experience'].mean()
country_mean = df_filtered[df_filtered.country_agg==country]['Coding Experience'].mean()

if country_mean <= global_mean:
    title = f"The average Kaggler from {country} has been coding for {country_mean:.1f} years,<br>less than the global average of {global_mean:.1f} years"
else:
    title = f"The average Kaggler from {country} has been coding for {country_mean:.1f} years,<br>more than the global average of {global_mean:.1f} years"

bar_with_mean_num(df_filtered, 'Coding Experience', country, global_mean, title)

##### Donut plot #####
most_common = df_filtered[df_filtered.country_agg==country].groupby(df_filtered[question_text]).size().sort_values(ascending=False)
most_common_val = most_common.index[0]
most_common_pct = most_common[0]*100/most_common.sum()

fig = px.pie(df_filtered[df_filtered.country_agg==country], question_text, 
             title=f'{most_common_pct:.0f}% of Kagglers from {country} have been coding for {most_common_val} years', 
             hole=0.6)
fig.update_traces(textposition='inside',textinfo='percent+label')
fig.update_layout(uniformtext_minsize=10, uniformtext_mode='hide')
fig.show()

##### Distribution compared to rest of the world #####

categories = ['I have never written code','< 1 years','1-3 years','3-5 years','5-10 years','10-20 years','20+ years']

df_country_agg = df[df.country_agg==country][question_text].value_counts(normalize=True)
df_country_agg.index = pd.Categorical(df_country_agg.index, categories)
df_country_agg.sort_index(inplace=True)

df_others_agg = df[df.country_agg=='Others'][question_text].value_counts(normalize=True)
df_others_agg.index = pd.Categorical(df_others_agg.index, categories)
df_others_agg.sort_index(inplace=True)

for index in df_others_agg.index:
        if index not in df_country_agg.index:
            df_country_agg[index] = 0
df_country_agg.sort_index(inplace=True)

fig = go.Figure(data=[
    go.Bar(name=country, x=categories, y=df_country_agg.values*100),
    go.Bar(name='Others', x=categories, y=df_others_agg.values*100)
])
# Change the bar mode
fig.update_layout(
    title=f'Coding Experience of respondents from {country} compared to Other countries',
    xaxis_title=None,
    yaxis_title='Percentage'
)
fig.show()

# Personal Preferences
## 8. Programming language
<font color=red>Only includes respondents who chose atleast one option for the question "<b>What programming languages do you use on a regular basis?</b>"</font>

In [None]:
global_and_country_bars(df, 'What programming languages do you use on a regular basis?', country)

## 9. IDE
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following integrated development environments (IDE's) do you use on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df, "Which of the following integrated development environments (IDE's) do you use on a regular basis?", country)

## 10. Hosted Notebooks
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following hosted notebook products do you use on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df, 'Which of the following hosted notebook products do you use on a regular basis?', country)

## 11. Computing Platform

<font color=red>Only includes respondents who answered the question '<b>What type of computing platform do you use most often for your data science projects?</b>'</font>

In [None]:
question_text = 'What type of computing platform do you use most often for your data science projects? - Selected Choice'
answer_choice= 'A deep learning workstation (NVIDIA GTX, LambdaLabs, etc)'

df_filtered = df[(~df[question_text].isna()) & (df[question_text]!='None')]

##### Donut subplots #####
overall_pct = len(df_filtered[(df_filtered[question_text]==answer_choice)])*100/len(df_filtered)

country_pct = len(df_filtered[(df_filtered[COUNTRY_QUESTION_TEXT]==country) 
                 & (df_filtered[question_text]==answer_choice)])*100/len(df_filtered[df_filtered[COUNTRY_QUESTION_TEXT]==country])

if country_pct < overall_pct:
    title = f"In {country}, this decreases to {country_pct:.0f}%."
elif country_pct > overall_pct:
    title = f"In {country}, this increases to {country_pct:.0f}%."
else:
    title = f", same as that in {country}."

donut_subplots(df_filtered, question_text, country, f"{overall_pct:.0f}% of all Kagglers use {answer_choice} most often<br>{title}")

##### Distribution compared to rest of the world #####
plot_distribution(df_filtered, question_text, country, f'Coding Platform preference of Kagglers from {country} compared to others')

## 12. Data Visualization Library
<font color=red>Only includes respondents who chose atleast one option for the question '<b>What data visualization libraries or tools do you use on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df, 'What data visualization libraries or tools do you use on a regular basis?', country)

## 13. Machine Learning Frameworks
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following machine learning frameworks do you use on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df, 'Which of the following machine learning frameworks do you use on a regular basis?', country)

## 14. Machine Learning Algorithms
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following ML algorithms do you use on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df, 'Which of the following ML algorithms do you use on a regular basis?', country)

## 15. Computer Vision Methods
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which categories of computer vision methods do you use on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df,  'Which categories of computer vision methods do you use on a regular basis?', country)

## 16. NLP Methods
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following natural language processing (NLP) methods do you use on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df, 'Which of the following natural language processing (NLP) methods do you use on a regular basis?', country)

## 17. Cloud Computing Platforms
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following cloud computing platforms do you use on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df, 'Which of the following cloud computing platforms do you use on a regular basis?', country)

## 18. Cloud Computing Products
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Do you use any of the following cloud computing products on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df, 'Do you use any of the following cloud computing products on a regular basis?', country)

## 19. Machine Learning Products
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Do you use any of the following managed machine learning products on a regular basis?</b>'</font>

In [None]:
global_and_country_bars(df,"Do you use any of the following managed machine learning products on a regular basis?", country)

## 20. Big Data Products
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?</b>' and '<b>Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often?</b>' </font>

In [None]:
multi_question_text = "Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?"
single_question_text = "Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often? - Selected Choice"

##### Bar charts #####
global_and_country_bars(df, multi_question_text, country)

##### Donut subplots #####
get_popular_answer_and_plot(df, single_question_text, country)

#### Distribution plot #####
plot_distribution(df, single_question_text, country, f"Big Data product preference of Kagglers from {country} compared to others")

## 21. Business Intelligence Tools
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following business intelligence tools do you use on a regular basis?</b>'</font>

In [None]:
multi_question_text = "Which of the following business intelligence tools do you use on a regular basis?"
single_question_text = "Which of the following business intelligence tools do you use most often? - Selected Choice"

##### Bar charts #####
global_and_country_bars(df, multi_question_text, country)

##### Donut subplots #####
get_popular_answer_and_plot(df, single_question_text, country)

#### Distribution plot #####
plot_distribution(df, single_question_text, country, f"BI Tool preference of Kagglers from {country} compared to others")

## 22. ML Lifecycle Automation Tools
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Do you use any automated machine learning tools (or partial AutoML tools) on a regular basis?</b>'</font>

In [None]:
question_text = "Do you use any automated machine learning tools (or partial AutoML tools) on a regular basis?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 23. AutoML Tools
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following automated machine learning tools (or partial AutoML tools) do you use on a regular basis?</b>'</font>

In [None]:
question_text = "Which of the following automated machine learning tools (or partial AutoML tools) do you use on a regular basis?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 24. ML Experiment Management Tools
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Do you use any tools to help manage machine learning experiments?</b>'</font>

In [None]:
question_text = "Do you use any tools to help manage machine learning experiments?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 25. Data Anaysis sharing/ML Experiment Deployment Platform
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Where do you publicly share your data analysis or machine learning applications?</b>'</font>

In [None]:
question_text = "Where do you publicly share your data analysis or machine learning applications?"

##### Bar charts #####
global_and_country_bars(df, question_text, country, ['I do not share my work publicly'])

## 26. TPU Usage
<font color=red>Only includes respondents who have answered the question '<b>Approximately how many times have you used a TPU (tensor processing unit)?</b>'</font>

In [None]:
question_text = 'Approximately how many times have you used a TPU (tensor processing unit)?'
answer_choice= 'Never'

df_filtered = df[(~df[question_text].isna()) & (df[question_text]!='None')]

##### Donut subplots #####
overall_pct = len(df_filtered[(df_filtered[question_text]!=answer_choice)])*100/len(df_filtered)

country_pct = len(df_filtered[(df_filtered[COUNTRY_QUESTION_TEXT]==country) 
                 & (df_filtered[question_text]!=answer_choice)])*100/len(df_filtered[df_filtered[COUNTRY_QUESTION_TEXT]==country])

if country_pct < overall_pct:
    title = f"Kagglers from {country} are less experienced with TPUs, with {country_pct:.1f}% having used them atleast once."
elif country_pct > overall_pct:
    title = f"Kagglers from {country} are more experienced with TPUs, with {country_pct:.1f}% having used them atleast once."
else:
    title = f"{country_pct:.1f}% of Kagglers from {country} have also used a TPU atleast once."

donut_subplots(df_filtered, question_text, country, f"{overall_pct:.2f}% of all Kagglers have used a TPU atleast once<br>{title}")

##### Distribution compared to rest of the world #####
plot_distribution(df_filtered, question_text, country, f'TPU Usage of Kagglers from {country} compared to others', 'TPU Usage')

# Learning Preferences
## 27. Data Science Learning Platform
<font color=red>Only includes respondents who chose atleast one option for the question '<b>On which platforms have you begun or completed data science courses?</b>'</font>

In [None]:
question_text = "On which platforms have you begun or completed data science courses?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 28. Data Science Media Sources
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Who/what are your favorite media sources that report on data science topics?</b>'</font>

In [None]:
question_text = "Who/what are your favorite media sources that report on data science topics?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 29. Cloud Computing Platforms
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following cloud computing platforms do you hope to become more familiar with in the next 2 years?</b>'</font>

In [None]:
question_text = "Which of the following cloud computing platforms do you hope to become more familiar with in the next 2 years?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 30. Cloud Computing Products
<font color=red>Only includes respondents who chose atleast one option for the question '<b>In the next 2 years, do you hope to become more familiar with any of these specific cloud computing products?</b>'</font>

In [None]:
question_text = "In the next 2 years, do you hope to become more familiar with any of these specific cloud computing products?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 31. Machine Learning Products
<font color=red>Only includes respondents who chose atleast one option for the question '<b>In the next 2 years, do you hope to become more familiar with any of these specific machine learning products?</b>'</font>

In [None]:
question_text = "In the next 2 years, do you hope to become more familiar with any of these managed machine learning products?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 32. Big Data Products
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you hope to become more familiar with in the next 2 years?</b>'</font>

In [None]:
question_text = "Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you hope to become more familiar with in the next 2 years?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 33. Business Intelligence Tools
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which of the following business intelligence tools do you hope to become more familiar with in the next 2 years?</b>'</font>

In [None]:
question_text = "Which of the following business intelligence tools do you hope to become more familiar with in the next 2 years?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 34. ML Workflow Automation Tools
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which categories of automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years?</b>'</font>

In [None]:
question_text = "Which categories of automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 35. Machine Learning / Partial AutoML Tools
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Which specific automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years?</b>'</font>

In [None]:
question_text = "Which specific automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

## 36. ML Experiment Management Tools
<font color=red>Only includes respondents who chose atleast one option for the question '<b>In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments?</b>'</font>

In [None]:
question_text = "In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments?"

##### Bar charts #####
global_and_country_bars(df, question_text, country)

# Workplace

## 37. Size
<font color=red>Only includes respondents who answered the question '<b>What is the size of the company where you are employed?</b>'</font>

In [None]:
question_text = 'What is the size of the company where you are employed?'
answer_choice = ['0-49 employees','50-249 employees','250-999 employees']

df_filtered = df[(~df[question_text].isna()) & (df[question_text]!='None')]

##### Donut subplots #####
overall_pct = len(df_filtered[(df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered)
country_pct = len(df_filtered[(df_filtered[COUNTRY_QUESTION_TEXT]==country) 
                 & (df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered[df_filtered[COUNTRY_QUESTION_TEXT]==country])

if country_pct < overall_pct:
    title = f"For {country}, this percentage decreases to {country_pct:.0f}%<br>Kagglers from {country} tend to work in larger companies"
elif country_pct > overall_pct:
    title = f"For {country}, this percentage increases to {country_pct:.0f}%<br>Kagglers from {country} tend to work in smaller companies"
else:
    title = f"At {country_pct:.0f}%, it is the same for {country} too"

donut_subplots(df_filtered, question_text, country, f"{overall_pct:.0f}% of all Kagglers work in companies with less than 1000 employees<br>{title}")

##### Distribution compared to rest of the world #####
plot_distribution(df_filtered, question_text, country, f'Workplace size of Kagglers from {country} compared to others')

## 38. Data Science Team Size
<font color=red>Only includes respondents who answered the question '<b>Approximately how many individuals are responsible for data science workloads at your place of business?</b>'</font>

In [None]:
question_text = 'Approximately how many individuals are responsible for data science workloads at your place of business?'
answer_choice = ['0','1-2','3-4']

df_filtered = df[(~df[question_text].isna()) & (df[question_text]!='None')]

##### Donut subplots #####
overall_pct = len(df_filtered[(df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered)
country_pct = len(df_filtered[(df_filtered[COUNTRY_QUESTION_TEXT]==country) 
                 & (df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered[df_filtered[COUNTRY_QUESTION_TEXT]==country])

if country_pct < overall_pct:
    title = f"For {country}, this percentage decreases to {country_pct:.0f}%<br>Kagglers from {country} work in companies with larger Data Science teams"
elif country_pct > overall_pct:
    title = f"For {country}, this percentage increases to {country_pct:.0f}%<br>Kagglers from {country} work in companies with smaller Data Science teams"
else:
    title = f"At {country_pct:.0f}%, it is the same for {country} too"

donut_subplots(df_filtered, question_text, country, 
               f"{overall_pct:.0f}% of all Kagglers work in companies with less than 5 individuals handling Data Science workloads<br>{title}")

##### Distribution compared to rest of the world #####
plot_distribution(df_filtered, question_text, country, f'Company Data Science Team size of Kagglers from {country} compared to others')

## 39. Machine Learning Adoption at Work
<font color=red>Only includes respondents who answered the question '<b>Does your current employer incorporate machine learning methods into their business?</b>'</font>

In [None]:
question_text = 'Does your current employer incorporate machine learning methods into their business?'
answer_choice = ['No (we do not use ML methods)']

df_filtered = df[(~df[question_text].isna()) & (df[question_text]!='None')]

##### Donut subplots #####
overall_pct = len(df_filtered[(df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered)
country_pct = len(df_filtered[(df_filtered[COUNTRY_QUESTION_TEXT]==country) 
                 & (df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered[df_filtered[COUNTRY_QUESTION_TEXT]==country])

if country_pct < overall_pct:
    title = f"For {country}, this percentage decreases to {country_pct:.0f}%<br>More Kagglers from {country} work in a company using ML methods"
elif country_pct > overall_pct:
    title = f"For {country}, this percentage increases to {country_pct:.0f}%<br>Less Kagglers from {country} work in a company using ML methods"
else:
    title = f"At {country_pct:.0f}%, it is the same for {country} too"

donut_subplots(df_filtered, question_text, country, 
               f"{overall_pct:.0f}% of all Kagglers work in companies which don\'t use ML methods<br>{title}")

##### Distribution compared to rest of the world #####
plot_distribution(df_filtered, question_text, country, f'ML adoption in workplaces of Kagglers from {country} compared to others')

## 40. Role at Work
<font color=red>Only includes respondents who chose atleast one option for the question '<b>Select any activities that make up an important part of your role at work</b>'</font>

In [None]:
question_text = "Select any activities that make up an important part of your role at work"

##### Bar charts #####
global_and_country_bars(df, question_text, country, drop_vals=False, title=False)

## 41. Spend on ML/Cloud
<font color=red>Only includes respondents who answered the question '<b>Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services at home (or at work) in the past 5 years (approximate $USD)?</b>'</font>

In [None]:
question_text = 'Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services at home (or at work) in the past 5 years (approximate $USD)?'
answer_choice = ['0 (USD)']

df_filtered = df[(~df[question_text].isna()) & (df[question_text]!='None')]
df_filtered[question_text] = df_filtered[question_text].apply(lambda x: x.replace('$','').replace(',',''))

##### Donut subplots #####
overall_pct = len(df_filtered[(df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered)
country_pct = len(df_filtered[(df_filtered[COUNTRY_QUESTION_TEXT]==country) 
                 & (df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered[df_filtered[COUNTRY_QUESTION_TEXT]==country])


if country_pct < overall_pct:
    title = f"For {country}, this percentage decreases to {country_pct:.1f}%<br>Less Kagglers from {country} work in a company not spending on ML/Cloud"
elif country_pct > overall_pct:
    title = f"For {country}, this percentage increases to {country_pct:.1f}%<br>More Kagglers from {country} work in a company not spending on ML/Cloud"
else:
    title = f"At {country_pct:.0f}%, it is the same for {country} too"

donut_subplots(df_filtered, question_text, country, 
               f"{overall_pct:.1f}% of all Kagglers work in companies which don\'t spend on ML/Cloud<br>{title}")

##### Distribution compared to rest of the world #####
plot_distribution(df_filtered, question_text, country, f'Spend on ML/Cloud in workplaces of Kagglers from {country} compared to others')

## 42. Primary Tool used at Work
<font color=red>Only includes respondents who answered the question '<b>What is the primary tool that you use at work or school to analyze data?</b>'</font>

In [None]:
question_text = 'What is the primary tool that you use at work or school to analyze data? (Include text response) - Selected Choice'
answer_choice = ['Cloud-based data software & APIs (AWS, GCP, Azure, etc.)']

df_filtered = df[(~df[question_text].isna()) & (df[question_text]!='None')]

##### Donut subplots #####
overall_pct = len(df_filtered[(df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered)
country_pct = len(df_filtered[(df_filtered[COUNTRY_QUESTION_TEXT]==country) 
                 & (df_filtered[question_text].isin(answer_choice))])*100/len(df_filtered[df_filtered[COUNTRY_QUESTION_TEXT]==country])


if country_pct < overall_pct:
    title = f"For {country}, this percentage decreases to {country_pct:.1f}%<br>Less Kagglers from {country} use clud-based data software to analyze data"
elif country_pct > overall_pct:
    title = f"For {country}, this percentage increases to {country_pct:.1f}%<br>More Kagglers from {country} use clud-based data software to analyze data"
else:
    title = f"At {country_pct:.0f}%, it is the same for {country} too"

donut_subplots(df_filtered, question_text, country, 
               f"{overall_pct:.1f}% of all Kagglers use clud-based data software to analyze data<br>{title}")

##### Distribution compared to rest of the world #####
plot_distribution(df_filtered, question_text, country, f'Primary Tool used at Work in workplaces of Kagglers from {country} compared to others')

# Thanks!

Thanks for going through my notebook.  
If you liked it, please don't forget to <font color=green><b>UPVOTE</b></font>, and suggest what else you would like to be added.  
Also, would love to know which analysis you found particularly interesting, and why.

<b>I am working on making the code as reusable as possible across questions/plots/countries. If you see any further scope for making the code more modular, please do let me know :)</b>

## Check-out my similar works on the Kaggle 2020 survey
* [Kaggle 2020: Your Country VS the World](https://www.kaggle.com/siddhantsadangi/kaggle-2020-your-country-vs-the-world)
* [Kaggle 2020: India VS the World](https://www.kaggle.com/siddhantsadangi/kaggle-2020-india-vs-the-world-all-questions)
* [Kaggle 2020: USA VS the World](https://www.kaggle.com/siddhantsadangi/kaggle-2020-usa-vs-the-world-all-questions)