In [None]:
%%HTML
<style type="text/css">

div.caveat{
    background-color: yellow;
    padding: 5px;
    color: black;
    font-family: "Arial Black", Gadget, sans-serif;
    font-size: 20px;
    max-width: 1500px;
    margin: auto; 
    margin-top: 5px;
}

div.section{
    background-color: #008abc;
    color: white;
    padding: 5px;
    font-size: 20px;
    max-width: 1500px;
    margin: auto; 
    margin-top: 5px;
}

div.subsection{
    color: #f8766d;
    padding: 5px;
    font-size: 20px;
    max-width: 1500px;
    margin: auto; 
    margin-top: 5px;
}

div.subsubsection{
    color: #000000;
    padding: 5px;
    font-size: 20px;
    max-width: 1500px;
    margin: auto; 
    margin-top: 5px;
}

div.extra{
    color: #5cae71;
    padding: 5px;
    font-size: 15px;
    max-width: 1500px;
    margin: auto; 
    margin-top: 5px;
}

div.extra2{
    color: #949599;
    padding: 5px;
    font-size: 15px;
    max-width: 1500px;
    margin: auto; 
    margin-top: 5px;
}

</style>

<h1><center>Growing into the Future</center></h1>
<center><i>What Kaggle can do in order to fortify a data-driven future, using the context of young Indian data science aspirants</i></center>

![](https://github.com/ry05/kag_survey_analysis/blob/main/rise.png?raw=true)

The ubiquity of data science in the current world cannot be questioned. Though realized *officially* only within this decade, data science has been delivering impact for a very long time, of course in less glamorous forms than the cloak it wears today. It is a well-propagated idea that this discipline holds the key to how the world will function in the future (Davenport and Patil, 2012). In fact, every field known to mankind from medicine to education and from crime to conservation will be recipients of the high-impact delivering skills of being able to make decisions with data(many of them already do so). 

However, I believe that the most important component required to realize this dream of a data-driven future, would not be *the discipline itself*. The future of data science, like any other growing field is dependent on *those that work in it* in the future. Or in other words, those that currently refer to themselves as *data science aspirants* currrently hold the key to furthering this field. 

On those grounds, it then becomes important to understand the current state of these aspirants.

In this report, I present an attempt to explore the Kaggle survey responses of **young data science aspirants from India** and to understand their current state in data science by dissecting my finds across multiple themes.

> "We cannot always build the future for our youth, but we can build the youth for the future" - Franklin D. Roosevelt

<div class='caveat'>Caveat</div>

This notebook contains interpretations that *I find to be true* based on my readings and analysis. Yet, I must cede to the reader that some of the interpretations are born out of my own personal experiences as a data science aspirant. **I believe that a low-degree of personal opinion in an analysis will aid in its explanation.** However, if you choose to believe otherways, I have marked these personal interpretations with a **üîé (magnifying glass) emoji**. You can choose to skip these sections if personal experiences fail to interest you.

Another important point to note is that since this analysis is based on a survey, there could be possibilities of [sampling error](https://www.investopedia.com/terms/s/samplingerror.asp). Therefore, it is encouraged the reader keep this in mind while reading the report. 

<div class='section'>Table of Contents</div>

This notebook is split into the following main sections:

1. <a href='#1'>Reading the Notebook</a><br>
2. <a href='#2'>Summary</a><br>
3. <a href='#3'>Choosing the Community</a><br>
4. <a href='#4'>Why Must this Community be Studied?</a><br>
5. <a href='#5'>A Broad Overview of the IRU21 Community</a><br>
6. <a href='#6'>The Challenges of the IRU21 Community</a><br>
7. <a href='#7'>References</a><br>
8. <a href='#8'>Appendix</a><br>

<a id='1'></a>
<div class='section'>1. Reading the Notebook</div>

<div class='extra'>Data Used</div>

- All official Kaggle survey datasets from 2017-2020 (inclusive) have been used

<div class='extra'>Terms Used</div>

- **Data science aspirant** : A person looking to make a break into the data science field i.e seeking a career in data science
- **Data role** : A job title that involves substantial work with data(Software engineer does not count as a data role in the context of this notebook)

<div class='extra'>Interpreting Emojis</div>

The following emojis have been used in this report

- üìù : Note
- üí° : Insights from a visualization
- üîé : Personal Interpretation
- ‚ùì : Question
- üéØ : Important point

<div class='extra'>Reading Visualizations</div>

- This report contains several visualizations. In order to understand each visualization to its best, I recommend the reader to **hover over them**.
- With the exception of the heatmaps(and a few other graphs), most of the visualizations are interactive in nature.
- In visualizations that require further understanding than what is present in the graph itself, look out for the **üìù (note)** emoji immediately succeeding the visualization.

In [None]:
# import libraries

# data manipulation
import pandas as pd
import numpy as np

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# stat tests
from scipy import stats

# slide deck embed
from IPython.display import HTML

<a id='2'></a>
<div class='section'>2. Summary</div>

In summary, this study finds that in spite of evident challenges, the community of young Indian data science aspirants has a great scope of growth if provided the "right" opportunities.

While several patterns have surfaced from the analysis of data from the Kaggle ML and DS Surveys (2017-2020), a few stand taller than the rest. These patterns are not just descriptive in nature, but also present newer options to to perform relevant actions.

It is recommended that Kaggle add onto their highly effective community-building work by making progress across these 3 themes - **Foundation**, **Inclusion** and **Motivation**.

The recommendations have been further explained in the following slide deck. Do give atleast 2-5 seconds per slide for loading.



In [None]:
HTML('<div style="position: relative; width: 100%; height: 0; padding-top: 56.2500%;padding-bottom: 48px; box-shadow: 0 2px 8px 0 rgba(63,69,81,0.16); margin-top: 1.6em; margin-bottom: 0.9em; overflow: hidden;border-radius: 8px; will-change: transform;"><iframe style="position: absolute; width: 100%; height: 100%; top: 0; left: 0; border: none; padding: 0;margin: 0;"src="https:&#x2F;&#x2F;www.canva.com&#x2F;design&#x2F;DAEQe2bfn-8&#x2F;view?embed"></iframe></div><a href="https:&#x2F;&#x2F;www.canva.com&#x2F;design&#x2F;DAEQe2bfn-8&#x2F;view?utm_content=DAEQe2bfn-8&amp;utm_campaign=designshare&amp;utm_medium=embeds&amp;utm_source=link" target="_blank" rel="noopener">Kaggle Survey Analysis</a> by Ramshankar Yadhunath')

In [None]:
# load data

data_17 = pd.read_csv(
    "/kaggle/input/kaggle-survey-2017/multipleChoiceResponses.csv", encoding="latin-1"
)
data_18 = pd.read_csv("/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv")
data_19 = pd.read_csv("/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv")
data_20 = pd.read_csv(
    "/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv"
)

In [None]:
# utility functions - used by the author in background analysis


def format_headers(df):
    """
    Make the headers

    Parameters
    ----------
    df (dataframe):
        dataframe whose header is to be formatted

    Returns
    -------
    dataframe
    """

    headers = df.iloc[0, :]
    df = df[1:]
    df.columns = headers

    return df


def print_all(df):
    """
    To print all the content of the cells in a given row without truncation

    Parameters
    ----------
    df (dataframe):
        dataframe under consideration

    Returns
    -------
    None
    """

    headers = df.columns
    for h in headers:
        print(h)


def get_ques(df):
    """
    Get the unique questions in the survey data
    based on whether its multi-choice or single

    Parameters
    ----------
    df (dataframe):
        dataframe under consideration

    Returns
    -------
    dict
    """

    uniq_ques = {}
    for col in df.columns:
        temp = col.split("-")
        key = temp[0]

        if key not in uniq_ques.keys():
            if "(Select all that apply)" in key:
                value = "M"
            else:
                value = "S"
            uniq_ques[key] = value
    return uniq_ques


In [None]:
# prepare data

data_18 = format_headers(data_18)
data_18["In which country do you currently reside?"] = data_18[
    "In which country do you currently reside?"
].replace(
    {
        "United States of America": "United States",
        "People 's Republic of China": "China",
    }
)
data_19 = format_headers(data_19)
data_19["In which country do you currently reside?"] = data_19[
    "In which country do you currently reside?"
].replace(
    {
        "United States of America": "United States",
        "People 's Republic of China": "China",
    }
)
data_20 = format_headers(data_20)
data_20["In which country do you currently reside?"] = data_20[
    "In which country do you currently reside?"
].replace(
    {
        "United States of America": "United States",
        "People 's Republic of China": "China",
    }
)

# subset to IRU21
d_17 = data_17[
    (data_17["Country"] == "India") & (data_17["Age"] >= 18) & (data_17["Age"] <= 21)
]
d_18 = data_18[
    (data_18["What is your age (# years)?"] == "18-21")
    & (data_18["In which country do you currently reside?"])
]
d_19 = data_19[
    (data_19["What is your age (# years)?"] == "18-21")
    & (data_19["In which country do you currently reside?"])
]
d_20 = data_20[
    (data_20["What is your age (# years)?"] == "18-21")
    & (data_20["In which country do you currently reside?"])
]

<a id='3'></a>
<div class='section'>3. Choosing the Community</div>

In [None]:
def country_age_dist(df):
    """
    Visualize distribution of respondents according to age groups
    across the top 10 countries in terms of responses

    Parameters
    ----------
    df (dataframe):
        dataframe under consideration

    Returns
    -------
    dataframe
    """

    # get the list
    ctry_list = (
        df["In which country do you currently reside?"].value_counts().head(10).index
    )
    temp = df[df["In which country do you currently reside?"].isin(ctry_list)]
    cross = pd.crosstab(
        temp["In which country do you currently reside?"],
        temp["What is your age (# years)?"],
        normalize="index",
    )
    return cross

def ctry_age_u21(
    df,
    ctry_col="In which country do you currently reside?",
    age_col="What is your age (# years)?",
    idn="Duration (in seconds)",
):
    """
    Ratio of respondents under 21 years in top 10 countries
    in terms of size of respondent pool

    Parameters
    ----------
    df (dataframe):
        dataframe under consideration
    ctry_col (str):
        name of column containing country attributes
    age_col (str):
        name of column containing age attribute
    idn (str):
        name of column containg identifier attriubute

    Returns
    -------
    dataframe
    """

    # agg all response counts for each country
    total = df.groupby([ctry_col], as_index=False).agg({idn: "count"})
    total.columns = ["Country", "Total responses"]

    # agg age<21 responses for each country
    u_21_df = df[(df[age_col] == "18-21")]
    age_split = u_21_df.groupby([ctry_col], as_index=False).agg({idn: "count"})
    age_split.columns = ["Country", "Responses"]

    # create new dataframe
    merged = pd.merge(left=total, right=age_split, on="Country")
    merged["Ratio"] = merged["Responses"] / merged["Total responses"]
    # consider a country only if there are 500 total responses in a year atleast
    merged = merged[(merged["Total responses"] > 500)]
    merged.columns = ["Country", "Total responses", "Responses", "Ratio"]
    merged = merged.sort_values(by="Ratio", ascending=False).reset_index(drop=True)

    return merged

As described in the beginning of this notebook, the focus of this work is to explore the responses of a very specific community - **Indian respondents under the age of 21 years**. This notebook is the story of this community, their goals, their aspirations, their priorities and their challenges.

In this section, I explain why this community has been chosen in specific.

In [None]:
# viz. respondent distribution

fig, axes = plt.subplots(3, 1, figsize=(12, 25))
fig.suptitle(
    "Distribution of respondents based on country of residence and age group\n(For top 10 countries each year)",
    fontsize=20,
)

sns.heatmap(
    country_age_dist(data_18),
    annot=True,
    fmt=".2g",
    linewidths=0.5,
    cmap="Blues",
    cbar=False,
    ax=axes[0],
)

axes[0].set_title("2018 Kaggle Survey")
axes[0].set_xlabel("Age of respondents(years)")
axes[0].set_ylabel("Country of residence")

sns.heatmap(
    country_age_dist(data_19),
    annot=True,
    fmt=".2g",
    linewidths=0.5,
    cmap="Blues",
    cbar=False,
    ax=axes[1],
)

axes[1].set_title("2019 Kaggle Survey")
axes[1].set_xlabel("Age of respondents(years)")
axes[1].set_ylabel("Country of residence")

sns.heatmap(
    country_age_dist(data_20),
    annot=True,
    fmt=".2g",
    linewidths=0.5,
    cmap="Blues",
    cbar=False,
    ax=axes[2],
)

axes[2].set_title("2020 Kaggle Survey")
axes[2].set_xlabel("Age of respondents(years)")
axes[2].set_ylabel("Country of residence")

plt.show()

<center><strong>Exhibit 3-A.</strong> Distribution of respondents based on country and age group</center>

üìù **Note**

- Only those countries that ranked in the top 10 list each year with respect to number of responses in that year's survey have been included
- Each cell's value represents the associated percentage
- The values inside the cells have been rounded of

The above figure depicts the distribution of the number of respondents who filled up the survey (based on age and country of residence) for the years 2018, 2019 and 2020. A very important visual contribution of this figure is that it **sheds light on the dominance of Indian respondents under the age of 21 years**. 

In 2018 and 2019, 30% of India's responses came from 18-21 year olds. In 2020, this proportion has increased to 40%. A bird's eye view of the heatmap shows how this demographic has contributed the most to the survey responses over the years that have been considered. 

Keeping this as *one of many factors* in mind, I have decided that the focus of my analysis shall be on the rapidly growing community of data science enthusiasts in India who are under the age of 21. Now, I mentioned *many factors*. A further look into these other factors that have influenced my decision is given in the following section.

<a id='4'></a>
<div class='section'>4. Why Must this Community be Studied?</div>

Following from the previous section, this section throws more light on why the chosen community of Indian respondents under 21 years need to be studied and understood. 

In [None]:
# growth of young Indian  ds aspirants

# data prep

# 2017 survey
d_17_total = data_17.groupby(["Country"], as_index=False).agg(
    {"StudentStatus": "count"}
)
d_17_total.columns = ["Country", "Total responses"]
d_17_u21 = (
    data_17[(data_17["Age"] >= 18) & (data_17["Age"] <= 21)]
    .groupby(["Country"], as_index=False)
    .agg({"StudentStatus": "count"})
)
d_17_u21.columns = ["Country", "Responses"]
# merge data
ctry_age_u21_17 = pd.merge(left=d_17_total, right=d_17_u21, on="Country")
ctry_age_u21_17["Ratio"] = (
    ctry_age_u21_17["Responses"] / ctry_age_u21_17["Total responses"]
)
ctry_age_u21_17 = (
    ctry_age_u21_17.sort_values(by=["Total responses", "Ratio"], ascending=False)
    .reset_index(drop=True)
    .head(10)
)
ctry_age_u21_17["Year"] = "2017"
ctry_age_u21_17 = ctry_age_u21_17[(ctry_age_u21_17["Total responses"] > 500)]
ctry_age_u21_17["Country"] = ctry_age_u21_17["Country"].replace(
    {
        "United States of America": "United States",
        "People 's Republic of China": "China",
    }
)

# 2018 survey
ctry_age_u21_18 = ctry_age_u21(data_18)
ctry_age_u21_18["Year"] = "2018"

# 2019 survey
ctry_age_u21_19 = ctry_age_u21(data_19)
ctry_age_u21_19["Year"] = "2019"

# 2020 survey
ctry_age_u21_20 = ctry_age_u21(data_20)
ctry_age_u21_20["Year"] = "2020"

In [None]:
# visualization

ctry_age_u21_df = pd.concat(
    [ctry_age_u21_17, ctry_age_u21_18, ctry_age_u21_19, ctry_age_u21_20]
)
fig = px.scatter(
    ctry_age_u21_df,
    x="Total responses",
    y="Responses",
    size="Ratio",
    color="Country",
    size_max=25,
    hover_name="Year",
)
fig.update_traces(
    marker=dict(line=dict(width=2, color="DarkSlateGrey")),
    selector=dict(mode="markers"),
)
annotations = []
annotations.append(
    dict(
        text="India's growth over the past 3 years",
        xref="paper",
        yref="paper",
        x=0.9,
        y=0.75,
        showarrow=False,
    )
)
fig.add_shape(
    type="circle",
    xref="x",
    yref="y",
    x0=3800,
    y0=900,
    x1=6500,
    y1=2500,
    opacity=0.3,
    fillcolor="#008abc",
)
fig.update_layout(
    annotations=annotations,
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "Growth of young data science aspirants in India",
        "font": {"family": "Rockwell", "size": 20},
        "yanchor": "top",
    },
    xaxis_title="Total number of survey responses",
    yaxis_title="Number of responses from those under 21 years",
    showlegend=False,
)

fig.show()

<center><strong>Exhibit 4-A.</strong> Growth of young data science aspirants in India</center>

üìù **Note**
- Only those countries with atleast 500 responses for a given survey year have been considered in the scatter plot above
- The size of a point represents the ratio of `total responses in a year from any country C` and `responses from 18-21 year olds from any country C`
- Hover over each point for further details

üí° **Insights**
- From the above figure, it is evident that the rise of respondents in the survey has been the most prominent in India over the years
- While USA has the most total responses in 2018, India makes a bigger leap in the next 2 years while USA falls back
- Also, unlike other countries, respondents under 21 years have been more frequent from India in the past 3 years
- Hence, we could say that **India has seen in recent times a phenomenal growth in the number of young, aspiring data scientists!**
- However, note that the stagerring increase from India could be attributed to [India's much larger youth population](https://economictimes.indiatimes.com/news/politics-and-nation/india-has-worlds-largest-youth-population-un-report/articleshow/45190294.cms?from=mdr)
    - Even if this is the case, the fact that a **significant portion i.e close to 30%-35% of Indian responses are from those under 21 years old** is strong evidence for the purpose of my study in this notebook

üéØ **Important Point**

This community of **Indian respondents under 21 years** shall be known as **IRU21** for the remainder of this notebook.

This rapid growth of IRU21 respondents builds a case that motivates the contents of this notebook. 

There also does exist another narrative to studying IRU21 respondents. IRU21 respondents are most likely data science "aspirants" in this survey. While there could be arguments against the use of the term "aspirants", the term fits perfectly in context here. 

Aspirants in any field grow into the roles they desire within that field on the basis of 
- The **effort** they put in
- The **resources** they receive
- The **mentorship** they receive

While the first point can receive no substantial contribution by external intervention, the next two can. **Good resources and legitimate mentorship are critical to the growth of an individual.** However, given the popularity of data science currently, it is very easy for aspirants to fall into the trap of *resource overload*.

Hence, this work is also an attempt to draw focus onto insights that could help work towards the provision of useful resources and able mentorship to the IRU21 community in specific.

<a id='5'></a>
<div class='section'>5. A Broad Overview of the IRU21 Community</div>

In this section, I provide a broad oversight about the responses from the IRU21 community. This oversight will help gain a general understanding which could prove beneficial later on. 

This broad oversight is performed by pointing focus at two relevant themes :
1. **Gender**
2. **Work**

The third theme of **Education** has been interleaved with the above two themes to draw insights from this community's role in the current data science spectrum.

üìù **Note**  
Education in this section's context also includes experience writing code and experience with ML methods.

![](https://github.com/ry05/kag_survey_analysis/blob/main/broad_themes.png?raw=true)

<div class='subsection'>5.1. Does Gender play a role in the IRU21 subset of responses?</div>

One of the most important explanations with regards to the role of gender in the data science space was performed by Parul Pandey in her last year's work of [Geek Girls Rising : Myth or Reality!](https://www.kaggle.com/parulpandey/geek-girls-rising-myth-or-reality). 

In this section however, I choose to work on a smaller set of questions pertaining to gender trends within the IRU21 respondents.

In [None]:
def make_3_genders(df, gender_col):
    """Clean data to have only 3 genders
    - Male
    - Female
    - Other

    Parameters
    ----------
    df (dataframe):
        dataframe under consideration
    gender_col (str):
        name of column containing gender attribute

    Returns
    -------
    dataframe
    """

    df[gender_col] = df[gender_col].replace(
        {
            "A different identity": "Other",
            "Non-binary, genderqueer, or gender non-conforming": "Other",
            "Prefer not to say": "Other",
            "Prefer to self-describe": "Other",
            "Nonbinary": "Other",
            "Man": "Male",
            "Woman": "Female",
        }
    )

    return df


def gender_stats_df(df, yr, op=1):
    """
    Make a dataframe with gender stats

    Parameters
    ----------
    df (dataframe):
        dataframe under consideration
    yr (str):
        year as a string
    op (int):
        if 1, indicates 2017
        else, indicates 2018 or 2019 or 2020

    Returns
    -------
    dataframe
    """

    if op == 1:
        df_g = pd.DataFrame(df["GenderSelect"].value_counts()).reset_index()
    else:
        df_g = pd.DataFrame(
            df["What is your gender? - Selected Choice"].value_counts()
        ).reset_index()
    df_g.columns = ["Gender", "Count"]
    df_g = make_3_genders(df_g, "Gender")
    df_g = df_g.groupby(["Gender"], as_index=False).agg({"Count": "sum"})
    df_g["Percent"] = round(df_g["Count"] / df_g["Count"].sum() * 100, 2)
    df_g["Year"] = yr

    return df_g

In [None]:
# gender trends

# prep data
d_17_g = gender_stats_df(d_17, "2017")
d_18_g = gender_stats_df(d_18, "2018", 2)
d_19_g = gender_stats_df(d_19, "2019", 2)
d_20_g = gender_stats_df(d_20, "2020", 2)
gender_stats = pd.concat([d_17_g, d_18_g, d_19_g, d_20_g]).reset_index(drop=True)
years = ["2017, 2018", "2019", "2020"]

# visualize
fig = go.Figure()
fig.add_trace(
    go.Bar(x=d_17_g["Gender"], y=d_17_g["Percent"], name="2017", marker_color="#7CB9E8")
)
fig.add_trace(
    go.Bar(x=d_18_g["Gender"], y=d_18_g["Percent"], name="2018", marker_color="#6699CC")
)
fig.add_trace(
    go.Bar(x=d_19_g["Gender"], y=d_19_g["Percent"], name="2019", marker_color="#007FFF")
)
fig.add_trace(
    go.Bar(x=d_20_g["Gender"], y=d_20_g["Percent"], name="2020", marker_color="#00308F")
)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "Gender trends among IRU21 respondents",
        "font": {"size": 20},
        "yanchor": "top",
    },
    xaxis_title="Gender",
    yaxis_title="Percentage of respondents",
)

fig.show()

<center><strong>Exhibit 5-A.</strong> Gender trends among IRU21 respondents</center>

üí° **Insights**

- The above visualization does depict the **huge imbalance of gender representation** in the data science community. But, let's look at it again. This time, let's use the **lens of positivity**!
- It also shows the emergence of some kind of trend that looks to balance this lack of representation
    - **Young female responders from India have been increasing** with every year, with a very promising increase in share of over 4% from 2019 to 2020
    - Young male responders from India have seen a decline on the other hand with a substantial decrement of over 6% in share
    - **Young responders from India who identify themselves as another gender** or didn't wish to disclose their gender have seen a very small, but still **evident increase** in their representation in the survey
    
üéØ **Important Point**

There surely is a gender imbalance, but there is also a steady force that is trying to bring about balance. Therefore, what we need is a **stronger, community-led approach to expedite this balancing process**. Especially, when it comes to uplifting the performance of those that do not identify with the male or female genders.

<div class='subsubsection'>5.1.1. Education in the context of Gender</div>

In this part, let's try to identify if the interactions of education and gender among IRU21 respondents.

In [None]:
# data prep

d_17 = make_3_genders(d_17, 'GenderSelect')
d_18 = make_3_genders(d_18, 'What is your gender? - Selected Choice')
d_19 = make_3_genders(d_19, 'What is your gender? - Selected Choice')
d_20 = make_3_genders(d_20, 'What is your gender? - Selected Choice')

In [None]:
# visualization

fig, axes = plt.subplots(4, 1, figsize=(8, 25))
fig.suptitle(
    "Percentage of education levels per gender in young Indian data science aspirants",
    fontsize=20,
)

cross = pd.crosstab(
    d_17["FormalEducation"], d_17["GenderSelect"], normalize="columns", margins=True
)
sns.heatmap(cross, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[0])
axes[0].set_title("2017 Kaggle Survey")
axes[0].set_ylabel("Education")
axes[0].set_xlabel("")

cross = pd.crosstab(
    d_18[
        "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?"
    ],
    d_18["What is your gender? - Selected Choice"],
    normalize="columns",
    margins=True,
)
sns.heatmap(cross, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[1])
axes[1].set_title("2018 Kaggle Survey")
axes[1].set_ylabel("Education")
axes[1].set_xlabel("")

cross = pd.crosstab(
    d_19[
        "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?"
    ],
    d_19["What is your gender? - Selected Choice"],
    normalize="columns",
    margins=True,
)
sns.heatmap(cross, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[2])
axes[2].set_title("2019 Kaggle Survey")
axes[2].set_ylabel("Education")
axes[2].set_xlabel("")

cross = pd.crosstab(
    d_20[
        "What is the highest level of formal education that you have attained or plan to attain within the next 2 years?"
    ],
    d_20["What is your gender? - Selected Choice"],
    normalize="columns",
    margins=True,
)
sns.heatmap(cross, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[3])
axes[3].set_title("2020 Kaggle Survey")
axes[3].set_ylabel("Education")
axes[3].set_xlabel("Gender")

plt.show()

<center><strong>Exhibit 5-B.</strong> Percentage of education levels per gender in young data science aspirants</center>

üí° **Insights**

- Across all years, the **Bachelor's degree is the most common with atleast 65% of all young Indian respondents** having one/pursuing one in the next 2 years
    - This is intuitive as the age group considered here is 18-21, which is generally the age when the Bachelor's degree is pursued in India
- Other higher degrees are relatively lesser in proportion for the very reason that the respondents in this sample are between 18-21
    - Master's degrees: An interesting pattern is that a **slightly higher percentage of female respondents have Master's degrees** when compared to their male counterparts
    - Doctoral degree: Extremely rare, but not nil. This makes sense as 18-21 is probably too early to get a PhD!
    
‚ùó‚ùó‚ùó **Alert**

The choice **Some college/university study without earning a bachelor's degree** is very ambiguous. The question asks the respondent what level of education they have or they *expect to attain in next 2 years.* Considering the open-endedness of this question, the response of **Some college/uni without bachelor's degree** lacks clarity.

Hence, Kaggle could probably rephrase this choice to ensure better data quality for this question.

<div class='subsubsection'>5.1.2. Experience writing code in the context of Gender</div>

Writing code is an important part of doing data science (Royal Society, 2019). In this part, let's investigate if experience with writing code has anything to do with gender.

In [None]:
# data prep

index = ["Never", "< 1 year", "1-2 years", "3-5 years", "5-10 years", "10+ years"]

d_18["How long have you been writing code to analyze data?"] = d_18[
    "How long have you been writing code to analyze data?"
].replace(
    {
        "10-20 years": "10+ years",
        "40+ years": "10+years",
        "I have never written code and I do not want to learn": "Never",
        "I have never written code but I want to learn": "Never",
    }
)
cross_18 = pd.crosstab(
    d_18["How long have you been writing code to analyze data?"],
    d_18["What is your gender? - Selected Choice"],
    normalize="columns",
    margins=True,
).reindex(index)

d_19[
    "How long have you been writing code to analyze data (at work or at school)?"
] = d_19[
    "How long have you been writing code to analyze data (at work or at school)?"
].replace(
    {
        "< 1 years": "< 1 year",
        "10-20 years": "10+ years",
        "20+ years": "10+years",
        "I have never written code": "Never",
    }
)
cross_19 = pd.crosstab(
    d_19["How long have you been writing code to analyze data (at work or at school)?"],
    d_19["What is your gender? - Selected Choice"],
    normalize="columns",
    margins=True,
).reindex(index)

d_20["For how many years have you been writing code and/or programming?"] = d_20[
    "For how many years have you been writing code and/or programming?"
].replace(
    {
        "< 1 years": "< 1 year",
        "10-20 years": "10+ years",
        "20+ years": "10+years",
        "I have never written code": "Never",
    }
)
cross_20 = pd.crosstab(
    d_20["For how many years have you been writing code and/or programming?"],
    d_20["What is your gender? - Selected Choice"],
    normalize="columns",
    margins=True,
).reindex(index)

In [None]:
# visualization

fig, axes = plt.subplots(1, 3, figsize=(25, 8))
fig.suptitle(
    "Percentage of code writing experience levels per gender in young Indian data science aspirants",
    fontsize=20,
)

# 2018
sns.heatmap(cross_18, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[0])
axes[0].set_title("2018 Kaggle Survey")
axes[0].set_ylabel("Years of experience writing code")
axes[0].set_xlabel("Gender")

# 2019
sns.heatmap(cross_19, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[1])
axes[1].set_title("2019 Kaggle Survey")
axes[1].set_ylabel("")
axes[1].set_xlabel("Gender")

# 2020
sns.heatmap(cross_20, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[2])
axes[2].set_title("2020 Kaggle Survey")
axes[2].set_ylabel("")
axes[2].set_xlabel("Gender")

plt.show()

<center><strong>Exhibit 5-C.</strong> Percentage of coding experience per gender in young Indian data science aspirants</center>

üí° **Insights**

- As expected, much of the young respondents across 2018, 2019 and 2020 surveys have less than a year of experience with coding
    - There is **no substantial difference with regards to male and female respondents** in this case
- But, there is an interesting pattern
    - In 2018 and 2019, those with less than a year of experience are much higher than those with 1-5 years of experience
    - However, in 2020 the above pattern is reversed. There are more respondents in this year's survey who have 1-5 years experience with coding than those with under a year
        - This could mean that this year's survey saw a **decreased participation** from young Indians who are *complete coding beginners*
        
‚ùì **Question**

The **decreased participation** argument above is only one possibility. A more important question to consider is ***Is coding introduced in serious fashion much earlier to schoolchildren in India now when compared to the past years?***.

If this is the case, it makes sense to see the surge in 18-21 year olds with 1-5 years of experience in coding.

üîé **Personal Interpretation**

I believe that students are now introduced to programming at the school level in a much more serious fashion than what it used to be a few years back. This can be attributed to the recent realization amongst schools about the importance of programming in the future.

<div class='subsubsection'>5.1.3. Experience with Machine Learning methods in the context of Gender</div>

In [None]:
# data prep

index = [
    "Never",
    "< 1 year",
    "1-2 years",
    "2-3 years",
    "3-4 years",
    "4-5 years",
    "5+ years",
]

d_18[
    "For how many years have you used machine learning methods (at work or in school)?"
] = d_18[
    "For how many years have you used machine learning methods (at work or in school)?"
].replace(
    {
        "5-10 years": "5+ years",
        "10-15 years": "5+ years",
        "20+ years": "5+ years",
        "I have never studied machine learning but plan to learn in the future": "Never",
        "I have never studied machine learning and I do not plan to": "Never",
    }
)
cross_18 = pd.crosstab(
    d_18[
        "For how many years have you used machine learning methods (at work or in school)?"
    ],
    d_18["What is your gender? - Selected Choice"],
    normalize="columns",
).reindex(index)

d_20["For how many years have you used machine learning methods?"] = d_20[
    "For how many years have you used machine learning methods?"
].replace(
    {
        "5-10 years": "5+ years",
        "10-20 years": "5+ years",
        "20 or more years": "5+ years",
        "Under 1 year": "< 1 year",
        "I do not use machine learning methods": "Never",
    }
)
cross_20 = pd.crosstab(
    d_20["For how many years have you used machine learning methods?"],
    d_20["What is your gender? - Selected Choice"],
    normalize="columns",
).reindex(index)

d_19["For how many years have you used machine learning methods?"] = d_19[
    "For how many years have you used machine learning methods?"
].replace(
    {
        "5-10 years": "5+ years",
        "10-15 years": "5+ years",
        "20+ years": "5+ years",
        "< 1 years": "< 1 year",
    }
)
cross_19 = pd.crosstab(
    d_19["For how many years have you used machine learning methods?"],
    d_19["What is your gender? - Selected Choice"],
    normalize="columns",
).reindex(index)

In [None]:
# visualization

fig, axes = plt.subplots(1, 3, figsize=(25, 8))
fig.suptitle(
    "Percentage of ML methods experience levels per gender in young Indian data science aspirants",
    fontsize=20,
)

# 2018
sns.heatmap(cross_18, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[0])
axes[0].set_title("2018 Kaggle Survey")
axes[0].set_ylabel("Years of experience writing code")
axes[0].set_xlabel("Gender")

# 2019
sns.heatmap(cross_19, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[1])
axes[1].set_title("2019 Kaggle Survey")
axes[1].set_ylabel("")
axes[1].set_xlabel("Gender")

# 2020
sns.heatmap(cross_20, cmap="Blues", annot=True, cbar=False, linewidths=0.5, ax=axes[2])
axes[2].set_title("2020 Kaggle Survey")
axes[2].set_ylabel("")
axes[2].set_xlabel("Gender")

plt.show()


<center><strong>Exhibit 5-D.</strong> Percentage of ML methods' experience level per gender in young Indian data science aspirants</center>

üí° **Insights**

- Most of the respondents from India under the age of 21 have less than a year of experience using ML methods
    - This is exactly what was expected as Machine Learning is not yet an active part of school curriculums, atleast not in a nationwide sense
    - This find is **independent of the gender** i.e irrespective of the gender, experience with ML methods among IRU21 respondents is low
    

<div class='subsubsection'>5.1.4. Further Thoughts</div>

The gender imbalance in data science as a discipline is obvious. This imbalance could be attributed to several factors such as cultural perceptions, attrition and biases as depicted in [Women in data science and AI | The Alan Turing Institute](https://www.turing.ac.uk/research/research-projects/women-data-science-and-ai). 

The Boston Consulting Group's [What‚Äôs Keeping Women Out of Data Science?](https://www.bcg.com/en-gb/publications/2020/what-keeps-women-out-data-science) is another excellent resource that dives deep into the problem.

<div class='subsection'>5.2. How is employment in the IRU21 subset of responses?</div>

The IRU21 respondents only include those in the ag groups of 18-21 years. In most parts of the world(and India for sure), this is the age when students go through college to attain a basic or Bachelor's degree. 

Therefore, it is most likely that "employment" here signifies part-time or internship opportunities. 

In [None]:
def top_n(df, col, n=5):
    """
    Get the counts of top n values in a column
    in the dataframe
    
    Parameters
    ----------
    df (dataframe):
        dataframe under consideration
    col (str):
        name of column

    Returns
    -------
    dataframe
    """
    
    temp = pd.DataFrame(df[col].value_counts()).head(n).reset_index()
    temp.columns = ['Current role', 'Number of respondents']
    
    return temp


In [None]:
# data prep
d_18_cur_role = top_n(
    d_18,
    "Select the title most similar to your current role (or most recent title if retired): - Selected Choice",
    10,
)
d_19_cur_role = top_n(
    d_19,
    "Select the title most similar to your current role (or most recent title if retired): - Selected Choice",
    10,
)
d_20_cur_role = top_n(
    d_20,
    "Select the title most similar to your current role (or most recent title if retired): - Selected Choice",
    10,
)

# visualize
fig = make_subplots(
    3,
    1,
    specs=[[{}], [{}], [{}]],
    subplot_titles=("2018 Kaggle Survey", "2019 Kaggle Survey", "2020 Kaggle Survey"),
)

colors = ["#6699cc"] * 10
colors[0] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=d_18_cur_role["Current role"],
        y=d_18_cur_role["Number of respondents"],
        name="2018",
        marker_color=colors,
    ),
    row=1,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=1, col=1)

colors = ["#007fff"] * 10
colors[0] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=d_19_cur_role["Current role"],
        y=d_19_cur_role["Number of respondents"],
        name="2019",
        marker_color=colors,
    ),
    row=2,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=2, col=1)

colors = ["#00308f"] * 10
colors[0] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=d_20_cur_role["Current role"],
        y=d_20_cur_role["Number of respondents"],
        name="2020",
        marker_color=colors,
    ),
    row=3,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=3, col=1)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "No surprises - Most young respondents are students!",
        "font": {"size": 20},
        "yanchor": "top",
    },
    xaxis_title="",
    # yaxis_title="Percentage of respondents",
    height=900,
    showlegend=False,
)

fig.show()

<center><strong>Exhibit 5-E.</strong> No surprises - Most IRU21 respondents are students</center>

Well, the above visualization is rather obvious as the community we are considering is restricted to 18-21 year olds alone and they are usually still in their phase of education.

<div class='subsubsection'>5.2.1. Employment Status of Young Indian Data Science Aspirants</div>

In [None]:
# data prep

# 2019
d_19_work = d_19.groupby(
    [
        "What is the size of the company where you are employed?",
        "Approximately how many individuals are responsible for data science workloads at your place of business?",
    ],
    as_index=False,
).agg({"Duration (in seconds)": "count"})
d_19_work.columns = ["Company size", "Data science team size", "Count"]
d_19_work["cs_num"] = d_19_work["Company size"].replace(
    {
        "0-49 employees": 1,
        "50-249 employees": 2,
        "> 10,000 employees": 5,
        "250-999 employees": 3,
        "1000-9,999 employees": 4,
    }
)
d_19_work["ts_num"] = d_19_work["Data science team size"].replace(
    {
        "0": 0,
        "1-2": 1,
        "3-4": 2,
        "20+": 6,
        "5-9": 3,
        "10-14": 4,
        "15-19": 5,
    }
)
d_19_work = d_19_work.sort_values(["ts_num", "cs_num"], ascending=True)

# 2020
d_20_work = d_20.groupby(
    [
        "What is the size of the company where you are employed?",
        "Approximately how many individuals are responsible for data science workloads at your place of business?",
    ],
    as_index=False,
).agg({"Duration (in seconds)": "count"})
d_20_work.columns = ["Company size", "Data science team size", "Count"]
d_20_work["cs_num"] = d_20_work["Company size"].replace(
    {
        "0-49 employees": 1,
        "50-249 employees": 2,
        "10,000 or more employees": 5,
        "250-999 employees": 3,
        "1000-9,999 employees": 4,
    }
)
d_20_work["ts_num"] = d_20_work["Data science team size"].replace(
    {
        "0": 0,
        "1-2": 1,
        "3-4": 2,
        "20+": 6,
        "5-9": 3,
        "10-14": 4,
        "15-19": 5,
    }
)
d_20_work = d_20_work.sort_values(["ts_num", "cs_num"], ascending=True)

In [None]:
# employment status

d_19_t = d_19.shape[0]
d_19_w = d_19_work['Count'].sum()

d_20_t = d_20.shape[0]
d_20_w = d_20_work['Count'].sum()

# data viz

labels = ['Employed', 'Not employed']
colors = ['#00308f', '#6699cc']

fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                   subplot_titles=['2019 Kaggle Survey', '2020 Kaggle Survey'])
fig.add_trace(go.Pie(labels=labels, values=[d_19_w, d_19_t-d_19_w], name='2019'),
              1, 1)
fig.add_trace(go.Pie(labels=labels, values=[d_20_w, d_20_t-d_20_w], name='2020'),
              1, 2)

fig.update_traces(hole=.4, hoverinfo="label+value+name",
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))

# add annotations
annotations = []
annotations.append(
    dict(
        xref="paper",
        yref="paper",
        x=0.85,
        y=-0.1,
        xanchor="center",
        yanchor="top",
        text="If a respondent is not work",
        font=dict(family="Rockwell", size=12, color="grey"),
        showarrow=False,
    )
)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family='Rockwell',
    title={
        "text": "Employment Status of Young Indian DS Aspirants",
        "font": {"size": 18},
        "yanchor": "top",
    },
    #xaxis_title="",
    #yaxis_title="Percentage of respondents",
    showlegend=True
)

fig.show()

<center><strong>Exhibit 5-F.</strong> Employment status of young Indian data science aspirants</center>

üìù **Note** 

In the above visualization, it is considered that a respondent is unemployed if their response was null for `size of employees in company` and `size of data science teams`. This choice has been made to accomodate for cases where a respondent is a Student but is working a part-time or research job on the side.

üí° **Insights**

- As expected, **employment under 21 years is not the norm, it's an exception**
- However, around 1/5th of respondents are employed
    - Probably, these can be **attributed to internships or part-time jobs**. More so, as the companies they are involved with are small (in next visualization)
- The percentage of employment has reduced in 2020 from 2019
    - This could well be a sampling problem or co-incidence
    - But, it **could also be the pandemic affecting job opportunities**
    
Let's now analyze the kind of employment this community gets.

<div class='subsubsection'>5.2.2. Where do Young Indian Data Science Aspirants work?</div>

As the above visualization indicated that a part of IRU21 respondents do infact have employment, it is now time to analyze what kind of employment they are in.

This is done by focusing on two characteristics of a workplace:
1. Size of the company in terms of employees
2. The number of employees responsible for data science workloads (the data science team)

In [None]:
# workplace of young ds aspirants

fig = make_subplots(
    2,
    1,
    specs=[[{}], [{}]],
    subplot_titles=("2019 Kaggle Survey", "2020 Kaggle Survey"),
)

hover_text = []
for index, row in d_19_work.iterrows():
    hover_text.append(
        (
            "Number of respondents: {resp_num}<br>"
            + "Company Size: {cs}<br>"
            + "Data science team size: {ts}<br>"
        ).format(
            resp_num=row["Count"],
            cs=row["Company size"],
            ts=row["Data science team size"],
        )
    )
d_19_work["text"] = hover_text

hover_text = []
for index, row in d_20_work.iterrows():
    hover_text.append(
        (
            "Number of respondents: {resp_num}<br>"
            + "Company Size: {cs}<br>"
            + "Data science team size: {ts}<br>"
        ).format(
            resp_num=row["Count"],
            cs=row["Company size"],
            ts=row["Data science team size"],
        )
    )
d_20_work["text"] = hover_text

fig.add_trace(
    go.Scatter(
        x=d_19_work["Company size"],
        y=d_19_work["Data science team size"],
        mode="markers",
        text=d_19_work["text"],
        name="2019",
        marker=dict(size=d_19_work["Count"], sizemin=0, color="#007fff"),
    ),
    1,
    1,
)
fig.update_xaxes(title_text="Size of company", row=1, col=1)
fig.update_yaxes(title_text="Size of data science team", row=1, col=1)


fig.add_trace(
    go.Scatter(
        x=d_20_work["Company size"],
        y=d_20_work["Data science team size"],
        text=d_20_work["text"],
        name="2020",
        mode="markers",
        marker=dict(size=d_20_work["Count"], sizemin=0, color="#003085"),
    ),
    2,
    1,
)
fig.update_xaxes(title_text="Size of company", row=2, col=1)
fig.update_yaxes(title_text="Size of data science team", row=2, col=1)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "Where do young Indian DS Aspirants work?",
        "font": {"size": 20},
        "yanchor": "top",
    },
    height=900,
    showlegend=False,
)

fig.show()

<center><strong>Exhibit 5-G.</strong> Where do young Indian data science aspirants work?</center>

üìù **Note**

The above visualization only considers those respondents who *are working* as per the survey data. It can be read keeping the following couple of points in mind:
- Each **point is a cluster of respondents** identified by the size of the company they work in and the number of employees responsible for data science workloads at the company
- The **size of the point represents the number of respondents** in such a cluster

üí° **Insights**

- Across both years, **most young Indian respondents work in small companies(less than 50 employees)** where there are **no employees in charge of data science workloads**!
    - This means that most of **young Indian data science aspirants** could be **working in jobs which don't need to use data science for their operations**
    - There is also a possibility that they are working in companies using data science but have **no immediate data science teams** and hence the IRU21 respondents might **not have opportunities to be mentored**
- **Large companies**(with over 10,000 employees) tend to have **more people in charge of data science workloads**
- There are way more young Indian respondents working in smaller companies(0-49 employees) in 2020 than in 2019
    - This could be attributed to sampling differences, but what if there was a story to this?
        - For example, **what if there is a rise in young Indian data science aspirants finding work early in smaller establishments to gain some hands on experience in the industry?**
        - It's worth a thought!
        
üéØ **Important Point**

The challenges of IRU21 respondents seeking an early career break in data science include not finding companies that use data science or getting into companies where there is limited opportunity to receive mentorship.

In [None]:
# 2018
flow_18 = d_18.groupby(
    [
        "Does your current employer incorporate machine learning methods into their business?",
        "For how many years have you used machine learning methods (at work or in school)?",
    ],
    as_index=False,
).agg({"Duration (in seconds)": "count"})
flow_18.columns = ["ML used in work", "Experience with ML", "Count"]
flow_18 = flow_18[
    (flow_18["ML used in work"] != "Missing")
    & (flow_18["ML used in work"] != "I do not know")
]
flow_18["ord"] = flow_18["Experience with ML"].replace(
    {
        "Never": 0,
        "< 1 year": 1,
        "1-2 years": 2,
        "2-3 years": 3,
        "3-4 years": 4,
        "4-5 years": 5,
        "5+ years": 6,
    }
)
flow_18 = flow_18.sort_values(by="ord", ascending=True)
flow_18

# 2019
flow_19 = d_19.groupby(
    [
        "Does your current employer incorporate machine learning methods into their business?",
        "For how many years have you used machine learning methods?",
    ],
    as_index=False,
).agg({"Duration (in seconds)": "count"})
flow_19.columns = ["ML used in work", "Experience with ML", "Count"]
flow_19 = flow_19[(flow_19["ML used in work"] != "I do not know")]
flow_19["ord"] = flow_19["Experience with ML"].replace(
    {
        "Never": 0,
        "< 1 year": 1,
        "1-2 years": 2,
        "2-3 years": 3,
        "3-4 years": 4,
        "4-5 years": 5,
        "5+ years": 6,
    }
)
flow_19 = flow_19.sort_values(by="ord", ascending=True)

# 2020
flow_20 = d_20.groupby(
    [
        "Does your current employer incorporate machine learning methods into their business?",
        "For how many years have you used machine learning methods?",
    ],
    as_index=False,
).agg({"Duration (in seconds)": "count"})
flow_20.columns = ["ML used in work", "Experience with ML", "Count"]
flow_20 = flow_20[
    (flow_20["ML used in work"] != "Missing")
    & (flow_20["ML used in work"] != "I do not know")
]
flow_20["ord"] = flow_20["Experience with ML"].replace(
    {
        "Never": 0,
        "< 1 year": 1,
        "1-2 years": 2,
        "2-3 years": 3,
        "3-4 years": 4,
        "4-5 years": 5,
        "5+ years": 6,
    }
)
flow_20 = flow_20.sort_values(by="ord", ascending=True)

In [None]:
def plot_exp_comp(flow_18, title, color, ticks):

    fig = make_subplots(
        4,
        1,
        specs=[[{}], [{}], [{}], [{}]],
        subplot_titles=(
            "Company does not use ML",
            "Company is exploring ML",
            "Company is a new(<2 yr) ML user",
            "Company is experienced with ML",
        ),
    )

    exp = flow_18["Experience with ML"].unique()
    fig.add_trace(
        go.Bar(
            x=exp,
            y=flow_18[flow_18["ML used in work"] == "No (we do not use ML methods)"][
                "Count"
            ],
            marker_color=color,
        ),
        1,
        1,
    )
    fig.update_xaxes(title_text="Respondent's experience with ML", row=1, col=1)
    fig.update_yaxes(title_text="Number of respondents", row=1, col=1, tickvals=ticks)

    exp = flow_18["Experience with ML"].unique()
    fig.add_trace(
        go.Bar(
            x=exp,
            y=flow_18[
                flow_18["ML used in work"]
                == "We are exploring ML methods (and may one day put a model into production)"
            ]["Count"],
            marker_color=color,
        ),
        2,
        1,
    )
    fig.update_xaxes(title_text="Respondent's experience with ML", row=2, col=1)
    fig.update_yaxes(title_text="Number of respondents", row=2, col=1, tickvals=ticks)

    exp = flow_18["Experience with ML"].unique()
    fig.add_trace(
        go.Bar(
            x=exp,
            y=flow_18[
                flow_18["ML used in work"]
                == "We recently started using ML methods (i.e., models in production for less than 2 years)"
            ]["Count"],
            marker_color=color,
        ),
        3,
        1,
    )
    fig.update_xaxes(title_text="Respondent's experience with ML", row=3, col=1)
    fig.update_yaxes(title_text="Number of respondents", row=3, col=1, tickvals=ticks)

    exp = flow_18["Experience with ML"].unique()
    fig.add_trace(
        go.Bar(
            x=exp,
            y=flow_18[
                flow_18["ML used in work"]
                == "We have well established ML methods (i.e., models in production for more than 2 years)"
            ]["Count"],
            marker_color=color,
        ),
        4,
        1,
    )
    fig.update_xaxes(title_text="Respondent's experience with ML", row=4, col=1)
    fig.update_yaxes(title_text="Number of respondents", row=4, col=1, tickvals=ticks)

    # enhance template
    fig.update_layout(
        template="simple_white",
        font_family="Rockwell",
        title={
            "text": "Company's ML usage vs respondent's ML experience - " + title,
            "font": {"size": 20},
            "yanchor": "top",
        },
        height=900,
        showlegend=False,
    )

    fig.show()

<div class='subsubsection'>5.2.3. Is there a relation between a Company's ML usage and Hiring Strategy?</div>

In [None]:
plot_exp_comp(flow_18, '2018', "#6699cc",[0, 50, 100, 150, 200, 250])

In [None]:
plot_exp_comp(flow_19, '2019', "#007fff", [0,5,10,15,20,25,30,35,40,45,50])

In [None]:
plot_exp_comp(flow_20, '2020', "#00308f", [0,10,20,30,40,50,60,70,80,90,100])

<center><strong>Exhibit 5-H a), b), c).</strong> Company's ML usage vs Respondents ML Experience</center>

üí° **Insights**

- The above visualizations has a very slight embedded pattern
- In general, as the familiarity of a company with ML methods increases, so does its hiring strategy begin to include those with some experience in ML OR respondents with some ML expreience tend to apply for jobs in companies that use ML methods
    - This can be noticed by the slight concentration of bar heights towards the right as experience increases. This is noticeable every year.
    - However, this is just a broad way of looking at it(so broad that there might not even be a pattern)
    - An **individual company or an Indian respondent under 21 years need not conform to this**

<div class='subsubsection'>5.2.4. Annual Compensation for Young Indian DS Aspirants who are employed</div>

In [None]:
def get_median(x):
    """
    Get the median of an interval
    """

    x = x.replace("$", "").replace("> ", "").replace(",", "").replace("+", "")

    if x == "I do not wish to disclose my approximate yearly compensation":
        return "Undisclosed"
    elif len(x.split("-")) != 2:
        return int(x)
    else:
        return int(x.split("-")[0]) / int(x.split("-")[1])


def get_median(x):
    """
    Get the median of an interval
    """

    x = x.replace("$", "").replace("> ", "").replace(",", "").replace("+", "")

    if (
        x == "I do not wish to disclose my approximate yearly compensation"
        or x == "Missing"
    ):
        return "Undisclosed"
    elif len(x.split("-")) != 2:
        return int(x)
    else:
        return (int(x.split("-")[1]) + int(x.split("-")[0])) / 2


# 2018
d_18["What is your current yearly compensation (approximate $USD)?"] = d_18[
    "What is your current yearly compensation (approximate $USD)?"
].fillna("Missing")
d_18["comp_order"] = d_18[
    "What is your current yearly compensation (approximate $USD)?"
].apply(get_median)

# 2019
d_19["What is your current yearly compensation (approximate $USD)?"] = d_19[
    "What is your current yearly compensation (approximate $USD)?"
].fillna("Missing")
d_19["comp_order"] = d_19[
    "What is your current yearly compensation (approximate $USD)?"
].apply(get_median)

# 2020
d_20["What is your current yearly compensation (approximate $USD)?"] = d_20[
    "What is your current yearly compensation (approximate $USD)?"
].fillna("Missing")
d_20["comp_order"] = d_20[
    "What is your current yearly compensation (approximate $USD)?"
].apply(get_median)

# aggregate compensation data for 3 years
comp_18 = d_18[
    ["What is your current yearly compensation (approximate $USD)?", "comp_order"]
]
comp_18["year"] = "2018"

comp_19 = d_19[
    ["What is your current yearly compensation (approximate $USD)?", "comp_order"]
]
comp_19["year"] = "2019"

comp_20 = d_20[
    ["What is your current yearly compensation (approximate $USD)?", "comp_order"]
]
comp_20["year"] = "2020"

comp = pd.concat([comp_18, comp_19, comp_20]).reset_index(drop=True)
comp = comp[comp["comp_order"] != "Undisclosed"]
comp["comp_order"] = comp["comp_order"].astype("int")
comp = comp.sort_values("comp_order", ascending=True).reset_index(drop=True)
comp.columns = ["Compensation", "Order", "Year"]

# prep data for viz


def comp_plot_data(year):
    """
    Plot the compensation bar plots
    """

    comp_yr = comp[comp["Year"] == year]
    comp_yr = (
        comp_yr.groupby(["Compensation", "Order"], as_index=False)
        .agg({"Year": "count"})
        .sort_values(by="Order", ascending=True)
    )

    return comp_yr


amt_18 = comp_plot_data("2018")
amt_19 = comp_plot_data("2019")
amt_20 = comp_plot_data("2020")

In [None]:
# data prep
# 2019 vs 2020 compensation

amt_19["less_1k"] = amt_19["Compensation"].apply(lambda x: 1 if x == "$0-999" else 0)
amt_20["less_1k"] = amt_20["Compensation"].apply(lambda x: 1 if x == "$0-999" else 0)

amt_19_t = comp_19[comp_19["comp_order"] != "Undisclosed"].shape[0]
amt_19_l = amt_19[amt_19["less_1k"] == 1]["Year"].sum()

amt_20_t = comp_20[comp_20["comp_order"] != "Undisclosed"].shape[0]
amt_20_l = amt_20[amt_20["less_1k"] == 1]["Year"].sum()

# data viz

labels = ["Less than 1,000 USD", "More than 1,000 USD"]
colors = ["#6699cc", "#00308f"]

fig = make_subplots(
    rows=1,
    cols=2,
    specs=[[{"type": "domain"}, {"type": "domain"}]],
    subplot_titles=["2019 Kaggle Survey", "2020 Kaggle Survey"],
)
fig.add_trace(
    go.Pie(labels=labels, values=[amt_19_l, amt_19_t - amt_19_l], name="2019"), 1, 1
)
fig.add_trace(
    go.Pie(labels=labels, values=[amt_20_l, amt_20_t - amt_20_l], name="2020"), 1, 2
)

fig.update_traces(
    hole=0.4,
    hoverinfo="label+value+name",
    marker=dict(colors=colors, line=dict(color="#000000", width=2)),
)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "How much do young Indian respondents earn annually?",
        "font": {"size": 20},
        "yanchor": "top",
    },
    # xaxis_title="",
    # yaxis_title="Percentage of respondents",
    showlegend=True,
)

fig.show()


<center><strong>Exhibit 5-I.</strong> How much do young Indian respondents earn annually?</center>

üí° **Insights**

- There has **been a decrease by almost 20% in the respondents who earn more than 1,000 USD annually** from the 2019 survey to the 2020 survey
    - This could mean that more respondents from India under the age of 21 years are **choosing to gain some early experience in the industry irrespective of the pay they get!**
    
üîé **Personal Interpretation**

Based on my experience, 18-21 year olds usually do not get employed full-time. Hence, the figures that are mentioned in the dataset(annually) could most likely be extrapolations of what would originally be a compensation for 3 months or 6 months or any other period for which the employment is(which is likely to be less than a year).

<a id='6'></a>
<div class='section'>6. The Challenges of the IRU21 Community</div>

There is no community without its challenges. Similarly, IRU21 respondents too face a few concerns in their journey to become data scientists. This section is a reminder of what some of those challenges are and how they inhibit the IRU21 community.

Before diving into the section, I want to tell you a story. It's short, don't worry.

Or wait, I shall do you one better - It's pictorial!

![](https://github.com/ry05/kag_survey_analysis/blob/main/challenges.png?raw=true)

> Does the story of X resemble your own personal story?

Data science is no doubt exciting, however the large amount of talk around the discipline can be very unsettling for beginners. Since this body of work concerns itself with young Indians who are looking to break into data science, it becomes important to first recognize the challenges a data science beginner could potentially have to face when starting out.

While this section in no ways can discover or account for every challenge a data science aspirant will face, I make an attempt to include at least the most relevant ones.

<div class='subsection'>6.1. A primer before you read on</div>

For the purpose of the analysis in this section, I have also considered another subset of responses from the surveys. This subset shall be referred to as **Non-IRU21** and will include all responses from the surveys which do not belong to the IRU21 community.

This means, the Non-IRU21 community will include all respondents who are above 21 years old(irrespective of their country). 

In [None]:
# rest of world data subsets
row_18 = data_18[data_18["What is your age (# years)?"] != "18-21"]
row_19 = data_19[data_19["What is your age (# years)?"] != "18-21"]
row_20 = data_20[data_20["What is your age (# years)?"] != "18-21"]

# iru21 data subsets
d_18 = data_18[
    (data_18["What is your age (# years)?"] == "18-21")
    & (data_18["In which country do you currently reside?"])
]
d_19 = data_19[
    (data_19["What is your age (# years)?"] == "18-21")
    & (data_19["In which country do you currently reside?"])
]
d_20 = data_20[
    (data_20["What is your age (# years)?"] == "18-21")
    & (data_20["In which country do you currently reside?"])
]

In [None]:
freq_roles = top_n(
    row_20,
    "Select the title most similar to your current role (or most recent title if retired): - Selected Choice",
    20,
)

# plot
colors = [
    "#00308f",
] * 20
colors[1] = colors[5] = colors[6] = colors[7] = colors[8] = "#ff3333"

fig = go.Figure(
    data=[
        go.Bar(
            x=freq_roles["Current role"],
            y=freq_roles["Number of respondents"],
            marker_color=colors,
        )
    ]
)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "Most common job roles(non-IRU21) in 2020 Survey",
        "font": {"size": 20},
        "yanchor": "top",
    },
    height=500,
    showlegend=False,
)

fig.show()

<center><strong>Exhibit 6-A.</strong> Most common job roles(non-IRU21) in 2020 Survey</center>

This section deals with the challenges of the IRU21 community and in this context, a challenge refers to a hurdle in a typical IRU21 respondent's path towards achieving a **goal**. So, the question is what is this goal?

A reasonable hypothesis based on the context of this notebook would be that the *IRU21 respondent's goal is to become a data scientist*. However, owing to the fact that a data scientist is not a universally well-defined term (Donoho, 2017), let us assume that the actual goal is to work in a *position where the principles of data science and machine learning are used regularly*. As per the figure above, 8 roles(excluding Student) qualify for this condition. 

For the purpose of more reliable analysis, I have chosen 5 roles from the original 8 as aspirational employment positions for an IRU21 respondent(these are depicted in red). Thus, **I am making an assumption here that it is highly likely that a typical IRU21 respondent will want to become one of the 5 roles in red.**

So, this section now essentially will take on a form of comparison between the trends of respondents working in any of these 5 roles and the trends of IRU21 respondents. This could help identify key differences between the two groups and thus identify ways in which IRU21 respondents can become more like the former. 

üìù **Note**  
For the sake of analysis, **Non-IRU21 respondents have now further been filtered into only those respondents whose most recent employment at the time of the respective survey was one of Data Scientist, Data Analyst, Research Scientist, Machine Learning Engineer or Business Analyst.**

<div class='subsection'>6.2. The 3 Main Challenges</div>

From the pictorial story of X the cat, we understand that there do exist quite a few challenges in the path of an aspiring data scientist. Now, what if we group these different challenges into 3 main challenges? 

According to the categorization in this work, the 3 main challenges are
- **Learning Challenges** => What concepts and methods must an individual understand well before being able to contest for a data science role?
- **Tool Challenges** => What tools must an individual familiarize oneself with to do good data science?
- **Project Challenges** => How should an individual work on and showcase their projects to maximize their chances of breaking into data science?

These challenges shall be studied within the scope of the available Kaggle Survey 2020 data. This has been done to ensure that a *contemporary view of the challenges* is taken up. Therefore, all visualizations are created of the 2020 Kaggle survey data, unless specified otherwise in the title.

üîé **Personal Interpretation**

When I started my stint with data science, I was very overwhelmed. It started with first identifying what to learn! Once that was identified, the next challenge was which language to learn - R or Python? This struggle of choosing between languages or IDEs is a **very real struggle, especially if the individual does not have a background in programming**. Finally, it is taxing to work on a project and make it polished for public sharing. A lot of aspirants do not do this because of the **laborious amount of work involved in making an analysis palatable and interpretable** for other readers.

üìù **Note**

These 3 main challenges will require much further study apart from the use of the survey data in order to understand them completely. However, that is outside the scope of this notebook currently.

<div class='subsection'>6.3. Learning Challenges</div>

The field of data science has always been growing at a remarkable pace (Provost and Fawcett, 2013). In order to keep up with this growth and the associated competition that comes with it, IRU21 respondents have to keep up their learning up-to-speed. 

Learning challenges concern themselves with the concepts and methods an individual must understand well before being able to attempt at landing a data-based role. In this section, a brief look of topics concerning these challenges are posed based on the Kaggle surveys.

In [None]:
def multi_ops(df, ques):
    """
    Get data of questions with multiple options
    """

    # filter out only those columns relating to the ques
    resp = []
    for col in df.columns:
        if ques in col:
            resp.append(col)
    # rename columns
    resp2 = [name.replace(ques + " - Selected Choice - ", "") for name in resp]
    temp = df[resp]
    temp.columns = resp2

    return temp


def make_sparse(df):
    """
    Code dataframe to 1s and 0s
    (best used after multi_ops)
    """

    # code the dataset with 1s and 0s
    # 1 if value exists
    # 0 if it does not

    return df.notnull().astype("int")


def agg_data(df):
    """ Aggregate data """

    agg = {}
    header = []
    values = []
    for col in df.columns:
        header.append(col)
        values.append(df[col].sum())
    new_df = (
        pd.DataFrame({"Current role": header, "Number of respondents": values})
        .sort_values(by="Number of respondents", ascending=False)
        .head(10)
        .iloc[::-1]
    )
    return new_df


def iru21_vs_rest_plot(iru21, non_iru21, h1=[], h2=[], title="You forgot me!"):
    """
    Plot bar plots to compare IRU21 and non-IRU21 responses

    iru21: Dataframe with iru21 responses
    non_iru21: Dataframe with non iru21 responses
    h1: Numbers of bars to be highlighted in iru21
    h2: Number of bars to be highlighted in non iru21
    """

    # visualize
    fig = make_subplots(
        2,
        1,
        specs=[[{}], [{}]],
        subplot_titles=("Non-IRU21 Respondents", "IRU21 Respondents"),
    )

    # non iru 21 plots
    colors = ["#dca917"] * 10
    for bar in h2:
        colors[bar] = "#FF3333"
    fig.add_trace(
        go.Bar(
            y=non_iru21["Current role"],
            x=non_iru21["Number of respondents"],
            name="Non-IRU21",
            marker_color=colors,
            orientation="h",
        ),
        row=1,
        col=1,
    )
    fig.update_xaxes(title_text="Number of respondents", row=1, col=1)

    # iru 21 plots
    colors = ["#008abc"] * 10
    for bar in h1:
        colors[bar] = "#FF3333"
    fig.add_trace(
        go.Bar(
            y=iru21["Current role"],
            x=iru21["Number of respondents"],
            name="IRU21",
            marker_color=colors,
            orientation="h",
        ),
        row=2,
        col=1,
    )
    fig.update_xaxes(title_text="Number of respondents", row=2, col=1)

    # enhance template
    fig.update_layout(
        template="simple_white",
        font_family="Rockwell",
        title={
            "text": title,
            "font": {"size": 20},
            "yanchor": "top",
        },
        xaxis_title="",
        # yaxis_title="Percentage of respondents",
        height=600,
        showlegend=False,
    )

    fig.show()

<div class='subsubsection'>6.3.1. Does coding experience decide job title?</div>

The key idea of this part is to identify if there is an association between the the experience a respondent has in writing code and the job role they get eventually. 

To understand this, I use the non-IRU21 data subset and employ a [Chi-square test for independence](https://www.youtube.com/watch?v=ZjdBM7NO7bY) to it.

In [None]:
# data prep

com_roles_18 = row_18[
    row_18[
        "Select the title most similar to your current role (or most recent title if retired): - Selected Choice"
    ].isin(
        [
            "Data Scientist",
            "Data Analyst",
            "Research Scientist",
            "Machine Learning Engineer",
            "Business Analyst",
        ]
    )
]
com_roles_19 = row_19[
    row_19[
        "Select the title most similar to your current role (or most recent title if retired): - Selected Choice"
    ].isin(
        [
            "Data Scientist",
            "Data Analyst",
            "Research Scientist",
            "Machine Learning Engineer",
            "Business Analyst",
        ]
    )
]
com_roles_20 = row_20[
    row_20[
        "Select the title most similar to your current role (or most recent title if retired): - Selected Choice"
    ].isin(
        [
            "Data Scientist",
            "Data Analyst",
            "Research Scientist",
            "Machine Learning Engineer",
            "Business Analyst",
        ]
    )
]


In [None]:
def chi2_test_specs(crosstab, print_op=False):
    """
    Print output of chi2 test in an easy-to-interpret format
    """

    # calculate contingency table
    stats_op = stats.chi2_contingency(crosstab)

    print(f"Pearson Chi2 value: {stats_op[0]}")
    print(f"p-value: {stats_op[1]}")
    if stats_op[1] < 0.5:
        # null rejected
        print("There is an association between the two variables")
    else:
        print("There is no association between the two variables")
    if print_op == True:
        print("\nOriginal output from stats.chi2_contingency")
        print(stats_op)

In [None]:
# role vs experience

crosstab = pd.crosstab(
    com_roles_20[
        "Select the title most similar to your current role (or most recent title if retired): - Selected Choice"
    ],
    com_roles_20["For how many years have you been writing code and/or programming?"],
)
chi2_test_specs(crosstab, False)

üìù **Note**

The above output is the output of a **Chi2 test for independence**. If you don't know what that means yet, think of it as a statistical test(or some test) that helps identify whether there is a meaningful association between job title and coding experience.

The p-value is the probability of such an association occuring by chance if our test declares there is an association. Here, p-value is very small (<0.05). Hence, we can be sure that an association could exist and if it did, it is very unlikely that it is a co-incidence!

üí° **Insight**

There is an association between coding experience and job title at a significance level of 95%(as p-value lesser than 0.05).

But the next question then is **what exactly is this association**? This is answered below.

In [None]:
# what exactly is the association?

crosstab_norm = pd.crosstab(
    com_roles_20[
        "Select the title most similar to your current role (or most recent title if retired): - Selected Choice"
    ],
    com_roles_20["For how many years have you been writing code and/or programming?"],
    normalize="index",
)
crosstab_norm = crosstab_norm[
    [
        "I have never written code",
        "< 1 years",
        "1-2 years",
        "3-5 years",
        "5-10 years",
        "10-20 years",
        "20+ years",
    ]
]
sns.heatmap(crosstab_norm, cmap="Blues", annot=True, cbar=False, linewidths=0.5)
plt.title("Association between coding experience and job title")
plt.xlabel("Coding experience(in years)")
plt.ylabel("Most recent job title")
plt.show()

<center><strong>Exhibit 6-B.</strong> Association between coding experience and job title</center>

üí° **Insights**

- A **high level of coding experience** (3-10) is necessary when it comes to **Data Scientist and ML Engineer** roles
    - 28% of each of these roles have 3-5 years of experience in coding
- **Moderate level** of coding experience (1-5 years) works fine with Data **Analyst roles**
- **Research Scientist** roles tend to be taken up by those with **very high coding experience**
    - 16% of those working as research scientists have 20+ years of coding expeirence, which is the highest among all these 5 roles
- 36% of **business analysts** have under 1 year of experience writing code out of which 12% of them have never written code

üéØ **Important Point**

According to the above analysis, it seems easier to get into an Analyst role when compared to Data Scientist or ML roles, when coding experience is low. Therefore, analyst could be a job profile that freshers could seek for, if they find it interesting.

<div class='subsubsection'>6.3.2. What programming languages should an aspirant learn?</div>

This is a very commonly asked question, that has seen several well-formed debates. The most constant languages in the centre of this virtual battleground in data science are Python and R (Ozgur et al., 2017). 

There are plenty of articles that provide suggestions on why one language is better suited than another. But these articles often compare languages based on the individual attributes and features of each of them. In this section, I make an attempt to analyze the same topic, but with the exception of viewing it from the vantage point of *what is used and preferred in the industry*.

üìù **Note**

Programming languages could easily fall into the Tool Challenges section. However, I choose to include it here as to many entering the field of data science, programming is not generally a part of their background.

First, let's understand the **recommendations non-IRU21 respondents make for beginners** to learn as their first language.

In [None]:
# data prep
row_18_prog_asp = top_n(
    row_18,
    "What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice",
    10,
)
row_19_prog_asp = top_n(
    row_19,
    "What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice",
    10,
)
row_20_prog_asp = top_n(
    row_20,
    "What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice",
    10,
)

# visualize
fig = make_subplots(
    3,
    1,
    specs=[[{}], [{}], [{}]],
    subplot_titles=("2018 Kaggle Survey", "2019 Kaggle Survey", "2020 Kaggle Survey"),
)

colors = ["#6699cc"] * 10
colors[0] = colors[1] = colors[2] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=row_18_prog_asp["Current role"],
        y=row_18_prog_asp["Number of respondents"],
        name="2018",
        marker_color=colors,
    ),
    row=1,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=1, col=1)
fig.update_xaxes(title_text="Recommendation for aspiring data scientists", row=1, col=1)

colors = ["#007fff"] * 10
colors[0] = colors[1] = colors[2] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=row_19_prog_asp["Current role"],
        y=row_19_prog_asp["Number of respondents"],
        name="2019",
        marker_color=colors,
    ),
    row=2,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=2, col=1)
fig.update_xaxes(title_text="Recommendation for aspiring data scientists", row=2, col=1)

colors = ["#00308f"] * 10
colors[0] = colors[1] = colors[2] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=row_20_prog_asp["Current role"],
        y=row_20_prog_asp["Number of respondents"],
        name="2020",
        marker_color=colors,
    ),
    row=3,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=3, col=1)
fig.update_xaxes(title_text="Recommendation for aspiring data scientists", row=3, col=1)


# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "The Big 3 according to non-IRU21 responses",
        "font": {"size": 20},
        "yanchor": "top",
    },
    xaxis_title="",
    # yaxis_title="Percentage of respondents",
    height=900,
    showlegend=False,
)

fig.show()

<center><strong>Exhibit 6-C.</strong> The Big 3 according to non-IRU21 responses</center>

üí° **Insights**

- Python wins the race by a very heavy margin
- R and SQL follow closely
- It is also worth noticing how C has been becoming more prominent as a recommendation from non-IRU21 respondents over the years
    - Is this because C is being used increasingly now for data science?
    - Is this because of C being recognized as a good language to grasp core programming concepts in recent years?
    - Is this because of the fall in popularity of other languages?
    - All relevant questions to ask!
    
    
Now, let's see what **IRU21 respondents recommend fellow beginners** to learn as their first programming language.

In [None]:
# data prep
row_18_prog_asp = top_n(
    d_18,
    "What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice",
    10,
)
row_19_prog_asp = top_n(
    d_19,
    "What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice",
    10,
)
row_20_prog_asp = top_n(
    d_20,
    "What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice",
    10,
)

# visualize
fig = make_subplots(
    3,
    1,
    specs=[[{}], [{}], [{}]],
    subplot_titles=("2018 Kaggle Survey", "2019 Kaggle Survey", "2020 Kaggle Survey"),
)

colors = ["#6699cc"] * 10
colors[0] = colors[1] = colors[2] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=row_18_prog_asp["Current role"],
        y=row_18_prog_asp["Number of respondents"],
        name="2018",
        marker_color=colors,
    ),
    row=1,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=1, col=1)
fig.update_xaxes(title_text="Recommendation for aspiring data scientists", row=1, col=1)

colors = ["#007fff"] * 10
colors[0] = colors[1] = colors[2] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=row_19_prog_asp["Current role"],
        y=row_19_prog_asp["Number of respondents"],
        name="2019",
        marker_color=colors,
    ),
    row=2,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=2, col=1)
fig.update_xaxes(title_text="Recommendation for aspiring data scientists", row=2, col=1)

colors = ["#00308f"] * 10
colors[0] = colors[1] = colors[2] = "#FF3333"
fig.add_trace(
    go.Bar(
        x=row_20_prog_asp["Current role"],
        y=row_20_prog_asp["Number of respondents"],
        name="2020",
        marker_color=colors,
    ),
    row=3,
    col=1,
)
fig.update_yaxes(title_text="Number of respondents", row=3, col=1)
fig.update_xaxes(title_text="Recommendation for aspiring data scientists", row=3, col=1)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "IRU21 respondents choose C++ over SQL",
        "font": {"size": 20},
        "yanchor": "top",
    },
    xaxis_title="",
    # yaxis_title="Percentage of respondents",
    height=900,
    showlegend=False,
)

fig.show()

<center><strong>Exhibit 6-D.</strong> IRU21 respondents choose C++ over SQL</center>

üí° **Insights**

- Python still wins (duh!). But, there is a surprise - SQL has been **dethroned**!
- A good amount of IRU21 respondents recommend that C++ be the first language an aspiring data scientist learn before learning SQL
    - Interesting observation as non-IRU21 respondents recommend SQL over C++ in general
- Julia makes an entry into the top 10 in the 2020 survey, just like as was in the case of non-IRU21 respondents
- Among IRU21 respondents, MATLAB is seeing waning favour
    - It has been dropping down the top 10 list every year
    - Is there an anti-MATLAB sentiment among young Indian data science aspirants? **(I sure have seen it in my peer group!)**
  
  
üéØ **Important Point**

IRU21 respondents recommend C++ over SQL. While this seems odd at first, there could be a logical explanation to this. C++ is what IRU21 respondents generally study as a part of their curriculum in undergraduate programming courses. But this is again open to debate. Also, the difference between the counts of those choosing either language are not too large, contrary to the non-IRU21 community.
  
The above two graphs only talk about recommendations. Let's have a look at what languages are *most used on a regular basis* by our respondents by again making the distinction between IRU21 and non-IRU21 respondents.

In [None]:
temp = multi_ops(
    com_roles_20,
    "What programming languages do you use on a regular basis? (Select all that apply)",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "What programming languages do you use on a regular basis? (Select all that apply)",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21,
    temp_non_iru21,
    h1=[0, 2, 3],
    h2=[4, 5, 7],
    title="Most used languages - 2020",
)

<center><strong>Exhibit 6-E.</strong> Most used languages - 2020</center>

üí° **Insights**

- **IRU21 respondents don't use R, MATLAB or Bash as much as they are used amongst non-IRU21** respondents(who are working in the industry)
- Also IRU21 respondents in general use more C++ and C
    - This could be because of the curriculum structure, recruitment tests and not just a personal preference
    
üéØ **Important Point**

It could prove beneficial for IRU21 respondents to increase their use of and gain some experience with R, MATLAB and Bash. However, a language depends on the employer and the task at hand. So, it is not a *must learn* requirement, but it could help in *general*.

<div class='subsubsection'>6.3.3. What machine learning methods should an aspirant learn?</div>

In [None]:
# role vs experience

crosstab = pd.crosstab(
    com_roles_20[
        "Select the title most similar to your current role (or most recent title if retired): - Selected Choice"
    ],
    com_roles_20["For how many years have you used machine learning methods?"],
)
chi2_test_specs(crosstab, False)

üìù **Note**

The above output is the output of a Chi2 test for independence. If you don't know what that means yet, think of it as a statistical test(or some test) that helps identify whether there is a meaningful association between job title and machine learning experience.

The p-value is the probability of such an association occuring by chance if our test declares there is an association. Here, p-value is very small (<0.05). Hence, we can be sure that an association could exist and if it did, it is very unlikely that it is a co-incidence!

üí° **Insight**

There is an association between machine learning experience and job title at a significance level of 95%(as p-value lesser than 0.05).

But the next question then is **what exactly is this association**? This is answered below.

In [None]:
# what exactly is the association?

crosstab_norm = pd.crosstab(
    com_roles_20[
        "Select the title most similar to your current role (or most recent title if retired): - Selected Choice"
    ],
    com_roles_20["For how many years have you used machine learning methods?"],
    normalize="index",
)
crosstab_norm = crosstab_norm[
    [
        "I do not use machine learning methods",
        "Under 1 year",
        "1-2 years",
        "2-3 years",
        "3-4 years",
        "4-5 years",
        "5-10 years",
        "10-20 years",
        "20 or more years",
    ]
]
sns.heatmap(crosstab_norm, cmap="Blues", annot=True, cbar=False, linewidths=0.5)
plt.title("Association between machine learning experience and job title")
plt.xlabel("Machine Learning experience(in years)")
plt.ylabel("Most recent job title")
plt.show()

<center><strong>Exhibit 6-F.</strong> Association between machine learning experience and job title</center>

üí° **Insights**

- If you **do not use ML methods, your best shot is at analyst roles**
- To be a data scientist or ML enginner of research scientist, there needs to be some level of experience using ML methods

Now, let's have a look at which are the most used ML algorithms and whether they differ across the IRU21 and non-IRU21 respondents

In [None]:
# 2019
temp = multi_ops(
    com_roles_19,
    "Which of the following ML algorithms do you use on a regular basis? (Select all that apply):",
)
temp = make_sparse(temp)
temp = temp.drop(
    [
        "Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Other - Text"
    ],
    axis=1,
)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_19,
    "Which of the following ML algorithms do you use on a regular basis? (Select all that apply):",
)
temp = make_sparse(temp)
temp = temp.drop(
    [
        "Which of the following ML algorithms do you use on a regular basis? (Select all that apply): - Other - Text"
    ],
    axis=1,
)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21,
    temp_non_iru21,
    h1=[7, 8, 9],
    h2=[7, 8, 9],
    title="Most used ML algorithms - 2019",
)

# 2020
temp = multi_ops(
    com_roles_20,
    "Which of the following ML algorithms do you use on a regular basis? (Select all that apply):",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "Which of the following ML algorithms do you use on a regular basis? (Select all that apply):",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21,
    temp_non_iru21,
    h1=[7, 8, 9],
    h2=[7, 8, 9],
    title="Most used ML algorithms - 2020",
)

<center><strong>Exhibit 6-G a) b).</strong> Most used ML algorithms</center>

üí° **Insights**

- Across both 2019 and 2020 surveys, Linear and Logistic Regression is the most used ML algorithm (for both communities).
- However, there is an interesting difference between IRU21 and non-IRU21 respondents
    - For both years, IRU21 respondents use more CNNs (Convolutional Neural Networks) than GBMs (Gradient Boosting Machines)
    - In the case non-IRU21 respondents, the reverse is true i.e more GBMs are used than CNNs
    - This could be because **IRU21 respondents have more time in hand than non-IRU21 respondents to learn and work on projects that implement CNNs**
    - Another possible reason could be that **IRU21 respondents are more drawn towards the larger powers of the CNN when dealing with unstructured data than other algorithms**
- Bayesian approaches are more in favour with non-IRU21 respondents than IRU21 respondents

<div class='subsubsection'>6.3.4. What data science activity must be learnt definitely by an aspirant?</div>

In [None]:
temp = multi_ops(
    com_roles_20,
    "Select any activities that make up an important part of your role at work: (Select all that apply)",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "Select any activities that make up an important part of your role at work: (Select all that apply)",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21, temp_non_iru21, h1=[7], h2=[7], title="Most frequent activities at work"
)

<center><strong>Exhibit 6-H.</strong> Most frequent activites at work</center>

üí° **Insights**

- The most important activity an aspirant has to learn is **"Analyzing and understanding data with a business context"**
    - It's already prominent in the IRU21 community too
    - Well of course, without this activity data science has little value
    
üéØ **Important Point**

It is necessary to note that there is a very limited resource pool when it comes to teaching the art of understanding data with context and asking the right questions from it. We see from above that this is the most important duty for data scientists. Therefore, it probably is time to develop an **art of thinking for data science!**

<div class='subsubsection'>6.3.5. What learning platforms are most used?</div>

A learning platform in this context refers to a platform where data science courses are available.

In [None]:
temp = multi_ops(
    com_roles_20,
    "On which platforms have you begun or completed data science courses? (Select all that apply)",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "On which platforms have you begun or completed data science courses? (Select all that apply)",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21,
    temp_non_iru21,
    h1=[],
    h2=[0],
    title="Most Used Data Science Learning Platforms",
)

<center><strong>Exhibit 6-I.</strong> Most used data science learning platforms</center>

üí° **Insights**

- **Coursera is the most frequent** learning platform for both groups, followed by Udemy and Kaggle Learn Courses
- Fast.ai comes in within the top 10 of the preferences of those working in a data-based role outside of IRU21, while this is not the case with IRU21 respondents
    - [**Fast.ai has a top-down learning method**](https://www.fast.ai/) and hence should be more resourceful to beginners looking to enter the field
    - However, Kaggle Learn Courses too have a top-down learning method
    - Therefore, it is worth thinking about why fast.ai is not as popular yet among IRU21 respondents while Kaggle Learn is

<div class='subsection'>6.4. Tool Challenges</div>

A point of confusion for several beginners is the idea of *what data science tool to choose!* These dilemmas are not limited to the debates of R or Python, pytorch or tensorflow, matplotlib or plotly etc. In fact, some of these can even pertain to the very bare basics such as the choice of IDE!

In this section, I make anattempt to identify the tool preferences of non-IRU21 respondents and compare it with what the IRU21 respondents favour at present i.e in the 2020 survey.

<div class='subsubsection'>6.4.1. What primary tools are most used for analysis?</div>

In [None]:
# data prep
non_iru21_pr_tool = top_n(
    com_roles_20,
    "What is the primary tool that you use at work or school to analyze data? (Include text response) - Selected Choice",
    10,
).iloc[::-1]
iru21_pr_tool = top_n(
    d_20,
    "What is the primary tool that you use at work or school to analyze data? (Include text response) - Selected Choice",
    10,
).iloc[::-1]

# visualize
iru21_vs_rest_plot(
    iru21_pr_tool,
    non_iru21_pr_tool,
    h1=[4, 5],
    h2=[4, 5],
    title="Primary tools used for analysis - 2020 Survey",
)

<center><strong>Exhibit 6-J.</strong> Primary tools used for analysis - 2020 Survey</center>

üí° **Insights**

- Irrespective of whether the respondents are from IRU21 or not, **local development environments** are most used for analysis
- This is **followed by Basic statistical software**

üéØ **Important Point**

Cloud-based tools are not particularly favoured as primary analysis tools. Therefore, it would help **beginners to first affirm ground in the use local development environments and basic statistical software**.

<div class='subsubsection'>6.4.2. What Integrated Development Environments are most favoured?</div>

> "An IDE, or Integrated Development Environment, enables programmers to consolidate the different aspects of writing a computer program." - [Source](https://www.codecademy.com/articles/what-is-an-ide)

In [None]:
temp = multi_ops(
    com_roles_20,
    "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp = temp.drop(["Click to write Choice 13"], axis=1)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp = temp.drop(["Click to write Choice 13"], axis=1)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(temp_iru21, temp_non_iru21, h1=[9], h2=[9], title="Most Common IDEs")

<center><strong>Exhibit 6-K.</strong> Most common IDEs</center>

üí° **Insights**

- The **Jupyter** Ecosystem is the most used IDE
    - It wins over other IDEs significantly
- **RStudio is highly used in non-IRU21** and **less used in IRU-21**

üéØ **Important Point**

While Jupyter notebooks are wonderful tools, there does exist an issue with how they are used. Jupyter notebooks can be worked on in non-linear fashion and this is one of its main strengths. However, this non-linearity can also cause inconsistent notebooks if the author is not careful. According to a recent study by [Datalore, 36% of jupyter notebooks on Github are inconsistent](https://blog.jetbrains.com/datalore/2020/12/17/we-downloaded-10-000-000-jupyter-notebooks-from-github-this-is-what-we-learned/).


There are so many different IDE's mentioned. Some IDEs are better than others with regards to the task at hand. So, is there a difference in how many IDEs a respondent uses on a regular basis w.r.t to the IRU21 vs non-IRU21 distinction? Let's find out.

In [None]:
# iru21
temp = multi_ops(
    d_20,
    "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp = temp.drop(["Click to write Choice 13"], axis=1)
temp = temp.drop(["None", "Other"], axis=1)
temp["tot"] = temp.sum(axis=1)

iru21 = temp["tot"]

# noniru21
temp = multi_ops(
    com_roles_20,
    "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp = temp.drop(["Click to write Choice 13"], axis=1)
temp = temp.drop(["None", "Other"], axis=1)
temp["tot"] = temp.sum(axis=1)

non_iru21 = temp["tot"]

# visualize
fig = go.Figure()
fig.add_trace(go.Box(y=iru21, name="IRU21", marker_color="#008abc"))
fig.add_trace(go.Box(y=non_iru21, name="Non-IRU21", marker_color="#dca917"))

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "How many IDEs does a respondent use?",
        "font": {"size": 20},
        "yanchor": "top",
    },
    xaxis_title="Category of respondents",
    yaxis_title="Number of IDE's used regularly",
    showlegend=False,
)

fig.show()

<center><strong>Exhibit 6-L.</strong> How many IDEs does a respondent use?</center>

üìù **Note**

The boxplots have been created with data points where each point represents the number of IDEs a particular respondent uses regularly. The options `None` and `Other` have been removed for accurate results.

üí° **Insights**

- Non-IRU21 respondents tend to use a larger variety of IDEs regularly than IRU21 respondents
    - This could mean that IRU21 respondents tend to stick to around a couple of IDEs in general for their projects
    - Or it could also be that non-IRU21 respondents have further experience and hence know about more IDEs that they would need at their workplace

<div class='subsubsection'>6.4.3. Which are the most used hosted notebooks?</div>

In [None]:
temp = multi_ops(
    com_roles_20,
    "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21, temp_non_iru21, h1=[7], h2=[7], title="Most Common Hosted Notebooks"
)

<center><strong>Exhibit 6-M.</strong> Most common hosted notebooks</center>

üí° **Insights**

- Colab and Kaggle notebooks are most used in both communities
- Also, there is a substantial portion of respondents in both who **do not use any hosted notebook service** i.e they probably use only their local systems
    - This could also be the case that some of these respondents **do not use notebooks** itself and prefer scripts

üéØ **Important Point**

It is thus necessary to specifically provide knowledge to beginners on how they can use their local machines(efficiently) to do data science. For example, data science at the command line.

<div class='subsubsection'>Which are the most used data viz libraries?</div>

In [None]:
temp = multi_ops(
    com_roles_20,
    "What data visualization libraries or tools do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "What data visualization libraries or tools do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21,
    temp_non_iru21,
    h1=[8, 9],
    h2=[8, 9],
    title="Most Common Data Viz Libraries",
)

<center><strong>Exhibit 6-N.</strong> Most common data visualization libraries</center>

üí° **Insights**

- Matplotlib and Seaborn are the most used

However, to what extent do these two libraries dominate the data viz domain in both non-IRU21 and IRU21 communities?

In [None]:
# non-iru21
temp = multi_ops(
    com_roles_20,
    "What data visualization libraries or tools do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp["tot"] = temp.sum(axis=1)
temp["ms"] = temp[" Matplotlib "] + temp[" Seaborn "]
temp["ms_used"] = temp["ms"].apply(lambda x: 1 if x > 0 else 0)

non_ms = temp[temp["ms_used"] == 1].shape[0]
non_non_ms = temp[temp["ms_used"] == 0].shape[0]

# iru21
temp = multi_ops(
    d_20,
    "What data visualization libraries or tools do you use on a regular basis?  (Select all that apply)",
)
temp = make_sparse(temp)
temp["tot"] = temp.sum(axis=1)
temp["ms"] = temp[" Matplotlib "] + temp[" Seaborn "]
temp["ms_used"] = temp["ms"].apply(lambda x: 1 if x > 0 else 0)

iru_ms = temp[temp["ms_used"] == 1].shape[0]
iru_non_ms = temp[temp["ms_used"] == 0].shape[0]

# data viz

labels = ["Matplotlib or Seaborn", "Anything else"]
colors = ["#6699cc", "#00308f"]

fig = make_subplots(
    rows=1,
    cols=2,
    specs=[[{"type": "domain"}, {"type": "domain"}]],
    subplot_titles=["Non-IRU21 Respondents", "IRU21 Respondents"],
)
fig.add_trace(
    go.Pie(labels=labels, values=[non_ms, non_non_ms], name="Non-IRU21"), 1, 1
)
fig.add_trace(go.Pie(labels=labels, values=[iru_ms, iru_non_ms], name="IRU21"), 1, 2)

fig.update_traces(
    hole=0.4,
    hoverinfo="label+value+name",
    marker=dict(colors=colors, line=dict(color="#000000", width=2)),
)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "The dominance of Matplotlib and Seaborn",
        "font": {"size": 20},
        "yanchor": "top",
    },
    # xaxis_title="",
    # yaxis_title="Percentage of respondents",
    showlegend=True,
)

fig.show()

<center><strong>Exhibit 6-O.</strong> The dominance of Matplotlib and Seaborn</center>

üìù **Note**

The above doughnut chart is a comparison between whether a respondent uses either matplotlib or seaborn on a regular basis OR does not use neither matplotlib nor seaborn on a regular basis.

üí° **Insights**

- **3/4 of non-IRU21 respondents**(i.e those in the industry ina  data-related role) use only matplotlib or seaborn regularly
- This figure drops to 63% (a drop of almost 10%) when it comes to the IRU21 community
    - **37% of young Indian data science aspirants in the survey do not use matplotlib or seaborn**(most basic and probably most powerful viz libraries)
        - Could this be because of matplotlib's greater complexity if one wants to make good looking visualizations?

<div class='subsubsection'>6.4.4. Which are the most used ML frameworks?</div>

In [None]:
temp = multi_ops(
    com_roles_20,
    "Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply)",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply)",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21,
    temp_non_iru21,
    h1=[7, 8, 9],
    h2=[7, 8, 9],
    title="Most Common ML Frameworks",
)

<center><strong>Exhibit 6-P.</strong> Most common ML frameworks</center>

üí° **Insights**

- For both communities, the most used ML frameworks are **scikit learn, tensorflow and keras**

<div class='subsection'>6.5. Project Challenges</div>

Project-based learning offers much needed hands-on experience with concepts. It helps the learner approach the problem at hand, think deeply about it and come up with valuable solutions (Solomon, 2003). 

Data science, in the real-world is a very applied field. Hence, this concept of working on projects to learn is of paramaount importance here. In this section, a couple of questions concerning challenges associated with projects are analyzed.  

<div class='subsubsection'>6.5.1. Where are projects most shared?</div>

[Sharing of projects is necessary to gain visibility in the data science spectrum](https://www.dataquest.io/blog/build-a-data-science-portfolio/). In this part, let's identify the most common project sharing platforms.

In [None]:
temp = multi_ops(
    com_roles_20,
    "Where do you publicly share or deploy your data analysis or machine learning applications? (Select all that apply)",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "Where do you publicly share or deploy your data analysis or machine learning applications? (Select all that apply)",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21,
    temp_non_iru21,
    h1=[9],
    h2=[9],
    title="Most favoured project sharing platforms",
)

<center><strong>Exhibit 6-Q.</strong> Most favoured project sharing platforms</center>

üí° **Insights**

- **Github** is the most favoured project sharing platform
- **IRU21** respondents also share their work widely on **Kaggle**

However, looking at the y-axis we see that the number of responses in the graph are much lesser than what we expect of the ~3k responses in IRU21. Why is that so?

It is so because **several respondents left this question unanswered** and thus led to null values here. Not answering this question could indicate that the respondent has
- Either never worked on a project
- Does not share their work

If the latter is true, then the respondent had an option stating `I do not share my work publicly`. So, there is a **high chance that these respondents have not yet worked on substantial projects!** 

In [None]:
# data prep

temp = multi_ops(
    d_20,
    "Where do you publicly share or deploy your data analysis or machine learning applications? (Select all that apply)",
)
temp = make_sparse(temp)
temp["tot"] = temp.sum(axis=1)
share = temp[(temp["tot"] >= 1) & (temp["I do not share my work publicly"] == 0)].shape[
    0
]
no_share = temp.shape[0] - share

# data viz

labels = ["Definitely yes", "Most likely not"]
colors = ["#6699cc", "#00308f"]

fig = make_subplots(rows=1, cols=1, specs=[[{"type": "domain"}]])
fig.add_trace(
    go.Pie(labels=labels, values=[share, no_share], name="Project sharing1"), 1, 1
)

fig.update_traces(
    hole=0.4,
    hoverinfo="label+value+name",
    marker=dict(colors=colors, line=dict(color="#000000", width=2)),
)

# enhance template
fig.update_layout(
    template="simple_white",
    font_family="Rockwell",
    title={
        "text": "Do IRU21 respondents work on projects?",
        "font": {"size": 20},
        "yanchor": "top",
    },
    # xaxis_title="",
    # yaxis_title="Percentage of respondents",
    showlegend=True,
)

fig.show()

<center><strong>Exhibit 6-R.</strong> Do IRU21 respondents work on projects?</center>

üìù **Note**

Every respondent who has left the question empty has been put into the category of **"Most likely not"**. This is an assumption, but it has to be true at a substantial level even if not entirely true for every respondent.

üí° **Insights**

- **Only around 5% of IRU21 respondents have answered the question** regarding the medium of publishing their project work
- This could mean that **only around 5% of this community in the 2020 survey have been working on substantial projects** as a part of their learning

üéØ **Important Point**

There needs to be an initiative to spread awareness about the need for project-based self-learning within the IRU21 community.

<div class='subsubsection'>6.5.2. Which media sources are most followed?</div>

While working on projects, learning often happens on the fly. Such on-the-fly learning is usually well-supported by media sources. Hence, this question finds place in the section of Project Challenges.

In [None]:
temp = multi_ops(
    com_roles_20,
    "Who/what are your favorite media sources that report on data science topics? (Select all that apply)",
)
temp = make_sparse(temp)
temp_non_iru21 = agg_data(temp)

temp = multi_ops(
    d_20,
    "Who/what are your favorite media sources that report on data science topics? (Select all that apply)",
)
temp = make_sparse(temp)
temp_iru21 = agg_data(temp)

iru21_vs_rest_plot(
    temp_iru21, temp_non_iru21, h1=[2], h2=[6], title="Most favoured media sources"
)

<center><strong>Exhibit 6-S.</strong> Most favoured media sources</center>

üí° **Insights**

- **Kaggle, Youtube and Blogs** are the most preferred sources of learning via media
- Journal publications as a media source poses an interesting perspective
    - Non-IRU21 respondents include it in their top 4 preferences
    - But, the IRU21 community places it in the bottom 3
    - This means that **IRU21 respondents do not prefer learning from academic work** as much as those working in the industry in data-roles
- Inspite of being extremely fun and resourceful, podcasts are still yet to take off as top media sources!

üéØ **Important Point**

The IRU21 community might be skipping on academic work because of the complexity involved which makes sense. Therefore, it would be useful to try and distill academic work into simpler explanations to make it easier for data science aspirants to learn.

<a id='7'></a>
<div class='section'>7. References</div>

1. Davenport, T.H. and Patil, D.J., 2012. Data scientist. Harvard business review, 90(5), pp.70-76.
2. The Royal Society (Charity), 2019. Dynamics of data science skills: how can all sectors benefit from data science talent?
3. Donoho, D., 2017. 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), pp.745-766.
4. Provost, F. and Fawcett, T., 2013. Data science and its relationship to big data and data-driven decision making. Big data, 1(1), pp.51-59.
5. Ozgur, C., Colliau, T., Rogers, G., Hughes, Z. and Myer-Tyson, B., 2017. MatLab vs. Python vs. R. Journal of Data Science, 15(3), pp.355-372.
6. Solomon, Gwen. "Project-based learning: A primer." TECHNOLOGY AND LEARNING-DAYTON- 23, no. 6 (2003): 20-20.


<a id='8'></a>
<div class='section'>8. Appendix</div>

<div class='subsection'>A. Methodology</div>

![](https://github.com/ry05/kag_survey_analysis/blob/main/methodology.png?raw=true)

I have followed a linear methodology for working on this study. A brief explanation of the steps involved are as follows:

1. **Choose the Community =>** Deciding on which specific community's story I want to tell
2. **Get a Broad Overview =>** Develop a general understanding of the community and its characteristics
3. **Conduct Literature Survey =>** Read a bit about any relevant study of the field
4. **Develop Framework for Analysis =>** Create a skeletal structure of analysis. In this study, my structure focuses on the challenges of the community and offers recommendations on what Kaggle can do to answer these challenges
5. **Analyze the Data =>** Use descriptive analytics to understand what is going on
6. **Interpret the Finds =>** Make sense of the visualizations generated. Refine visualizations.
7. **Build a Report =>** Organize the finds, readings and figures to make a coherent report
8. **Recommend Next Steps =>** Make actionable recommendations

<div class='subsection'>B. Other links that helped me</div>

1. [Some Best Practices for Analytics Reporting by John Miller](https://www.kaggle.com/jpmiller/some-best-practices-for-analytics-reporting)
2. [Low Tech SUPER POWERS for Data Storytelling](https://www.youtube.com/watch?v=2-48m867oTc&feature=emb_logo)
3. [Geek Girls Rising: Myth or Reality by Parul Pandey](https://www.kaggle.com/parulpandey/geek-girls-rising-myth-or-reality)