## Introduction

Python is one of the most widely used programming languages across a broad spectrum of applications. First invented in the late 80s, it has come a long way since then (we're nearing Python 4!).

Interestingly, Python that is originally a developer's tool has also become the lingua franca of data munging and crunching, well surpassing free open source tools like R (that was originally made by statisticians for statisticians). I've myself witnessed this great language's growth in popularity; when I first joined Kaggle over 5 years ago, R was the dominant tool and data science itself was a pretty confined area, and well quite unheard of.

In such a short period of time, the growth of Python and data science has gone hand-in-hand, and Python now has well overtaken most of its contemporaries. I thought I might explore the Python community of Kaggle and see what it has to show us. This is my first submission on Kaggle, despite joining it in 2014 with people of my age group now being master competitors here (I thank them for the inspiration). But it's never too late! And so I hope this notebook would come out as interesting and insightful as possible.

In this notebook I explore specifically the Python community with some comparisons with other lanugages here and there. Pandas is a powerful tool and I personally believe it to be more versatile than R's tidyr package. And so here I've done most analysis through the chained `.groupby().unstack()` approach, keeping my code as simple and yet insightful as possible.

This is a work in progress, so I'll keep mining from the vast Kaggle dataset and adding more insights and new ideas whenever I come across anything important.

## Initial Procedures

Let's load our dataset and do some manipulations. I'll be isolating all the questions from the main dataset and storing them in a dictionary for helpful reference while isolating columns of importance.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np

from pandas.plotting import register_matplotlib_converters
import warnings

register_matplotlib_converters()

warnings.filterwarnings("ignore")

plt.rcParams["font.family"] = "sans-serif"
# plt.rcParams["figure.dpi"] = 150

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

directory = "/kaggle/input/kaggle-survey-2020/"

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
survey = pd.read_csv(directory + "kaggle_survey_2020_responses.csv")

In [None]:
def isolate_col(df):
    """
    Used for isolating 2 columns
    and saving them to a dictionary.
    col1 and col2 preferably have to
    be object dtype.

    Parameters
    ----------
    df : DataFrame
        DESCRIPTION.
    col1 : Object
        DESCRIPTION.
    col2 : Object
        DESCRIPTION.

    Returns
    -------
    Dictionary of col1 and col2.

    """
    col1 = df.columns
    col2 = list(df.loc[0, ])

    q_dict = {q: ref for q, ref in zip(col1, col2)}

    return q_dict

# %% Run above function and subset only Python users
quest_dict = isolate_col(survey)
survey.drop(0, 0, inplace=True)

# We have Q7_Part_1 as only Python users
python = survey[survey.Q7_Part_1 == "Python"]

In [None]:
print("Total number of respondents: {}".format(len(survey)))
python = survey[survey.Q7_Part_1 == "Python"]
print("Total number of pythonists: {}".format(len(python)))

## Age Distribution of Python Programmers 

With over 95% respondents coding in Python (see above), it is crystal clear of how much reach the language commands.

Let's start with some early steps and simple charting, and then I'll level up one step at a time. Below is the age distribution of pythonists, with education levels renamed for the purpose of proper naming within plot borders.

In [None]:
mapping = {"Bachelor’s degree": "Bachelor's",
           "Doctoral degree": "Doctoral",
           "I prefer not to answer": "Not Answered",
           "Master’s degree": "Master's",
           "No formal education past high school": "High School",
           "Professional degree": "Professional",
           "Some college/university study without earning a bachelor’s degree": "Some College Study"}

python["Q4"] = python.Q4.map(mapping)

python["Q4"].unique()

In [None]:
python_age_dist = python.Q1.value_counts().sort_index()

fig, ax = plt.subplots(1, 1, figsize=(10, 8))
p1 = ax.bar(python_age_dist.index,
            python_age_dist,
            color=("lightcoral"),
            alpha=0.6,
            edgecolor='black',
            linewidth=0.7)

ax.grid(axis='y', color='grey', linewidth=0.5, alpha=0.4)

# Annotate each bar
for num in range(len(python_age_dist.index)):                 
    ax.annotate(format(python_age_dist[num]),
                xy=(p1.patches[num].get_x(),
                    p1.patches[num].get_y() + p1.patches[num].get_height()+50),
                xytext=(22.5, 0),
                textcoords="offset points",
                va='center',
                ha='center',
                fontweight="bold")

sb.despine(left=True)
plt.title("Age Distribution (Pythonists)",
          loc="left",
          fontweight='bold',
          fontsize=15)

plt.tight_layout()
plt.savefig("Age Distribution.png");

Clearly, age group 25-29 witnesses the most frequency, followed by younger age groups. We see an abrupt decline starting from bracket 30-34. Conclusion - Python is the choice of young learners and data crunchers.

## Education (general and gender-wise)
Let's see how pythonists fare on the education front.

In [None]:
edu_df = python[["Q2", "Q4"]]

edu_count = edu_df[["Q2", "Q4"]].groupby("Q4").count()
edu_count.reset_index(inplace=True)
edu_df_gender = pd.crosstab(edu_df.Q4, edu_df.Q2)

fig, ax = plt.subplots(figsize=(15, 8))

all_colors = list(plt.colormaps())
cmap = 'ocean'

sb.barplot(y=edu_count.Q4,
           x=edu_count.Q2,
           ax=ax,
           palette=sb.set_palette(cmap, len(edu_count)))

ax.set_xlabel("")
ax.set_ylabel("")
ax.tick_params(labelsize=12)
ax.set_title("Count of Degree Holders", loc="left",
             fontfamily="sans-serif", fontweight="bold", fontsize=16)

ax.grid(axis='x', alpha=0.3, linewidth=0.5, color='black')
sb.despine(left=True)

Data science certainly is a specialised area and requires consistent practise and skill development. Consequently, the majority hold at least one master's, followed by a bachelor's. We can conclude that more than 90% pythonists are full time degree holders.

Let's go a little deeper into this - what is the educational distribution gender wise?

In [None]:
male_edu = edu_df_gender.loc[:, "Man"]
fem_edu = edu_df_gender.loc[:, "Woman"]


fig, ax = plt.subplots(2, 1, figsize=(10, 10), subplot_kw=dict(aspect='equal'))

wedges, text, pct = ax[0].pie(male_edu,
                              wedgeprops=dict(width=0.5),
                              startangle=0,
                              autopct="%.1f%%",
                              pctdistance=0.8,
                              colors=sb.set_palette("tab10", len(male_edu)),
                              textprops=dict(color='w',
                                             fontfamily='sans-serif',
                                             fontsize=12,
                                             fontweight='bold'))

bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)
kw = dict(arrowprops=dict(arrowstyle="-"),
          bbox=bbox_props, zorder=0, va="center")


for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1)/2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    hor_align = {-1: "right", 1: "left"}[int(np.sign(x))]
    conn_style = "angle, angleA=0, angleB={}".format(ang)
    kw["arrowprops"].update({"connectionstyle": conn_style})
    ax[0].annotate(male_edu.index[i],
                   xy=(x, y),
                   xytext=(1.35 * np.sign(x), 1.3 * y),
                   horizontalalignment=hor_align,
                   fontfamily='sans-serif',
                   fontsize=12, **kw)

ax[0].set_title("Qualifications (Men)", fontweight='bold', fontsize=14)

wedges, text, pct = ax[1].pie(fem_edu,
                              wedgeprops=dict(width=0.5),
                              startangle=0,
                              autopct="%.1f%%",
                              pctdistance=0.8,
                              colors=sb.set_palette("tab10", len(fem_edu)),
                              textprops=dict(color='w',
                                             fontsize=12,
                                             fontweight='bold'))

for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1)/2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    hor_align = {-1: "right", 1: "left"}[int(np.sign(x))]
    conn_style = "angle, angleA=0, angleB={}".format(ang)
    kw["arrowprops"].update({"connectionstyle": conn_style})
    ax[1].annotate(fem_edu.index[i],
                   xy=(x, y),
                   xytext=(1.35 * np.sign(x), 1.3 * y),
                   horizontalalignment=hor_align,
                   fontsize=12, **kw)

ax[1].set_title("Qualifications (Women)", fontweight='bold', fontsize=14)
plt.tight_layout()

The gender wise comparison is interesting. Despite having much fewer women's response, we still observe that there are 4% more women with master's, 2% more with a bachelor's degree and 1% more with doctoral's. Conclusively, Kaggle women hold more full time degrees than their male counterparts.

## Occupations

Let's move on to diving into the occupations part - what occupations do pythonists pursue?

In [None]:
occupation = python["Q5"].value_counts()

# Explore gender vs. occupation
gen_occu = python[["Q2", "Q5"]].groupby("Q2")["Q5"].value_counts().unstack()
gen_occu = gen_occu.transpose()
gen_occu.drop(["Nonbinary", "Prefer to self-describe", "Prefer not to say"],
              axis=1, inplace=True)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(30, 30), dpi=150)

p1 = sb.barplot(x=gen_occu.Man,
                y=gen_occu.index,
                data=gen_occu,
                color='lightblue',
                ax=ax1)

p2 = sb.barplot(x=gen_occu.Woman,
                y=gen_occu.index,
                data=gen_occu,
                color='lightcoral',
                ax=ax2)

ax2.invert_xaxis()
ax2.yaxis.set_ticks_position("right")

sb.despine(left=True)
ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.tick_params(labelsize=25)
ax2.set_xlabel('')
ax2.set_ylabel('')
ax2.tick_params(labelsize=25)
ax1.set_title("Occupations (Python, Men)", fontweight='bold', loc="left", fontsize=30)
ax2.set_title("Occupations (Python, Women)", fontweight='bold', loc="right", fontsize=30)

percent_men = [str(round((i/gen_occu.Man.sum()) * 100, 2)) + "%" for i in gen_occu.Man]
percent_women = [str(round((i/gen_occu.Woman.sum()) * 100, 2)) + "%" for i in gen_occu.Woman]

# Annotate men and women
for perc, num in zip(percent_men, range(len(gen_occu))):
    ax1.annotate(perc,
                   xy=(p1.patches[num].get_width(),
                       p1.patches[num].get_y() + p1.patches[num].get_height()/2),
                   textcoords="offset points",
                   xytext=(28, 0),
                   ha='center', va='center',
                   fontsize=23,
                rotation=-25)

for perc, num in zip(percent_women, range(len(gen_occu))):
    ax2.annotate(perc,
                   xy=(p2.patches[num].get_width(),
                       p2.patches[num].get_y() + p1.patches[num].get_height()/2),
                   textcoords="offset points",
                   xytext=(-30, 0),
                   ha='center', va='center',
                   fontsize=23,
                rotation=-25)

plt.tight_layout(pad=1.0);

The majority of the Python community is student followed by data scientists and interestingly, followed by a considerable unemployed people. Note that that I intentionally kept the x-axis to be expressed in numerals instead of percentages since my idea was to provide primarily the percentage of people under each occupation but also to provide a rough idea of how many people do really work under that occupation.

There are more women students than men, and this could explain why more women hold professional degrees than men. Interestingly, despite possessing more degrees, female Python programmers are 2% more unemployed. For men, we clearly observe inclination towards engineering roles. And hence greater distribution for data engineers, machine learning engineers, etc.

How does the Python community compare with say C++?

In [None]:
# %% Compare above chart with C++ programmers
cpp = survey[survey.Q7_Part_5 == "C++"]
gen_occu_cpp = cpp[["Q2", "Q5"]].groupby("Q2")["Q5"].value_counts().unstack()
gen_occu_cpp = gen_occu_cpp.transpose()
gen_occu_cpp.drop(["Nonbinary", "Prefer to self-describe", "Prefer not to say"],
              axis=1, inplace=True)

# Plot bars
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(30, 30), dpi=150)

p1 = sb.barplot(x=gen_occu_cpp.Man,
                y=gen_occu_cpp.index,
                data=gen_occu_cpp,
                color='#BFA840',
                ax=ax1)

p2 = sb.barplot(x=gen_occu_cpp.Woman,
                y=gen_occu_cpp.index,
                data=gen_occu_cpp,
                color='#82BF40',
                ax=ax2)

ax2.invert_xaxis()
ax2.yaxis.set_ticks_position("right")

sb.despine(left=True)
ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.tick_params(labelsize=25)
ax2.set_xlabel('')
ax2.set_ylabel('')
ax2.tick_params(labelsize=25)
ax1.set_title("Occupations (C++, Men)", fontweight='bold', loc="left", fontsize=30)
ax2.set_title("Occupations (C++, Women)", fontweight='bold', loc="right", fontsize=30)

percent_men_cpp = [str(round((i/gen_occu_cpp.Man.sum()) * 100, 2)) + "%" for i in gen_occu_cpp.Man]
percent_women_cpp = [str(round((i/gen_occu_cpp.Woman.sum()) * 100, 2)) + "%" for i in gen_occu_cpp.Woman]

# Annotate men
for perc, num in zip(percent_men_cpp, range(len(gen_occu_cpp))):
    ax1.annotate(perc,
                   xy=(p1.patches[num].get_width(),
                       p1.patches[num].get_y() + p1.patches[num].get_height()/2),
                   textcoords="offset points",
                   xytext=(28, 0),
                   ha='center', va='center',
                   fontsize=23,
                rotation=-25)

for perc, num in zip(percent_women_cpp, range(len(gen_occu_cpp))):
    ax2.annotate(perc,
                   xy=(p2.patches[num].get_width(),
                       p2.patches[num].get_y() + p1.patches[num].get_height()/2),
                   textcoords="offset points",
                   xytext=(-30, 0),
                   ha='center', va='center',
                   fontsize=23,
                rotation=-25)

plt.tight_layout(pad=1.0);

Wow! I was expecting something different. There are more students who code in C++, almost 15% more on average for both genders. Also, there are more unemployed Python programmers for both genders than compared with the C++ community. Additionally, software engineers and data scientists also seem to rely considerably on C++. Let's have a final look at Python's rival - R.

In [None]:
r_lang = survey[survey.Q7_Part_2 == "R"]
gen_occu_r = r_lang[["Q2", "Q5"]].groupby("Q2")["Q5"].value_counts().unstack().fillna(0)
gen_occu_r = gen_occu_r.transpose()
gen_occu_r.drop(["Nonbinary", "Prefer to self-describe", "Prefer not to say"],
              axis=1, inplace=True)

# Plot bars
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(30, 30), dpi=150)

p1 = sb.barplot(x=gen_occu_r.Man,
                y=gen_occu_r.index,
                data=gen_occu_r,
                color='#40BFB6',
                ax=ax1)

p2 = sb.barplot(x=gen_occu_r.Woman,
                y=gen_occu_r.index,
                data=gen_occu_r,
                color='#BF4049',
                ax=ax2)

ax2.invert_xaxis()
ax2.yaxis.set_ticks_position("right")

sb.despine(left=True)
ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.tick_params(labelsize=25)
ax2.set_xlabel('')
ax2.set_ylabel('')
ax2.tick_params(labelsize=25)
ax1.set_title("Occupations (R, Men)", fontweight='bold', loc="left", fontsize=30)
ax2.set_title("Occupations (R, Women)", fontweight='bold', loc="right", fontsize=30)

percent_men_r = [str(round((i/gen_occu_r.Man.sum()) * 100, 2)) + "%" for i in gen_occu_r.Man]
percent_women_r = [str(round((i/gen_occu_r.Woman.sum()) * 100, 2)) + "%" for i in gen_occu_r.Woman]

# Annotate men
for perc, num in zip(percent_men_r, range(len(gen_occu_r))):
    ax1.annotate(perc,
                   xy=(p1.patches[num].get_width(),
                       p1.patches[num].get_y() + p1.patches[num].get_height()/2),
                   textcoords="offset points",
                   xytext=(28, 0),
                   ha='center', va='center',
                   fontsize=23,
                rotation=-25)

for perc, num in zip(percent_women_r, range(len(gen_occu_r))):
    ax2.annotate(perc,
                   xy=(p2.patches[num].get_width(),
                       p2.patches[num].get_y() + p1.patches[num].get_height()/2),
                   textcoords="offset points",
                   xytext=(-30, 0),
                   ha='center', va='center',
                   fontsize=23,
                rotation=-25)

plt.tight_layout(pad=1.0);

Clear results. Despite the perceived dominance of Python in the data science domain, most kagglers with the role 'Data Scientist' use more R than Python. Many inferences can be made from this. For example, despite the following Python commands, data scientists on the contrary perceive R as more expendable. Also, Python still remains the choice of students and young people just entering the industry. Additionally, R clearly seems to be maintaining its territory - more data analysts, data scientists, statisticians and people from research use R than Python.

Moving on - let's see what age has to tell us. For example, we can analyse the occupation vs. age and answer what is the average age of data scientists and say an allied occupation of data analysts, so on and so forth.

In [None]:
age_occu = python[["Q1", "Q5"]].groupby("Q1")["Q5"].value_counts().unstack().fillna(0)

plt.figure(figsize=(50, 300), dpi=150)

for ix, group in enumerate(age_occu):
    ax = plt.subplot(11, 2, ix+1)
    p = sb.barplot(x=age_occu.index,
               y=f"{group}",
               data=age_occu,
               ax=ax,
               color=sb.set_palette("inferno",
                                    len(age_occu)))

    for age in range(len(age_occu.index)):
        p.annotate(format(int(age_occu[f"{group}"][age])),
                   xy=(p.patches[age].get_x()+0.4,
                       p.patches[age].get_y() + p.patches[age].get_height()),
                   xytext=(0, 5),
                   fontweight="bold",
                   fontsize=24,
                   textcoords="offset points",
                    ha='center',
                    va='center')

    ax.tick_params(axis='x', labelrotation=50)
    ax.tick_params(labelsize=28)
    ax.set_xlabel("")
    ax.set_ylabel("")
    ax.set_title(f"Age vs {group}",
                 fontsize=34,
                 fontweight="bold")

    sb.despine(left=True)

A lot of inferences can be made here:

* People aged 18-21 dominantly occupy the student category. Interestingly, age group 25-29 has fewer students than I expected personally.
* Some roles certainly record people of younger age groups (data analysts, ML Engineers and Business Analysts).
* Project managers, data scientists and database engineers have a more uniform distribution; in fact the DBEs seem to be more towards the older age groups.
* There are more younger unemployed than older groups.

This is it for now!
### Work in Progress