# Skills for Success: Trends of Tech, Tools & Top Languages in the Data Industry

---



My focus in this notebook is to explore and summarize the trends of programming languages, big data tools and machine learning technology within the data industry over the past couple years.

Specifically, I'd like to answer the question:

### "*Which technologies, tools and languages would help me succeed in a data-related career?*"

Many data workers (or those aspiring to a data-related position!) will ask this kind of question to become as prepared and qualified as possible to solve the real-world data problems that companies face. Of course, the answers of which tech, tools and languages are most useful to us will vary by whatever each of our specific career goal is.

The goal of this analysis is to unpack, unravel and explore many different skills and technologies combined with other career-relevant factors such as job title, location, company size, and more, to identify the tools and skills more applicable to one career than another. This will not only give an overview of several of these data tools but hopefully also give the reader some insights into a particular area of interest, such as the trends in a particular country, programming language, cloud computing platform, etc., that he or she is interested in.

As I'm planning to transition to full-time data work in 2021 in the Pearl River Delta region of China, I'll also take some deeper dives into the trends and insights which are especially relevant to this specific career focus - ***data science in China***.

#### The Data Source - Kaggle Data Science Surveys
The data for this analysis (with the exception of Appendix C) all come from the Kaggle Data Science Surveys from 2018-2020. In addition to including questions on several data science tools, the Kaggle survey data is a fitting dataset to discover the trends in the data world since we have data from not only 2020 but also from the past few years.

#### Data-Related Job Titles
Since this analysis focuses on trends related to data careers, it's worth noting up front that ***almost half of the 2020 respondents do not have a "data/programming-related" job title***, as the below chart shows:


In [None]:
# Import needed libraries
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.patches as mpatches


In [None]:
### 2020 DF ###

# Only read in the needed columns
use_cols_2020 = [
    list(range(7)) + [107, 118] + 
    list(range(7, 18)) + 
    list(range(66, 80)) + 
    list(range(120, 130)) + 
    list(range(155, 171)) + 
    list(range(174, 187))
][0]

# These will be the new column names...
general_cols1 = ["duration", "age", "gender", "country", "education", "title", "yrs_coding"]
language_cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", "Julia", "Swift", 
                 "Bash", "MATLAB"]
ml_cols = ["Scikit-learn", "TensorFlow", "Keras", "PyTorch", "Fast.ai", "MXNet", "Xgboost", "LightGBM",
            "CatBoost", "Prophet", "H2O 3", "Caret", "Tidymodels", "JAX"]
general_cols2 = ["company_size", "compensation"]
cloud_cols = ["Amazon Web Services (AWS)", "Microsoft Azure", "Google Cloud Platform (GCP)", 
                  "IBM Cloud / Red Hat", "Oracle Cloud", "SAP Cloud", "Salesforce Cloud", 
                  "VMware Cloud", "Alibaba Cloud", "Tencent Cloud"]
bigdata_cols = ["MySQL", "PostgreSQL", "SQLite", "Oracle Database", "MongoDB", "Snowflake", "IBM Db2",
                "Microsoft SQL Server", "Microsoft Access", "Microsoft Azure Data Lake Storage", 
                "Amazon Redshift", "Amazon Athena", "Amazon DynamoDB", "Google Cloud BigQuery", 
                "Google Cloud SQL", "Google Cloud Firestore"]
bi_cols = ["Amazon QuickSight", "Microsoft Power BI", "Google Data Studio", "Looker", "Tableau", 
           "Salesforce", "Einstein Analytics", "Qlik", "Domo", "TIBCO Spotfire", "Alteryx", "Sisense", 
           "SAP Analytics Cloud"]

# ... combined in this list. This will be a bit different for each of the 3 Surveys
col_names_2020 = general_cols1 + language_cols + ml_cols + general_cols2 + cloud_cols + bigdata_cols + bi_cols

# Read in DF with clearer column names
df2020 = pd.read_csv(
#     "kaggle-survey-2020\kaggle_survey_2020_responses.csv",
    "../input/kaggle-data-science-survey-analysis/kaggle_survey_2020_responses.csv",    
    skiprows=1,
    header=0,
    usecols=use_cols_2020
)
df2020.columns = col_names_2020

# Reduce DF memory_usage by making all categories except 'compensation' col
for col in df2020.select_dtypes(include="object").columns:
    if col != "compensation":
        df2020[col] = df2020[col].astype("category")
df2020["year"] = 2020
        
    
    
### 2019 DF ###

# Same thing as above for the 2019 DF
use_cols_2019 = [
    [0,1,2,4,5,6,8,20] + 
    list(range(82, 89)) + [90,91] +
    [155,156,157,159,160,161,162,164] + 
    list(range(168, 178)) + 
    [194,195,201,240,242] + list(range(233, 239))
][0]

col_names_2019 = [
    "duration", "age", "gender", "country", "education", "title", "company_size", "compensation",
    "Python", "R", "SQL", "C", "C++", "Java", "Javascript", "Bash", "MATLAB", 
    "Scikit-learn", "TensorFlow", "Keras", "Xgboost", "PyTorch", "Caret", "LightGBM", "Fast.ai",
    "Google Cloud Platform (GCP)", "Amazon Web Services (AWS)", "Microsoft Azure",  "IBM Cloud",
    "Alibaba Cloud", "Salesforce Cloud", "Oracle Cloud", "SAP Cloud", "VMware Cloud", "Red Hat",
    "Google Cloud BigQuery", "Amazon Redshift", "Amazon Athena", "MySQL", "PostgreSQL", "SQLite", 
    "Microsoft SQL Server", "Oracle Database", "Microsoft Access", "Amazon DynamoDB", "Google Cloud SQL"
]

# Read in DF with clearer column names
df2019 = pd.read_csv(
#     "kaggle-survey-2019\multiple_choice_responses.csv",
    "../input/kaggle-data-science-survey-analysis/multiple_choice_responses.csv",
    skiprows=1,
    header=0,
    usecols=use_cols_2019
)
df2019.columns = col_names_2019

# Reduce DF memory_usage by making all categories except 'compensation' col
for col in df2019.select_dtypes(include="object").columns:
    if col != "compensation":
        df2019[col] = df2019[col].astype("category")
df2019["year"] = 2019
        

    
### 2018 DF ###

use_cols_2018 = [
    [0,1,3,4,5,7,12] + 
    list(range(65,71)) + [72,73,75] + 
    [88,89,90,91,93,94,95,96,97,99,101,102] + 
    list(range(57, 62)) + 
    list(range(203, 208)) + [197,199,209,220,231,232,233,239]
][0]

col_names_2018 = [
    "duration", "gender", "age", "country", "education", "title", "compensation",
    "Google Cloud Platform (GCP)", "Amazon Web Services (AWS)", "Microsoft Azure", 
    "IBM Cloud / Red Hat", "Alibaba Cloud",
    "Python", "R", "SQL", "Bash", "Java", "Javascript", "C", "MATLAB", "Julia",
    "Scikit-learn", "TensorFlow", "Keras", "PyTorch", "H2O 3", "Fast.ai", "MXNet", 
    "Caret", "Xgboost", "Prophet", "LightGBM", "CatBoost",
    "Google Cloud SQL", "Amazon DynamoDB", "Microsoft SQL Server", "MySQL", "PostgreSQL", 
    "SQLite", "Oracle Database", "Microsoft Access", "IBM Db2", "Amazon Athena", 
    "Amazon Redshift", "Google Cloud BigQuery", "Snowflake"
]

# Read in DF with clearer column names
df2018 = pd.read_csv(
#     "kaggle_2018_survey\multipleChoiceResponses.csv",
    "../input/kaggle-data-science-survey-analysis/multipleChoiceResponses.csv",
    skiprows=1,
    header=0,
    usecols=use_cols_2018
)
df2018.columns = col_names_2018

# Reduce DF memory_usage by making all categories except 'compensation' col
for col in df2018.select_dtypes(include="object").columns:
    if col != "compensation":
        df2018[col] = df2018[col].astype("category")
df2018["year"] = 2018



### Concat DFs ###
df = pd.concat([df2020, df2019, df2018], axis=0)
    
# Some clean up before ordering the categories
df.education.fillna('I prefer not to answer', inplace=True)
df.compensation.replace('> $500,000', '$500,000-500000', inplace=True)
df.loc[df.age.isin(["70-79", "80+"]), "age"] = "70+"
df.loc[df.company_size=="10,000 or more employees", "company_size"] = "> 10,000 employees"
df.company_size = df.company_size.str.replace(" employees", "")

# Extract the range of compensation, take the average and return it in a Series
comp = (df.compensation
    .str.extractall('([\d,]+)-([\d,]+)')
    .applymap(lambda x: int(x.replace(",", "")) if type(x)!="int" else x)
    .apply(lambda x: int(round(x.mean(),0)), axis=1)
    .reset_index()
    .loc[:, 0]
)
df["compensation"] = comp

# Category Ordering
for col in df.select_dtypes(include="object").columns:
    if col != "compensation":
        df[col] = df[col].astype("category")
age_order = ['18-21', '22-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-69', '70+']
yrs_coding_order = ["I have never written code", "< 1 years", "1-2 years", "3-5 years", "5-10 years", "10-20 years", "20+ years"]
education_order = ['I prefer not to answer', 'No formal education past high school', 'Some college/university study without earning a bachelor’s degree',
                  'Bachelor’s degree', 'Professional degree', 'Master’s degree', 'Doctoral degree']
size_order = ['0-49', '50-249', '250-999', '1000-9,999', '> 10,000']
cat_orders = [age_order, yrs_coding_order, education_order, size_order]
for i, each in enumerate(["age", "yrs_coding", "education", "company_size"]):
    df[each].cat.reorder_categories(
        new_categories=cat_orders[i],
        ordered=True,
        inplace=True
    )

# Replace the Question columns with Booleans
nonBoolCols = ["duration", "age", "gender", "country", "education", "title", "yrs_coding",
              "company_size", "compensation", "year"]
df.loc[:, [col for col in df.columns if col not in nonBoolCols]] = \
    df.loc[:, [col for col in df.columns if col not in nonBoolCols]].notnull()

# Combine IBM Cloud & Red Hat since they were merged starting from the 2020 survey
df["IBM Cloud / Red Hat"] = df.loc[:, ["IBM Cloud / Red Hat", "Red Hat", "IBM Cloud"]].sum(axis=1).astype(bool) 
df.drop(["Red Hat", "IBM Cloud"], axis=1, inplace=True)

# New features - "number of _____" used by each person
df["num_lang"] = df.loc[:, language_cols].apply(lambda x: x.sum(), axis=1)
df["num_ml"] = df.loc[:, ml_cols].apply(lambda x: x.sum(), axis=1)
df["num_cloud"] = df.loc[:, cloud_cols].apply(lambda x: x.sum(), axis=1)
df["num_bigdata"] = df.loc[:, bigdata_cols].apply(lambda x: x.sum(), axis=1)
df["num_bi"] = df.loc[:, bi_cols].apply(lambda x: x.sum(), axis=1)

# Combine the "not employed" data and the "other"s
df.title = df.title.str.replace("Not employed", "Currently not employed")
to_other = ["Chief Officer", "Consultant", "Data Journalist", "Developer Advocate",
            "Manager", "Marketing Analyst", "Principal Investigator", "Research Assistant",
           "Salesperson"]
df.loc[df.title.isin(to_other), "title"] = "Other"

# Replace 8 words with 2 letters
df.country = df.country.str.replace("United Kingdom of Great Britain and Northern Ireland", "UK")
# And a few more...
df.country = df.country.str.replace("United States of America", "USA")
df.country = df.country.str.replace("Iran, Islamic Republic of...", "Iran")
df.country = df.country.str.replace("Republic of Korea", "South Korea")  # There are entries for both, so combine
df.country = df.country.str.replace("Viet Nam", "Vietnam")



In [None]:
def summaryBar(df, y_axis=None, special_list=None, cols=None, default_color="tab:grey", 
                special_color="tab:red", label_offset=100, title_text=None, 
                def_patch=None, spec_patch=None, size=(10,7), sort_index=False,
                ascending=False):
    '''
    Creates a horizontal bar chart based on the unique values in the y-axis or the total
    sum of boolean values in a list of columns "cols". Certain bars can be highlighted in
    a custom color using the special_list and special_color args.
    
    Parameters
    ----------
    df : DataFrame
        DF from which the data will be drawn
    y_axis : str
        the Series name which contains the values to have on the y-axis; i.e. the labels
        to use to make the bar chart. Use either cols or y_axis.
    special_list : array-like, optional
        list of items on the y-axis for which the special_color argument will be applied,
        default None
    cols : list of strings
        the columns in the DF which will be plotted; these columns should be boolean type.
        Use either cols or y_axis.
    default_color : str, optional
        matplotlib colorname for all non-special bars, default "tab:grey"
    special_color : str, optional
        matplotlib colorname for all bars included in the special_list, default "tab:red"
    label_offset : int, optional
        space in units along the x-axis to offset the text value of each bar, default 100
    title_text : str, optional
        text for the title, default None
    def_patch : str, optional
        text to desribe the "default color" bars within the legend; if None, legend is not
        drawn, default None
    spec_patch : str, optional
        text to desribe the "special color" bars within the legend, if None, legend is not
        drawn, default None
    size : tuple of integers, optional
        matplotlib figure.set_size_inches() parameter, default (10, 7)
    sort_index : bool, optional
        whether to sort the bars by index name rather than value, default False
    ascending : bool, optional
        if sort_index is True, then whether to sort by ascending index order, default False
        
    Returns
    -------
    None
    Just draws the chart    
    
    '''
    # Set x and y based on the y_axis or cols provided, also depending on whether or not
    # to sort by index
    if y_axis != None:
        if sort_index==True:
            x = df.loc[df.year==2020, y_axis].value_counts().sort_index(ascending=ascending).index.values
            y = df.loc[df.year==2020, y_axis].value_counts().sort_index(ascending=ascending).values
        else:
            x = df.loc[df.year==2020, y_axis].value_counts().index.values
            y = df.loc[df.year==2020, y_axis].value_counts().values            
    if cols != None:
        if sort_index==True:
            x = df.loc[df.year==2020, cols].sum().sort_index(ascending=ascending).index.values
            y = df.loc[df.year==2020, cols].sum().sort_index(ascending=ascending).values        
        else:
            x = df.loc[df.year==2020, cols].sum().sort_values(ascending=False).index.values
            y = df.loc[df.year==2020, cols].sum().sort_values(ascending=False).values     

    # Create the list of colors for the bars
    check = lambda x: special_color if x in special_list else default_color
    colors = [check(val) for val in x]
    
    # Plot the bars
    fig, ax = plt.subplots()
    bars = ax.barh(range(len(x)), y, alpha=0.6)
    for i, bar in enumerate(bars):
        bar.set_color(colors[i])
        ax.text(bar.get_width() + label_offset, bar.get_y() + bar.get_height()/2,
                "{:,}".format(int(bar.get_width())), va="center", color="dimgrey", fontsize=11)
    ax.tick_params(axis="both", which="both", length=0, labelsize=11, labelcolor="dimgrey")
    for s in ["top", "bottom", "left", "right"]:
        ax.spines[s].set_color("white")
    ax.set_yticks(range(len(x)))
    ax.set_yticklabels(x, alpha=1)
    ax.set_xticklabels("")
    ax.invert_yaxis()
    ax.xaxis.grid(b=True, which="major", color="white", linestyle="-")
    ax.set_title(title_text, fontsize=16, alpha=0.8)

    # Create the legend if both of the 'patch' arguments were given
    if def_patch != None and spec_patch != None:
        grey_patch = mpatches.Patch(color=default_color, alpha=0.6, label=def_patch)
        red_patch = mpatches.Patch(color=special_color, alpha=0.6, label=spec_patch)
        ax.legend(loc=(0.6, 0.1), handles=[red_patch, grey_patch], fontsize=11)

    fig.set_size_inches(*size)
    plt.show()
    
    
summaryBar(
    df=df, 
    y_axis="title",
    special_list=["Student", "Other", "Currently not employed"], 
    default_color="tab:grey", 
    special_color="tab:red",
    label_offset=100, 
    title_text="2020 Survey Respondents by Job Title", 
    def_patch='Data/Programming Title', 
    spec_patch='Non-Data/Programming Title', 
    size=(10,7),
    sort_index=False
)

Since nearly half of our data is coming from respondents with a "non-data/non-progamming" job title, it's important to first address the question:

#### Are the survey responses from respondents with "non-data titles" still relevant to our analysis which seeks to bring insights to data-related careers? 

I believe so, for several reasons:

- First, there are a number of paths to working in a "data-related" career, many of which may not include "data", "analyst" or "programmer" in the job title. 
- Second, those with "non-data-related" titles may very well be involved with data regularly. Many "Students", for example, will likely be data-related students, those "Not currently employed" may be in transition from one data job to another, and those belonging to "Other" also may be working with data in some way. 
- Third, those people who've registered an email with Kaggle (and thus were able to receive the survey invitation) are almost certainly involved with, transitioning to or interested in data work to some extent.
- Finally, and as we'll see more clearly later, respondents with "non-data-related" titles are also heavy users of a lot of these technologies, tools and languages!

Therefore, ***all the data of this dataset can help us identify and understand situations and trends in the data industry, even from those respondents who don't currently have a data-related job title***. Some of our analyses will, however, break out the different job titles so we can see the trends among them. But given the above points, when I refer to "data workers" at times during the analysis, I'm including all the survey respondents as all of them are involved in data to a certain degree.


### Methodology

The methodology for preparing the data was to read in and clean up relevant portions of the 2018-2020 datasets, merge them together and use <code>.groupby()</code>'s and pivot tables to aggregate answers to specific questions. Four main kinds of visualizations will be used in each key area ranging from a simple overview chart to a "2018 to 2020 trend" chart with two axes. Further details of the data preparation are commented within the code.


#### The 5 Key Areas of this Analysis
This analysis will focus on 5 key areas:
    1. Programming Languages
    2. Machine Learning Frameworks
    3. Cloud Computing Platforms
    4. Big Data Tools
    5. Business Intelligence Tools

For each of these areas, we'll explore both the overarching picture and then dive down a little deeper to understand these data tools and trends for different careers, taking a specific look at the situation in **China** for each given key area. 

#### Groundwork to Lay

There will be some groundwork to lay with the first key area (programming languages) to introduce the main visualization tools, but after that we'll be able to move along faster with areas 2 through 5 and reap the benefits of applying the same analysis from area 1 to these other areas.

Let's begin.

---

In [None]:
def vsCompare(df, A, B, y_axis="country", color1="tab:red", color2="tab:blue", thresh=40,
             A_loc=(0.2, 0.2), B_loc=(0.6, 0.6), xSpaceMultiplier=1.2, label_pad=10, 
             text_loc=(0.52, 0.05), size=(12, 10)):
    '''
    Compares two Series "head-to-head" among a group on the y-axis with sorted horizontal bar graphs 
    with two colors, each representing one of the two Series.
    
    Parameters
    ----------
    df : DataFrame
        DF which contains the Series "A", "B" and "y_axis" below
    A : string
        the (column) name of the first Series to compare
    B : string
        the (column) name of the second Series to compare
    y_axis : string, optional
        the Series name which contains the values to have on the y-axis, default "country"
    color1 : matplotlib color, optional
        color for the "A" series, default "tab:red"
    color2 : matplotlib color, optional
        color for the "B" series, defaul "tab:blue"
    thresh : int, optional
        threshold for which values in the "y_axis" to include in the chart based on a given value
        meeting the threshold which is the number of rows that have at least A or B true; for example,
        if A and B are two programming languages, and in a given country there are only 30 respondents
        who know one or both those languages and the thresh is 40, this country would not appear in the
        resulting chart, default 40
    A_loc : tuple of floats between 0 and 1, optional
        location of the text label for "A"; (0, 0) is bottom left corner, default (0.2, 0.2)
    B_loc : tuple of floats between 0 and 1, optional
        location of the text label for "B"; (0, 0) is bottom left corner, default (0.6, 0.6)
    xSpaceMultiplier : float, optional
        adds space in between the two stacks of horizontal bar charts, default 1.2
    label_pad : int, optional
        space in units between the edge of a horizontal bar and its numerical label, default 10
    text_loc : tuple of floats between 0 and 1, optional
        location between (0, 0) and (1, 1) to display the text "based on threshold of ___", default
        (0.52, 0.05) which is near the lower center
    size : tuple of integers, optional
        matplotlib figure.set_size_inches() parameter, default (12, 10)
    
    Returns
    -------
    DataFrame which is the resulting pivot table from which the results are graphed
    Also shows the chart
    
    '''
    # Create a DF with the columns of country and those columns to compare
    df2 = df.loc[df.year==2020, [y_axis, A, B]]

    # Melt and pivot the data, summed up by country
    df2 = df2.melt(
        id_vars = y_axis,
        var_name = "data_tool",
        value_name = "use"
    )
    PT = df2.pivot_table(
        index = y_axis,
        columns = "data_tool",
        values = "use",
        aggfunc = "sum"
    )

    # Function that will later calculate the "Tool A Preferred" Series
    pref_func = lambda a, b: (b - a) / a

    # Now, tidy up the Pivot Table and ready it for viz
    PT["total"] = PT.sum(axis=1)
    # Remove countries with < the threshold defined
    PT = PT.loc[PT.total>=thresh, :]
    # Sort the values, make all integers
    PT = PT.sort_values("total", ascending=False).applymap(int)
    # Create the Series of data that will be visualized
    # The "Tool A Preferred" of 40% for instance means that 40% more people prefer
    # tool A to tool B
    PT["B_Preferred"] = pref_func(PT[A], PT[B])
    PT["A_Preferred"] = pref_func(PT[B], PT[A])
    # Remove the negative values since we don't want these to plot (since negative values
    # mean that the other Tool was preferred, and that will be the one that's graphed)
    PT = PT.applymap(lambda x: 0 if x < 0 else x)
    # Make Tool A's percentages negative so that they graph the other way
    PT.loc[:, [A, "A_Preferred"]] = PT.loc[:, [A, "A_Preferred"]] * -1
    # Sort by Tool B preferred then Tool A, to get the 'spiral' kind of effect
    PT = PT.sort_values(["B_Preferred", "A_Preferred"], ascending=False)
    # Combine these into a single column which will be graphed
    PT["Preferred"] = PT["B_Preferred"] + PT["A_Preferred"]
    # Create the colors
    PT["colors"] = PT.Preferred.apply(lambda x: color1 if x > 0 else color2)
    # Calculate the difference from one Tool to the other (this will be graphed as the dark color)
    PT["difference"] = PT[A] + PT[B]
    # Reset index to "free up" the country column for easier indexing
    PT = PT.reset_index()
    # Remove "Other" countries. To keep "Other" countries, we would need to add back in all the
    # other countries we filtered out based on our threshold above, as they would count now as
    # "other".
    if y_axis == "country":
        PT = PT.loc[PT.country!="Other", :]


    # Plot the bar charts
    fig, ax = plt.subplots()
    ax.barh(PT[y_axis], PT.difference, color=PT.colors, alpha=0.7)
    B_bars = ax.barh(range(len(PT)), PT[B], color=color1, alpha=0.2)
    A_bars = ax.barh(range(len(PT)), PT[A], color=color2, alpha=0.2)
    # Add in the actual number of respondents for each country
    for bar in B_bars:
        if bar.get_width() > 0:
            ax.text(bar.get_width() + label_pad, bar.get_y() + bar.get_height()/2,
                    "{:,}".format(np.abs(bar.get_width())), va="center", ha="left", 
                     color=color1, fontsize=9, alpha=0.7)
    for bar in A_bars:
        if bar.get_width() < 0:
            ax.text(bar.get_width() - label_pad, bar.get_y() + bar.get_height()/2,
                    "{:,}".format(np.abs(bar.get_width())), va="center", ha="right", 
                     color=color2, fontsize=9, alpha=0.7) 
    ax.invert_yaxis()
    ax.set_title("{} vs. {}".format(A, B), fontsize=20, alpha=0.8)
    ax.tick_params(axis="both", which="both", length=0, labelsize=11)
    ax.text(*B_loc, B, ha='center', va='center', transform=ax.transAxes,
            fontsize=14, color=color1)
    ax.text(*A_loc, A, ha='center', va='center', transform=ax.transAxes,
            fontsize=14, color=color2)
    ax.yaxis.set_tick_params(pad=5)
    ax2 = ax.twiny()
    percent_bars = ax2.barh(PT[y_axis], PT.Preferred, color=PT.colors, alpha=0.5)
    ax2.set_xlim(-22,1)
    ax2.text(1.05, 0.5, "% respondents\nwho use one\ntool more than\nthe other", 
            ha='left', va='center', transform=ax.transAxes, fontsize=11, color="tab:grey",
            alpha=0.8)
    for s in ["top", "bottom", "right", "left"]:
        ax.spines[s].set_color("white")
        ax2.spines[s].set_color("white")
    ax.set_xlim(ax.get_xlim()[0], ax.get_xlim()[1] * xSpaceMultiplier)
    ax.set_xticklabels("")
    ax2.set_xticklabels("")
    for bar in percent_bars:
        if bar.get_width() > 0:
            ax2.text(bar.get_width() + 0.2, bar.get_y() + bar.get_height()/2,
                    "{:.0%}".format(np.abs(bar.get_width())), va="center", ha="left", 
                     color="dimgrey", fontsize=10)
        else:
            ax2.text(bar.get_width() - 0.2, bar.get_y() + bar.get_height()/2,
                    "{:.0%}".format(np.abs(bar.get_width())), va="center", ha="right", 
                     color="dimgrey", fontsize=10)
    ax.text(
        *text_loc, 
        "Based on a threshold of {:,} respondents\nwho use one or both tools".format(thresh), 
        transform=ax.transAxes, fontsize=10, alpha=0.6, color="tab:grey"
    )        
    fig.set_size_inches(*size)
    plt.show()
    
    return PT


In [None]:
def allLangPT(df, cols, num_col, country="all", show=True, inverse=True,
              offsets=None, text_loc=(0.1, 0.03), title_text=None, 
              color_dict=None, patches=None, **kwargs):
    '''
    Takes inputs of country and variables and graphs the "% of respondents who use ___ skill"
    on the x-axis and the "# of total skills" that respondents use regularly who also use
    a given skill. The function is simply filtering by the given key-value criteria and 
    outputting the pivot table for all the skills (languages, for example) and showing the
    resulting graph.
    
    Parameters
    ----------
    df : DataFrame
        DF from which the data will be drawn
    cols : list of strings
        the columns in the DF which will be plotted
    num_col : str
        name of the column which is the total 'skills' of a category that someone knows. 
        This is used for the y-axis plotting in the chart.
    country : str, optional
        filter the DF by a specific country, default "all" to show results for all countries
    show : bool, optional
        whether to display the graph or not, default True
    inverse : bool, optional
        whether to display the y-axis as 'total num languages used' rather than the "% share", 
        which may aid interpretation, default True
    offsets : dictionary, optional
        dictionary of keys which are the "cols" above to offset the ax.text() location to 
        avoid labels placed on top of each other. A bit hacky, yes. But hey it works. Default
        None.
    text_loc : tuple of floats between 0 and 1, optional
        location between (0, 0) and (1, 1) to display the text "based on ___ samples", default
        (0.1, 0.03) which is near the lower left hand corner
    title_text : str, optional
        text for the title, default None
    color_dict : dictionary, optional
        keys of "cols" above and values of matplotlib colors
    patches : list of patch objects, optional
        matplotlib patches to overlay on the chart
    kwargs : dictionary, optional
        key-value pairs of column name and criteria by which to filter the column of the
        DataFrame; for example, {"year": 2020} filters the DF by the year column for rows
        with the year 2020, then continues to create the pivot table and chart, default None
    
    Returns
    -------
    DataFrame which is the resulting pivot table from which the results are graphed
    Also shows the chart    
    
    '''
    # Filter DF based on 'country' and kwargs
    if country!="all":
        df = df.loc[df.country==country, :]
    for k, v in kwargs.items():
        try:
            df = df.loc[df[k]==v, :]
        except:
            print("Double filter applied") 
            continue

    # Filter DF based on columns
    df2 = df.loc[:, cols + [num_col, "year"]]
    # Record sample size of this filtered DF
    sample_size = df2.shape[0]
    # Prep data for pivot table
    df2 = df2.melt(
        id_vars = [num_col, "year"],
        var_name = "skill",
        value_name = "use"
    )

    # If no color dictionary is passed in, then make a default one
    if color_dict == None:
        color_dict = dict(zip(cols, ["C{}".format(i) for i in range(11)]))

    # Offset each skill label to avoid them overlapping
    offset_dict = dict(zip(cols, [(0,0) for x in range(len(cols))]))
    if offsets != None:
        for k, v in offsets.items():
            offset_dict[k] = v
    
    # Pivot Table to calculate how many people regularly use a given skill,
    # and of those people who use a given skill, how many other skills in
    # this category (such as programming languages) do they regularly use.
    pt = df2.groupby(["year", "skill", "use"]).agg(["count", "sum"])
    pt = pt.unstack()
    pt["total_count"] = pt.iloc[:, 0] + pt.iloc[:, 1]
    pt["skill_share"] = pt.iloc[:, 1] / pt.iloc[:, 3]
    # Instead of showing "% share of a skill", take the inverse, which shows
    # "how many total skills" are used regularly by those who use that given
    # skill.
    if inverse==True:
        pt["skill_share"] = 1 / pt["skill_share"]
    pt["use_skill"] = pt.iloc[:, 1] / pt.iloc[:, 4]
    # And clean up the Pivot Table a bit
    pt = pt.droplevel([1, 2], axis=1).iloc[:, -2:]
    pt = pt.reset_index("year")
    
    # These are the alphas and sizes for the 3 years
    alphas = [0.2, 0.35, 0.65]
    sizes = [150, 200, 250]
    # Use this for setting the markers below
    num_years = pt.year.nunique()
    
    if show==True:
        fig, ax = plt.subplots()
        for j, yr in enumerate([2018, 2019, 2020]):
            temp_pt = pt.loc[pt.year==yr, :]
            for i, row in enumerate(temp_pt.index.values):
                color = color_dict[row]
                # Set the marker to Pentagon if the color is green to help readers who have difficulty
                # distinguishing red/green. Only do this for 'trend' graphs which will have unlabeled dots.
                if num_years == 1:
                    marker = "o"
                else:
                    marker = "o" if color != "C2" else "p"
                offset_x = offset_dict[row][0]
                offset_y = offset_dict[row][1]
                x = temp_pt.loc[row, "use_skill"]
                y = temp_pt.loc[row, "skill_share"]
                # If the color is blue, then add a label in there which will be included in the legend
                # if there is more than one year plotted
                if color=="C0":
                    ax.scatter(x, y, marker=marker, s=sizes[j], alpha=alphas[j], color=color, label=str(yr))
                else:
                    # Don't plot "C" values for 2018 since they were combined with C++
                    if "C" in color_dict and color == "C3" and yr == 2018:
                        continue
                    else:
                        ax.scatter(x, y, marker=marker, s=sizes[j], alpha=alphas[j], color=color)   
                    
                if yr==2020:
                    ax.text(x + 0.02 + offset_x, y + offset_y, row, color=color, fontsize=14)
            if yr==2020:
                ax.set_title(title_text, fontsize=18, pad=15)
                ax.text(
                    *text_loc, "Based on {:,} respondents".format(int(sample_size)), 
                    transform=ax.transAxes, fontsize=11, alpha=0.4
                )
                ax.set_ylabel("# of Skills Regularly Used", fontsize=14, labelpad=20)
                ax.set_xlabel("% Who Regularly Use Skill", fontsize=14, labelpad=20)
                ax.xaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0, symbol="%"))
                ax.tick_params(axis="both", labelsize=12)
                if inverse==False:
                    ax.set_ylabel("% Share of Skills Regularly Used", fontsize=14, labelpad=20)
                    ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0, symbol="%"))
                fig.set_size_inches(8, 8)
        # Create the legend if more than 1 year will be plotted
        if num_years > 1:
            ax.legend(loc=(0.84, 0.82), fontsize=9, handletextpad=0.3, labelspacing=1.1,
                     frameon=False)
        # Add patches to axis if they were added in
        if patches != None:
            for patch in patches:
                ax.add_patch(patch)
        
        plt.show()
    
    return pt


In [None]:
def heatmap(df, cols, expand_col, cbar_text, title_text, transpose=False, 
            rotation=45, fraction=0.056, cmap="Greens", xlabel=None, 
            ylabel=None, size=(8,8)):
    '''
    Makes a pivot table from the DataFrame based on the cols and expand_col parameters
    and graphs a heatmap with the boolean values which come from the cols.
    
    Parameters
    ----------
    df : DataFrame
        DF from which the data will be drawn
    cols : list of strings
        the columns in the DF which will be plotted; these columns should be boolean type
    expand_col : str
        name of the column which will be expanded along the x-axis
    cbar_text : str
        text label for the matplotlib colorbar beside the chart
    title_text : str
        text for the title
    transpose : bool, optional
        whether to transpose the x and y axes, default False
    rotation : int, optional
        x-axis tick label rotation in degrees, default 45
    fraction : float, optional
        controls the size of the cbar to align it with the height of the heatmap, 
        default 0.056
    cmap : str, optional
        matplotlib colormap, default "Greens"
    xlabel : str, optional
        text for xlabel, default None
    ylabel : str, optional
        text for ylabel, default None
    size : tuple of integers, optional
        matplotlib figure.set_size_inches() parameter
    
    Returns
    -------
    DataFrame which is the resulting pivot table from which the results are graphed
    Also shows the heatmap chart    
    
    '''
    # Filter DF according to the parameters
    df2 = df.loc[df.year==2020, cols + [expand_col]]
    df2 = df2.melt(
        id_vars = expand_col,
        var_name = "variable",
        value_name = "usage"
    )
    # Make the pivot table based on the transpose parameter
    if transpose==False:
        heatmap_pt = df2.pivot_table(
            index="variable",
            columns=expand_col,
            values="usage",
            aggfunc="mean"
        )
    else:
        heatmap_pt = df2.pivot_table(
            index=expand_col,
            columns="variable",
            values="usage",
            aggfunc="mean"
        )
        
    # Do the heatmap visualization
    fig, ax = plt.subplots()
    im = ax.imshow(heatmap_pt, cmap=cmap)
    ax.set_xticks(np.arange(heatmap_pt.shape[1]))
    ax.set_yticks(np.arange(heatmap_pt.shape[0]))
    ax.set_xticklabels(heatmap_pt.columns)
    ax.set_yticklabels(heatmap_pt.index.values)
    plt.setp(ax.get_xticklabels(), rotation=rotation, ha="right",
             rotation_mode="anchor")
    cbar = ax.figure.colorbar(im, ax=ax, fraction=fraction, pad=0.03)
    cbar.ax.set_ylabel(cbar_text, rotation=-90, va="bottom", labelpad=10, fontsize=12)
    cbar.ax.set_yticklabels(["{:.0%}".format(x) for x in cbar.get_ticks()])
    ax.set_title(title_text, fontsize=16, alpha=0.8, pad=15)
    ax.set_xlabel(xlabel, fontsize=12, alpha=0.8, labelpad=15)
    ax.set_ylabel(ylabel, fontsize=12, alpha=0.8, labelpad=15)
    fig.set_size_inches(*size)
    plt.show()
    
    return heatmap_pt


# 1. Programming Languages

### Summary of Programming Languages

The first area we'll explore is programming languages. Starting with a summary, we see that Python and SQL are the most common "Languages Regularly Used" among the respondents:


In [None]:
summaryBar(
    df=df, 
    y_axis=None,
    cols=language_cols,
    special_list=["Python"], 
    default_color="tab:grey", 
    special_color="tab:green",
    label_offset=100, 
    title_text="Languages Regularly Used, 2020", 
    def_patch=None, 
    spec_patch=None, 
    size=(10,7),
    sort_index=False
)

But there's another aspect of programming languages that would be helpful to understand: **Of the people who *do* regularly use a given language (say, "R"), of *all* the languages they use regularly, what % does R take up?** For example, those 4,277 respondents who do regularly use R also regularly use a total of 13,938 languages (which I calculated from the DataFrame separately). 

Of the 20,036 respondents in the 2020 survey, this means that **21.3% regularly use R**, and that of those people who *do* regularly use R, **R accounts for 30.7% of the languages they regularly use**.

Let's look at *all* the programming languages based on the percentage of those who regularly used that language (x-axis) and the "percentage share" of that language (y-axis):

In [None]:
# All respondents

offset = {
    "C": (-0.06, -0.004),
    "Java": (0.005, -0.004),
    "Bash": (-0.03, 0.004),
    "Python": (-0.12, -0.005),
    "Javascript": (0.025, -0.001)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=False, 
    offsets=offset, 
    text_loc=(0.6, 0.03),
    title_text="Programming Languages Regularly Used:\nAll Respondents, 2020", 
    color_dict=color_dict, 
    year=2020,
    patches = [
        mpatches.Rectangle((0.163, 0.2965), 0.11, 0.02, linewidth=1, edgecolor='r', facecolor='none')
    ]   
)

First, we can see **R** plotted where we expected from our calculations above: **21.3% regular usage** (x-axis) and taking up **30.7% of the total languages** that R users regularly use (y-axis).

If we just look at the **x-axis**, we see the same proportions that the above summary bar chart showed (removing Swift and Julia which had very low numbers) - Python, then SQL, then R, etc.

From the **y-axis**, we can see that **Python** also occupied a much larger "% share" of total languages used than other languages.

**We can invert this "% share" statistic to make it more interpretable.** For instance, in our previous example with R, inverting R's "% share" of 30.7% gives us **3.26**, which is simply the **number of languages regularly used by those who regularly use R**. This is a more interpretable and meaningful metric than the "% share", so from now on we'll only use this for our y-axis when plotting this kind of "trend chart".

In [None]:
# This is just the inverted chart of above

offset = {
#     "C": (-0.06, -0.004),
#     "Java": (0.005, -0.004),
    "Bash": (-0.04, 0.05),
    "Python": (-0.12, 0.001),
    "Javascript": (0.00, -0.04)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"

pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nAll Respondents, 2020", 
    color_dict=color_dict, 
    year=2020
)

The **y-axis** label has now been converted from "% share" to its inverse, "Number of skills regularly used".

Now we can see that Python not only is regularly used by nearly 80% of respondents, but of those who *do* use Python regularly, they only use about **2.7 languages on average regularly**. **C**, on the other hand, is only **regularly used by about 15%** of respondents, and those C users on average use **more than 4 languages regularly**.

#### How can the y-axis "number of languages regularly used" metric be helpful to us? 
For career considerations, it's useful to know how much a certain skill (such as a programming language) is used in proportion to other skills in a given job, company size or country. A skill in one circumstance might be just 1 of 4 that are regularly used (like C in this example), whereas the same skill in a different job, country or company might be 1 of 2 skills commonly used, which would indicate that that skill plays a larger part of that given role.

But we should also ask the question, **"Why do those who regularly use Python only use 2.7 languages regularly?"** Is it because Python is so useful for many things, or is it because those newer to programming in recent years start with Python and thus use fewer languages regularly? Based on this kind of analysis of several skills in this report, I think both factors are present, and perhaps the usefulness of these leading skills (like Python here) also draws more students and younger people to learn them. (More about this later and in Appendix B).

At the same time, regarding the y-axis statistic of "number of skills regularly used", just keep in mind that this is ***descriptive*** of the survey results, not ***prescriptive*** for how many skills you or I should regularly use!

Seeing both the statistics of "Percentage of those who regularly use a language" (x-axis) and the "Total number of skills regularly used" (y-axis) **can help us identify and prioritize certain skills** which turn out to be more relevant to our specific career goals.


### Trends in Programming Languages

Let's now examine the trends from 2018-2020 in these two statistics:

In [None]:
offset = {
    "Javascript": (0.00, -0.05),
    "Bash": (-0.04, 0.05),
    "Python": (-0.12, 0.001),
    "MATLAB": (-0.02, -0.08),
    "C++": (0, -0.05),
    "C": (0, 0.03)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nAll Respondents, 2018-2020", 
    color_dict=color_dict, 
#     year=2020
)

A big trend is that **Python has jumped by nearly 15%** in respondents who regularly use it. **R has decreased 5-10%** by the same metric, and **SQL has increased by about 5%**.

Many languages, including SQL and R, are also increasing on the y-axis, meaning that those who do know these languages are now using other languages with a little higher frequency than before.

>***Note:*** *The pentagon shape for SQL doesn't have meaning in the chart but is just there to help some readers distinguish the red and green data points, as the [National Eye Institute](https://www.nei.nih.gov/learn-about-eye-health/eye-conditions-and-diseases/color-blindness/types-color-blindness) identifies red/green as the most common kind of color blindness. The green data points in the following "trend graphs" like this one will all have this feature since only the 2020 data point is labeled and sometimes the green and red points for 2018 and 2019 data can stray near to each other.*



### How about Programming Languages among Data Scientists?

In [None]:
offset = {
    "Javascript": (0.01, -0.05),
    "C++": (0, 0.01),
    "MATLAB": (0, -0.04),
    "Python": (-0.1, 0.06)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nData Scientists, 2020", 
    color_dict=color_dict, 
    year=2020,
    title="Data Scientist"
)

Unsurprisingly, Python, SQL and R are at the top in regular usage.

### How about the trend of Programming Languages among Data Scientists from 2018-2020?

In [None]:
offset = {
    "Javascript": (0.01, -0.05),
    "C++": (0, 0.01),
    "MATLAB": (0, -0.04),
    "Python": (-0.1, 0.06)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nData Scientists, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
    title="Data Scientist",
    patches = [
        mpatches.Rectangle((0.44, 3.05), 0.2, 0.33, linewidth=1, edgecolor='r', facecolor='none'),
        mpatches.Rectangle((0.72, 2.48), 0.25, 0.36, linewidth=1, edgecolor='r', facecolor='none'),
    ]   
)

The trend shows that only two languages, **Python** and **SQL**, have made a sizeable increase in total regular usage among Data Scientists, whereas the other languages (including R) have more or less stayed the same in regular usage.

This statistic highlights both the growth of these languages as well as the importance for aspiring Data Scientists to gain experience with them.

### Which languages do Software Engineer respondents use the most?

In [None]:
offset = {
    "Javascript": (0.0, 0.03),
    "C++": (0, 0.01),
    "MATLAB": (0, -0.04),
    "Python": (-0.12, 0.06)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nSoftware Engineers, 2020", 
    color_dict=color_dict, 
    year=2020,
    title="Software Engineer"
)

#### Python first choice?
The most commonly used language of **Python** simply confirms what we already know about the Kaggle Survey dataset - it was made by data enthusiasts *for* data enthusiasts, so the software engineers involved in a data science community like Kaggle will naturally be inclined to use Python more regularly than the average programmer would.

#### Interesting Feature: The Diagonal Line
Another interesting feature is the mostly diagonal line that these points form. More commonly used skills (Python, in this case) will less likely be used regularly with other tools (lower y-coordinate), whereas those respondents who use tools which are not as commonly used like Matlab tend to use more tools in general (higher y-coordinate).

This makes intuitive sense, and it's a feature that will show up often in this "trend graph". *(For a closer look at this pattern and my theory for why it shows up so often, please see **Appendix B**.)*

### Now let's see the 2018-2020 trends for Software Engineers:



In [None]:
offset = {
    "Javascript": (0.0, 0.03),
    "C++": (0, 0.01),
    "MATLAB": (0, -0.04),
    "Python": (-0.12, 0.06)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nSoftware Engineers, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
    title="Software Engineer"
)

#### Growth Languages
Many languages, including **Python**, **SQL**, **JavaScript** and **Java** have seen considerable growth in "% regular usage".

#### Using More Languages Regularly
Another interesting pattern is that **Software Engineers are using more languages regularly** than they were in 2018 and 2019, shown from the solid color data from 2020 being higher on the y-axis than the data of previous years. 

Clearly a trend among Software Engineers is regularly using more languages than before. While the data and visualization don't necessarily reveal why that is, it's still an easy application for someone desiring to be a Software Engineer (perhaps a Software Engineer with a slant toward data?) to expect to be working with more than just 1 or 2 cool languages.

Let's take a look at another profession.

### How about trends in programming languages among Statisticians?

First, let's look at the 2020 data:

In [None]:
offset = {
    "Javascript": (-0.02, 0.09),
    "R": (-0.06, 0.06),
    "C": (0, -0.03)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nStatisticians, 2020", 
    color_dict=color_dict, 
    year=2020,
    title="Statistician"
)

As you anticipated, **R tops out Python among Statisticians**. It's interesting that Statisticians who use R regularly use fewer than 2.5 languages regularly in total. This is one of the lower ratios I encountered in this analysis, confirming that R indeed is the Statisticians's primary tool and it takes up nearly most of their toolbox.

### And now the 2018-2020 trend in Programming Languages among Statisticians:

In [None]:
offset = {
    "Javascript": (-0.02, 0.09),
    "R": (-0.06, 0.06),
    "C": (0, -0.03)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nStatisticians, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
    title="Statistician"
)

Noticing carefully that the 2018 R dot is plotted right underneath the 2020 dot, we see that **R had a fluctuation in 2019 but remained unmoved after 2 years**.

Python has gained some traction, but other languages have stayed relatively fixed in terms of "% regularly used", though again we also see a slight trend of more languages being used per respondent in general, though not as pronounced as with the software engineers.

### How about computer languages used by Machine Learning Engineers?

Since "ML Engineer" was only added as an option in the 2020 survey, **we can only plot the 2020 numbers**:

In [None]:
# No ML Engineer data from 2018-2019

offset = {
    "Javascript": (0.0, 0.01),
    "C++": (-0.02, 0.05),
    "MATLAB": (-0.07, 0.06),
    "Bash": (0, -0.06),
    "Python": (-0.12, 0.06),
    "Java": (0.01, -0.03)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nML Engineers, 2020", 
    color_dict=color_dict, 
    year=2020,
    title="Machine Learning Engineer"
)

In the 2020 data, from the 1,082 Machine Learning Engineers, **Python is used regularly by more than 90%** with nothing else coming close, and with **only about 10% of ML Engineers regularly using R**.

This point could help set the direction of someone who wants to be a ML Engineer and isn't quite sure where to get started.

### What languages do Students regularly use?

In [None]:
offset = {
    "SQL": (-0.05, 0.05),
    "Python": (-0.12, 0.04)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nStudents, 2020", 
    color_dict=color_dict, 
    year=2020,
    title="Student"
)

**Python again** is the most regularly used language among Students, followed by **C++, C** and **SQL**.

### How about the 2018-2020 trends of the languages Students often use?

In [None]:
offset = {
    "SQL": (-0.05, 0.05),
    "Python": (-0.12, 0.04)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nStudents, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
    title="Student"
)

For Students, **Python** again finds itself winding forward on the x-axis in "% regular usage", while **C** and **C++** have made meaningful strides as well. And, as before, we see a general trend of Students regularly using slightly more languages than they were the past couple years.

## Languages by Country

Switching gears, let's continue to analyze the usage of programming languages in different countries to provide additional context for those who are interested in a data career in a specific region. Since India had the most respondents in the 2020 survey, let's start there.

### What are the most regularly used programming languages among Kagglers in India?


In [None]:
offset = {
    "C++": (0, -0.05),
    "Python": (-0.12, 0.04)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="India", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nIndia, 2020", 
    color_dict=color_dict, 
    year=2020,
#     title="Student"
)

After **Python**, we see **SQL** at around **35%** regular usage, **C** and **C++** at about **25%**, **Java** near **20%** and **R** around **15%**. 

### How about the trend in India from 2018-2020?

In [None]:
# C and C++ increasing a lot

offset = {
    "C++": (0, -0.05),
    "Python": (-0.12, 0.04)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="India", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nIndia, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Student"
)

In addition to a **big jump in Python's usage**, C, C++ and SQL are making gains as well in overall regular usage.

### What about programming languages in the U.S.?

In [None]:
offset = {
    "Javascript": (-0.06, -0.1),
    "MATLAB": (0, -0.06),
    "Python": (-0.12, 0.04)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="USA", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nUSA, 2020", 
    color_dict=color_dict, 
    year=2020,
#     title="Student"
)

**Python, SQL and R** are the most regularly used languages among Kagglers in the U.S. Compared with India, **R is much more widely used in the U.S.** whereas **C and C++ are more popular in India**.

### How about the 2018-2020 trend in the U.S.?

In [None]:
offset = {
    "Javascript": (-0.06, -0.1),
    "R": (0, 0.07),
    "MATLAB": (0, -0.06),
    "Python": (-0.1, 0.06)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="USA", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nUSA, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Student"
)

A few interesting things about the programming language trends of Kagglers in the U.S.:

1. Besides Python and perhaps SQL, **there are neither many big jumps nor decreases in regular usage in languages**.
2. While most of the other language trend charts showed an increase in the y-axis (the total number of languages regularly used), programming languages regularly used by Kagglers in the U.S. don't share this trend, but **the languages on the y-axis are either staying stable or slightly decreasing** across the board. As noted above, the y-axis is *descriptive* rather than *prescriptive*, but could indicate that the trend is not to continually use more and more languages on the job, but perhaps only the ones most needed for a task.

Now let's turn our attention to **China**.

### How about programming languages regularly used in China?

In [None]:
offset = {
    "C++": (0, -0.06),
    "MATLAB": (-0.14, 0.02),
    "Python": (-0.12, 0.04)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="China", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nChina, 2020", 
    color_dict=color_dict, 
    year=2020,
#     title="Student"
)

Similar to the languages used regularly in India, **Chinese Kagglers use SQL, C++, C and Java (and even Matlab) more than R**, whereas R was the third most commonly used language in the U.S. with fourth place quite a ways from it.

### How about the trend of programming languages in China from 2018-2020?

In [None]:
offset = {
    "C++": (0, -0.06),
    "MATLAB": (-0.14, -0.08),
    "Python": (-0.12, 0.04)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="China", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nChina, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Student"
)

**Python has seen a large increase** in regular usage, and **C++, SQL, C and Java** have also seen **small gains** in regular usage in the past few years.

Let's now look at programming language usage by **company size**, specifically answering the question:

### Is there a difference between languages commonly used in small businesses vs. large corporations?

This would be useful information to know if someone has a preference for working in a small start-up or in an established company.

In [None]:
# Only 2 years' data

offset = {
    "Python": (-0.12, 0.04),
    "MATLAB": (-0.07, -0.11),
    "Bash": (-0.07, -0.1),
    "C": (-0.01, 0.03)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nSmall Businesses, 2019-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Student"
    company_size="0-49"
)

For **small businesses** (those with 0 to 49 employees), **Python** and **SQL** are the **most regularly used languages**, and they've also been gaining in usage from 2019 to 2020. **JavaScript, C++, Java**, and **C** have also seen some minor gains, but are still only regularly used by fewer than 20% of the respondents from small companies.

### How does this compare to programming language usage of large corporations?

In [None]:
offset = {
    "Python": (-0.12, 0.04),
    "MATLAB": (-0.07, -0.1),
    "Bash": (0, -0.08)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nLarge Corporations, 2019-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Student"
    company_size="> 10,000",
    patches = [
        mpatches.Rectangle((0.38, 2.95), 0.23, 0.33, linewidth=1, edgecolor='r', facecolor='none'),
    ]   
)

Interestingly, the trend of programming language usage in large corporations (with more than 10,000 employees) is **quite similar to that of small businesses**, the biggest difference being **SQL is used by 15% more respondents** of large corporations. This makes sense as larger companies are more likely to work with relational databases and rely less on spreadsheets as smaller companies do.

### Languages regularly used by "Years of Coding Experience"

Let's switch gears again and look at languages commonly used by **years of programming experience**. This will provide a good warm-up to another visualization type we'll come back to throughout the analysis - the heatmap.


In [None]:
heatmap_pt = heatmap(
    df.loc[df.yrs_coding!="I have never written code", :],
    cols=language_cols, 
    expand_col="yrs_coding", 
    cbar_text="% Usage", 
    title_text="Language Use by Coding Experience", 
    transpose=True, 
    rotation=45, 
    fraction=0.026, 
    cmap="Greens",
    ylabel="Coding Experience"
)

In this heatmap, we can see:
1. **Python is heavily used** by those from all levels of coding experience.
2. **As programmers gain more experience, they tend to use SQL more.** Is this because the type of company more experienced programmers work at is typically different than the companies newer programmers work at? Or because companies are moving away from SQL-based systems in favor of other options, which would mean programmers with less experience wouldn't be using SQL regularly as much as their more experienced counterparts? These are interesting questions but a little beyond the scope of this analysis, so we'll leave that for another time.
3. **Programmers with more experience also tend to use non-Python languages more**, such as Bash, C, C++, Java and Javascript.

And now onto another question relevant to honing our skills to a career we're interested in:

### How does Programming Language use differ by Job Title?


In [None]:
heatmap_pt = heatmap(
    df,
    cols=language_cols, 
    expand_col="title", 
    cbar_text="% Usage", 
    title_text="Language Use by Job Title", 
    transpose=True, 
    rotation=35, 
    fraction=0.056, 
    cmap="Greens"
)

This chart reveals several things:
- **Python remains a powerhouse** for all professions of the respondents
- **Any job title that includes the word "Data" uses SQL heavily**, as well as Business Analysts and Software Engineers
- **Students use C and C++ the most**
- **Software Engineers use Java and Javascript the most**
- **Statistician** is the only job title that **uses R more frequently than Python**
- **DBAs and DB Engineers live in SQL**, using it more often than Python

### Programming Language Usage by Age

And just for fun (while we're still on the subject of programming languages), how do the languages regularly used change depending on a person's age?

In [None]:
heatmap_pt = heatmap(
    df,
    cols=language_cols, 
    expand_col="age", 
    cbar_text="% Usage", 
    title_text="Language Use by Age", 
    transpose=False, 
    rotation=25, 
    fraction=0.046, 
    cmap="Greens",
    xlabel="Age range"
)

The most noticeable patterns are:
- **Python is used heavily across the board** with a lean toward younger programmers
- **SQL is largely used** regularly by those in their **mid-career range** and tapering off on either side from there
- **R shows a steady increase** in usage as a programmer's **age increases** (reminder: "correlation, not causation")
- As we saw above, **C and C++ aren't used much by Kagglers after graduation**, though there is a gradual resurgence of C/C++ usage in those respondents toward the latter part of their careers

### C vs. C++: Which does each country prefer?

Speaking of C and C++ (and to take a breather from noting how widespread Python's usage is), let's have **C** and **C++** go head-to-head in a matchup to see which countries regularly use one more than the other.


In [None]:
PT = vsCompare(df, "C", "C++", thresh=75,
               A_loc=(0.2,0.3), B_loc=(0.7,0.9), xSpaceMultiplier=1.25)

Several things are going on in this chart. Let's break it down:
1. For **each country** we see the **total number of respondents who regularly use one language or the other**. This total number is also represented by the width of the lighter-colored bars.
2. **This chart is simply comparing the regular usage of these two languages**, and some respondents do indeed regularly use both.
3. **The solid-colored bars on top of the lighter-colored bars represent the difference of those who regularly use one language more than the other.** For example, in Pakistan 42 people regularly use C, while 88 people regularly use C++. The difference (88-42=46) is filled in with solid red.
4. **The chart is sorted by the percentage of respondents who use one tool** (programming language, in this instance) **more than the other**. This data is shown in the small bars to the right. Going back to our Pakistan example, 46 *more* people regularly use C++ than the 42 people who regularly use C. In other words, 110% *more people* regularly use C++ than C (46/42=1.1).

We can see that **47% more Kagglers in the U.S. use C++ more regularly than C** while in India C is used slightly more than C++.

#### Removing India's results to see the others more clearly
With this particular graph, we can see India and the USA's results most clearly due to the large number of respondents from India shrinking the other bars. If we were to remove India to get a better feel for the rest of the data, this is what we'd see:

In [None]:
# Removing India to "zoom in" on the rest
PT = vsCompare(df.loc[df.country!="India", :], "C", "C++", thresh=75,
               A_loc=(0.1,0.2), B_loc=(0.7,0.9), xSpaceMultiplier=1.25,
              label_pad=4, text_loc=(0.52, 0.05))

This will conclude our analysis of programming languages, but now that we've introduced the different visualization tools, we'll be able to move through the remaining 4 areas more quickly. And the next area won't be too far removed from the discussion of programming languages - **machine learning frameworks**.


# 2. Machine Learning Frameworks

Let's get a "lay of the land" before we dive into the trends:

In [None]:
summaryBar(
    df=df, 
    y_axis=None,
    cols=ml_cols,
    special_list=["Scikit-learn", "TensorFlow", "Keras"], 
    default_color="tab:grey", 
    special_color="tab:orange",
    label_offset=100, 
    title_text="Machine Learning Frameworks\nRegularly Used, 2020", 
    def_patch=None, 
    spec_patch=None, 
    size=(10,7)
)

**Scikit-learn**, **TensorFlow** and **Keras** were the most regularly used ML frameworks in the 2020 survey with some others also regularly used by about 1,000+ respondents. We'll focus our trends analysis on these more commonly used ML frameworks.

### What are the trends of Machine Learning Framework usage?

In [None]:
offset = {
    "Scikit-learn": (-0.12, 0.04)
}
cols = ['Scikit-learn', 'TensorFlow', 'Keras', 'Xgboost', 'PyTorch',
       'LightGBM', 'Caret', 'CatBoost']
color_dict = dict(zip(cols, ["C{}".format(i) for i in range(11)]))
                    
pt = allLangPT(
    df.loc[df.year.isin([2019,2020]), :], 
    cols=cols, 
    num_col="num_ml", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="ML Frameworks Regularly Used:\nAll Respondents, 2019-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Student"
#     company_size="> 10,000"
)

#### Similar pattern as before
As we saw with programming languages - the ML frameworks which are more regularly used are also used with fewer other frameworks. For instance, Scikit-learn is typically only 1 of 3 ML frameworks that those who regularly use Scikit-learn use, whereas those 10% of respondents (x-axis) who use LightGBM typically use 5 ML framework tools (y-axis) regularly.

#### Main Gains from 2019 to 2020
We can also see that **Scikit-learn**, **TensorFlow** and **PyTorch** all made gains in regular usage from 2019 to 2020. There is a reason why we're not looking at the 2018 data, and we'll revisit that in the next section. 

And now on to the trends among Data Scientists.

### What is the trend of Machine Learning Frameworks regularly used by Data Scientists?

In [None]:
offset = {
    "Scikit-learn": (-0.15, 0.1)
}
cols = ['Scikit-learn', 'TensorFlow', 'Keras', 'Xgboost', 'PyTorch',
       'LightGBM', 'Caret']
color_dict = dict(zip(cols, ["C{}".format(i) for i in range(11)]))
                    
pt = allLangPT(
    df.loc[df.year.isin([2019,2020]), :], 
    cols=cols, 
    num_col="num_ml", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="ML Frameworks Regularly Used:\nData Scientists, 2019-2020", 
    color_dict=color_dict, 
#     year=2020,
    title="Data Scientist"
#     company_size="> 10,000"
)

There is a clear trend from 2019 to 2020 among all Data Scientists of **regularly using more ML Frameworks** (shown by the increases along the y-axis). 

Also, **Scikit-learn remains the most commonly used framework** followed by TensorFlow, Keras and Xgboost. Staying on the subject of ML Frameworks regularly used by Data Scientists:

### How about Data Scientists in the U.S.?

In [None]:
offset = {
    "Scikit-learn": (-0.15, 0.1),
    "Caret": (-0.01, -0.14),
    "TensorFlow": (-0.15, -0.1)
}
cols = ['Scikit-learn', 'TensorFlow', 'Keras', 'Xgboost', 'PyTorch',
       'LightGBM', 'Caret']
color_dict = dict(zip(cols, ["C{}".format(i) for i in range(11)]))
                    
pt = allLangPT(
    df.loc[df.year.isin([2019,2020]), :], 
    cols=cols, 
    num_col="num_ml", 
    country="USA", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="ML Frameworks Regularly Used:\nData Scientists in the USA, 2019-2020", 
    color_dict=color_dict, 
#     year=2020,
    title="Data Scientist"
#     company_size="> 10,000"
)

The ML Frameworks regularly used by Data Scientists in the U.S. are quite representative of the overall Data Scientist population in the survey as we saw above, with some slight variations in some of the Frameworks.

But as I'm looking to transition into a **Data Science job in China...**

### What are the most regularly used Machine Learning Frameworks among Data Scientists in China?

In [None]:
offset = {
    "Scikit-learn": (-0.11, -0.12),
    "PyTorch": (-0.065, 0.1),
    "TensorFlow": (-0.15, 0)
}
cols = ['Scikit-learn', 'TensorFlow', 'Keras', 'Xgboost', 'PyTorch',
       'LightGBM', 'Caret']
color_dict = dict(zip(cols, ["C{}".format(i) for i in range(11)]))
                    
pt = allLangPT(
    df.loc[df.year.isin([2019,2020]), :], 
    cols=cols, 
    num_col="num_ml", 
    country="China", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="ML Frameworks Regularly Used:\nData Scientists in China, 2019-2020", 
    color_dict=color_dict, 
#     year=2020,
    title="Data Scientist",
#     company_size="> 10,000",
    patches = [
        mpatches.Rectangle((0.41, 3.5), 0.17, 0.37, linewidth=1, edgecolor='r', facecolor='none'),
    ]   
)

#### TensorFlow claims the #1 spot in 2020
**TensorFlow is the most commonly used ML Framework among Data Scientists in China!** Though it's important to note that the sample size of 59 data scientists located in China isn't very large for a 2-year span, at the same time we're not only *not* seeing a "run away victory" for Scikit-learn as with the other countries we looked at, but we're also seeing Scikit-learn fall in regular usage while TensorFlow, Xgboost and PyTorch rise in usage, so this is a meaningful insight. 

#### Scikit-learn usage lower than average
Whereas Scikit-learn is used regularly by 75% of Data Scientists in general, among Data Scientists in China it's only used regularly by 50% of them.

#### No clear favorite ML Framework
Though TensorFlow is at #1, this chart shows that **there's no clearly preferred ML framework by Data Scientists in China**, and that it would be prudent for a **data scientist job seeker to gain problem-solving experience in each of the 5 most common frameworks** above to at least be able to talk through their strengths and weaknesses and also to be more recruitable to those companies who prefer a specific framework over the others.

It will be interesting to see how these trends develop next year.

### What are the most regularly used Machine Learning Frameworks by Company Size?


In [None]:
heatmap_pt = heatmap(
    df,
    cols=ml_cols, 
    expand_col="company_size", 
    cbar_text="% Usage", 
    title_text="Machine Learning Framework Usage\nby Company Size", 
    transpose=True, 
    rotation=25, 
    fraction=0.017, 
    cmap="Oranges",
    ylabel="Employees"
)

A couple of things stand out:
- **Scikit-learn is popular among all company sizes**, though especially popular in **large corporations**
- **Xgboost** also gains popularity in larger corporations

Therefore, if someone really desires to work in a large corporation in the machine learning field, learning **Xgboost** may give their credentials a nice boost.

And double checking our sample sizes for each of these company size categories, we see that none of them is underrepresented, though Kagglers tend to work in small companies:

In [None]:
summaryBar(
    df=df, 
    y_axis="company_size",
    cols=None,
    special_list=["0-49"], 
    default_color="tab:grey", 
    special_color="tab:orange",
    label_offset=100, 
    title_text="Number of Respondents by\nCompany Size, 2020", 
    def_patch=None, 
    spec_patch=None, 
    size=(7,3),
    sort_index=True,
    ascending=True
)

Continuing our quest to understand the tools and tech that are more pertinent to a specific career direction we're interested in, let's next look at:

### What Machine Learning Frameworks are used more regularly by Profession?

In [None]:
heatmap_pt = heatmap(
    df,
    cols=ml_cols, 
    expand_col="title", 
    cbar_text="% Usage", 
    title_text="Machine Learning Framework Usage\nby Job Title", 
    transpose=False, 
    rotation=35, 
    fraction=0.056, 
    cmap="Oranges"
)

Some things we can see from this chart are:
- As expected, **Data Scientists** and **ML Engineers** tend to be the **heavier users of ML techniques**
- While Data Scientists are heavy users also of the most popular frameworks (Sklearn, TensorFlow, PyTorch, and Keras), **Data Scientists are also the ones who tend to employ lesser-used tools** (Caret, CatBoost, LightGBM, Prophet, etc.) to a higher degree than other professions which tend to mostly use the popular frameworks
- **Statisticians tend to use the R packages Caret and Tidymodels** more than the other job titles do, which is unsurprising as Statisticians tend to use R more regularly than Python

And for fun:

### Which Machine Learning Frameworks are used more regularly by different age groups?

In [None]:
heatmap_pt = heatmap(
    df,
    cols=ml_cols, 
    expand_col="age", 
    cbar_text="% Usage", 
    title_text="Machine Learning Framework Usage\nby Age", 
    transpose=False, 
    rotation=35, 
    fraction=0.056, 
    cmap="Oranges",
    xlabel="Age range"
)

While there aren't any drastic differences between the age groups, some frameworks such as **LightGBM and Xgboost are used a bit more by those in their early- to mid-career range**.

To conclude our analysis of ML Frameworks, let's see a head-to-head comparison of:

### Facebook's PyTorch vs. Google's TensorFlow

Among all 2020 survey respondents, which countries regularly use one tool more than the other?

In [None]:
PT = vsCompare(df, "PyTorch", "TensorFlow", thresh=100,
               A_loc=(0.15,0.3), B_loc=(0.7,0.9), xSpaceMultiplier=1.25)

We can see that among all the survey respondents, **all countries but Russia use TensorFlow more regularly than PyTorch** - and most by a significant margin of **50-100%** more regularly.

Now we'll look at the third key area: cloud computing platforms.

# 3. Cloud Computing Platforms

As always, let's start with an overview of the most popular platforms:

In [None]:
summaryBar(
    df=df, 
    y_axis=None,
    cols=cloud_cols,
    special_list=["Amazon Web Services (AWS)", "Microsoft Azure", "Google Cloud Platform (GCP)"], 
    default_color="tab:grey", 
    special_color="tab:blue",
    label_offset=100, 
    title_text="Cloud Computing Platforms\nRegularly Used, 2020", 
    def_patch=None, 
    spec_patch=None, 
    size=(10,6),
    sort_index=False,
    ascending=False
)

**AWS**, **GCP** and **Azure** fill up about 90% of the cloud-computing sky. 

### What is the trend of Cloud Computing Platforms from 2018 to 2020?

In [None]:
offset = {
    "Amazon Web Services (AWS)": (-0.01, 0), 
    "Microsoft Azure": (-0.01, 0),  
    "Google Cloud Platform (GCP)": (-0.01, 0),  
    "IBM Cloud / Red Hat": (-0.01, 0),  
    "Oracle Cloud": (-0.01, 0),  
    "Alibaba Cloud": (-0.01, 0)
}
cols = ["Amazon Web Services (AWS)", "Microsoft Azure", "Google Cloud Platform (GCP)", 
                  "IBM Cloud / Red Hat", "Oracle Cloud", "Alibaba Cloud"]
color_dict = dict(zip(cols, ["C{}".format(i) for i in range(11)]))
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_cloud", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.05, 0.03),
    title_text="Cloud Computing Platforms Regularly Used:\nAll Respondents, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Data Scientist"
#     company_size="> 10,000"
)

#### Fewer Platforms commonly used
First, we can see that those who use **AWS**, **GCP** and **Azure** typically only use **that platform plus one other platform** regularly (since the y-axis value is around 2), whereas those who regularly use a less-popular platform regularly use 3+ platforms.

#### 2018 to 2019 Dropoff?
Next, we can see an interesting trend among nearly all cloud computing platforms:
- **From 2018 to 2019, there is a large dropoff** as regular usage vaporizes by about **50%** for each cloud computing platform
- **From 2019 to 2020, we can see an increase on the y-axis** which means that those regular users of that specific platform are in general regularly using more platforms than in 2019

While it appears that companies could be relying less on a specific cloud computing platform, **the big "dropoff" from 2018 to 2019 is likely due to two reasons**:
1. While the **2018 survey** had just **5 cloud computing platforms** (the 6 above minus Oracle Cloud), the **2019** and **2020 surveys** provided expanded options of **10 choices** not counting "None" and "Other";
2. More importantly, **the question itself regarding cloud computing was different**. Wherea the 2018 survey asked about tools used ***in the last 5 years***, from 2019 the question changed to **"*use on a regular basis*"**:
    - *Which of the following cloud computing platforms do you use on a regular basis?* (2019, 2020)
    - *Which of the following cloud computing services have you used at work or school in the last 5 years?* (2018)

Therefore, it's natural that respondents would regularly use fewer tools than the tools they have used at some point in the past 5 years. This explains the dropoff from 2018 to 2019.

Having noted that point, **the ratio among the top 3 platforms remains remarkably similar from 2018-2020** and only further confirms for us the current status of cloud computing among data workers.

#### This phenomenon also occurs in the ML Frameworks and Big Data questions from 2018.
In section 2 about ML Frameworks, we only plotted 2019-2020 for this reason. But if we were to go back and re-plot the ML Framework trends from 2018 to 2020, we'll see a similar pattern:


In [None]:
offset = {
    "Scikit-learn": (-0.12, 0.04)
}
cols = ['Scikit-learn', 'TensorFlow', 'Keras', 'Xgboost', 'PyTorch',
       'LightGBM', 'Caret', 'CatBoost']
color_dict = dict(zip(cols, ["C{}".format(i) for i in range(11)]))
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_ml", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="ML Frameworks Regularly Used:\nAll Respondents, 2018-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Student"
#     company_size="> 10,000"
)

Though it's interesting that while there is a consistent dropoff from 2018 to 2019, it's not nearly as pronounced as the cloud computing platform dropoff which was about 50%. This indicates that **the ML frameworks a data worker has used in the past 5 years are largely the same tools that he or she still regularly uses**.

Drifting back to our analysis of Cloud Computing:

### Is there a difference of Cloud Computing Platform usage among companies of different sizes?

In [None]:
heatmap_pt = heatmap(
    df,
    cols=cloud_cols, 
    expand_col="company_size", 
    cbar_text="% Usage", 
    title_text="Cloud Computing Platforms Usage\nby Company Size", 
    transpose=False, 
    rotation=35, 
    fraction=0.056, 
    cmap="Blues",
    xlabel="Employees"
)

We can see:
- **AWS is regularly used** by companies of all sizes
- **GCP is used widely also**, yet more so with smaller to mid-sized companies
- **Azure regular usage increases** as the number of employees increases, yet even in large corporations **AWS** is still more widely used

### What is Cloud Computing Platform usage by Profession?


In [None]:
countries = list(df.country.value_counts().sort_values(ascending=False)[:16].index.values)

heatmap_pt = heatmap(
    df.loc[df.country.isin(countries), :],
    cols=cloud_cols, 
    expand_col="title", 
    cbar_text="% Usage", 
    title_text="Cloud Computing Platforms Usage\nby Profession", 
    transpose=False, 
    rotation=35, 
    fraction=0.0358, 
    cmap="Blues"
)

We can see that **Data Engineers**, **ML Engineers** and **Data Scientists** are the heaviest regular users of cloud computing platforms. 

### How does Cloud Computing Platform usage vary by country?

In [None]:
countries = list(df.country.value_counts().sort_values(ascending=False)[:16].index.values)

heatmap_pt = heatmap(
    df.loc[df.country.isin(countries), :],
    cols=cloud_cols, 
    expand_col="country", 
    cbar_text="% Usage", 
    title_text="Cloud Computing Platforms Usage\nby Country", 
    transpose=False, 
    rotation=35, 
    fraction=0.029, 
    cmap="Blues"
)

First, we can see that **Australia**, **Canada**, the **UK** and the **USA** have a larger percentage of respondents who **regularly use AWS**.

Also, we see that **Japan** has a higher percentage of regular usage of **GCP** among its respondents than that of other countries.

#### Cloud Computing in China
Finally, the most interesting insight to me is **the importance of Tencent Cloud and Alibaba Cloud in China**. Though other platforms are used, the home-grown technology is more regularly used than the other platforms. 

What's more interesting is that this data is from the Kaggle survey, and that those located in China who are part of the Kaggle community will likely be *more exposed to other technologies* as well as *more open to using them* than, say, those in China who aren't part of the Kaggle community. Therefore, if the results of the Chinese respondents in the Kaggle survey show such relatively high usage of Alibaba Cloud and Tencent Cloud, it could be the case that these two players (plus others like Baidu Cloud in the Chinese market) are ***even more regularly used*** than the data above suggest.

#### Other "Local" Cloud Computing Options?
Other countries in this heatmap also seem to lack some heavier color shades, such as **Russia** and **India**. Even though "Other" was selected only 99 times by all countries combined in the 2020 survey, perhaps there are more popular cloud computing platforms in these countries which were not included as selection options but also were not selected as "Other" by respondents. 

Speaking of countries' regular usage of Cloud Computing Platforms, **which countries use AWS more regularly than GCP?**

### Google Cloud Platform vs. Amazon Web Services

In [None]:
PT = vsCompare(
    df, 
    "Google Cloud Platform (GCP)", 
    "Amazon Web Services (AWS)",
    thresh=50, 
    A_loc=(0.7,0.25), 
    B_loc=(0.7,0.7), 
    xSpaceMultiplier=1.35
)


The chart shows that **AWS is more regularly used than GCP in most countries**, but the silver lining for Google is that **China uses GCP more regularly than AWS by a ratio of 2:1**.

And now, the same analysis by Profession: 

### Google Cloud Platform vs. Amazon Web Services by Profession

In [None]:
PT = vsCompare(
    df,
    A = "Google Cloud Platform (GCP)",   
    B = "Amazon Web Services (AWS)", 
    y_axis = "title", 
    color1="tab:red", 
    color2="tab:blue", 
    thresh=40,
    A_loc=(0.7,0.25), 
    B_loc=(0.75,0.75), 
    xSpaceMultiplier=1.25, 
    label_pad=10, 
    size=(11,6)
)

While **Data Scientists and Data Engineers** tend to work more regularly with **AWS**, **Statisticians and Research Scientists** tend to work with **both platforms roughly equally** or lean toward **GCP**.

This will now conclude our analysis of cloud computing platforms and will reign in our final 2 key areas, the next of which being Big Data Tools.

# 4. Big Data Tools

First, we'll look at a summary of the most regularly used big data tools:

In [None]:
summaryBar(
    df=df, 
    y_axis=None,
    cols=bigdata_cols,
    special_list=["MySQL", "PostgreSQL", "Microsoft SQL Server"], 
    default_color="tab:grey", 
    special_color="tab:red",
    label_offset=100, 
    title_text="Big Data Tools\nRegularly Used, 2020", 
    def_patch=None, 
    spec_patch=None, 
    size=(10,6),
    sort_index=False,
    ascending=False
)

**MySQL**, **PostgreSQL** and **Microsoft SQL Server** are the dominant players in the big data space, but there are several popular alternatives as well.

### Which Professions regularly use which Big Data Tools?

In [None]:
heatmap_pt = heatmap(
    df,
    cols=bigdata_cols, 
    expand_col="title", 
    cbar_text="% Usage", 
    title_text="Big Data Tools Usage\nby Job Title", 
    transpose=False, 
    rotation=35, 
    fraction=0.056, 
    cmap="Reds"
)

Of course, **DBAs, Database Engineers and Data Engineers** regularly use more big data tools than other professions do. 

For **Data Scientists**, the top 3 (**MySQL, PostgreSQL and Microsoft SQL Server**) are used most regularly, along with Product/Project Managers, Business & Data Analysts and Software Engineers.

In addition, **ML Engineers** and **Software Engineers** tend to use **MongoDB** more regularly than the other non-database-related professions.

### Which Countries use which Big Data Tools?

In [None]:
countries = list(df.country.value_counts().sort_values(ascending=False)[:20].index.values)

heatmap_pt = heatmap(
    df.loc[df.country.isin(countries), :],
    cols=bigdata_cols, 
    expand_col="country", 
    cbar_text="% Usage", 
    title_text="Big Data Tools Usage\nby Country", 
    transpose=False, 
    rotation=35, 
    fraction=0.037, 
    cmap="Reds",
    size=(10,10)
)

In addition to **Mexico's high usage of MySQL** per survey respondent, the most meaningful insight to me is that **Kagglers located in China use MySQL much more regularly than other big data tools**, with MongoDB coming in as the next most often used tool. 

While being able to navigate and use multiple flavors of **SQL** is always useful, this chart can show insights like this to let us hone in on a specific Big Data Tool that would be more relevant in one country than another.

### Which countries use MySQL more regularly than PostgreSQL?

In [None]:
PT = vsCompare(df, "MySQL", "PostgreSQL", thresh=40, A_loc=(0.2,0.3), B_loc=(0.7,0.9),
              xSpaceMultiplier=2, text_loc=(0.1, 0.05))


**Russia** and **Ukraine** top the list of those countries who use **PostgreSQL more often than MySQL**, while **most other countries use MySQL more** frequently. **India, Nigeria and China** show a rather large portion of data workers who use **MySQL** more regularly than PostgreSQL - especially **China at a 4:1 ratio!**

### How about Microsoft SQL Server vs. PostgreSQL by country?

In [None]:
PT = vsCompare(df, "Microsoft SQL Server", "PostgreSQL", thresh=40, A_loc=(0.2,0.2), B_loc=(0.7,0.9),
              xSpaceMultiplier=1.5)


As with the bar chart at the beginning of this section, we can see that regular usage between **Microsoft SQL Server** and **PostgreSQL** is split quite evenly, though several countries lean one way or another.

Instead of comparing countries, let's compare these two by job titles.

### Microsoft SQL Server vs. PostgreSQL by Profession

In [None]:
PT = vsCompare(
    df,
    A = "Microsoft SQL Server",  
    B = "PostgreSQL", 
    y_axis = "title", 
    color1="tab:red", 
    color2="tab:blue", 
    thresh=40,
    A_loc=(0.17,0.15), 
    B_loc=(0.8,0.8), 
    xSpaceMultiplier=1.2, 
    label_pad=10, 
    size=(12,6)
)

**Data Scientists and ML Engineers tend to use PostgreSQL more regularly than Microsoft SQL Server**, as well as **Software Engineers** and **Research Scientists**. But as we've seen above, the variations in usage between different countries can be quite large, so for those interested in a certain profession in a certain country, this data can be filtered to give a more specific analysis.

And, as a final review of Big Data Tools, we'll look at Big Data Tool regular usage trends.

### Trends in Big Data Tool Usage from 2019-2020

In [None]:
offset = {
    "MySQL": (-0.04, 0.04),
    "PostgreSQL": (-0.015, 0), 
    "SQLite": (-0.015, -0.05),
    "Oracle Database": (-0.015, 0),
    "Microsoft SQL Server": (-0.015, 0),
    "Microsoft Access": (-0.015, 0),
    "Amazon Redshift": (-0.015, 0),
    "Google Cloud BigQuery": (-0.017, 0.07)
}
cols = ["MySQL", "PostgreSQL", "SQLite", "Oracle Database",
                "Microsoft SQL Server", "Microsoft Access", 
                "Amazon Redshift", "Google Cloud BigQuery"]

color_dict = dict(zip(cols, ["C{}".format(i) for i in range(len(cols))]))
                    
pt = allLangPT(
    df.loc[df.year.isin([2019, 2020]), :], 
    cols=cols, 
    num_col="num_bigdata", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.05, 0.03),
    title_text="Big Data Tools Regularly Used:\nAll Respondents, 2019-2020", 
    color_dict=color_dict, 
#     year=2020,
#     title="Data Scientist"
#     company_size="> 10,000"
)

**There seems to be an overall decrease in usage of each particular skill** (decrease in the x-axis) **while using more big data tools in general** (increase in the y-axis) **from 2019 to 2020.** But while both the 2019 and 2020 surveys asked about regular usage of big data tools, in the 2019 survey the big data products and platforms were divided into two different questions whereas in the 2020 survey they were combined into one question. Perhaps this might account for the somewhat consistent trends across the board from 2019 to 2020, but regardless, **MySQL is still the dominant leader in Big Data Tools.**

And now we'll dive into the final of our 5 main areas: Business Intelligence.

# 5. Business Intelligence Tools

Let's start with our summary across the whole 2020 survey:

In [None]:
summaryBar(
    df=df, 
    y_axis=None,
    cols=bi_cols,
    special_list=["Tableau", "Microsoft Power BI"], 
    default_color="tab:grey", 
    special_color="tab:purple",
    label_offset=30, 
    title_text="Business Intelligence Tools\nRegularly Used, 2020", 
    def_patch=None, 
    spec_patch=None, 
    size=(10,6),
    sort_index=False,
    ascending=False
)

As expected, **Tableau and Power BI lead the way** hands down.

Since the BI Tools data wasn't included in the 2018-2019 surveys, we can't see the trends from the data, but if we were to plot the 2020 data in the same "trend" chart, we can see that of those respondents who do use one of the top 3 BI Tools, they typically use only about 2 BI Tools regularly:


In [None]:
# No 2018-2019 data

offset = {
    "Microsoft Power BI": (-0.045, 0.04),
    "Google Data Studio": (-0.017, 0), 
    "Tableau": (-0.033, -0.02),
    "Salesforce": (-0.017, 0),
    "Qlik": (-0.017, 0)
}
cols = ["Microsoft Power BI", "Google Data Studio", "Tableau", 
           "Salesforce", "Qlik"]

color_dict = dict(zip(cols, ["C{}".format(i) for i in range(len(cols))]))
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_bi", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.05, 0.03),
    title_text="Business Intelligence Tools Regularly Used:\nAll Respondents, 2020", 
    color_dict=color_dict, 
    year=2020,
#     title="Data Scientist"
#     company_size="> 10,000"
)

### What about BI Tools Regular Usage by Country?


In [None]:
countries = list(df.country.value_counts().sort_values(ascending=False)[:20].index.values)

heatmap_pt = heatmap(
    df.loc[df.country.isin(countries), :],
    cols=bi_cols, 
    expand_col="country", 
    cbar_text="% Usage", 
    title_text="Business Intelligence Tools Usage\nby Country", 
    transpose=False, 
    rotation=35, 
    fraction=0.0305, 
    cmap="Purples",
    size=(10, 10)
)

We can see that:
- **Brazil is a power user of Power BI**, while **Tableau is still preferred by several countries** as well.
- **Google Data Studio is used more regularly by survey respondents in Mexico** compared to its usage in other countries.
- **Qlik has a slightly higher usage in Italy and Spain** compared to its usage in other countries, but of course Italy and Spain both prefer Power BI to Qlik by a lot.

Let's look at a head-to-head matchup of Tableau vs. Power BI to see in which countries each BI tool has a higher regular usage.

### Tableau vs. Power BI

In [None]:
PT = vsCompare(df, "Microsoft Power BI", "Tableau", A_loc=(0.16, 0.2), B_loc=(0.6, 0.7),
              text_loc=(0.48, 0.05))


While the split among the number of countries is roughly even (this chart is only including those with at least 40 respondents who regularly use one or both tools), both **India and the U.S.** - with the largest and second largest number of survey respondents - **tend to use Tableau more than Power BI by a large margin**.

**In China, Tableau is slightly more regularly used than Power BI.**

Diving deeper into China's usage of BI Tools:

### In China, what are the commonly used BI Tools by company size?

In [None]:
heatmap_pt = heatmap(
    df.loc[df.country=="China", :],
    cols=bi_cols, 
    expand_col="company_size", 
    cbar_text="% Usage", 
    title_text="Business Intelligence Tools Usage\nin China by Company Size", 
    transpose=True, 
    rotation=35, 
    fraction=0.018, 
    cmap="Purples",
    size=(10, 10)
)

#### In China:
- **Tableau dominates the mid-sized company range** and also has a **high usage among larger corporations**, whereas **Power BI and Google Data Studio** are used a bit more than Tableau by **smaller companies**. 
- **Power BI** is also used more regularly for those who work in **large corporations**, while **Salesforce** also makes an appearance in the mid- to large-size company range.

And finally, let's look at another head-to-head between Power BI and Tableau by job titles.

### Power BI vs. Tableau - by Profession

In [None]:
PT = vsCompare(
    df,
    A = "Microsoft Power BI", 
    B = "Tableau", 
    y_axis = "title", 
    color1="tab:red", 
    color2="tab:blue", 
    thresh=40,
    A_loc=(0.15,0.15), 
    B_loc=(0.8,0.65), 
    xSpaceMultiplier=1.2, 
    label_pad=10, 
    size=(12,6)
)


Among all 2020 survey respondents, **ML Engineers and Data Scientists** tend to use **Tableau more regularly than Power BI**, whereas **Power BI** is used more often by **Business Analysts, Statisticians and DBAs**.

# Conclusion

This survey data has provided insights regarding the skills, tools and technologies one might desire in pursuing a specific data-related career direction. As nearly every reader will have his or her own goals and career direction in mind, I approached the analysis from a high level before zooming in a bit, spanning across countries, company sizes, job titles and at times age groups. This hopefully gave each reader some unique takeaways and insights that are more relevant to his or her own career situation.

I also zoomed in on the trends around these areas in China, as I aim to pursue a job in China in 2021.

### Insights & Takeaways

Some of my biggest takeaways regarding this specific goal include:
1. **Python**, **C** and **C++** are making big gains in China, whereas SQL is widely used but its growth is mostly stagnant. And while R is a popular language for data workers in the U.S., it hasn't gained much traction in China.
2. **TensorFlow** is big in China and is growing, especially among Data Scientists.
3. **Alibaba Cloud** and **Tencent Cloud** are the dominant Cloud Computing players in China, so it would be useful to gain familiarity with these tools and even do some projects with them.
4. **MySQL** is the dominant big data tool in China.
5. **Tableau** is heavily preferred for mid-sized and larger companies in China, whereas **Google Data Studio** and **Power BI** are used more by smaller companies.

### Next Steps

These insights have sharpened my awareness of the overall Data Science direction in China, as well as the whole Kaggle community worldwide. They've guided me in some decisions that will shape the trajectory of my data endeavors:
1. Based on the prominence of SQL among data workers, including those in China, I've **pulled ahead my PostgreSQL** course one semester earlier to start diving deeper into SQL this month. 

2. As I learn more SQL, I'll also **keep tabs on MySQL syntax** and become more familiar with that as it's more widely used in China.

3. Also, I had originally wanted to pick up R this year, but after seeing that it's not as widely used of a language in China, and that TensorFlow is even more widely used among Data Scientists than sklearn, I've decided to **prioritize a deep dive into TensorFlow** before embarking on R.

4. Finally, I will explore and do some **projects with Alibaba Cloud and Tencent Cloud** to at least become familiar with these cloud technologies and their similarities and differences with the "big 3", though this will be a priority for later in the year.

I hope you also gained some valuable takeaways to help you move toward your own desired career direction in the data world. 

There's no better time than now to make a resolution and put your insights into action!

***Thank you for reading!***


*****



### Appendix A: My Points of Learning in this Project

In addition to the above insights into the overall data science trends that I got from this project, I also learned a lot during the data cleaning, function writing, and visualization steps. These points of learning include:
1. **Reducing the size of the DF.** I ran out of memory a few times trying to melt all the columns together into long format. Since there is a column for nearly each question response choice which has just a single unique value, nearly all but several of the 100s of columns can be read in or converted to a category data type. This reduced the DF size by 31 times alone. When I had a better idea of the direction I wanted to take the analysis, I further reduced the DF size I was working with by just reading in those columns I needed from the 3 surveys.
2. **Animation using FuncAnimation in Matplotlib.** At first I thought about showing the dots moving from 2018 to 2020, and it's not difficult to implement with FuncAnimation, but in the end I decided not to use it since there were only 3 data points, and animating three dots I found more difficult to understand the trend than just plotting each of the 3 years with slightly different sizes and alpha values. For a dozen or more data points, perhaps the animation would have been suitable.
3. **Keeping some of the charts 'clutter-free' was a challenge.** Especially the trend charts since there is a lot going on in them. After some trial and error with different alphas and sizes, I came up with the balance between allowing the reader to identify the colors at varying alpha values, while also being able to differentiate the alphas enough to see which one is 2020 and which were the 2018-2019 data points. That was the most difficult part of the trend chart.
4. **I was inspired by others using Plotly and Javascript.** So much so that I took a few days and learned some HTML/CSS/Javascript basics. While these are areas I'd like to pursue more later as they could liven up visualizations, I decided to stick with Matplotlib for this project. Though in my few days of study, one concept that was emphasized was designing webpages for *accessability*, which led me to change the green circle in the trends graphs to make that more distinguishable for some readers.
5. **Writing functions to reuse and declutter code.** Though at first, the functions were too hard-coded and became unwieldy when I wanted to do the same visualization from a slightly different angle on different data. At first, I ended up copying the whole functions and modifying their contents slightly for each of the 5 sections, but later went back and refactored them into one  function with more arguments to allow for more flavors from the same visualization. I think there's still room for improvement within these main visualization functions, such as splitting them into smaller sub-functions that focus on just one thing each, but all-in-all I'm content with at least making these 4 main visualization functions flexible enough to work for any of the 5 categories I was working with.
6. **Y-axis labels in the 'A vs. B' charts.** I really enjoyed making the "Power BI vs. Tableau" visualization of comparing two skills between several groups. Since that visualization packs in so much data, I had to find the balance between making the chart simple and intuitive enough to understand while at the same time not watering it down to where it loses its value. While I'm happy enough with the current result, I thought about where to better put the y-axis labels (i.e. country names) so that the reader could still clearly identify the corresponding bar plots but without cluttering up that part of the chart. That's something I'll be keeping an eye on in the future.
7. **The "trend charts" helped me catch a mistake!** When I plotted the 2018-2020 trends of ML Frameworks and Cloud Computing Platforms, it seemed puzzling that there was quite a consistent dropoff from 2018-2019 among every item. It made me look back through the questions, and at first I thought it was because the cloud computing options in the survey were fewer in 2018, but after seeing the same trend in Big Data Tools (which had *more* options in 2018) I knew that couldn't be it. That led me to see that the question in 2018 asked about tools used in the past 5 years, which accounts for the sudden yet quite consistent dropoff among all skills. (The language question though in 2018 was the same as in 2019 and 2020 which allowed those results to be more meaningfully plotted together.) Goes to show that data visualization can help you catch your mistakes and faulty assumptions!

---

### Appendix B: A Shallow Deep-Dive into the Trend Chart

When plotting different skills on the trend chart below, we often saw the skills form a diagonal line from the top left to the bottom right, indicating that more regularly used skills are less frequently used with other skills, and vice versa.

So what would it mean if a language, tool or skill were to appear in one of these "open" areas circled below?

In [None]:
offset = {
    "Javascript": (0.0, 0.03),
    "C++": (0, 0.01),
    "MATLAB": (0, -0.04),
    "Python": (-0.12, 0.06)
}
cols = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", 
        "Bash", "MATLAB"]
color_dict = dict(zip(language_cols, ["C{}".format(i) for i in range(11)]))
color_dict["MATLAB"] = "C8"
                    
pt = allLangPT(
    df, 
    cols=cols, 
    num_col="num_lang", 
    country="all", 
    show=True, 
    inverse=True, 
    offsets=offset, 
    text_loc=(0.1, 0.03),
    title_text="Programming Languages Regularly Used:\nSoftware Engineers, 2020", 
    color_dict=color_dict, 
    year=2020,
    title="Software Engineer",
    patches = [
        mpatches.Ellipse((0.23, 3.55), 0.3, 0.3, linewidth=2, edgecolor='r', facecolor='none'),
        mpatches.Ellipse((0.55, 4.55), 0.4, 0.6, linewidth=2, edgecolor='b', facecolor='none')
    ]   
)

If a skill were to be plotted in the **red area**, the x-axis would mean that it's not a skill regularly used by many people, and the low position on the y-axis would mean that the skill is paired less often with other skills used regularly. But it wouldn't make sense for a skill to stay in this location.
- If the few people (low x-axis value) who did regularly use the skill could really do so without using many other skills regularly (since the skill is really good at solving a particular problem?), then more people would pick up this skill and add it to their arsenal. Not only would this increase the skill's x-axis position, but since it's being "picked up" by those who've already been using other skills, its y-axis position would increase as well! 
- Conversely, if the skill were not actually *that* useful, then those few people who do regularly use the skill would need to learn other skills to solve the problems they need to solve, thus increasing the skill's position on the y-axis.

In either situation, the skill will eventually drift toward that diagonal line of dots in the middle.

And the same logic could apply to the **blue area** to the upper right:
- If the skill were in fact really awesome at solving problems, then its users would need to regularly use fewer other skills. Also, younger learners would target that skill first since it's so useful, which would pull down its y-axis position, again closer to that diagonal line.
- Whereas if the skill were not actually that great, and perhaps was being overtaken by another skill, then fewer people would regularly use it, which would also cause it to gravitate toward the diagonal line.

Therefore, it makes intuitive sense that the data points tend to fall in a diagonal pattern in this way.

---


### Appendix C: Bonus!

#### Cultural Cues - "Do more time-sensitive cultures finish the Kaggle survey faster than more laid back cultures?"

At least, that's the question I wanted to answer. Unfortunately, I couldn't find any data on a kind of "punctuality index" for different countries. At best, articles that highlight punctual countries (Germany, Japan, South Korea) as well as those countries and regions that don't place as much emphasis on punctuality (Latin America, Middle East, etc.). But no raw data.

So I decided the bext "proxy" dataset I could use would be the average temperature, under the (perhaps incorrect) logic that warmer climates will be more laid back while colder climates will be more punctual. The dataset is here on [Wikipedia](https://en.wikipedia.org/wiki/List_of_countries_by_average_yearly_temperature). Admittedly, not the best substitute dataset for a cultural dimension such as punctuality. But it'll work. 

So our question can be updated to: 

#### "Is there a correlation between a country's average temperature and its median duration of filling in the survey?"

Let's find out.

First, we can see that 2,212 people in the past 3 years took more than 1 day to finish the survey.


In [None]:
((df.duration.sort_values() / 60 / 60 / 24) > 1).sum()

Also, 92 respondents over the past 3 years took more than a week to complete the survey. Since the results within the survey were deemed "usable" by Kaggle and because there are legitimate reasons for taking a long time to fill in the survey (internet connectivity issues, just being busy, etc.), we'll still go ahead and use all the data. 

In [None]:
((df.duration.sort_values() / 60 / 60 / 24) > 7).sum()

But we'll use the median instead of the mean to mitigate the effects of these outliers.

In [None]:
time = df.groupby("country")["duration"].agg("median") / 60  # duration in minutes
time.head()

Now, we'll read in the Wikipedia average temperature data.

In [None]:
# Now, read in the Wikipedia data
wikiTemp = pd.read_csv("../input/kaggle-data-science-survey-analysis/wikiTemp.csv", usecols=[1,2])
wikiTemp.columns = ["country", "avgTemp"]
wikiTemp.country = wikiTemp.country.str.strip()
country_dict = {
    "United States": "USA",
    "United Kingdom": "UK",
}
wikiTemp.country = wikiTemp.country.apply(lambda x: country_dict[x] if x in country_dict else x)
wikiTemp.head()

And now we'll merge the Temperature data into the Survey Duration data.

In [None]:
# Merge the Temperature data into the time data which was taken from the main Kaggle DF
timeTemp = pd.merge(time, wikiTemp, how="left", left_index=True, right_on="country")
timeTemp = timeTemp.loc[timeTemp.avgTemp.notna(), :]
timeTemp.avgTemp = timeTemp.avgTemp.apply(lambda x: float(x) if "−" not in x else -(float(x.replace("−",""))))
timeTemp.head()

#### Now, it's showtime. 

#### Is there a correlation between the average temperature and the median duration of time in filling in the Kaggle survey?

In [None]:
fig, ax = plt.subplots()
ax.scatter(timeTemp.duration, timeTemp.avgTemp)
ax.set_title("Correlation between Temperature\nand Survey Duration?")
ax.set_ylabel("Avg. Temp")
ax.set_xlabel("Median Survey Duration")
plt.show()

***The answer is a resounding "no"*** - in fact, if anything, there's a tiny negative correlation between the two elements.

In [None]:
timeTemp.corr()

Whether or not a cultural aspect such as "punctuality" is involved in the filling in of the survey, ***cultural understanding and awareness*** is nonetheless important in not only our careers but in communication in general, especially given the global growth and the cross-cultural collaboration that we see in the data science community.