# What Beginner Language is Best?

Like many Kagglers, I do software and data science for a living. Tech is always changing, and there are always new languages and platforms out there to learn. I don't have infinite time, so I'm wondering: based off of the kaggle survey data, which tech looks like it's up and coming? What languages and platforms are worthwhile to learn next, whether you're an experienced programmer or new to the field?

While the 2020 data has plenty to dig into, we can also make use of the previous three years of surveys to get a sense of how tech popularity has evolved over time.

### Import and Clean Data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt
import re

sns.set(style='white', context='notebook', palette='hls')
pd.set_option('display.max_columns', None)

In [None]:
# Prep 2020 Data
df_2020 = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
question_labels_2020 = {'Time from Start to Finish (seconds)' : 'duration',
            'Q1': 'age',
            'Q2' : 'gender',
            'Q3' : 'country',
            'Q4' : 'education',
            'Q5' : 'job_title',
            'Q6' : 'coding_experience',
            'Q8' : 'beginner_language_rec',
            'Q11' : 'most_common_compute_platform',
            'Q13' : 'tpu_use',
            'Q15' : 'ml_experience',
            'Q20' : 'company_size',
            'Q21' : 'data_science_team_size',
            'Q22' : 'company_ml_use',
            'Q24' : 'compensation',
            'Q25' : 'compute_spend',
            'Q30' : 'most_common_db',
            'Q32' : 'most_common_bi',
            'Q38' : 'most_common_analyzer'  
            }
df_2020 = df_2020.rename(columns=question_labels_2020)
df_2020.loc[df_2020['gender']=='Man','gender'] = 'Male'
df_2020.loc[df_2020['gender']=='Woman','gender'] = 'Female'
df_2020.loc[df_2020['gender']=='Prefer to self-describe','gender'] = 'GNC'
df_2020.loc[df_2020['gender']=='Nonbinary','gender'] = 'GNC'
df_2020['education'] = df_2020.education.apply(lambda x: x.replace("’", "'") if isinstance(x, str) else x)
df_2020 = df_2020.drop(0, axis='index')

# Prep 2019 Data
df_2019 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
questions_2019 = pd.read_csv('../input/kaggle-survey-2019/questions_only.csv')
question_labels_2019 = {'Time from Start to Finish (seconds)' : 'duration',
            'Q1': 'age',
            'Q2' : 'gender',
            'Q3' : 'country',
            'Q4' : 'education',
            'Q5' : 'job_title',
            'Q6' : 'company_size', 
            'Q7' : 'data_science_team_size',
            'Q8' : 'company_ml_use',
            'Q10' : 'compensation',
            'Q11' : 'compute_spend',
            'Q14' : 'most_common_analyzer',
            'Q15' : 'data_coding_experience',
            'Q19' : 'beginner_language_rec',
            'Q22' : 'tpu_use',
            'Q23' : 'ml_experience' 
            }
df_2019 = df_2019.rename(columns=question_labels_2019)
df_2019.loc[df_2019['gender']=='Prefer to self-describe','gender'] = 'GNC'
df_2019['education'] = df_2019.education.apply(lambda x: x.replace("’", "'") if isinstance(x, str) else x)
df_2019 = df_2019.drop(0, axis='index')

# Prep 2018 data
df_2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
schema_2018 = pd.read_csv('../input/kaggle-survey-2018/SurveySchema.csv')
question_labels_2018 = {'Time from Start to Finish (seconds)' : 'duration',
            'Q1': 'gender',
            'Q2' : 'age',
            'Q3' : 'country',
            'Q4' : 'education',
            'Q5' : 'undergraduate_major', 
            'Q6' : 'job_title', #'company_size', 
            'Q7' : 'industry', # 'data_science_team_size',
            'Q8' : 'experience_in_current_role', #
            'Q9' : 'compensation',
            'Q10' : 'company_ml_use',#'compute_spend',
            'Q17' : 'most_freq_programming_lang', #'most_common_analyzer',
            'Q18' : 'beginner_language_rec', #'data_coding_experience',
            'Q20' : 'most_used_ml_library',
            'Q22' : 'most_used_viz_library', #'tpu_use',
            'Q23' : 'time_spent_coding', #'ml_experience' 
            'Q24' : 'data_coding_experience',
            'Q25' : 'ml_experience',
            'Q26' : 'consider_yourself_data_scientist',
            'Q32' : 'most_frequent_data_encountered',
            'Q37' : 'most_frequent_online_platform_used',
            'Q40' : 'academic_achievements_vs_independent_projects',
            'Q43' : 'exploring_unfair_bias',
            'Q46' : 'exploring_model_insights',
            'Q48' : 'black_box_ml'            
            }
df_2018 = df_2018.rename(columns=question_labels_2018)
df_2018.loc[df_2018['gender']=='Prefer to self-describe','gender'] = 'GNC'
df_2018.loc[df_2018['age']=='70-79','age'] = '70+'
df_2018.loc[df_2018['age']=='80+','age'] = '70+'
df_2018['education'] = df_2018.education.apply(lambda x: x.replace("’", "'") if isinstance(x, str) else x)
df_2018 = df_2018.drop(0, axis='index')

# Prep 2017 Data
df_2017 = pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv', encoding="ISO-8859-1")
question_labels_2017 = {'Time from Start to Finish (seconds)' : 'duration',
            'GenderSelect': 'gender',
            'Age' : 'age',
            'Country' : 'country',
            'FormalEducation' : 'education',
            'MajorSelect' : 'undergraduate_major', 
            'CurrentJobTitleSelect' : 'job_title', #'company_size', 
            'CompensationAmount' : 'compensation',
            'LanguageRecommendationSelect' : 'beginner_language_rec', #'data_coding_experience',
            'Tenure' : 'data_coding_experience',
            'DataScienceIdentitySelect' : 'consider_yourself_data_scientist',        
            }
df_2017 = df_2017.rename(columns=question_labels_2017)
df_2017.loc[df_2017['gender']=='A different identity','gender'] = 'GNC'
df_2017.loc[df_2017['gender']=='Non-binary, genderqueer, or gender non-conforming','gender'] = 'GNC'
df_2017['exact_age'] = df_2017.age
bins = [17, 21, 24, 29, 34, 39, 44, 49, 54, 59, 69, 150]
labels = ["18-21","22-24","25-29","30-34","35-39","40-44","45-49","50-54","55-59","60-69","70+"]
df_2017.age = pd.cut(df_2017.age, bins=bins, labels=labels)
df_2017 = df_2017.drop(0, axis='index')

# Append all the datasets together
df_2017['year'] = '2017'
df_2018['year'] = '2018'
df_2019['year'] = '2019'
df_2020['year'] = '2020'
df_tot = df_2020.append(df_2019).append(df_2018).append(df_2017)

### Plotting Helper Functions

In [None]:
sns.set(style='white', context='notebook', palette="colorblind")
"""
This function is going to give you a sum of 100% per every bar
The df looks like
index     legend_1     legend_2
x1        % L1 in x1   % L2 in x1 (these sum to 1)
x2        % L1 in x2   % L2 in x2 (these sum to 1)
"""
def group_x_bar_percentages_plot(df, x_axis_var, legend_var, color_order=False):
    x_axis_groups = df.groupby(x_axis_var)[legend_var]
    percentages_df = pd.DataFrame()
    for group_name, group_df in x_axis_groups:
        single_x_bar_df = group_df.value_counts(normalize=True).to_frame().rename(columns={legend_var:group_name}).transpose()
        percentages_df = percentages_df.append(single_x_bar_df)

    # Convert from proportions to percent
    percentages_df = percentages_df.mul(100)
    percentages_df = percentages_df.reindex(sorted(percentages_df.columns), axis=1)


    ax = percentages_df.plot.bar(stacked=True,
                                title = f"Normalized Counts of {x_axis_var.capitalize()}, Broken Down by {legend_var.capitalize()}",
                                xlabel = x_axis_var.capitalize(),
                                ylabel = "Percent of Population")
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.show()
    

"""
This function is going to give you a sum of 100% across a legend item
The df looks like
index     legend_1     legend_2
x1        % x1 in L1   % x1 in L2
x2        % x2 in L1   % x2 in L2 
          (sum to 1)   (sum to 1)
"""
def group_legend_percentages_plot(df, x_axis_var, legend_var):
    legend_groups = df.groupby(legend_var)[x_axis_var]
    percentages_df = pd.DataFrame()
    for group_name, group_df in legend_groups:
        single_legend_item_df = group_df.value_counts(normalize=True).to_frame().rename(columns={x_axis_var:group_name})
        percentages_df = percentages_df.merge(single_legend_item_df, how='outer',  left_index=True, right_index=True)
    
    # Convert from proportions to percent
    percentages_df = percentages_df.mul(100)

    ax = percentages_df.plot.bar(title = f"Normalized Value Counts of {x_axis_var.capitalize()}, Grouped by {legend_var.capitalize()}",
                                xlabel = x_axis_var.capitalize(),
                                ylabel = "Percent of Population")
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.show()

def plot_gender_support_by_year(df, col_name, yes_value, inc_nb=True):
    df['yes_value_count'] = df[col_name] == yes_value
    smaller_df = df_tot[['year','gender','yes_value_count',col_name]]
    g = smaller_df.groupby(by=['year','gender']).agg({col_name:'count','yes_value_count':'sum'}).reset_index()
    g['perc'] = g['yes_value_count'] / g[col_name]
    plt.plot('year','perc','o-',data=g[g.gender=='Male'], label='Male')
    plt.plot('year','perc','o-',data=g[g.gender=='Female'], label='Female')
    plt.plot('year','perc','o-',data=g[g.gender=='GNC'], label='Non-binary or Gender Non-Conforming')
    plt.title(f'Percent of {col_name} with value {yes_value}')
    plt.legend()
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.show()

def plot_support_by_year(df, col_name, yes_values, inc_nb=True):
    for yes_value in yes_values:
        df['yes_value_count'] = df[col_name] == yes_value
        smaller_df = df[['year','yes_value_count',col_name]]
        g = smaller_df.groupby(by=['year']).agg({col_name:'count','yes_value_count':'sum'}).reset_index()
        g['perc'] = 100 * (g['yes_value_count'] / g[col_name])
        plt.plot('year','perc','-o', data=g, label=yes_value, )
    plt.title(f'Percent of {col_name} with value {yes_value}')
    plt.legend()
    plt.ylabel('Percent of Responses')
    plt.xlabel('Year')
    plt.show()

def get_lin_fit(row):
    try:
        y = np.array([row['2017'],row['2018'],row['2019'],row['2020']])
    except:
        return np.nan
    x = np.array([2017, 2018, 2019, 2020])
    idx = np.isfinite(y)
    if sum(idx) <= 1:
        return np.nan
    else:
        fit = np.polyfit(x=x[idx], y=y[idx], deg=1)
        return fit[0]
    
def over_year_percents(df):
    total_pop = len(df)
    over_years_df = df[["year","beginner_language_rec"]].groupby(['year','beginner_language_rec']).size().reset_index().rename(columns={0:"count"})
    over_years_df = over_years_df.pivot(index='beginner_language_rec', columns='year', values='count')
    over_years_df['small_sample_flag'] = over_years_df.max(axis=1) > 10
    over_years_df = over_years_df.div(over_years_df.sum(axis=0), axis=1).multiply(100)
    over_years_df['avg'] = over_years_df.mean(axis=1)
    over_years_df['yearly_delta'] = over_years_df.apply(lambda row: get_lin_fit(row) , axis=1)
    over_years_df['relative_yearly_delta'] = over_years_df.apply(lambda row: row['yearly_delta'] / row['avg'], axis=1 ).multiply(100)
    over_years_df = over_years_df.drop(['Other','None'], errors='ignore')
    over_years_df['small_sample_flag'] = over_years_df['small_sample_flag'] > 0
    return over_years_df

def plot_absolute_change(df, filter_small_samples=True):
    if filter_small_samples:
        df = df[df.small_sample_flag]
    df.yearly_delta.sort_values().dropna().plot.bar()
    plt.title("Average Yearly Change in Percentage from 2017 to 2020")
    plt.xlabel("Language")
    plt.ylabel("Yearly Change in Recommendation %")
    plt.show()
    
def plot_relative_change(df, filter_small_samples=True):
    if filter_small_samples:
        df = df[df.small_sample_flag]
    df.relative_yearly_delta.sort_values().dropna().plot.bar()
    plt.title("Relative Yearly Change in Percentage from 2017 to 2020")
    plt.xlabel("Language")
    plt.ylabel("Yearly Change in Recommendation %")
    plt.show()
    

## Short Answer: Use Python!

Let's start with the basics. For beginners out there, you're in luck: the survey asked directly about language recommendations! Everyone was asked for a single beginner language recommendation, and they replied with a resounding "Python!"

In [None]:
df_2020.beginner_language_rec.value_counts(normalize=True).multiply(100).plot(kind='bar')
plt.title('2020 Beginner Language Recommendations')
plt.xlabel('Lanuguage Recommended')
plt.ylabel('Percent of Responses')
plt.show()

Damn. It's hard to get 80% of people to agree on anything in 2020, but Python has managed it somehow. The second most popular choice, R, only has 7% of total responses. But has that always been true? Is Python on its way in or on its way out?

We'll look at the three most common languages (Python, R and SQL) to keep the next plot easier to read. 

In [None]:
plot_support_by_year(df=df_tot, col_name='beginner_language_rec', yes_values=['Python','R','SQL'])

From 2017 to 2020, Python is on the incline while R is slipping in popularity. Python jumped 17% percentage points over the four years, a relative increase of 26%.  

It's too bad we don't have data from long enough in the past to see what the Python data looked like when it wasn't the clear champion of the data science world. Still, are there any up-and-comer languages hiding amidst our less popular language bretheren? 

Sadly, the language options for this question haven't been fully consistent across the years, so if we do a direct subtraction from 2020 to 2017 we lose many of the potential languages. Instead, we can do a linear fit of the four years for each language that's in at least two of the survey years, and find our average change per year for each language (the slope of the linear fit).

In [None]:
overall_yearly_changes = over_year_percents(df_tot)
plot_absolute_change(overall_yearly_changes)

Cool, as expected we get Python in the lead gaining an average of 5 percentage of responses each year from 2017 through 2020. Most languages change a much smaller amount over time, in part because they amount to such a smaller percentage of the overall responses. For example, Julia has on average gained 0.14% of the responses each year, which isn't much compared to Python in an absolute sense, but since the average response rate for Julia across the four years is only 0.48%, that's a 28% increase in popularity each year! If we look at relative increases rather than the original absolute increases, we get a plot of

In [None]:
plot_relative_change(overall_yearly_changes)

Python, the original powerhouse, has been overshadowed! While it's still data science's main go to, it looks like C, Julia and Javascript are growing in popularity, at least in that people are recommending it as the language you should use first. 

That being said, we're currently oversimplifying. If someone asked me what data science language to learn first, I want to know: What *exactly* are you trying to do? If your goal is to make interactive visualizations, you're going to want a different language than if you only care about statistical testing.

So, how does beginner language recommendation change by job title in 2020? For visual simplicity, this plot focuses in on the three most popular responses (SQL, R and Python).

In [None]:
group_x_bar_percentages_plot(df_2020[df_2020.beginner_language_rec.isin(["Python","SQL","R"])], 
                                                          'job_title',
                                                           'beginner_language_rec',)

Statisticians are the clear outlier here! R may be the right path for you, but only in the case that you're working in statistics. If we limit our dataset to only folks who said they're a statistician, how have beginner lanugage preferences changed over time?

In [None]:
plot_support_by_year(df=df_tot[df_tot.job_title=="Statistician"].copy(), col_name='beginner_language_rec', yes_values=['Python','R','SQL'])

Among statisticians, R was actually the most popular recommendation in 2017! We're seeing that preference decay, though, even in stats. Do statisticians have any up and coming language preferences that are different than the general population?

In [None]:
plot_relative_change(over_year_percents(df_tot[df_tot.job_title == 'Statistician']))

Again, we're seeing C as a growing choice. And wow, statisticians sure are running away from SAS!

While job title is an important distinction, looking at other demographics can help us figure out if some languages are friendlier to a specific gender or age group. 

For example: Women and non-binary folks have been traditionally underrepresented in data science fields. Are their language preferences different? Let's start by taking a look at 2020.

In [None]:
group_legend_percentages_plot(df_2020[df_2020.gender.isin(['Male','Female','GNC'])], 'beginner_language_rec', 'gender')

In 2020, preferences seem relatively equal across gender, with a slight male preference for python and female preference for R. But was that always true? If we look at the rate at which people recommend Python over the years, we can see that the gender difference in Python preference was more acute in 2017, and has been slowly equalling out.

In [None]:
plot_gender_support_by_year(df=df_tot, col_name='beginner_language_rec', yes_value='Python')

Interestingly, that same equalization is happening with R! It used to be a more popular recommendation from women, and now R is almost the same popularity across genders.

In [None]:
plot_gender_support_by_year(df=df_tot, col_name='beginner_language_rec', yes_value='R')

What about those lesser known lanugages? If we limit our sample to men and look at relative change in popularity by year and language, we can see that C and Julia have been increasing in popularity the most quickly.

In [None]:
male_yearly_changes = over_year_percents(df_tot[df_tot.gender == 'Male'])
plot_relative_change(male_yearly_changes)

In contrast, javascript is the fastest increasing in popularity for women respondents! And bash is losing popularity the fastest, as opposed to SAS with men.

In [None]:
plot_relative_change(over_year_percents(df_tot[df_tot.gender == 'Female']))

Do folks with Phds have any information the rest of us don't? At least by the looks of which languages are increasing in popularity, it looks pretty similar to the rest of the distributions.

In [None]:
plot_relative_change(over_year_percents(df_tot[df_tot.education == 'Doctoral degree']))