# The languages of Data Science

## Table of Contents

- [Introduction](#Introduction)
- [The size of the dataset](#The-size-of-the-dataset)
- [Language usage](#Language-usage)
    - [Secondary language usage](#Secondary-language-usage)
    - [Heatmap of language usage](#Heatmap-of-language-usage)
- [Job distribution across languages](#Job-distribution-across-languages)
- [The activities of a job](#The-activities-of-a-job)
- [The intricacies of machine learning across languages](#The-intricacies-of-machine-learning-across-languages)
    - [Employers use of machine learning](#Employers'-use-of-machine-learning)
    - [Kagglers own usage of machine learning methods](#Kagglers-own-usage-of-machine-learning-methods)
- [Overall programming experience](#Overall-programming-experience)
- [Language recommendations](#Language-recommendations)
- [Learning platforms](#Learning-platforms)
- [Results](#Results)
- [Conclusion](#Conclusion)

## Introduction

Data science is a complex field to say the least. It takes its roots from statistics, mathematics and computer science; its branches spread towards nearly every field. Whichever way you look at it, data science is a juggernaut indeed.

But everything begins with a base, a foundation, a first step. One of the most important bases of data science is computer programming, that critical skill which allows us to work with massive amounts of data. Programming, in turn, begins with languages– programming languages, that is. There are countless programming languages around; some are ubiquitous, others obsolete. Aspiring data scientists are often daunted by the thought of where to begin. A popular language choice is Python, but is it the only one? Is R a better choice? How many languages does a data scientist need anyway? 

In this notebook, I will look at the languages that Kagglers use. Though Kagglers are a wide and varied bunch, they represent a vibrant community of current and aspiring data scientists. Taking a look at their language choices would give a good idea of the languages in vogue today. Also, I will be using Python, as *my* language of choice.

## Importing the data

The first thing to do is to import the relevant libraries, and the dataset.

In [None]:
# importing the relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

In [None]:
# importing the kaggle dataset
df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv', header = 1)

## The size of the dataset

The first thing we do is look at what the dataframe has to present to us.

In [None]:
df.head()

In [None]:
print('The Kaggle 2021 survey dataset has {} rows and {} columns. Each row consists of information from a single Kaggle user.'.format(df.shape[0], df.shape[1]))

We see from the above that there are 25,000+ data entries, a good sample size. We also see that there are 369 columns with different aspects of the data scientists. A cursory look at the dataset shows that the questions covre personal details as well as the respondent's knowledge and level of skill. 

## Language usage

Now we take a look at the languages themselves. The first thing we would want to see is which languages are used most often. From a prelimary look at our data, we see that columns 7 - 19 contain this information.

In [None]:
# isolate the required columns into a separate dataset
langpopRaw = df.iloc[:, 7:20]

# get the name of the language from the question 
languageList = [x.split(' ')[-1] for x in list(langpopRaw.columns)]
langpopRaw.columns = languageList

In [None]:
# sum up the number of language users
langpop = langpopRaw.notnull().sum().sort_values(ascending = False).reset_index()
langpop.columns = ['Language', 'No. of users']

In [None]:
langpop.head(6)

In [None]:
# finally, plot out the popularity of each language on a bar chart
plt.figure(figsize = (15,10))
popbar = plt.bar(langpop['Language'], langpop['No. of users'], color = 'olive', edgecolor = 'black')
plt.xlabel('Language', fontsize = 15)
plt.ylabel('No. of users', fontsize = 15)
plt.bar_label(popbar)
plt.title('Figure 1: Most used programming languages among Kagglers', fontsize = 20)
plt.show()

From the above we can see that Python is the language most regularly used amongst Kagglers by a large majority. However, there are a greater number of SQL users, and slightly more C++ users than R. Java is the fifth most used language, though by looking at the chart we can see that C has about the same number of users. 

### Secondary language usage

But programmers often know and use more than one language. In the given dataset this is true as well, as a large number of respondents have selected multiple languages they use regularly. We delve further into this to see which languages are used with other languages- in a way, which languages 'go' together.  

In [None]:
# we get subsets of the main dataset in order to gather this data
pythondf = df.dropna(subset = [df.columns[7]])
rdf = df.dropna(subset = [df.columns[8]])
sqldf = df.dropna(subset = [df.columns[9]])
cdf = df.dropna(subset = [df.columns[10]])
cplusdf = df.dropna(subset = [df.columns[11]])
javadf = df.dropna(subset = [df.columns[12]])
jsdf = df.dropna(subset = [df.columns[13]])
juliadf = df.dropna(subset = [df.columns[14]])
swiftdf = df.dropna(subset = [df.columns[15]])
bashdf = df.dropna(subset = [df.columns[16]])
mtlbdf = df.dropna(subset = [df.columns[17]])
nonedf = df.dropna(subset = [df.columns[18]])
otherdf = df.dropna(subset = [df.columns[19]])

In [None]:
pytlang = pythondf.groupby(pythondf.columns[7], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
pytlang = pytlang.iloc[1:,:].reset_index()
pytlang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
pytlang['Other languages'] = [x.split(' ')[-1] for x in pytlang['Other languages']]
pytlang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
pytlang = pytlang.iloc[:-1,:]

In [None]:
rlang = rdf.groupby(rdf.columns[8], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
rlang = rlang.iloc[1:,:].reset_index()
rlang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
rlang['Other languages'] = [x.split(' ')[-1] for x in rlang['Other languages']]
rlang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
rlang = rlang.iloc[:-1,:]


In [None]:
sqllang = sqldf.groupby(sqldf.columns[9], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
sqllang = sqllang.iloc[1:,:].reset_index()
sqllang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
sqllang['Other languages'] = [x.split(' ')[-1] for x in sqllang['Other languages']]
sqllang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
sqllang = sqllang.iloc[:-1,:]

In [None]:
clang = cdf.groupby(cdf.columns[10], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
clang = clang.iloc[1:,:].reset_index()
clang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
clang['Other languages'] = [x.split(' ')[-1] for x in clang['Other languages']]
clang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
clang = clang.iloc[:-1,:]

In [None]:
cpluslang = cplusdf.groupby(cplusdf.columns[11], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
cpluslang = cpluslang.iloc[1:,:].reset_index()
cpluslang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
cpluslang['Other languages'] = [x.split(' ')[-1] for x in cpluslang['Other languages']]
cpluslang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
cpluslang = cpluslang.iloc[:-1,:]

In [None]:
javalang = javadf.groupby(javadf.columns[12], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
javalang = javalang.iloc[1:,:].reset_index()
javalang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
javalang['Other languages'] = [x.split(' ')[-1] for x in javalang['Other languages']]
javalang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
javalang = javalang.iloc[:-1,:]

In [None]:
jslang = jsdf.groupby(jsdf.columns[13], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
jslang = jslang.iloc[1:,:].reset_index()
jslang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
jslang['Other languages'] = [x.split(' ')[-1] for x in jslang['Other languages']]
jslang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
jslang = jslang.iloc[:-1,:]

In [None]:
julialang = juliadf.groupby(juliadf.columns[14], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
julialang = julialang.iloc[1:,:].reset_index()
julialang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
julialang['Other languages'] = [x.split(' ')[-1] for x in julialang['Other languages']]
julialang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
julialang = julialang.iloc[:-1,:]

In [None]:
swiftlang = swiftdf.groupby(swiftdf.columns[15], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
swiftlang = swiftlang.iloc[1:,:].reset_index()
swiftlang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
swiftlang['Other languages'] = [x.split(' ')[-1] for x in swiftlang['Other languages']]
swiftlang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
swiftlang = swiftlang.iloc[:-1,:]

In [None]:
bashlang = bashdf.groupby(bashdf.columns[16], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
bashlang = bashlang.iloc[1:,:].reset_index()
bashlang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
bashlang['Other languages'] = [x.split(' ')[-1] for x in bashlang['Other languages']]
bashlang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
bashlang = bashlang.iloc[:-1,:]

In [None]:
mtlblang = mtlbdf.groupby(mtlbdf.columns[17], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
mtlblang = mtlblang.iloc[1:,:].reset_index()
mtlblang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
mtlblang['Other languages'] = [x.split(' ')[-1] for x in mtlblang['Other languages']]
mtlblang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
mtlblang = mtlblang.iloc[:-1,:]

In [None]:
nonelang = nonedf.groupby(nonedf.columns[18], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
nonelang = nonelang.iloc[1:,:].reset_index()
nonelang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
nonelang['Other languages'] = [x.split(' ')[-1] for x in nonelang['Other languages']]
nonelang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
nonelang = nonelang.iloc[:-1,:]

In [None]:
otherlang = otherdf.groupby(otherdf.columns[19], as_index = False).count().iloc[:,np.r_[0:1,8:20]].T
otherlang = otherlang.iloc[1:,:].reset_index()
otherlang.rename(columns = {'index': 'Other languages', 0: 'No. of users'}, inplace = True)
otherlang['Other languages'] = [x.split(' ')[-1] for x in otherlang['Other languages']]
otherlang.sort_values(by = 'No. of users', ascending = False, inplace = True, ignore_index = True)
otherlang = otherlang.iloc[:-1,:]

In [None]:
# first we create the four subplots
fig,((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2, sharex=True, sharey=True, figsize = (18, 18))


# we add data for the python pie chart
ax1.pie(pytlang['No. of users'], labels = pytlang['Other languages'], autopct='%1.1f%%',
        explode = [0,0,0,0,0,0,0,0,0,0.2,0.2],  wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, colors = sns.color_palette('colorblind'))
ax1.set_title('Python', fontdict = {'fontsize': 20})


# we add data for the R pie chart
ax2.pie(rlang['No. of users'], labels = rlang['Other languages'], autopct='%1.1f%%',
        explode = [0,0,0,0,0,0,0,0,0,0.2,0.2],  wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, colors = sns.color_palette('colorblind'))
ax2.set_title('R Language', fontdict = {'fontsize': 20})


# we add data for the SQL pie chart
ax3.pie(sqllang['No. of users'], labels = sqllang['Other languages'], autopct='%1.1f%%',
        explode = [0,0,0,0,0,0,0,0,0,0.2,0.2],  wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, colors = sns.color_palette('colorblind'))
ax3.set_title('SQL', fontdict = {'fontsize': 20})


# we add data for the C++ pie chart
ax4.pie(cpluslang['No. of users'], labels = cpluslang['Other languages'], autopct='%1.1f%%',
        explode = [0,0,0,0,0,0,0,0,0,0.2,0.2],  wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, colors = sns.color_palette('colorblind'))
ax4.set_title('C++', fontdict = {'fontsize': 20})
fig.suptitle('Figure 2: 2nd most used languages amongst users of the four most popular languages', fontsize = 22)
plt.show()

Now we check the least popular programming languages: Matlab, Bash, Swift and Julia.

In [None]:
# the second figure showing the four least popular languages
fig,((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2, sharex=True, sharey=True, figsize = (18, 18))


# we add data for the matlab pie chart
ax1.pie(mtlblang['No. of users'], labels = mtlblang['Other languages'], autopct='%1.1f%%',
        explode = [0,0,0,0,0,0,0,0,0,0.2,0.2],  wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, colors = sns.color_palette('colorblind'))
ax1.set_title('Matlab', fontdict = {'fontsize': 20})


# we add data for the bash pie chart
ax2.pie(bashlang['No. of users'], labels = bashlang['Other languages'], autopct='%1.1f%%',
        explode = [0,0,0,0,0,0,0,0,0,0.2,0.2],  wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, colors = sns.color_palette('colorblind'))
ax2.set_title('Bash', fontdict = {'fontsize': 20})


# we add data for the swift pie chart
ax3.pie(swiftlang['No. of users'], labels = swiftlang['Other languages'], autopct='%1.1f%%',
        explode = [0,0,0,0,0,0,0,0,0,0.2,0.2],  wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, colors = sns.color_palette('colorblind'))
ax3.set_title('Swift', fontdict = {'fontsize': 20})


# we add data for the julia pie chart
ax4.pie(julialang['No. of users'], labels = julialang['Other languages'], autopct='%1.1f%%',
        explode = [0,0,0,0,0,0,0,0,0,0.2,0.2],  wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, colors = sns.color_palette('colorblind'))
ax4.set_title('Julia', fontdict = {'fontsize': 20})
fig.suptitle('Figure 3: 2nd most used language amongst users of the four least popular languages', fontsize = 22)
plt.show()

For the four most used languages, we see that the trends are generally the same as that found in overall popularity, with the second most used programming language being Python by a large margin. After that, users of R and SQL are more likely to use the complementary language (i.e., users of R are likely to use SQL and vice versa). For users of C++, the second most used language is C, which is understandable considering C++'s relationship to C language. However, while SQL is the third most popular language for these users, R's relative unpopularity is surprising - it is at 7th place, with only 5.2% of C++ users also knowing R.

For the least used languages, Python and SQL still reign supreme as the 2nd and 3rd most used languages respectively, proving that they most data scientists are likely to be using them regardless of which other language they use. Users of Matlab and, Bash and Swift are more likely to use C++ as compared to R, which is relegated to the middle of the pack along with Java and C. Users of Julia show differences in their language preferences; for them R is the 4th most popular language, and they are much more likely to use Bash. Unfortunately, users of Bash are not likely to use Julia any more than others.


### Heatmap of language usage

A more comprehensive yet somewhat more complicated way of visualizing language usage is by a heatmap.

In [None]:
htmpdf = pd.DataFrame(columns = languageList, index = languageList)

In [None]:
for lang in languageList:
    htmpdf.at[lang, lang] = int(langpop.loc[langpop['Language'] == lang]['No. of users'])

In [None]:
htmpdf.head(3)

In [None]:
langtodf = {'Python':pytlang,'R':rlang,'SQL':sqllang,'C':clang,'C++':cpluslang,'Java':javalang,'Javascript':jslang,'Julia':julialang,'Swift':swiftlang,'Bash':bashlang,'MATLAB':mtlblang,'None':nonelang,'Other':otherlang}

In [None]:
for lang in languageList:
    for i in range(11):
        curdf = langtodf[lang]
        htmpdf.at[curdf.at[i, 'Other languages'], lang] = curdf.at[i, 'No. of users']

In [None]:
htmpdf.fillna(0, inplace = True)

In [None]:
htmpdf.head(3)

In [None]:
# we use scaling to bring the measurements to a heatmap-compatible scale
scaler = MinMaxScaler()
heatMatrixRaw = scaler.fit_transform(htmpdf)

In [None]:
heatmatrix = pd.DataFrame(heatMatrixRaw)
heatmatrix.columns = htmpdf.columns
heatmatrix.index = htmpdf.index

In [None]:
heatmatrix.head(3)

In [None]:
plt.figure(figsize = (12,8))
sns.heatmap(heatmatrix, vmin = 0, vmax = 1, cmap = 'viridis')
plt.title('Figure 4: Heatmap showing language combinations that Kagglers use regularly')
plt.show()

The heatmap tells us a more generalized story of what the pie charts show us separately. The top half of the heatmap is lighter, showing that languages like python and SQL are commonly used alongside other data science languages. The lighter portion in the right-center of the map shows that languages like C, C++ and Javascript are more commonly used by Swift, Bash and Julia users than R or C. The darker part at the bottom portion of the map tells us that languages like Bash, MATLAB, etc are less used; people using 'Other' languages however use Javascript and SQL more than others. 

## Job distribution across languages

We have seen all the languages used by Kagglers. We have also seen which are more commonly used and which are less common. However, this begs another question, namely: who uses which language to do what? SQL cannot be used for machine learning; Python, for all its versatility, is not the ideal language to design websites. Of course, in our modern-day world there is no field or industry of which programming doesn’t take a little part. For the purposes of this notebook, we shall run more code to see what jobs Kagglers usually do.

In [None]:
# first we get list of the different roles for the users of each language
pytroles = pythondf.groupby(pythondf.iloc[:,5]).count().iloc[:,1].sort_values()
rroles = rdf.groupby(rdf.iloc[:,5]).count().iloc[:,1].sort_values()
sqlroles = sqldf.groupby(sqldf.iloc[:,5]).count().iloc[:,1].sort_values()
croles = cdf.groupby(cdf.iloc[:,5]).count().iloc[:,1].sort_values()
cplusroles = cplusdf.groupby(cplusdf.iloc[:,5]).count().iloc[:,1].sort_values()
javaroles = javadf.groupby(javadf.iloc[:,5]).count().iloc[:,1].sort_values()
jsroles = jsdf.groupby(jsdf.iloc[:,5]).count().iloc[:,1].sort_values()
juliaroles = juliadf.groupby(juliadf.iloc[:,5]).count().iloc[:,1].sort_values()
swiftroles = swiftdf.groupby(swiftdf.iloc[:,5]).count().iloc[:,1].sort_values()
bashroles = bashdf.groupby(bashdf.iloc[:,5]).count().iloc[:,1].sort_values()
mtlbroles = mtlbdf.groupby(mtlbdf.iloc[:,5]).count().iloc[:,1].sort_values()
noneroles = nonedf.groupby(nonedf.iloc[:,5]).count().iloc[:,1].sort_values()
otherroles = otherdf.groupby(otherdf.iloc[:,5]).count().iloc[:,1].sort_values()

In [None]:
# we concatenate the resulting series into one dataframe
roles = pd.concat([pytroles, rroles, sqlroles, croles, cplusroles, javaroles, jsroles, juliaroles, swiftroles, bashroles, mtlbroles, noneroles, otherroles], axis = 1)

# rename the columns by which language we are using
roles.columns = languageList

In [None]:
# we need to rename the index from the question asked to 'None'
roles.index.name = None

In [None]:
# transpose the dataframe and reset the index. This allows us to get all the languages in one column
roles = roles.T.reset_index()

# then rename the first column to 'languages'
roles.rename(columns = {roles.columns[0] : 'Languages'}, inplace = True)

In [None]:
# also, we change all NaNs to 0
roles.replace(np.nan, 0, inplace = True)

In [None]:
# finally, we change all values to percentages for easy visualization
for i in range(13):
    roles.iloc[i,1:] = [(x/np.sum(roles.iloc[i,1:])) * 100 for x in roles.iloc[i,1:]]

In [None]:
roles.head(3)

In [None]:
colors = ['#ea0755', '#f717a9', '#ff71ce', '#0b5394', '#3ea0f8', '#01fef6', '#4fa282', '#00ad89' ,'#01e38c' , '#741b47','#6d00ca' , '#c569e7','#f4f796', '#ffea35', '#f87000']

In [None]:
plt.figure(figsize = (22, 10))
b = 0
for val in range(1, 16):
    if val == 1:
        plt.barh(roles.iloc[:, 0], roles.iloc[:, val], data = roles, edgecolor = 'black', color = colors[val - 1])
        b = roles.iloc[:, val]
    else:
        plt.barh(roles.iloc[:, 0], roles.iloc[:, val], data = roles, left = b, edgecolor = 'black', color = colors[val - 1])
        b = b + roles.iloc[:, val]
        
        
plt.ylabel('Programming languages', fontsize = 12)
plt.xlabel('Percentage of Kagglers in role/job position', fontsize = 12)
plt. title('Figure 5: Role/Job position of Kagglers using each programming language', fontsize = 20)
plt.legend(roles.columns[1:], bbox_to_anchor = (0, 1.05, 1, 0), loc = 'lower left', mode = 'expand', ncol = 3, fontsize = 12)
plt.show()

The bar chart above shows that a majority of Kagglers are students, followed by data scientists, analysts and software engineers – not very unusual choices. However, users of C and C++ have the largest number of students (>40%) while Bash and Swift have the least (~15%). Around 25% of Python users and ~20% of R and SQL users are students. A detailed breakdown is as follows:

**Python users**, ~28% are students, ~15% are data scientists, followed by ~8% of both software engineers and data analysts. Occupations such as business analyst, product manager and research scientist make up <10% of Python users’ occupations. 

**R users**, 22% are students, 21% data scientists, and 13% data analysts. Only ~3% are software engineers, a remarkably low number compared to Python users. This, coupled with the differences in the numbers of data scientists and analysts between the two languages, highlights the fact that R is possibly ‘the’ language for data science. Other occupations make up <10% of R users.

**SQL users**, the percentages are more evenly spread amongst students, data scientists, data analysts and software engineers (~13-19%).

**C, C++, Java and Javascript users**: A good look at the table tells us that data scientists and data analysts make up <10% of their users; the largest numbers of Kagglers who use them regularly are students (>40% for C and C++, to 26% for Javascript), followed by software engineers (ranging from 11% to 23%). This lends credence to the position that these are more software languages than data ones.

**Julia users**: Students make 20% of its users. Of the rest, ~18% are data scientists, and 16% research scientists, implying that Julia might be used more in research-work. Other occupations, including data analysts, software engineers, data engineers etc, make up only handfuls, with each occupation contributing to <10% of the total.

**Swift users**: Software engineers make up the highest percentage of Swift users, at 26%. 14% are students, and ~10% are in ‘other’ industries. Of the named occupations, however, none make more than 10% of the total, though data scientists are at the higher side of the scale at 9.9%.

**Bash users**: 19% are data scientists, 16% software engineers, and 13% students. An interesting thing to note is that research scientist and machine learning engineer rank relatively highly (at 9.7% and 9.1% respectively).

**MATLAB users**: After students (33%), research scientists make up the greatest proportion of its users at ~13%. Data scientists make 11% of its users, while the other named occupations are <10%.

**None**: People who use no programming language regularly are either students (18%) or do ‘other’ jobs (27%). Of the rest, 16.9% are unemployed and 10% are data analysts. The other occupations make <10% of the total number of users; data scientists and software engineers make, respectively, 1.5% and 2% of the total.

**Others**: Kagglers who use other languages regularly are mostly either software engineers (23%), students (~12%), in ‘other’ jobs (12%) or data analysts (~10%). 

We can shift the kaleidoscope a little to see which languages are used in which **industry**. 

In [None]:
# we get the industries by different languages
indusdf = df.groupby(df.iloc[:, 115]).count().iloc[:,7:20].reset_index()

In [None]:
result = [x.split(' ')[-1] for x in indusdf.columns[1:]]

In [None]:
indusdf.rename(columns=dict(zip(indusdf.columns[1:], result)), inplace=True)

In [None]:
indusdf.head(3)

In [None]:
# convert values to percentages for easier visualization
for val in range(1, 14):
    indusdf.iloc[:, val] = [(x/np.sum(indusdf.iloc[:, val])) * 100 for x in indusdf.iloc[:, val]]

In [None]:
colors2 = ['grey', 'lightcoral', 'maroon', 'red', 'indigo', 'chocolate', 'goldenrod', 'darkorange', 'yellow', 'limegreen', 'lightgreen', 'azure', 'powderblue', 'steelblue' ,'purple', 'darkviolet', 'pink', 'crimson']

In [None]:
# finally, plot the figure
plt.figure(figsize = (22, 12))
c = 0

for val in range(0, 18):
    if val == 0:
        plt.barh(indusdf.columns[1:], indusdf.iloc[val, 1:], edgecolor = 'black', color = colors2[val])
        c = indusdf.iloc[val,1:]
    else:
        plt.barh(indusdf.columns[1:], indusdf.iloc[val,1:], left = c, edgecolor = 'black', color = colors2[val])
        c = c + indusdf.iloc[val, 1:]
        
plt.ylabel('Programming languages', fontsize = 12)
plt.xlabel('Percentage of language users', fontsize = 12)
plt. title('Figure 6: Industries of Kagglers divided by programming language', fontsize = 20)
plt.legend(labels = indusdf.iloc[:,0], bbox_to_anchor = (0, 1.05, 1, 0), loc = 'lower left', mode = 'expand', ncol = 3, fontsize = 12)
plt.show()

Taken in conjunction with the data on jobs, we can assume that a large majority of the users in Academics/Education are students. Languages such **C** and **C++** and even **MATLAB** are often taught in universities and programming courses, so it is understandable that they would be used regularly by students. **R**, **Java** and **Julia** are the next most popular in Academics, followed by **Python, Javascript and SQL**.  
32% of **Swift users** and 31.5% of **Bash users** are in Computer/Technology, comparable to ~32% of **users of C++**, 35% of **Java users**, and 33% of **Javascript users**. Of course, by the very nature of the field, all languages have a higher percentage of users here; computers and technology are where programming languages are in their natural habitat! We can assume that this is where all the data engineers, database engineers and machine learning engineers are pooled here. 
In the next industry, Finance and Accounting, we see that **SQL** has the greatest number of users; between 5-10% of users of the other languages are in the Finance/Accounting industries.

So we have seen what the users’ of different languages do. One thing to note is that usage of the language in a particular field does not imply that the language is being used to perform that job. Many programmers know a multitude of languages; Figure 6 has shown us this as well. What this tells us is the degree of overlap between languages and jobs: while users of R, for instance may be software engineers, it is less likely that they are actively using R to design any software.
This brings us to our next question: What activities make up an important role in the Kagglers’ work? 

## The activities of a job

The question now becomes: what activities are there that we could look at? Programming encompasses a world of activities, to say nothing of things like meetings and paperwork that are common to all jobs. Our question is mostly concerned with data science; data analysis, infrastructure and experimentation using machine learning are more along the line of what we want to know about. For this we look at **Figure 7**, which shows eight different options that were asked for on the survey.

In [None]:
# first we sum all the activities for each language using grouping

pytact = pythondf.iloc[:, 119: 127].notna().sum()
ract = rdf.iloc[:, 119: 127].notna().sum()
sqlact = sqldf.iloc[:, 119: 127].notna().sum()
cact = cdf.iloc[:, 119: 127].notna().sum()
cplusact = cplusdf.iloc[:, 119: 127].notna().sum()
javaact = javadf.iloc[:, 119: 127].notna().sum()
jsact = jsdf.iloc[:, 119: 127].notna().sum()
juliaact = juliadf.iloc[:, 119: 127].notna().sum()
swiftact = swiftdf.iloc[:, 119: 127].notna().sum()
bashact = bashdf.iloc[:, 119: 127].notna().sum()
mtlbact = mtlbdf.iloc[:, 119: 127].notna().sum()
noneact = nonedf.iloc[:, 119: 127].notna().sum()
otheract = otherdf.iloc[:, 119: 127].notna().sum()

In [None]:
# we concatenate the series of activities into a single dataframe
activities = pd.concat([pytact,ract,sqlact,cact,cplusact,javaact,jsact,juliaact,swiftact,bashact,mtlbact,noneact,otheract], axis = 1)

In [None]:
# we rename the columns
activities.columns = languageList

In [None]:
# reset index so that the activities are in a separate column
activities = activities.reset_index()

# rename the activities column
activities.rename(columns = {'index': 'Activities'}, inplace = True)

#get the activities
activities['Activities'] = [x.split('-')[-1].strip() for x in activities.iloc[:,0]]

In [None]:
# remove the question part of each column name and put the results in a list
rolesnames = [x.split('-')[-1].strip() for x in activities.columns[1:]]

In [None]:
# rename the rest of the columns by activity
activities.rename(columns=dict(zip(activities.columns[1:], rolesnames)), inplace=True)

In [None]:
# transform the raw numbers into percentages for uniformity
for i in range(1,14):
    activities.iloc[:,i] = [(x /np.sum(activities.iloc[:,i])) * 100 for x in activities.iloc[:,i]]

In [None]:
activities.head(3)

In [None]:
plt.figure(figsize = (22,10))

d = 0

for i in range(8):
    if i == 0:
        plt.bar(activities.columns[1:], activities.iloc[i,1:], edgecolor = 'black')
        d = activities.iloc[i,1:]
    else:
        plt.bar(activities.columns[1:], activities.iloc[i,1:], bottom = d, edgecolor = 'black')
        d = d + activities.iloc[i,1:]
        
plt.xlabel('Programming languages', fontsize = 12)
plt.ylabel('Percentage of language users', fontsize = 12)
plt. title('Figure 7: Activities that make up an important part of users\' role at work', fontsize = 20)
plt.legend(labels = activities.iloc[:,0], bbox_to_anchor = (0, 1.05, 1, 0), loc = 'lower left', mode = 'expand', ncol = 2, fontsize = 12)
plt.show()

We see a fairly even distribution of activities. Regardless of language, most Kagglers spend their time analysing and understanding data; understandably, users of **R** and **SQL**, and to a lesser extent **Python** are in greater numbers here. Those Kagglers who do not regularly use any language are actually the likeliest to be performing analyses, possibly through means of programs such as Tableau. Users of other language spend more time in machine learning activities: by building prototypes to explore machine learning in new areas, running machine learning services, or even experimenting and researching machine learning models.
Overall, **Julia users** spend the greatest amount of time in data science activities; only ~2% have activities that have nothing to do with the given options, and 0.8% have ‘other’ activities. In contrast, users that do not regularly use any programming language have the least to do with any machine learning activities- again an understandable phenomenon. 

We are starting, by now, to see something of the individualities of the different languages. Progressing further, we elaborate further on the machine learning aspect of the data, trying to get a better idea of how much each languages’ users are involved in it. 

## The intricacies of machine learning across languages

Here we divide the data by two metrics: **Figure 8** shows how much the Kagglers’ current employers employ machine learning in their business, while **Figure 9** shows how many years the users themselves have been using machine learning methods. 

### Employers use of machine learning

In [None]:
mldf = df.groupby(df.iloc[:, 118]).count().iloc[:,7:20].reset_index()
colnames = [x.split(' ')[-1] for x in mldf.columns[1:]]
mldf.rename(columns=dict(zip(mldf.columns[1:], colnames)), inplace=True)

In [None]:
for val in range(1, 14):
    mldf.iloc[:, val] = [(x/np.sum(mldf.iloc[:, val])) * 100 for x in mldf.iloc[:, val]]

In [None]:
mldf = mldf.reindex([0,1,2,5,4,3])

In [None]:
mldf.head(3)

In [None]:
plt.figure(figsize = (22, 12))
c = 0

for val in range(0, 6):
    if val == 0:
        plt.barh(mldf.columns[1:], mldf.iloc[val, 1:], edgecolor = 'black')
        c = mldf.iloc[val,1:]
    else:
        plt.barh(mldf.columns[1:], mldf.iloc[val,1:], left = c, edgecolor = 'black')
        c = c + mldf.iloc[val, 1:]
        
plt.ylabel('Programming languages', fontsize = 12)
plt.xlabel('Percentage of language users', fontsize = 12)
plt. title('Figure 8: Does your current employer incorporate machine learning methods into their business?', fontsize = 20)
plt.legend(labels = mldf.iloc[:,0], bbox_to_anchor = (0, 1.05, 1, 0), loc = 'lower left', mode = 'expand', ncol = 2, fontsize = 12)
plt.show()

### Kagglers own usage of machine learning methods

In [None]:
usrmldf = df.groupby(df.iloc[:, 71]).count().iloc[:,7:20].reset_index()
loc2colnames = [x.split(' ')[-1] for x in usrmldf.columns[1:]]
usrmldf.rename(columns=dict(zip(usrmldf.columns[1:], loc2colnames)), inplace=True)

In [None]:
for val in range(1, 14):
    usrmldf.iloc[:, val] = [(x/np.sum(usrmldf.iloc[:, val])) * 100 for x in usrmldf.iloc[:, val]]

In [None]:
usrmldf = usrmldf.reindex([7,8,0,2,4,5,6,1,3])

In [None]:
usrmldf.head(3)

In [None]:
plt.figure(figsize = (22, 12))
c = 0

for val in range(0, 9):
    if val == 0:
        plt.barh(usrmldf.columns[1:], usrmldf.iloc[val, 1:], edgecolor = 'black')
        c = usrmldf.iloc[val,1:]
    else:
        plt.barh(usrmldf.columns[1:], usrmldf.iloc[val,1:], left = c, edgecolor = 'black')
        c = c + usrmldf.iloc[val, 1:]
        
plt.ylabel('Programming languages', fontsize = 12)
plt.xlabel('Percentage of language users', fontsize = 12)
plt. title('Figure 9: {}'.format(usrmldf.columns[0]), fontsize = 20)
plt.legend(labels = usrmldf.iloc[:,0], bbox_to_anchor = (0, 1.05, 1, 0), loc = 'lower left', mode = 'expand', ncol = 2, fontsize = 12)
plt.show()

From **Figure 8** little less than 40% (exact percentages range from 32-36%) of **Python, SQL, R, C, C++, Java, Javascript and MATLAB** users either do not know or have not had machine learning methods employed in their place of work; taken in conjunction with **Figure 7** above, we can assume that these are either those working in data analysis, or the very  few whose activities are not represented in the data. Users of **Julia** and **Bash** are least likely to not use machine learning methods, with only ~20% of **Julia users** and ~21% of **Bash users** saying that they do not know or use them. More than 60% of Kagglers who do not use any language do not know/use any machine learning methods, in comparison to 40% of those who use other languages.

Around a quarter (~25%) of **C, C++, Java, Javascript and MATLAB** users work in businesses which are exploring machine learning methods; 22% of **Python users, 21% R users, 20% SQL users and 19.5% Julia users**  work in similar businesses. Percentages are similar across the board for businesses that have recently started using machine learning methods; in contrast, 26% of **Julia users** and 28% of **Bash users** work in businesses that have already established machine learning methods. 

This trend is reflected in the users’ machine learning experiences as well. Going by **Figure 9** and the associated data, we can see that only around 5% of Julia users do not use ML methods, and 19% have used them for less than a year;  taken together, only \~23% of Julia users are ‘new’ to machine learning. A greater number of **Bash** users (~9%) do not use ML, a little less than the 10.9% of **MATLAB**, 12.8% of **Python**, 11.9% of **R**, and 12.3% of **Swift** users. Users of **C, C++, Java and Javascript** who do not use ML hover in the same range - ~16%. 41.5% of **C users**, ~39% of **Python, C++ and Java users**, ~36% of **SQL and Javascript users** and 31% of **R users** have been using machine learning methods for <1 year. 

After this, the percentages spread out somewhat evenly again across the different languages; beyond the one year mark, **C, C++, Java, Javascript** and, surprisingly, **Python** users are in the 40-50% range, **R and MATLAB** users are in the 50-60% range, and **Julia** users make 76.5% of users who use machine learning methods.

In short, it seems that Kagglers who use **Julia, and to a lesser extent MATLAB and Bash users** are the ones who have the greatest experience using machine learning methods.

## Overall programming experience

But machine learning experience isn’t the only kind of experience a data scientist have; first, they must haveexperience in coding **Figure 10** shows us the years of experience the users of different languages have. 

In [None]:
# for each, we first isolate the column which gives duration of years working, and group the years
# we rename the resulting series by language name

pytdur = pythondf.groupby(pythondf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
pytdur.name = 'Python'
rdur = rdf.groupby(rdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
rdur.name = 'R'
sqldur = sqldf.groupby(sqldf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
sqldur.name = 'SQL'
cdur = cdf.groupby(cdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
cdur.name = 'C'
cplusdur = cplusdf.groupby(cplusdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
cplusdur.name = 'C++'
javadur = javadf.groupby(javadf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
javadur.name = 'Java'
jsdur = jsdf.groupby(jsdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
jsdur.name = 'Javascript'
juliadur = juliadf.groupby(juliadf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
juliadur.name = 'Julia'
swiftdur = swiftdf.groupby(swiftdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
swiftdur.name = 'Swift'
bashdur = bashdf.groupby(bashdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
bashdur.name = 'Bash'
mtlbdur = mtlbdf.groupby(mtlbdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
mtlbdur.name = 'Matlab'
nonedur = nonedf.groupby(nonedf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
nonedur.name = 'None'
otherdur = otherdf.groupby(otherdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,1]
otherdur.name = 'Other'

# finally, we get the groups of years working
durations = otherdf.groupby(otherdf.iloc[:,6]).count().iloc[:,0].reset_index().iloc[:,0]
durations.name = 'Years spent writing code and/or programming'

In [None]:
# we concatenate the years worked for each language into one dataframe
yrsworked = pd.concat([durations, pytdur, rdur, sqldur, cdur, cplusdur, javadur, jsdur, juliadur, swiftdur, bashdur, mtlbdur, nonedur, otherdur], axis = 1)

In [None]:
# transpose and transform the dataframe to get the age ranges and the names
yrsworked = yrsworked.T.reset_index()
yrsworked.columns = yrsworked.iloc[0,:]
yrsworked = yrsworked.iloc[1:,:]

In [None]:
# we get the headers to rearrange the columns
headers = list(yrsworked.columns)

In [None]:
yrsworked = yrsworked[[headers[0],headers[6],headers[1],headers[4],headers[5],headers[2],headers[3]]]

In [None]:
# now for each value we take its percentage so that the values are normalized
for i in range(0, 13):
    total = np.sum(yrsworked.iloc[i,1:])
    yrsworked.iloc[i,1:] = [(x/total) * 100 for x in yrsworked.iloc[i,1:]]

In [None]:
yrsworked.head(3)

In [None]:
# we get age ranges for the legend of the bar chart
labels = list(yrsworked.columns[1:])
colors = ['#fe2c49', '#ff7d02', '#fad819','skyblue','coral', 'tan']

In [None]:
# finalyl we construct the bar chart
plt.figure(figsize = (12,6))
a = 0

for val in range(1,7):
    if val == 1:
        plt.bar(yrsworked.iloc[:,0], yrsworked.iloc[:,val], data = yrsworked, color = colors[val - 1])
        a = yrsworked.iloc[:,val]
    else:
        plt.bar(yrsworked.iloc[:,0], yrsworked.iloc[:,val], data = yrsworked, bottom = a, color = colors[val - 1])
        a = a + yrsworked.iloc[:,val]
        
plt.yticks(np.arange(0, 120, 10))
plt.xlabel('Programming languages', fontsize = 12)
plt.ylabel('Percentage of Kagglers in year bracket', fontsize = 12)
plt. title('Figure 10: Programming experience by programming language', fontsize = 20)
plt.legend(labels, bbox_to_anchor = (1, 1), loc = 'upper left')
plt.show()

Right away we can see that as we move across the figure– from ‘popular’ programming languages like **Python, R and SQL**, to languages geared towards software engineering like **C, C++, Java and Javascript**, towards the harder, less popular languages like **Julia and Bash**, the users’ programming experience increases, with ~20% **Julia and Swift users** having greater than twenty years’ experience. This trend is broken by **MATLAB** which, from what we’ve seen by its popularity amongst students, has a higher percentage of users with lesser number of years worth of experience. It is understandable that Kagglers not using any programming language regularly have little programming experience.

## Language recommendations

Of course, experience depends on which language one begins one’s programming journey with, and also which platform is used for the learning. An aspiring data scientist's first programming language usually depends on many things: what interested them in data science, their previous education/career, and where they began to learn. While the question of which language a Kaggler learned first wasn’t present in the survey (perhaps rightly, since its likely that some might not remember the answer), the question of which one they would recommend to an aspiring data scientist was asked. 

In [None]:
langrec = (
    df
    .groupby(df.iloc[:,20])
    .agg(Total = ('Duration (in seconds)', 'count'))
    .sort_values(by = 'Total', ascending = False)
)
langrec.reset_index(inplace = True)
langrec.rename(columns = {langrec.columns[0]: 'Recommended Lang.'}, inplace = True)

In [None]:
langrec['Percentage'] = [(x/np.sum(langrec['Total'])) * 100 for x in langrec['Total']]

In [None]:
langrec.head()

In [None]:
plt.figure(figsize = (10, 10))
plt.pie(langrec.iloc[:,2], labels = langrec.iloc[:,0], explode = [0,0,0,0,0,0.3,0.3,0.3,0.3,0.3,0.3,0.5,0.5], autopct = '%1.1f%%')
plt.title('Figure 11: Language recommendations for aspiring data scientists', fontsize = 20)
plt.show()

Out of the languages most commonly used, **Python** is overwhelmingly the most recommended for newbies as well. However, while **SQL** is the second most common language used by Kagglers by far, in recommendations it is behind **R**, though only by 0.4%. And while **C++** is the third most common language used by Kagglers, in recommendations it is in fifth place. 

The other languages are recommended only sparingly. This can be due to the steepness of their learning curves as well as the experience required by their users; as seen in the Figure 10, Kagglers that normally use Bash, Swift and Julia have been programming for larger periods of time. It could be that jobs require more years of experience.

However, it could be due to a vicious loop as well: due to the ease and versatility of languages like Python and SQL, languages like Julia and Bash aren't learned. This in turn requires older, more experienced programmers to learn these languages, which results in the results seen in the survey.

## Learning platforms

Finally, we see where Kagglers have obtained their education, to see whether that has any influence on the language learned.

In [None]:
pytlrn = pythondf.iloc[:, 243: 255].notna().sum()
rlrn = rdf.iloc[:, 243: 255].notna().sum()
sqllrn = sqldf.iloc[:, 243: 255].notna().sum()
clrn = cdf.iloc[:, 243: 255].notna().sum()
cpluslrn = cplusdf.iloc[:, 243: 255].notna().sum()
javalrn = javadf.iloc[:, 243: 255].notna().sum()
jslrn = jsdf.iloc[:, 243: 255].notna().sum()
julialrn = juliadf.iloc[:, 243: 255].notna().sum()
swiftlrn = swiftdf.iloc[:, 243: 255].notna().sum()
bashlrn = bashdf.iloc[:, 243: 255].notna().sum()
mtlblrn = mtlbdf.iloc[:, 243: 255].notna().sum()
nonelrn = nonedf.iloc[:, 243: 255].notna().sum()
otherlrn = otherdf.iloc[:, 243: 255].notna().sum()

In [None]:
lrnplat = pd.concat([pytlrn,rlrn,sqllrn,clrn,cpluslrn,javalrn,jslrn,julialrn,swiftlrn,bashlrn,mtlblrn,nonelrn,otherlrn], axis = 1)

In [None]:
# we rename the columns
lrnplat.columns = languageList

In [None]:
# reset index so that the activities are in a separate column
lrnplat = lrnplat.reset_index()

# rename the activities column
lrnplat.rename(columns = {'index': 'Learning Platforms'}, inplace = True)

#get the activities
lrnplat['Learning Platforms'] = [x.split('-')[-1].strip() for x in lrnplat.iloc[:,0]]

In [None]:
platforms = [x.split('-')[-1].strip() for x in lrnplat.columns[1:]]

In [None]:
# rename the rest of the columns by activity
lrnplat.rename(columns=dict(zip(lrnplat.columns[1:], platforms)), inplace=True)

In [None]:
# transform the raw numbers into percentages for uniformity
for i in range(1,14):
    lrnplat.iloc[:,i] = [(x /np.sum(lrnplat.iloc[:,i])) * 100 for x in lrnplat.iloc[:,i]]

In [None]:
lrnplat.head(3)

In [None]:
lrnplat.shape

In [None]:
plt.figure(figsize = (22,10))

d = 0

for i in range(12):
    if i == 0:
        plt.bar(lrnplat.columns[1:], lrnplat.iloc[i,1:], edgecolor = 'black')
        d = lrnplat.iloc[i,1:]
    else:
        plt.bar(lrnplat.columns[1:], lrnplat.iloc[i,1:], bottom = d, edgecolor = 'black')
        d = d + lrnplat.iloc[i,1:]
        
plt.xlabel('Programming languages', fontsize = 12)
plt.ylabel('Percentage of language users', fontsize = 12)
plt. title('Figure 12: Learning platforms on which users began/completed data science courses', fontsize = 20)
plt.legend(labels = lrnplat.iloc[:,0], bbox_to_anchor = (0, 1.05, 1, 0), loc = 'lower left', mode = 'expand', ncol = 2, fontsize = 12)
plt.show()

**Figure 12** shows us where Kagglers completed programming courses. However, the spread of users across languages and learning platforms is more or less the same; there don’t appear to be any languages whose representation is higher or lower on any particular platform, be it university courses, certification programs, or online learning websites like Coursera or Kaggle.

## Results

After our analysis, the following differences can be observed in the usage of different programming languages:

- Python, R and SQL are popular but 'easy' programming languages, used equally in analytics as well as machine learning. They are more likely to be learned early and used by novice data scientists. Amongst them, SQL users are more likely to be in data analytics; they are also more likely to be in field of Accounting or Finance. Python users  are by contrast more likely to be working in computers/technology, with their work geared more towards data science and data engineering. R users are more likely to be in ‘hard’ data analytics, with equal numbers in academics, finance and technology.
- C, C++, Java and Javascript are languages whose users are more likely to be in software engineering and related disciplines; they are also more likely to be students and those learning programming and data science. While their users work in businesses/jobs that employ machine learning methods, it could be that they use other languages in addition to these four in their work.
- Swift's users tend to be into software engineering as well. However, the users of Swift are more likely to be older and more experienced, and much less likely to be students. They are also more likely to be using or employed in businesses using machine learning.
- Julia and Bash appear to be languages for more experienced programmers, and those deeply into data science. Very few of their users do not use machine learning; they are the least likely to be students. Users of Julia are more likely to be in research.
- MATLAB are also likely to be more into machine learning than analytics. However, a higher percentage of them tend to be students, meaning that the language is more likely to be taught earlier in aspiring data scientists.
- The recommended language for new data scientists to learn is overwhelmingly Python, with fewer suggestions for SQL and R. Languages like Bash and Swift are not recommended at all.

## Conclusion

As stated in the introduction, there are countless programming languages. Through our analysis, we have seen the ones Kagglers use, and where and how they are likely to use them. We have seen which languages are used by experience, and taken a look at which ones to choose depending on potential career. While it is clear that some languages are preferred over others, it depends, in the end, on the nature of the work being done; there is no 'best' programming language after all!