# What makes you use the technologies you use?
### 2020 Kaggle ML & DS Survey

![tools](https://i.imgur.com/Ar42cBf.png)

It is not a surprise that there are many tools for the job, since the first approach to the machine learning and data science space we ask ourselves "should I start with Python or R?", "Should I make my models with the x or y package? "," Should I use Amazon, GCP, or anything else? "

In this notebook I will explore what makes us choose the technologies / tools we use.

But first it is inevitable to wonder ...


# What tools are being use?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib

In [None]:
import numpy as np 
import pandas as pd


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

survey2020 = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", encoding='utf8')


In [None]:
survey2020headers = survey2020.loc[survey2020.Q1 == "What is your age (# years)?"]
survey2020 = survey2020.loc[survey2020.Q1 != "What is your age (# years)?"]

In [None]:
#survey2020["Q7_Part_3"]

#survey2020[survey2020.columns[survey2020.Series(survey2020.columns).str.startswith('Q7_Part_')]]


#additional_cols = ["Q1"]#, "Q2"]
#survey2020.columns.str.startswith('Q7_Part_')
cols_lang = list(survey2020.columns[survey2020.columns.str.startswith('Q7_Part_')])
#columns = additional_cols + list(cols)
summary = survey2020[cols_lang].describe()
languages = list(summary.iloc[2])
languagescount = list(summary.iloc[3])
#survey2020[columns].fillna("-").groupby(columns).size()

fig, ax = plt.subplots(figsize=(15, 10))

width = 0.9

ax.bar(languages,languagescount,width=width)
    
ax.set_xlabel ("Languages selected", fontsize=18)
ax.set_title ("What programming languages do you use on a regular basis?", fontsize=18)

ax.legend()
plt.show()

In [None]:
summary = survey2020[cols_lang].describe()
lensurvey = len(survey2020)

lfreq = list(summary.iloc[3])
lensurvey = len(survey2020)
i=0
for x in summary.iloc[2]:
    print(x + ": "+ str(round(lfreq[i] / lensurvey*100, 1)) + "%")
    i=i+1


In [None]:
cols_ide = list(survey2020.columns[survey2020.columns.str.startswith('Q9_Part_')])
#columns = additional_cols + list(cols)
summary = survey2020[cols_ide].describe()
ide = list(summary.iloc[2])
ide_count = list(summary.iloc[3])
#survey2020[columns].fillna("-").groupby(columns).size()

fig, ax = plt.subplots(figsize=(15, 10))

width = 0.9

ax.bar(ide,ide_count,width=width)
    
ax.set_xlabel ("IDE's selected", fontsize=18)
ax.set_title ("Which of the following integrated development environments (IDE's) do you use on a regular basis?", fontsize=18)

ax.legend()
plt.xticks(rotation=90)
plt.show()

In [None]:
summary = survey2020[cols_ide].describe()
lensurvey = len(survey2020)

lfreq = list(summary.iloc[3])
lensurvey = len(survey2020)
i=0
for x in summary.iloc[2]:
    print(x + ": "+ str(round(lfreq[i] / lensurvey*100, 1)) + "%")
    i=i+1


In [None]:
cols_ml = list(survey2020.columns[survey2020.columns.str.startswith('Q16_Part_')])
#columns = additional_cols + list(cols)
summary = survey2020[cols_ml].describe()
mlf = list(summary.iloc[2])
mlf_count = list(summary.iloc[3])
#survey2020[columns].fillna("-").groupby(columns).size()

fig, ax = plt.subplots(figsize=(15, 10))

width = 0.9

ax.bar(mlf,mlf_count,width=width)
    
ax.set_xlabel ("Frameworks", fontsize=18)
ax.set_title ("Which of the following machine learning frameworks do you use on a regular basis?", fontsize=18)

ax.legend()
plt.xticks(rotation=45)
plt.show()

In [None]:
summary = survey2020[cols_ml].describe()
lensurvey = len(survey2020)

lfreq = list(summary.iloc[3])
lensurvey = len(survey2020)
i=0
for x in summary.iloc[2]:
    print(x + ": "+ str(round(lfreq[i] / lensurvey*100, 1)) + "%")
    i=i+1


In [None]:
cols_cpl = list(survey2020.columns[survey2020.columns.str.startswith('Q26_A_Part_')])
summary = survey2020[cols_cpl].describe()
cpl = list(summary.iloc[2])
cpl_count = list(summary.iloc[3])

fig, ax = plt.subplots(figsize=(15, 10))

width = 0.9

ax.bar(cpl,cpl_count,width=width)
    
ax.set_xlabel ("Platforms", fontsize=18)
ax.set_title ("Which of the following cloud computing platforms do you use on a regular basis?", fontsize=18)

ax.legend()
plt.xticks(rotation=90)
plt.show()

In [None]:
summary = survey2020[cols_cpl].describe()
lensurvey = len(survey2020)

lfreq = list(summary.iloc[3])
lensurvey = len(survey2020)
i=0
for x in summary.iloc[2]:
    print(x + ": "+ str(round(lfreq[i] / lensurvey*100, 1)) + "%")
    i=i+1


## Most Used Tools!
From this we can now conclude which are the most used!

**Programming language**
1. Python **77.5%**
2. SQL 37.6%
3. R 21.3%
4. C++ 19.1%

**IDE's**
1. JupyterLab / Jupyter Notebooks **56.0%**
2. Visual Studio Code 29.3%
3. PyCharm 25.4%
4. RStudio 19.1%

**ML Frameworks**
1. Scikit-learn **51.2%**
2. TensorFlow 34.6%
3. Keras 30.9%
4. PyTorch 25.4%

**Cloud Computing Platforms**
1. Amazon Web Services 14.0%
2. Google Cloud Platform 11.4%
3. Microsoft Azure : 8.5%

Now we can ask ourselves, are there differences between groups about which tools are most popular?

# Is it age?

In [None]:
a = survey2020[["Q1"] + cols_lang].groupby(["Q1", cols_lang[0]]).size()
a

In [None]:
#languages
#a = survey2020[["Q1"] + cols_lang].groupby(["Q1", i]).size()
arr = np.empty([len(a)])

for i in cols_lang:
    if i == "Q7_Part_9":
        a = list(survey2020[["Q1"] + cols_lang].groupby(["Q1", "Q7_Part_9"]).size()) + [0]
    else:
        a = survey2020[["Q1"] + cols_lang].groupby(["Q1", i]).size()
    #print(a)
    a = list(a)
    #print(len(a))
    highest = sum(a)
    #print([round((x / highest)*100, 0) for x in a])
    #print("")
    
    #a = [round((x / highest)*100, 1) for x in a]
    #print(len(a))
    arr = np.vstack((arr, a))

arr = np.delete(arr, (0), axis=0)

arr = arr.transpose()
for i in range(len(arr)):
    #print(arr[i])
    sumrow  = arr[i].sum()
    arr[i] = [round((x / sumrow)*100, 1) for x in arr[i]]
    #print()
#arr = arr.transpose()

ages_serie = survey2020.Q1.value_counts().sort_index().index
ages_serie



fig, ax = plt.subplots(figsize=(15, 10))
im = ax.imshow(arr, cmap="magma_r")

# We want to show all ticks...
ax.set_xticks(np.arange(len(languages)))
ax.set_yticks(np.arange(len(ages_serie)))
# ... and label them with the respective list entries
ax.set_xticklabels(languages)
ax.set_yticklabels(ages_serie)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(ages_serie)):
    for j in range(len(languages)):
        text = ax.text(j, i, arr[i, j],
                       ha="center", va="center", color="w")

ax.set_title("Use of languages by age group (% of group)")
fig.tight_layout()
plt.show()

We can conclude various thing with this graph, the first is the python dominance in all age groups
Secondly we can say that the R language is not very popular between young people(18-24)

In [None]:
#languages
#a = survey2020[["Q1"] + cols_lang].groupby(["Q1", i]).size()
arr = np.empty([len(a)])

for i in cols_ide:
    if i == "Q7_Part_9":
        a = list(survey2020[["Q1"] + cols_ide].groupby(["Q1", "Q7_Part_9"]).size()) + [0]
    else:
        a = survey2020[["Q1"] + cols_ide].groupby(["Q1", i]).size()
    #print(a)
    a = list(a)
    #print(len(a))
    highest = sum(a)
    #print([round((x / highest)*100, 0) for x in a])
    #print("")
    
    #a = [round((x / highest)*100, 1) for x in a]
    #print(len(a))
    arr = np.vstack((arr, a))

arr = np.delete(arr, (0), axis=0)

arr = arr.transpose()
for i in range(len(arr)):
    #print(arr[i])
    sumrow  = arr[i].sum()
    arr[i] = [round((x / sumrow)*100, 1) for x in arr[i]]
    #print()
#arr = arr.transpose()

ages_serie = survey2020.Q1.value_counts().sort_index().index
ages_serie



fig, ax = plt.subplots(figsize=(15, 10))
im = ax.imshow(arr, cmap="magma_r")

# We want to show all ticks...
ax.set_xticks(np.arange(len(ide)))
ax.set_yticks(np.arange(len(ages_serie)))
# ... and label them with the respective list entries
ax.set_xticklabels(ide)
ax.set_yticklabels(ages_serie)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(ages_serie)):
    for j in range(len(ide)):
        text = ax.text(j, i, arr[i, j],
                       ha="center", va="center", color="w")

ax.set_title("Use of languages by age group (% of group)")
fig.tight_layout()
plt.show()

Looks like Notepad++, RStudio and Visual Studio are not so used by young people (Ages 18-24)

And VSCode is more popular among the same group

In [None]:
age_education = survey2020.groupby(['Q11', 'Q1']).size().sort_index()
age_education

fig, ax = plt.subplots(figsize=(15, 10))

width = 0.9

previous = []
for i in list(survey2020.Q11.value_counts().sort_index().index):
    if i == "A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)":
        ax.bar(list(age_education[i].index),list(age_education[i]),width=width, label=i)
        previous = list(age_education[i])
    else:
        ax.bar(list(age_education[i].index), list(age_education[i]), width=width,bottom=previous, label=i)
        previous = [x + y for x, y in zip(previous, list(age_education[i]))]

ax.set_xlabel ("Ages group", fontsize=18)
ax.set_title (survey2020headers["Q11"][0], fontsize=18)

ax.legend()
plt.show()

And we don't tend to use cloud computing plataforms based on our age

In [None]:
ages_serie = survey2020.Q1.value_counts().sort_index().index
ages_serie

# Is it role?

In [None]:
occupations= list(survey2020.Q5.value_counts().sort_index().index)

fig, ax = plt.subplots(figsize=(15, 10))

width = 0.9

previous = []

for i in range(len(occupations)):
    label = occupations[i]
    q = survey2020[["Q5"] + cols_cpl]["Q5"] == label
    ml_oc_uses = list(survey2020[cols_cpl][q].describe().iloc[0])
    if i == 0:
        ax.bar(cpl, ml_oc_uses, width=width, label=label)
        previous = ml_oc_uses
    else:
        ax.bar(cpl, ml_oc_uses, width=width,bottom=previous, label=label)
        previous = [x + y for x, y in zip(previous, ml_oc_uses)]
        
ax.legend()
plt.xticks(rotation=90)
ax.set_xlabel ("Platforms", fontsize=18)
ax.set_title ("Which of the following cloud computing platforms do you use on a regular basis?", fontsize=18)
ax.set_ylim(0,3000)
plt.show()

Thats a lot of info but the interesting things are:

- AWS is the most attractive and used plataform by almost every role
- GCP is close to beat that dominance in roles like Research scientist, Machine learning engineer and Data analyst

In [None]:
age_education = survey2020.groupby(['Q11', 'Q5']).size().sort_index()
age_education

fig, ax = plt.subplots(figsize=(15, 10))

width = 0.9

previous = []
for i in list(survey2020.Q11.value_counts().sort_index().index):
    if i == "A cloud computing platform (AWS, Azure, GCP, hosted notebooks, etc)":
        ax.bar(list(age_education[i].index),list(age_education[i]),width=width, label=i)
        previous = list(age_education[i])
    else:
        ax.bar(list(age_education[i].index), list(age_education[i]), width=width,bottom=previous, label=i)
        previous = [x + y for x, y in zip(previous, list(age_education[i]))]

ax.set_xlabel ("Occupation", fontsize=18)
ax.set_title (survey2020headers["Q11"][0], fontsize=18)

plt.xticks(rotation=90)
ax.legend()
plt.show()

Two big interesting thing in this chart is:

- Cloud computing platforms are very popular among students and data scientists
- Business analysts and product managers don't tent to use cloud computing platforms, lack of tools or lack of marketing for these tools?

### I hope you liked it, any suggestions are welcome!