In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Learning from Kagglers 2020

This notebook includes an Exploratory Data Analysis (EDA) on the Kaggle Survey 2020 dataset.

## Inspiration and Goals

While writing this notebook, I tried to ask myself the questions I would like to answer in order to help my introduction into the professional fields of Data Science and Machine Learning. Through my questions I tried to:

* Understand the role of a scientist who works on Data Science and Machine Learning fields
* Choose the skills I need to practice in order to enter these fields

I hope that the results coming from the analysis in this notebook will be useful for anyone who is a Data Science and Machine Learning beginner learner and aims to enter these fields as a professional scientist.

## About the data

For the analysis I used the [Kaggle Survey 2020](http://www.kaggle.com/c/kaggle-survey-2020/overview) dataset. 

The dataset is a table in .CSV format, that contains the anwsers that Kaggle users gave to a set of questions asked by Kaggle for the Kaggle Survey 2020. Inside the table each Survey question corresponds to:
* One column, if only one choice could be selected as answer to this question
* Mulitple columns, if multiple choices could be selected as an answer to this question

Each row corresponds to a Survey participant. 

As it is already known, the dataset has 20,036 rows and 335 columns. Which means that the Survey participants is 20,036 in number.

**Supplementary data**: The Kaggle Survey 2020 dataset come along with two .PPT files that contain the methodology applied for the Survey and the questionary with the exact questions and answer choices that the participants have recieved to answer.

### Defining the 'Kaggler'
From now on in this notebook each Kaggle Survey 2020 participant will be called a 'Kaggler'. For that Kagglers' answers (in the Kaggle Survey 2020 dataset) will be the data that are going to help me answer my questions about Data Science and ML professional fields.

## Workflow

**Topics**: Kagglers' professional roles, programming languages, use of Project sharing platforms, activities per professional role, use of ML frameworks and algorithms, use of data analysis tools.

**Organisation**: The analysis that will be presented is based on ten (10) questions:

1. How many of the Kagglers are students?
2. What are the three most used learning platforms by the Kagglers?
3. Which programming language is most used by the most experienced data scientists?
4. Inside the experienced coders group with which other languages Python is  simultaneously used?
5. What programming language is the most recommend for an aspiring data scientist to learn first?
6. Which platform is the most used for sharing and deploying Data Science projects?
7. The use of platforms for sharing DS projects differs among the different professional roles?
8. Which are the most important activities of each professional role?
9. 
  a) Which are the most used Machine Learning frameworks by the group of: i) Research Scientists ii) Data Scientists and Machine Learning Engineers?
  
  b) Which are the most used Machine Learning algorithms by the group of: i) Research Scientists ii) Data Scientists and Machine Learning Engineers?
  
10. What tools of data analysis is mostly used by the Kagglers who are currently working at the Data Science field in any professional role?

These questions came on my mind, the one after the other, while analysing the dataset and at the same time keeping on my mind the goal of this notebook. The analysis that comes next, tries to answer these questions.

## Exploratory Data Analysis

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# import data
data = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", low_memory=False)

# extract the first row of data, where the Survey questions are included
questions = list(data.iloc[0,:])
data = data.iloc[1:]

In [None]:
# view data
print(data.shape)
data.head()

In [None]:
# view the first five Survey questions
questions[:5]

### **Question 1**: How many of the Kagglers are students?

For answering this question we will use the column 'Q5' from data, that corresponds to the Survey question: 'Select the title most similar to your current role (or most recent title if retired)- Selected Choice'. This Survey question was asked to all participants.

In [None]:
# count missing values in column 'Q5'
data['Q5'].isnull().sum()

The 'Q5' column has 759 missing values. That means that out of the 20,036 Kagglers that 'Q5' was asked, 759 did not give an answer or a valid answer to this question. This means that almost 94% of the Kagglers answered question 'Q5' about her/his current role.

In [None]:
# count the times that each unique value in column 'Q5' appears in this column
roles = data['Q5'].value_counts()
roles

In [None]:
# picture the frequency of the Kagglers' roles in a pie chart
pie, ax = plt.subplots(figsize=[8,8])
labels = roles.keys()
plt.pie(x=roles, autopct="%.1f%%", explode=[0.05]*13, labels=labels, pctdistance=0.8)
plt.title("Kagglers' roles", fontsize=16)
plt.show

In [None]:
# calculate the percentage of Kagglers that are students (rounded to the nearest unit)
round(data['Q5'].value_counts().max() / data['Q5'].value_counts().sum(), 2)*100

**Answer**: 5171 of Kagglers are students. Students are the most frequent Kagglers' role with frequency almost 27%.

### **Question 2**: What are the three most used learning platforms by the Kagglers?

We will use the answers that Kagglers gave to 'Q37' : 'On which platforms have you begun or completed data science courses? (Select all that apply)'. 'Q37' takes multiple columns of data, for that we need to spot the right columns.

In [None]:
# find the index of the questions that contain the phrase 'data science courses'
col_indexes1 = []
for question in questions:
    if 'data science courses' in question:
        col_indexes1.append(questions.index(question))
col_indexes1

In [None]:
# pick the columns from data based on index
cour_data = data[data.columns[231:243]]
cour_data.head()

In [None]:
# get the pairs of course - number of course users
courses = []
cs = []
for i in range(cour_data.shape[1]):
    s = cour_data.iloc[:,i].value_counts()
    courses.append(list(s.items()))
for course in courses:
    for c,n in course:
        cs.append([c,n])
        
# create a dataframe containing the pairs
courses_data = pd.DataFrame(cs, columns = ['Platform', 'Number of users'])
courses_data = courses_data.sort_values("Number of users", ascending =  False)
courses_data = courses_data.reset_index(drop =True)

courses_data

In [None]:
# visualize courses_data
sns.set_style("whitegrid")
plt.figure(figsize = (10,8))
course_bar = sns.barplot(y = 'Platform',x = 'Number of users', data = courses_data, orient = 'h')
plt.show()

**Answer**: The top three most used learning platforms by Kagglers for Data Science are: Coursera, Kaggle Learn Courses and Udemy.

### **Question 3**: Which programming language is most used by the most experienced Kagglers?

To answer this question we will use the answers that Kagglres gave to Survey questions:
* 'Q6': 'For how many years have you been writing code and/or programming?'
* 'Q7': 'What programming languages do you use on a regular basis? (Select all that apply)'.

As the most experienced we consider the Kagglers with more than 10 years of coding.

In [None]:
# pick the rows from data that correspond to Kagglers with more than 10 years of coding
exp_data = data.loc[(data['Q6'] == '10-20 years') | (data['Q6'] == '20+ years')]
exp_data.shape

3080 out of the 20036 Kagglers are experienced coders.

Next, we use the subset of experienced coders and we pick the columns corresponding to the question 'Q6', which includes to the most used language.

In [None]:
# find the index of the columns that contain the phrase 'What programming languages do you use on a regular basis?'
col_indexes2 = []
for question in questions:
    if 'What programming languages do you use on a regular basis?' in question:
        col_indexes2.append(questions.index(question))
col_indexes2

In [None]:
# pick the columns from exp_data
lang_data = exp_data[data.columns[7:20]]
print(lang_data.shape)
lang_data.head()

In [None]:
# get the language - number of users pairs
lans = []
ls = []
for i in range(lang_data.shape[1]):
    s = lang_data.iloc[:,i].value_counts()
    lans.append(list(s.items()))
for lan in lans:
    for l,n in lan:
        ls.append([l,n])
        
# create a dataframe
languages_data = pd.DataFrame(ls, columns = ['Language', 'Number of users'])
languages_data = languages_data.sort_values("Number of users", ascending =  False) 
languages_data = languages_data.reset_index(drop =True)

languages_data

In [None]:
# visualize courses_data 
plt.figure(figsize = (8,6))
languages_bar = sns.barplot(y = 'Language', x = 'Number of users', data = languages_data, orient = 'h')
plt.show()

In [None]:
# calculate the frequency of three most used languages
for i in range(3):
 print(str(languages_data['Language'][i]) + " is used of the " + "{:.1f}".format((languages_data['Number of users'][i]/languages_data['Number of users'].sum())*100) + "% of the Kagglers with more than 10 years of coding experience.")

**Answer**: Python is the most used language by the Kagglers that are experienced coders (more than 10 years of coding experience).

### **Question 4**: Inside the experienced coders group with which other programming languages Python is simultaneously used?
   


To answer this question we are going to use the lang_data dataframe that we have already created on question 3 and that contains the columns from data correspond to Survey questions 'Q6' and 'Q7' and the rows from data corresponding to experinced coders (> 10 years of coding experience).

In [None]:
# create a list with all the programming languages that are used by the Kagglers who are experience coders and use python
comp_langs = []
for i in range(0, lang_data.shape[0]):
    if lang_data.iloc[i, 0] == 'Python':
        for j in range(1, lang_data.shape[1]):
            if pd.notnull(lang_data.iloc[i,j]):
                comp_langs.append(lang_data.iloc[i,j])
        if lang_data.iloc[i,1:].isnull().sum() == 12:
            comp_langs.append('only python')

# turn list into a pandas series
comp_lan_series = pd.Series(comp_langs).value_counts()
total = comp_lan_series.sum()
comp_languages = round((pd.Series(comp_langs).value_counts()/total)*100, 2)
comp_languages

In [None]:
# picture the frequency of each language used by experienced coders
pie, ax = plt.subplots(figsize=[8,8])
labels = comp_languages.keys()
patches, texts, autotexts = plt.pie(x=comp_languages, autopct="%.2f%%", explode=[0.05]*12, labels=labels, pctdistance=0.8)
plt.title("Programming languages used along with Python\n by experienced coders (> 10 yrs of coding)", fontsize=16)
plt.setp(autotexts, **{'color':'black', 'fontsize':10.5})
plt.setp(texts, **{'color':'black', 'fontsize':11.5})
plt.show

**Answer**: Inside the group of experienced coders ( > 10 years of coding) python is used along with SQL (21.97% of the times) and rarely with no other language (3.47% of the times).

### **Question 5**: What programming language would is the most recommend for an aspiring data scientist to learn first?

To answer this question, we will use the answer to the Survey question 'Q8':
'What programming language would you recommend an aspiring data scientist to learn first?'. 'Q8' takes more than one answers and for that takes more than one columns space in data.

In [None]:
# find the index of the columns that contain the phrase 'What programming language would you recommend an aspiring data scientist to learn first?'
col_indexes3 = []
for question in questions:
    if 'What programming language would you recommend an aspiring data scientist to learn first?' in question:
        col_indexes3.append(questions.index(question))
col_indexes3

In [None]:
# pick the columns from data
recommendations = data[data.columns[20]]
recommendations.head()

In [None]:
# calculate the percentage of each recommended prgramming language
rec = (recommendations.value_counts(normalize =True))*100
rec

In [None]:
# picture the frequency of each language recommended by Kagglers for Data Science beginners
pie, ax = plt.subplots(figsize=[8,8])
colors = ['lightskyblue','yellowgreen','red','gold','violet','blue','lightcoral','grey', 'darkgreen','purple','black','tan', 'navy']
patches, texts =  plt.pie(x=rec, explode=[0.05]*13, labeldistance = 1.05, colors = colors)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(list(rec.keys()), rec.tolist())]

sort_legend = True
if sort_legend:
    patches, labels, dummy =  zip(*sorted(zip(patches, labels, rec.tolist()),
                                          key=lambda x: x[2],
                                          reverse=True))

plt.legend(patches, labels, loc='center right', bbox_to_anchor=(-0.1, 0.6),
           fontsize=14)

plt.title("Programming languages recommended for Data Science begginers", fontsize=16)
plt.show

**Answer**: The most recommended language for a beginner to data science is Python!

### **Question 6**:  Which platform is the most used for sharing and deploying Data Science projects ?


To answer this question we are going to use the columns corresponding to question 'Q36': 'Where do you publicly share or deploy your data analysis or machine learning applications? (Select all that apply)'. 'Q36' corresponds to more than one columns in data, so we need to spot the right columns in data. 

In [None]:
# find the index of the columns that contain the phrase 'Where do you publicly share or deploy your data analysis or machine learning applications?'
col_indexes4 = []
for question in questions:
    if 'Where do you publicly share or deploy your data analysis or machine learning applications?' in question:
        col_indexes4.append(questions.index(question))
col_indexes4

In [None]:
# pick the columns from data
platforms = data[data.columns[221:230]]
platforms.head()

In [None]:
# get the platform - number of users pairs
lans = []
ls = []
for i in range(platforms.shape[1]):
    s = platforms.iloc[:,i].value_counts()
    lans.append(list(s.items()))
for lan in lans:
    for l,n in lan:
        ls.append([l,n])
        
# create a dataframe
platforms_data = pd.DataFrame(ls, columns = ['Platforms', 'Counts'])
platforms_data = platforms_data.sort_values("Counts", ascending = False) 
platforms_data = platforms_data.reset_index(drop = True)

plats = round(platforms_data['Counts'] / platforms_data['Counts'].sum()*100, 2) 

plats.index = platforms_data['Platforms']
plats.name = 'platforms_perc'
plats

In [None]:
# picture the frequency of each platform used by Kagglers to publish and deploy Data Science projects
pie, ax = plt.subplots(figsize=[8,8])
labels = plats.keys()
colors = ['lightskyblue','yellowgreen','red','gold','violet','blue','lightcoral','grey', 'tan']
patches, texts, autotexts =  plt.pie(x=plats, explode=[0.05]*9, labeldistance = 1.05, colors = colors, labels =labels, autopct="%.1f%%", pctdistance=0.8)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(list(plats.keys()), plats.tolist())]
plt.title("Platforms used by Kagglers to share or deploy Data Science projects", fontsize=16)
plt.show

**Answer**: The most used platform for sharing Data Science projects among Kagglers is GitHub with 33.54% frequency.

### **Question 7**:  The use of platforms for sharing DS projects differs among the different professional roles?

To answer this question we will use the answers that Kagglres gave to Survey questions:

* 'Q5': 'Select the title most similar to your current role (or most recent title if retired)- Selected Choice'
* 'Q36': 'Where do you publicly share or deploy your data analysis or machine learning applications? (Select all that apply)'.

We are going to use the list of question col_indexes4 that we created on the previous questions and contains the columns corresponding to 'Q36' Survey question. 

In [None]:
# add the index of the column 'Q5' in the col_indexes4 
col_indexes4.insert(0,5)

In [None]:
# pick the columns from data
platforms2 = data.iloc[:, col_indexes4]
# pick the rows from platforms2 where role is not student or currently not employed
platforms2 = platforms2.loc[(platforms2['Q5'] != 'Student') & (platforms2['Q5'] != 'Currently not employed')]

print(platforms2.shape)
platforms2.head()

In [None]:
# Create a dataframe which contains the frequency of use of each platform by each profesional role. 
# The summary of each column (platform) is 100%.
df = pd.DataFrame()
for i in range(1, platforms.shape[1]):
    p = platforms2.groupby('Q5')[platforms.columns[i]]
    pcount = p.count()
    perc = round((pcount/ pcount.sum())*100, 2)
    name = ''.join(platforms.iloc[:, i].value_counts().index.tolist())
    counts = perc.tolist()
    df[name] = counts

df_index = perc.index.tolist()
df['Role'] = df_index 
df = df.set_index('Role',drop = True)
df

In [None]:
# Picture the content of df ploting a line for each professional role. 
# Almost straight line means that an exact professional role use all the platforms with almost the same frequency.
fig,axis = plt.subplots(nrows=1,ncols=1,figsize=(10,7),sharex=True)
for i in range(len(df)):
    plt.plot([k for k in df.columns],[df[y].iloc[i] for y in df.columns], '-o')

plt.legend(df.index,title='Roles', bbox_to_anchor=(1.05, 1), loc='upper left', fontsize='large')
plt.xlabel("Platforms for project sharing",size = 24, labelpad=20)
plt.ylabel("Percentage of usage per platform",size = 24, labelpad=20)
axis.set_xticklabels(['Streamlit', 'NBViewer', 'GitHub', 'Personal blog', 'Kaggle','Colab', 'Shiny','No public share' ],fontsize=16, rotation = 27)
axis.set_yticklabels(['0', '10', '20', '30', '40', '50', '60'],fontsize=16)
plt.show()

**Answer** : There are no big preferences in frequency of use of platforms among the group of professional roles. We can notice though that Shiny is Data Scientists' and Data Analysts' most used platform for project sharing and at the same time Shint is Machine Learning Engineers' and Software Engineers' least used platform for project sharing.

### **Question 8**: Which are the most important activities of each professional role?

To answer this question where are going to use the answers that Kagglers gave to the Survey questions:
* 'Q5': 'Select the title most similar to your current role (or most recent title if retired)' 
* 'Q23': 'Select any activities that make up an important part of your role at work: (Select all that apply)'

In [None]:
# find the index of the columns that contain the phrase 'Select any activities that make up an important part of your role at work'
col_indexes5 = []
for question in questions:
    if 'Select any activities that make up an important part of your role at work' in question:
        col_indexes5.append(questions.index(question))
col_indexes5

In [None]:
# add 'Q5' to the col_indexes5
col_indexes5.insert(0,5)
col_indexes5

In [None]:
# pick the columns from data
activities = data.iloc[:, col_indexes5]
# pick the rows from activities where role is nor student nor currently not employed
activities = activities.loc[(activities['Q5'] != 'Student') & (activities['Q5'] != 'Currently not employed')]
activities = activities.reset_index(drop = True)

print(activities.shape)
activities.head(5)

In [None]:
# create a dataframe that contains the frequency with wich each professional role does each activity
activities_df = pd.DataFrame()
for i in range(1, activities.shape[1]):
    p = activities.groupby('Q5')[activities.columns[i]]
    pcount = p.count()
    perc = round((pcount/ pcount.sum())*100, 2)
    name = ''.join(activities.iloc[:, i].value_counts().index.tolist())
    counts = perc.tolist()
    activities_df[name] = counts

df_index = perc.index.tolist()
activities_df['Role'] = df_index 
activities_df = activities_df.set_index('Role',drop = True)
activities_df

All activities were done by all professional roles, but not with same frequency, which means that some roles do some activities most often than other roles do. 

In order to spot more clearly the most and least important activities for each role, we can rescale the percentages across each row of the activities_df. We can reset the value of the most frequent activity to be equal to 5 and the value of the least frequent activity to be equal to 1. All the other activities will have a value between 1 and 5. 

At the next level of the analysis, we can make a graph to visualize the reletive frequency of each activity for each role.

In [None]:
# rescale activities_df
from sklearn.preprocessing import minmax_scale
scaler = minmax_scale(activities_df, feature_range=(1, 5), axis=1, copy=False)
activities_df

In [None]:
# picture a bar plot of activities for each professional role
fig,axes = plt.subplots(nrows=4,ncols=3,figsize=(20,22),sharex=True)
fig.delaxes(axes[3,2])
for i in range(4):
        g = sns.barplot([k for k in activities_df.columns],[activities_df[y].iloc[i] for y in activities_df.columns], ax = axes[i,0])
        g.set_title(activities_df.index[i], size = 20)
        g.set(xticklabels=[])
for i in range(4):
        g = sns.barplot([k for k in activities_df.columns],[activities_df[y].iloc[i+4] for y in activities_df.columns], ax = axes[i,1])
        g.set_title(activities_df.index[i+4], size = 20)
        g.set(xticklabels=[])
for i in range(3):
        g = sns.barplot([k for k in activities_df.columns],[activities_df[y].iloc[i+8] for y in activities_df.columns], ax = axes[i,2])
        g.set_title(activities_df.index[i+8], size = 20)
        g.set(xticklabels=[])

handles = [bar for bar in axes[0,0].containers[0]]
leg = axes[0,0].legend(handles, activities_df.columns, title = 'Activities', 
                       bbox_to_anchor=(-0.1,1.9,3.6,0.5), loc="upper left",
                       mode="expand", borderaxespad=4, ncol=1, fontsize='x-large', framealpha=0.1)
leg.get_frame().set_facecolor('grey')
leg.get_title().set_fontsize(20)

fig.suptitle("Activities by professional role", y=1.12, x = 0.52,size = 26)
plt.show()

We can use each graph to creat the profil each role based on its activities.

For example, we can say a **Business Analyst** as a big part of his activities analyze and understand data to influence product or business decisions and he does less activities related to Machine Learning experimentation. Also we can say that a **Data Scientist** is more occupied with activities related to Machine Learning experimentation and prototypes creation, that is still business related. A **Research Scientist** to the other hand, is occupied with Machine Learning experimentation but for research purpose in order to improve the state of art of Maching Learning models.

There are also some roles that they can not be best described by the activities that are suggested by the Survey and so the asnwers 'other' and 'none of these activities are important to my role' are the most frequent answers by the participants having these roles. In this category there are the **Software Engineer**, the **Statistician** and  **Product/Project Manager**. 

Noticably, there is a contradiction among the answers of the **DBA/Database Engineer** group of Kagglers, where the anwser 'none of these activities are important to my role' is almost as frequent as the answer ' Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data'.

*Data scientists, Machine Learning Engineers and Research Scientists seem to be the professional roles that are mostly occupied with Machine Learning. Research Scientists is more orientated towards scientific research while Data Scientists and Machine Learning Engineers seem to be more orientated towards business development.*

### **Question 9**: a) Which are the most used Machine Learning frameworks by the group of: i) Research Scientists ii) Data Scientists and Machine Learning Engineers? 

To answer this question where are going to use the answers that Kagglers gave to the questions:
* 'Q5: Select the title most similar to your current role (or most recent title if retired)'
* 'Q16: Which of the following machine learning frameworks do you use on a regular basis? (Select all that apply)'


a)i) Find the frequency of use of each ML platform inside the group of Reaserch Scientists.

In [None]:
# picking the rows from data that correspond to Research Scientists
ml_res_data = data[data['Q5'] == 'Research Scientist']

ml_res_data.shape

In [None]:
# find the the index of the columns of data that correspond to 'Q16'
col_indexes6 = []
for question in questions:
    if 'Which of the following machine learning frameworks do you use on a regular basis' in question:
        col_indexes6.append(questions.index(question))
col_indexes6

In [None]:
# add the index of the column 'Q5'
col_indexes6.insert(0,5)
col_indexes6

In [None]:
# pick columns from ml_res_data
ml_res_data = ml_res_data.iloc[:, col_indexes6]
ml_res_data = ml_res_data.reset_index(drop = True)

print(ml_res_data.shape)
ml_res_data.head(5)

In [None]:
# get the framework - number of users pairs.
lans = []
ls = []
for i in range(1,ml_res_data.shape[1]):
    s = ml_res_data.iloc[:,i].value_counts()
    lans.append(list(s.items()))
for lan in lans:
    for l,n in lan:
        ls.append([l,n])
        
# Create a dataframe
frameworks_data = pd.DataFrame(ls, columns = ['Frameworks', 'Number of users'])
frameworks_data = frameworks_data.sort_values("Number of users", ascending = False) 
frameworks_data = frameworks_data.reset_index(drop = True)

frames = round(frameworks_data['Number of users'] / frameworks_data['Number of users'].sum()*100, 2) 

frames.index = frameworks_data['Frameworks']
frames.name = 'res_frameworks_perc'
frames

In [None]:
# turn res_frameworks_perc into series
ml_frames = pd.DataFrame(frames)
ml_frames.reset_index(inplace = True)
ml_frames

a)ii) Repeat the same prossess for Data Scientists and Machine Learning Engineers.

In [None]:
# picking the rows from data that corrspond to Data Scientists and ML Engineers
ml_ds_en_data = data[(data['Q5'] == 'Data Scientist') | (data['Q5'] == 'Machine Learning Engineers')]
ml_ds_en_data.shape

In [None]:
# pick the suitable columns from ml_ds_en_data 

ml_ds_en_data = ml_ds_en_data.iloc[:, col_indexes6]
ml_ds_en_data = ml_ds_en_data.reset_index(drop = True)

print(ml_ds_en_data.shape)
ml_ds_en_data.head(5)

In [None]:
# get the framework - number of users pairs.
lans = []
ls = []
for i in range(1, ml_ds_en_data.shape[1]):
    s = ml_ds_en_data.iloc[:,i].value_counts()
    lans.append(list(s.items()))
for lan in lans:
    for l,n in lan:
        ls.append([l,n])
        
# Create a dataframe
frameworks_data2 = pd.DataFrame(ls, columns = ['Frameworks', 'Number of users'])
frameworks_data2 = frameworks_data2.sort_values("Number of users", ascending = False) 
frameworks_data2 = frameworks_data2.reset_index(drop = True)

frames2 = round(frameworks_data2['Number of users'] / frameworks_data2['Number of users'].sum()*100, 2) 

frames2.index = frameworks_data2['Frameworks']
frames2.name = 'ds_ml_frameworks_perc'
frames2

In [None]:
# add mal_frames2 to ml_frames dataframe
ml_frames['ds_ml_frameworks_perc'] = frames2.values

In [None]:
# preparation for plotting a factorplot containing the freaquency of each platform for both groups of scientists.
ml_frames = pd.melt(ml_frames, id_vars="Frameworks", var_name="Role", value_name="percentage")
ml_frames

In [None]:
# plot the factorplot

p = sns.factorplot(x='Frameworks', y='percentage', hue='Role', data=ml_frames, 
                   kind='bar', legend_out=False, errwidth = 5, 
                   palette ='mako')
fig = plt.gcf()
fig.set_size_inches(13, 6)
p.set_ylabels('Frequency of Use', size = 14)
p.set_xlabels(size = 14)
new_labels=['Research Scientist', 'Data Scientist and ML Engineer']
leg = p.axes.flat[0].get_legend()
for t, l in zip(leg.texts, new_labels): t.set_text(l)
plt.xticks(rotation = 35, size = 11)
plt.show()

#### Notes on the plot:
* Both groups of scientists use the different ML packages with almost the some frequency.
* Scikit-learn is the most used framework by both groups of roles.
* For the Data Scientists and the ML Engineers TensorFlow, Keras, PyTorch come to the second place with almost the same percentage of usage. 
* For the Researcher Scientists TensorFlow is more used than Keras and Keras is more used than PyTorch.
* At the fith place of usage frequency for both groups of scientists is the Xgboost platform, with no big difference of percentage of usage between the two groups. 
* At the sixth place there is LightGBM, which is more used by the Data Scientists and ML Engineers than by Research Scientists. 

### **Question 9**: b) Which are the most used Machine Learning algorithms by the group of: i) Research Scientists ii) Data Scientists and Machine Learning Engineers?

To answer this question where are going to use the answers that Kagglers gave to the questions:

* 'Q5: Select the title most similar to your current role (or most recent title if retired)'
* 'Q17: Which of the following ML algorithms do you use on a regular basis? (Select all that apply)'


b)i) Find the frequency of use of each ML algorithm inside the group of Reaserch Scientists.

In [None]:
# spot the right columns
col_indexes7 = []
for question in questions:
    if 'Which of the following ML algorithms do you use on a regular basis' in question:
        col_indexes7.append(questions.index(question))
# add 'Q5'
col_indexes7.insert(0,5)
col_indexes7

In [None]:
# pick rows and columns from data
ml_res_data = data[data['Q5'] == 'Research Scientist']

ml_res_data = ml_res_data.iloc[:, col_indexes7]
ml_res_data = ml_res_data.reset_index(drop = True)


# get the framework - number of users pairs.
lans = []
ls = []
for i in range(1,ml_res_data.shape[1]):
    s = ml_res_data.iloc[:,i].value_counts()
    lans.append(list(s.items()))
for lan in lans:
    for l,n in lan:
        ls.append([l,n])
        
# create a dataframe a dataframe containing the frequency of each algorithm for the Research Scientists group
alg_data = pd.DataFrame(ls, columns = ['Algorithms', 'Number of users'])
alg_data = alg_data.sort_values("Number of users", ascending = False) 
alg_data = alg_data.reset_index(drop = True)

algos = round(alg_data['Number of users'] / alg_data['Number of users'].sum()*100, 2) 

algos.index = alg_data['Algorithms']
algos.name = 'res_algorithms_perc'
algos_frame1 = pd.DataFrame(algos)
algos_frame1.reset_index(inplace = True)
algos_frame1

b) ii) Repeat the same process for Data Scientists and ML Engineers

In [None]:
# picking the rows and columns from data
ml_ds_en_data = data[(data['Q5'] == 'Data Scientist') | (data['Q5'] == 'Machine Learning Engineers')]
ml_ds_en_data = ml_ds_en_data.iloc[:, col_indexes7]
ml_ds_en_data = ml_ds_en_data.reset_index(drop = True)

# get the algorithm - number of users pairs
lans = []
ls = []
for i in range(1, ml_ds_en_data.shape[1]):
    s = ml_ds_en_data.iloc[:,i].value_counts()
    lans.append(list(s.items()))
for lan in lans:
    for l,n in lan:
        ls.append([l,n])
        
# Create a dataframe that contains the frequency of use of each algorithm for the Data Scientists and ML engineers group
alg_data2 = pd.DataFrame(ls, columns = ['Algorithms', 'Number of users'])
alg_data2 = alg_data2.sort_values("Number of users", ascending = False) 
alg_data2 = alg_data2.reset_index(drop = True)

algos2 = round(alg_data2['Number of users'] / alg_data2['Number of users'].sum()*100, 2) 

algos2.index = alg_data2['Algorithms']
algos2.name = 'ds_ml_algorithms_perc'

algos_frame2 = algos2.to_frame()
algos_frame2.reset_index(inplace= True)
algos_frame2

In [None]:
# merge algos_frame1 and algos_frame2 dataframes
algos_df = pd.merge(algos_frame1, algos_frame2, on = 'Algorithms', how = 'inner')

In [None]:
# preparation for plotting a factorplot containing the frequency of each algorithm for both groups of scientists.
algos_df = pd.melt(algos_df, id_vars='Algorithms', var_name="Role", value_name="percentage")
algos_df

In [None]:
# plot factorplot
import re
p = sns.factorplot(x='Algorithms', y='percentage', hue='Role', data=algos_df, 
                   kind='bar', legend_out=False, errwidth = 5, 
                   palette ='rocket')
fig = plt.gcf()
fig.set_size_inches(15, 6)
p.set_ylabels('Percentage of Use', size = 14)
p.set_xlabels(size = 12)
new_labels=['Research Scientist', 'Data Scientist and ML Engineer']
leg = p.axes.flat[0].get_legend()
for t, l in zip(leg.texts, new_labels): t.set_text(l)

labels = []
for ax in p.axes.flat:
    for label in ax.get_xticklabels():
        lbl = re.findall("(?<=')(.*)(?=')", str(label))
        lbl1 = ' '.join(map(str, lbl)) 
        lbl2 = str(lbl1).split(" ")
        lbl3 = '\n'.join(lbl2)
        labels.append(lbl3)
    for i,l in enumerate(labels):
        ax.set_xticklabels(labels) 
plt.show()

#### Notes on the plot:
* **Linear or Logistic Regression** is the most frequently used Machine Learning algorithm by both groups of scientists. Then come the **Decision Trees or Random Forests** algorithms and on the third place comes the **Convolutional Neural Networks algorithm** for both groups.
* Generally there no big diffirences in the frequency of use of different algorithms between the two group of scientists. The exception is about the **Gradient Boosting Machines** algorithm, which is much more used by Data Scientists and Machine Learning Engineers than by Research Scientists.

### **Qyestion 10**: What tools of data analysis is mostly used by the Kagglers who are currently working in Data Science field having any professional role?

To answer this final question we are going to use the answers that Kagglers gave to Survey question: 'Q38
What is the primary tool that you use at work or school to analyze data? (Include text response)'.

In [None]:
# spot the right columns in data
col_indexes9 = []
for question in questions:
    if 'What is the primary tool that you use at work or school to analyze data' in question:
        col_indexes9.append(questions.index(question))

# pick rows and columns from data
data_prof = data[(data['Q5'] != 'Student') & (data['Q5'] != 'Currently not employed') & (data['Q5'] != 'Other')]
tools = data[data.columns[col_indexes9]]
tools.value_counts()

In [None]:
# calculate the frequency of of use of each tool
tools_perc = (tools.value_counts()/tools.value_counts().sum())*100

In [None]:
# plot the frequency of use of each tool
plt.figure(figsize=(7,6))
colors = sns.color_palette('Set2', n_colors=len(tools_perc))
plot4 = sns.barplot(x = tools_perc.index, y = tools_perc.values, palette=colors)
handles = [bar for bar in plot4.containers[0]]
leg = plot4.legend(handles, tools_perc.index, 
                       bbox_to_anchor=(1.0,0.0,1.41,0.3), loc="lower left",
                       mode="expand", borderaxespad=4, ncol=1, fontsize='large', framealpha=0.1)
leg.get_frame().set_facecolor('grey')
plot4.set(yticks=np.arange(0,50,5), xticklabels=[], yticklabels=[0,5,10,15,20,25,30,35,40,45])
plt.xlabel('Tools for data analysis', size=14)
plt.ylabel('Frequency of use', size=14)
plt.show()

**Answer**: The most used tools for data analysis by professionals in any Data Science related field among Kagglers is: Local development environments (Rstudio, jupyterlab, etc.) and Basic statistical software (Microsoft Excel, Google Sheets, etc.).

## Conclusions
* Python is a highly reccomended language to start with when wanting to start learning and applying Data Science and Machine Learning.
* Github is a commonly used platform for sharing Data Science projects and for that can be a very useful learning tool for Data Science and Machine Learning beginners.
* Online courses on platforms like Coursera, Kaggle and Udemy have been chosen by the majority of Kagglers for learning Data Science And Machine Learning and can be some useful sources for beginners to learn from.
* Basic statistical software, like Microsoft Excel and Google Sheets, is still very used tools by the Data Science and Machine Learning professionals.
* The use of Machine Learning frameworks and algorithms does not differs significantly between the scientific and business world. For that, this parameters should not hold a lot in choosing a career path.
* Some of the most important roles of a Data Scientist are: 
        a) to build or run a machine learning service that operationally improves products or workflows, 
        b) to experimente and iterate in order to improve existing Machine Learning models,
        c) to build prototypes to explore applying Machine Learning to new areas and 
        d) to analyze and understand data to influence product or business decisions.

Thank you !!!