# 2020 Kaggle Machine Learning and Data Science Survey 

<p style='text-align: right;'>Author: Daniel Noto, Yupak Satsrisakul</p>

<a id='table'></a>

### Table of Contents

#### 1. [Key Results](#section1)
#### 2. [Finding the right target group to analyze](#section2)
- 2.1 [Data preparation](#section2_1)
- 2.2 [Story Telling](#section2_2)

#### 3. [The *Top Three*](#section3)
- 3.1 [Sociodemograpic attributes](#section3_1)
- 3.2 [Importance of Gender - How does the current picture look like?](#section3_2)

#### 4. [Target Group: US Female Students](#section4)
- 4.1 [Social Media and Learning online](#section4_1)
- 4.2 [US Girls and their Machine Learning experience / Technology know-how](#section4_2)
- 4.3 [Social Media and Learning online](#section4_3)
- 4.4 [What do plan US female students in the next two years?](#section4_4)

#### 5. [Conclusion](#section5)

# Prerequisite

In [None]:
# load required libraries
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('darkgrid') # set design
import matplotlib.pyplot as plt
%matplotlib inline

from _plotly_future_ import v4_subplots
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

print('Version of current Plotly package: ' + __version__) # requires version >= 1.9.0

import cufflinks as cf
# For Notebooks
init_notebook_mode(connected=True)
# For offline use
cf.go_offline()

<a id='section1'></a>

### 1. Key Results

More than 20,000 people were interviewed on Kaggle

-	Half of these Kagglers are either students (26.8%), data scientists (13.9%) or software engineers (10.2%)
-	The Remaining 49.1% of Kagglers are, for examples, currently not employed (8.6%) or data analysts (8.6%), etc.

#### Because of students being the majority in the survey the further analysis continued based off that subgroup.

Moreover, the survey results revealed that the majority of students are either from India (43.4%), USA (6.8%) or China (4%) (regardless of other (5%)). These three most representative countries are the so called *Top Three* in the underlying analysis.

Regarding the age distribution of the *Top Three* the answers point out most of students (> 95%) are in the age between 18 and 29. 

Looking at the education level attended by the students from the *Top Three*, it can be observed based on the results most of students attend whether a bachelor's program or a master's 
program. These sum up to more than 84%. What is more, there are also some students who study but will not earn any academic degree (8%).

The representation of the gender is, as everyone can guess, still in favor of men. Male students are still the majority (73.4%), whereas women are represented by (24.9%) which is actually every fourth person on Kaggle regarding the *Top Three*. Hereby, we could figure out that the USA has the highest ratio in terms of women and men and amongst the *Top Three* (the other three choices have been neglected for simplicity). However, every third person is a female student who joined the Kaggle survey 2020 as an American student. That led us to the conclusion to stick to that specific subgroup for the remaining analysis. 

#### Insights about female students studying in the USA

#### Sociodemographic data:

-	Most of female students from the USA are in the age between 25 and 29
-	Every second American female student attends a master’s program 

#### Machine Learning and Technology:

-	Roughly, one-third of them have one to two years of programming experience. One out of every fourth student has between three to five years of experience, whereas, 2% even more than 10 years. (Also, on average a female student can handle at least two programming languages)
-	Regarding the most used language Python (32.2%) is a clear winner, followed by R (17.2%), and eventually by SQL (12.2%) (By the way, nearly 8 out of 10 female students recommend Python to an aspiring scientist)
-	The majority uses a personal computer or laptop for data science projects; hereby, the female students use rather GPUs (32.9%) than TPUs (1.2%); that’s why most of these students have never used a TPU (88.1%)
-	For visualizing/plotting the data the majority uses as library 1. Matplotlib (30.5%), 2. Ggplot/ggplot2 (19.9%) and 3. Seaborn (16.9%)
-	The results indicate that the majority of female students (39%) have been using ML methods for less than one year
-	The most ML algorithms in use are 1. Linear or Logistic Regression (29.5%), 2. Decision Trees or Random Forests (21.2%) and 3. Bayesian Approach (12.8%)
-	In terms of computer vision categories the majority uses Image classification and other general purpose networks  or none (each 22.5%)
-	NLP methods mostly used are: 1. Transformer language model (31%), 2. Word embeddings/vectors (31%), 3. Encoder-Decoder models (20.7%)

#### Social Media and Learning online:

-	Approximately, every fifth female student from the US acquires the needed know-how for data science in the university through courses itself, followed by Cousera (18.5%) and LinkedIn Learning (11.5%)
-	The favorite media sources which are used to gain further information about data science are YouTube (1 out of 5 female students), Kaggle (16.4%) and Blogs (15.8%); only 2.4% of these students do not inform on any media source

#### Endeavors of female students for the upcoming two years:

In general, the good news are that female students from the USA want to invest more time to get familiar with certain technologies regarding cloud computing, ML, big data products and more. 

For instance, there is a big interest to gain more knowledge / get more familiar with cloud computing platforms like Amazon Web Services (26.2%), Google Cloud Platform (19%) and Microsoft Azure (17.2%). Only 4.5% responded to have no interest at all. What is more, related to specific cloud computing products point out that Azure Cloud Services (14.6%), Google Cloud Compute Engine (13.4%) and Amazon EC2 (12.9%) are the most popular amongst the given choices. Here again, only 4.8% of female students indicate to not get familiar with for in the next two years.

Regarding the attitude towards specific machine learning products, female students would like to get more familiar with Google Cloud AI Platform / Google Cloud ML Engine (15%) and Google Cloud Natural Language (14.1%). Products from Amazon are rather located at the end of the ranking, for instance, Amazon Forecast ranked as second last favorite (7.5%). Fortunately, only the minority (3.1%) is interested in none. 

Regarding big data products, we could figure out that MySQL (12.9%), MongoDB (10.8%) and Microsoft SQL Server (9.7%) are the favorite ones amongst the US female students. Products from Amazon (like Amazon Redshift or Amazon Athena), IBM (like IBM Db2) or Microsoft (like Microsoft Access or Microsoft Azure Data Lake Storage) are not really in favor of American female students. These are chosen by 1.8% to 2.9% of the overall female students. 

When looking at Business Intelligence tools Tableau is a clear favorite. More than every fourth student (28.6%) voted for that particular software. Microsoft Power BI and Google Data Studio are ranked second and third, respectively (19.3% and 13%). 
TIBCO Spotfire, Domo and Qlik are the least choices of American female students (each chosen by 1.2%). 

Looking at the choices for automated ML tools the picture looks as follows:
In particular, the majority of female students (17.1%) want to get more familiar with Automation of full ML pipelines (like Google Cloud AutoML). Automated model selection, from auto-sklearn, for instance, are chosen by 16.6% of female students, whereas Automated feature engineering/selection are chosen by 16%. Nearly every tenth female student from the US do not want to get familiar with any of these tools. 
If students were asked about specific automated ML products, the majority is in favor of Auto-Sklearn (every fifth female student). Auto-Keras and Google Cloud AutoML are ranked second and third, respectively. 
In order to manage ML experiments a clear favorite is TensorBoard (nearly every third female student, i.e., 28.3%). However, one out of four students (i.e., 25.3%) do not even want to get familiar with any of these managing tools. But, still Weights & Biases and Neptune.ai seem to be interesting for 14.1% and 10.1% of female students, respectively. 

[Back to Table](#table)

<a id='section2'></a>

## 2\. Finding the right target group to analyze

<a id='section2_1'></a>

#### 2.1 Data preparation

In [None]:
# load the dataset

survey = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)

In [None]:
# show the first 5 rows of the data

survey.head()

In [None]:
# here, it would be great to drop the first row (=description of the question)
# for that, a function 'question' will be defined that can easily output the full question regarding 
# its column name:

def question (col_name, row_entry=0):
    return(survey.loc[row_entry, col_name])

In [None]:
# does it work?

question('Q8')

In [None]:
# drop the description of the question
Survey = survey.drop(0, axis=0)

# please note capital and small s of the declared variable

In [None]:
Survey.info()

In [None]:
# average time of finishing the survey in seconds:

Survey['Time from Start to Finish (seconds)'] = Survey['Time from Start to Finish (seconds)'].astype(str).astype(int)
avg_time = Survey['Time from Start to Finish (seconds)'].mean() / 60 / 60
avg_time = round(avg_time,1)

print('Average time to complete the survey is {} hours'.format(avg_time))

***Remark!!!!***

Let look again upon the distribution of time spent in the survey, the distribution graph (left) shows that a majority of the female students took 2.5 hours(9156 seconds) to complete the survey. This number was derived because of that the higher number of time spent on the survey was dominant a total time used on average. Jumping into the insight of the data, let zoom in the data between 0 - 2000 seconds, the distributed data is represented in the plot (right). In reality, the majority of students could complete the survey by spending time less than 2000 seconds (or they could even finish within half an hour).

In [None]:
mean = Survey['Time from Start to Finish (seconds)'].mean()

mean = Survey['Time from Start to Finish (seconds)'].mean()
print('From the whole data the mean time to complete the survey {} seconds'.format(round(mean)))
time_spent = Survey['Time from Start to Finish (seconds)']

restricted_time = Survey[Survey['Time from Start to Finish (seconds)']<2500]
time_spent_range = restricted_time['Time from Start to Finish (seconds)']

f, axes = plt.subplots(1,2,figsize=(15,5))
sns.distplot(time_spent,ax=axes[0])
plt.title('Most of the time spent in reality')
sns.distplot(time_spent_range,kde=True, ax=axes[1]);

#### Here, we are going to partition the data in a reasonable way in order to get a better overview of the given results and define sections. These sections are:
- Section I: Questions about social and demographic attributes
- Section II: Questions about software and programming skills
- Section III: Questions about hardware, ML technologies and libraries used by data scietists
- Section IV: Questions about current technologies like cloud computing
- Section V: Questions about social media activities
- Section VI: Questions related to the effort data scientist enthusiasts will do in the next two years (Part B)

Those sections should help us to create a story in a reasonable way. The idea behinnd is simply to breakdown the survey questions in order to get a better picture of what was asked in the survey, so that everyone can theoretically write its own story based on the results.  

##### [Back to Table](#table)

In [None]:
### Questions 1 to 5 are regarding social and demographical attributes
survey_Q1_Q5 = Survey.loc[:,('Q1','Q2','Q3','Q4','Q5')]

In [None]:
### Questions 6 to 10 refer to software and programming (skills)
survey_Q6_Q10 = Survey.loc[:,('Q6','Q7_Part_1','Q7_Part_2','Q7_Part_3','Q7_Part_4','Q7_Part_5','Q7_Part_6',
                              'Q7_Part_7','Q7_Part_8','Q7_Part_9','Q7_Part_10','Q7_Part_11','Q7_Part_12','Q7_OTHER',
                              'Q8','Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7',
                              'Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_OTHER','Q10_Part_1','Q10_Part_2',
                              'Q10_Part_3','Q10_Part_4','Q10_Part_5','Q10_Part_6','Q10_Part_7','Q10_Part_8',
                              'Q10_Part_9','Q10_Part_10','Q10_Part_11','Q10_Part_12','Q10_Part_13','Q10_OTHER')]



In [None]:

### Questions 11 to 19 report about hardware, machine learning techniques and given libraries 
### in programming languages that are used by data scientists
survey_Q11_Q19 = Survey.loc[:,('Q11','Q12_Part_1','Q12_Part_2','Q12_Part_3','Q12_OTHER','Q13','Q14_Part_1','Q14_Part_2',
                               'Q14_Part_3','Q14_Part_4','Q14_Part_5','Q14_Part_6','Q14_Part_7','Q14_Part_8','Q14_Part_9',
                               'Q14_Part_10','Q14_Part_11','Q14_OTHER','Q15','Q16_Part_1','Q16_Part_2','Q16_Part_3',
                               'Q16_Part_4','Q16_Part_5','Q16_Part_6','Q16_Part_7','Q16_Part_8','Q16_Part_9','Q16_Part_10',
                               'Q16_Part_11','Q16_Part_12','Q16_Part_13','Q16_Part_14','Q16_Part_15','Q16_OTHER',
                               'Q17_Part_1','Q17_Part_2','Q17_Part_3','Q17_Part_4','Q17_Part_5','Q17_Part_6','Q17_Part_8',
                               'Q17_Part_9','Q17_Part_10','Q17_Part_11','Q17_OTHER','Q18_Part_1','Q18_Part_2','Q18_Part_3',
                               'Q18_Part_4','Q18_Part_5','Q18_Part_6','Q18_OTHER','Q19_Part_1','Q19_Part_2','Q19_Part_3',
                               'Q19_Part_4','Q19_Part_5','Q19_OTHER')]

In [None]:

### Questions 20 to 25 are regarding working place and its circumstances
survey_Q20_Q25 = Survey.loc[:,('Q20','Q21','Q22','Q23_Part_1','Q23_Part_2','Q23_Part_3','Q23_Part_4','Q23_Part_5',
                               'Q23_Part_6','Q23_Part_7','Q23_OTHER','Q24','Q25')]

In [None]:
### Questions 26 to 35 ask information everything related to cloud computing, ML, automated ML, tools for automated 
### ML and visualization tools / BI tools that are used by data scientists
###### PLEASE NOTE: THERE ARE BOTH GIVEN - PART A AND PART B QUESTIONS. ACCORDINGLY, THE QUESTIONS ARE INDICATED
###### BY PART A AND PART B
survey_Q26_Q35_PartA = Survey.loc[:,('Q26_A_Part_1','Q26_A_Part_2','Q26_A_Part_3','Q26_A_Part_4','Q26_A_Part_5',
                                     'Q26_A_Part_6','Q26_A_Part_7','Q26_A_Part_8','Q26_A_Part_9','Q26_A_Part_10',
                                     'Q26_A_Part_11','Q26_A_OTHER','Q27_A_Part_1','Q27_A_Part_2','Q27_A_Part_3',
                                     'Q27_A_Part_4','Q27_A_Part_5','Q27_A_Part_6','Q27_A_Part_7','Q27_A_Part_8',
                                     'Q27_A_Part_9','Q27_A_Part_10','Q27_A_Part_11','Q27_A_OTHER','Q28_A_Part_1',
                                     'Q28_A_Part_2','Q28_A_Part_3','Q28_A_Part_4','Q28_A_Part_5','Q28_A_Part_6',
                                     'Q28_A_Part_7','Q28_A_Part_8','Q28_A_Part_9','Q28_A_Part_10','Q28_A_OTHER',
                                     'Q29_A_Part_1','Q29_A_Part_2','Q29_A_Part_3','Q29_A_Part_4','Q29_A_Part_5',
                                     'Q29_A_Part_6','Q29_A_Part_7','Q29_A_Part_8','Q29_A_Part_9','Q29_A_Part_10',
                                     'Q29_A_Part_11','Q29_A_Part_12','Q29_A_Part_13','Q29_A_Part_14','Q29_A_Part_15',
                                     'Q29_A_Part_16','Q29_A_Part_17','Q29_A_OTHER','Q30','Q31_A_Part_1','Q31_A_Part_2',
                                     'Q31_A_Part_3','Q31_A_Part_4','Q31_A_Part_5','Q31_A_Part_6','Q31_A_Part_7',
                                     'Q31_A_Part_8','Q31_A_Part_9','Q31_A_Part_10','Q31_A_Part_11','Q31_A_Part_12',
                                     'Q31_A_Part_13','Q31_A_Part_14','Q31_A_OTHER','Q32','Q33_A_Part_1','Q33_A_Part_2',
                                     'Q33_A_Part_3','Q33_A_Part_4','Q33_A_Part_5','Q33_A_Part_6','Q33_A_Part_7',
                                     'Q33_A_OTHER','Q34_A_Part_1','Q34_A_Part_2','Q34_A_Part_3','Q34_A_Part_4',
                                     'Q34_A_Part_5','Q34_A_Part_6','Q34_A_Part_7','Q34_A_Part_8','Q34_A_Part_9',
                                     'Q34_A_Part_10','Q34_A_Part_11','Q34_A_OTHER','Q35_A_Part_1','Q35_A_Part_2',
                                     'Q35_A_Part_3','Q35_A_Part_4','Q35_A_Part_5','Q35_A_Part_6','Q35_A_Part_7',
                                     'Q35_A_Part_8','Q35_A_Part_9','Q35_A_Part_10','Q35_A_OTHER')]


In [None]:

### Questions 26 to 35 ask information everything related to cloud computing, ML, automated ML, tools for automated 
### ML and visualization tools / BI tools that are used by data scientists, but with the particularity 
### to get familiar with those tools in the next two years.
###### PLEASE NOTE: THERE ARE BOTH GIVEN - PART A AND PART B QUESTIONS. ACCORDINGLY, THE QUESTIONS ARE INDICATED
###### BY PART A AND PART B
survey_Q26_Q35_PartB = Survey.loc[:,('Q26_B_Part_1','Q26_B_Part_2','Q26_B_Part_3','Q26_B_Part_4',
                                      'Q26_B_Part_5','Q26_B_Part_6','Q26_B_Part_7','Q26_B_Part_8',
                                      'Q26_B_Part_9','Q26_B_Part_10','Q26_B_Part_11','Q26_B_OTHER',
                                      'Q27_B_Part_1','Q27_B_Part_2','Q27_B_Part_3','Q27_B_Part_4',
                                      'Q27_B_Part_5','Q27_B_Part_6','Q27_B_Part_7','Q27_B_Part_8',
                                      'Q27_B_Part_9','Q27_B_Part_10','Q27_B_Part_11','Q27_B_OTHER',
                                      'Q29_B_Part_1','Q29_B_Part_2','Q29_B_Part_3','Q29_B_Part_4',
                                      'Q29_B_Part_5','Q29_B_Part_6','Q29_B_Part_7','Q29_B_Part_8','Q29_B_Part_9','Q29_B_Part_10','Q29_B_Part_11',
                                      'Q29_B_Part_12','Q29_B_Part_13','Q29_B_Part_14','Q29_B_Part_15','Q29_B_Part_16',
                                      'Q29_B_Part_17','Q29_B_OTHER','Q31_B_Part_1','Q31_B_Part_2','Q31_B_Part_3',
                                      'Q31_B_Part_4','Q31_B_Part_5','Q31_B_Part_6','Q31_B_Part_7','Q31_B_Part_8',
                                      'Q31_B_Part_9','Q31_B_Part_10','Q31_B_Part_11','Q31_B_Part_12','Q31_B_Part_13',
                                      'Q31_B_Part_14','Q31_B_OTHER','Q33_B_Part_1','Q33_B_Part_2','Q33_B_Part_3',
                                      'Q33_B_Part_4','Q33_B_Part_5','Q33_B_Part_6','Q33_B_Part_7','Q33_B_OTHER',
                                      'Q34_B_Part_1','Q34_B_Part_2','Q34_B_Part_3','Q34_B_Part_4','Q34_B_Part_5',
                                      'Q34_B_Part_6','Q34_B_Part_7','Q34_B_Part_8','Q34_B_Part_9','Q34_B_Part_10',
                                      'Q34_B_Part_11','Q34_B_OTHER','Q35_B_Part_1','Q35_B_Part_2','Q35_B_Part_3',
                                      'Q35_B_Part_4','Q35_B_Part_5','Q35_B_Part_6','Q35_B_Part_7','Q35_B_Part_8',
                                      'Q35_B_Part_9','Q35_B_Part_10','Q35_B_OTHER')]

In [None]:

### Questions 36 to 39 report information about given activities on social media platforms. Moreover, it is asked
### what primary tools are used to analyze data at school or at work.
survey_Q36_Q39 = Survey.loc[:,('Q36_Part_1','Q36_Part_2','Q36_Part_3','Q36_Part_4','Q36_Part_5','Q36_Part_6','Q36_Part_7',
                               'Q36_Part_8','Q36_Part_9','Q36_OTHER','Q37_Part_1','Q37_Part_2','Q37_Part_3','Q37_Part_4',
                               'Q37_Part_5','Q37_Part_6','Q37_Part_7','Q37_Part_8','Q37_Part_9','Q37_Part_10','Q37_Part_11',
                               'Q37_OTHER','Q38','Q39_Part_1','Q39_Part_2','Q39_Part_3','Q39_Part_4','Q39_Part_5',
                               'Q39_Part_6','Q39_Part_7','Q39_Part_8','Q39_Part_9','Q39_Part_10','Q39_Part_11','Q39_OTHER')]

# After slicing the data into predefined sections, we have accomplished not just an easier way of looking at the data,
# unnecassary columns like 'Unnamed 7', 'Unnamed 21' are dropped. 

<a id='section2_2'></a>

#### 2.2 Story Telling

After defining and partioning the data, it is now interesting to get an idea of what our story might look like. For that, I would like you to think about what we can do in order to get the most insight out of a specific group in 
the given results of the Kaggle survey 2020.  
Therefore, it might be advantageous to look at question 5 which tells us the current role / title of the interviewed person itself:

In [None]:
question('Q5')

In [None]:
# Which roles are represented most in the survey 2020?
q5_percent =  Survey['Q5'].value_counts()/Survey['Q5'].value_counts().sum()*100 # results in %
q5_percent = round(q5_percent, 1)

n_all = len(Survey) # sample size

q5_percent.iplot(kind='bar', color='#51ccfc',
                 title='Kaggle Survey 2020 - Representation of responses grouped by roles (n = {})'.format(n_all),
                 yTitle='% share')

In [None]:
# What are the given representations in percentage:

percent = Survey['Q5'].value_counts()/Survey['Q5'].value_counts().sum()*100

#percent.round(1)

name_ = []
for i,j in zip(percent.index, percent.round(1)):
    n = '{} % {}'.format(j, i)
    name_.append(n)


percent.plot(kind='pie', figsize=(10,6), autopct='%1.1f%%')
plt.title('Kaggle Survey 2020 - Representation of responses (in %) grouped by roles \n (n = {})'.format(n_all))
plt.ylabel('') # in order to remove "Q5" as y label
plt.tight_layout()
plt.legend(name_, title='Percentage', loc ='center right',bbox_to_anchor=(1, 0, 1, 1))
plt.show();

Students, data scientists and software engineers are the majority of the survey - more than half of the overall responses.
Hereby, it is worth to have a closer look at the students in the survey since they represent ~ 1/4 of the overall roles.



In [None]:
survey_student = Survey[Survey['Q5']=='Student']

In [None]:
survey_student['Q3'].unique()
# let's abbreviate the following countries/unions of countries in order to have a clearer presentation of the viz:
# - United States of AAmerica --> USA and
# - United Kingdom of Great Britain and Northern Ireland --> UK/NI
# - United Ara Emirates --> UAE
# - Iran, Islamic Republic of... --> IIR

In [None]:
survey_student['Q3'] = survey_student['Q3'].str.replace('United States of America', 'USA')
survey_student['Q3'] = survey_student['Q3'].str.replace('United Kingdom of Great Britain and Northern Ireland', 'UK/NI')
survey_student['Q3'] = survey_student['Q3'].str.replace('United Arab Emirates', 'UAE')
survey_student['Q3'] = survey_student['Q3'].str.replace('Iran, Islamic Republic of...', 'IIR')

In [None]:
# filter only after students and visualize the students by its country

stud_percent = survey_student['Q3'].value_counts()/survey_student['Q3'].value_counts().sum()*100
stud_percent = round(stud_percent, 1)

stud_percent.iplot(kind='bar',color='#51ccfc', title='Kaggle Survey 2020 - Where do students come from?')

Based off the underlying result, we can observe that India, USA and China (regardless of Other) are the *Top Three*
leading groups represented on Kaggle community. 

For our further analysis, we will only look after students who resided in one of those three countries: Our *Top Three*

<a id='section3'></a>

## 3. The *Top Three*

In [None]:
# filter after students in India, USA and China in order to get our target groups
top_three = survey_student[(survey_student['Q3']=='USA') | (survey_student['Q3']=='India') | (survey_student['Q3']=='China')]

top_three.Q3.unique() # check if filtering worked out

In [None]:
# number of responses
a = len(top_three) # sample size
a_percent = a / len(Survey) * 100
a_percent = round(a_percent,1)

print('Number of responses within the Top Three: {} - which are {}% of the overall answers.'.format(a, a_percent))

##### [Back to Table](#table)

<a id='section3_1'></a>

##### 3.1 Sociodemographic attributes

In this part of our journey we first look how the ages are distributed amongst our defined *Top Three* group. As a net step, we look which education level the students currently attend. As a last step, we look at the gender representation within each country of the *Top Three*.

In [None]:
question('Q1')

In [None]:
n = a # sample size

top_three['Q1'].iplot(kind='hist',bins=11,color='#51ccfc', 
                          title="Kaggle Survey 2020 - Age distribution based off Top Three (n = {})".format(n),
                          xTitle='Age', yTitle='Frequency')

In [None]:
# percentage share of first three bars from the above plot
round(top_three['Q1'].value_counts()[[0,1,2]].sum()/top_three['Q1'].value_counts().sum()*100,1) 

#### Insights:
- Most of students in the *Top Three* represented are in the age between 18 and 29. Therefore, these are a total of 95.5%; the oldest student is greater than 70 years old. Congrats, for studying at that old age again! :-)

#### Next: Visualize the occurrences of education level for
- *Top Three* as a whole picture
- India
- China
- USA

#### Overall Top Three

In [None]:
top3_percent = top_three.Q4.value_counts()/top_three.Q4.value_counts().sum()*100
top3_percent = round(top3_percent, 1)

top3_percent.iplot(kind='bar', color='#51ccfc',
                   title='kaggle Survey 2020 - Representation of Education by its level for Top Three (n = {})'.format(n),
                   xTitle='Education Level', yTitle='Percentage')

#### Insights:
- Most of students in the *Top Three* represented attend whether a bachelor's program or a master's 
  program. These sum up to more than 84%. What is more, there are also some students who study but will not 
  earn any academic degree (8%).

- The minority are 'students' who have no formal education (< 1%). Cool, for studying and having 
  the interest of joining Kaggle!!!

In [None]:
top_three['Q1'].iplot(kind='hist',bins=11,color='#51ccfc', 
                          title="Kaggle Survey 2020 - Age distribution based off Top Three",
                          xTitle='Age', yTitle='Frequency')

#### Insights:
- Most of students in the *Top Three* represented are in the age between 18 and 29. Therefore, these sum up to a total of 2,678 students; the oldest student is greater than 70 years old. Congrats, for studying at that old age again! :-)

#### Next: Visualize the occurrences of education level for
- *Top Three* as a whole picture
- India
- China
- USA

#### Overall Top Three

In [None]:
top3_percent = top_three.Q4.value_counts()/top_three.Q4.value_counts().sum()*100
top3_percent = round(top3_percent, 1)

top3_percent.iplot(kind='bar', color='#51ccfc',
                   title='kaggle Survey 2020 - Representation of Education by its level for Top Three',
                   xTitle='Education Level', yTitle='Percentage')

#### Insights:
- Most of students in the *Top Three* represented attend whether a bachelor's program or a master's 
  program. These sum up to more than 84%.. What is more, there are also some students who study but will not 
  earn any acedemic degree (8%).

- The minority are 'students' who have no formal education (< 1%). Cool, for studying and having 
  the interest of joining Kaggle!!!

In [None]:
target_countries = top_three.Q3.unique() #['China','India','United States of America']

for country in target_countries:
    y = top_three[top_three['Q3']==country]['Q4'].value_counts()/top_three[top_three['Q3']==country]['Q4'].value_counts().sum()*100
    y = round(y, 1)
    x = ['Bachelor’s degree','Master’s degree', 'Some college/university study without earning a bachelor’s degree',
         'Doctoral degree','I prefer not to answer','Professional degree','No formal education past high school']

    n_country = top_three[top_three['Q3']==country]['Q4'].value_counts().sum() # sample size

    # plot the values
    plt.figure(figsize=(12,8))
    plt.barh(x, y)
    plt.title('{} and its representation by education level \n (n = {})'.format(country,n_country),fontsize=18)
    plt.xlabel('% Share')
    plt.ylabel('Education level')

    for index, value in enumerate(y):
        plt.text(value, index, str(value) + '%')

    plt.show();

#### What insights can we extract out of the previous figures?

- Well, as everybody is able to see Bachelor's and Master's students are the majority in the *Top Three*. Hereby,
4
  Indian students are the biggest represented group on Kaggle (with > 1,400 interviewed students). 
5
- For China and the US master's students are the majority in the kaggle community. (China: 47% / USA: 44%)
6
- Students who attend a doctoral degree are represented most by the US (> 10%).
7
- 'Students' who have no formal education are the least represented group in the Kaggle 2020 survey.

<a id='section3_2'></a>

#### 3.2 Importance of Gender - How does the current picture look like?
- Overall representation amongst the *Top Three*
- Individual representation of India, USA and China

In [None]:
question('Q2')

In [None]:
gender_percent = top_three['Q2'].value_counts()/top_three['Q2'].value_counts().sum()*100
gender_percent = round(gender_percent, 1)

gender_percent.iplot(kind='bar', color='#51ccfc',
                     title='Kaggle Survey 2020 - Gender representation of the Top Three',
                     xTitle='Gender', yTitle='Percentage')


#### Insights

Still, Mr 'Man' covers the majority in the Top Three represented. (Do you think there is another country in which that fact is not given?) However, women are only represented by one-quarter compared to the remaining gender types. 1.2% of responses within the *Top Three* prefer not to say, whereas, nonbinary or the preference to self-description is given by each 0.3%.  

#### How does the representation look like regarding the inidvidual members of the *Top Three*?

In [None]:
num_men = []
num_women = []
ratio_countries = []

for country in target_countries:
    selected_data = top_three[top_three['Q3']==country]
    sum_men_and_women = selected_data[(selected_data['Q2'] == 'Man')|(selected_data['Q2']=='Woman')]['Q5'].count()
    sum_men = (selected_data['Q2'] == 'Man').sum() / sum_men_and_women*100
    sum_men = round(sum_men, 1)
    sum_women = round((selected_data['Q2'] == 'Woman').sum() / sum_men_and_women*100,1)
    sum_women = round(sum_women, 1)
    
    ratio = selected_data[selected_data['Q2']=='Woman']['Q2'].count()/(selected_data[selected_data['Q2']=='Woman']['Q2'].count() 
                                                                       + selected_data[selected_data['Q2']=='Man']['Q2'].count()) *100 
    ratio_countries.append(ratio)
    
    num_men.append(sum_men)
    num_women.append(sum_women)

#print(num_men, num_women) 

In [None]:
x = np.arange(len(target_countries))  # the label locations
width = 0.35  # the width of the bars
#print(x, n_men, n_women)
fig, ax = plt.subplots(figsize=(12,8))
rects1 = ax.bar(x - width/2, num_men, width, label='Men')
rects2 = ax.bar(x + width/2, num_women, width, label='Women')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Percentage')
ax.set_title('Gender Representation')
ax.set_xticks(x)
ax.set_xticklabels(target_countries)
ax.legend(),


def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')


autolabel(rects1)
autolabel(rects2)

fig.tight_layout();
plt.show();

#### Insights:
Interesting! We can directly observe that USA has the highest percentage of female students with respect to the overall sum of men and women within the respective country. Next, it follow India with roughly 25% and China's female students are represented by nearly 20%.

Let us calculate their ratios in order to figure out the highest one amongst the *Top Three*.

In [None]:
# From above section, we calculate the repective national female ratio and then visualize them in order to get a better idea
# in which country there is the highest representation of women.
# note: female ratio = # of women / (# of women + # of men) * 100)

ratios = pd.DataFrame({'Ratio':ratio_countries})
ratios = ratios.rename({0:'India', 1:'USA', 2:'China'})
ratios = ratios.round(1)
ratios = ratios.sort_values(by=['Ratio'], ascending=False)

# visualize to have a comparison between the ratios:
ratios.iplot(kind='bar', color='#51ccfc', legend=False,
             title='Ratios within the Top Three',
             yTitle='% Share')
#plt.title('Ratios within the Top Three', fontsize=14);
#plt.ylabel('% Share')
#plt.xlabel('Top Three')
#plt.xticks(rotation=0)
#plt.show()

#### Insights:
A clear winner is the US. It is represented by more than 30%. To be honest, I was rather expecting to be India (25.1%) the winner since a lot of students who join Kaggle community are Indians. However, the US and its female students being on Kaggle community seem to be a great subgroup to find out more about them. :) 

#### Next steps

For our upcoming analysis and the given agenda, we take three steps back and visualize the times series of female students from the USA who attended Kaggle Surveys since 2017. 

<a id='section4'></a>


### 4. Target Group: US Female Students

- 4.1 Sociodemographic attributes
- 4.2 US Girls and their Machine Learning experience / Technology know-how
- 4.3 Social Media and Learning online
- 4.4 What do plan US female students in the next two years?

##### [Back to Table](#table)


#### Next steps

For our upcoming analysis and the given agenda, we take three steps back and visualize the times series of female students from the USA who attended Kaggle Surveys since 2017. 

In [None]:
# load data from 2017, 2018 and 2019

survey_2017 = pd.read_csv('../input/2017-kaggle-survey/multipleChoiceResponses.csv', low_memory=False, encoding='latin-1')
survey_2018 = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv', low_memory=False, encoding='latin-1')
survey_2019 = pd.read_csv('../input/kagglesurvey2019/mcr_2019.csv', low_memory=False, encoding='latin-1')

In [None]:
survey_2017.head()

In [None]:
survey_2018.head()

In [None]:
survey_2019.head()

Our purpose is only to visualize the time series of respones of female students from the US from the year of 2017 all the way up to the current survey. For that, we proceed as follows:
    1. Add a column 'year' to each dataset (i.e., for survey 2017, we need rows which indicate the time)
    2. Filter after target group in the individual datasets
    3. Merge them togehter
    4. Apply function to count the values within the given year
    5. Visualize time series

In [None]:
# add column year

survey_2017['year'] = 2017
survey_2018['year'] = 2018
survey_2019['year'] = 2019

usa = top_three[top_three['Q3']=='United States of America']
usa['year'] = 2020 # use the predefined table from above

survey_2017.head()

In [None]:
# filter after target group: female students in the U.S.

### survey 2017
us_FemaleStudents_2017 = survey_2017[(survey_2017['GenderSelect']=='Female') & (survey_2017['Country']=='United States') & (survey_2017['StudentStatus']=='Yes')]

### survey 2018
us_FemaleStudents_2018 = survey_2018[(survey_2018['Q1']=='Female') & (survey_2018['Q3']=='United States of America') & (survey_2018['Q6']=='Student')]

### survey 2019
us_FemaleStudents_2019 = survey_2019[(survey_2019['Q2']=='Female') & (survey_2019['Q3']=='United States of America') & (survey_2019['Q5']=='Student')]

#####################################
# Before we are going to append the four datasets vertically (i.e., row wise) we will have to rename the respective 
# columns mutually. For that, let us name the respective columns (= Gender, Country and current status) equally.

# 2017
us_FemaleStudents_2017['Current_Role'] = 'Student' # please note that a comparison is difficult since 
# the set up of the survey is not consistent for all years. For that, we apply the given command as a trick to obtain 
# a reasonable comparison.
us_FemaleStudents_2017 = us_FemaleStudents_2017[['year','Current_Role']]

# 2018
us_FemaleStudents_2018 = us_FemaleStudents_2018.rename({'Q4':'Current_Role'}, axis=1)
us_FemaleStudents_2018 = us_FemaleStudents_2018[['Current_Role','year']]

# 2019
us_FemaleStudents_2019 = us_FemaleStudents_2019.rename({'Q4':'Current_Role'}, axis=1)
us_FemaleStudents_2019 = us_FemaleStudents_2019.loc[:,('Current_Role','year')]

# 2020+
US_FemaleStudents_2020 = usa[usa['Q2']=='Woman']
US_FemaleStudents_2020 = US_FemaleStudents_2020.loc[:,('Q4','year')]
US_FemaleStudents_2020 = US_FemaleStudents_2020.rename({'Q4':'Current_Role'}, axis=1)

# let us append the data now
US_FemaleStudents = pd.concat([us_FemaleStudents_2017, us_FemaleStudents_2018, us_FemaleStudents_2019, US_FemaleStudents_2020], ignore_index=True, sort=True)
US_FemaleStudents.head()

In [None]:
plt.figure(figsize=(12,8))

US_FemaleStudents.groupby('year')['Current_Role'].count().plot(kind='line',color='#51ccfc')
plt.title('Development of Female Students from the U.S. joining Kaggle Surveys', fontsize=16)
plt.xticks([2017,2018,2019,2020])
plt.xlabel('Year', fontsize=14)
plt.annotate('34',xy=(2017.1,34),fontsize=14) # annotation 2017
plt.annotate('255',xy=(2018,255),fontsize=14) # annotation 2018
plt.annotate('126',xy=(2019,126),fontsize=14) # annotation 2019
plt.annotate('102',xy=(2020,102),fontsize=14) # annotation 2020
plt.ylabel('Number of Students',fontsize=14)
plt.show();

#### Insights:
As we can observe in the graph, there is an interesting development of the number of female students from the US. In 2018
there was the peak of female students joining Kaggle's survey. However, in the first year when the survey was conducted, i.e, 2017 only 34 female students participated. In the results of the current survey (2020) there were in total 102 female US students actively supporting Kaggle community. 

102 answers is still a reasonable sample size to explore the data and find more out about the female students from the USA. As a next step, we take a look after the answers of the remaing questions of the latest survey data.

In [None]:
usa = top_three[top_three['Q3']=='USA']
FemaleStudents_USA_2020 = usa[usa['Q2']=='Woman']
FemaleStudents_USA_2020.head()

Before we continue our journey, we might go a step back and explore once again the age distribution, the occurrences of the given education levels within that group and how long they took to complete the survey :)

In [None]:
a = FemaleStudents_USA_2020['Time from Start to Finish (seconds)'].mean()/60/60
a = round(a,1)
a
print('Female students from the USA took on average {} hours to complete the survey.'.format(a))

Intersting, female students from the US are faster compared to the (statistical) average (2.5 hours) response time. :D

***Remark!!!!***

Let look again upon the distribution of time spent in the survey, the distribution graph (left) shows that a majority of the female students took 1.4 hours(5173 seconds) to complete the survey. This number was derived because of that the higher number of time spent on the survey was dominant a total time used on average. Jumping into the insight of the data, let zoom in the data between 0 - 4000 seconds, the distributed data is represented in the plot (right). In reality, the majority of students could complete the survey by spending time less than 2000 seconds (or they could even finish within half an hour).

In [None]:
mean = FemaleStudents_USA_2020['Time from Start to Finish (seconds)'].mean()
print('From the whole data the mean time to complete the survey {} seconds'.format(round(mean)))
time_spent = FemaleStudents_USA_2020['Time from Start to Finish (seconds)']

restricted_time = FemaleStudents_USA_2020[FemaleStudents_USA_2020['Time from Start to Finish (seconds)']<4000]
time_spent_range = restricted_time['Time from Start to Finish (seconds)']

f, axes = plt.subplots(1,2,figsize=(15,5))
sns.distplot(time_spent,ax=axes[0])
plt.title('Most of the time spent in reality')
sns.distplot(time_spent_range,kde=True, ax=axes[1]);

#### 4.1 Sociodemographic attributes

In [None]:
FemaleStudents_USA_2020['Q1'].iplot(kind='hist',color='#51ccfc', sortbars=True, 
                                    title="Kaggle Survey 2020 - Female student's age distriburtion living in the USA",
                                    xTitle='Age', yTitle='Frequency')

#### Insights:

It can be observed that the majority of those students are in the age of between 25 and 29 (# 41), whereas the youngest age group, i.e., 18-21 are occupied by 21 students, followed by 19 students partaining to the age group of 22-24. Apparently, there is one female student close to 50.

In [None]:
y = FemaleStudents_USA_2020['Q4'].value_counts()/FemaleStudents_USA_2020['Q4'].value_counts().sum()*100
y = round(y,1)
x = ['Master’s degree','Bachelor’s degree','Doctoral degree',
     'Some college \n without earning any degree','I prefer not to answer']

n_y = len(FemaleStudents_USA_2020) # sample size of female US students



# plot the values
plt.figure(figsize=(12,8))
plt.barh(x, y, color='#51ccfc')
plt.title('Shares of education level \n (n = {})'.format(n_y), fontsize=18)
plt.xlabel('% Share',  fontsize=14)
plt.xticks(fontsize=14)
plt.ylabel('Education level', fontsize=14)
plt.yticks(fontsize=14)

for index, value in enumerate(y):
    plt.text(value, index, str(value) + '%', fontsize=14)
    
plt.tight_layout()
plt.show();

#### Insights:
That is interesting. Half of the underlying subgroup in the youngest survey attend a Master's degree, followed by Bachelor's student. More than 16% of female students attend a doctoral degree. 

Note: For the upcoming questions we are going to use the questions sliced (or partitioned) from above. Hereby, every section indicates the partition made. 

        ## ------------------ Section II - Questions about software and programming skills ------------------ ##

<a id='section4_2'></a>

#### 4.2 US Girls and their Machine Learning experience / Technology know-how

In [None]:
print(question('Q6'))

In [None]:
y = FemaleStudents_USA_2020['Q6'].value_counts()/FemaleStudents_USA_2020['Q6'].value_counts().sum()*100 
y = round(y,1)

n = len(FemaleStudents_USA_2020['Q6']) # sample size of female US students

y.iplot(kind='bar', orientation='v', color='#51ccfc', sortbars=True,
       title='Kaggle Survey 2020 - Programming experience of female US students (n = {})'.format(n),
       yTitle='% share')

In [None]:
b = FemaleStudents_USA_2020.Q6.value_counts()[[0,1,3,5]].sum()/FemaleStudents_USA_2020.Q6.value_counts().sum()*100
b = round(b,1)

print('{}% of female students in the USA and who joined the 2020 survey on kaggle have at least 1 year \n of programming experience.'.format(b))

# But, what languages do they use on a regular basis?

In [None]:
# Create a function to plot data responding to each question


n_usa = len(FemaleStudents_USA_2020)

def viz_demo_plot(Q, plot_title, y_label, n_usa):
    name = []
    counts = []

    for i in Q.columns:

        name.append(Q[i].dropna().unique())
        counts.append(Q[i].count())

    table = pd.DataFrame(name).set_index(0)
    table['Counts'] = counts
    values_q = table['Counts']/table['Counts'].sum()*100
    values_q = values_q.sort_values()
    values_q = round(values_q,1)
    
    if(values_q[0]==0):
        values_q = values_q[1:]


    plt.figure(figsize=(12,8))

    values_q.plot(kind='barh', color='#51ccfc')

    plt.title(plot_title.format(n_usa), fontsize=16)
    plt.xlabel('% Share')
    plt.ylabel(y_label)

    for index, value in enumerate(values_q):
        plt.text(value, index, str(value)+'%', fontsize=14)

    plt.show();


In [None]:
# Question 7: What programming languages do you use on a regular basis?

Q7 = FemaleStudents_USA_2020.loc[:,('Q7_Part_1','Q7_Part_2','Q7_Part_3','Q7_Part_4','Q7_Part_5','Q7_Part_6',
                                    'Q7_Part_7','Q7_Part_8','Q7_Part_9','Q7_Part_10','Q7_Part_11','Q7_Part_12',
                                    'Q7_OTHER')]

In [None]:
plot_title = 'Shares of programming language usage \n (n = {})'
y_label = 'Programming Language'
viz_demo_plot(Q7,plot_title,y_label,n_usa)

#### Insights:
Most of female students in the USA and who participated at the kaggle 2020 survey, use Python (32%) and R (17%) as a commom programming language on a regular basis. On the contrary, Julia (0.4%) and Swift (1.3%) are used least.

In [None]:
# Count occurrences of programming languages row wise in order to see how many languages a (statistical) 
# female student from the US is capable of. 

def count(Q7):
    Q7['numb_lang'] = np.sum(Q7 == ['Python', 'R', 'SQL', 'C', 'C++', 'Java', 'Javascript', 'Julia', 'Swift', 'Bash','MATLAB', 'None', 'Other'])
    return Q7

Q7 = Q7.apply(count, axis=1)

Q7.head()

In [None]:
numb_lang = Q7['numb_lang'].mean(skipna=True) # 2.254901 ~ 2
numb_lang = int(numb_lang)

print('On average, a female student from the USA is able to handle at least {} programming languages'.format(numb_lang))

In [None]:
question('Q8')

In [None]:
q8 = FemaleStudents_USA_2020['Q8'].value_counts()/FemaleStudents_USA_2020['Q8'].value_counts().sum()*100
q8 = round(q8,1)
q8.iplot(kind='bar', color='#51ccfc',
         title='Programming language recommended to an aspiring Data Scientist by female students from the USA',
         xTitle='Programming language', yTitle='% Share')


#### Insights:
A clear winner is Python! But, as we could figure out, the majority of female students in the US are familiar with Python. Do you think there might be a correlation between the language spoken on a regular basis and the language recommended? So in other words, a student who is inspired by a programming language which is used on a regular basis is the language which is used normally. A language is considered abnormal if that language is not used on a regular basis. Do you think an abnormal language is recommended to an aspiring Data Scientist?

In [None]:
### Question 9: Which of the following integrated development environments (IDE's) do you use on a regular basis?

Q9 = survey_Q6_Q10.loc[:,('Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7',
                          'Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_OTHER')]

In [None]:
plot_title = 'Use of IDEs on a regular basis \n (n = {})'
y_label = 'IDE'
viz_demo_plot(Q9,plot_title,y_label,n_usa)

#### Insights:
The plot above shows Jupyter Notebook and RStudio are the most used integrated development environments by female students from the USA. Vim / Emacs are the least used throughout the representative students on Kaggle community. 

In [None]:
### Question 10: Which of the following hosted notebook products do you use on a regular basis?

Q10 = survey_Q6_Q10.loc[:,('Q10_Part_1','Q10_Part_2','Q10_Part_3','Q10_Part_4','Q10_Part_5','Q10_Part_6',
                           'Q10_Part_7','Q10_Part_8','Q10_Part_9','Q10_Part_10','Q10_Part_11',
                           'Q10_Part_12','Q10_Part_13','Q10_OTHER')]

In [None]:
plot_title = 'Use of hosted notebook products on a regular basis \n (n = {})'
y_label = 'Hosted Notebook Product'
viz_demo_plot(Q10,plot_title,y_label,n_usa)

#### Insights:
Colab, Kaggle and Binder/JupyterHub notebooks are the most used hosted notebook products for female students from the USA (regardless of none since it is the majority wiht more than 25%). Paperspace/Gradient and Google Cloud AI Platform Notebooks are least used.

    ## ----- Section III - Questions about hardware, ML technologies and libraries used by data scietists ----- ##

In [None]:
question('Q11')

In [None]:
plt.figure(figsize=(10,8))

q11_perc = round(FemaleStudents_USA_2020['Q11'].value_counts()/FemaleStudents_USA_2020['Q11'].value_counts().sum()*100,1)
q11_perc.plot(kind='barh', color='#51ccfc',fontsize=12)

plt.title('Most often used computing platforms aimed for data science projects \n (n = {})'.format(n_y),fontsize=14)
plt.xlabel('% Share')
plt.ylabel('Computing Platform')

for index, value in enumerate(q11_perc):
    plt.text(value, index, str(value)+'%', fontsize=14)

plt.show();

#### Insights:
Based on the given results approx. 8% of female students from the USA use a cloud computing platform for data science projects. Whereas, around 87% do use a personal computer or notebook for related projects. 

In [None]:
# Question 12: Which types of specialized hardware do you use on a regular basis? 

q12 = FemaleStudents_USA_2020.loc[:,('Q12_Part_1','Q12_Part_2','Q12_Part_3','Q12_OTHER')]

In [None]:
plot_title = 'Types of Specialized Hardware for use on a regular basis \n (n = {})'
y_label = 'Specialized Hardware'
viz_demo_plot(q12,plot_title,y_label,n_usa)

#### Insights:
The majority of female students from the USA use rather GPUs (32.9%) than TPUs (1.2%).

In [None]:
print(question('Q13'))

In [None]:
share_q13 = FemaleStudents_USA_2020.Q13.value_counts()/FemaleStudents_USA_2020.Q13.value_counts().sum()*100
share_q13 = round(share_q13,1)

share_q13.iplot(kind='bar', color='#51ccfc',
                title='Shares of how many times a TPU has been used (n = {})'.format(n_y),
                yTitle='% Share', xTitle='Frequency')

#### Insights:
Based on question 12 someone could already assume that female students in the USA are not likely that they have worked at least more than six times with a TPU (actually about 2% only). 8% of them have worked a few times with a tensor processing unit. And the majority, more than 88%, have even never worked with that particular hardware.

In [None]:
# Question 14: What data visualization libraries or tools do you use on a regular basis?

q14 = FemaleStudents_USA_2020.loc[:,('Q14_Part_1','Q14_Part_2','Q14_Part_3','Q14_Part_4','Q14_Part_5','Q14_Part_6',
                                     'Q14_Part_7','Q14_Part_8','Q14_Part_9','Q14_Part_10','Q14_Part_11','Q14_OTHER')]

In [None]:
plot_title = 'Data visualization libraries or tools in use on a regular basis \n (n = {})'
y_label = 'Data Viz library / tool'
viz_demo_plot(q14,plot_title,y_label,n_usa)

#### Insights:
Matplotlib (30.3%), ggplot2 / Ggplot (19.9%) and Seaborn (16.9%) are the most frequent used visualization libraries or tools on a regular basis.

In [None]:
question('Q15')

In [None]:
Q15 = pd.DataFrame(FemaleStudents_USA_2020['Q15'])
Q15 = Q15.dropna()
Q15 = Q15.rename(columns={'Q15':'ExpLevel'})
Q15.head()

In [None]:
# Count occurrences of experience level row wise in order to get the individual summations

def count(Q15):
    Q15['< 1 year'] = np.sum(Q15 == 'Under 1 year')
    Q15['1-2 years'] = np.sum(Q15 == '1-2 years')
    Q15['2-3 years'] = np.sum(Q15 == '2-3 years')
    Q15['3-4 years'] = np.sum(Q15 == '3-4 years')
    Q15['4-5 years'] = np.sum(Q15 == '4-5 years')
    Q15['5-10 years'] = np.sum(Q15 == '5-10 years')
    Q15['I do not use ML'] = np.sum(Q15 == 'I do not use machine learning methods')
    return Q15

Q15 = Q15.apply(count, axis=1)

Q15 = Q15.drop('ExpLevel', axis=1)

Q15.head()

In [None]:
Q15_sum = Q15.sum(axis=0)
Q15_sum = Q15_sum/Q15_sum.sum()*100
Q15_sum = round(Q15_sum, 1)
Q15_sum = Q15_sum.sort_values()
Q15_sum.iplot(kind='bar', color='#51ccfc',
              title='Shares of how many years ML methods have been used by US female students (n = {})'.format(n_y),
              yTitle='% Share', xTitle='Experience')

#### Insights:
39% of the female students from the US have been using under one year machine learning methods. 1-2 years of usage of ML methods have been used by 25.6% of female students. And roughly, one quarter (23.2%) of the given female students do not even use ML methods. 

In [None]:
# Question 16:  Which of the following machine learning frameworks do you use on a regular basis?

Q16 = FemaleStudents_USA_2020.loc[:,('Q16_Part_1','Q16_Part_2','Q16_Part_3','Q16_Part_4','Q16_Part_5',
                                     'Q16_Part_6','Q16_Part_7','Q16_Part_8','Q16_Part_9','Q16_Part_10',
                                     'Q16_Part_11','Q16_Part_12','Q16_Part_13','Q16_Part_14','Q16_Part_15',
                                     'Q16_OTHER')]

In [None]:
plot_title = 'Use of ML frameworks on a regular basis \n (n = {})'
y_label = 'ML framework'
viz_demo_plot(Q16,plot_title,y_label,n_usa)

#### Insights:
Throughout female students in the USA the most frequent used machine learning framework on a regular basis are:

1. Scikit-learn (27.6%)
2. Tensorflow (19%)
3. Keras (17.8%)
    
ML frameworks that are not popular amongst female students in the USA are:
- Prophet, CatBoost and MXNet (each 0.6%)

In [None]:
# Question 17: Which of the following ML algorithms do you use on a regular basis?

Q17 = FemaleStudents_USA_2020.loc[:,('Q17_Part_1','Q17_Part_2','Q17_Part_3','Q17_Part_4','Q17_Part_5','Q17_Part_6',
                                     'Q17_Part_8','Q17_Part_9','Q17_Part_10','Q17_Part_11','Q17_OTHER')]                            

In [None]:
plot_title = 'Most used ML Algos \n (n = {})'
y_label = 'ML Algorithm'
viz_demo_plot(Q17,plot_title,y_label,n_usa)

#### Insights:
Most of female students use Linear / Logistic regressions (29.5%) when using machine learning algorithms. Another common ML algo for female students are decision trees or random forests model (21.2%)

The most uncommon ML algo is Generative Adversarial Network (2.6%)

In [None]:
# Question 18:  Which categories of computer vision methods do you use on a regular basis? 
# Note: Question 18 (which specific ML methods) was only asked to respondents that selected the relevant
# answer choices for Question 17 (which categories of algorithms).

Q18 = FemaleStudents_USA_2020.loc[:,('Q18_Part_1','Q18_Part_2','Q18_Part_3','Q18_Part_4',
                                     'Q18_Part_5','Q18_Part_6','Q18_OTHER')]

In [None]:
name_q18 = []
counts_q18 = []

for i in Q18.columns:
    
    name_q18.append(Q18[i].dropna().unique())
    counts_q18.append(Q18[i].count())  

In [None]:
import textwrap

In [None]:
s = 'Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)'
length_s = len('Image classification and other general purpose networks')
s = textwrap.fill(s, length_s)
name_q18[3] = np.array([s], dtype=object)

In [None]:
table = pd.DataFrame(name_q18).set_index(0)
table['Counts'] = counts_q18
table['share'] = table['Counts']/table['Counts'].sum()*100
table['share'] = round(table['share'],1)
values = table['share'].sort_values()
values = values[1:]

plt.figure(figsize=(12,8))

values.plot(kind='barh',color='#51ccfc', fontsize=14) 

plt.title('Use of Computer Vistion categories methods on a regular basis \n (n = {})'.format(n_y),fontsize=14)
plt.xlabel('% Share', color='black', fontsize=14)
plt.ylabel('CV Category', color='black', fontsize=14)

for index, value in enumerate(values):
    plt.text(value, index, str(value)+'%', fontsize=14)

plt.show();

#### Insights:
Clearly, one fifth of female students in the US do not use any Computer Vision methods. 
But, the two subsequent groups in the rank use Image classification and other general purpose networks (22.5%)
and Image segmentation methods (20%). 

The least one: General purpose image/video tools is used by 5% of the female students.

In [None]:
# Question 19: Which of the following natural language processing (NLP) methods do you use on a regular basis?
# Note: Question 19 (which specific ML methods) was only asked to respondents that selected the relevant
# answer choices for Question 17 (which categories of algorithms).

Q19 = FemaleStudents_USA_2020.loc[:,('Q19_Part_1','Q19_Part_2','Q19_Part_3','Q19_Part_4','Q19_Part_5','Q19_OTHER')]

In [None]:
plot_title = 'Most used NLP methods on a regular basis \n (n = {})'
y_label = 'NLP Method'
viz_demo_plot(Q19,plot_title,y_label,n_usa)

#### Insights:
According to the plot, we can observe that female students from the USA whether use Transformer language models or Word embeddings/vectors (respectively 31%).

Contextualized embedding NLP models are used by nearly 7% of female students being the last one in the rank. 

Please note: We will directly jump to Section V (since the questions from Section IV, i.e. questions 20-36 were not answered by (female) students in general). 

                ### ----------------- Section V: Questions about social media activities ----------------- ###

<a id='section4_3'></a>

#### 4.3 Social Media and Learning online

In [None]:
# Question Q37:
# On which platforms have you begun or completed data science courses? (Select all that apply)

Q37 = FemaleStudents_USA_2020.loc[:,('Q37_Part_1','Q37_Part_2','Q37_Part_3','Q37_Part_4',
                                     'Q37_Part_5','Q37_Part_6','Q37_Part_7','Q37_Part_8','Q37_Part_9',
                                     'Q37_Part_10','Q37_Part_11','Q37_OTHER')]

In [None]:
name_q37 = []
counts_q37 = []

for i in Q37.columns:
    
    name_q37.append(Q37[i].dropna().unique())
    counts_q37.append(Q37[i].count())  

In [None]:
s = 'Cloud-certification programs (direct from AWS, Azure, GCP, or similar)'
length_s = len('Cloud-certification programs ')
s = textwrap.fill(s, length_s)
name_q37[8] = np.array([s], dtype=object)

In [None]:
table = pd.DataFrame(name_q37).set_index(0)
table['Counts'] = counts_q37
table['share'] = table['Counts']/table['Counts'].sum()*100
table['share'] = round(table['share'],1)
values = table['share'].sort_values()
#values = values[1:]

plt.figure(figsize=(12,8))

values.plot(kind='barh',color='#51ccfc', fontsize=14) 

plt.title('Ranking of platforms to learn DS \n (n = {})'.format(n_y), fontsize=16, color='black')
plt.xlabel('% Share', color='black', fontsize=14)
plt.ylabel('Learning Platform', color='black', fontsize=14)

for index, value in enumerate(values):
    plt.text(value, index, str(value)+'%', fontsize=14)

plt.show();

#### Insights
A clear winner are University Courses which result in a university degree (> 21%). Coursera is located on the second rank (18.5%) and subsequently follows LinkedIn Learning (11.5%). Our Kaggle community with its courses is only placed at the fourth rank (11%). 


On the contrary, female students from the US are not likely (2.5%) to start or complete a Cloud-certification program (from e.g. AWS) in the world wide web. 

In [None]:
question('Q38')

In [None]:
Q38 = FemaleStudents_USA_2020.Q38
Q38 = pd.DataFrame(Q38)
Q38 = Q38.dropna()
Q38 = Q38.rename(columns={'Q38':'PrimTool'})
Q38.head()

In [None]:
FemaleStudents_USA_2020.Q38.unique()

In [None]:
# Question 38: 
# What is the primary tool that you use at work or school to analyze data?

def count(Q38):
    Q38['Basic StatSoft'] = np.sum(Q38 == 'Basic statistical software (Microsoft Excel, Google Sheets, etc.)')
    Q38['Advanced StatSoft'] = np.sum(Q38 == 'Advanced statistical software (SPSS, SAS, etc.)')
    Q38['BITool'] = np.sum(Q38 == 'Business intelligence software (Salesforce, Tableau, Spotfire, etc.)')
    Q38['LocalEnv'] = np.sum(Q38 == 'Local development environments (RStudio, JupyterLab, etc.)')
    Q38['Cloud'] = np.sum(Q38 == 'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)')
    Q38['other'] = np.sum(Q38 == 'Other')
    return Q38

Q38 = Q38.apply(count, axis=1)

Q38 = Q38.drop('PrimTool',axis=1)

Q38.head()

In [None]:
Q38_sum = Q38.sum(axis=0)
Q38_sum = Q38_sum/Q38_sum.sum()*100
Q38_sum = round(Q38_sum, 1)
Q38_sum = Q38_sum.sort_values()
Q38_sum.iplot(kind='bar', color='#51ccfc',
              title='Shares of primary tool to analyze data (n = {})'.format(n_y),
              yTitle='% Share')

#### Insights
As we could expect and based on the prior results, a clear favorite are the Local development environments like R or JupyterLab which are used most throughout the female students from the US (58.7%)

Nearly one-fifth (17.5%) of these students use a basic statistical software in order to analyse data.

Cloud-based data software & APIs are not used that frequently (appr. 3%).

In [None]:
# Question 39: Who/what are your favorite media sources that report on data science topics?

Q39 = FemaleStudents_USA_2020.loc[:,('Q39_Part_1','Q39_Part_2','Q39_Part_3','Q39_Part_4','Q39_Part_5',
                                     'Q39_Part_6','Q39_Part_7','Q39_Part_8','Q39_Part_9','Q39_Part_10','Q39_Part_11','Q39_OTHER')]

In [None]:
plot_title = 'Ranking of favorite media sources which reports about DS topics \n (n = {})'
y_label = 'Media Source'
viz_demo_plot(Q39,plot_title,y_label,n_usa)

#### Insights
The ranking is as follows throughout the female students in the USA regarding primary use media sources:
1. YouTube (20%)
2. Kaggle (16.4%)
3. Blogs ((15.8%)

...

11. None (2.4%)

        ### Section VI - Questions related to the effort data scientist enthusiasts will do in the next two years ###

<a id='section4_4'></a>

#### 4.4 What do plan US female students in the next two years?

In [None]:
# Question 26 Part B: Which of the following cloud computing platforms do you hope to become 
# more familiar with in the next 2 years?

Q26b = FemaleStudents_USA_2020.loc[:,('Q26_B_Part_1','Q26_B_Part_2','Q26_B_Part_3','Q26_B_Part_4',
                                      'Q26_B_Part_5','Q26_B_Part_6','Q26_B_Part_7','Q26_B_Part_8',
                                      'Q26_B_Part_9','Q26_B_Part_10','Q26_B_Part_11','Q26_B_OTHER')]

In [None]:
plot_title = 'In the next two years, I hope to become better in the following computing platform  \n (n = {})'
y_label = 'Computing Platform'
viz_demo_plot(Q26b,plot_title,y_label,n_usa)

#### Insights

Female students from the USA try to get familiar with AWS and GCP (26.2% and 19%, respectively) in the next two years. Microsoft Azure, ranked on place 3, was chosen by 17.3% of female students. Alibaba Cloud is the least concrete computing platform female students (1.4%) want to get familiar with by 2022. :)

In [None]:
# Question 27 Part B: In the next 2 years, do you hope to become more familiar with any of these specific 
# cloud computing products? 

Q27b = FemaleStudents_USA_2020.loc[:,('Q27_B_Part_1','Q27_B_Part_2','Q27_B_Part_3','Q27_B_Part_4',
                                      'Q27_B_Part_5','Q27_B_Part_6','Q27_B_Part_7','Q27_B_Part_8',
                                      'Q27_B_Part_9','Q27_B_Part_10','Q27_B_Part_11','Q27_B_OTHER')]

In [None]:
plot_title = 'In the next two years, I hope to become better in the following cloud computing products \n (n = {})'
y_label = 'Cloud Computing Product'
viz_demo_plot(Q27b,plot_title,y_label,n_usa)

#### Insights

In the next two years, female students from the USA try to get familiar mostly with Azure Cloud Services (14.8%), Google Cloud Compute Engine (13.4%) and Amazon EC2 (12.9%). Nearly 5% of students do not want to get familiar with any of these cloud computing platforms. But still, there is a clear trend given that female studens want to get more familiar with cloud computing platforms in general. :)

In [None]:
# Question 28 B: In the next 2 years, do you hope to become more familiar with any of these specific 
# machine learning products? 

Q28b = FemaleStudents_USA_2020.loc[:,('Q28_B_Part_1','Q28_B_Part_2','Q28_B_Part_3','Q28_B_Part_4','Q28_B_Part_5',
                                      'Q28_B_Part_6','Q28_B_Part_7','Q28_B_Part_8','Q28_B_Part_9','Q28_B_Part_10',
                                      'Q28_B_OTHER')]

In [None]:
plot_title = 'In the next two years, I hope to become better in the following specific ML products \n (n = {})'
y_label = 'Specific ML Product'
viz_demo_plot(Q28b,plot_title,y_label,n_usa)

In [None]:
# Question 29 B: Which of the following big data products (relational databases, data warehouses, 
# data lakes, or similar) do you hope to become more familiar with in the next 2 years?

Q29b = FemaleStudents_USA_2020.loc[:,('Q29_B_Part_1','Q29_B_Part_2','Q29_B_Part_3','Q29_B_Part_4','Q29_B_Part_5',
                                      'Q29_B_Part_6','Q29_B_Part_7','Q29_B_Part_8','Q29_B_Part_9','Q29_B_Part_10',
                                      'Q29_B_Part_11','Q29_B_Part_12','Q29_B_Part_13','Q29_B_Part_14','Q29_B_Part_15',
                                      'Q29_B_Part_16','Q29_B_Part_17','Q29_B_OTHER')]

In [None]:
plot_title = 'In the next two years, I hope to become better in the following big data products \n (n = {})'
y_label = 'Big data product'
viz_demo_plot(Q29b,plot_title,y_label,n_usa)

In [None]:
name_q29b = []
counts_q29b = []

for i in Q29b.columns:
    
    name_q29b.append(Q29b[i].dropna().unique())
    counts_q29b.append(Q29b[i].count())  

In [None]:
table = pd.DataFrame(name_q29b).set_index(0)
table['Counts'] = counts_q29b
table['share'] = table['Counts']/table['Counts'].sum()*100
table['share'] = round(table['share'],1)
values = table['share'].sort_values()
values = values[1:]

plt.figure(figsize=(12,8))

values.plot(kind='barh',color='#51ccfc', fontsize=14) 

plt.title('In the next two years, I hope to become better in the following big data products \n (n = {})'.format(n_y), fontsize=16, color='black')
plt.xlabel('% Share', color='black', fontsize=14)
plt.ylabel('Big data product', color='black', fontsize=14)

for index, value in enumerate(values):
    plt.text(value, index, str(value)+'%', fontsize=14)

plt.show();

#### Insights

Clearly, we can observe that the majority of female students would like to learn more about MySQL (12.9%). Followed by:
2. MongoDB (10.8%) and
3. Microsoft SQL Server (9.7%).

Again, we can see how ambitious female students from the US are. Only 3.6% of the overall responses denied of getting familiar with any big data product in the next two years.

In [None]:
# Question 31 B: Which of the following business intelligence tools do you hope to become more 
# familiar with in the next 2 years?

Q31b = FemaleStudents_USA_2020.loc[:,('Q31_B_Part_1','Q31_B_Part_2','Q31_B_Part_3','Q31_B_Part_4',
                                      'Q31_B_Part_5','Q31_B_Part_6','Q31_B_Part_7','Q31_B_Part_8',
                                      'Q31_B_Part_9','Q31_B_Part_10','Q31_B_Part_11','Q31_B_Part_12',
                                      'Q31_B_Part_13','Q31_B_Part_14','Q31_B_OTHER')]

In [None]:
plot_title = 'In the next two years, I hope to become better in the following BI tool \n (n = {})'
y_label = 'BI Tool'
viz_demo_plot(Q31b,plot_title,y_label,n_usa)

#### Insights

Winners are given by:

1. Tableau (26.6%)
2. Microsoft Power BI (19.3%)
3. Google Data Studio (13%)

... Only 5.6% of female students do not want to learn any BI Tool in the upcoming two years.

In [None]:
# Question 33 B: Which categories of automated machine learning tools (or partial AutoML tools) 
# do you hope to become more familiar with in the next 2 years?

Q33b = FemaleStudents_USA_2020.loc[:,('Q33_B_Part_1','Q33_B_Part_2','Q33_B_Part_3','Q33_B_Part_4',
                                      'Q33_B_Part_5','Q33_B_Part_6','Q33_B_Part_7','Q33_B_OTHER')]

In [None]:
plot_title = 'In the next two years, I hope to become better in the following categories of automated ML \n (n = {})'
y_label = '(Partial) Automated ML Tool'
viz_demo_plot(Q33b,plot_title,y_label,n_usa)

#### Insights

Only nearly 10% of female students do not want to get familiar with none of the given categories of (partial) automated machine learning tools for in the next two years. 

In [None]:
# Question 34 B: Which specific automated machine learning tools (or partial AutoML tools) do you hope 
# to become more familiar with in the next 2 years?

Q34b = FemaleStudents_USA_2020.loc[:,('Q34_B_Part_1','Q34_B_Part_2','Q34_B_Part_3','Q34_B_Part_4',
                                      'Q34_B_Part_5','Q34_B_Part_6','Q34_B_Part_7','Q34_B_Part_8',
                                      'Q34_B_Part_9','Q34_B_Part_10','Q34_B_Part_11','Q34_B_OTHER')]

In [None]:
plot_title = 'In the next two years, I hope to become better in the specific automated ML tool \n (n = {})'
y_label = 'Specific automated ML Tool'
viz_demo_plot(Q34b,plot_title,y_label,n_usa)

#### Insights

Apparently, 1. Auto-Sklearn (21.2%), 2. Auto-Keras (16.9%) and 3. Google Cloud AutoML (16.1%) are those specific automated ML tools that female students from the US want to become more familiar with in the nexrt two years. 

Roughly 6% do not want to get familiar with none of them. 

In [None]:
# Question 35 B: In the next 2 years, do you hope to become more familiar with any of these tools 
# for managing ML experiments?

Q35b = FemaleStudents_USA_2020.loc[:,('Q35_B_Part_1','Q35_B_Part_2','Q35_B_Part_3','Q35_B_Part_4',
                                      'Q35_B_Part_5','Q35_B_Part_6','Q35_B_Part_7','Q35_B_Part_8',
                                      'Q35_B_Part_9','Q35_B_Part_10','Q35_B_OTHER')]

In [None]:
plot_title = 'In the next two years, I hope to become more familiar with the following tools for managing ML experiments \n (n = {})'
y_label = 'Manager for ML experiments'
viz_demo_plot(Q35b,plot_title,y_label,n_usa)

#### Insights

The majority of female students from the USA whether want to become familiar with TensorBoard (28.3%) in the next two years or with none (25.3%) of these given choices. 

<a id='section5'></a>

### 5. Conclusion

The underlying analysis of the **2020 Kaggle Machine Learning and Data Science Survey** revealed many interesting insights. 

At the beginning, we tried to look for a specific subgroup which is most representative in the given survey. After filtering the *Top Three* we could get the insight that the USA has the highest ratio of women with respect to women and men, followd by India and China. After that result we continued our journey with the target group of US female students. Based on the given results we could clearly see that female students from the USA are a subgroup on Kaggle community that is on a rise to get more familiar with data science, machine learning and all the other related fields like big data products. The results revealed that every second female student from the USA attends a master's program. Moreover, the majority of them are in the age between 25 and 29 which is clearly an indication that most of them are looking for a job/PhD position in the forseeable future. What is more, the majority of that specific subgroup has only one to two years of programming experience (there is still a lot of capacity left!!). The favorite (or rather the mosed used) programming languages are Python and R.
Even when looking at the use of TPUs and GPUs on a regular basis, the answers showed clearly that GPUs are rather in use than TPUs (maybe because they are still in the easy and perfect world of theory (=university)) for the majority of female students. The category of computer vision seem to be alos an area of less interest. Here, it would be interesting to get more insights about why that could be the case. But also the resutls reagarding the experience with the use of particular machine learning algorithms is for the majority less than one year (but as we know, all things come to those who wait). 

With respect to the next two years, female students from the USA are likely to invest more time in technology, ML and other related fields. As the results show, female students are interested in cloud computing platforms like Amazon Web Services or Google Cloud Platform. But also, the interest for Business Intelligence tools seem to be big - in particular, Tableau has been revealed as a clear favorite. 
Lastly, most of the results point out that the endeavors for the next two years are quite high since only the minority of these students are interested in none of several topics. 

Thanks for taking time to read through. :)

##### [Back to Table](#table)

                                                END