**A COMPREHENSIVE UNDERSTANDING OF ML AND DS AND GENDER ROLE**

**A comprehensive understanding of females in technological career roles in relation to their education level.**

This is the Kaggle's third annual Machine Learning and Data Science Survey. This is the second-ever survey data challenge. 

This year, the industry-wide survey presents a comprehensive view of the state of data science and machine learning. 

The survey was live for three weeks in October, and after cleaning the data we have 19,717 responses. We will be using these responses from the survey to understand the ML and DS state.

The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists break into these fields. Kaggle managed to publish these data in form of raw format without compromising anonymization, which makes it an unusual example of a survey dataset.

**The challenge objective:**
To tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs).

**Submissions will be evaluated on the following:**

1. **Composition** - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations.
2. **Originality** - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? It should be informative, thought provoking, and fresh all at the same time.
3. **Documentation** - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible

![](https://media.giphy.com/media/SRL18lj1i4UoM/giphy.gif)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

**UNDERSTANDING THE DATA**

In [None]:
schema = pd.read_csv('../input/kaggle-survey-2019/survey_schema.csv')
questions = pd.read_csv('../input/kaggle-survey-2019/questions_only.csv')
mcq = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
responses = pd.read_csv('../input/kaggle-survey-2019/other_text_responses.csv')


In [None]:
responses.head()

From the responses data, there are a lot of empty data represented by NaN. For instance, checking text question 13, provides a result with lots of empty data values.

In [None]:
responses.Q13_OTHER_TEXT.head(50)

In [None]:
questions.head()

Let us look at the schema, wich contains both the questions and the answers from the survey.

In [None]:
schema.head()

In [None]:
schema.groupby('Q13')['Q31'].count().head(500)

In [None]:
schema['Q4'].describe()

In [None]:
mcq.head()

**UNDERSTANDING THE SHAPE OF THE DATA**

In [None]:
## Let us first get to understand the shape of the data before choosing which one to work with

responses_len, schema_len, mcq_len, questions_len = len(responses.index), len(schema.index),len(mcq.index), len(questions.index)
print(f'responses size: {responses_len}, schema size: {schema_len} , mcq size: {mcq_len}, questions size: {questions_len}')

Based on the data sizes above, we will work with either the responses or the mcq data because they have most of the datasets.

First, let us focus on the multiple choice questions (mcq) dataset for now:

Let us check the missing values in the mcq dataset

In [None]:
## Count the missing values 
miss_val_mcq = mcq.isnull().sum(axis=0) / mcq_len
miss_val_mcq = miss_val_mcq[miss_val_mcq> 0] * 100
miss_val_mcq

The results show us that Q4,5,6,7,8 has than 50% of the missing values. Meaning we can analyze this data using the these questions and come up with adequate analysis and understanding. 

While the questions Q34_Part_8,9,10,11,12 have over 97% of the missing values. Meaning these questions have null or NaN responses and thus will not be helpful in using them to analyze the data provided.

In [None]:
## Number of mcq columns are
len(list(mcq.columns))

Now, let us check the highest level of education using wordcloud. First, we will look into the question then later use a function to give us the most popular level of education of the survey takers.

In [None]:
## Lets see the words of first question

plt.figure(figsize=(20, 5))

text = schema.Q4[0]

# Create and generate a word cloud image:
wordcloud = WordCloud().generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Now let us use a function to give us the most popular level of education of the survey takers.

In [None]:
## Lets see the words of first question

plt.figure(figsize=(20, 5))

text = mcq.Q4[200]

# Create and generate a word cloud image:
wordcloud = WordCloud().generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

We see that majority of the survey takers have masters degree as the level of education qualification.

Let us now find the gender and the percentage of the participants.

In [None]:
mcq.Q2.head(50)

How many males and females took the survey?

In [None]:
##How many males and females specified their gender? 

text = " ".join(Q2 for Q2 in mcq.Q2)
print ("There were {} males and females who specified their gender in the survey conducted.".format(len(text)))

Okay, now let us break it down and check the ratio of males to females

In [None]:
## Check the scoring for questions
all_mcq_columns = list(mcq.columns)
Q2 = all_mcq_columns[:11]
Q4 = all_mcq_columns[11:32]
Q3  = all_mcq_columns[32:41]

In [None]:
mcq[Q4]

In [None]:
mcq.Q5.head(50)


In [None]:
mcq.Q4.head(50)

In [None]:
mcq.groupby('Q2')['Q4'].count().head(50)


In [None]:
mcq['Q4'].describe()

In [None]:
schema['Q3'].describe()

In [None]:
schema['Q5'].describe()

In [None]:
schema['Q1'].describe()

In [None]:
schema['Q2'].describe()

**DATA CLEANING AND PREPROCESSING**

![](https://media.giphy.com/media/zOvBKUUEERdNm/giphy.gif)

Now, let's look at the salary by gender, to understand the relationship between the different genders and the salaries earned in each gender.

In [None]:
pt1 = mcq[['Q1', 'Q2']]
pt1 = pt1.rename(columns={"Q1": "Age", "Q2": "Gender"})
pt1.drop(0, axis=0, inplace=True)

# plotting to create pie chart 
plt.figure(figsize=(18,12))
plt.subplot(221)
pt1["Gender"].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism",8),startangle = 60,labels=["Male","Female","Prefer not to say","Prefer to self-describe"],
wedgeprops={"linewidth":4,"edgecolor":"k"},explode=[.1,.1,.2,.3],shadow =True)
plt.title("Gender Distribution in (%)")

plt.show()

In [None]:
plt.figure(figsize=(16,10))
sorted_age=['18-21','22-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','70+']
sns.countplot(x='Age', hue='Gender',order=sorted_age, data=pt1 )
plt.title('Gender distribution based on Their Ages',fontsize=24)
plt.ylabel('Number of Survey Takers', fontsize = 18.0) # Y label
plt.xlabel('Age of the Survey Takers', fontsize = 18) # X label
plt.show()

The above shows the age distribution of all the survey takers. It is clear that there are more male than females in the DS and the ML field.

In [None]:
plt.figure(figsize=(12,6))
pt2 = mcq[['Q2','Q4', 'Q5']]
pt2 = pt2.rename(columns={"Q2":"Gender","Q4": "Education", "Q5": "title"})
pt2.drop(0, axis=0, inplace=True)

# Replacing the ambigious education names with easy to use names
pt2['Education'].replace(
                                                   {'Master’s degree':'MS',
                                                    'Bachelor’s degree':'Grad',
                                                    "Doctoral degree":'Doct',
                                                    "Some college/university study without earning a bachelor’s degree":'Under Grad',
                                                    "Professional degree":"Professional",
                                                   "I prefer not to answer":"Prefer NA",
                                                   "No formal education past high school":"No edu"},inplace=True)

sns.countplot(x='Education', data=pt2 )
plt.title('Education wise distribution',fontsize=18)
plt.ylabel('Number of People', fontsize = 16.0) # Y label
plt.xlabel('Education', fontsize = 16) # X label
plt.show()

plt.figure(figsize=(12,6))

sns.countplot(y='title', data=pt2 )
plt.title("Job Profile Distribution", fontsize=18)

plt.ylabel('Job Profile', fontsize = 16.0) # Y label
plt.xlabel('Number of People', fontsize = 16) # X label
plt.show()

In [None]:
pt2 = pt2.rename(columns={"Q5":"title","Q10": "Education"})
pt2.drop(0, axis=0, inplace=True)
plt.figure(figsize=(16,8))

sns.countplot(x='title',hue='Education', data=pt2 )
plt.title("Job Profile vs Education Distribution", fontsize=18)

plt.ylabel('Number of People', fontsize = 16.0) # Y label
plt.xlabel('Job Profile', fontsize = 16) # X label
plt.xticks(rotation=45)
plt.show()

The coding experience of the survey takers in relation to their salary level.

In [None]:
plt.style.use('fivethirtyeight')
sns.barplot(mcq['Q15'].value_counts()[0:7].values,mcq['Q15'].value_counts()[0:7].index,palette=('bright'))
plt.xticks(rotation=90)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.title('Coding Experience in relation to their salary levels')
plt.show()

In [None]:
plt.figure(figsize=(14,8))
ptML = mcq[['Q15','Q10']]
ptML = ptML.rename(columns={"Q15":"code_experience","Q10": "Income"})
ptML.drop(0, axis=0, inplace=True)
sorted_exp=['I have never written code','< 1 years','1-2 years','3-5 years','5-10 years','10-20 years','20+ years']
sns.countplot(x='code_experience',hue='Income',order=sorted_exp, data=ptML )
plt.title("Coding Experience wise Income Distribution", fontsize=18)

plt.ylabel('Number of People', fontsize = 16.0) # Y label
plt.xlabel('Coding Experience in years', fontsize = 16) # X label
plt.xticks(rotation=45)
plt.show()

Their favorite learning platforms are:

In [None]:
media = ["Twitter", "HackerNews", "Reddit", "Kaggle", "Forums", "YouTube", "Podcasts", "Blogs", "Journals", "Slack"]
media_count = [sum(~mcq.Q12_Part_1.isna()), sum(~mcq.Q12_Part_2.isna()), sum(~mcq.Q12_Part_3.isna()), sum(~mcq.Q12_Part_4.isna()), sum(~mcq.Q12_Part_5.isna()),
               sum(~mcq.Q12_Part_6.isna()), sum(~mcq.Q12_Part_7.isna()), sum(~mcq.Q12_Part_8.isna()), sum(~mcq.Q12_Part_9.isna()), sum(~mcq.Q12_Part_10.isna())]

pt_media = pd.DataFrame({"media": media, "media_percentage": np.array(media_count) * 100 / mcq.shape[0]})
pt_media.sort_values("media_percentage", ascending=False, inplace=True)
plt.style.use('fivethirtyeight')
sns.barplot(pt_media['media_percentage'],pt_media['media'],palette=('coolwarm'))
plt.xticks(rotation=90)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.title('Media Sources Responsible for Learning Datascience')
plt.show()

Ever wandered which is the most popular ML framework kagglers survey takers use the most?

In [None]:
labels_frame = ["Scikit-Learn","Tensorflow ","Keras","Random Forest","Xgboost","Pytorch","Caret","LightGBM", "Spark MLLib", "Fast.ai"]
frame_count = [sum(~mcq.Q28_Part_1.isna()), sum(~mcq.Q28_Part_2.isna()), sum(~mcq.Q28_Part_3.isna()), sum(~mcq.Q28_Part_4.isna()), sum(~mcq.Q28_Part_5.isna()),
             sum(~mcq.Q28_Part_6.isna()), sum(~mcq.Q28_Part_7.isna()), sum(~mcq.Q28_Part_8.isna()), sum(~mcq.Q28_Part_9.isna()), sum(~mcq.Q28_Part_10.isna())]

pt_frame = pd.DataFrame({"ML Frameworks": labels_frame, "Percentage Used": np.array(frame_count) * 100 / mcq.shape[0]})
pt_frame.sort_values("Percentage Used", ascending=False, inplace=True)

plt.style.use('classic')
sns.barplot(pt_frame['ML Frameworks'],pt_frame['Percentage Used'])
plt.xticks(rotation=90)
fig=plt.gcf()
fig.set_size_inches(18,6)
plt.title('ML Frameworks Popularity')
plt.show()

**The Survey Methodology**

This survey received 19,717 usable respondents from 171 countries and
territories. If a country or territory received less than 50
respondents, we grouped them into a group named “Other” for
anonymity.

We excluded respondents who were flagged by our survey system as
“Spam”.

Most of our respondents were found primarily through Kaggle channels,
like our email list, discussion forums and social media channels.

The survey was live from October 8th to October 28th. We allowed
respondents to complete the survey at any time during that window.
The median response time for those who participated in the survey was
approximately 10 minutes.

Not every question was shown to every respondent. You can learn more
about the different segments we used in the survey_schema.csv file. In general, 
respondents with more experience were asked more questions and respondents 
with less experience were asked less questions.

To protect the respondents’ identity, the answers to multiple choice
questions have been separated into a separate data file from the
open-ended responses. We do not provide a key to match up the
multiple choice and free form responses. Further, the free form
responses have been randomized column-wise such that the responses
that appear on the same row did not necessarily come from the same
survey-taker.

Multiple choice single response questions fit into individual columns whereas 
multiple choice multiple response questions were split into multiple columns.
Text responses were encoded to protect user privacy and countries with fewer
than 50 respondents were grouped into the category "other".