# Building an EDA Model

### I want to thank Ken Jee for his Youtube video and his notebook too

## 1. Introduction
*Kaggle conducts an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for 3.5 weeks in October with 20,036 responses.*

### The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. This notebook tries to see Kagglers from the perspective of continents.

In [None]:
# Import the libraries

import pandas as pd
import numpy as np
from numpy import NaN
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [None]:
# Read the dataset to the kernel

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')

In [None]:
# Removing the first row because we can't woek with it

df_1 = df.iloc[1:,:]
df_1.head()

In [None]:
# Gender

counts = df_1.Q2.value_counts()
plt.figure(figsize=(12,8))
plt.title('Number of Participants by Gender')
plt.xlabel('Gender')
for i in range(len(df_1.Q2.unique())):
    plt.text(i,counts[i]/2, str(f'{counts[i]:,}'), horizontalalignment='center', fontdict= dict(color='w'))
sns.histplot(data=df_1, x='Q2')

**We can see from the above chart that Men are dramatically higher than Women in Data and Data related domains**

*We are talking about the dataset which has been obtained from Kaggle and this isn't mean that the whole world is the same*

In [None]:
# Gender and Age

# To get the gender and age, we slice the data frame based on columns
# which will give us the required information
genders = df_1.groupby(['Q1','Q2']).count()
genders = genders.iloc[:,0]
genders = genders.reset_index()
genders = genders.rename({'Time from Start to Finish (seconds)':'Count'}, axis=1)
plt.figure(figsize=(15,8))
plt.title('Gender and Age group count', fontdict=dict(fontsize=25))
sns.barplot(data=genders, y='Count', x='Q1', hue='Q2')
plt.xlabel('Age Group')
plt.legend(title='Gender', bbox_to_anchor= (0.78, 1))

**It's clear that the Men population is in the range of 25-29 years, which is making sense.**

**On the other hand, Women are in the range 22-24, which could be explained later when we dive into the roles and the education**

In [None]:
# Country & Gender

# Again we sliced the data to get the required columns only
genders = df_1.groupby(['Q2','Q3']).count()
genders = genders.iloc[:,0]
genders = genders.reset_index()
genders = genders.rename({'Time from Start to Finish (seconds)':'Count'}, axis=1)
genders = genders.sort_values('Count', ascending=False)
womans = genders.loc[genders['Q2'] == 'Woman']
womans = womans.sort_values('Count', ascending=False)
womans = womans.iloc[:10, :]
men = genders.loc[genders['Q2'] == 'Man']
men = men.sort_values('Count', ascending=False)
men = men.iloc[:10, :]
men_women = pd.concat([womans, men])
plt.figure(figsize=(15,8))
plt.title('Gender and Country count (top 10)', fontdict=dict(fontsize=25))
sns.barplot(data=men_women, y='Count', x='Q3', hue='Q2')
plt.xticks(rotation=80)
plt.xlabel('Country')
plt.legend(title='Gender', bbox_to_anchor= (0.78, 1))

**This graph illustrating a couple of points:**
1. India is the most participating country of both genders.
2. Some countries such as Turkey Indonesia & Canada had only Women participants.

In [None]:
# Role & Gender

# Again filtering the dataframe
proffesion = df_1.groupby(['Q2','Q5']).count()
proffesion = proffesion.iloc[:, 1]
proffesion = proffesion.reset_index()
proffesion = proffesion.rename({'Q1':'Count'}, axis=1)
proffesion = proffesion.sort_values('Count', ascending=False)

In [None]:
# Ploting the above dataset results

# One more filter for fenders
men_q = proffesion.loc[proffesion['Q2'] == 'Man']
wom_q = proffesion.loc[proffesion['Q2']== 'Woman']
fig, ax = plt.subplots(1,2, figsize=(20,10))
ax[0].pie(men_q['Count'], labels=men_q['Q5'], autopct='%1.1f%%')
ax[0].set_title('Men', fontdict=dict(fontsize=20))
ax[0].axis('equal')
ax[1].pie(wom_q['Count'], labels=men_q['Q5'], autopct='%1.1f%%', rotatelabels=True)
ax[1].set_title('Woman',fontdict=dict(fontsize=20))
ax[1].axis('equal')
plt.show()

**The Pie charts seem to be identical.**

**The Students of both genders are taking the majority of the participants although, the Women Students are accounted for more than Men by %8 followed by Data Scientists and Software Engineers.**

**A nice point here to be described; the unemployed participants in both genders are almost the same which can give us an idea that there are no any kind of racisms at least for those who joined the survey.**



In [None]:
# Programming Langues

# Defenition for collecting the sub-questions
# For all the questions that have sub-questions will be collected and...
# filtered by this function first and then invoke it into a new data frame
def all_Q(df, self='Q7'):
    global Q
    Q = []
    for i in df.columns:
        if i.startswith(str(self)) is True:
            Q.append(str(i))
        else: continue
    return Q

In [None]:
# capturing Q7 which is related to the programming langues

all_Q(df_1, 'Q7')
prog = df_1[Q]
prog.columns = list(prog.mode().iloc[0,:])
q7 = prog.count().reset_index()
q7.columns = ['Languages','Count']

In [None]:
# Ploting the above dataset

plt.figure(figsize=(15,5))
sns.barplot(data=q7, x='Languages', y='Count', color='Red')
plt.title('Most Languages', fontdict=dict(fontsize=20), color='Purple')

**This graph is self-explained. Python is the most popular language among all famous languages followed by SQL which data query-specific language**

In [None]:
# Gender & Programming Languages

# We append Q2 column (genders column) to the Q list and create a new dataframe..
# Q list here is still Q7
Q.append('Q2')
gen_prog = df_1[Q]
gen_prog.columns = list(gen_prog.mode().iloc[0,:])
gen_prog = gen_prog.rename({'Man':'Gender'}, axis=1)
gen_prog = gen_prog.groupby('Gender').count()
gen_prog = gen_prog.reset_index()
gen_prog

In [None]:
# Subploting the for all Languages by Gender

count=1
plt.subplots(figsize=(30, 20))
for i in gen_prog.columns[1:]:
    plt.subplot(4,5, count)
    sns.barplot(x=gen_prog['Gender'], y=gen_prog[i])
    plt.xticks(rotation=20)
    count+=1

plt.show()

**The Men in all languages are higher.**

**The noticeable chart is the none language, the only chart that Women is very close to Men in terms of the number of participants.**

In [None]:
# Woman only and Languages

prog_wom = df_1[Q]
prog_wom.columns = list(prog_wom.mode().iloc[0,:])
prog_wom = prog_wom.rename({'Man':'Gender'}, axis=1)
for col in prog_wom.columns[:-1]:
    prog_wom[col] = np.where(prog_wom[col].isnull(), 0, 1) # A way to convert the data to binary
prog_wom = prog_wom.loc[prog_wom['Gender'] == 'Woman']
#prog_wom = prog_wom.groupby('Gender').sum()
#prog_wom = prog_wom.reset_index()
prog_wom

In [None]:
# Ploting the above dataset

count=1
plt.subplots(figsize=(30, 20))
plt.suptitle('Woman and programming Languages')
for i in prog_wom.columns[:-1]:
    plt.subplot(4,5, count)
    sns.barplot(y=prog_wom[i], x=prog_wom.Gender)
    #plt.xticks(rotation=20)
    plt.yscale('linear')
    plt.ylim((-0.2, 0.8))
    count+=1
plt.show()

**The above charts are not surprising us, the same information as Men.**

In [None]:
# Roles & Coding experience

code_prof = df_1[['Time from Start to Finish (seconds)','Q5','Q6']]
code_prof = code_prof.rename({code_prof.columns[0]:'Count'}, axis=1)
code_prof = code_prof.groupby(['Q5','Q6']).count()
code_prof = code_prof.reset_index()
plt.figure(figsize=(15,8))
sns.barplot(data=code_prof, x='Q6', y='Count', hue='Q5', palette='Set2')
plt.xlabel('Years of Coding')
plt.legend(title='Role')

In [None]:
df_1.columns

In [None]:
# Education & Role

edu_role = df_1[['Time from Start to Finish (seconds)','Q4','Q5']]
edu_role = edu_role.groupby(['Q4','Q5']).count()
edu_role = edu_role.reset_index()
edu_role = edu_role.rename({edu_role.columns[-1]:'Count'}, axis=1)
fig = go.Figure([go.Bar(x=edu_role.Q5, y=edu_role.Count)])
fig.show()

In [None]:
# Role & Gender

role_gen = df_1.groupby(['Q2','Q5']).count()
role_gen = role_gen[role_gen.columns[0]]
role_gen = role_gen.reset_index()
role_gen = role_gen.rename({role_gen.columns[2]: 'Count'}, axis=1)
role_gen = role_gen.loc[(role_gen['Q2'] == 'Woman') | (role_gen['Q2'] == 'Man')]
role_gen.head()

In [None]:
plt.figure(figsize=(15,8))
sns.barplot(data=role_gen, y='Q5', hue='Q2', x='Count')
plt.ylabel('Role', fontdict=dict(fontsize=15))
plt.legend(title='Gender')
plt.title('Role & Gender', fontdict=dict(fontsize=20))

In [None]:
# IDE

#first we need to collect Q9
all_Q(df_1, 'Q9')
ide = pd.DataFrame()
ide[Q] = df_1[Q]
ide.columns = list(ide.mode().iloc[0, :])
ide = ide.count()
ide

In [None]:
fig = go.Figure([go.Pie(values=ide.values, labels=ide.index, title='IDE Percentage')])
fig.update_traces(hoverinfo='value+label', textinfo='percent', textfont_size=10,
                  marker=dict(line=dict(color='#000000', width=1.5)))
fig.show()

In [None]:
# Programming Languages & Role

all_Q(df, 'Q7')
prog_role = pd.DataFrame()
prog_role['Role'] = df_1['Q5']
prog_role[Q] = df_1[Q]
prog_role.columns = list(prog_role.mode().iloc[0,:])
prog_role = prog_role.groupby('Student').count()
prog_role

In [None]:
count=1
plt.subplots(figsize=(30, 20))
plt.suptitle('Roles and programming Languages', fontdict=(dict(fontsize=30)))
for i in prog_role.columns[1:]:
    plt.subplot(4,4, count)
    plt.pie(prog_role[i], labels=prog_role.index, autopct='%1.1f%%')
    plt.title(label=str(i))
    count+=1
plt.show()

In [None]:
# Melting the Dataframe to conver columns to rows

prog_role_1 = prog_role.reset_index()
role_prog = pd.melt(prog_role_1, id_vars=['Student'])
# Pivot the dataframe to convert the rows to columns
role_prog = role_prog.pivot(columns='Student', values='value', index='variable')
role_prog = role_prog.drop(role_prog.columns[-1], axis=1)
role_prog.head()

In [None]:
count=1
plt.subplots(figsize=(30, 20))
plt.suptitle('Programming Languages and Roles ', fontdict=(dict(fontsize=30)))
for i in role_prog.columns:
    plt.subplot(4,4, count)
    plt.pie(role_prog[i], labels=role_prog.index, autopct='%1.1f%%')
    plt.title(label=str(i))
    count+=1
plt.show()

In [None]:
# ML Vis Lib

all_Q(df_1, 'Q14')
ml_lib = df_1[Q]
ml_lib.columns = ml_lib.mode().iloc[0,:]
ml_lib = ml_lib.count()
fig = go.Figure([go.Bar(x=ml_lib.index, y=ml_lib.values)])
fig.update_traces(marker_color='pink', marker_line_color='blue')
fig.update_layout(title_text='Most Selected Viz lib')
fig.show()

In [None]:
# ML Lib

all_Q(df_1, 'Q16')
ml_lib = df_1[Q]
ml_lib.columns = ml_lib.mode().iloc[0,:]
ml_lib = ml_lib.count()
x, y = ml_lib.index, ml_lib.values
fig = go.Figure(data=[go.Bar(x=x, y=y)])
# Customize aspect
fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.5)
fig.update_layout(title_text='Most Selected ML Library')
fig.show()