<center> <h1> 2020 Kaggle Machine Learning & Data Science Survey </h1>
<br>
<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/23724/logos/header.png?t=2020-10-31-23-22-58"> </center>

**The Aim of this Analysis is to Understand the 2020 Kaggle Machine learning & Data Science Survey Responses Filled by the Kagglers. The Data consists of 20,036 responses and the answers to the various questions posed.**

Lets gain some interesting insights!

In [None]:
import numpy as np
import pandas as pd
import re

import seaborn as sns

#importing plotly for creating visualization
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.decomposition import PCA

# setting default template to plotly_white for all visualizations
pio.templates.default = "plotly_dark"
# for charts to be rendered properly
init_notebook_mode()

from colorama import Fore, Back, Style
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
c_ = Fore.CYAN
y_ = Fore.YELLOW
res = Style.RESET_ALL

In [None]:
df = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)

In [None]:
df.head()

In [None]:
print(f"As stated in the Dataset Information, the data consists of {y_}{df.shape[0]-1}{res} Responses")

# Univariate Analysis

Lets Check out the questions that were posed to Kagglers and try to see if their answers match the general expectations.

## Time From Start to Finish

In [None]:
print(f"Column Header - {r_}{df.columns[0]}{res}")
print(f"Feature Represented - {c_}{df.iloc[0, 0]}{res}")

**This Feature should denote the Time taken to fill the Questionnaire. It will be interesting to Check out the Distribution**

In [None]:
df.loc[1:, 'Time from Start to Finish (seconds)'] = df.loc[1:, 'Time from Start to Finish (seconds)'].astype(int)
sec_series = df.iloc[1:, 0].value_counts().sort_index().reset_index()

print(f"Maximum Time taken for filling the form - {g_}{max(sec_series['index'])}{res} seconds")
print(f"Minimum Time taken for filling the form - {y_}{min(sec_series['index'])}{res} seconds")

The Fastest person takes 20 sec to fill the form Hmm.. sounds shady, lets check out the number of NaN values for them.

In [None]:
tmp = list(df[df['Time from Start to Finish (seconds)'] == 20].isna().sum(axis=1)/354)
vals = [round(x, 5) for x in tmp]
print(f"Nan Ratios of the Fastest Form Fillers are {g_}{vals}{res}")

As Expected, The Fastest Form fillers left most of the form blank

Also the Maximum time taken is Huge! This can be explained easily by the possibility that the form might be open and the User might not be filling it. Our Graph will not be interpretable at all if we plot it as it is. Lets clip the outliers and then plot it

In [None]:
sec_series = df.iloc[1:, 0][df.iloc[1:, 0]< 1500]

# create the bins
counts, bins = np.histogram(sec_series, bins=range(0, 1500, 20))
bins = 0.5 * (bins[:-1] + bins[1:])

fig = px.bar(x=bins, y=counts)
             
fig.update_layout(title_text='Time taken to Fill Questions (in seconds)', 
                  xaxis=dict(title='Time in Seconds'),
                  yaxis=dict(title='Person Count'),
                  showlegend=False
                 )
fig.show()

* The Plot is similar to a Normal distribution as expected from random data
* People take a mean time of 500sec which is around 8 min to fill the form

## Q1 - What is your Age?

In [None]:
print(f"Column Header - {r_}{df.columns[1]}{res}")
print(f"Question Asked - {c_}{df.iloc[0, 1]}{res}")
print(f"Number of NaN values - {y_}{df['Q1'].isna().sum()}{res}")

**Let us see the multiple choices offered to the People and the Number of responses in each**

In [None]:
tmp = df['Q1'][1:].value_counts().sort_index().reset_index() 
ages = tmp['index']
cnts = tmp['Q1']

fig = go.Figure(data=[go.Table(header=dict(values=['Age_group','Responses'],
                                          fill_color='indigo',
                                           height=30),
                 cells=dict(values=[ages,cnts],
                            height=25))
                     ])
fig.update_layout(title_text='Age Group choices')

fig.show()

In [None]:
colors = ['cyan']*len(tmp)

fig = go.Figure(data=[go.Bar(
    x=tmp['index'],
    y=tmp['Q1'],
    marker_color=colors # marker color can be a single color value or an iterable
)])

fig.update_layout(title='Age group Responses',
                 xaxis=dict(title='Age Group'),
                 yaxis=dict(title='Count'))



fig.show()

* The Questionnaires were filled only by 18+ People. This might correspond to the Age Restriction.
* Younger People dominate the Older ones, according to the data
* There are even Kagglers with 70+ Age! 

##  Q2 - What is your Gender?

In [None]:
print(f"Column Header - {r_}{df.columns[2]}{res}")
print(f"Question Asked - {c_}{df.iloc[0, 2]}{res}")
print(f"Number of NaN values - {y_}{df['Q2'].isna().sum()}{res}")

**Lets see the Responses**

In [None]:
tmp = df['Q2'][1:].value_counts().reset_index()
fig = go.Figure(data=[go.Pie(labels=tmp['index'], values=tmp['Q2'],
                             sort=False)])


fig.update_layout(title_text='Gender Responses')
fig.show()

In [None]:
colors = ['orange'] * len(tmp)

fig = go.Figure(data=[go.Bar( x = tmp['index'],
                            y = tmp['Q2'],
                            marker_color=colors)])

fig.update_layout(title='Gender Responses',
                 xaxis=dict(title='Gender'),
                 yaxis=dict(title='Count'))

fig.show()

#### This Data shows that Kaggle is dominated by Male Gender. A Total of 78.8% Responses are by Males!

##  Q3 - Which Country are you from?

In [None]:
print(f"Column Header - {r_}{df.columns[3]}{res}")
print(f"Question Asked - {c_}{df.iloc[0, 3]}{res}")
print(f"Number of NaN values - {y_}{df['Q3'].isna().sum()}{res}")

In [None]:
tmp = df['Q3'][1:].value_counts().reset_index()

fig = go.Figure(data=[go.Pie(labels=tmp['index'], values=tmp['Q3'])])
fig.update_traces(textposition='inside')

fig.update_layout(title_text='Responses by Country')
fig.show()

#### Maximum Responses are from India followed by USA. Together they comprise 40% of the Survey Data

##  Q4 - Highest level of education

In [None]:
print(f"Column Header - {r_}{df.columns[4]}{res}")
print(f"Question Asked - {c_}{df.iloc[0, 4]}{res}")
print(f"Number of NaN values - {y_}{df['Q4'].isna().sum()}{res}")

In [None]:
tmp = df['Q4'][1:].value_counts().sort_index().reset_index() 
ages = tmp['index']
cnts = tmp['Q4']

fig = go.Figure(data=[go.Table(header=dict(values=['Education Level','Responses'],
                                          fill_color='gray',
                                           height=30),
                 cells=dict(values=[tmp['index'], tmp['Q4']],
                            height=25))
                     ])
fig.update_layout(title_text='Education choices', height=400)

fig.show()

In [None]:
fig = go.Figure(data=[go.Pie(labels=tmp['index'], values=tmp['Q4'],
                             sort=False,marker=dict(colors=px.colors.qualitative.Prism))])


fig.update_layout(title_text='Education Responses')
fig.show()

#### Maximum number of people have either a Master's Degree or a Bachelor's degree. These two Levels describe 75.9% of the survey population!

## Q5 - Current Role

In [None]:
print(f"Column Header - {r_}{df.columns[5]}{res}")
print(f"Question Asked - {c_}{df.iloc[0, 5]}{res}")
print(f"Number of NaN values - {y_}{df['Q5'].isna().sum()}{res}")

In [None]:
tmp = df['Q5'][1:].value_counts().sort_index().reset_index() 
ages = tmp['index']
cnts = tmp['Q5']

fig = go.Figure(data=[go.Table(header=dict(values=['Current Role','Responses'],
                                          fill_color='orange',
                                           height=30),
                 cells=dict(values=[tmp['index'], tmp['Q5']],
                            height=22,
                           fill=dict(color='blue')))
                     ])
fig.update_layout(title_text='Current Role choices')

fig.show()

In [None]:
colors = ['orange'] * len(tmp)

fig = go.Figure(data=[go.Bar( y = tmp['index'],
                            x = tmp['Q5'],
                            marker_color=colors,
                            orientation='h')])

fig.update_layout(title='Role Responses',
                 yaxis=dict(title='Role'),
                 xaxis=dict(title='Count'))

fig.show()

#### Kaggle is a platform for learning, Hence Students being the most popular category for Kagglers is a given. Next in line is what all Students here are trying to become, Data Scientists

##  Q6 - Years of Programming

In [None]:
print(f"Column Header - {r_}{df.columns[6]}{res}")
print(f"Question Asked - {c_}{df.iloc[0, 6]}{res}")
print(f"Number of NaN values - {y_}{df['Q6'].isna().sum()}{res}")

In [None]:
tmp = df['Q6'][1:].value_counts().sort_index().reset_index() 
ages = tmp['index']
cnts = tmp['Q6']

fig = go.Figure(data=[go.Table(header=dict(values=['Years of Programming','Responses'],
                                          fill_color='gray',
                                           height=30),
                 cells=dict(values=[tmp['index'], tmp['Q6']],
                            height=22,
                           fill=dict(color='indigo')))
                     ])
fig.update_layout(title_text='Years of Programming', height=400)

fig.show()

In [None]:
fig = go.Figure(data=[go.Pie(labels=tmp['index'], values=tmp['Q6'],
                             sort=False,marker=dict(colors=px.colors.qualitative.Prism))])


fig.update_layout(title_text='Years of Programming')
fig.show()

#### We have almost same number of people in all age groups. It suggests that Kaggle is a diverse platform created for beginners to Experts!

#### It will be interesting to note the relation of the Coding Experience with Other Features like Education and Age. This will be done in Multivariate Analysis

##  Q7 - Type of Programming Language Used

### This Question Explores the Programming Language Used by the Kagglers.

In [None]:
q7cols = df.columns[df.columns.str.startswith('Q7')].tolist()
print(f"This Question consists of Many Parts - \n{b_}{q7cols}{res}")

### Each Part allows the Kaggler to select the language

In [None]:
lang_dict = {}
for col in q7cols:
    lang = df[col].value_counts().reset_index().iloc[0,0]
    val = df[col].value_counts().reset_index().iloc[0,1]
    lang_dict[lang] = val

In [None]:
lang_dict = {k: v for k, v in sorted(lang_dict.items(), key=lambda item: item[1], reverse=True)}
fig = go.Figure(data=[go.Table(header=dict(values=['Programming Language','People Count'],
                                          fill_color='indigo',
                                           height=30),
                 cells=dict(values=[list(lang_dict.keys()), list(lang_dict.values())],
                            height=25,
                           fill=dict(color='chocolate')))
                     ])
fig.update_layout(title_text='Programming Language Used', height=600)

fig.show()

In [None]:
fig = go.Figure(data=[go.Pie(labels=list(lang_dict.keys()), values=list(lang_dict.values()),
                             sort=False,marker=dict(colors=px.colors.qualitative.Prism))])


fig.update_layout(title_text='Years of Programming')
fig.show()

### Python, SQL and R dominate the Kaggle as expected. They are the Most Preferred Data Science Languages

##  Q8 - Programming Language Recommended

In [None]:
print(f"Column Header - {r_}{df.columns[20]}{res}")
print(f"Question Asked - {c_}{df.iloc[0, 20]}{res}")
print(f"Number of NaN values - {y_}{df['Q8'].isna().sum()}{res}")

In [None]:
tmp = df.iloc[1:, :]['Q8'].value_counts().reset_index()
colors = ['cyan'] * len(tmp)

fig = go.Figure(data=[go.Bar( x = tmp['index'],
                            y = tmp['Q8'],
                            marker_color=colors)])

fig.update_layout(title='Programming language Recommended',
                 xaxis=dict(title='Programming Language'),
                 yaxis=dict(title='Count'))

fig.show()

### Python is the most recommended Programming Language by Kagglers. 

## Q9 - Programming IDE Used on a Regular Basis

### This Question explores the Programming IDE Used by the Kagglers

In [None]:
q9cols = df.columns[df.columns.str.startswith('Q9')].tolist()
print(f"This Question consists of Many Parts - \n{g_}{q9cols}{res}")

In [None]:
ide_dict = {}
for col in q9cols:
    ide = df[col].value_counts().reset_index().iloc[0,0].lstrip().rstrip()
    val = df[col].value_counts().reset_index().iloc[0,1]
    ide_dict[ide] = val

In [None]:
ide_dict = {k: v for k, v in sorted(ide_dict.items(), key=lambda item: item[1], reverse=True)}
fig = go.Figure(data=[go.Table(header=dict(values=['Programming IDE','People Count'],
                                          fill_color='magenta',
                                           height=30),
                 cells=dict(values=[list(ide_dict.keys()), list(ide_dict.values())],
                            height=25,
                           fill=dict(color='green')))
                     ])
fig.update_layout(title_text='Programming IDE Used on a Regular Basis', height=600)

fig.show()

In [None]:
fig = go.Figure(data=[go.Pie(labels=list(ide_dict.keys()), values=list(ide_dict.values()),
                             sort=False,marker=dict(colors=px.colors.qualitative.Prism_r))])


fig.update_layout(title_text='Programming IDE Usage by Kagglers')
fig.show()

### Jupyter Notebook is as expected the most popular IDE followed by VSCode.

The Top IDEs are Jupyter, Pycharm which are Python based followed by Rstudio which is R based. This correlates well with the popular languages used Feature where we showed that Python is followed by R as the most popular language

## Q10 -  Hosted Notebook Product Used on a Regular Basis

In [None]:
setcols = df.columns[df.columns.str.startswith('Q10')].tolist()

val_dict = {}
for col in setcols:
    ftr = df[col].value_counts().reset_index().iloc[0,0].lstrip().rstrip()
    val = df[col].value_counts().reset_index().iloc[0,1]
    val_dict[ftr] = val
    
val_dict = {k: v for k, v in sorted(val_dict.items(), key=lambda item: item[1], reverse=True)}

colors = ['cyan'] * len(val_dict)

fig = go.Figure(data=[go.Bar( x = list(val_dict.keys()),
                            y = list(val_dict.values()),
                            marker_color=colors)])

fig.update_layout(title='Hosted Notebook Product Used',
                 xaxis=dict(title='Notebook Product'),
                 yaxis=dict(title='Count'))

fig.show()


### Google colab and Kaggle Notebooks dominate the Notebooks used by Kagglers

## Q11 - Computing Platform used

In [None]:
tmp = df['Q11'][1:].value_counts().sort_index().reset_index() 
ages = tmp['index']
cnts = tmp['Q11']

fig = go.Figure(data=[go.Pie(labels=tmp['index'], values=tmp['Q11'],
                             sort=False,marker=dict(colors=px.colors.qualitative.Prism_r))])


fig.update_layout(title_text='Computing Platform Used on a Regular basis')
fig.show()

### Majority(78.4%) of Kagglers Prefer a Personal Computer or Laptop to Work

## Q12 -  Specialized Computing Hardware

In [None]:
setcols = df.columns[df.columns.str.startswith('Q12')].tolist()

val_dict = {}
for col in setcols:
    ftr = df[col].value_counts().reset_index().iloc[0,0].lstrip().rstrip()
    val = df[col].value_counts().reset_index().iloc[0,1]
    val_dict[ftr] = val
    
val_dict = {k: v for k, v in sorted(val_dict.items(), key=lambda item: item[1], reverse=True)}

colors = ['cyan'] * len(val_dict)

fig = go.Figure(data=[go.Bar( x = list(val_dict.keys()),
                            y = list(val_dict.values()),
                            marker_color=colors)])

fig.update_layout(title='Specialized Computing Hardware',
                 xaxis=dict(title='Hardware'),
                 yaxis=dict(title='Count'))

fig.show()

### The None can be assumed to be CPU. As we saw before majority of Kagglers prefer Google Colab or Kaggle Notebooks, The GPU can be a major attraction

## Q13 -  TPU Usage

In [None]:
tmp = df['Q13'][1:].value_counts().sort_index().reset_index() 
ages = tmp['index']
cnts = tmp['Q13']

fig = go.Figure(data=[go.Pie(labels=tmp['index'], values=tmp['Q13'],
                             sort=False,marker=dict(colors=px.colors.qualitative.Prism))])


fig.update_layout(title_text='Number of Times TPU was Used')
fig.show()

### As observed in the previous plot as well, majority of Kagglers have not used TPU. This might be attributed to the fact that TPU requires TPU-specific code

## Q14 -  Visualization Libraries/tools

In [None]:
setcols = df.columns[df.columns.str.startswith('Q14')].tolist()

val_dict = {}
for col in setcols:
    ftr = df[col].value_counts().reset_index().iloc[0,0].lstrip().rstrip()
    val = df[col].value_counts().reset_index().iloc[0,1]
    val_dict[ftr] = val
    
val_dict = {k: v for k, v in sorted(val_dict.items(), key=lambda item: item[1], reverse=True)}

colors = ['yellow'] * len(val_dict)

fig = go.Figure(data=[go.Bar( x = list(val_dict.keys()),
                            y = list(val_dict.values()),
                            marker_color=colors)])

fig.update_layout(title='Visualization Libraries and tools on a Regular Basis',
                 xaxis=dict(title='Library/tool'),
                 yaxis=dict(title='Count'))

fig.show()

### Matplotlib is the most Used Library. Since it has been around for a long time, its extensive documentation and support are a major attraction. Infact this forms the base for the Second most Used Libary Seaborn. 

### Next in line is Plotly which is used to provide these beautiful visualizations!

## Q15 - Years of Machine learning

In [None]:
tmp = df['Q15'][1:].value_counts().sort_index().reset_index() 
ages = tmp['index']
cnts = tmp['Q15']

fig = go.Figure(data=[go.Pie(labels=tmp['index'], values=tmp['Q15'],
                             sort=False,marker=dict(colors=px.colors.qualitative.Plotly))])


fig.update_layout(title_text='Years of Machine learning Usage')
fig.show()

### Majority of the Kagglers are newly joined hence have less experience of 0-2 years

## Q16 - Machine learning Frameworks Used

In [None]:
setcols = df.columns[df.columns.str.startswith('Q16')].tolist()

val_dict = {}
for col in setcols:
    ftr = df[col].value_counts().reset_index().iloc[0,0].lstrip().rstrip()
    val = df[col].value_counts().reset_index().iloc[0,1]
    val_dict[ftr] = val
    
val_dict = {k: v for k, v in sorted(val_dict.items(), key=lambda item: item[1], reverse=True)}

colors = ['red'] * len(val_dict)

fig = go.Figure(data=[go.Bar( x = list(val_dict.keys()),
                            y = list(val_dict.values()),
                            marker_color=colors)])

fig.update_layout(title='Machine Learning frameworks Used',
                 xaxis=dict(title='Framework'),
                 yaxis=dict(title='Count'))

fig.show()

### Scikit-learn is the old Evergreen library which is no doubt is the most used Framework. It is followed by the popular frameworks like Tensorflow, Keras and Pytorch.

## Q17 - Popular ML Algorithms

In [None]:
setcols = df.columns[df.columns.str.startswith('Q17')].tolist()

val_dict = {}
for col in setcols:
    ftr = df[col].value_counts().reset_index().iloc[0,0].lstrip().rstrip()
    val = df[col].value_counts().reset_index().iloc[0,1]
    val_dict[ftr] = val
    
val_dict = {k: v for k, v in sorted(val_dict.items(), key=lambda item: item[1], reverse=False)}

colors = ['green'] * len(val_dict)

fig = go.Figure(data=[go.Bar( y = list(val_dict.keys()),
                            x = list(val_dict.values()),
                            marker_color=colors,
                            orientation='h')])

fig.update_layout(title='ML Algorithms Used on a Regular Basis',
                 yaxis=dict(title='Algorithm'),
                 xaxis=dict(title='Count'))

fig.show()

### Basic Algorithm like Linear/Logistic Regression is as expected the most popular. This is followed by Decision Trees and Random Forests.

## Q18 - Computer Vision Methods Used

In [None]:
setcols = df.columns[df.columns.str.startswith('Q18')].tolist()

val_dict = {}
for col in setcols:
    ftr = df[col].value_counts().reset_index().iloc[0,0].lstrip().rstrip()
    val = df[col].value_counts().reset_index().iloc[0,1]
    val_dict[ftr] = val
    
val_dict = {k: v for k, v in sorted(val_dict.items(), key=lambda item: item[1], reverse=False)}

colors = ['lightcyan'] * len(val_dict)

fig = go.Figure(data=[go.Bar( y = list(val_dict.keys()),
                            x = list(val_dict.values()),
                            marker_color=colors,
                            orientation='h')])

fig.update_layout(title='Computer Vision Algorithms Used on a Regular Basis',
                 yaxis=dict(title='Algorithm'),
                 xaxis=dict(title='Count'))

fig.show()

### Basic Image Classification Networks no doubt dominate the Interest of people.

## Q19 - Natural Language Processing Methods

In [None]:
setcols = df.columns[df.columns.str.startswith('Q19')].tolist()

val_dict = {}
for col in setcols:
    ftr = df[col].value_counts().reset_index().iloc[0,0].lstrip().rstrip()
    val = df[col].value_counts().reset_index().iloc[0,1]
    val_dict[ftr] = val
    
val_dict = {k: v for k, v in sorted(val_dict.items(), key=lambda item: item[1], reverse=False)}

colors = ['grey'] * len(val_dict)

fig = go.Figure(data=[go.Bar( y = list(val_dict.keys()),
                            x = list(val_dict.values()),
                            marker_color=colors,
                            orientation='h')])

fig.update_layout(title='NLP Methods Used on a Regular Basis',
                 yaxis=dict(title='NLP'),
                 xaxis=dict(title='Count'))

fig.show()

### Word embeddings and Vectors are the most popular methods followed by Encoder decoder models

# Work in Progress ...

### Do Upvote if you find it useful!