<h1><center><font size="6">Data Scientists in 2020 - Kaggle Survey</font></center></h1>

<img src="https://upload.wikimedia.org/wikipedia/commons/7/7c/Kaggle_logo.png"></img>

# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Prepare the data analysis</a>  
 - <a href='#21'>Load packages</a>  
 - <a href='#21'>Load the data</a>  
- <a href='#3'>Data exploration</a>   
- <a href='#4'>Combine the features</a>   
- <a href='#5'>Final note</a>   

# <a id='1'>Introduction</a>  

We will analyze the dataset `2020 Kaggle ML & DS Survey` with answers provided by the respondents to the survey of Kaggle users in 2020.


# <a id='2'>Prepare the data analysis</a>   


Before starting the analysis, we need to make few preparation: load the packages, load and inspect the data.



# <a id='21'>Load packages</a>

We load the packages used for the analysis.

In [None]:
import pandas as pd
import numpy as np
import sys
import os
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

<a href="#0"><font size="1">Go to top</font></a>  


# <a id='22'>Load the data</a>  

Let's see first what data files do we have in the root directory.

In [None]:
os.listdir("../input/kaggle-survey-2020")

There are three dataset files. Let's load all the files.

In [None]:
multiple_df = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)

In [None]:
print("Multiple choice response - rows: {} columns: {}".format(multiple_df.shape[0], multiple_df.shape[1]))


<a href="#0"><font size="1">Go to top</font></a>  


# <a id='3'>Data exploration</a>  


Let's start by exploring the multiple choice response dataset.

We will also glimpse the free format response dataset.

## Glimpse the data

In [None]:
multiple_df.head(3)

Because the first row contains a description of the column, we will read only from 2nd row the categorical values per each column.

## Missing data

Let's represent the distribution of available data for all the columns, using a boxplot.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
df = missing_data(multiple_df)

In [None]:
def plot_percent_of_available_data(title):
    trace = go.Box(
        x = df['Percent'],
        name="Percent",
         marker=dict(
                    color='rgba(238,23,11,0.5)',
                    line=dict(
                        color='tomato',
                        width=0.9),
                ),
         orientation='h')
    data = [trace]
    layout = dict(title = 'Percent of available data  - all columns ({})'.format(title),
              xaxis = dict(title = 'Percent', showticklabels=True), 
              yaxis = dict(title = 'All columns'),
              hovermode = 'closest',
             )
    fig = dict(data=data, layout=layout)
    iplot(fig, filename='percent')

In [None]:
plot_percent_of_available_data('multiple_df')



<a href="#0"><font size="1">Go to top</font></a>  

## Columns to visualize


Some of the following columns are grouped, to capture the multiple choice answers where the order of the answer gives the order of preferences. 
Let's check which columns groups have only one item in the group (columns with multiple items in the group will be called like `Q11_Part1`, `Q11_Part2`[...]. For this we will compose filters like `Q1`, `Q2`, ..., `Q11`, `Q12` etc. and filter the columns containing these values and count the items. We separate only the group of columns with one item in the group. These will be the columns we will further represent.



In [None]:
tmp = pd.DataFrame(multiple_df.columns.values)
columns = []
for i in range(1,50):
    var = "Q{}".format(i)
    l = len(list(tmp[tmp[0].str.contains(var)][0]))
    if(l == 1):
        columns.append(var)

print("The columns with only one item in the column group are:\n",columns)

We will make sure to include these columns in the following, besides the obvious options `Q1`, `Q2` ... `Q7`.


<a href="#0"><font size="1">Go to top</font></a>  

## Age interval

Let's show the age interval, as declared by respondents. 
We will create here also a function to count categories for categorical data and a function to draw barplots using Plotly.

In [None]:
def get_categories(data, val):
    tmp = data[1::][val].value_counts()
    return pd.DataFrame(data={'Number': tmp.values}, index=tmp.index).reset_index()

In [None]:
df = get_categories(multiple_df, 'Q1')

In [None]:
def draw_trace_bar(data, title, xlab, ylab,color='Blue'):
    trace = go.Bar(
            x = data['index'],
            y = data['Number'],
            marker=dict(color=color),
            text=data['index']
        )
    data = [trace]

    layout = dict(title = title,
              xaxis = dict(title = xlab, showticklabels=True, tickangle=15,
                          tickfont=dict(
                            size=9,
                            color='black'),), 
              yaxis = dict(title = ylab),
              hovermode = 'closest'
             )
    fig = dict(data = data, layout = layout)
    iplot(fig, filename='draw_trace')

In [None]:
draw_trace_bar(df, 'Number of people/age interval', 'Age interval', 'Number of people' )


<a href="#0"><font size="1">Go to top</font></a>  



## Gender

Let's explore the gender groups.

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q2'), "Number of people in each gender", "Gender", "Number of people", "Green")


<a href="#0"><font size="1">Go to top</font></a>  



## Country

Let's plot the number of responses per country..

In [None]:
df = get_categories(multiple_df, 'Q3')
df.head()

In [None]:
trace = go.Choropleth(
            locations = df['index'],
            locationmode='country names',
            z = df['Number'],
            text = df['index'],
            autocolorscale =False,
            reversescale = True,
            colorscale = 'rainbow',
            marker = dict(
                line = dict(
                    color = 'rgb(0,0,0)',
                    width = 0.5)
            ),
            colorbar = dict(
                title = 'Respondents',
                tickprefix = '')
        )

data = [trace]
layout = go.Layout(
    title = 'Number of respondents per country',
    geo = dict(
        showframe = True,
        showlakes = False,
        showcoastlines = True,
        projection = dict(
            type = 'natural earth'
        )
    )
)

fig = dict( data=data, layout=layout )
iplot(fig)


<a href="#0"><font size="1">Go to top</font></a>  

## Highest level of formal education

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q4'), "Highest level of formal education", "Education", "Number of people", "Magenta")

## Current job title

The next question is about the description of the industry of the current employer.

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q5'), "Current job title", "Current job title", "Number of respondents", "Tomato")


<a href="#0"><font size="1">Go to top</font></a>  


## Years of experience writting code

The next question is about the description of the title.

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q6'), "Years of experience", "Years of experience", "Number of respondents", "Red")

<a href="#0"><font size="1">Go to top</font></a>  


## What programming language to learn first?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q8'), 
               "What programming language would you recommend an aspiring data scientist to learn first?", 
               "Programming language", "Number of respondents", "Lightblue")

<a href="#0"><font size="1">Go to top</font></a>  


## What computing platform use most often?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q11'), 
               "What type of computing platform do you use most often for your data science projects?", 
               "Computing platform", "Number of respondents", "Orange")

<a href="#0"><font size="1">Go to top</font></a>  


## What computing platform use most often?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q22'), 
               multiple_df['Q22'].values[0], 
               "ML methods included in the business", "Number of respondents", "Lightgray")

<a href="#0"><font size="1">Go to top</font></a>  


## Individuals responsible for data science at your place of business?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q21'), 
               multiple_df['Q21'].values[0], 
               "Individuals involved in Data Science", "Number of respondents", "Darkgreen")

## What is your currently yearly compensation?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q24'), multiple_df['Q24'][0], "Option", "Number of respondents", "Lightgreen")

## How much money you or your team spent on machine learing or cloud computing?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q25'), multiple_df['Q25'][0], "Option", "Number of respondents", "Gold")

## How much money you or your team spent on machine learing or cloud computing at home?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q25'), multiple_df['Q25'][0], "Option", "Number of respondents", "Gold")

## Big data products used most often?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q30'), multiple_df['Q30'][0], "Option", "Number of respondents", "Cyan")

## Business intelligence tools used

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q32'), multiple_df['Q32'][0], "Option", "Number of respondents", "Darkblue")

## Primary tool used at work for data analysis

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q38'), multiple_df['Q38'][0], "Option", "Number of respondents", "steelblue")


# <a id='4'>Combine the features</a>


Let's visualize some of the dimmensions presented previously in combination. For example, let's see the combined distribution of sex and age to see how these two are distributed.

## Number of respondents by Sex and Age

In [None]:
def get_categories_group(data, val_group, val):
    tmp = data[1::].groupby(val_group)[val].value_counts()
    return pd.DataFrame(data={'Number': tmp.values}, index=tmp.index).reset_index()

In [None]:
def draw_trace_group_bar(data_df, val_group, val, title, xlab, ylab,color='Blue'):
    data = list()
    groups = (data_df.groupby([val_group])[val_group].nunique()).index
    for group in groups:
        data_group_df = data_df[data_df[val_group]==group]
        trace = go.Bar(
                x = data_group_df[val],
                y = data_group_df['Number'],
                name = group,
                #marker=dict(color=color),
                text=data_group_df[val]
            )
        data.append(trace)

    layout = dict(title = title,
              xaxis = dict(title = xlab, showticklabels=True, tickangle=15,
                          tickfont=dict(
                            size=9,
                            color='black'),), 
              yaxis = dict(title = ylab),
              hovermode = 'closest'
             )
    fig = dict(data = data, layout = layout)
    iplot(fig, filename='draw_trace')

In [None]:
df = get_categories_group(multiple_df, 'Q1', 'Q2')
draw_trace_group_bar(df, 'Q1', 'Q2', 'Number of respondents by Sex and age', 'Sex', 'Number of respondents')

## Number of respondents by Age and Highest level of formal education

In [None]:
df = get_categories_group(multiple_df, 'Q1', 'Q4')
draw_trace_group_bar(df, 'Q1', 'Q4', 'Number of respondents by Age and Highest level of formal education', 'Highest level of formal education', 'Number of respondents')

## Age and number of years of experience

In [None]:
df = get_categories_group(multiple_df, 'Q1', 'Q6')
draw_trace_group_bar(df, 'Q1', 'Q6', 'Number of respondents by Age and number of years of experience', 
                     'Number of years of experience', 'Number of respondents')

## Highest level of formal education and current yearly compensation

In [None]:
df = get_categories_group(multiple_df, 'Q4', 'Q6')
draw_trace_group_bar(df, 'Q4', 'Q6', 'Number of respondents by Highest level of formal education and Current yearly compensation', 'Current yearly compensation', 'Number of respondents')

## Current title and years of experience

In [None]:
df = get_categories_group(multiple_df, 'Q6', 'Q4')
draw_trace_group_bar(df, 'Q6', 'Q4', 'Number of respondents by Current title and years of experience', 'Current title', 'Number of respondents')


# <a id='5'>Final note</a>  

This Kernel is still under construction. Stay tuned, we will update it frequently in the following days.

<a href="#0"><font size="1">Go to top</font></a>  