<h1 style="text-align:center;">Employed Data Professional's EDA - Differences in Survey Responses per Job Title and Gender </h1>

In [None]:
%%html
<div>
<table style="align:center;font-family: sans-serif; border-collapse: collapse;border: 1px solid #ddd;width:40%">
  <tr style="background-color: #f2f2f2;">
    <th style="padding-top: 20px;
  padding-bottom: 15px;padding-left:20px;font-size:25px;
  text-align: center;background-color: #4CAF50;
  color: white;">Table of Contents</th></tr>
    <tr><td style="text-align:left;font-size:16px;padding-top: 10px;padding-bottom: 5px;"><a href="#1">1. Introduction</a></td></tr>
    <tr><td style="text-align:left;font-size:16px;padding-top: 10px;padding-bottom: 5px;"><a href="#2">2. Reshaping Data</a></td></tr>
    <tr><td style="text-align:left;font-size:16px;padding-top: 10px;padding-bottom: 5px;"><a href="#3">3. Data Preprocessing</a></td></tr>
    <tr><td style="text-align:left;font-size:16px;padding-top: 10px;padding-bottom: 5px;"><a href="#4">4. General EDA</a></td></tr>
    <tr><td style="text-align:left;font-size:16px;padding-top: 10px;padding-bottom: 5px;"><a href="#5">5. Job Title Pivot Table (EDA)</a></td></tr>
        <tr><td style="text-align:left;font-size:16px;padding-top: 10px;padding-bottom: 5px;"><a href="#6">6. Common Tools Per Job Title (EDA)</a></td></tr>
<tr><td style="text-align:left;font-size:16px;padding-top: 10px;padding-bottom: 5px;"><a href="#7">7. Short Summary</a></td></tr>
 </table>
</div>

<div id="1"></div>

# 1. Introduction

We are going to investigate the 2020 Kaggle Survey data about employment in data-related roles. You can find the survey <a href="https://www.kaggle.com/c/kaggle-survey-2020">here</a>.


## 1.1 Goal

The focus of this notebook will be on **employed data professionals** and the following will be the goals as it relates to such survey respondents: 
- Show current market expectations and job requirements in terms of skills, salary, experience, and tasks.
- Explore gender differences in data-related roles.
- Help students make data-driven decisions about their career path.


## 1.2 Tasks 

**General EDA:**  Relative frequency bar graphs of hard skills, salary, and experience of employed data professionals.

**Job Title Pivot Table W/ Select Variables**: Concentration of answers for variables of interest (salary, age, coding experience, etc.) per job title. Another additional chart showing relative differences in survey answers for males relative to their female counterparts.

**Tool Chart** Top 5 commonly used technical tools per job title

## 1.3 Description of Data
The survey data (```kaggle_survey_2020_responses.csv```) consists of 39+ questions and 20,036 responses. Per the site: 

> "Responses to multiple choice questions (only a **single choice** can be selected) were recorded in **individual columns**. Responses to multiple selection questions (**multiple choices** can be selected) were **split into multiple columns** (with one column per answer choice)."


## 1.4 Packages Used

In [None]:
import pandas as pd
import numpy as np

import gc
import re
import math
from math import log10
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
import warnings
warnings.filterwarnings("ignore")
!pip install pycountry_convert
from pycountry_convert import country_alpha2_to_continent_code, country_name_to_country_alpha2

## 1.5 Functions Used

In [None]:
# functions for visualization purposes
def shorten(num, precision=2, suffixes=['', 'K', 'M', 'G', 'T', 'P']):
    try: float(num)
    except: return "Not A Number!"
    if num < 1:
        return num  # avoid dividing by 0 or negatives
    m = int(log10(num) // 3)
    return f'{num/1000.0**m:.{precision}f}{suffixes[m]}'
       
def modifyChartBasic(ax,labelSize,xLabel='',yLabel='',grid=False):
    """
    Function that takes in graph variables for the purpose of customizing basic
    default settings (ie. setting label size)
    """
    if grid:
        ax.grid(axis=grid, alpha=.3)
    # spines
    sns.despine()
    ax.spines['bottom'].set_color('gray')
    ax.spines['left'].set_color('lightgrey')

    # labels
    ax.set_ylabel(yLabel, labelpad=5, fontsize=16)
    ax.set_xlabel(xLabel, labelpad=5, fontsize=16)

    # tick settings
    ax.tick_params(labelsize=labelSize)
    ax.tick_params(axis='both', left=False, bottom=False)
    
def modifyChartExtra(ax,maxs=False,topChild=None):
    """
    Main Function used to create extra modifications to chart.
    """
    if maxs and topChild is None:
        topChild = max([(bar,bar.get_width()) for bar  in ax.patches],key=lambda x:x[1])
    modifyMaxYTickH(ax,topChild[1])
    # changing size and color of tick labels (ticks are set to 0 to not display)
    ax.tick_params(size=0)

    
def modifyMaxYTickH(ax, maxValue):
    """
    Function that modifies x-scale to have the top x-tick label represent the maximum value
    whilst 'preventing' overlap between the last and second to last tick labels
    """
     # keep all except last  tick - ensure that the ticks don't overlap
    midCenterQuarter = (ax.get_xticks()[1] - ax.get_xticks()[0]) / 4

    ax.set_xlim(0, maxValue)  # set the xticks
    # add the last tick value
    x_ticks = np.append(
            [i for i in ax.get_xticks() if 
             i < (maxValue - midCenterQuarter)],# skip old max val if near new max val
              [maxValue]) # new max value
    # set the modified x ticks
    ax.set_xticks(x_ticks)

def count_valuesUpdateDictionary(x, counts):
    #if numerical, check if null
    if x=='nan' or x==None or (type(x) in [float, int] and math.isnan(x)):
        return counts

    #if not null
    counts.setdefault(x, 0)
    counts[x] = counts.get(x,0)+1
    return counts

def count_values(df, columnName, noFalse=None):
    """ Custom count values function that takes into account a list's inner values.

    Args:
        df (pd.Dataframe): dataframe of interest
        columnName (str): column of interest
        noFalse(varies, optional): the value to ignore
    Return:
        dict: a dictionary with all the values and their respective counts
    """

    counts={}
    for val in df[columnName].values:
        if isinstance(val,(np.ndarray, list)):#if a list
            for i in val:
                if isinstance(i,(np.ndarray, list)): #if a sublist
                    for sub in i:
                        counts= count_valuesUpdateDictionary(sub, counts)
                else:  #if not a sublist
                    counts = count_valuesUpdateDictionary(i, counts)
        else:
            counts =count_valuesUpdateDictionary(val, counts)

    if noFalse != None:
        #filter out the value to ignore, typically a False value
        return  {i[0]:i[1] for i in counts.items() if i[0] != noFalse}
    return counts

<div id="2"></div>

# 2. Reshaping Data

In [None]:
df = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv",header=1)

In [None]:
f"There are a total of {df.shape[1]} columns and {df.shape[0]:,} rows"

The goal here is to reduce the 355+ columns in the data frame by merging all columns with Q-A combinations (ie. *What programming lang do you use - Python*), into single columns for each question. The end data frame will consist of columns with unique questions only (~48 columns); this will make it much easier to work with during the analysis as well as the next preprocessing stage.


In [None]:
#search for columns containing question-answer combination and save into a new dataframe
QSubAnswerColumns = df[df.columns[df.columns.str.contains(" - Selected Choice - ")]]

#for Q-A columns, keep only question (ie. remove "Selected Choice - Python")
pattern = " \(Select all that apply.+| Selected Choice - .+"
keepOnlyQuestion = pd.Series(df.columns[df.columns.str.contains(" - Selected Choice - ")]).apply(
    lambda x: re.sub(pattern,"", x).strip()) #keep only question

QSubAnswerColumns.columns = keepOnlyQuestion  #rename columns

# replace nulls with empty string as it'll be used for merging in the next step
nullValues = ["No / None", "None", None]
QSubAnswerColumns = QSubAnswerColumns.fillna("NULL").applymap(lambda x: '' if x=="NULL" or x in nullValues else (str(x)))
QSubAnswerColumns.head()

In [None]:
def add_mainColumn(df,evaluate):
     # merge all column values into a single column with a lists of non-null (here an empty string) values 
    new_column = df.apply(lambda x: [i.strip() for i in x if i !=''] ,  1) 
    
    # length of values in each row must match 
    assert (evaluate == new_column.apply(lambda x: len(x))).all() 
    
    new_column = new_column.apply(lambda x: "None" if x==[]   #if an empty list then turn into "None" value
                            else (x if type(x)==list and len(x)==1 else x)) #if one value in list -> unpack 
    return new_column

for mainQuestion in np.unique(keepOnlyQuestion):
    evaluate = QSubAnswerColumns[mainQuestion].applymap(lambda x: 1 if x!='' else 0).sum(1) # sanity check
    new_column = add_mainColumn(QSubAnswerColumns[mainQuestion], evaluate)
    QSubAnswerColumns[mainQuestion+"NEW"] =new_column #add new to distinguish new columns
    
# keep new columns only 
QSubAnswerColumns = QSubAnswerColumns[QSubAnswerColumns.columns[QSubAnswerColumns.columns.str.contains("NEW")]]
QSubAnswerColumns.head()

In [None]:
QSubAnswerColumns.columns = QSubAnswerColumns.columns.str.replace("NEW","")

# merge SINGLE question columns with the NEW Q-A combination columns 
df = pd.merge(df[df.columns[~df.columns.str.contains(" - Selected Choice - ")]], #inverse operation
              QSubAnswerColumns,left_index=True, right_index=True)

# some further cleaning for the column names
df.columns = df.columns.str.replace(": - Selected Choice| - Selected Choice| -","")
df.head()

In [None]:
f"There are a total of {df.shape[1]} columns and {df.shape[0]:,} rows"

We went from 355 columns to 48 columns--nice! 

The rows match the original data frame's row count all of which have passed a **sanity check**. Let's move to the next stage

<div id="3"></div>

# 3. Preprocessing

In this stage, the goal will be to (1) keep relevant respondents per the definitions set in the intro. and (2) reduce the unique value count for certain columns.


### 1. Employed respondents

Here we'll define employed as respondents who did not select any of the following: ```Student, Other, Currently not employed, nulls.``` 

Furthermore, we'll ignore job titles that are not strictly "data-related," which are the following: ```Product/Project Manager,Software Engineer.```

Finally we'll group job titles with similar functionalities.

In [None]:
jobTitleCol = 'Select the title most similar to your current role (or most recent title if retired)'
dataRoles = {"Statistician":"Academic Data Scientist","Research Scientist":"Academic Data Scientist",
                     "Data Analyst":"Analyst","Business Analyst":"Analyst","DBA/Database Engineer":"Data Engineer"}
notEmployed = ['Student', 'Other','Currently not employed',"Product/Project Manager","Software Engineer",np.nan]

df = df[~df[jobTitleCol ].isin(notEmployed)].reset_index(drop=True)
df[jobTitleCol] = df[jobTitleCol].apply(lambda x: dataRoles[x] if x in dataRoles else x)

df[jobTitleCol].value_counts(normalize=True)

We can see that 1/3 of the employed data professionals are data scientists with analysts (~28%) coming making up the other large share of respondents.

### Gender

In [None]:
df["What is your gender?"].value_counts(normalize=True)

Unfortunately, given the low % of nonbinary values, only ```male``` and ```female``` will be investigated in this notebook. The other values: ```Prefer not to say```, ```Prefer to self-describe``` are ambiguous and will also not be considered given that the goal is to study gender differences.

In [None]:
df = df[df['What is your gender?'].isin(["Man","Woman"])].reset_index(drop=True)
df['What is your gender?'].value_counts(normalize=True)*100

It should be mentioned that the % of employed data professionals is disappointing. Clearly, there is a need for more representation in gender.

### Countries into Continents
Next we'll create the continent columns using the country column. We'll rename certain countries to be able to apply the function to all of the countries. 


In [None]:
countryCol = 'In which country do you currently reside?'
continents = {
    'NA': 'North America', 'SA': 'South America', 'AS': 'Asia',
    'OC': 'Australia', 'AF': 'Africa', "EU":"Europe"
}
countryMap = {'Iran, Islamic Republic of...':"Iran",'Republic of Korea':"South Korea",
             'United States of America':"USA",'United Kingdom of Great Britain and Northern Ireland':"United Kingdom"}
df['In which country do you currently reside?']=df['In which country do you currently reside?'].apply(
    lambda x: countryMap[x] if x in countryMap else x)

countries = df[countryCol].unique()
continentMap = dict([(country,continents[country_alpha2_to_continent_code(country_name_to_country_alpha2(country))]
                     ) for country in countries if country !="Other"])

df["continent"]  = df[countryCol].apply(lambda x: 
                                        continentMap[x] if x in continentMap else x)
df["continent"].value_counts(normalize=True)*100


Here we can see that the continent of Asia is where the majority (~39%) of respondents reside. One factor for this may be due to the large population size of certain countries (India and China) in that continent. Surely these large % of respondents should be of interest to the Kaggle staff.

### Reducing Media Source Values

Removing examples withing parantheses for the media source values. 

In [None]:
mediaCol = 'Who/what are your favorite media sources that report on data science topics?'

df[mediaCol] = df[mediaCol].apply(lambda x:
                            [re.sub(" \(.*","",i) for i in x] if type(x)==list else x)

### Grouping salary ranges 

In [None]:
to_modify = {'$0-999':"$0-999",'> $500,000':'> $249,999'}
def salaryRanges(x):
    if x in to_modify.keys(): 
        return to_modify[x]
    try:
        #keeping last value in range
        if len(x.split('-'))> 1:
            x =int(x.replace(',','').replace(' ','').split('-')[1]) 
    except: 'nothing here'
        
    x =float(x)
    if x > 999.99 and  x < 9999.99:
        return '$1,000 - $9,999'
    elif x > 9999.99 and  x < 19999.99:
        return '$10,000 - $19,999'
    elif x > 19999.99 and  x < 39999.99:
        return '$20,000 - $39,999'
    elif x > 39999.99 and  x < 69999.99:
        return '$40,000 - $69,999'
    elif x > 69999.99 and x < 99999.99:
        return '$70,000 - $99,999'
    elif x > 99999.99 and x < 149999.99:
        return '$100,000 - $150,000'
    elif x > 149999.99 and x < 249999.99:
        return '$150,000 - $249,999'
    elif x > 249999.99: 
        return '> $249,999'
    
salaryCol = 'What is your current yearly compensation (approximate $USD)?'
df[salaryCol] = df[salaryCol].apply(lambda x: salaryRanges(x) if type(x) != float else x)
df[salaryCol].value_counts(normalize=True)*100

### Reducing unique values for other columns

In [None]:
# Age range
ageRange = {"50-54":"50+", "55-59":"50+", "60-69":"50+", "70+":"50+", "40-44":"40-49", "45-49":"40-49"} 
df['What is your age (# years)?'] = df['What is your age (# years)?'].apply(
                                    lambda x: ageRange[x] if x in ageRange else x)

# Education
eduCol = 'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'
eduMap = {"No formal education past high school":"Some college or less","I prefer not to answer":"",
          "Some college/university study without earning a bachelor’s degree":"Some college or less"}
df[eduCol] = df[eduCol].apply(lambda x: eduMap[x] if x in eduMap else x)

# Activity at work
activityCol = 'Select any activities that make up an important part of your role at work:'
other = 'None of these activities are an important part of my role at work:'
df[activityCol] = df[activityCol].apply(lambda x: ["Other" if i== other
                                            else i for i in x ] if type(x)==list else x)

# Company Size
companySizeCol = 'What is the size of the company where you are employed?'
df[companySizeCol] = df[companySizeCol].str.replace(" employees",'')

# IDEs
ide = "Which of the following integrated development environments (IDE's) do you use on a regular basis?"
vs = {'Visual Studio Code (VSCode)':"Visual Studio Code", "Visual Studio":"Visual Studio Code"}
df[ide] = df[ide].apply(lambda x: [re.sub(" \(.*","",i) if not i in vs else vs[i] for i in x] if type(x)==list
                             else (re.sub(" \(.*","",x) if x!='' else x))
employed =  df.copy()
employed.head()

In [None]:
del [QSubAnswerColumns, df]
gc.collect()

### Further reducing dataframe size

- Turn 'None' and numpy nan values into empty strings
- Unpacking lists if only one element in it

In [None]:
def reduceList(x):
    if len(x)==1: 
        return x[0]
    return [i for i in x]


employed = employed.applymap(lambda x: reduceList(x) if type(x)==list  
                     else ('' if x=='None' or x is np.nan else x))
employed.head()

<div id="4"></div>

# 4. General Exploratory Data Analysis (EDA)

In [None]:
def findCol(df,words):
    """
    Searches for columns within dataframe. 
    """
    keywords = ''
    for word in words:
        keywords+="(?=.*{:})".format(word)
    cols = df.columns[df.columns.str.contains(keywords)]
    return cols

In [None]:
initial_color ="#20BEFF"

In [None]:
def plot(df, col, initial_color ="#20BEFF",figsize=(15, 5)):
    fig, ax = plt.subplots(figsize=figsize,facecolor="#FAFAFA", constrained_layout=True)
    #data
    counts = count_values(df, col, noFalse='')
    dataRel= (pd.DataFrame(counts.values(),counts.keys())
               .sort_values(0)
              .applymap(lambda x: 
                        x/sum([i for i in counts.values()])*100))
    dataAbs = pd.DataFrame(counts.values(),counts.keys()).sort_values(0)
    #max of 20
    if len(dataAbs)>20:
        dataAbs = dataAbs[-20:]
        dataRel = dataRel[-20:]
    #bar plot 
    ax.barh(dataAbs.index,dataAbs[0].values, color=initial_color,zorder=3) 
    ax.set_yticklabels([i[:30] for i in dataAbs.index],fontfamily='serif')
    ax.grid(axis="x",alpha=.3,zorder=0) 
    params = {"int":False,"percent":True,"decimal":1}
    
    for bar,s in zip(ax.patches,dataRel[0].values):
        x = bar.get_height()/2+ bar.get_width()
        y = bar.get_y()+ bar.get_height()/2 
        pos = ("black","left") if s<5 else ("black","right")
        ax.text(x,y,f"{s:.1f}%",fontsize=13,color=pos[0],ha=pos[1],fontfamily="serif",va="center")
    
    modifyChartExtra(ax,maxs=True) 
    modifyChartBasic(ax,13) 
    ax.set_facecolor("#FAFAFA")  
    ax.set_xticklabels([shorten(i,precision=1) if len(str(int(i)))>3 else int(i) for i in ax.get_xticks()],
                       fontsize=13) 
    ax.get_yticklabels()[-1].set_fontweight("bold")
    ax.spines["bottom"].set_color('#FAFAFA')
    ax.spines['right'].set_visible(True); ax.spines['right'].set_alpha(.1)
    
    plt.title(col,fontfamily='serif',fontsize=16)
    return ax

## Share of Employed Data Professionals by Job Titles

In [None]:
title = findCol(employed,["title"])
ax = plot(employed,title,figsize=(11,7))
ax.get_xticklabels()[-1].set_rotation(45)
plt.suptitle("Share of Job Roles", x = .28,
             y=.95,fontsize=22,fontweight="semibold", fontfamily='serif')
plt.title("~61% of Data Professionals are Data Scientists or Analysts",
                  x = .41, fontsize=16, fontfamily='serif')
plt.show()

<div class="alert alert-info" style="font-size:16px;width:80%">
Key points:
    <br>
<ul>
    <li>Not surprisingly, data science and analyst job titles are common amongst employed data professionals. </li>
    <li> Almost 1/5 of the surveyed data professionals work in academic-related position as a data scientist. </li>
</ul>   
</div>

## Location of Survey Respondents

In [None]:
plot(employed,"continent",figsize=(11,7))
plt.suptitle("60% of Respondents Are Located In Asia or Europe", x = .5,
             y=.90,fontsize=20,fontweight="semibold", fontfamily='serif')
plt.title('');plt.show()

countries = findCol(employed,["countr"])
ax = plot(employed,countries,figsize=(11,7))
ax.get_xticklabels()[-1].set_rotation(45)
plt.suptitle("Top 20 Countries of Survey Respondents", x = .44,
             y=.95,fontsize=22,fontweight="semibold", fontfamily='serif')
plt.title("1/3 of Respondents are Located in India or the U.S.",
                  x = .32, fontsize=15, fontfamily='serif'    )
plt.show()

<div class="alert alert-info" style="font-size:16px;width:80%">
Key points:
    <br>
<ul>
    <li>The large continents of South America and Africa have the least amount of survey respondents in comparison to the other continents.  </li>
    <li> India is well represented in this survey with ~1/5 of total respondents being located there. </li>
</ul>   
</div>

## Age

In [None]:
age=  findCol(employed,["age \(# years\)"])
ax = plot(employed,age,figsize=(11,7))
ax.get_xticklabels()[-1].set_rotation(45)
plt.suptitle("Age Range", x = .45,
             y=.95,fontsize=19,fontweight="semibold", fontfamily='serif')
plt.title("Data Professionals are Young: 42% are in their Mid-20s to Mid-30s",
                  x = .45, fontsize=14, fontfamily='serif')
plt.show()

## Gender

In [None]:
gender = findCol(employed,["gender"])
genderProp = employed[gender].value_counts(normalize=True) * 100
fig, ax = plt.subplots(facecolor="#FAFAFA")

patches, texts, autotexts = plt.pie(genderProp ,
                                    labels=['Man', 'Woman'],
                                    labeldistance=1.1,
                                    shadow=True,
                                    colors=["#20BEFF",'#ff9999'],
                                    autopct='%1.1f%%',
                                    textprops={
                                        'fontsize': 16,
                                        'fontweight': 'bold'
                                    },wedgeprops=dict(width=0.6),
                                     startangle=90,pctdistance=0.80,
                                    radius=1.4)
 
plt.title("Employed Data Professionals - Gender",
          pad=40,
          size=20, x=.5,
          fontname='sans-serif')

plt.show()

<div class="alert alert-info" style="font-size:16px;width:80%;">
Key points:
    <br>
<ul>
    <li>For every five respondents, only one of them will be female. </li>
    <li>Sadly there is a clear disproportion in gender representation amongst the employed data professional respondents.  </li>
</ul>   
</div>



## Yearly Compensation

In [None]:
salary = findCol(employed,["compensation"])
ax = plot(employed,salary,figsize=(11,7))

plt.suptitle("Yearly Compensation ($USD)", x = .35,
             y=.95,fontsize=22,fontweight="semibold", fontfamily='serif')
plt.title("~41% Are Paid less than $10k and ~35% are Paid between \\$10k to \\$70k",
                  x = .42, fontsize=14, fontfamily='serif')
ax.set_xticklabels([shorten(i) if len(str(int(i)))>3 else int(i) for i in ax.get_xticks()],
                   fontsize=13) 
plt.show()

<div class="alert alert-info" style="font-size:16px;width:80%;">
Key points:
    <br>
<ul>
    <li> Yearly salaries for the respondents are overwhelmingly low (~41% paid <$10k). However, considering the proportion of respondent's location (ie. ~21% from India), it seems like this was a contributing factor.
    </li>
</ul>   
</div>




## Use of and Recommended Programming Languages To Learn

In [None]:
progLang = findCol(employed,["programming"])
suptitles = ["Years Writing Code and/or Programming",
             "First Programming Language Recommendation",
            "Programming Language Used Regularly"]
titles = ["~44% of data professionals have been coding for 1 - 5 years",
         "Python is overwhelming recommended as a first programming language to learn",
         "Python, SQL, and R are commonly used amongst data professionals"]
for col,sup,t in zip(progLang,suptitles,titles):
    ax = plot(employed,col,figsize=(12,6))
    plt.suptitle(sup, y=.96,fontsize=21,fontweight="semibold", fontfamily='serif')
    plt.title(t, fontsize=14, fontfamily='serif')
    plt.show()  

<div class="alert alert-info" style="font-size:16px;width:80%;">
Key points:
    <br>
<ul>
    <li> ~44% of data professionals have been coding for 1 - 5 years. </li>
    <li> Python is overwhelmingly recommended as a first programming language to learn. </li>
    <li> Python, SQL, and R are commonly used amongst employed data professionals </li>
</ul>   
</div>





## Education

In [None]:
edu=  findCol(employed,["formal education"])
ax = plot(employed,edu,figsize=(11,7))
ax.get_xticklabels()[-1].set_rotation(45)
plt.suptitle("Edu. Planned/Attained Within the Next 2 Years", x = .45,
             y=.95,fontsize=19,fontweight="semibold", fontfamily='serif')
plt.title("~90% of Data Professionals Have or Plan On Obtaining A Higher Ed Degree",
                  x = .45, fontsize=14, fontfamily='serif')
plt.show()

<div class="alert alert-info" style="font-size:16px;width:80%;">
Key points:
    <br>
<ul>
    <li> About half of the respondents indicated having or plan on obtaining a Master's Degree. </li>
    <li> It seems like obtaining a higher education, at least by those surveyed (~90% plan on or have obtained one), is important. </li>
</ul>   
</div>

<div id="5"></div>

# 5. Job Title and Gender Heatmaps (EDA)

Variables we'll study per job titles: 
1. Education
2. Common tasks at work
3. Salary Pay

*Note: Some of the charts do not include a gender differences chart as they pertain to questions that are not expected to be influenced by gender (ie. programming language preferences). 

For this section we'll retrieve only columns relevant to the aforementioned variables of interest.

In [None]:
useThese = [ 'What is your age (# years)?',
       'What is your gender?', 'In which country do you currently reside?',
       'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?',
       'Select the title most similar to your current role (or most recent title if retired)',
       'For how many years have you been writing code and/or programming?',
       'What programming language would you recommend an aspiring data scientist to learn first?',
       'What is the size of the company where you are employed?',
       'Approximately how many individuals are responsible for data science workloads at your place of business?',
       'What is your current yearly compensation (approximate $USD)?',
       'Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often?',
       'Which of the following business intelligence tools do you use most often?',
       'Select any activities that make up an important part of your role at work:',
       'What data visualization libraries or tools do you use on a regular basis?',
       'What programming languages do you use on a regular basis?',
       'Which of the following cloud computing platforms do you use on a regular basis?',
       'continent'
]
employedDF = employed[useThese]

notList = [col for col in employedDF.columns 
           if employedDF[col].apply(lambda x: 1 if type(x)==list else 0).sum() ==0]
employedDF[notList] = employedDF[notList].astype("category")

### Heatmap Functions For Pivot Table

In [None]:
color_map =plt.cm.get_cmap('bone').reversed()
    
def heatmaps(ax,df,rel,cmap=color_map):
    kwargs = {'alpha':.9,'linewidth':2, 'linestyle':'--', 'rasterized':False, 'edgecolor':'w',  "capstyle":'projecting',}
    sns.heatmap(data = df,annot=True,linewidths=3,
                fmt=".1f" if rel else "d", annot_kws={"fontsize":8}, 
                 square=False, ax=ax, cmap= cmap,**kwargs)
    ax.tick_params(length=0,labelsize=12,pad=0)  
    if rel:
        for t in ax.texts: t.set_text(t.get_text() + " %")
    ax.set_ylabel('') 

def relativeVal(male,female):
    maleSum = male.sum(1)
    femaleSum = female.sum(1)
    male = male.apply(lambda x: x/maleSum)*100
    female = female.apply(lambda x: x/femaleSum)*100
    return male, female

def create_plots(ax,col, main, data,male, female,rel):
    if rel:
        male, female = relativeVal(male, female)
    if male.index.nlevels ==1: diff = male -female
    else: diff = male.droplevel(1) - female.droplevel(1) 
    heatmaps(ax,diff,rel,cmap=plt.cm.get_cmap('bwr').reversed())
    if any([True for i in diff.columns if len(i)>15]):
        ax.tick_params(axis='x',labelrotation=60)
    ax.set_title('\n'.join(["","Male Relative to Female"])) 

### Pivot Table Function (Agg Functions Used: Unique Count and % of Total)

In [None]:
def pivot_table(col,main,data=employedDF, relative=False,gender=False):
    #retrieve select gender 
    if gender != False:
        data= data[data["What is your gender?"] == gender]
        data["What is your gender?"] = data["What is your gender?"].cat.remove_unused_categories() #remove category
     
    #create series with select columns. Rows with lists as values are exploded 
    df = data[main+col].explode(col[0]).reset_index()
    df= df[~df.applymap(lambda x: True if x=='' else False).any(1)] #skip none values
    
    # since we have categories, we'll need to remove unused catregories after filtering out none values
    cats = [col for col in df.columns if df[col].dtype.name=='category' and '' in  df[col].cat.categories]
    if len(cats)>0: 
        for cols in cats:
            df[cols] = df[cols].cat.remove_unused_categories()
    
    df["index"] = df["index"].astype(np.int16) #convert to int16 type for faster processing
    
    # get unique count per select variables (main) for each unique value in the second variable (col)
    df_pivot = df.pivot_table(index=main, columns=col[0],values="index", aggfunc="nunique")
    df_pivot.index = df_pivot.index.set_names(["" for i in range(len(df_pivot.index.names))]) # no need for column names in plot
    df_pivot.columns = df_pivot.columns.rename("") # no need for column names in plot
    
    if relative: 
        return (df_pivot.sum().div(df_pivot.sum().sum()).sort_values(ascending=False)*100)
    return df_pivot

## Job Titles and Degree

In [None]:
def mainHeatmap(col, main,cols=None):
    fig, ax = plt.subplots(figsize=(15,10),nrows=2,ncols=1)

    df_pivot = pivot_table(col,main,data=employedDF,gender=False)
    df_pivot = df_pivot.apply(lambda x: x/df_pivot.sum(1))*100
    if type(cols)==type(None): cols= df_pivot.columns
    df_pivot_M = pivot_table(col, main,data=employedDF,gender="Man")[cols]
    df_pivot_W = pivot_table(col, main,data=employedDF,gender="Woman")[cols]
    heatmaps(ax[0],df_pivot[cols], rel=True,cmap='Greens')
    ax[0].figure.axes[-1].set_yticklabels([str(int(i))+"%"
                                        for i in ax[0].figure.axes[-1].get_yticks()])
    ax[0].figure.axes[-1].tick_params(labelsize=14)
    ax[0].figure.axes[-1].set_title("% of Total",x=.6,fontsize=14) 
    create_plots(ax[1],edu, jobTitle, employedDF,df_pivot_M, df_pivot_W,True)
    ax[1].figure.axes[-1].tick_params(labelsize=14)
    ax[1].figure.axes[-1].set_yticklabels([str(int(i))+"%"
                                        for i in ax[1].figure.axes[-1].get_yticks()])
    ax[1].figure.axes[-1].set_title("% Difference",x=.6,fontsize=14) 
    [t.set_fontsize(15) for i in range(2) for t in ax[i].texts]
    return ax

In [None]:
jobTitle = list(findCol(employedDF,['current role']))
edu = list(findCol(employedDF,['formal education']))
sortedCols = ['Some college or less','Bachelor’s degree', 
              'Professional degree','Master’s degree', 'Doctoral degree']
ax = mainHeatmap(edu, jobTitle, sortedCols)

ax[1].tick_params(labelrotation=0)
titles = ["Degree Planned/Obtained Per Job Title", "\n\nMale Relative to Female"]
[ax[i].set_title(t,fontsize=18,pad=10) for i,t in zip([0,1],titles) ]

plt.show()

<div class="alert alert-info" style="font-size:16px;width:90%;">
Key points:
    <br>
<ul>
     <li> With the next exception of data scientists, obtaining a Master's degree within the next two years is the most common selection (45% - 52%) across all professional groups. </li>
    <li> Half of academic data scientists have attained or plan to attain a doctoral degree within the next two years. Not surprisingly, these professionals have the highest formal education out of the group.</li>
    <li> Data Scientists and Machine Learning (ML) Engineers are the next highest formally educated professionals with 17% and 14%, aiming to obtain or have obtained a Doctoral degree.</li>
    <li> For ML engineer professionals, there were 11% more male respondents who selected having or planning on obtaining a Bachelors's degree relative to their female counterparts.</li>
     On the other hand, female ML engineers' responses are more concentrated on the higher education choice relative to males--higher Masters and Doctoral Degree percentages.
         
  
</ul>   
</div> 



## Job Titles and Activities at Work

In [None]:
activities = list(findCol(employedDF,["activities"]))

activityMap = dict([(k,v) for k,v in 
      zip(count_values(employedDF, activities,noFalse='').keys(),range(8))])

legend = (pd.Series(activityMap.values(),activityMap.keys()).to_frame().rename(columns={
    0:"Legend"
}).T)

ax = mainHeatmap(activities,jobTitle)

[ax[i].tick_params(axis="x",labelrotation=0) for i in range(2)]
titles = ["Activities That Are Important For Job Role", "\n\nMale Relative to Female"]
[ax[i].set_title(t,fontsize=18,pad=10) for i,t in zip([0,1],titles)]
[ax[i].set_xticklabels(
              [activityMap[x.get_text()] for x in ax[i].get_xticklabels()]
                      ) for i in range(2)]
plt.tight_layout()
display(legend)

<div class="alert alert-info" style="font-size:16px;width:90%;">
Key points:
    <br>
<ul> <li> We can see that there is a clear variation in terms of import activities per job title. In general, though, <i>analyzing and understanding data</i> plays an important part in the work of all data professionals. </li>
    <li> Building prototypes for exploring ML applications seem to have higher relative importance for male data engineers and academic data scientists. Relative to their female counterparts, males had a 6% higher concentration of its members selecting the <i>building prototypes</i> choice.  </li>
</ul>   
</div>


## Job Titles and Salary

In [None]:
salary = list(findCol(employedDF,["compensation"]))
sortedCols = ['$0-999', '$1,000 - $9,999', '$10,000 - $19,999','$20,000 - $39,999',
    '$40,000 - $69,999', '$70,000 - $99,999', '$100,000 - $150,000', 
 '$150,000 - $249,999',  '> $249,999']
ax = mainHeatmap(salary,jobTitle,sortedCols)

[ax[i].tick_params(axis="x",labelrotation=15) for i in range(2)]
titles = ["Yearly Salary Expectations", "\n\nMale Relative to Female"]
[ax[i].set_title(t,fontsize=18,pad=10) for i,t in zip([0,1],titles)]

plt.tight_layout()

<div class="alert alert-info" style="font-size:16px;width:90%;">
Key points:
    <br>
<ul> <li> Across all job titles, yearly salary expectations are concentrated in the \$0 to \$10k range. As previously mentioned, this is likely due to the locations of the participants.</li>
    <li> Across all job titles, males have a higher % of respondents selecting higher salary ranges relative to their female counterparts. This is especially the case for data engineers of whom have ~20% higher concentration of responses for the \$20 to \$70k salary ranges.</li>
    <li> Relative to their male counterparts, the higher concentration of females selecting \$0 to \$999 as their expected yearly salary shows some strong hints of a gender pay disparity. </li>
</ul>   
</div>



In [None]:
cols = ['For how many years have you been writing code and/or programming?',
'What programming language would you recommend an aspiring data scientist to learn first?',
'What is the size of the company where you are employed?',
'Approximately how many individuals are responsible for data science workloads at your place of business?']

In [None]:
col = [cols[0]]
sortedCols = [ 'I have never written code','< 1 years','1-2 years', 
             '3-5 years', '5-10 years', '10-20 years', '20+ years', ]
ax = mainHeatmap(col, jobTitle,sortedCols)

[ax[i].tick_params(axis="x",labelrotation=15) for i in range(2)]
titles = ["Years Coding", "\n\nMale Relative to Female"]
[ax[i].set_title(t,fontsize=18,pad=10) for i,t in zip([0,1],titles)]

plt.tight_layout()


<div class="alert alert-info" style="font-size:16px;width:90%;">
Key points:
    <br>
<ul> <li> Analysts have the least amount of experience in programming with most ~58% having 0 - 2 years of coding experience.</li>
    <li> Academic data scientists and data engineers have the most experience in programming with 1/3 of professionals stating to have more than 10 years of coding experience.</li>
    <li> Male respondents have a higher concentration of responses on the higher end of coding experience of which is especially the case for the academic data scientists and data engineers. </li>
    <li>Interestingly enough, we see that women have a much higher concentration of responses on the low coding experience ranges. If we are to take into account the previous salary statistic, this may have played a role in the low salary ranges for female data engineers relative to their male counterparts.</li>
</ul>   
</div>





In [None]:
col = [cols[1]]
ax = mainHeatmap(col, jobTitle)

titles = ["Programming Language Used Most Often", "Male Relative to Female"]
[ax[i].set_title(t,fontsize=18,pad=10) for i,t in zip([0,1],titles)]
ax[1].set_visible(False)
ax[1].figure.axes[-1].set_visible(False)
plt.tight_layout()

<div class="alert alert-info" style="font-size:16px;width:90%;">
Key points:
    <br>
<ul> <li> ML engineers have the highest concentration (86%) of respondents who use Python as their main programming language.</li>
    <li> The heatmap also shows R and SQL being utilized the most for some data professionals.</li>
</ul>   
</div>




In [None]:
col = [cols[2]]
sortedCols = ['0-49','50-249', '250-999', '1000-9,999','10,000 or more']
ax = mainHeatmap(col, jobTitle,sortedCols)

titles = ["Size of Company", "\n\nMale Relative to Female"]
[ax[i].set_title(t,fontsize=18,pad=10) for i,t in zip([0,1],titles)]
#ax[1].set_visible(False)
#ax[1].figure.axes[-1].set_visible(False)
plt.tight_layout()

<div class="alert alert-info" style="font-size:16px;width:90%;">
Key points:
    <br>
<ul> 
    <li> Across all job titles, a good share of them works for companies with less than 50 employees. ML engineers, in particular, have a high concentration (~54%) of respondents who work at a small-sized business (< 50 employees).</li>  
    </ul>   
</div>




In [None]:
col = [cols[3]]
sortedCols=['0', '1-2','3-4', '5-9', '10-14', '15-19', '20+']
ax = mainHeatmap(col, jobTitle,sortedCols)

titles = ["Individuals Responsible for Data Science Workloads", "\n\nMale Relative to Female"]
[ax[i].set_title(t,fontsize=18,pad=10) for i,t in zip([0,1],titles)]
ax[1].set_visible(False)
ax[1].figure.axes[-1].set_visible(False)
plt.tight_layout()

<div class="alert alert-info" style="font-size:16px;width:90%;">
Key point:
    <br>
    <ul> <li> 10 - 19 individuals working on data science workloads is not common. Data professionals either work in small teams (0 - 10) or large teams (20+). </li>
</ul>   
</div>






<div id="6"></div>

# 6. Common Tools Per Job Title (EDA)

In [None]:
tools=[]
for w in ["use most often","programming languages","cloud"]:
    tools+= list(findCol(employedDF,[w]))
print(tools)

def toolChart(tool):
    fig, ax = plt.subplots(figsize=(15,7.5),facecolor="#03174D")
    ax.set_facecolor("#03174D")
    df_pivot = pivot_table([tool],jobTitle,data=employedDF,gender=False)
    df_pivot = df_pivot.apply(lambda x: x/df_pivot.sum(1))*100
    top5 = (df_pivot.transform(np.sort)[::-1]).iloc[0].sort_values().tail(5).index
    data = df_pivot[top5]
    sortedRow = data.sum(1).sort_values().index
    data.loc[sortedRow].plot.barh(stacked=True,ax=ax,width=.7,zorder=3)
    for bar in ax.patches:
        x,y,s = bar.get_x()+bar.get_height()*2,bar.get_y()+bar.get_height()/2, bar.get_width()
        rot= 90 if s< 6 else 0
        ax.text(x-bar.get_height() if s<= 6 else x, y ,f"{s:.1f}%",color="white",
                fontsize=14,rotation=rot)
    modifyChartBasic(ax,13,grid='x')
    ax.set_xlim(0,100)
    ax.set_xticklabels([f"{i:.1f}%" for i in ax.get_xticks()])
    ax.tick_params(labelcolor="white",labelsize=14)
    data.loc[sortedRow].applymap(lambda x: 100).iloc[:,-1].rename("").plot.barh(zorder=0,width=.7,edgecolor="teal",
          legend=None, linewidth=2, facecolor= "#03174D")
    legend = ax.legend(loc="upper center", bbox_to_anchor=(0.47, 1.06),ncol=5,fontsize=15)
    
    return ax


### Big Data Products Used Most Often Per Job Title

In [None]:
toolChart(tools[0])
plt.suptitle("Top 5 Commonly Used Big Data Products Per Job Title",y=1.05,
             color="white",fontsize=25)
plt.show()

<div class="alert alert-info" style="font-size:16px;width:85%;">
Big Data Products Used Most Often:
    <ul> <li><b>Academic Data Scientists: </b>MySQL </li>
        <li><b>Machine Learning Engineer: </b>MySQL and PostgreSQL </li>
        <li><b>Analyst: </b>MySQL and Microsoft SQL Server</li>
        <li><b>Data Scientist: </b>MySQL and PostgreSQL. </li>
        <li><b>Data Engineer: </b> Microsoft SQL Server</li>
</ul>   
</div>

### Business Intelligence (BI) Tools Used Most Often Per Job Title

In [None]:
toolChart(tools[1])
plt.suptitle("Top 5 Most Used BI Tool Per Job Title",y=1.05,
             color="white",fontsize=25)
plt.show()

<div class="alert alert-info" style="font-size:16px;width:85%;">
BI Tools Used Regularly:
    <ul> <li><b>Academic Data Scientists: </b>Tableau, Power BI, and Google Data Studio </li>
        <li><b>Machine Learning Engineer: </b>Tabeau </li>
        <li><b>Analyst: </b>Tableau and PowerBI</li>
        <li><b>Data Scientist: </b>Tableau </li>
        <li><b>Data Engineer: </b> Power BI</li>
</ul>   
</div>

### Programming Languages Used Regularly Per Job Title

In [None]:
toolChart(tools[2])
plt.suptitle("Top 5 Commonly Used Programming Lang. Per Job Title",y=1.05,
             color="white",fontsize=25)
plt.show()

<div class="alert alert-info" style="font-size:16px;width:85%;">
Programming Language Used Regularly:
    <ul> <li><b>Academic Data Scientists: </b>Python </li>
        <li><b>Machine Learning Engineer: </b>Python </li>
        <li><b>Analyst: </b>Python and SQL</li>
        <li><b>Data Scientist: </b>Python and SQL </li>
        <li><b>Data Engineer: </b>Python and SQL </li>
</ul>   
</div>

### Cloud Platforms Used Regularly Per Job Title

In [None]:
ax = toolChart(tools[3])
ax.legend(loc="upper center", bbox_to_anchor=(0.45, 1.1),ncol=3,fontsize=15)
plt.suptitle("Top 5 Commonly Used Cloud Platform Per Job Title",
             y=1.075,color="white",fontsize=25)
plt.show()

<div class="alert alert-info" style="font-size:16px;width:85%;">
Cloud Platform Used Regularly:
    <ul> <li><b>Academic Data Scientists: </b>AWS and Google Cloud Platform</li>
        <li><b>Machine Learning Engineer: </b>AWS and Google Cloud Platform </li>
        <li><b>Analyst: </b>AWS, Google Cloud Platform, and Microsoft Azure </li>
        <li><b>Data Scientist: </b>AWS, Google Cloud Platform, and Microsoft Azure</li>
        <li><b>Data Engineer: </b>AWS, Google Cloud Platform, and Microsoft Azure </li>
</ul>   
</div>

<div id="7"></div>

# 7. Short Summary

General Demographics of Employed Kagglers

While there are inherently some biases (ie. response bias) inherit in the data as well as my own analysis, the general insights obtained from it provide us with some insight on how the current landscape looks for employed data professionals. From the responses, employed data professionals who took the survey are: young (42% are in their mid-20s to mid-30s), mostly located in Asia or Europe (60%), are predominately male (81.5%), and are formally educated (90% have or plan on obtaining a higher-ed degree withing the net 2 years).

Recommendation for Technical Tools To Learn

If we are to consider the responses from Kagglers who reported being employed data professionals, Python and SQL are great programming languages to learn as they are most commonly used in a professional setting. In addition, learning how to utilize cloud platforms such as AWS, Google Cloud Platform, and Microsoft Azure might be a good idea if one is considering to join a team that works with "big data." For SQL server languages, the following are good choices to learn: SQL, PostgreSQL, or Microsoft SQL Server. Lastly, for those planning on doing work involving BI, knowledge of Tableau and/or Power BI will be great to have.

## Data Used For Tableau Dashboard 
Note: Data shape is in long format (index is used for unique count).

```forTableau = employedDF[['What is your age (# years)?', 'What is your gender?',
       'In which country do you currently reside?',
       'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?',
       'Select the title most similar to your current role (or most recent title if retired)',
       'What is your current yearly compensation (approximate $USD)?',
       'Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often?',
       'What programming languages do you use on a regular basis?',
    'Approximately how many individuals are responsible for data science workloads at your place of business?',
       'Which of the following cloud computing platforms do you use on a regular basis?','continent']]```

```containsList=  [col for col in forTableau.columns if (forTableau[col].apply(lambda x: 1 if type(x)==list else 0).sum())>0]```

```for col in containsList:
    forTableau  = forTableau.explode(col)
forTableau.to_csv("try.csv")
forTableau.head()```

# How might your job profile look?
### Use the dashboard below to explore responses in accordance to country, job title, gender, and level of education.

In [None]:
%%html
<div class='tableauPlaceholder' id='viz1609977219722' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ka&#47;KaggleSurvey2020&#47;jobInfographic&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='KaggleSurvey2020&#47;jobInfographic' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ka&#47;KaggleSurvey2020&#47;jobInfographic&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1609977219722');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='2527px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

# Please don't forget to upvote if you found this notebook helpful and/or insightful! Thanks!!