<br>
<h1 style = "font-size:30px; font-weight : bold; text-align: center; border-radius: 10px 15px;"> 2021 Kaggle ML & DS Survey: "Currently Not Employed" - A Short EDA </h1>
<br>

---

## Introduction
Kaggle is a diverse data science community, composed of people from different countries, cultures and educational backgrounds. It is also composed of people in different stages in their professional lives. 

The Annual Machine Learning and Data Science Survey, conducted every year since 2017, allows us to obtain a comprehensive view of our community and its subsets. Some commonly covered topics include the analysis of Kagglers by country, gender, age and preferred programming language. The question "Select the title most similar to your current role" also encourages the creation of notebooks focused on the most popular professions and on the student community. However, there is one group that doesn't normally receive a similar attention: The Kagglers who selected the option "Currently Not Employed".

In this notebook, I explore this subset of the Kaggle community, analyzing their distribution among different demographic features (gender, age and education) and the unemployment rate for each chosen category.

## Notes About the Dataset

Removed Samples: In an attempt to focus on people that are most likely looking for a job, I removed the samples that have the option ‘Student’ as their current role. I also made the decision to remove the respondents that are less than 22 years old or above 60 years old.

Question #2 (Gender): The options “Nonbinary”, “Prefer not to say” and “Prefer to self-describe” have a low number of respondents. To analyze the unemployment rate by group, I created a new column called “Gender”, where those answers were set as “Other option”.

Question #3 (Country):
Some countries account for a small number of respondents. To analyze the unemployment by country, I made the decision to create a new column called “Country”, where all answers that correspond to nations with less than 200 respondents on the remaining dataset were changed to the option “Other”.

Question #4 (Education Level):
-	The options “Doctoral degree” and “Professional doctorate” were changed to “Doctor’s degree”;
-	The options “No formal education past high school” and “Some college/university study without earning a bachelor’s degree” were changed to “No degree”.



In [None]:
#Importing Packages

import pandas as pd 
import matplotlib as mat
import matplotlib.pyplot as plt    
import numpy as np
import seaborn as sns
%matplotlib inline

import plotly.express as px
import plotly.figure_factory as ff


import warnings
warnings.filterwarnings('ignore') 

In [None]:
raw_df= pd.read_csv("/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")

In [None]:

df = raw_df.copy()

#Dropping first row
df = df.drop(index = df.index[0], axis=0)

#Removing students from dataframe 
df = df[~df['Q5'].isin(['Student'])]

#Removing certain age groups and reseting index
df = df[~df['Q1'].isin(['18-21', '60-69', '70+'])].reset_index(drop=True)

#New column where options are Employed or Unemployed
df['Employed'] = df['Q5'].apply(lambda x: 'Unemployed' if x == 'Currently not employed' else 'Employed')

#New column with two age ranges: 22-29 and 30+
df['<29_30>'] = df['Q1'].apply(lambda x: '22-29' if (x == '22-24' or x == '25-29') else '30+')

#New gender column, grouping the three option with fewer samples
other_opt = ['Nonbinary', 'Prefer not to say', 'Prefer to self-describe']
df['Gender'] = df['Q2'].apply(lambda x: 'Other option' if x in other_opt else x)

#Grouping some answers on education level
no_degree_list = ['No formal education past high school', 'Some college/university study without earning a bachelor’s degree']
docs_degree_list = ['Doctoral degree', 'Professional doctorate']

df['Q4'] = df['Q4'].replace(no_degree_list,'No degree')
df['Q4'] = df['Q4'].replace(docs_degree_list,"Doctor's degree")

#Getting list of countries with less than 200 respondents
over200_list = []
for x in df['Q3'].unique():
    if len(df[df['Q3'] == x]) >= 200:
        over200_list.append(x)
#print(len(over200_list))
#print(over200_list)

#New country column, grouping countries with few respondents into 'other' category
df['Country'] = df['Q3'].apply(lambda x: 'Other' if x not in over200_list else x)


#display(df)

In [None]:
#Creating dataframes used on the analysis by country

Country_unemp = pd.DataFrame(df.groupby(by=['Country'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
Country_unemp['percentage'] *= 100
Country_unemp['percentage'] = Country_unemp['percentage'].round(decimals=2)
Country_unemp = Country_unemp[Country_unemp['Employed'] =='Unemployed'].sort_values('percentage', ascending = False)

#display(Country_unemp)

Country_unemp_gen = pd.DataFrame(df.groupby(by=['Country','Gender'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
Country_unemp_gen['percentage'] *= 100
Country_unemp_gen['percentage'] = Country_unemp_gen['percentage'].round(decimals=2)

Country_unemp_gen = Country_unemp_gen[Country_unemp_gen['Employed'] =='Unemployed']
Country_unemp_gen = Country_unemp_gen.drop('Employed', axis = 1)
Country_unemp_gen = Country_unemp_gen[Country_unemp_gen['Gender'] != 'Other option']

Country_unemp_gen = Country_unemp_gen.pivot(index='Country', columns='Gender')
Country_unemp_gen.columns = ['Man', 'Woman']
Country_unemp_gen = Country_unemp_gen.sort_values('Woman', ascending = False)

#display(Country_unemp_gen)

Country_unemp_age = pd.DataFrame(df.groupby(by=['Country','<29_30>'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
Country_unemp_age['percentage'] *= 100
Country_unemp_age['percentage'] = Country_unemp_age['percentage'].round(decimals=2)

Country_unemp_age = Country_unemp_age[Country_unemp_age['Employed'] =='Unemployed']
Country_unemp_age = Country_unemp_age.drop('Employed', axis = 1)

Country_unemp_age = Country_unemp_age.pivot(index='Country', columns='<29_30>')
Country_unemp_age.columns = ['22-29', '30+']
Country_unemp_age = Country_unemp_age.sort_values('22-29', ascending = False)

#display(Country_unemp_age)

Country_unemp_edu = pd.DataFrame(df.groupby(by=['Country','Q4'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
Country_unemp_edu['percentage'] *= 100
Country_unemp_edu['percentage'] = Country_unemp_edu['percentage'].round(decimals=2)

Country_unemp_edu = Country_unemp_edu[Country_unemp_edu['Employed'] =='Unemployed']
Country_unemp_edu = Country_unemp_edu.drop('Employed', axis = 1)
Country_unemp_edu = Country_unemp_edu[Country_unemp_edu['Q4'].isin(['Bachelor’s degree', 'Master’s degree'])]

Country_unemp_edu = Country_unemp_edu.pivot(index='Country', columns='Q4')
Country_unemp_edu.columns = ['Bachelor’s degree', 'Master’s degree']
Country_unemp_edu = Country_unemp_edu.sort_values('Bachelor’s degree', ascending = False)

#display(Country_unemp_edu)

In [None]:
#Plotting functions

def plotly_hist_v(df, column, title, xtitle, height = 400, color = None
                     , categoryorder = 'trace', categoryarray = None, histnorm = None):

    fig = px.histogram(df, x=column, height = height, histnorm = histnorm, color = color)

    fig.update_traces(marker_color = '#365c8d')
    
    fig.update_layout(
        #template = 'plotly_dark',
        paper_bgcolor='#F5F5F5', #background color outside plot area
        plot_bgcolor='#F5F5F5', #background color inside plot area
        title=dict(
            text= title,
            x=0.5,
        ),      
        xaxis=dict(
            title= xtitle,
            titlefont_size=16,
            categoryorder = categoryorder, #default = 'trace' -> order on dataframe
            categoryarray = categoryarray #order based on list when categoryorder = 'array'
        ),    
        yaxis=dict(
            title='% of Respondents',
            titlefont_size=14,
        ),
        barmode = 'group' #default = 'stack' ->stacked bar chart when 'color' is defined
    )
    
    fig.show()
    
    
def plotly_bar_v(df, column, title, xtitle, y, height = 400, color = None
                     , category_orders = None, color_discrete_map = {}):

    fig = px.bar(df, x=column, y = y ,height = height, text = y, color = color, 
                  category_orders = category_orders, color_discrete_map = color_discrete_map)
    
    if color_discrete_map == {}:
        fig.update_traces(marker_color = '#365c8d')

    fig.update_layout(
        #template = 'plotly_dark',
        paper_bgcolor='#F5F5F5', #background color outside plot area
        plot_bgcolor='#F5F5F5', #background color inside plot area
        title=dict(
            text= title,
            x=0.5,
        ),      
        xaxis=dict(
            title= xtitle,
            titlefont_size=16,
            #categoryorder = categoryorder, #default = 'trace' -> order on dataframe
            #categoryarray = categoryarray #order based on list when categoryorder = 'array'
        ),    
        yaxis=dict(
            title='% of Respondents',
            titlefont_size=14,
        ),
        barmode = 'group' #default = 'stack' ->stacked bar chart when 'color' is defined
    )

    fig.show()


def annotated_heatmap(x, y, z, title, width=1200, height = 400):
    x = x
    y = y
    z = z

    fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z, colorscale='viridis', xgap=3, ygap=3)

    fig.update_layout(title_text=title,
                      title_x=0.5,
                      titlefont={'size': 24, 'family':'San Serif'},
                      width=width, height = height,
                      xaxis_showgrid=False,
                      xaxis={'side': 'bottom'},
                      yaxis_showgrid=False,
                      yaxis_autorange='reversed',                   
                      paper_bgcolor='#F5F5F5',
                      )
    fig.show() 
    
    
def choropleth(df, locations, color, title):
    fig = px.choropleth(df, locations= locations, locationmode='country names',
                        color= color,
                        hover_name = locations,
                        color_continuous_scale=px.colors.sequential.Viridis,
                        labels= {color : '%'},
                        title = title)

    fig.update(layout=dict(title=dict(x=0.5)))

    fig.show()

In [None]:
#Lists used to define the orders on their respective plots

age_order = ['18-21', '22-24', '25-29', '30-34', '35-39'
             , '40-44', '45-49', '50-54', '55-59', '60-69', '70+']

gender_order = ['Man', 'Woman', 'Nonbinary'
             , 'Prefer not to say', 'Prefer to self-describe']

gender_new_order = ['Man', 'Woman', 'Other option']

education_order = ['I prefer not to answer', 'No degree', 'Bachelor’s degree'
             , 'Master’s degree', "Doctor's degree"]

### Unemployment Rate

After limiting the scope of the dataset to Kagglers between 22 and 59 years old, who didn’t declare themselves as ‘students’, we have the 17152 remaining respondents. In this group, 10.34% of the respondents are currently not employed.

In [None]:
#Distribution between Employed and Unemployed
plotly_hist_v(df, 'Employed', 'Distribution by Employment Status', 'Employment Status', histnorm = 'percent')

### Distribution by Gender

Among the Kagglers who declared themselves as not employed, 73.96% are men and 23,96% are women. The other options make up for around 2% of those respondents.

In [None]:
plotly_hist_v(df[df['Employed']=='Unemployed'], 'Q2', 'Unemployed Kagglers: Gender Distribution', 'Gender'
                 , categoryorder ='array', categoryarray = gender_order, histnorm = 'percent')

In [None]:
grouped_gender = pd.DataFrame(df.groupby(by=['Gender'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
grouped_gender['percentage'] *= 100
grouped_gender['percentage'] = grouped_gender['percentage'].round(decimals=2)
#grouped_gender

### Unemployment Rate by Gender

Given the small number of samples that correspond to the options ‘Nonbinary’, ‘Prefer not to say’ and ‘Prefer to self-describe’, they were combined in an attempt to reduce possible distortions when we analyze the unemployment rate.

We see that less than 9.50% of male respondents are unemployed, while the unemployment rate for woman and the other options are 14.19% and 12.13%, respectively. Given the restrictive nature of the survey and the lack of some additional context, we should be careful when drawing certain conclusions, but this result may be an indicative of gender bias.

In [None]:
plotly_bar_v(grouped_gender[grouped_gender['Employed']=='Unemployed'], 'Gender', 'Unemployment Rate by Gender', 'Gender', y = 'percentage'
                 , category_orders = {'Gender' : gender_new_order})

### Distribution by Age

Over 50% of the respondents that are currently not employed are 29 years old or less. Above 30 years, the percentage of respondents that compose this subset declines as we advance on the age range.

In [None]:
plotly_hist_v(df[df['Employed']=='Unemployed'], 'Q1', 'Unemployed Kagglers: Age Distribution', 'Age Range'
                 , categoryorder ='array', categoryarray = age_order, histnorm = 'percent')

In [None]:
grouped_age = pd.DataFrame(df.groupby(by=['Q1'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
grouped_age['percentage'] *= 100
grouped_age['percentage'] = grouped_age['percentage'].round(decimals=2)
#grouped_age

### Unemployment Rate by Age

As expected, the higher unemployment rates are found among the youngest Kagglers. The lower unemployment rates are found among the samples who are between 30 and 49 years old. After that age, we see a little increase on unemployment.

In [None]:
plotly_bar_v(grouped_age[grouped_age['Employed']=='Unemployed'], 'Q1', 'Unemployment Rate by Age Range', 'Age Range', y = 'percentage'
                 , category_orders = {'Q1' : age_order})

### Distribution by Education Level

The immense majority of currently not employed Kagglers have selected Bachelor’s and Master’s degree as their highest level of education, corresponding for above 80% of those respondents.

In [None]:
plotly_hist_v(df[df['Employed']=='Unemployed'], 'Q4', 'Unemployed Kagglers: Education Distribution', 'Education Level'
                 , categoryorder ='array', categoryarray = education_order, histnorm = 'percent')

In [None]:
grouped_education = pd.DataFrame(df.groupby(by=['Q4'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
grouped_education['percentage'] *= 100
grouped_education['percentage'] = grouped_education['percentage'].round(decimals=2)
#grouped_education

### Unemployment Rate by Education Level

The highest unemployment rate (17.41%) is found among the respondents who preferred to not state their education level. We can notice that the unemployment rate among Kagglers who have no formal degree and those who have a bachelor’s degree are actually quite close, around 14%. As we move on the ‘formal education ladder’, the unemployment rate seems to decrease, reaching 4,97% on Kagglers with doctor’s degree.

In [None]:
plotly_bar_v(grouped_education[grouped_education['Employed']=='Unemployed'], 'Q4', 'Unemployment Rate by Education Level', 'Education Level', y = 'percentage'
                 , category_orders = {'Q4': education_order})

In [None]:
grouped_age_gender = pd.DataFrame(df.groupby(by=['<29_30>', 'Gender'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
grouped_age_gender['percentage'] *= 100
grouped_age_gender['percentage'] = grouped_age_gender['percentage'].round(decimals=2)

#grouped_age_gender

### Unemployment Rate by Gender and Age.

To have a better understanding on the unemployment landscape, we can analyze different demographic features at the same time, checking how each combination of categories affects the unemployment rate. We start with gender and age range.

When combining those features, we can observe an interesting point. The unemployment rates by gender are somewhat close among Kagglers who are less than 30 years old. The gap is considerably larger, especially between men and women, among the older subset of respondents.


In [None]:
plotly_bar_v(grouped_age_gender[grouped_age_gender['Employed']=='Unemployed'], 'Gender', 'Unemployment Rate by Gender and Age Range', 'Gender', y = 'percentage'
                 , category_orders = {'Gender': gender_new_order}, color = '<29_30>'
            , color_discrete_map= { "22-29": "#46327e", "30+": "#2e6e8e"})

In [None]:
grouped_edu_gender = pd.DataFrame(df.groupby(by=['Q4', 'Gender'])['Employed'].value_counts(normalize = True).reset_index(name='percentage'))
grouped_edu_gender['percentage'] *= 100
grouped_edu_gender['percentage'] = grouped_edu_gender['percentage'].round(decimals=2)
#grouped_edu_gender

### Unemployment Rate by Gender and Education Level

Now let’s take a look at the unemployment rate analyzing the samples both by gender and education level.

The highest unemployment rate is found among the Kagglers who don’t define themselves as either man or woman and don’t have a formal degree. However, it’s important to point out that, given the small number of samples on that group, we should avoid drawing definitive conclusions.

Among the respondents who declared themselves as man or woman, we can see that the unemployment rate is always higher on the female subset. Apart from the ‘no degree’ option, the gap is always significant. Surprisingly, it doesn’t seem to reduce even when we progress on the education level. We can see that among the doctors, 3.71% of the male respondents are unemployed while we have an unemployment rate of 9.43% for women.


In [None]:
plotly_bar_v(grouped_edu_gender[grouped_edu_gender['Employed']=='Unemployed'], 'Q4', 'Unemployment Rate by Gender and Education Level', 'Education Level', y = 'percentage'
                 , category_orders = {'Q4': education_order, 'Gender': gender_new_order}, color = 'Gender'
            , color_discrete_map= { "Man": "#46327e", "Female": "#365c8d", "Other option": "#2db27d"})

### Unemployment Rate by Country

Now, let’s take a look on the unemployment rate by country.

The five countries with the highest unemployment rate are: Indonesia, Nigeria, India, Pakistan and Egypt.

The country with the lowest unemployment rate is Italy (4.62%) followed by: China, Brazil, Germany and France.

In [None]:
annotated_heatmap(x = list(Country_unemp['Country']), y = ['Country'], z = [list(Country_unemp['percentage'])]
                           , title = 'Unemployment Rate by Country')

To find geographic patterns, we can plot a choropleth map. For instance, we can see that, apart from the United Kingdom, the unemployment in Europe is relatively low, with rates ranging from 4.62% to 7.42%.

In [None]:
choropleth(Country_unemp, "Country", "percentage", "Unemployment Rate by Country")

### Men and Women Unemployment Rate by Country

We can also analyze the unemployment rate by country for men and women.

When we focus on the female Kagglers, the top five countries in unemployment rate changes to the following configuration: Spain, Canada, Egypt, Nigeria and India.

The last two countries have a high ‘overall’ unemployment rate, with similar values between men and women. However, the first three countries have an alarming difference on the unemployment between genders. The #1 Spain has a rate of 20% for women while only 5.62% of Spanish male Kagglers are unemployed. 

In almost all listed countries, the unemployed rate for men is either significantly lower or around the same percentage as for women. An interesting exception is the United Kingdom, where 11.36% of male respondents are unemployed, against 5.75% for women.

In [None]:
annotated_heatmap(x = list(Country_unemp_gen.index), y = ['Woman', 'Man'], z = [list(Country_unemp_gen['Woman']), list(Country_unemp_gen['Man'])]
                           , title = 'Men and Women Unemployment Rate by Country')

On the choropleth map for female unemployment rate, it’s noticeable how Spain and Canada differ from the other countries of their respective continents.

In [None]:
choropleth(Country_unemp_gen, Country_unemp_gen.index, "Woman", "Women Unemployment Rate by Country")

### Unemployment Rate by Country and Age Range

The five countries with the highest unemployment rate for younger Kagglers (22 to 29 years old) are: Indonesia, Nigeria, India, Egypt and United Kingdom.

As expected, the Unemployment rate for respondents who are less than 29 years old is significantly higher in comparison with older Kagglers on most countries. However, there are interesting exceptions, mainly Italy, Japan and Canada, where the unemployment rate is actually higher for those above 30 years old.

In [None]:
annotated_heatmap(x = list(Country_unemp_age.index), y = ['22-29', '30+'], z = [list(Country_unemp_age['22-29']), list(Country_unemp_age['30+'])]
                           , title = 'Unemployment Rate (Under and Over 30 years old) by Country')

We can plot the choropleth map for both categories to highlight the contrasts.

In [None]:
choropleth(Country_unemp_age, Country_unemp_age.index, "22-29", "Young Kagglers (29 or less) Unemployment Rate by Country")
choropleth(Country_unemp_age, Country_unemp_age.index, "30+", "Older Kagglers (30+ Years) Unemployment Rate by Country")

### Unemployment Rate by Country for respondents with Bachelor’s and Master’s Degree

Among the five countries with the highest unemployment rate for Kagglers who have a bachelor’s degree as their highest level of education, we find Indonesia, Pakistan, Nigeria and India, countries who have a higher overall unemployment rate. However, the #1 in this list is France. France presents a substantial gap in unemployment rate between respondents with Bachelor’s and Master’s degree (25.00% and 7.08%, respectively).

Generally, having a higher formal degree translates to a higher chance of find a job. Curiously, this doesn’t seem to be true in some countries. In Italy, China, United Kingdom and Egypt, the unemployment rate is actually higher among respondents with Master’s degree.


In [None]:
annotated_heatmap(x = list(Country_unemp_edu.index), y = ['Bachelor’s degree', 'Master’s degree'], z = [list(Country_unemp_edu['Bachelor’s degree']), list(Country_unemp_edu['Master’s degree'])]
                           , title = "Unemployment Rate (Bachelor's and Master's Degree) by Country")

We finish this notebook by plotting the choropleth map for both categories. It’s interesting to see how France stands out from the rest on the first plot (Bachelor’s degree) and how the overall landscape changes on the second (Master’s degree).

In [None]:
choropleth(Country_unemp_edu, Country_unemp_edu.index, "Bachelor’s degree", "Unemployment Rate (Kagglers w/Bachelor's Degree) by Country")
choropleth(Country_unemp_edu, Country_unemp_edu.index, "Master’s degree", "Unemployment Rate (Kagglers w/Master’s Degree) by Country")

## Conclusion

This notebook presents a brief exploration on the subset of Kagglers who are currently not employed. It's important to point out that it doesn't attempt to answer specific questions, such as how to maximize the chances of employment. The intent was to simply provide an overview of this particular group.

When it comes to use this dataset to provide some guidance for those who are currently unemployed, notebooks that focus on exploring the tools and technologies used by those who are working on a ‘data-related’ job are a good place to start. However, I believe the survey can still be improved, including, for instance, questions related to the level of knowledge on each of those tools. Such questions would help us to understand the knowledge/skill gap between those who are successful on this field and those who aim to enter it.


### Additional Note

The use of Plotly annotated heatmaps and part of the structure adopted in this work were inspired by [@desalegngeb’s](https://www.kaggle.com/desalegngeb) notebook [“How popular is kaggle in Africa?”](https://www.kaggle.com/desalegngeb/how-popular-is-kaggle-in-africa).

## <center> Thank you for reading! <center>