# A SWOT Analysis of the World of Data Science

![](https://drive.google.com/uc?export=download&id=1JDL1L85J_g0uzghs12ekbUHc-reSZy6n)


Throught my life, I have lived in many countries. Whether you are interested in visiting, living, or working in a country, each country has its own competitive advantage. In other words, each country has some strengths and some weaknesses. This inspired me to create this notebook. In this notebook, we will analyze the data from the Kaggle Survey 2019. We will use the SWOT analysis [[1](#ref_1), [2](#ref_2), [3](#ref_3)] as a framework to guide us. There are 4 components of a SWOT analysis:
* Strengths
* Weaknesses
* Opportunities
* Threats

It is worth mentioning that a SWOT analysis is typically used when making decisions. However, it can also be used to capture the state of a business, a competition, a concept or pretty much anything you want. We will learn more about SWOT in [Section_1](#section_1). 

In this notebook, we will use the World Bank categorization of the countries around the world. The World Bank categorizes countries based on their economy into 4 groups: 
* Low income
* Lower-middle income
* Upper-middle income
* High income

The Kaggle 2019 survey only contains data about countries belonging to lower-middle income, upper-middle income, and high income categories. There's no data for low income countries (this is likely because the number of responses from these countries are below 50). We will perform the SWOT analysis on the following 3 income groups: lower-middle income, upper-middle income, and high income categories. We will learn more about the World Bank classification in [Section_2](#section_2). 

In [Section_3](#section_3), we will learn more about the income groups available in the survey. Then, we will perform the SWOT analysis from four different angles: gender ([Section_4](#section_4)), age ([Section_5](#section_5)), education ([Section_6](#section_6)), and jobs ([Section_7](#section_7)). Finally, we will conclude this notebook in [Section_8](#section_8).

# Table of Contents

* [Section_1: What is a SWOT analysis?](#section_1)
* [Section_2: World Bank Income Categories](#section_2)
* [Section_3: Getting to know the Countries in the Survey](#section_3)
* [Section_4: Gender](#section_4)
* [Section_5: Age](#section_5)
* [Section_6: Education](#section_6)
* [Section_7: Jobs](#section_7)
* [Section_8: Conclusion](#section_8)
* [References](#sec_references)


# Section_1: What is a SWOT analysis?
<a id="section_1"></a>
![](https://drive.google.com/uc?export=download&id=11uzOHGshUb44gkH-cI9gw2vifrbnSIp_)

SWOT analysis [[1](#ref_1), [2](#ref_2), [3](#ref_3)] is a framework to allow someone to identify the **S**trengths, **W**eaknesses, **O**pportunities, and **T**hreats. This tool is typically used during decision making. However, it can also be used to capture the state of a concept or an entity. 
* Strengths:
    * Things the entity does well.
    * Qualities that separate the entity from competitors.
    * Tangible Assets.
    * This is an internal factor related to the entity itself.
    
* Weaknesses: 
    * This is the opposite of Strengths.
    * Things that the entity lacks.
    * Things that the competitors do better than the entity.
    * Lack of assets or resources.
    * This is an internal factor related to the entity itself.
    
* Opportunities: 
    * Potential external opportunities that the entity may take advantage of but hasn't yet. 
    * This is an external factor related to competitors and the ecosystem.
    
* Threats: 
    * Potential external threats that can prevent the entity from doing well in the future.
    * This is an external factor related to competitors and the ecosystem as a whole.

Here's Jared from Silicon Valley to explain this. 

<iframe width="560" height="315" src="https://www.youtube.com/embed/XfB0g_JDIds" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

If you want to go through some examples of SWOT, please feel free to check out the examples in [[3](#ref_3)]. 

Now that we understand what SWOTing is, we can move on to the next section. In the next section, we will talk about the classification that the World Bank uses for different countries. 

# Section_2: World Bank Income Categories
<a id="section_2"></a>
![](https://drive.google.com/uc?export=download&id=1EhZ6w3LQp8RzJLG4NEkCK1xvrYiJ5j8H)

Bill Gates once said "The Internet is becoming the town square for the global village of tomorrow" [[4](#ref_4)]. Thanks to the internet, we are living in an era of a truly connected world. We can communicate with each other instantly no matter where we are. We can even move from one place to another in less than a day. In the early 19th century, it used to take us 6 weeks to cross the atlantic. Now, we can cross in less than 6 hours [[5](#ref_5)]. Today, cross-border trade is about 25% of the global production [[6](#ref_6)]. This was made possible by the advancement in transportation and communication technology. Even when it comes to challenges, we are all facing similar challenges. The major one is climate change. 

Despite living in a connected global village, we face our own local opportunities and challenges. Indeed, the economy varies across our countries. The World Bank categorizes economies into 4 groups: low, lower-middle, upper-middle, and high[[7](#ref_7)]. The is based on a measure of national income per person. 

Currently, here's how the World Bank classifies countries based on GNI (Gross National Income) per capita [[7](#ref_7)]:
* Low: \$1,025 or less
* Lower middle-income: between \$1,026 and \$3,995
* Upper middle-income: between \$3,996 and \$12,375
* High-income: \$12,376 or more

Economies evolve over time. Check out the below figure ([Figure_1](#figure_1)) that shows how China started as a low-income economy in 1990 and is moving towards becoming a high-income economy. 
<a id="figure_1"></a>
<script type='text/javascript' src='https://dataviz.worldbank.org/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 750px; height: 654px;'><object class='tableauViz' width='750' height='654' style='display:none;'><param name='host_url' value='https%3A%2F%2Fdataviz.worldbank.org%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='&#47;t&#47;DECDG' /><param name='name' value='gni_and_thresholds&#47;final' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='display_count' value='no' /><param name='display_spinner' value='no' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>
> **Figure_1**: * Source [[7](#ref_7)]. The economy in many countries improve over time. For instance, in this graph, we can see how China moves from being a low-income economy in 1990 towards becoming a high-income economy. *

As shown in [Figure_1](#figure_1), economies tend to improve over time. In 1990, 6 in 10 humans lived in low-income economy and now only 1 in 10 humans still live in low-income economy [[7](#ref_7)]. 

[Figure_2](#figure_2) shows how the share of world population living in low-income countries has decreased from 60% to less than 10%. People who were under the low income category became part of the middle income categories. The share of world population living in high income countries remained almost the same at around 15% because it is very difficult for high income countries to grow at high rates.

<a id="figure_2"></a>
![](https://drive.google.com/uc?export=download&id=1XGzDivLXH9VA3h3E2ZlAiX2vGVYABF72)
> **Figure_2**: * Source [[7](#ref_7)]. The share of world population living in low-income countries has decreased from 60% to less than 10%. The share of world population living in high income countries remained almost the same at around 15%. *

So far, we have talked about what SWOT is and how the World Bank categorizes the economies of the countries around the world. In the next section, we will get to know the countries that are part of the Kaggle Survey - 2019. 

In [None]:
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# Input data files are available in the "../input/" directory.
# Any results you write to the current directory are saved as output.

from plotly.offline import init_notebook_mode, iplot 
from plotly.subplots import make_subplots
from IPython.display import Image
from wordcloud import WordCloud
from folium import plugins
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.offline as py
import seaborn as sns
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import pycountry
import folium 
import os

py.init_notebook_mode(connected=True)

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read the Data

schema_2019 = pd.read_csv("../input/kaggle-survey-2019/survey_schema.csv")
multipleChoiceResponses_2019 = pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv")
freeFormResponses_2019 = pd.read_csv("../input/kaggle-survey-2019/other_text_responses.csv")
questions_only_2019 = pd.read_csv("../input/kaggle-survey-2019/questions_only.csv")

multipleChoiceResponses_2019.columns = multipleChoiceResponses_2019.iloc[0]
multipleChoiceResponses_2019=multipleChoiceResponses_2019.drop([0])

wdi_countries = pd.read_csv('../input/world-development-indicators/wdi-csv-zip-57-mb-/WDICountry.csv', delimiter=',')
wdi_data = pd.read_csv('../input/world-development-indicators/wdi-csv-zip-57-mb-/WDIData.csv')


In [None]:
# Make sure that the countries are all standardized
standard_countries = pycountry.countries
standard_countries_names = [country.name for country in standard_countries]
unique_kaggle_countries_names = sorted(multipleChoiceResponses_2019['In which country do you currently reside?'].unique())

print ('***** Countries in Kaggle but not in Standard')
for country_name in unique_kaggle_countries_names:
    if country_name not in standard_countries_names:
        print (country_name)
        
unique_countries = sorted(multipleChoiceResponses_2019['In which country do you currently reside?'].unique())
print ('There are ', len(unique_countries) , 'counties in the survey.')
print (unique_countries)

# Kaggle_countries: 
## Czech Republic --> Czechia
## Hong Kong (S.A.R.) --> Hong Kong
## Iran, Islamic Republic of... --> Iran, Islamic Republic of
## Republic of Korea --> Korea
## South Korea --> Korea
## Russia --> Russian Federation
## Taiwan --> Taiwan, Province of China
## United Kingdom of Great Britain and Northern Ireland --> United Kingdom
## United States of America --> United States

replace_countries = {
    'Czech Republic': 'Czechia',
    'Hong Kong (S.A.R.)': 'Hong Kong',
    'Iran, Islamic Republic of...': 'Iran, Islamic Republic of',
    'Republic of Korea': 'Korea',
    'South Korea': 'Korea',
    'Russia': 'Russian Federation',
    'Taiwan': 'Taiwan, Province of China',
    'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
    'United States of America': 'United States',
}

def replace_value(df, column, old_value, new_value):
    old_before = sum(df[column]==old_value)
    new_before = sum(df[column]==new_value)
    df[column].replace(old_value, new_value, inplace = True)
    old_after = sum(df[column]==old_value)
    new_after = sum(df[column]==new_value)
    
    print ('Replacing ', old_value, ' with ', new_value, 'Before Counts - (old, new)', old_before, new_before, 'After Counts - (Old, new)', old_after, new_after)
    
for key in replace_countries:
    replace_value(multipleChoiceResponses_2019, 'In which country do you currently reside?', key, replace_countries[key])
        
unique_countries = sorted(multipleChoiceResponses_2019['In which country do you currently reside?'].unique())
print ('There are ', len(unique_countries) , 'counties in the survey.')
print (unique_countries)

unique_wdi_countries_names = wdi_countries['Short Name'].values

print ('***** Countries in Kaggle but not in WDI')
unique_kaggle_countries_names = sorted(multipleChoiceResponses_2019['In which country do you currently reside?'].unique())
for country_name in unique_kaggle_countries_names:
    if country_name not in unique_wdi_countries_names:
        print (country_name)

# WDI_countries replacements
############# Hong Kong (S.A.R.) --> Hong Kong
############# Iran, Islamic Republic of... --> Iran, Islamic Republic of
############# Republic of Korea --> Korea
############# South Korea --> Korea
############# Taiwan --> Taiwan, Province of China
## United Kingdom of Great Britain and Northern Ireland --> United Kingdom
## United States of America --> United States
## Vietnam --> Viet Nam

replace_countries = {
    'Czech Republic': 'Czechia',
    'Hong Kong SAR, China': 'Hong Kong',
    'Iran': 'Iran, Islamic Republic of',
    'Russia': 'Russian Federation',
    'Taiwan': 'Taiwan, Province of China',
    'Vietnam': 'Viet Nam',
}

for key in replace_countries:
    replace_value(wdi_countries, 'Short Name', key, replace_countries[key])

# Section_3: Getting to know the Countries in the Survey
<a id="section_3"></a>
![](https://drive.google.com/uc?export=download&id=1X0iLDBQZYObLLaQbJd5kfJyaS4S8O3IV)

Let's try to explore the countries that are represented in the Kaggle survey. [Figure_3](#figure_3) shows a map of all the countries around the world along with their categories. For instance, in the figure:
* The US is classified as High Income country.
* Russia is classified as Upper-Middle Income country.
* India is classified as Lower-Middle Income country. 
* Yemen is classified as Low Income country. 

In [None]:
# Showing the map of 4 groups
countries_income_groups = wdi_countries[['Short Name', 'Income Group']]
countries_income_groups['income_group'] = 1*(countries_income_groups['Income Group'] == 'Low income').astype(np.int)
countries_income_groups['income_group'] += 2*(countries_income_groups['Income Group'] == 'Lower middle income').astype(np.int)
countries_income_groups['income_group'] += 3*(countries_income_groups['Income Group'] == 'Upper middle income').astype(np.int)
countries_income_groups['income_group'] += 4*(countries_income_groups['Income Group'] == 'High income').astype(np.int)

fig = go.Figure(data=go.Choropleth(type = 'choropleth', locationmode = 'country names', colorscale = 'Blues',
                                   locations = countries_income_groups['Short Name'], z = countries_income_groups['income_group'], 
                                   colorbar_title = 'Income<br>Groups', colorbar_tickmode = 'array', colorbar_tickvals = [1,2,3,4], colorbar_ticktext = ['Low', 'Lower-Middle', 'Upper-Middle', 'High']),
               layout = dict(title = 'World Bank 4 Income Groups'))
py.iplot(fig)

<a id="figure_3"></a>
> **Figure_3**: *shows a map of all the countries around the world along with their categories. For instance, the US is classified as High Income. Russia is classified as Upper-Middle Income. India is classified as Lower-Middle Income. Yemen is classified as Low Income. * 

The survey received responses from 171 countries. There are two important notes: 
* If a country or territory received less than 50 respondents, the responses is grouped into a group named “Other". We will ignore the country "Other". 
* This notebook merges "South Korea" and "Republic of Korea" as both refer to the same country, which is South Korea [[5](#ref_5)]. 

As a result, this notebook will analyze 56 countries. 

[Figure_4](#figure_4) shows the distribution of unique countries along with their income groups in the Kaggle-Survey-2019. The majority of the countries in the survey are high income countries. The second largest group is the upper-middle income countries followed by the lower-middle income countries. As mentioned before, there are no low income countries in the survey. 

In [None]:
# A bar showing the 3 groups in the Kaggle Survey
kaggle_countries = multipleChoiceResponses_2019['In which country do you currently reside?'].unique()

# Get the countries in the current survey
kaggle_countries_income_groups = countries_income_groups[countries_income_groups['Short Name'].isin(kaggle_countries)]
counts = kaggle_countries_income_groups.groupby('Income Group').count()['Short Name']
colors = ["#083069", "#1F6EB1", "#6BAED6"]# High, Upper, Lower

labels = ['High income', 'Upper middle income', 'Lower middle income']
values = counts.loc[labels]

labels = ['High income', 'Upper-Middle income', 'Lower-Middle income']

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent', marker=dict(colors=colors))],
                layout = dict(title='Percentage of Unique Countries in the Survey and Their Income Group'))
fig.show()

<a id="figure_4"></a>
> **Figure_4**: *This chart is based on the number of unique countries in the survey and does not take into consideration the number of participants from each country. The majority of the countries in the survey are high income countries. The second largest group is the upper-middle income countries followed by the lower-middle income countries. There are no low income countries in the survey.* 

The previous pie chart does not take the number of participants in each country into consideration because it counts each country once regardless of the number of participants. So, let's plot the percentage of participants by their income group. [Figure_5](#figure_5) shows the distribution of the participants along with their income groups in the Kaggle-Survey-2019. The majority of participants (47%) come from countries with high income while 36% of the participants come from lower-middle income countries. 

In [None]:
high_countries = kaggle_countries_income_groups[kaggle_countries_income_groups['income_group']==4]['Short Name'].values.tolist()
upper_countries = kaggle_countries_income_groups[kaggle_countries_income_groups['income_group']==3]['Short Name'].values.tolist()
lower_countries = kaggle_countries_income_groups[kaggle_countries_income_groups['income_group']==2]['Short Name'].values.tolist()

number_of_participants_from_high = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(high_countries)].shape[0]
number_of_participants_from_upper = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(upper_countries)].shape[0]
number_of_participants_from_lower = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(lower_countries)].shape[0]

values = [number_of_participants_from_high, number_of_participants_from_upper, number_of_participants_from_lower]

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent', marker=dict(colors=colors))],
                layout = dict(title='Percentage of Participants According to Their Income Groups'))
fig.show()

<a id="figure_5"></a>
> **Figure_5**: *47% of the participants come from high income country. 17.4% come from upper-middle income country. 35.4% come from lower-middle income country. * 

Let's see what each income group is made of. [Figure_6](#figure_6) shows the countries that are part of the three groups. The high income group has 30 countries and include: the US, Canada, the UK. The upper-middle income group has 14 countries and include: Brazil, China, and Russia. The lower-middle income group include: India, Nigeria, and Indonesia.

In [None]:
# Show the word cloud of the 3 Kaggle Groups

high_countries = kaggle_countries_income_groups[kaggle_countries_income_groups['income_group']==4]['Short Name'].values.tolist()
upper_countries = kaggle_countries_income_groups[kaggle_countries_income_groups['income_group']==3]['Short Name'].values.tolist()
lower_countries = kaggle_countries_income_groups[kaggle_countries_income_groups['income_group']==2]['Short Name'].values.tolist()

high_wcl = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(high_countries)]['In which country do you currently reside?'].values.tolist()
upper_wcl = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(upper_countries)]['In which country do you currently reside?'].values.tolist()
lower_wcl = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(lower_countries)]['In which country do you currently reside?'].values.tolist()

high_wcl = [x.replace(' ','_') for x in high_wcl]
upper_wcl = [x.replace(' ','_').replace('Iran,_Islamic_Republic_of','Iran').replace('Russian_Federation','Russia') for x in upper_wcl]
lower_wcl = [x.replace(' ','_') for x in lower_wcl]

high_wordcloud = WordCloud(relative_scaling=.5, collocations=False, colormap='Blues').generate(" ".join(high_wcl))
upper_wordcloud = WordCloud(relative_scaling=.5, collocations=False, colormap='Blues').generate(" ".join(upper_wcl))
lower_wordcloud = WordCloud(relative_scaling=.5, collocations=False, colormap='Blues').generate(" ".join(lower_wcl))

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=[50, 10])
ax1.imshow(high_wordcloud)
ax1.axis('off')
ax1.set_title('High Income Countries',fontsize=40);
ax2.imshow(upper_wordcloud)
ax2.axis('off')
ax2.set_title('Upper-Middle Income Countries',fontsize=40);
ax3.imshow(lower_wordcloud)
ax3.axis('off')
ax3.set_title('Lower-Middle Income Countries',fontsize=40);



<a id="figure_6"></a>
> **Figure_6**: *High Income Countries has 30 countries and include: the US, Canada, the UK. Upper-Middle Income Countries has 14 countries and include: Brazil, China, and Russia. Lower-Middle Income Countries include: India, Nigeria, and Indonesia.* 

It should be noted that the number of survey participants from each country varies significantly. For instance:
* In the high income group: 
    * The largest country (by number of survey participants) is the US and it makes up 36% of the high income group.
    * The smallest countries (by number of survey participants) are New Zealand and Norawy. Both are less than 0.5% of this group.
* In the upper-middle income group:
    * The largest country is Brazil, which makes up around 24% of the upper-middle income group.
    * The smallest country is Algeria, which makes up around 2% of the upper-middle income group.
* In the lower-middle income group:
    * The largest country by far is India making more than 70% of this group.
    * The smallest country in this group is the Philipines making only 1% of this group.

Now that we know more about each group, we can start doing the SWOT analysis by analyzing: Gender, Age, Education, and Jobs. 
![](https://drive.google.com/uc?export=download&id=1smv4Ak5_P4QA1k5_uycCcgQ6FdvkSRru)

In the next section, we will start with gender bias SWOT anlysis.

# Section_4: Gender
<a id="section_4"></a>
![](https://drive.google.com/uc?export=download&id=1qUHlXZtlJYyjRRPSbdDsmJVQXIZ0Fc-4)

Addressing gender inequality is very important. Men and women face different expectations when it comes to education and work [[9](#ref_9)]. The gender gap refers to the fact that women earn less than men in many areas. In some countries, women are not given the same opportunity to study and work. 

Furthermore, gender gap slows economic growth and it is costing the world economy trillions of dollars [[10](#ref_10)]. According to the IMF, closing the gap would raise the US GDP by 5%, Japan by 9% and Egypt by 34% [[10](#ref_10)]. 

We can clearly see the gap in [Figure_7](#figure_7). The percentage of participants who identify themselves as female in the survey is only 16% compared to 82% for male.



In [None]:
# Let's group the pandas into 3 groups to make it easier to analyze the income groups
high_df = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(high_countries)]
upper_df = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(upper_countries)]
lower_df = multipleChoiceResponses_2019[multipleChoiceResponses_2019['In which country do you currently reside?'].isin(lower_countries)]
high_df['Income Group'] = 'High Income'
upper_df['Income Group'] = 'Upper-Middle Income'
lower_df['Income Group'] = 'Lower-Middle Income'

all_df = pd.concat([high_df, upper_df, lower_df])

In [None]:
gender_df = all_df.groupby('What is your gender? - Selected Choice').count()['Duration (in seconds)'].rename('What is your gender? - Selected Choice')
colors = ["#083069", "#1F6EB1", "#6BAED6", "E7F1FA"]# High, Upper, Lower

labels = ['Male', 'Female', 'Prefer not to say', 'Prefer to self-describe']
values = gender_df[labels]

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent', marker=dict(colors=colors))],
                layout = dict(title='Most Participants Identify as Male'))
fig.show()

<a id="figure_7"></a>
> **Figure_7**: *The percentage of participants who identify themselves as female in the survey. In the survey, male make up 82% of the participants while only around 16% identify themselves as female.* 

In [Figure_8](#figure_8), we look into which income groups have the largest gap. As shown in [Figure_8](#figure_8), the percentage of female participation is low across all 3 income groups. Females in lower-middle income economies have the highest participation and smallest gap. The percentage of female participants in lower-middle income countries (17.08%) is slightly better than the percentage of female participants in high income countries (16.93%). The gap is the largest in upper-middle income countries because the female participation is only (15.15%).

In [None]:
high_gender_groups = high_df.groupby('What is your gender? - Selected Choice').count()['Duration (in seconds)'].rename('What is your gender? - Selected Choice')
upper_gender_groups = upper_df.groupby('What is your gender? - Selected Choice').count()['Duration (in seconds)'].rename('What is your gender? - Selected Choice')
lower_gender_groups = lower_df.groupby('What is your gender? - Selected Choice').count()['Duration (in seconds)'].rename('What is your gender? - Selected Choice')

gender_indices = ['Female', 'Male']
high_gender_groups = high_gender_groups[gender_indices] / sum(high_gender_groups[gender_indices])
upper_gender_groups = upper_gender_groups[gender_indices] / sum(upper_gender_groups[gender_indices])
lower_gender_groups = lower_gender_groups[gender_indices] / sum(lower_gender_groups[gender_indices])

female_X = [high_gender_groups['Female'], upper_gender_groups['Female'], lower_gender_groups['Female']][::-1]
male_X = [high_gender_groups['Male'], upper_gender_groups['Male'], lower_gender_groups['Male']][::-1]
y = ['High Income', 'Upper-Middle Income', 'Lower-Middle Income'][::-1]

bars1 = go.Bar(x=female_X, y=y, text=[str(round(100*x,2))+'%' for x in female_X], textposition='auto', name='Female', orientation='h', marker=dict(color=colors[1]))
bars2 = go.Bar(x=male_X, y=y, text=[str(round(100*x,2))+'%' for x in male_X], name='Male', orientation='h', marker=dict(color=colors[0]))
line = go.Scatter(x=[.5, .5], name="Equality Line", y=[0,1], mode="lines", marker=dict(color="red"), line=dict(dash='dash'))

fig = make_subplots(specs=[[{"secondary_y": True}]], print_grid=False)
fig.add_trace(bars1, 1, 1, secondary_y=False)
fig.add_trace(bars2, 1, 1, secondary_y=False)
fig.add_trace(line, 1, 1, secondary_y=True)
fig.update_layout(title='Percentage of Female accross Income Groups', barmode='stack', xaxis=dict(tickformat=".2%"), yaxis2= dict(fixedrange= True, range= [0, 1], visible= False))
fig.show()



<a id="figure_8"></a>
> **Figure_8**: *shows how the percentage of female participants is low across all 3 income groups. Female in lower-middle income economies have the smallest gap. The percentage of female participants in lower-middle income countries(17.08%) is slightly better than the percentage of female participants in high income countries (16.93%). The gap is the largest in upper-middle income countries and the female participation is only (15.15%).* 

Let's look at the gap in terms of salary. [Figure_9](#Figure_9) shows the gap in terms of average salary across the different income groups. The average salary gap is \$16K, \$9K, and \$10K for high income countries, upper-middle income countries, and lower-middle income countries, respectively.

In [None]:
num_salaries = {'$0-999': 500, '1,000-1,999': 1500, '2,000-2,999': 2500, '3,000-3,999': 3500, '4,000-4,999': 4500, '5,000-7,499':6500, '7,500-9,999': 8750, 
                   '10,000-14,999':12500, '15,000-19,999':17500, '20,000-24,999': 22500, '25,000-29,999':27500, '30,000-39,999':35000, '40,000-49,999': 45000, 
                   '50,000-59,999': 55000, '60,000-69,999': 65000, '70,000-79,999': 75000, '80,000-89,999': 85000,  '90,000-99,999': 95000, 
                   '100,000-124,999': 112000, '125,000-149,999':137500, '150,000-199,999':175000, '200,000-249,999': 225000, '250,000-299,999': 275000, '300,000-500,000': 400000, '> $500,000': 500000}

high_female_df = high_df[high_df['What is your gender? - Selected Choice']=='Female']
high_male_df = high_df[high_df['What is your gender? - Selected Choice']=='Male']

upper_female_df = upper_df[upper_df['What is your gender? - Selected Choice']=='Female']
upper_male_df = upper_df[upper_df['What is your gender? - Selected Choice']=='Male']

lower_female_df = lower_df[lower_df['What is your gender? - Selected Choice']=='Female']
lower_male_df = lower_df[lower_df['What is your gender? - Selected Choice']=='Male']

high_female_df['num_salaries'] = high_female_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)
high_male_df['num_salaries'] = high_male_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)

upper_female_df['num_salaries'] = upper_female_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)
upper_male_df['num_salaries'] = upper_male_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)

lower_female_df['num_salaries'] = lower_female_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)
lower_male_df['num_salaries'] = lower_male_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)

male_y = high_male_df['num_salaries'].mean(), upper_male_df['num_salaries'].mean(), lower_male_df['num_salaries'].mean()
female_y = high_female_df['num_salaries'].mean(), upper_female_df['num_salaries'].mean(), lower_female_df['num_salaries'].mean()

high_y = high_male_df['num_salaries'].mean(), high_female_df['num_salaries'].mean()
upper_y = upper_male_df['num_salaries'].mean(), upper_female_df['num_salaries'].mean()
lower_y = lower_male_df['num_salaries'].mean(), lower_female_df['num_salaries'].mean()

fig = go.Figure([go.Bar(x=['High Income<br>Salary Gap (dollars): $16K', 'Upper-Middle Income<br>Salary Gap (dollars): $9K', 'Lower-Middle Income<br>Salary Gap (dollars): $10K'], y=male_y, text=['$'+str(round(x,2)) for x in male_y], textposition='auto', name='Male', marker=dict(color=colors[0])),
                 go.Bar(x=['High Income<br>Salary Gap (dollars): $16K', 'Upper-Middle Income<br>Salary Gap (dollars): $9K', 'Lower-Middle Income<br>Salary Gap (dollars): $10K'], y=female_y, text=['$'+str(round(x,2)) for x in female_y], textposition='auto', name='female', marker=dict(color=colors[1]))
                 ],
               layout = dict(title = 'Salary Gap accross Income Groups', xaxis=dict(title = 'Income'), yaxis=dict(title = 'Yearly Salary (Dollars)'),))
fig.show()

<a id="figure_9"></a>
> **Figure_9**: *shows the gap in terms of average salary across the different income groups. In lower-middle income countries, the average salary gap is \$10K. In high income countries, the average salaray gap is \$16K. The gap is \$9K for upper-middle income countries.* 

High income countries have the largest gap in terms of average salary while upper-middle income countries have the smallest gap. However, if we normalize the gap with respect to the male salary, we see a different kind of picture. As shown in [Figure_10](#figure_10), after salary normalization, we can see that the largest gap is in lower-middle income countries because the gap is 90% (males earn twice as much as females). On the other hand, the salary percentage gap is 23% and 49% for high and upper-middle income groups, respectively.

In [None]:
num_salaries = {'$0-999': 500, '1,000-1,999': 1500, '2,000-2,999': 2500, '3,000-3,999': 3500, '4,000-4,999': 4500, '5,000-7,499':6500, '7,500-9,999': 8750, 
                   '10,000-14,999':12500, '15,000-19,999':17500, '20,000-24,999': 22500, '25,000-29,999':27500, '30,000-39,999':35000, '40,000-49,999': 45000, 
                   '50,000-59,999': 55000, '60,000-69,999': 65000, '70,000-79,999': 75000, '80,000-89,999': 85000,  '90,000-99,999': 95000, 
                   '100,000-124,999': 112000, '125,000-149,999':137500, '150,000-199,999':175000, '200,000-249,999': 225000, '250,000-299,999': 275000, '300,000-500,000': 400000, '> $500,000': 500000}

high_female_df = high_df[high_df['What is your gender? - Selected Choice']=='Female']
high_male_df = high_df[high_df['What is your gender? - Selected Choice']=='Male']

upper_female_df = upper_df[upper_df['What is your gender? - Selected Choice']=='Female']
upper_male_df = upper_df[upper_df['What is your gender? - Selected Choice']=='Male']

lower_female_df = lower_df[lower_df['What is your gender? - Selected Choice']=='Female']
lower_male_df = lower_df[lower_df['What is your gender? - Selected Choice']=='Male']

high_female_df['num_salaries'] = high_female_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)
high_male_df['num_salaries'] = high_male_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)

upper_female_df['num_salaries'] = upper_female_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)
upper_male_df['num_salaries'] = upper_male_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)

lower_female_df['num_salaries'] = lower_female_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)
lower_male_df['num_salaries'] = lower_male_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)

male_y = high_male_df['num_salaries'].mean(), upper_male_df['num_salaries'].mean(), lower_male_df['num_salaries'].mean()
female_y = high_female_df['num_salaries'].mean(), upper_female_df['num_salaries'].mean(), lower_female_df['num_salaries'].mean()

high_y = [(high_male_df['num_salaries'].mean() / high_female_df['num_salaries'].mean())-1]
upper_y = [(upper_male_df['num_salaries'].mean() / upper_female_df['num_salaries'].mean())-1]
lower_y = [(lower_male_df['num_salaries'].mean() / lower_female_df['num_salaries'].mean())-1]

fig = go.Figure([go.Bar(x=['High Income'], y=high_y, text=[str(round(100*x,2))+'%' for x in high_y], textposition='auto', name='High Income', marker=dict(color=colors[0])),
                 go.Bar(x=['Upper-Middle Income'], y=upper_y, text=[str(round(100*x,2))+'%' for x in upper_y], textposition='auto', name='Upper-Middle Income', marker=dict(color=colors[1])),
                 go.Bar(x=['Lower-Middle Income'], y=lower_y, text=[str(round(100*x,2))+'%' for x in lower_y], textposition='auto', name='Lower-Middle Income', marker=dict(color=colors[2]))
                 ],
               layout = dict(title = 'Salary Gap Percentage', xaxis=dict(title = 'Income Group'), yaxis=dict(tickformat=".2%", title = 'Percentage'),))
fig.show()

<a id="figure_10"></a>
> **Figure_10**: *the largest gap is in lower-middle income countries because the gap is 90% (males earn twice as much as females). On the other hand, the salary percentage gap is 23% and 49% for high and upper-middle income groups, respectively.* 

Someone may argue that the gap could be because the males could be more qualified than their female counterparts. However, this is simply not true. As shown in [Figure_11](#figure_11), females are more likely to hold a masters degree or a doctoral degree in all three income groups. In lower-middle income countries where the salary gap is almost double, females are more likely to hold a doctoral degree by 6%. 

In [None]:
# Female Education Gap
gender_q = 'What is your gender? - Selected Choice'
education_q = 'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'
education_indices = ['Master’s degree', 'Doctoral degree']

high_female_total = high_df[high_df[gender_q]=='Female']
high_male_total = high_df[high_df[gender_q]=='Male']

upper_female_total = upper_df[upper_df[gender_q]=='Female']
upper_male_total = upper_df[upper_df[gender_q]=='Male']

lower_female_total = lower_df[lower_df[gender_q]=='Female']
lower_male_total = lower_df[lower_df[gender_q]=='Male']

high_female_percentages, high_male_percentages, upper_female_percentages, upper_male_percentages, lower_female_percentages, lower_male_percentages = [], [], [], [], [], []
for education in education_indices:
    high_female_percentages.append(high_female_total[high_female_total[education_q]==education].shape[0]/high_female_total.shape[0])
    high_male_percentages.append(high_male_total[high_male_total[education_q]==education].shape[0]/high_male_total.shape[0])
    
    upper_female_percentages.append(upper_female_total[upper_female_total[education_q]==education].shape[0]/upper_female_total.shape[0])
    upper_male_percentages.append(upper_male_total[upper_male_total[education_q]==education].shape[0]/upper_male_total.shape[0])
    
    lower_female_percentages.append(lower_female_total[lower_female_total[education_q]==education].shape[0]/lower_female_total.shape[0])
    lower_male_percentages.append(lower_male_total[lower_male_total[education_q]==education].shape[0]/lower_male_total.shape[0])

high_female_diff = np.array(high_female_percentages) - np.array(high_male_percentages)
upper_female_diff = np.array(upper_female_percentages) - np.array(upper_male_percentages)
lower_female_diff = np.array(lower_female_percentages) - np.array(lower_male_percentages)

fig = go.Figure([go.Bar(x=education_indices, y=high_female_diff, marker_color=colors[0], name='High Income', text=[str(round(100*x,2))+'%' for x in high_female_diff], textposition='auto'),
                 go.Bar(x=education_indices, y=upper_female_diff, marker_color=colors[1], name='Upper-Middle Income', text=[str(round(100*x,2))+'%' for x in upper_female_diff], textposition='auto'),
                 go.Bar(x=education_indices, y=lower_female_diff, marker_color=colors[2], name='Lower-Middle Income', text=[str(round(100*x,2))+'%' for x in lower_female_diff], textposition='auto'),
                 ],
               layout = dict(title = 'Difference Between Female and Male Percentages for Masters and Doctoral', xaxis=dict(title = 'Degree'), yaxis=dict(tickformat=".2%", title = 'Difference Percentage'),))
fig.show()



<a id="figure_11"></a>
> **Figure_11**: *females are more likely to hold a masters degree or a doctoral degree in all three income groups. In lower-middle income countries where the salary gap is almost double, females are more likely to hold a doctoral degree by 6%.*

Based on the discussion in this section, we can now create the SWOT matrix. [Figure_12](#figure_12) shows the SWOT matrix for all three income groups. All groups suffer from large gender gap (weakness). This could lead to a lot of gender biases because females point of view is not being taken into consideration (threat). Also, accross all three groups, women are more likely to have an advanced degree such as a masters or a PhD(strength). This presents an opportunity and it is very likely that encouraging initiatives to close the salary gap can lead to more female graduates and hires.

In [None]:
values = [['<b>STRENGTHS</b>', '<b>WEAKNESSES</b>', '<b>OPPORTNITIES</b>', '<b>THREATS</b>'], #1st col
[
# high_strengths
"<b>1. Female are more likely<br>\
to have an advanced<br>\
degree (masters and PhDs.)</b>", 
# high_weaknesses
"<b>1. Large Gender Gap</b>", 
# high_opportunities
"<b>1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the <br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.</b>", 
# high_threats
"<b>1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male.</b>"
],
[
# Upper_strengths
"<b>1. Female are more likely<br>\
to have an advanced degree<br>\
(masters and PhDs.)</b>", 
# Upper_weaknesses
"<b>1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by survey<br>\
participtation.</b>", 
# Upper_opportunities
"<b>1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more female<br>\
graduates and hires.</b>", 
# Upper_threats
"<b>1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male.</b>"
],
# Lower_strengths
[
"<b>1. Female are more likely<br>\
to have an advanced degrees<br>\
(masters and PhDs.)", 
# Lower_weaknesses
"<b>1. Large Gender Gap<br><br>\
2. Has Largest gender gap <br>\
when measured by salary<br>\
(males paid 2x females).</b>", 
# Lower_opportunities
"<b>1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.", 
# lower_threats
"<b>1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male.</b>"],
]

fig = go.Figure(data=[go.Table(columnorder = [1,2,3,4], columnwidth = [13,29,29,29],
                               header = dict(values = [[''], ['<b>HIGH INCOME</b>'], ['<b>UPPER-MIDDLE INCOME</b>'], ['<b>LOWER-MIDDLE INCOME</b>']],
                                             line_color='darkslategray', fill_color=colors[0], font=dict(color='white', size=12)),
                               cells=dict(values=values, line_color='darkslategray', fill=dict(color=[['#00AF50', '#C55B11', '#0071C1', '#FE0000'], 'white']), align=['left', 'left', 'left', 'left'], font_color = ['white','black'], font_size=10)
                              )
                     ],
                layout = dict(margin=go.layout.Margin(l=0, r=0, b=0, t=0), height=500)
               )
fig.show()

<a id="figure_11"></a>
> **Figure_11**: *shows the SWOT matrix for all three income groups based on our gender discussion.*

Now that we finished talking about gender. Let's move to the next section and disscuss the age data.

# Section_5: Age
<a id="section_5"></a>
![](https://drive.google.com/uc?export=download&id=1A7TkbctR6UamNVN5hk3vTTTvveg1I8tt)

Economic growth is tightly related to the population demographics [[11](#ref11)]. One of the main reasons for why the economic growth has been slowing down is because of a lower productivity gains (new people entering the workforce). This trend is being noticed in high income countries but also started happening recently in middle income groups but at a lower rate [[12](#ref12), [13](#ref13)]. 

Let's explore the age of the participants in the survey. [Figure_13](#Figure_13) shows the distribution of age for all 3 income groups. We can clearly see that lower-middle income countries have a younger population. More than 50% of Kagglers in lower-middle income countries tend to be younger than 24 years old. On the other hand, for both high income and upper-middle income countries, Kagglers are on average older. More than 55% of the Kagglers in upper-middle income countries are younger than 30 years old. Kagglers in high income countries are the oldest with 62% aging 30 years old and above.

In [None]:
high_age_groups = high_df.groupby("What is your age (# years)?").count()["Duration (in seconds)"]
high_age_groups /= sum(high_age_groups)

upper_age_groups = upper_df.groupby("What is your age (# years)?").count()["Duration (in seconds)"]
upper_age_groups /= sum(upper_age_groups)

lower_age_groups = lower_df.groupby("What is your age (# years)?").count()["Duration (in seconds)"]
lower_age_groups /= sum(lower_age_groups)

fig = go.Figure()
fig.add_trace(go.Scatter(y=high_age_groups.values.tolist(), x=high_age_groups.index.tolist(), line_color = colors[0], name="High Income", mode='lines+markers', marker=dict(size=len(high_age_groups.index.tolist())*[18]), line=dict(width=6, shape='spline', smoothing=1) ))
fig.add_trace(go.Scatter(y=upper_age_groups.values.tolist(), x=upper_age_groups.index.tolist(), line_color=colors[1], name="Upper-Middle Income", mode='lines+markers', marker=dict(size=len(upper_age_groups.index.tolist())*[18]), line=dict(width=6, shape='spline', smoothing=1)))
fig.add_trace(go.Scatter(y=lower_age_groups.values.tolist(), x=lower_age_groups.index.tolist(), line_color=colors[2], name="Lower-Middle Income", mode='lines+markers', marker=dict(size=len(lower_age_groups.index.tolist())*[18]), line=dict(width=6, shape='spline', smoothing=1)))

fig.update_layout(xaxis=dict(title = 'Age Groups (years)'), yaxis=dict(tickformat=".2%", title = 'Percentage'), barmode='overlay', title='Distribution of Age Group for the 3 Income Groups')
fig.show()


<a id="figure_13"></a>
> **Figure_13**: *shows the distribution of age for all 3 income groups. The large percentage of kagglers in lower-middle income countries are younger than 24 years old. In high income countries, most kagglers are older than 30 years old.*

Now that we know the distribution of kagglers by age. Let's check how much money people make based on their age. [Figure_14](#figure_14) shows the average yearly compensation. We can clearly see that as a kaggler ages, his/her salary grows signicantly in high income countries from \$35K to \$120K. In middle income groups, the salary grows at a slower rate than high income countries. The yearly salary grows from around \$15K at age 22 to around \$42K at age 60 in upper-middle income countries. In lower-middle income countries, the salary peak is for people between 45-50 years old. Unlike high income countries, the salaries in middle income countries shrink after 60. This is likely because the retirement age in many middle income countries is lower than high income countries. For instance, the retirement age in India is 60 [[14](#ref_14)]. In Canada, it is 65.

In [None]:
# Age & Salary

num_salaries = {'$0-999': 500, '1,000-1,999': 1500, '2,000-2,999': 2500, '3,000-3,999': 3500, '4,000-4,999': 4500, '5,000-7,499':6500, '7,500-9,999': 8750, 
                   '10,000-14,999':12500, '15,000-19,999':17500, '20,000-24,999': 22500, '25,000-29,999':27500, '30,000-39,999':35000, '40,000-49,999': 45000, 
                   '50,000-59,999': 55000, '60,000-69,999': 65000, '70,000-79,999': 75000, '80,000-89,999': 85000,  '90,000-99,999': 95000, 
                   '100,000-124,999': 112000, '125,000-149,999':137500, '150,000-199,999':175000, '200,000-249,999': 225000, '250,000-299,999': 275000, '300,000-500,000': 400000, '> $500,000': 500000}

# Salary Q = 'What is your current yearly compensation (approximate $USD)?'

high_df['num_salaries'] = high_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)
upper_df['num_salaries'] = upper_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)
lower_df['num_salaries'] = lower_df['What is your current yearly compensation (approximate $USD)?'].replace(num_salaries)

high_age_groups = high_df.groupby("What is your age (# years)?").mean()["num_salaries"]
upper_age_groups = upper_df.groupby("What is your age (# years)?").mean()["num_salaries"]
lower_age_groups = lower_df.groupby("What is your age (# years)?").mean()["num_salaries"]

fig = go.Figure()
fig.add_trace(go.Scatter(y=high_age_groups.values.tolist(), x=high_age_groups.index.tolist(), line_color = colors[0], name="High Income", mode='lines+markers', marker=dict(size=len(high_age_groups.index.tolist())*[18]), line=dict(width=6, shape='spline', smoothing=1) ))
fig.add_trace(go.Scatter(y=upper_age_groups.values.tolist(), x=upper_age_groups.index.tolist(), line_color=colors[1], name="Upper-Middle Income", mode='lines+markers', marker=dict(size=len(upper_age_groups.index.tolist())*[18]), line=dict(width=6, shape='spline', smoothing=1)))
fig.add_trace(go.Scatter(y=lower_age_groups.values.tolist(), x=lower_age_groups.index.tolist(), line_color=colors[2], name="Lower-Middle Income", mode='lines+markers', marker=dict(size=len(lower_age_groups.index.tolist())*[18]), line=dict(width=6, shape='spline', smoothing=1)))

fig.update_layout(xaxis=dict(title = 'Age Groups (years)'), yaxis=dict(title = 'Salary ($)'), barmode='overlay', title='Salaries Growth by Age Group')
fig.show()

<a id="figure_14"></a>
> **Figure_14**: *shows the average yearly compensation. We can clearly see that as a kaggler ages, his/her salary grows signicantly in high income countries from \$35K to \$120K. In middle income countries, the salary grows at a slower rate than high income countries.*

We can now update our SWOT matrix with the age analysis. As shown in [Figure_15](#figure_15), the main strength of the high income group is the fact that the salaries are the highest and they keep growing over the lifespan. Unfortunately, the main weakness is that there is a low percentage of younger folks, which could lead to expertise shortages in the future (possible threat). For middle income groups, they both have a large percentage of young kagglers. The main weakness that these income groups suffer from is the fact that they have lower salaries and slower salary growth. However, it should be noted that in this notebook we are not taking taxes and cost of living into consideration. 

In [None]:
values = [['<b>STRENGTHS</b>', '<b>WEAKNESSES</b>', '<b>OPPORTNITIES</b>', '<b>THREATS</b>'], #1st col
[
# high_strengths
"1. Female are more likely<br>\
to have an advanced<br>\
degree (masters and PhDs.)<br><br>\
<b>2. Great Salaries that grow<br>\
significantly.</b>", 
# high_weaknesses
"1. Large Gender Gap<br><br>\
<b>2. Low Number of young<br>\
Kagglers.</b>", 
# high_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the <br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.", 
# high_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male.<br><br>\
<b> 2. The percentage of younger<br>\
Kagglers is much lower<br>\
than middle income categories.<br>\
This could lead to <br>\
expertise shortage in the<br>\
future. </b>"
],
[
# Upper_strengths
"1. Female are more likely<br>\
to have an advanced degree<br>\
(masters and PhDs.)<br><br>\
<b>2. Youthful Population of<br>\
Kagglers</b>", 
# Upper_weaknesses
"1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by survey<br>\
participtation.<br><br>\
<b>3. Salaries are relatively low<br>\
and they don't grow<br>\
much. </b>", 
# Upper_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.<br><br>\
<b>2.Decent supply of young<br>\
expertise to fill future<br>\
jobs.</b>", 
# Upper_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male."
],
# Lower_strengths
[
"1. Female are more likely<br>\
to have an advanced degrees<br>\
(masters and PhDs.)<br><br>\
<b>2. Tend to have the<br>\
youngest Kagglers.</b>", 
# Lower_weaknesses
"1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by salary<br>\
(males paid 2x females).<br><br>\
<b>3. Salaries are relatively low<br>\
and they don't grow<br>\
much. </b>", 
# Lower_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more female<br>\
graduates and hires.<br><br>\
<b>2.Large supply of young<br>\
expertise to fill<br>\
future jobs.</b>", 
# lower_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male."],
]

fig = go.Figure(data=[go.Table(columnorder = [1,2,3,4], columnwidth = [10,30,30,30],
                               header = dict(values = [[''], ['<b>HIGH INCOME</b>'], ['<b>UPPER-MIDDLE INCOME</b>'], ['<b>LOWER-MIDDLE INCOME</b>']],
                                             line_color='darkslategray', fill_color=colors[0], font=dict(color='white', size=12)),
                               cells=dict(values=values, line_color='darkslategray', fill=dict(color=[['#00AF50', '#C55B11', '#0071C1', '#FE0000'], 'white']), align=['left', 'left', 'left', 'left'], font_color = ['white','black'], font_size=10)
                              )
                     ],
                layout = dict(margin=go.layout.Margin(l=0, r=0, b=0, t=0), height=700)# You can update the height like this.
               )
fig.show()

<a id="figure_15"></a>
> **Figure_15**: *shows the updated SWOT matrix with age discussion (in bold)*

This is it for the age analysis. In the next section, we will perform SWOT analysis on the education data.

# Section_6: Education
<a id="section_6"></a>
![](https://drive.google.com/uc?export=download&id=1U1Z0FcmsazvKDctIGJnZOJEB-jmnBZKr)

Education is directly tied to economic performance. Industries with high barrier of entry tend to pay workers more [[15](#ref_15)]. In any ecosystem, the knowledge/skills that workers have drive economic growth. In this section, we will do the following:
* We will discuss the distribution of formal education across all income groups.
* We will study the relationship between the salaries and formal education. 

The distribution of formal education for all income groups is shown in shown in [Figure_16](#figure_16). The figure only shows the formal education of people who work as a Data Scientist. Clearly, a masters degree significantly increases your chances of being employed as a data scientist. In high and upper-middle income economies, more than half of the data scientists have a masters degree. For middle income countries, the percentage of data scientists who have a doctoral degree is lower than those who have a bachelor degree. In fact, in lower-middle income countries, only 6% of the data scientists have a doctoral degree compared to 26% in high income countries. More than 80% of data scientists in high income countries have either a masters degree or doctoral degree.

In [None]:
# Formal Education
title_q = 'Select the title most similar to your current role (or most recent title if retired): - Selected Choice'

high_education_groups = high_df[high_df[title_q] == 'Data Scientist'].groupby('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?').count()['Duration (in seconds)'].rename('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?')
upper_education_groups = upper_df[upper_df[title_q] == 'Data Scientist'].groupby('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?').count()['Duration (in seconds)'].rename('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?')
lower_education_groups = lower_df[lower_df[title_q] == 'Data Scientist'].groupby('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?').count()['Duration (in seconds)'].rename('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?')

high_education_groups = high_education_groups/sum(high_education_groups)
upper_education_groups = upper_education_groups/sum(upper_education_groups)
lower_education_groups = lower_education_groups/sum(lower_education_groups)

education_indices = ['No formal education past high school', 'Professional degree', 'Some college/university study without earning a bachelor’s degree', 'Bachelor’s degree', 'Master’s degree', 'Doctoral degree', 'I prefer not to answer']
high_education_counts = high_education_groups[education_indices].values.tolist()
upper_education_counts = upper_education_groups[education_indices].values.tolist()
lower_education_counts = lower_education_groups[education_indices].values.tolist()

education_indices[0] = 'High School'
education_indices[2] = 'Attended University but No Bachelor'
education_indices[3] = 'Bachelor'
education_indices[4] = 'Masters'
education_indices[5] = 'Doctoral'
education_indices[6] = 'No Answer'

fig = go.Figure([go.Bar(x=education_indices, y=high_education_counts, marker_color=colors[0], name='High Income', text=[str(round(100*x,2))+'%' for x in high_education_counts], textposition='auto'),
                 go.Bar(x=education_indices, y=upper_education_counts, marker_color=colors[1], name='Upper-Middle Income', text=[str(round(100*x,2))+'%' for x in upper_education_counts], textposition='auto'),
                 go.Bar(x=education_indices, y=lower_education_counts, marker_color=colors[2], name='Lower-Middle Income', text=[str(round(100*x,2))+'%' for x in lower_education_counts], textposition='auto'),
                ],
               layout = dict(title = 'Distribution of the Formal Education', xaxis=dict(title = 'Formal Education'), yaxis=dict(tickformat=".2%", title = 'Percentage'),))
fig.show()



<a id="figure_16"></a>
> **Figure_16**: *The figure only shows the formal education of people who work as a Data Scientist. In high and upper-middle income economies, more than half of the data scientists have a masters degree.*

What about salaries? Well, as we saw before, salaries are much higher in high income countries. As shown in [Figure_17](#figure_17), data scientists earn on average more than \$100K a year in high income countries. In upper-middle income countries, data scientists who have a masters or doctoral degrees earn higher salaries than those with only a bachelor degree. In fact, in upper-middle income countries, data scientists who have a doctoral degree can expect to be paid around 50% more than those who only have a bachelor degree. Unfortunately, this trend is not the same in lower-middle income countries where data scientists who have a doctoral degree reported having lower average yearly salary than those who have a bachelor degree. In low-middle income economies, the average annual salary of people with masters degrees is higher than those who have either a bachelor degree or a doctoral degree. In other words, in a lower-middle income economy, the ideal degree that is likely to yield the highest salary is a masters degree.

In [None]:
# Salaries & Degrees

title_q = 'Select the title most similar to your current role (or most recent title if retired): - Selected Choice'

high_education_groups = high_df[high_df[title_q] == 'Data Scientist'].groupby('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?').mean()['num_salaries']
upper_education_groups = upper_df[upper_df[title_q] == 'Data Scientist'].groupby('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?').mean()['num_salaries']
lower_education_groups = lower_df[lower_df[title_q] == 'Data Scientist'].groupby('What is the highest level of formal education that you have attained or plan to attain within the next 2 years?').mean()['num_salaries']

education_indices = ['Bachelor’s degree', 'Master’s degree', 'Doctoral degree']
high_education_counts = high_education_groups[education_indices].values.tolist()
upper_education_counts = upper_education_groups[education_indices].values.tolist()
lower_education_counts = lower_education_groups[education_indices].values.tolist()

education_indices[0] = 'Bachelor'
education_indices[1] = 'Masters'
education_indices[2] = 'Doctoral'

fig = go.Figure([go.Bar(x=education_indices, y=high_education_counts, marker_color=colors[0], name='High Income', text=['$'+str(round(x,2)) for x in high_education_counts], textposition='auto'),
                 go.Bar(x=education_indices, y=upper_education_counts, marker_color=colors[1], name='Upper-Middle Income', text=['$'+str(round(x,2)) for x in upper_education_counts], textposition='auto'),
                 go.Bar(x=education_indices, y=lower_education_counts, marker_color=colors[2], name='Lower-Middle Income', text=['$'+str(round(x,2)) for x in lower_education_counts], textposition='auto'),
                ],
               layout = dict(title = 'Salaries of Data Scientists based on Education', xaxis=dict(title = 'Formal Education'), yaxis=dict(title = 'Yearly Salary'),))
fig.show()


<a id="figure_17"></a>
> **Figure_17**: *In high income countries, data scientists earn on average more than \$100K a year. In upper-middle income countries, data scientists who have a masters or doctoral degrees earn higher salaries than those a bachelor degree. In lower-middle income countries, data scientists who have a doctoral degree reported having lower average yearly salary than those who have a bachelor degree.*

Let's update our SWOT matrix with the education analysis. 


In [None]:
values = [['<b>STRENGTHS</b>', '<b>WEAKNESSES</b>', '<b>OPPORTNITIES</b>', '<b>THREATS</b>'], #1st col
[
# high_strengths
"1. Female are more likely to<br>\
have an advanced degree<br>\
(masters and PhDs.)<br><br>\
2. Great Salaries that grow<br>\
significantly.<br><br>\
<b>3. A strong set of qualifications.<br>\
80% data scientists have either<br>\
masters or PhD.</b>", 
# high_weaknesses
"1. Large Gender Gap<br><br>\
2. Low Number of young Kagglers.", 
# high_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the salary<br>\
gap are likely to<br>\
lead to more female<br>\
graduates and hires.<br><br>\
<b>2. Earning a PhD can<br>\
lead to 6 figures<br>\
salary</b>", 
# high_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are male.<br><br>\
2. The percentage of younger<br>\
Kagglers is much lower<br>\
than middle income <br>\
categories. This could lead<br>\
to expertise shortage in the<br>\
future.<br><br>"
],
[
# Upper_strengths
"1. Female are more likely<br>\
to have an advanced<br>\
degree (masters <br>\
and PhDs.)<br><br>\
2. Youthful Population of Kagglers<br><br>\
<b>3. Two thirds of data<br>\
scientists have a<br>\
masters or PhD.</b>", 
# Upper_weaknesses
"1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by survey<br>\
participtation.<br><br>\
2. Salaries are relatively<br>\
low and they don't<br>\
grow much. ", 
# Upper_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.<br><br>\
2.Decent supply of young<br>\
expertise to fill future<br>\
jobs.<br><br>\
<b>3. You can signifcantly increase<br>\
your salary by having<br>\
a masters or PhD. </b>", 
# Upper_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are male."
],
# Lower_strengths
[
"1. Female are more likely<br>\
to have an advanced<br>\
degrees (masters and PhDs.)<br><br>\
2. Tend to have the<br>\
youngest Kagglers.", 
# Lower_weaknesses
"1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by salary<br>\
(males paid 2x females).<br><br>\
2. Salaries are relatively low<br>\
and they don't grow<br>\
much.<br><br>\
<b>3. Having a doctoral degree<br>\
doesn't lead to higher<br>\
salary.</b>", 
# Lower_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.<br><br>\
3.Large supply of young<br>\
expertise to fill future<br>\
jobs.<br><br>\
<b>4. Earning a masters degree<br>\
can increase your salary.</b>", 
# lower_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male.<br><br>\
<b>2. Low Salaries for data<br>\
scientists with doctoral degrees<br>\
can discourage them from<br>\
working as data scientists. </b>"],
]

fig = go.Figure(data=[go.Table(columnorder = [1,2,3,4], columnwidth = [10,30,30,30],
                               header = dict(values = [[''], ['<b>HIGH INCOME</b>'], ['<b>UPPER-MIDDLE INCOME</b>'], ['<b>LOWER-MIDDLE INCOME</b>']],
                                             line_color='darkslategray', fill_color=colors[0], font=dict(color='white', size=12)),
                               cells=dict(values=values, line_color='darkslategray', fill=dict(color=[['#00AF50', '#C55B11', '#0071C1', '#FE0000'], 'white']), align=['left', 'left', 'left', 'left'], font_color = ['white','black'], font_size=10)
                              )
                     ],
                layout = dict(margin=go.layout.Margin(l=0, r=0, b=0, t=0), height=900)# You can update the height like this.
               )
fig.show()

<a id="figure_18"></a>
> **Figure_18**: *shows the updated SWOT matrix with education discussion (in bold)*

OK. Now that we finished summarizing the education data in our SWOT matrix, we can move on to talk about jobs. Let's go to the next section.

# Section_7: Jobs
<a id="section_7"></a>
![](https://drive.google.com/uc?export=download&id=1o-N5Ajypcie9-_mNuLwxfpspKXR_iIe4)

So far, we discussed gender bias, age, and education. It is now time to talk about jobs. Specifically, we will do the following:
* We will show the average yearly salaries that kagglers are earning depending on where they live.
* We will show the unemployment rate based on formal education across all three income groups.

[Figure_19](#figure_19) shows the salaries of different professions in the survey. As we have seen before, the salaries in high income countries are much higher than middle income countries. In high income countries, the highest salaries are earned by product/project managers followed by data scientists and statisticians. In upper-middle income countries, database engineers earn the highest salary with an average of \$40K yearly followed by data engineers who earn \$35K a year. It is interesting to note that while upper-middle countries have higher salaries than lower-middle income countries, product/project managers earn more in lower-middle income countries than their counterparts in upper-middle income countries. Data analysts and statisticians earn the lowest salaries in lower-middle income countries.

In [None]:
# Title & Salary


high_title_groups = high_df.groupby('Select the title most similar to your current role (or most recent title if retired): - Selected Choice').mean()['num_salaries']
upper_title_groups = upper_df.groupby('Select the title most similar to your current role (or most recent title if retired): - Selected Choice').mean()['num_salaries']
lower_title_groups = lower_df.groupby('Select the title most similar to your current role (or most recent title if retired): - Selected Choice').mean()['num_salaries']

#title_indices = high_title_groups.index.tolist()
title_indices = ['Data Scientist', 'Software Engineer', 'Research Scientist', 'Data Analyst', 'Business Analyst', 'Data Engineer', 'Product/Project Manager', 'Statistician', 'DBA/Database Engineer']

high_title_counts = high_title_groups[title_indices]
upper_title_counts = upper_title_groups[title_indices]
lower_title_counts = lower_title_groups[title_indices]

fig = go.Figure([go.Bar(x=title_indices, y=high_title_counts, marker_color=colors[0], name='High Income', text=['$'+str(round(x,2)) for x in high_title_counts], textposition='auto'),
                 go.Bar(x=title_indices, y=upper_title_counts, marker_color=colors[1], name='Upper-Middle Income', text=['$'+str(round(x,2)) for x in upper_title_counts], textposition='auto'),
                 go.Bar(x=title_indices, y=lower_title_counts, marker_color=colors[2], name='Lower-Middle Income', text=['$'+str(round(x,2)) for x in lower_title_counts], textposition='auto'),
                ],
               layout = dict(title = 'Average Salaries by Job Title', xaxis=dict(title = 'Job Title'), yaxis=dict(title = 'Salary ($)'),))
fig.show()



<a id="figure_19"></a>
> **Figure_19**: *a chart showing the salaries of different professions in the survey.*

Now that we know have a clear idea about salaries, let's see the unemployment rate. As shown in [Figure_20](#figure_20), the unemployment rate for people with only a high school degree is very high. Earning a university degree significantly increases the likelihood of being employed. In high income countries, the unemployment rate drops from 4% for people with a bachelors degree to 3% if a person earns a doctoral degree. People with a masters degree in upper-middle income and lower-middle income countries reported having higher unemployment rate than those who have a bachelor degree only. However, people with a doctoral degree in upper-middle income countries have the lowest unemployment rate which is only 2.5%. In other words, having a doctoral degree in an upper-middle income country makes a lot of sense as the unemployment rate is very low. The same applies to lower-middle income countries where people with a doctoral degree reported having lower uneployment rate. Let's update our SWOT matrix with this information.

In [None]:
# UnEmployment By Degree
title_q = 'Select the title most similar to your current role (or most recent title if retired): - Selected Choice'
education_q = 'What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'

education_indices = ['No formal education past high school', 'Professional degree', 'Some college/university study without earning a bachelor’s degree', 'Bachelor’s degree', 'Master’s degree', 'Doctoral degree', 'I prefer not to answer']

high_unEmployment_rates, upper_unEmployment_rates, lower_unEmployment_rates = [], [], []
for education in education_indices:
    high_bachelor_total = high_df[high_df[education_q] == education]
    high_bachelor_unEmployed = high_bachelor_total[high_bachelor_total[title_q]=='Not employed']
    high_unEmployment_rates.append(high_bachelor_unEmployed.shape[0]/high_bachelor_total.shape[0])
    
    upper_bachelor_total = upper_df[upper_df[education_q] == education]
    upper_bachelor_unEmployed = upper_bachelor_total[upper_bachelor_total[title_q]=='Not employed']
    upper_unEmployment_rates.append(upper_bachelor_unEmployed.shape[0]/upper_bachelor_total.shape[0])
    
    lower_bachelor_total = lower_df[lower_df[education_q] == education]
    lower_bachelor_unEmployed = lower_bachelor_total[lower_bachelor_total[title_q]=='Not employed']
    lower_unEmployment_rates.append(lower_bachelor_unEmployed.shape[0]/lower_bachelor_total.shape[0])

education_indices[0] = 'High School'
education_indices[2] = 'Attended University but No Bachelor'
education_indices[3] = 'Bachelor'
education_indices[4] = 'Masters'
education_indices[5] = 'Doctoral'
education_indices[6] = 'No Answer'

fig = go.Figure([go.Bar(x=education_indices, y=high_unEmployment_rates, marker_color=colors[0], name='High Income', text=[str(round(100*x,2))+'%' for x in high_unEmployment_rates], textposition='auto'),
                 go.Bar(x=education_indices, y=upper_unEmployment_rates, marker_color=colors[1], name='Upper-Middle Income', text=[str(round(100*x,2))+'%' for x in upper_unEmployment_rates], textposition='auto'),
                 go.Bar(x=education_indices, y=lower_unEmployment_rates, marker_color=colors[2], name='Lower-Middle Income', text=[str(round(100*x,2))+'%' for x in lower_unEmployment_rates], textposition='auto'),
                ],
               layout = dict(title = 'Unemployment By Degree', xaxis=dict(title = 'Degree'), yaxis=dict(tickformat=".2%", title = 'Rate'),))
fig.show()

<a id="figure_20"></a>
> **Figure_20**: *a chart showing the unemployment rate depending on the formal education.*

In [None]:
values = [['<b>STRENGTHS</b>', '<b>WEAKNESSES</b>', '<b>OPPORTNITIES</b>', '<b>THREATS</b>'], #1st col
[
# high_strengths
"1. Female are more likely to<br>\
have an advanced degree<br>\
(masters and PhDs.)<br><br>\
2. Great Salaries that grow<br>\
significantly.<br><br>\
3. A strong set of qualifications.<br>\
80% data scientists have either<br>\
masters or PhD.<br><br>\
<b>4. Data scientists and statisticians<br>\
earn high salaries.<br>\
</b>", 
# high_weaknesses
"1. Large Gender Gap<br><br>\
2. Low Number of young Kagglers.", 
# high_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the salary<br>\
gap are likely to<br>\
lead to more female<br>\
graduates and hires.<br><br>\
2. Earning a PhD can<br>\
lead to 6 figures<br>\
salary.<br><br>\
<b>3. Earning PhD can decrease<br>\
chances of being unemployed.</b>",
# high_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are male.<br><br>\
2. The percentage of younger<br>\
Kagglers is much lower<br>\
than middle income <br>\
categories. This could lead<br>\
to expertise shortage in the<br>\
future.<br><br>"
],
[
# Upper_strengths
"1. Female are more likely<br>\
to have an advanced<br>\
degree (masters <br>\
and PhDs.)<br><br>\
2. Youthful Population of Kagglers<br><br>\
3. Two thirds of data<br>\
scientists have a<br>\
masters or PhD.<br><br>\
<b>4.DB engineers earn<br>\
high salaries.<br><br>\
5. Unemployment is only 2.5%<br>\
for people with doctoral<br>\
degrees, which is better than<br>\
any other income group.</b>", 
# Upper_weaknesses
"1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by survey<br>\
participtation.<br><br>\
2. Salaries are relatively<br>\
low and they don't<br>\
grow much.<br><br>\
<b>3. People with masters degree<br>\
reported higher unemployment than<br>\
those with a bachelor degree.</b>", 
# Upper_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.<br><br>\
2.Decent supply of young<br>\
expertise to fill future<br>\
jobs.<br><br>\
3. You can signifcantly increase<br>\
your salary by having<br>\
a masters or PhD.<br><br>\
<b>4. Earning a doctoral degree<br>\
may lower likelihood of<br>\
being unemployed</b>", 
# Upper_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are male."
],
# Lower_strengths
[
"1. Female are more likely<br>\
to have an advanced<br>\
degrees (masters and PhDs.)<br><br>\
2. Tend to have the<br>\
youngest Kagglers.<br><br>\
<b>3. Product/project managers<br>\
earn more than upper-<br>\
middle income counterparts.<br></b>", 
# Lower_weaknesses
"1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by salary<br>\
(males paid 2x females).<br><br>\
2. Salaries are relatively low<br>\
and they don't grow<br>\
much.<br><br>\
3. Having a doctoral degree<br>\
doesn't lead to higher<br>\
salary.<br><br>\
<b>4. Statisticians earn the<br>\
lowest salaries<br><br>\
5. People with masters degree<br>\
reported higher unemployment than<br>\
those with a bachelor degree.</b>", 
# Lower_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.<br><br>\
3.Large supply of young<br>\
expertise to fill future<br>\
jobs.<br><br>\
4. Earning a masters degree<br>\
can increase your salary.<br><br>\
<b>5. Earning a doctoral degree<br>\
may lower likelihood of<br>\
being unemployed</b>", 
# lower_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male.<br><br>\
2. Low Salaries for data<br>\
scientists with doctoral degrees<br>\
can discourage them from<br>\
working as data scientists.<br><br>\
<b>3. Having low salaries for<br>\
statisticians can lead to<br>\
fewer people pursuing<br>\
degrees in statistics.</b>"],
]

fig = go.Figure(data=[go.Table(columnorder = [1,2,3,4], columnwidth = [10,30,30,30],
                               header = dict(values = [[''], ['<b>HIGH INCOME</b>'], ['<b>UPPER-MIDDLE INCOME</b>'], ['<b>LOWER-MIDDLE INCOME</b>']],
                                             line_color='darkslategray', fill_color=colors[0], font=dict(color='white', size=12)),
                               cells=dict(values=values, line_color='darkslategray', fill=dict(color=[['#00AF50', '#C55B11', '#0071C1', '#FE0000'], 'white']), align=['left', 'left', 'left', 'left'], font_color = ['white','black'], font_size=10)
                              )
                     ],
                layout = dict(margin=go.layout.Margin(l=0, r=0, b=0, t=0), height=1100)# You can update the height like this.
               )
fig.show()

<a id="figure_21"></a>
> **Figure_21**: *shows the SWOT matrix updated with jobs analysis (shown in bold)*

Finally, we finished the SWOT analysis from 4 different angles: Gender, Age, Education, and Jobs. We will conclude this notebook in the next session.

# Section_8: Conclusion
<a id="section_8"></a>
![](https://drive.google.com/uc?export=download&id=1r6cpw5tmJqyiHOPIhmvSNtqYAu7XY57j)

I really hope that you enjoyed this notebook. In this notebook, we used the World Bank countries classification to analyze the Kaggle survey. The World Bank classifies countries based on economy into 4 categorizes: Low income, Lower-middle income, Upper-middle income, and High income. The survey has 30 countries that are classified as high income countries and this group include: the US, Canada, Germany, and the UK. The Upper-middle income group in the survey is composed of 14 countries and include: Brazil, China, and Russia. Finally, the lower-middle income group is composed of 12 countries and include: India, Nigeria, and Indonesia. 

We used the SWOT framework to capture the state of each income group from 4 different angles: Gender, Age, Education, and Jobs. SWOT stands for **S**trengths, **W**eaknesses, **O**pportunities, and **T**hreats. The SWOT Matrix shown in [Figure_22](#figure_22) summarizes the state of the 3 income groups in the survey.

Thank you. 



In [None]:
values = [['<b>STRENGTHS</b>', '<b>WEAKNESSES</b>', '<b>OPPORTNITIES</b>', '<b>THREATS</b>'], #1st col
[
# high_strengths
"1. Female are more likely to<br>\
have an advanced degree<br>\
(masters and PhDs.)<br><br>\
2. Great Salaries that grow<br>\
significantly.<br><br>\
3. A strong set of qualifications.<br>\
80% data scientists have either<br>\
masters or PhD.<br><br>\
4. Data scientists and statisticians<br>\
earn high salaries.<br>", 
# high_weaknesses
"1. Large Gender Gap<br><br>\
2. Low Number of young Kagglers.", 
# high_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the salary<br>\
gap are likely to<br>\
lead to more female<br>\
graduates and hires.<br><br>\
2. Earning a PhD can<br>\
lead to 6 figures<br>\
salary.<br><br>\
3. Earning PhD can decrease<br>\
chances of being unemployed.",
# high_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are male.<br><br>\
2. The percentage of younger<br>\
Kagglers is much lower<br>\
than middle income <br>\
categories. This could lead<br>\
to expertise shortage in the<br>\
future."
],
[
# Upper_strengths
"1. Female are more likely<br>\
to have an advanced<br>\
degree (masters <br>\
and PhDs.)<br><br>\
2. Youthful Population of Kagglers<br><br>\
3. Two thirds of data<br>\
scientists have a<br>\
masters or PhD.<br><br>\
4.DB engineers earn<br>\
high salaries.<br><br>\
5. Unemployment is only 2.5%<br>\
for people with doctoral<br>\
degrees, which is better than<br>\
any other income group.", 
# Upper_weaknesses
"1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by survey<br>\
participtation.<br><br>\
2. Salaries are relatively<br>\
low and they don't<br>\
grow much.<br><br>\
3. People with masters degree<br>\
reported higher unemployment than<br>\
those with a bachelor degree.", 
# Upper_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.<br><br>\
2.Decent supply of young<br>\
expertise to fill future<br>\
jobs.<br><br>\
3. You can signifcantly increase<br>\
your salary by having<br>\
a masters or PhD.<br><br>\
4. Earning a doctoral degree<br>\
may lower likelihood of<br>\
being unemployed", 
# Upper_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are male."
],
# Lower_strengths
[
"1. Female are more likely<br>\
to have an advanced<br>\
degrees (masters and PhDs.)<br><br>\
2. Tend to have the<br>\
youngest Kagglers.<br><br>\
3. Product/project managers<br>\
earn more than upper-<br>\
middle income counterparts.<br>", 
# Lower_weaknesses
"1. Large Gender Gap<br><br>\
2. Has Largest gender gap<br>\
when measured by salary<br>\
(males paid 2x females).<br><br>\
2. Salaries are relatively low<br>\
and they don't grow<br>\
much.<br><br>\
3. Having a doctoral degree<br>\
doesn't lead to higher<br>\
salary.<br><br>\
4. Statisticians earn the<br>\
lowest salaries<br><br>\
5. People with masters degree<br>\
reported higher unemployment than<br>\
those with a bachelor degree.", 
# Lower_opportunities
"1. Females are more likely<br>\
to have postgraduate degree.<br>\
Initiatives to close the<br>\
salary gap are likely<br>\
to lead to more<br>\
female graduates and hires.<br><br>\
3.Large supply of young<br>\
expertise to fill future<br>\
jobs.<br><br>\
4. Earning a masters degree<br>\
can increase your salary.<br><br>\
5. Earning a doctoral degree<br>\
may lower likelihood of<br>\
being unemployed", 
# lower_threats
"1. Gender Bias issues are<br>\
possible because most people<br>\
doing data science are<br>\
male.<br><br>\
2. Low Salaries for data<br>\
scientists with doctoral degrees<br>\
can discourage them from<br>\
working as data scientists.<br><br>\
3. Having low salaries for<br>\
statisticians can lead to<br>\
fewer people pursuing<br>\
degrees in statistics."],
]

fig = go.Figure(data=[go.Table(columnorder = [1,2,3,4], columnwidth = [10,30,30,30],
                               header = dict(values = [[''], ['<b>HIGH INCOME</b>'], ['<b>UPPER-MIDDLE INCOME</b>'], ['<b>LOWER-MIDDLE INCOME</b>']],
                                             line_color='darkslategray', fill_color=colors[0], font=dict(color='white', size=12)),
                               cells=dict(values=values, line_color='darkslategray', fill=dict(color=[['#00AF50', '#C55B11', '#0071C1', '#FE0000'], 'white']), align=['left', 'left', 'left', 'left'], font_color = ['white','black'], font_size=10)
                              )
                     ],
                layout = dict(margin=go.layout.Margin(l=0, r=0, b=0, t=0), height=1100)# You can update the height like this.
               )
fig.show()

<a id="figure_22"></a>
> **Figure_22**: *shows the final SWOT matrix.*

# References
<a id="sec_references"></a>
* [1] <a id="ref_1"></a> SWOT analysis - Wikipedia. https://en.wikipedia.org/wiki/SWOT_analysis
* [2] <a id="ref_2"></a> How to Do a SWOT Analysis for Your Small Business (with Examples). https://www.wordstream.com/blog/ws/2017/12/20/swot-analysis
* [3] <a id="ref_3"></a> What Is a SWOT Analysis, and How to Do It Right (With Examples). https://www.liveplan.com/blog/what-is-a-swot-analysis-and-how-to-do-it-right-with-examples/
* [4] <a id="ref_4"></a> Bill Gates quotes: words of wisdom from the Microsoft mogul. https://www.telegraph.co.uk/technology/0/bill-gates-quotes-words-wisdom-microsoft-mogul/microsoft-founder-gates-addresses-session-world-economic-forum/
* [5] <a id="ref_5"></a> Journey to America. https://spartacus-educational.com/USAEjourney.htm
* [6] <a id="ref_6"></a> Trade and Globalization. https://ourworldindata.org/trade-and-globalization
* [7] <a id="ref_7"></a> Classifying countries by income. https://datatopics.worldbank.org/world-development-indicators/stories/the-classification-of-countries-by-income.html
* [8] <a id="ref_8"></a> South Korea - Wikipedia. https://en.wikipedia.org/wiki/South_Korea
* [9] <a id="ref_9"></a> Frequently asked questions about gender equality - UNFPA. https://www.unfpa.org/resources/frequently-asked-questions-about-gender-equality
* [10] <a id="ref_10"></a> Gender inequality is costing the global economy trillions of dollars a year. https://www.newstatesman.com/economics/2014/02/gender-inequality-costing-global-economy-trillions-dollars-year
* [11] <a id="ref_11"></a> How Demographics Drive The Economy. https://www.investopedia.com/articles/investing/012315/how-demographics-drive-economy.asp
* [12] <a id="ref_12"></a> Population ageing - Wikipedia. https://en.wikipedia.org/wiki/Population_ageing
* [13] <a id="ref_13"></a> Aging Demographics: A Threat To The Economy And To Finance. https://www.forbes.com/sites/miltonezrati/2018/04/23/aging-demographics-a-threat-to-the-economy-and-to-finance/#557b42133f2e
* [14] <a id="ref_14"></a> Retirement age in India: demographic timebomb means changes will be needed to deal with ageing population. http://www.agediscrimination.info/news/2019/7/15/retirement-age-in-india
* [15] <a id="ref_15"></a> How Education and Training Affect the Economy. https://www.investopedia.com/articles/economics/09/education-training-advantages.asp