In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from plotly.offline import init_notebook_mode, iplot
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.graph_objs as go
from collections import Counter
import datetime as dt
from plotly.subplots import make_subplots
from matplotlib.patches import ConnectionPatch
from plotly.offline import iplot

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# read the files, remove the the questions and check what it looks like

survey_data = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv", index_col=False, low_memory=False)
#survey_data.head()
survey_data = survey_data.iloc[1:,:]
survey_data.head()
#survey_data.shape


# "Data Scientist: The Sexiest Job of the 21st Century"

This was an article by the [Harvard Business Review (HBR)](https://www.hbs.edu/faculty/Pages/item.aspx?num=43110) back in 2012. In it, Data Scientists are said to be the key to unlocking value or discovering the story buried in big data. This means, they will be highly sought after by industries and the world.

Nine years on, is the role still in demand and how do individuals around the globe responded to this demand. This is my first year using kaggle, new to programming and studying a Masters on Data Analytics, I am excited to see whether kaggle 2021 survey yields any insights on:
> - **Whether Data Scientist is the most popular role**
> - **Where does Data Analyst rank?**
> - **What are the characteristics of kaggle users?**
> - **Which country/countries have the highest pay for data professionals?**
> - **Which industry is looking for data professionals?**
> - **How popular is the kaggle platform to women?**



 
**The top country in the number of respondents is India** with **34.6% or 7434** responses. 

I am also interested in Australia, but it is ranked 24th with only 1.2% or 264 responses :-(!!!

In terms of questions I posed, applying to India:

>**Whether Data Scientist is the most popular role**
> - Data Scientists indeed is popular within the survey after an overwhelming proportion of students. Students based in India respresnt 38% of responses from the country.

>**Where does Data Analyst rank?**
> - Data Analyst ranked 5th in terms of responses from India with 7.5% of overall india responses.

>**What are the characteristics of kaggle users?**
> - India participants are young, with **37%** of responses from the 18 - 21 age group.

> The next three questions will be examined next.

>**Which country/countries have the highest pay for data professionals?**

>**Which industry is looking for data professionals?**

>**How popular is the kaggle platform to women?**


In [None]:
# India responses
india = survey_data.loc[survey_data.loc[:, 'Q3']== 'India']

In [None]:
# Countries that are most responsive to the kaggle survey shown in a pie chart

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(25, 30))
fig.subplots_adjust(wspace=0)

# Data for pie chart of countries ranking in number of respondents
sns.set_palette("Set3", 24)
country = survey_data['Q3'].value_counts().index
num = survey_data['Q3'].value_counts().values.tolist()
country = country[0:24]
num = num[0:24]

label=['India','USA','Other','Japan','China','Brazil','Russia','Nigeria',
       'UK','Pakistan','Egypt','Germany','Spain','Indonesia','Turkey','France',
       'South Korea','Taiwan','Canada','Bangledesh','Italy','Mexico','Vietnam', 'Australia']
explode=[0.2,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,
         0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.9,]
angle = -185 * num[0]
colors= ('burlywood','tab:orange','tab:cyan', 'tab:olive')
ax1.pie(num, autopct='%1.1f%%', startangle=angle, explode=explode, radius=1.6,
        labels=label, textprops={'fontsize' : 20, 'fontweight' : 'bold'}, 
        colors=colors, rotatelabels=200, labeldistance=1.1)

textstr = 'TOP 24 COUNTRIES RANKING'
props = dict(boxstyle='round', facecolor='white', alpha=0.5)
ax1.text(-0.05, 1.65, textstr, transform=ax1.transAxes, fontsize=35, fontweight='bold',
        verticalalignment='bottom', bbox=props)

#bar chart parameters of India
xpos = 0
bottom = 0

# India age distribution
india_age = india['Q1'].value_counts().index
india_num = india['Q1'].value_counts().values.tolist()

width = 1
for j in range(len(india_age)):
    height = india_num[j]
    ax2.bar(xpos, height, width, bottom=bottom)
    ypos = bottom + ax2.patches[j].get_height() / 2
    bottom += height
    ax2.text(xpos, ypos, "%d" % (ax2.patches[j].get_height()),
             ha='center', fontsize=17, weight='bold')
ax2.set_title('INDIA BREAKDOWN BY AGE',
              fontsize=28, fontweight='semibold')
ax2.legend((india_age),bbox_to_anchor=(0.65,0.96), 
           facecolor='white',fontsize=30)
ax2.axis('off')
ax2.set_xlim(- 2.5 * width, 2.6 * width)


# use ConnectionPatch to draw lines between the two plots
# get the wedge data
theta1, theta2 = ax1.patches[0].theta1, ax1.patches[0].theta2
center, r = ax1.patches[0].center, ax1.patches[0].r
bar_height = sum([item.get_height() for item in ax2.patches])

# draw top connecting line
x = r * np.cos(np.pi / 180 * theta2) + center[0]
y = r * np.sin(np.pi / 180 * theta2) + center[1]
con = ConnectionPatch(xyA=(-width / 2, bar_height), coordsA=ax2.transData,
                      xyB=(x, y), coordsB=ax1.transData)
con.set_color([0, 0, 0])
con.set_linewidth(2)
ax2.add_artist(con)

# draw bottom connecting line
x = r * np.cos(np.pi / 180 * theta1) + center[0]
y = r * np.sin(np.pi / 180 * theta1) + center[1]
con = ConnectionPatch(xyA=(-width / 2, 0), coordsA=ax2.transData,
                      xyB=(x, y), coordsB=ax1.transData)
con.set_color([0, 0, 0])
ax2.add_artist(con)
con.set_linewidth(2)

#Charting the India Job Distribution
india_j = india['Q5'].value_counts().index
india_j_num = india['Q5'].value_counts().values.tolist()
width = 1

for j in range(len(india_j)):
    height = india_j_num[j]
    ax3.bar(xpos, height, width, bottom=bottom)
    ypos = bottom + ax3.patches[j].get_height() / 2
    bottom += height
    ax3.text(xpos, ypos,"%d" % (ax3.patches[j].get_height()),
             ha='center', fontsize=17)

ax3.set_title('BY JOB',
              fontsize=28, fontweight='semibold')
ax3.legend(('1:Student', '2:Data Scientist', '3:Software Engineer',
            '4:Currently not employed', '5:Data Analyst','6:Other',
            '7:Machine Learning Engineer', '8:Business Analyst',
            '9:Program/Project Manager', '10:Research Scientist', 
            '11:Data Engineer','12:Product Manager', '13:Statistician', 
            '14:DBA/Database Engineer','15:Developer Relations/Advocacy'),bbox_to_anchor=(0.60,0.96), 
           facecolor='white',fontsize=28)
ax3.axis('off')
ax3.set_xlim(- 2.5 * width, 2.5 * width)


plt.show()



In [None]:

# Establish the OECD countries that are in the kaggle survey 
# This includes Australia

a2 = survey_data.loc[survey_data.loc[:, 'Q3']=='Austria']  
australia = survey_data.loc[survey_data.loc[:, 'Q3']== 'Australia']
bel = survey_data.loc[survey_data.loc[:, 'Q3']=='Belgium']
can = survey_data.loc[survey_data.loc[:, 'Q3']=='Canada']
chi = survey_data.loc[survey_data.loc[:, 'Q3']== 'Chile']
china = survey_data.loc[survey_data.loc[:, 'Q3']== 'China']
col = survey_data.loc[survey_data.loc[:, 'Q3']=='Colombia']
cez = survey_data.loc[survey_data.loc[:, 'Q3']=='Czech Republic']
den = survey_data.loc[survey_data.loc[:, 'Q3']=='Denmark']
fra = survey_data.loc[survey_data.loc[:, 'Q3']=='France']
ger = survey_data.loc[survey_data.loc[:, 'Q3']=='Germany']
gre = survey_data.loc[survey_data.loc[:, 'Q3']=='Greece']
ire = survey_data.loc[survey_data.loc[:, 'Q3']=='Ireland']
isr = survey_data.loc[survey_data.loc[:, 'Q3']=='Israel']
ita = survey_data.loc[survey_data.loc[:, 'Q3']=='Italy']
japan = survey_data.loc[survey_data.loc[:, 'Q3']== 'Japan']
mex = survey_data.loc[survey_data.loc[:, 'Q3']=='Mexico']
net = survey_data.loc[survey_data.loc[:, 'Q3']=='Netherlands']
nor = survey_data.loc[survey_data.loc[:, 'Q3']=='Norway']
pol = survey_data.loc[survey_data.loc[:, 'Q3']=='Poland']
por = survey_data.loc[survey_data.loc[:, 'Q3']=='Portugal']
sko= survey_data.loc[survey_data.loc[:, 'Q3']=='South Korea']
spa = survey_data.loc[survey_data.loc[:, 'Q3']=='Spain']
swe = survey_data.loc[survey_data.loc[:, 'Q3']=='Sweden']
swi = survey_data.loc[survey_data.loc[:, 'Q3']=='Switzerland']
tur = survey_data.loc[survey_data.loc[:, 'Q3']=='Turkey']
ukn = survey_data.loc[survey_data.loc[:, 'Q3']=='United Kingdom of Great Britain and Northern Ireland']
usa = survey_data.loc[survey_data.loc[:, 'Q3']== 'United States of America']
                       
oecd = pd.concat([a2,australia,bel,can,chi,col,cez,den,fra,ger,gre,ire,isr,ita,japan,
                  mex,net,nor,pol,por,sko,spa,swe,swi,tur,ukn,usa])

# Estabilist the Non-OECD Countries Not including India
# But include 'I' do not wish to disclose my location' and 'Other'

alg = survey_data.loc[survey_data.loc[:, 'Q3']=='Algeria']
arg= survey_data.loc[survey_data.loc[:, 'Q3']=='Argentina']
ban= survey_data.loc[survey_data.loc[:, 'Q3']=='Bangladesh']
bea= survey_data.loc[survey_data.loc[:, 'Q3']=='Belarus']
brazil = survey_data.loc[survey_data.loc[:, 'Q3']== 'Brazil']
ecu= survey_data.loc[survey_data.loc[:, 'Q3']=='Ecuador']
egy= survey_data.loc[survey_data.loc[:, 'Q3']=='Egypt']
eth= survey_data.loc[survey_data.loc[:, 'Q3']=='Ethiopia']
gha= survey_data.loc[survey_data.loc[:, 'Q3']=='Ghana']
hks= survey_data.loc[survey_data.loc[:, 'Q3']=='Hong Kong (S.A.R.)']
loc= survey_data.loc[survey_data.loc[:, 'Q3']=='I do not wish to disclose my location']
ind= survey_data.loc[survey_data.loc[:, 'Q3']=='Indonesia']
irs= survey_data.loc[survey_data.loc[:, 'Q3']=="Iran, Islamic Republic of..."]
iraq= survey_data.loc[survey_data.loc[:, 'Q3']=='Iraq']
kaz= survey_data.loc[survey_data.loc[:, 'Q3']=='Kazakhstan']
ken= survey_data.loc[survey_data.loc[:, 'Q3']=='Kenya']
mal= survey_data.loc[survey_data.loc[:, 'Q3']=='Malaysia']
mor= survey_data.loc[survey_data.loc[:, 'Q3']=='Morocco']
nel= survey_data.loc[survey_data.loc[:, 'Q3']=='Nepal']
nig= survey_data.loc[survey_data.loc[:, 'Q3']=='Nigeria']
other= survey_data.loc[survey_data.loc[:, 'Q3']=='Other']
pak= survey_data.loc[survey_data.loc[:, 'Q3']=='Pakistan']
peru= survey_data.loc[survey_data.loc[:, 'Q3']=='Peru']
phi= survey_data.loc[survey_data.loc[:, 'Q3']=='Philippines']
rom= survey_data.loc[survey_data.loc[:, 'Q3']=='Romania']
rus= survey_data.loc[survey_data.loc[:, 'Q3']=='Russia']
sau= survey_data.loc[survey_data.loc[:, 'Q3']=='Saudi Arabia']
sin= survey_data.loc[survey_data.loc[:, 'Q3']=='Singapore']
sou= survey_data.loc[survey_data.loc[:, 'Q3']=='South Africa']
sri= survey_data.loc[survey_data.loc[:, 'Q3']=='Sri Lanka']
tai= survey_data.loc[survey_data.loc[:, 'Q3']=='Taiwan']
tha= survey_data.loc[survey_data.loc[:, 'Q3']=='Thailand']
tun= survey_data.loc[survey_data.loc[:, 'Q3']=='Tunisia']
uga= survey_data.loc[survey_data.loc[:, 'Q3']=='Uganda']
ukr= survey_data.loc[survey_data.loc[:, 'Q3']=='Ukraine']
uae= survey_data.loc[survey_data.loc[:, 'Q3']=='United Arab Emirates']
vie= survey_data.loc[survey_data.loc[:, 'Q3']=='Viet Nam']
                     
non_oecd = pd.concat([alg,arg,ban,bea,brazil,china,ecu,egy,eth,gha,hks,loc,ind,irs,
                      iraq,kaz,ken,mal,mor,nel,nig,other,pak,peru,phi,rom,
                      rus,sau,sin,sou,sri,tai,tha,tun,uga,ukr,uae,vie])


oecd_tot = oecd['Q3'].value_counts().index
oecd_tot1 = oecd['Q3'].value_counts().values.tolist()
oecd2 = sum(oecd_tot1)

**As I am interested in Australia, I compare Australia to the top 5 countries**

>Unfortunately, against India with 7434 responses, Australia is a blip in comparison, but also to the other top countries:
> - **No 1 : India**
> - **No 2 : USA**
> - **No 3 : Japan**
> - **No 4 : China**
> - **No 5 : Brazil**

>The barcharts show Asutralia on the left, compare to the top five countries according to age distribution. It just shows Australia has low responses. It does highlight there are **very few participants in the over 55 age group.**

In [None]:
# Comparing Australia on the left versus the top 5 countries on the right
# These are age breakdown for all gender
usa_a = usa['Q1'].value_counts().index
usa_num = usa['Q1'].value_counts().values.tolist()
    
japan_a = japan['Q1'].value_counts().index
japan_num = japan['Q1'].value_counts().values.tolist()

china_a = china['Q1'].value_counts().index
china_num = china['Q1'].value_counts().values.tolist()

brazil_a = brazil['Q1'].value_counts().index
brazil_num = brazil['Q1'].value_counts().values.tolist()

aus_a = australia['Q1'].value_counts().index
aus_num = australia['Q1'].value_counts().values.tolist()

sns.set_style('whitegrid')
fig, axs = plt.subplots(nrows=1, ncols=2, constrained_layout=True, sharex=True, sharey=True, figsize=(20,8))
fig.suptitle('AUSTRALIA VERSUS USA,JAPAN, CHINA, BRAZIL\n   ', fontsize=25, fontweight='bold')
sns.set_style('whitegrid')
axs = axs.flatten()
axs[0].bar(aus_a, aus_num, color='tab:olive')
axs[0].set_title('Australia Respondents by Age', fontweight='semibold', fontsize=20)
axs[0].set_ylabel('No of Respondents')

axs[1].bar(usa_a, usa_num, color='blue')
axs[1].set_title('USA Respondents by Age', fontweight='semibold', fontsize=20)

fig, axs = plt.subplots(nrows=3, ncols=2,constrained_layout=True, sharex=True, sharey=True, figsize=(20,16))
sns.set_style('whitegrid')
axs = axs.flatten()
axs[0].bar(aus_a, aus_num, color='tab:olive')
axs[0].set_ylabel('No of Respondents')

axs[1].bar(japan_a, japan_num,color="r")
axs[1].set_title('Japan Respondents by Age', fontweight='semibold', fontsize=20)

axs[2].bar(aus_a, aus_num, color='tab:olive')
axs[2].set_ylabel('No of Respondents')

axs[3].bar(china_a, china_num, color='tab:purple')
axs[3].set_title('China Respondents by Age', fontweight='semibold', fontsize=20)

axs[4].bar(aus_a,aus_num,color='tab:olive')
axs[4].set_ylabel('No of Respondents')
axs[4].set_xlabel('Age Group', fontsize=15)

axs[5].bar(brazil_a, brazil_num, color='tab:brown')
axs[5].set_title('Brazil Respondents by Age', fontweight = 'semibold', fontsize=20)
axs[5].set_xlabel('Age Group', fontsize=15)


plt.show()



# OECD, NON-OECD and INDIA GROUPING

To come up with a meaningful set of comparisons and to encompass all countries, I have created the following groupings to assess whether there are any differences according to the questions I have outlined:

> - **Organisation for Economic Co-operation and Development (OECD) Countries**
> - **Non-OECD Countries**
> - **India**

[OECD](https://www.oecd.org/about/) is an international organisation that aims, through policies to "foster prosperity, equality and opportunity for all". It has over 60 years of experience and 'it partners with government, policy makers and citizens to establish evidence-based international standards and finding solutions to a range of social, economic and environmental challenges.' Currently it has [38 countries](https://www.oecd.org/about/document/ratification-oecd-convention.htm) as part of the OECD members.  

There are 27 countries out of 38 of the OECD countries that have participated in this survey. This includes Australia â€“ yeah!! There are no kaggle participants from these OECD countries; Costa Rica, Estonia, Finland, Hungary, Iceland, Latvia, Lithuania, Luxembourg, New Zealand, Slovak Republic, Slovenia.

Any countries not included in the OECD grouping is in the Non-OECD excluding India. The Non-OECD group includes at least 38 countries as the non-locations and those prefer not to divulge locations are included. 

The three groupings are quite evenly split in terms of number of respondents, which could yield interesting insights. 


## OECD Analysis

>**Whether Data Scientist is the most popular role**
> - Data Scientists indeed is also popular with the OECD countries. Again the survey shows it is after the a high proportion of students. Students based in OECD countries responding to the survey respresnt **17%** of the responses from OECD.

>**Where does Data Analyst rank?**
> - Data Analysts ranked slightly lower than India at 6th place in the OECD responses. Data Analyst comes after Software Engineer and Research Scientist which rank at 4th and 5th in terms of the number of responses.

>**What are the characteristics of kaggle users?**
> - OECD participants are not as young as those in India. The majority of the age groups are between **25 - 44.** The 18 - 21 participants is only **6.5%** of the total number of OECD respondents. Compare to India's overwhelming majority in the age group between 18 - 21, which accounts for **37%** of India's respondents.

> These questions will be examined later.

>**Which country/countries have the highest pay for data professionals?**

>**Which industry is looking for data professionals?**

>**How popular is the kaggle platform to women?**


In [None]:
# Comparing OECD vs Non-OECD vs India Responses
# Show within OECD: 1) The countries and their ranking 2) Age Distribution 3) Job Types 

fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(20, 30))
fig.subplots_adjust(wspace=0.2)

oecd_a = oecd['Q1'].value_counts().index
oecd_num = oecd['Q1'].value_counts().values.tolist()

oecd_j = oecd['Q5'].value_counts().index
oecd_j_num = oecd['Q5'].value_counts().values.tolist()

oecd_s = oecd['Q25'].value_counts().index
oecd_s_num = oecd['Q25'].value_counts().values.tolist()

sns.set_palette("Paired", 24)
sns.set(font="Arial")
res = [8973, 9567, 7434]
res_label = ['OECD', 'NON-OECD','India']

explode=[0.3,0.01,0.01,]
angle = -184 * res[0]
colors=('tab:cyan','tab:orange','burlywood')
ax1.pie(res, autopct='%1.1f%%', startangle=angle, explode=explode, radius=1.6,
        labels=res_label, labeldistance = 1.1, textprops={'fontsize': 18, 'fontweight': 'bold'}, 
        colors=colors, rotatelabels=200,)

#bar chart parameters of OECD Countries
xpos = 0
bottom = 0
width = 1

textstr = 'OECD, INDIA, NON-OECD GROUPINGS\n% OF RESPONDENTS'
props = dict(boxstyle='round', facecolor='white', alpha=0.5)
ax1.text(-0.6, 2, textstr, transform=ax1.transAxes, fontsize=20, fontweight='bold',
        verticalalignment='bottom', bbox=props)

# This is to chart countries
for j in range(len(oecd_tot)):
    height = oecd_tot1[j]
    ax2.bar(xpos, height, width, bottom=bottom)
    ypos = bottom + ax2.patches[j].get_height() / 2
    bottom += height
    ax2.text(xpos, ypos, "%d" % (ax2.patches[j].get_height()),
             ha='center', fontsize=14)

ax2.set_title('OECD COUNTRIES \nBREAKDOWN BY RESPONSES',fontsize=20, fontweight='semibold')
ax2.legend(('1: USA','2: Japan','3: UK','4: Germany','5: Spain','6: Turkey','7: France','8: S. Korea',
            '9: Canada','10: Italy','11: Mexico','12: Australia','13: Colombia','14: Poland',
            '15: Netherlands','16: Israel','17: Portugal','18: Greece','19: Chile','20: Ireland',
            '21: Sweden','22: Switzerland','23: Belgium','24: Czech\nRepublic','25: Austria',
            '26: Denmark','27: Norway'), 
           fontsize=20, bbox_to_anchor=(0.65,0.96), 
           facecolor='white')
ax2.axis('off')
ax2.set_xlim(- 2.5 * width, 2.5 * width)

# use ConnectionPatch to draw lines between the two plots
# get the wedge data
theta1, theta2 = ax1.patches[0].theta1, ax1.patches[0].theta2
center, r = ax1.patches[0].center, ax1.patches[0].r
bar_height = sum([item.get_height() for item in ax2.patches])

# draw top connecting line
x = r * np.cos(np.pi / 180 * theta2) + center[0]
y = r * np.sin(np.pi / 180 * theta2) + center[1]
con = ConnectionPatch(xyA=(-width / 2, bar_height), coordsA=ax2.transData,
                      xyB=(x, y), coordsB=ax1.transData)
con.set_color([0, 0, 0])
con.set_linewidth(2)
ax2.add_artist(con)

# draw bottom connecting line
x = r * np.cos(np.pi / 180 * theta1) + center[0]
y = r * np.sin(np.pi / 180 * theta1) + center[1]
con = ConnectionPatch(xyA=(-width / 2, 0), coordsA=ax2.transData,
                      xyB=(x, y), coordsB=ax1.transData)
con.set_color([0, 0, 0])
ax2.add_artist(con)
con.set_linewidth(2)

# This is to chart the age distribition
width =1
for j in range(len(oecd_a)):
    height = oecd_num[j]
    ax3.bar(xpos, height, width, bottom=bottom)
    ypos = bottom + ax3.patches[j].get_height() / 2
    bottom += height
    ax3.text(xpos, ypos, "%d" % (ax3.patches[j].get_height()),
             ha='center', fontsize=14)

ax3.set_title('BY AGE',fontsize=20, fontweight='semibold')
ax3.legend((oecd_a), 
           fontsize=22, bbox_to_anchor=(0.63,0.96), 
           facecolor='white')
ax3.axis('off')
ax3.set_xlim(- 2.5 * width, 2.5 * width)

# This is to chart the job types
for j in range(len(oecd_j)):
    height = oecd_j_num[j]
    ax4.bar(xpos, height, width, bottom=bottom)
    ypos = bottom + ax4.patches[j].get_height() / 2
    bottom += height
    ax4.text(xpos, ypos, "%d" % (ax4.patches[j].get_height()),
             ha='center', fontsize=14)

ax4.set_title('BY JOB',fontsize=20, fontweight='semibold')
ax4.legend(('1: Student', '2: Data Scientist', '3: Other', '4: Software Engineer',
       '5: Research Scientist', '6: Data Analyst', '7: Currently not employed',
       '8: Machine Learning Engineer', '9: Program/Project Manager',
       '10: Business Analyst', '11: Data Engineer', '12: Product Manager', '13: Statistician',
       '14: DBA/Database Engineer', '15: Developer Relations/Advocacy'), 
            fontsize=20, bbox_to_anchor=(0.63,0.96), 
           facecolor='white')
ax4.axis('off')
ax4.set_xlim(- 2.5 * width, 2.5 * width)

plt.show()


## NON-OECD Analysis

There are 37 countries in this grouping, not counting those in the Other and No Disclosure categories. The number of respondents is 9,567. China and Brazil ranked 1st and 2nd in this group after the Other category. 

>**Whether Data Scientist is the most popular role**
> - Data Scientists indeed is also popular with the NON-OECD countries. Again the survey shows it is after the a high proportion of students. Students based in NON-OECD countries responding to the survey respresnt **25%** of the responses from this group.

>**Where does Data Analyst rank?**
> - Data Analysts ranked 3rd after Data Scientists. This is a much higher ranking than INDIA and OECD group. In the other groupings, Software Engineers is ahead of Data Analyst, whereas in NON-OECD, it is ranked 4th after Data Analyst. 

>**What are the characteristics of kaggle users?**
> - The Non-OECD participants have a higher proportion of respondents in the 18 - 21 age group with **17%** as opposed to **6.5%** in the OECD group. However, NON-OECD respondents are not as young as INDIA which has **37%** respondents in the 18 - 21 category. The bulk of the participants in NON-OECD countries are in the 22 - 29 age group.

> These questions will be examined later.

>**Which country/countries have the highest pay for data professionals?**

>**Which industry is looking for data professionals?**

>**How popular is the kaggle platform to women?**

In [None]:
# Comparing OECD vs Non-OECD vs India Responses
# Show the countries in Non-OECD : 1) Countries Ranking 2)Their Age distribution 2) The Job types

fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(22, 30))
fig.subplots_adjust(wspace=0)

non_oecd_tot = non_oecd['Q3'].value_counts().index
non_oecd_tot_num = non_oecd['Q3'].value_counts().values.tolist()
    
non_oecd_a = non_oecd['Q1'].value_counts().index
non_oecd_num = non_oecd['Q1'].value_counts().values.tolist()

non_oecd_j = non_oecd['Q5'].value_counts().index
non_oecd_j_num = non_oecd['Q5'].value_counts().values.tolist()

non_oecd_s = non_oecd['Q25'].value_counts().index
non_oecd_s_num = non_oecd['Q25'].value_counts().values.tolist()

sns.set_palette("Set3",24)
res = [9567,8973,7434]
res_label = ['NON-OECD','OECD','INDIA']

textstr = 'NON-OECD, INDIA, OECD GROUPINGS\n% OF RESPONDENTS'
props = dict(boxstyle='round', facecolor='white', alpha=0.5)
ax1.text(-0.4, 2, textstr, transform=ax1.transAxes, fontsize=20, fontweight='bold',
        verticalalignment='bottom', bbox=props)

explode=[0.3,0.01,0.01,]
angle = -190 * res[0]
colors=('tab:orange','tab:cyan','burlywood')
ax1.pie(res, autopct='%1.1f%%', startangle=angle, explode=explode, radius = 1.2,
        labels=res_label, textprops={'fontsize': 16, 'fontweight' : 'bold'}, 
        colors=colors, rotatelabels=200)

#bar chart parameters of OECD Countries
xpos = 0
bottom = 0
width = 1

for j in range(len(non_oecd_tot)):
    height = non_oecd_tot_num[j]
    ax2.bar(xpos, height, width, bottom=bottom)
    ypos = bottom + ax2.patches[j].get_height() / 2
    bottom += height
    ax2.text(xpos, ypos, "%d" % (ax2.patches[j].get_height()),
             ha='center', fontsize=12)

ax2.set_title('38 NON-OECD\nBREAKDOWN BY RESPONSES',fontsize=20, fontweight='semibold')
ax2.legend(('1: Other', '2: China', '3: Brazil', '4: Russia', '5: Nigeria', '6: Pakistan', '7: Egypt',
       '8: Indonesia', '9: Taiwan', '10: Bangladesh', '11: Vietnam', '12: Kenya',
       '13: Iran,\nIslamic Republic of', '14: Ukraine', '15: Argentina', '16: Singapore',
       '17: Malaysia', '18: South Africa', '19: Morocco', '20: Thailand', '21: Peru',
       '22: UAE', '23: Tunisia', '24: Philippines', '25: Sri Lanka', '26: Ghana',
       '27: Saudi Arabia', '28: Hong Kong', '29: Nepal',
       '30: Not disclose', '31: Romania', '32: Belarus',
       '33: Ecuador', '34: Uganda', '35: Kazakhstan', '36: Algeria', '37: Iraq', '38: Ethiopia'), 
           fontsize=19, bbox_to_anchor=(0.63,0.96), 
           facecolor='white')
ax2.axis('off')
ax2.set_xlim(- 2.5 * width, 2.5 * width)

# use ConnectionPatch to draw lines between the two plots
# get the wedge data
theta1, theta2 = ax1.patches[0].theta1, ax1.patches[0].theta2
center, r = ax1.patches[0].center, ax1.patches[0].r
bar_height = sum([item.get_height() for item in ax2.patches])

# draw top connecting line
x = r * np.cos(np.pi / 180 * theta2) + center[0]
y = r * np.sin(np.pi / 180 * theta2) + center[1]
con = ConnectionPatch(xyA=(-width / 2, bar_height), coordsA=ax2.transData,
                      xyB=(x, y), coordsB=ax1.transData)
con.set_color([0, 0, 0])
con.set_linewidth(2)
ax2.add_artist(con)

# draw bottom connecting line
x = r * np.cos(np.pi / 180 * theta1) + center[0]
y = r * np.sin(np.pi / 180 * theta1) + center[1]
con = ConnectionPatch(xyA=(-width / 2, 0), coordsA=ax2.transData,
                      xyB=(x, y), coordsB=ax1.transData)
con.set_color([0, 0, 0])
ax2.add_artist(con)
con.set_linewidth(2)


width =1
for j in range(len(non_oecd_a)):
    height = non_oecd_num[j]
    ax3.bar(xpos, height, width, bottom=bottom)
    ypos = bottom + ax3.patches[j].get_height() / 2
    bottom += height
    ax3.text(xpos, ypos, "%d" % (ax3.patches[j].get_height()),
             ha='center', fontsize=12)

ax3.set_title('BY AGE',fontsize=20, fontweight='semibold')
ax3.legend((non_oecd_a), 
           fontsize=22, bbox_to_anchor=(0.75,0.96), 
           facecolor='white')
ax3.axis('off')
ax3.set_xlim(- 2.5 * width, 2.5 * width)

width =1
for j in range(len(non_oecd_j)):
    height = non_oecd_j_num[j]
    ax4.bar(xpos, height, width, bottom=bottom)
    ypos = bottom + ax4.patches[j].get_height() / 2
    bottom += height
    ax4.text(xpos, ypos, "%d" % (ax4.patches[j].get_height()),
             ha='center', fontsize=12)

ax4.set_title('BY JOB',fontsize=20, fontweight='semibold')
ax4.legend(('1: Student', '2: Data Scientist', '3: Data Analyst', '4: Software Engineer',
       '5: Other', '6: Currently not employed', '7: Machine Learning Engineer',
       '8: Research Scientist', '9: Business Analyst', '10: Program/Project Manager',
       '11: Data Engineer', '12: Statistician', '13: Product Manager',
       '14: DBA/Database Engineer', '15: Developer Relations/Advocacy'), 
           fontsize=22, bbox_to_anchor=(0.63,0.96), 
           facecolor='white')
ax4.axis('off')
ax4.set_xlim(- 2.5 * width, 2.5 * width)

plt.show()


# Which country/countries have the highest pay for data professionals?

Further analysis will be required to ascertain exactly which country or countries within the OECD and NON-OECD groups have the highest pay for data professionals. For now this examination is restricted to these three groupings. 

>As the number of respondents in the 'currently not employed' category are within the top 8 of each group. 
> - INDIA    : Rank No 4
> - OECD     : Rank No 7
> - NON-OECD : Rank No 6 

>Together with students ranking No 1 in all three groups in terms of Job Role. It is not surprising that the salary category of $0-999 has the highest grouping. 

OECD countries show the highest salary levels achieved by respondents comparing to the other two groups. It also shows a high number of respondents above the [OECD 2020 Average Annual Wages](https://data.oecd.org/earnwage/average-wages.htm) of 49.2k. This is inline with articles such as this one which cites [the top 10 most lucrative countries](https://www.analyticsinsight.net/top-10-countries-with-the-highest-salaries-for-data-scientists/) are within OECD Countries.


In [None]:
# Salary comparisons between the 3 groups
# Add a OECD median salary and an Data Scientist median salsry ranking

india_s = india['Q25'].value_counts().index
india_s_num = india['Q25'].value_counts().values.tolist()

fig, axs = plt.subplots(nrows=3, ncols=1,constrained_layout=True, sharex=True, figsize=(20,20))
fig.suptitle('COMPARING INDIA, OECD, NON-OECD SALARY LEVELS\n  ', fontsize=20, fontweight='bold')
axs = axs.flatten()
sns.set_style('whitegrid')

plt.xticks(rotation=90, fontsize=18)
axs[0].bar(india_s, india_s_num, color='burlywood')
axs[0].set_title('INDIA SALARY DISTRIBUTION', fontsize=20, fontweight = 'bold')
axs[0].axvline(12,0,1000, linewidth=6, linestyle='--', color='r')
textstr = 'OECD 2020 AVERAGE ANNUAL WAGES @ 49.2K'
props = dict(boxstyle='round', facecolor='white', alpha=0.5)
axs[0].text(0.35, 0.5, textstr, transform=axs[0].transAxes, fontsize=15,
        verticalalignment='bottom', bbox=props)

#axs[0].axvline(label='OECD MEDIAN SALARY')

axs[1].bar(oecd_s, oecd_s_num, color='tab:cyan')
axs[1].set_title('OECD COUNTRIES SALARY DISTRIBUTION', fontsize=20, fontweight = 'bold')
axs[1].axvline(12,0,600, linewidth=6, linestyle='--', color='r')

axs[2].bar(non_oecd_s, non_oecd_s_num, color='tab:orange')
axs[2].set_title('ONO-OECD COUNTRIES SALARY DISTRIBUTION', fontsize=20, fontweight = 'bold')
axs[2].axvline(12, 0, 1750, linewidth=6, linestyle='--', color='r')

plt.show()


In [None]:
# Lets look at Gender - Man vs Woman in the 3 Groupings


oecd_wo = oecd.loc[oecd.loc[:, 'Q2']=='Woman']
oecd_ma = oecd.loc[oecd.loc[:, 'Q2']=='Man']
oecd_gender = pd.concat([oecd_wo, oecd_ma])

non_oecd_wo = non_oecd.loc[non_oecd.loc[:, 'Q2']=='Woman']
non_oecd_ma = non_oecd.loc[non_oecd.loc[:, 'Q2']=='Man']
non_oecd_gender = pd.concat([non_oecd_wo,non_oecd_ma]) 
                              
#non_oecd_wo.shape
#oecd_gender.shape

# Which Industry / Industries employ Data Scientists and Analysts?

As we have seen earlier, the role of Data Scientists rank second after Students in terms of responses in all three groups. Data Analysts varies between the three groups. They are ranked in each group as follows:

> - INDIA : Rank 5
> - OECD : 6
> - NON-OECD : 3

So here we looked at which roles are deployed in the named industries in the survey. Also, how the two roles compared within an industry. 

> **The top three industries that employ most Data Scientists and Analysts are:**
> - 1) Computers/Technology
> - 2) Academics/Education
> - 3) Accounting/Finance

> **In all industries Data Scientists are deployed more than Analysts except for Government/Public Service**

In [None]:
# Looking at which industry employ Data Scientists and Analysts

s_d_s = survey_data.loc[survey_data.loc[:, 'Q5']== 'Data Scientist']
s_d_ana = survey_data.loc[survey_data.loc[:, 'Q5']== 'Data Analyst']
s_d = pd.concat([s_d_s, s_d_ana])

sns.set_style("whitegrid")

plt.figure(figsize=(20,13))

# Chart the graph according to the order of Data Scientist

order=pd.crosstab(s_d.Q20, s_d.Q5).sort_values('Data Scientist', ascending=False).index
a = sns.countplot(x='Q20', hue='Q5', data=s_d, palette='Paired', order=order)
plt.xticks(rotation=90, fontsize=15)
plt.legend(('Data Scientist', 'Data Analyst'), fontsize=20)

plt.title('DATA SCIENTIST AND ANALYST BREAKDOWN BY INDUSTRY', fontweight='bold', fontsize=20)
#plt.set_ylable('No of Respondents')

plt.show()



# HOW MANY WOMEN RESPONDED COMPARING TO THE MEN IN EACH GROUP?

**Given the opportunities for data professionals, has this attracted more women?**

In all groups, there are very few women respondents. In particular, there are hardly any participants in India and in the Non-OECD Countries in the over 50 years of age categories.

Even though OECD participants are "older" than India and Non-OECD countries' participants, they do have more women in the over 50 years of age categories - PHEW!!

In [None]:

#oecd_age = oecd['Q1'].value_counts().index
#oecd_a_num = oecd['Q1'].value_counts().values.tolist()

# India Gender 
in_wo = india.loc[india.loc[:, 'Q2']=='Woman']
in_ma = india.loc[india.loc[:, 'Q2']=='Man']
india_gender = pd.concat([in_wo, in_ma])

# Lets look at Gender - Man vs Woman in the 3 Groupings
oecd_wo = oecd.loc[oecd.loc[:, 'Q2']=='Woman']
oecd_ma = oecd.loc[oecd.loc[:, 'Q2']=='Man']
oecd_gender = pd.concat([oecd_wo, oecd_ma])

# NON_OECD gender
non_oecd_wo = non_oecd.loc[non_oecd.loc[:, 'Q2']=='Woman']
non_oecd_ma = non_oecd.loc[non_oecd.loc[:, 'Q2']=='Man']
non_oecd_gender = pd.concat([non_oecd_wo,non_oecd_ma]) 
                              

In [None]:
# Lets see the distribution of women and men by age in India, OECD and Non_OECD Countries

plt.figure(figsize=(20,10))
plt.suptitle('GENDER BREAKDOWN : INDIA, OECD, NON-OECD\n           ', fontsize = 25, fontweight ='bold')
plt.title('INDIA BREAKDOWN BY GENDER AND AGE',fontsize=20, fontweight='semibold')
sns.countplot(x='Q1', hue='Q2', data=india_gender,palette='Set2',
              order=['18-21','22-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','70+'])
plt.legend(loc='upper right', fontsize=20)

plt.figure(figsize=(20,6))
plt.title('OECD BREADKDOWN BY GENDER AND AGE',fontsize=20, fontweight='semibold')
sns.countplot(x='Q1', hue='Q2', data=oecd_gender, palette='Paired',
              order=['18-21','22-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','70+'])
plt.legend(fontsize=20)

plt.figure(figsize=(20,6))
plt.title('NON-OECD BREAKDOWN BY GENDER AND AGE',fontsize=20, fontweight='semibold')
sns.countplot(x='Q1', hue='Q2', data=non_oecd_gender, palette='OrRd',
              order=['18-21','22-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','70+'])
plt.legend(fontsize=20)

plt.show()

# WHAT ARE THE CHARACTERISTICS OF WOMEN KAGGLE USERS?

>**Whether Data Scientist is the most popular role**
> - Students and Data Scientists also ranked number 1 and 2 respectively in terms of responses. 

>**Where does Data Analyst rank?**
> - Data Analysts ranked 3rd after Data Scientists, similar to the NON-OECD countries. 

>**What are the characteristics of kaggle users?**
> - The age distribution of the women respondents have already been commented. Thus looking at the education levels of the women, it is good to see a high number in the Masters and Bachelors degree level. 

> - In terms of salary level, most respondents are in the 0-999 catagory level. It is, however good to see some, below 100 respondents in the high salary ranges which encompasses the 100K+ ranges. 


In [None]:
#These sets of charts show how women do

plt.figure(figsize=(20,37))

sns.set_style("whitegrid")
w_a_age = survey_data.loc[survey_data.loc[:, 'Q2']=='Woman']

# Women in Job Types
w_j = w_a_age['Q5'].value_counts().index
w_j_num = w_a_age['Q5'].value_counts().values.tolist()

# Women in education level
w_e = w_a_age['Q4'].value_counts().index
w_e_num = w_a_age['Q4'].value_counts().values.tolist()

# Women in salary level
w_s = w_a_age['Q25'].value_counts().index
w_s_num = w_a_age['Q25'].value_counts().values.tolist()

plt.subplot(311)
plt.title('WOMEN BY JOB TYPES', fontsize=20, fontweight='bold')
a = plt.bar(w_j, w_j_num, data=w_a_age, color='orange')
plt.xticks(rotation=35, fontsize=10)

plt.subplot(312)
plt.title('WOMEN BY HIGHEST EDUCATION LEVEL', fontsize=20, fontweight='bold')
b = plt.bar(w_e, w_e_num, data=w_a_age, color='tab:cyan')
plt.xticks(rotation=10, fontsize=10)

plt.subplot(313)
plt.title('WOMEN BY SALARY LEVEL', fontsize=20, fontweight='bold')
c = plt.bar(w_s, w_s_num, data=w_a_age, color='tab:olive')
plt.xticks(rotation=90, fontsize=10)


plt.show()


# IN SUMMARY WHAT DO I FIND?

**In answer to my questions posed based on an analysis on the three groups: INDIA, OECD, NON-OECD, I would conclude the following:**

> **Whether Data Scientist is the most popular role**
>> - The role of Data Scientists is indeed popular
>> - Students are however the top kaggle respondents, which suggests that kaggle is popular for learning 

> **Where does Data Analyst rank?**
>> - The role of Data Analyst rose in prominence as a popular type of data professionals
>> - This is especially true for the Non-OECD countries

> **What are the characteristics of kaggle users?**
>> - Respondents are young from India and NON-OECD countries, with more 'mature' respondents from OECD countries  

> **Which country/countries have the highest pay for data professionals?**
>> - It would seem that OECD countries has higher pay than India and NON-OECD countries

> **Which industry is looking for data professionals?**
>> The top three industries that employ most Data Scientists and Analysts are:
>> - 1) Computers/Technology
>> - 2) Academics/Education
>> - 3) Accounting/Finance
In all industries Data Scientists are deployed more than Analysts except for Government/Public Service

> **How popular is the kaggle platform to women?**
>> - I would really like to assess how to encourage women to take up the data related skills or what obstacles need to be removed. This would need more research, surveys and analysis. 


# Sources of information used are:

(1) OECD (2021), Average wages (indicator). doi: 10.1787/cc3e1387-en (Accessed on 26 November 2021)

(2) Ganguli D 2021., Top 10 Countries with the highest salaries for data scientists, viewed 22nd November https://www.analyticsinsight.net/top-10-countries-with-the-highest-salaries-for-data-scientists/