# Introduction
> This is an analysis of the most comprehensive dataset available on the state of ML and data science. The goal of the competition is to tell a story about a subset of the data science and ML community.
***Thus, I decided to talk about Arabic and African(since I am Egyptian) countries in my analysis. I will be trying to answer the following questions***

1. How do these countries compare against each other? Which country had the best advancements?
2. How do they compare against other ML and data science giants?

**(If you like this notebook, do not forget to upvote it!)**

In [None]:
import pandas as pd
import numpy as np
from IPython.core.display import display, HTML, Javascript
from string import Template
import json, random
import IPython.display
from plotly.offline import init_notebook_mode, iplot
from plotly import subplots
import plotly.figure_factory as ff
import plotly as py
import plotly.graph_objects as go
init_notebook_mode(connected=True)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Wrangling

>We look at the general properties of the data. To see if the data needs any cleaning before the analysis process begins

In [None]:
df=pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
df.head()

In [None]:
df.describe()

In [None]:
# we are dropping the first column as we do not need it.
df = df.drop(df.index[0])
questions = df.columns

In [None]:
q23_all = [question for question in questions if 'Q23' in question]
q23 = q23_all[0:6]

q23_df = df[df[q23_all].isnull().sum(axis=1) != len(q23_all)]
dataprofs = q23_df[q23_df[q23].isnull().sum(axis=1) != len(q23)]

print("Total number of respondents: ", df.shape[0])
print("The number of data professionals in the respondents: ",dataprofs.shape[0])

# EDA

We will be first looking through the whole dataset. We will be looking at **age, gender, coding experience. We will also look at degrees of study, and programming language per country** in the dataset.
 
Then, we will revisit these features in our analysis to answer our questions


In [None]:
#EDA colors
primary_blue = "#f6aa11"
primary_blue2 = "#d47e0f"
primary_blue3 = "#eb721c"
primary_grey = "#c6ccd8"
primary_black = "#202022"
primary_bgcolor = "#f4f0ea"

f1 = "#46f5fa"
f2 = "#a828fa"
f3 = "#01fa12"
f4 = "#fa3928"
f5 = "#fae57c"
f6 = "#c0ffa7"

In [None]:
fields=dataprofs["Q5"].unique()
fields_df = pd.DataFrame()
fields_df["all"] = q23_df["Q5"].value_counts()
fields_df["profs"] = dataprofs["Q5"].value_counts()
fields_df["non_profs"] = fields_df["all"] - fields_df["profs"]
fields_df["ratio"] = fields_df["profs"]/ fields_df["all"]
fields_df["proportion"] = fields_df['profs'] * 100/ fields_df["profs"].sum()

fig=go.Figure()

trace1 = go.Bar(
    y = fields_df.index,
    x = fields_df["proportion"],
    orientation = "h",
    marker = dict(color=[primary_blue] + [primary_grey]*10),
    name = "",
    width= 0.85,    
    customdata = fields_df[["profs","non_profs","proportion"]],
    hoverinfo = "none",
    hovertemplate = ' Work in data related roles: %{customdata[0]}<br> Do not work in data related roles: %{customdata[1]}<br> Contribution to  total number of data professionals: %{customdata[2]:.2f}%'
)

data = [trace1]

layout=dict( yaxis={'categoryorder':'array',
           'categoryarray': fields_df["proportion"].sort_values(ascending=True).keys()},title="Kagglers Jobs")

fig = go.Figure(data=data,layout = layout)

fig.show()



Data scientist comes first in this category, followed by data analyst. Even though I expected ML engineers to contribute more to the dataset 

In [None]:
fields=dataprofs["Q2"].unique()
fields_df = pd.DataFrame()
fields_df["all"] = q23_df["Q2"].value_counts()
fields_df["profs"] = dataprofs["Q2"].value_counts()
fields_df["non_profs"] = fields_df["all"] - fields_df["profs"]
fields_df["ratio"] = fields_df["profs"]/ fields_df["all"]
fields_df["proportion"] = fields_df['profs'] * 100/ fields_df["profs"].sum()

fig=go.Figure()

trace1 = go.Bar(
    y = fields_df.index,
    x = fields_df["proportion"],
    orientation = "h",
    marker = dict(color=[primary_blue] + [primary_grey]*10),
    name = "",
    width= 0.85,    
    customdata = fields_df[["profs","non_profs","proportion"]],
    hoverinfo = "none",
    hovertemplate = ' Work in data related roles: %{customdata[0]}<br> Do not work in data related roles: %{customdata[1]}<br> Contribution to  total number of data professionals: %{customdata[2]:.2f}%'
)

data = [trace1]

layout=dict( yaxis={'categoryorder':'array',
           'categoryarray': fields_df["proportion"].sort_values(ascending=True).keys()},title="Kagglers Gender")

fig = go.Figure(data=data,layout = layout)

fig.show()

In [None]:
students = df[df["Q5"]=="Student"]
student_ages = students["Q1"].value_counts()
dataprof_ages = dataprofs["Q1"].value_counts()

trace1 = go.Bar(
    x = student_ages.keys(),
    y = student_ages.values,
    name = "Students",
    marker_color = "#29658c",
    text = student_ages[student_ages.keys()],
    textposition = "outside",
)
trace2 = go.Bar(
    x = student_ages.keys(),
    y = - dataprof_ages[student_ages.keys()],
    name = "Data Professionals",
    marker_color = "#cc0000",
    text = dataprof_ages[student_ages.keys()],
    textposition = "outside"
)
layout = dict(
    title = "<span style='font-size:26px'>Study of age groups</span><br><span style='color:#999; font-size: 16px; font-weight:200'>students and professionals</span>",
    plot_bgcolor='#f5f5f5',
    margin = dict(t=50, l=0, r=0),
    legend=dict(yanchor='top',xanchor='right', x=0.992, y=0.98, font=dict(size= 12),traceorder='normal'),
    xaxis = dict(domain=[0,1]),
    barmode="overlay",
    bargap = 0.1,
    width = 765
)
data = [trace1,trace2]

fig = go.Figure(data=data,layout = layout)

fig.show()


1. 70% of 18-21 is pursuing a bachelor degree
1. 22-24, the focus moves to Masters.
1. Most professionals have most respondents at 25-29

In [None]:
fields=dataprofs["Q6"].unique()
fields_df = pd.DataFrame()
fields_df["all"] = q23_df["Q6"].value_counts()
fields_df["profs"] = dataprofs["Q6"].value_counts()
fields_df["non_profs"] = fields_df["all"] - fields_df["profs"]
fields_df["ratio"] = fields_df["profs"]/ fields_df["all"]
fields_df["proportion"] = fields_df['profs'] * 100/ fields_df["profs"].sum()

fig=go.Figure()

trace1 = go.Bar(
    y = fields_df.index,
    x = fields_df["proportion"],
    orientation = "h",
    marker = dict(color=[primary_blue] + [primary_grey]*10),
    name = "",
    width= 0.85,    
    customdata = fields_df[["profs","non_profs","proportion"]],
    hoverinfo = "none",
    hovertemplate = ' Work in data related roles: %{customdata[0]}<br> Do not work in data related roles: %{customdata[1]}<br> Contribution to  total number of data professionals: %{customdata[2]:.2f}%'
)

data = [trace1]

layout=dict( yaxis={'categoryorder':'array',
           'categoryarray': fields_df["proportion"].sort_values(ascending=True).keys()},title="Kagglers coding experience")

fig = go.Figure(data=data,layout = layout)

fig.show()



In [None]:
# Education levels of professionals by field
education_df = pd.DataFrame()

fields = dataprofs["Q5"].unique()

for field in fields:
    education_df[field] = dataprofs[dataprofs["Q5"]==field]["Q4"].value_counts()
        
education_df.dropna(inplace = True)
education_df = education_df/education_df.sum()

# Adding spacing and formatting directly to the column names.
education_df.columns = [("<span style='font-size:15px; font-family:Helvetica'>"+label + "</span>  ") for label in education_df.columns]

# plotting stacked bar charts
trace1 = go.Bar(
    y = education_df.columns,
    x = education_df.loc["Doctoral degree"],
    name = "Doctoral Degree",
    marker = dict(color= f1),#"#46f5fa"
    orientation = "h"
)

trace2 = go.Bar(
    y = education_df.columns,
    x = education_df.loc["Master’s degree"],
    orientation = "h",
    marker = dict(color= f2), #"#a828fa"
    name = "Master's degree"
)

trace3 = go.Bar(
    y = education_df.columns,
    x = education_df.loc["Professional degree"],
    marker = dict(color= f3), #"#f1efd9"
    name = "Professional degree",
    orientation = "h"    
)

trace4 = go.Bar(
    y = education_df.columns,
    x = education_df.loc["Bachelor’s degree"],
    name = "Bachelor's degree",
    marker = dict(color= f4), #"#fa3928"
    orientation = "h"
)

trace5 = go.Bar(
    y = education_df.columns,
    x = education_df.loc["Some college/university study without earning a bachelor’s degree"],
    name = "Education without a degree",
    marker = dict(color= f5), #"#fae57c"
    orientation = "h"
)

trace6 = go.Bar(
    y = education_df.columns,
    x = education_df.loc["No formal education past high school"],
    name = "No formal education past high school",
    orientation = "h",
    marker = dict(color= primary_blue3), #"#b4cde3"
)

# title format
large_title_format = "<span style='font-size:36px; font-family:Times New Roman'>What educational qualifications do I need?</span>"
small_title_format = "<span style='font-size:14px; font-family:Helvetica'><b>Master's and Bachelor's degrees form the majority of all fields</b></span>"

layout = dict(
    title = dict(text = large_title_format + "<br>" + small_title_format,x=0.5, y=0.835),
    margin = dict(t=250, l=0,b=0),
    xaxis = dict(title="<span style='font-size:15px; font-family:Helvetica'><b>Color Key: </b>Educational qualifications of professionals</span>", side="top",title_standoff=0, domain=[0,0.95], showticklabels = False),
    xaxis2 = dict(domain=[0,1]),
    yaxis = dict(domain=[0.85,1], showticklabels = False),
    yaxis2={'categoryorder':'array',
           'categoryarray': education_df.loc["Doctoral degree"].sort_values(ascending=True).keys(),
            'domain':[0,0.83]
           },
    barmode = "stack",
    bargap = 0.05,
    showlegend = False,
    legend = dict(orientation='h',yanchor='top',xanchor='center',y=-0.05,x=0.5,font=dict(size= 12), traceorder='normal'),
    width = 850,
    height = 700,
    plot_bgcolor = "#fff"
)

# Using a heatmap to depict a colour key
colorscale = ff.create_annotated_heatmap(
    z=[[1,2,3,4,5,6]],
        annotation_text =[["<span style='font-size:12px'>"+text+"</span>" for text in ["Doctoral<br>degree","Master's<br>degree","Professional<br>degree","Bachelor's<br>degree","Education<br>without<br>degree","High school<br>education"]]],
    colorscale= [
        [0.000,"#46f5fa"],[0.166,"#46f5fa"],
        [0.166,"#a828fa"],[0.333,"#a828fa"],
        [0.333,"#01fa12"],[0.500,"#01fa12"],
        [0.500,"#fa3928"],[0.666,"#fa3928"],
        [0.666,"#fae57c"],[0.833,"#fae57c"],
        [0.833,primary_blue3],[1.000,primary_blue3],
    ],
    font_colors = ["white", "white", "black", "white", "white", "white"],
    hoverinfo = "none",
    xgap = 1.5,
    showscale = False
)


data = [trace1, trace2, trace3, trace4, trace5, trace6]

# color key - row 1, horizontal stacked bar chart - row2 
fig = subplots.make_subplots(
    rows=2, 
    cols=1, 
    shared_yaxes=True, 
    shared_xaxes=False, 
    horizontal_spacing = 0.02, 
    vertical_spacing = 0.01
)

fig.append_trace(colorscale.data[0],1,1); 

fig.append_trace(trace1,2,1); 
fig.append_trace(trace2,2,1); 
fig.append_trace(trace3,2,1); 
fig.append_trace(trace4,2,1); 
fig.append_trace(trace5,2,1); 
fig.append_trace(trace6,2,1);

# to add figure factory's annotations to main fig
annot1 = list(colorscale.layout.annotations)
for k in range(len(annot1)):
    annot1[k]['xref'] = 'x'
    annot1[k]['yref'] = 'y'
fig.update_layout(annotations=annot1) 


fig.update_layout(layout)
fig.show()

Most people either has a Masters or Bachelors in their represective field. The doctoral degree comes close second to these two. If you want to get a job in this field. Masters is the way to go and then a Bachelor  

In [None]:
primary_blue = "#496595"
primary_blue2 = "#85a1c1"
primary_blue3 = "#3f4d63"
primary_grey = "#c6ccd8"
primary_black = "#202022"

q7 = [question for question in questions if 'Q7' in question]

languages = []
for qn in q7:
    for val in dataprofs[qn].unique():
        languages.append(val)
        
languages = [lang for lang in languages if str(lang)!='nan']

prof_langs = (dataprofs.shape[0] - dataprofs[q7].isnull().sum()) / dataprofs.shape[0]
student_langs = (students.shape[0] - students[q7].isnull().sum()) / students.shape[0]

prof_langs.index = languages
student_langs.index = languages

texttemplate_white = "<b style='color: #fff'>%{text}% </b>"
texttemplate_black = "<b style='color: #000'> %{text}% </b>"

trace2 = go.Bar(
    y = languages,
    x = prof_langs,
    orientation = "h",
    name = "Professionals",
    marker = dict(color = primary_blue),
    text = np.round(prof_langs*100),
    texttemplate = [texttemplate_white]*7 +[texttemplate_black]*2 + [texttemplate_white]*2 +[texttemplate_black] + [texttemplate_white],
    textposition = ["inside"]*7 +["outside"]*2 + ["inside"]*2 +["outside"] + ["inside"],
)


texttemplate_white = "<b style='color: #fff'>%{text}% </b>"
texttemplate_black = "<b style='color: #000'> %{text}% </b>"

trace2 = go.Bar(
    y = languages,
    x = prof_langs,
    orientation = "h",
    name = "Professionals",
    marker = dict(color = primary_blue),
    text = np.round(prof_langs*100),
    texttemplate = [texttemplate_white]*7 +[texttemplate_black]*2 + [texttemplate_white]*2 +[texttemplate_black] + [texttemplate_white],
    textposition = ["inside"]*7 +["outside"]*2 + ["inside"]*2 +["outside"] + ["inside"],
)


trace1 = go.Bar(
    y = languages,
    x = student_langs,
    name = "Students",
    orientation = "h",
    marker = dict(color = primary_grey),
    text = np.round(student_langs*100),
    texttemplate = [texttemplate_white]*7 +[texttemplate_black]*2 + [texttemplate_white]*2 +[texttemplate_black] + [texttemplate_white],
    textposition = ["inside"]*7 +["outside"]*2 + ["inside"]*2 +["outside"] + ["inside"],    
)

layout = dict(
    title = "<span style='font-size:26px'>Languages used</span><br><span style='color:#999; font-size: 16px; font-weight:200'>students vs data professionals</span><br>",
    margin = dict(t=150),
    legend=dict(#title = "<span style='font-size:16px'>  Legend</span>",
                orientation="h",
                yanchor='top',xanchor='center',
                y= 1.06,x=0.5,
                font=dict(size= 16),
                traceorder='reversed',
#                 bordercolor=primary_grey,
#                 borderwidth=1, 
#                 bgcolor = "#f4f0ea"
               ),
    yaxis={'categoryorder':'array',
           'categoryarray': prof_langs.sort_values(ascending=True).keys()},
    xaxis=dict(side="top"),
    barmode = "group",
    bargap = 0.05,
    bargroupgap =0.1,
    width = 800,
    height= 1000,
    plot_bgcolor = "#f4f0ea" # "#f6f2e8"
)

data = [trace1, trace2]

fig = go.Figure(data = data, layout = layout)

main_annot_format = "<span style='font-size:12px; font-family:Tahoma;'><b> %s </b><br> %s</span>"

fig["data"][0]["text"][0] = 83.98
fig["data"][1]["text"][0] = 83.64

iplot(fig)

1. The most recommend language to learn is Python. Data Science jobs listing elicit SQL as a need. 
1. If you are more leaned towards statistics and data analysis, I recommend knowledge of R. 
1. Most people learn Java or C or C++ as students than they move to languages more relative to the fields 

In [None]:
# countries will be comparing for the African Arabic analysis
df_Afro_Arab=df[df.Q3.isin(["Tunisia","Morocco","Egypt","United Arab Emirates","Saudi Arabia","Ghana","South Africa","Kenya","Nigeria"])]

Now we head over to our first question. Which country made the best advancements in the data science fields.  

In [None]:
df_Afro_Arab_Full = df_Afro_Arab.copy()
df_Afro_Arab= df_Afro_Arab.iloc[:,0:7]
df_Afro_Arab["Count"] = 1

In [None]:
"""
df_Afro_Arab_Count=df_Afro_Arab.groupby("Q3").sum()
df_Afro_Arab_Count = df_Afro_Arab_Count.sort_values("Count")
me =df_Afro_Arab_Count.Count.sum()
df_Afro_Arab_Count.head()
fig=py.express.bar(df_Afro_Arab_Count,x='Count')
fig.show()
"""
df_Afro_Arab_Count=df_Afro_Arab.groupby("Q3").sum()
df_Afro_Arab_Count = df_Afro_Arab_Count.sort_values("Count")
me =df_Afro_Arab_Count.Count.sum()
df_Afro_Arab_Count.head()
df_Afro_Arab_Count['percentage']=(df_Afro_Arab_Count.Count/me)*100
df_Afro_Arab_Count.head()
fig=py.express.bar(df_Afro_Arab_Count,x='percentage',title='Kagglers in Arabic and African Countries',color='percentage',labels={
    "Q3":"Country",
    "percentage":"percentage"
})
fig.show()

 Nigeria dominates the percentage of respondents by 35% and Egypt comes second with 13% 

In [None]:
df_Afro_Arab_Gender = df_Afro_Arab.groupby(["Q3" , "Q2"], as_index=False).sum()
for i in df_Afro_Arab_Gender.Q3.unique():
    pi=df_Afro_Arab_Gender[df_Afro_Arab_Gender.Q3==i]
    me=pi.Count.sum()
    pi.Count=(pi.Count/me)*100
    
    df_Afro_Arab_Gender.drop(pi.index,axis=0,inplace=True)
    df_Afro_Arab_Gender=pd.concat([df_Afro_Arab_Gender,pi])
df_Afro_Arab_Gender_Women=df_Afro_Arab_Gender[df_Afro_Arab_Gender.Q2=="Woman"].sort_values(by="Count")
fig=py.express.bar(df_Afro_Arab_Gender_Women,x='Count',y='Q3',title='Woman representation in Arabic and African Countries',color='Count',labels={
    "Q3":"Country",
    "Count":"Percentage"
})
fig.show()

 For women's representation, non-Arabic countries have low numbers. When compared with their Arabic counterparts. Tunisia is leading with 37% while the highest non-Arabic African country is South Africa with a 20% female representation. 

In [None]:
df_Afro_Arab_Age=df_Afro_Arab.groupby(["Q3","Q1"],as_index=False).sum()
for i in df_Afro_Arab_Age.Q3.unique():
    pi=df_Afro_Arab_Age[df_Afro_Arab_Age.Q3==i]
    me=pi.Count.sum()
    pi.Count=(pi.Count/me)*100
    
    df_Afro_Arab_Age.drop(pi.index,axis=0,inplace=True)
    df_Afro_Arab_Age=pd.concat([df_Afro_Arab_Age,pi])
Highlight={"Tunisia":"#D55E00","Egypt":"#CC79A7","United Arab Emirates":"#0072B2","Morocco":"#F0E442","Saudi Arabia":"009E73"}
fig=go.Figure()
for i in df_Afro_Arab_Age.Q3.unique():
    color=Highlight.get(i,'grey')
    data=df_Afro_Arab_Age[df_Afro_Arab_Age.Q3==i]
    name=data.Q3.iloc[-1]
    fig.add_trace(go.Scatter(x=data.Q1,y=data.Count,name=name))
fig.update_layout(plot_bgcolor='white')
fig.update_xaxes(showgrid=True, gridwidth=0.2, gridcolor='#EEE1FA')
fig.update_yaxes(showgrid=True, gridwidth=0.2, gridcolor='#EEE1FA')
fig.update_layout(
    title='Age in in Arabic and African Countries')
fig.show()

For 18-21, Egypt takes it home with 25% while Tunisia stands out again with 44% from 22-24 

In [None]:
exp = ['I have never written code', '< 1 years' , '1-2 years', '3-5 years' ,'5-10 years','10-20 years', '20+ years']
cat_dtype = pd.api.types.CategoricalDtype(categories=exp, ordered=True)
df_Afro_Arab["Q6"]  = df_Afro_Arab["Q6"].astype(cat_dtype)
df_Afro_Arab_Coding = df_Afro_Arab.groupby(["Q3" , "Q6"], as_index=False).sum()
for i in df_Afro_Arab_Coding.Q3.unique():
    pi = df_Afro_Arab_Coding[df_Afro_Arab_Coding.Q3 == i]
    me = pi.Count.sum()
    pi.Count = pi.Count/me
    
    df_Afro_Arab_Coding.drop(pi.index, axis=0, inplace=True)
    df_Afro_Arab_Coding = pd.concat([df_Afro_Arab_Coding,pi])
fig=go.Figure()
for i in df_Afro_Arab_Coding.Q3.unique():
    color=Highlight.get(i,'grey')
    data=df_Afro_Arab_Coding[df_Afro_Arab_Coding.Q3==i]
    name=data.Q3.iloc[-1]
    fig.add_trace(go.Scatter(x=data.Q6,y=data.Count,name=name))
fig.update_layout(plot_bgcolor='white')
fig.update_xaxes(showgrid=True, gridwidth=0.2, gridcolor='#EEE1FA')
fig.update_yaxes(showgrid=True, gridwidth=0.2, gridcolor='#EEE1FA')
fig.update_layout(
    title='Coding experience in Arabic and African Countries')
fig.show()

Saudi Arabia has the highest percentage of users who never wrote code. 3-5 years Morocco has 43 percent and Ghana has 37% 1-2 years experience and finally from 5-10, we have UAE with 22% at the top.

In [None]:
df_Afro_Arab1 = df_Afro_Arab.copy()
df_Afro_Arab1 = df_Afro_Arab.dropna()

df_Afro_Arab = df_Afro_Arab[(df_Afro_Arab["Q5"]!= "Other") & (df_Afro_Arab["Q5"]!= "Currently not employed")]
df_Afro_Arab["Q5"][(df_Afro_Arab["Q5"] == "Product/Project Manager") | (df_Afro_Arab["Q5"] == "Business Analyst")] = "Product/Project Manager or BA"
df_Afro_Arab["Q5"][(df_Afro_Arab["Q5"] == "Research Scientist") | (df_Afro_Arab["Q5"] == "Statistician")] = "Statistician or Research Scientist"
df_Afro_Arab["Q5"][(df_Afro_Arab["Q5"] == "DBA/Database Engineer") | (df_Afro_Arab["Q5"] == "Data Engineer")] = "Data Engineer or DBA"

df_Afro_Arab_JobTitle = df_Afro_Arab.groupby(["Q3" , "Q5"], as_index=False).sum()

figure=go.Figure()

for country in df_Afro_Arab_JobTitle.Q3.unique():
    color=Highlight.get(country)
    plot_data=df_Afro_Arab_JobTitle[df_Afro_Arab_JobTitle.Q3==country]
    axis=plot_data["Q5"].tolist()
    axis.append(axis[0])
    plot_data=plot_data.Count.tolist()
    plot_data = (np.array(plot_data) / sum(plot_data) * 100).tolist()
    plot_data.append(plot_data[0])
    figure.add_trace(go.Scatterpolar(r=plot_data,theta=axis,showlegend=True,mode='lines',name=country,line_shape='spline',line_smoothing=0.6))
figure.update_layout(polar_bgcolor='white',  polar_radialaxis_visible=True,  polar_radialaxis_showticklabels=True,
    polar_radialaxis_tickfont_color='darkgrey',  polar_angularaxis_color='grey',
    polar_angularaxis_showline=False, polar_radialaxis_showline=False, 
    polar_radialaxis_layer='below traces',polar_radialaxis_gridcolor='#F2F2F2',
    polar_radialaxis_range=(0,60), polar_radialaxis_tickvals=[20, 40], 
    polar_radialaxis_ticktext=['20%', '40%'],polar_radialaxis_tickmode='array',title='Expertise in Arabic and African Countries',width=800,
    height=800
)

figure.show()


We seem to have a low number of Data Engineers in African Arabic countries. The highest percent in this category goes to Morocco with 9. We have Egypt with 22% far off of any other country in this category. Data Science we have South Africa with 23% and we have a close second Kenya with 20%. A final point that Ghana has the highest percentage in any category with 55% in the student category. This could mean one of two things. Either that Ghana is lagging behind on the professional side . Or That's the raise of future Ghanaian Data scientists and Machine Learning Engineers.   

In [None]:
df_Afro_Arab2=df_Afro_Arab.copy()
df_Afro_Arab2=df_Afro_Arab2.dropna()
df_Afro_Arab2 = df_Afro_Arab2[(df_Afro_Arab2["Q4"]!= 'I prefer not to answer')]
df_Afro_Arab2["Q4"][(df_Afro_Arab2["Q4"].isin(['Professional degree',
                                                               'Some college/university study without earning a bachelor’s degree',
                                                               'No formal education past high school']))] = "Other"
df_Afro_Arab_Education = df_Afro_Arab2.groupby(["Q3" , "Q4"], as_index=False).sum()
exp = [ 'Other' , 'Bachelor’s degree', 'Master’s degree', 'Doctoral degree']
cat_dtype = pd.api.types.CategoricalDtype(categories=exp, ordered=True)
df_Afro_Arab_Education=df_Afro_Arab_Education.sort_values("Q4")
figure=go.Figure()

for country in df_Afro_Arab_Education.Q3.unique():
    color = Highlight.get(country)
    plot_data=df_Afro_Arab_Education[df_Afro_Arab_Education.Q3==country]
    axis = plot_data["Q4"].tolist()
    axis.append(axis[0])
    plot_data = plot_data.Count.tolist()
    plot_data = (np.array(plot_data) / sum(plot_data) * 100).tolist()
    plot_data.append(plot_data[0]) 
    figure.add_trace(go.Scatterpolar(r=plot_data,theta=axis,showlegend=True,mode='lines',name=country,line_shape='spline',line_smoothing=0.6))
figure.update_layout(polar_bgcolor='white',  polar_radialaxis_visible=True,  polar_radialaxis_showticklabels=True,
    polar_radialaxis_tickfont_color='darkgrey',  polar_angularaxis_color='grey',
    polar_angularaxis_showline=False, polar_radialaxis_showline=False, 
    polar_radialaxis_layer='below traces',polar_radialaxis_gridcolor='#F2F2F2',
    polar_radialaxis_range=(0,80), polar_radialaxis_tickvals=[20, 40], 
    polar_radialaxis_ticktext=['20%', '40%'],polar_radialaxis_tickmode='array',title='Formal Education Level in Arabic and African Countries'
)

figure.show()

We have a low number of Doctoral degrees in general with Morocco leading with 25 percent. In the Master’s field which we talked about before is the way to go to get a job, we have UAE with 59% followed by Morocco with 52%. For the Bachelor's degrees, we have Kenya with 70% of degree holders.  It is no surprise that Ghana has the lowest percentage of Doctoral degree with only 2%.  In the Master’s category, we have Egypt with 11% as the lowest percentage of Master's degree.  

In [None]:
Language =df_Afro_Arab_Full.iloc[:,7:18]
colname={}
for i in Language.columns:
    colname[i] = Language[i].dropna().unique()[0]
Language = Language.rename(columns = colname)
Language[~Language.iloc[:,:].isna()] = 1
Language = Language.join(df["Q3"] , lsuffix='_caller', rsuffix='_other')

Language_group = Language.groupby("Q3").sum()

res = Language_group.div(Language_group.sum(axis=1), axis=0)


fig = go.Figure(data=go.Heatmap(
        z=res.values*100,
        x=res.columns,
        y=res.index,
        colorscale='Viridis'))

fig.update_layout(
    title='Programming Languages in Arabic and African Countries')

fig.show()

We talked before about the three most important languages to my analysis which are R, Python, SQL. All countries seem to have Python as the most dominant with Nigeria taking the lead with 50%. SQL has the same percentage for all countries ranging from 13 to 20% with the highest being South Africa. In R, we have Kenya with the highest percentage with 21%. We also have South Africa and Ghana with 11% in second so not a lot of R users in African Arabic countries. Instead, countries like Egypt and Morocco have high percentages in C++ and Java 

In [None]:
df_Comp=df[df.Q3.isin(["Japan","United States of America","Egypt","Russia","Saudi Arabia","Kenya","Morocco"])]

I picked Japan, the USA, and Russia as the advanced countries. Yet, having 12 countries in the same plot would make for an unpleasant experience. I picked Saudi Arabia to represent the Non-African Arabic countries. I picked Kenya to represent the African Non-Arabic countries. Morocco like Kenya stood out in many categories.  I wanted to include Egypt as I want to see how my home country would compare against the advanced countries. 

In [None]:
df_Comp_Full = df_Comp.copy()
df_Comp= df_Comp.iloc[:,0:7]
df_Comp["Count"] = 1

In [None]:
df_Comp_Gender = df_Comp.groupby(["Q3" , "Q2"], as_index=False).sum()
for i in df_Comp_Gender.Q3.unique():
    pi=df_Comp_Gender[df_Comp_Gender.Q3==i]
    me=pi.Count.sum()
    pi.Count=(pi.Count/me)*100
    
    df_Comp_Gender.drop(pi.index,axis=0,inplace=True)
    df_Comp_Gender=pd.concat([df_Comp_Gender,pi])
df_Comp_Gender_Women=df_Comp_Gender[df_Comp_Gender.Q2=="Woman"].sort_values(by="Count")
fig=py.express.bar(df_Comp_Gender_Women,x='Count',y='Q3',title='Woman representation',color='Count',labels={
    "Q3":'Country',
    'Count':'percentage'
})
fig.show()

That's come as a surprise. I would have expected Russia and Japan to be higher on the female representation. Even Kenya that far off from countries like Saudi Arabia, it is still higher than both Russia and Japan. The biggest shock for me that Japan did not even cross 10% in the women representation category with only 6%!

In [None]:
df_Comp_Age=df_Comp.groupby(["Q3","Q1"],as_index=False).sum()
for i in df_Comp_Age.Q3.unique():
    pi=df_Comp_Age[df_Comp_Age.Q3==i]
    me=pi.Count.sum()
    pi.Count=(pi.Count/me)*100
    
    df_Comp_Age.drop(pi.index,axis=0,inplace=True)
    df_Comp_Age=pd.concat([df_Comp_Age,pi])
Highlight={"Kenya":"#D55E00","Egypt":"#CC79A7","Nigeria":"#0072B2","Japan":"#F0E442","Saudi Arabia":"009E73"}
fig=go.Figure()
for i in df_Comp_Age.Q3.unique():
    color=Highlight.get(i,'grey')
    data=df_Comp_Age[df_Comp_Age.Q3==i]
    name=data.Q3.iloc[-1]
    fig.add_trace(go.Scatter(x=data.Q1,y=data.Count,name=name))
fig.update_layout(plot_bgcolor='white')
fig.update_xaxes(showgrid=True, gridwidth=0.2, gridcolor='#EEE1FA')
fig.update_yaxes(showgrid=True, gridwidth=0.2, gridcolor='#EEE1FA')
fig.update_layout(
    title='Age comparison')
fig.show()

In the age distribution, the difference is not clear right away. Yet, the rate of change in the age distributions in advanced countries is much lower than in Arabic African Countries. An example, the United States and Morocco in the 25-29 has 30. In 30-34 is 13 percent for Morocco while the United States has the same percent of (17-18%) in both categories.  

In [None]:
exp = ['I have never written code', '< 1 years' , '1-2 years', '3-5 years' ,'5-10 years','10-20 years', '20+ years']
cat_dtype = pd.api.types.CategoricalDtype(categories=exp, ordered=True)
df_Comp["Q6"]  = df_Comp["Q6"].astype(cat_dtype)
df_Comp_Coding = df_Comp.groupby(["Q3" , "Q6"], as_index=False).sum()
for i in df_Comp_Coding.Q3.unique():
    pi = df_Comp_Coding[df_Comp_Coding.Q3 == i]
    me = pi.Count.sum()
    pi.Count = (pi.Count/me)*100
    
    df_Comp_Coding.drop(pi.index, axis=0, inplace=True)
    df_Comp_Coding = pd.concat([df_Comp_Coding,pi])
fig=go.Figure()
for i in df_Comp_Coding.Q3.unique():
    color=Highlight.get(i,'grey')
    data=df_Comp_Coding[df_Comp_Coding.Q3==i]
    name=data.Q3.iloc[-1]
    fig.add_trace(go.Scatter(x=data.Q6,y=data.Count,name=name))
fig.update_layout(plot_bgcolor='white')
fig.update_xaxes(showgrid=True, gridwidth=0.2, gridcolor='#EEE1FA')
fig.update_yaxes(showgrid=True, gridwidth=0.2, gridcolor='#EEE1FA')
fig.update_layout(
    title='Coding experience')
fig.show()

Three interesting observations:

1. I have never written code section developed countries all share the same percentage (3%) with Morocco. Egypt comes second with 5%.

2. Saudi Arabia has the highest percentage of non-coders with 20%. A reason that a high percentage of Saudi Arabian Kagglers are statisticians/researchers or product managers.

3. Both Japan and the USA have not dropped much in 10-20 years and 20+ coding experience. unlike other countries, even Russia dropped in the 20+ category to 6%. Yet, all other countries were not even close to Japan and the USA with Morocco coming in with 3%

In [None]:
df_Comp1 = df_Comp.copy()
df_Comp1 = df_Comp.dropna()

df_Comp = df_Comp[(df_Comp["Q5"]!= "Other") & (df_Comp["Q5"]!= "Currently not employed")]
df_Comp["Q5"][(df_Comp["Q5"] == "Product/Project Manager") | (df_Comp["Q5"] == "Business Analyst")] = "Product/Project Manager or BA"
df_Comp["Q5"][(df_Comp["Q5"] == "Research Scientist") | (df_Comp["Q5"] == "Statistician")] = "Statistician or Research Scientist"
df_Comp["Q5"][(df_Comp["Q5"] == "DBA/Database Engineer") | (df_Comp["Q5"] == "Data Engineer")] = "Data Engineer or DBA"

df_Comp_JobTitle = df_Comp.groupby(["Q3" , "Q5"], as_index=False).sum()

figure=go.Figure()

for country in df_Comp_JobTitle.Q3.unique():
    color=Highlight.get(country)
    plot_data=df_Comp_JobTitle[df_Comp_JobTitle.Q3==country]
    axis=plot_data["Q5"].tolist()
    axis.append(axis[0])
    plot_data=plot_data.Count.tolist()
    plot_data = (np.array(plot_data) / sum(plot_data) * 100).tolist()
    plot_data.append(plot_data[0])
    figure.add_trace(go.Scatterpolar(r=plot_data,theta=axis,showlegend=True,mode='lines',name=country,line_shape='spline',line_smoothing=0.6))
figure.update_layout(polar_bgcolor='white',  polar_radialaxis_visible=True,  polar_radialaxis_showticklabels=True,
    polar_radialaxis_tickfont_color='darkgrey',  polar_angularaxis_color='grey',
    polar_angularaxis_showline=False, polar_radialaxis_showline=False, 
    polar_radialaxis_layer='below traces',polar_radialaxis_gridcolor='#F2F2F2',
    polar_radialaxis_range=(0,60), polar_radialaxis_tickvals=[20, 40], 
    polar_radialaxis_ticktext=['20%', '40%'],polar_radialaxis_tickmode='array',title='Expertise comparison',width=800,
    height=800
)

figure.show()

There is no much difference between the first and the comparison graph. Japan has the highest percentage of software engineering. The USA has the highest percentage in the data scientist category. 

In [None]:
df_Comp2=df_Comp.copy()
df_Comp2=df_Comp2.dropna()
df_Comp2 = df_Comp2[(df_Comp2["Q4"]!= 'I prefer not to answer')]
df_Comp2["Q4"][(df_Comp2["Q4"].isin(['Professional degree',
                                                               'Some college/university study without earning a bachelor’s degree',
                                                               'No formal education past high school']))] = "Other"
df_Comp_Education = df_Comp2.groupby(["Q3" , "Q4"], as_index=False).sum()
exp = [ 'Other' , 'Bachelor’s degree', 'Master’s degree', 'Doctoral degree']
cat_dtype = pd.api.types.CategoricalDtype(categories=exp, ordered=True)
df_Comp_Education=df_Comp_Education.sort_values("Q4")
figure=go.Figure()

for country in df_Comp_Education.Q3.unique():
    color = Highlight.get(country)
    plot_data=df_Comp_Education[df_Comp_Education.Q3==country]
    axis = plot_data["Q4"].tolist()
    axis.append(axis[0])
    plot_data = plot_data.Count.tolist()
    plot_data = (np.array(plot_data) / sum(plot_data) * 100).tolist()
    plot_data.append(plot_data[0]) 
    figure.add_trace(go.Scatterpolar(r=plot_data,theta=axis,showlegend=True,mode='lines',name=country,line_shape='spline',line_smoothing=0.6))
figure.update_layout(polar_bgcolor='white',  polar_radialaxis_visible=True,  polar_radialaxis_showticklabels=True,
    polar_radialaxis_tickfont_color='darkgrey',  polar_angularaxis_color='grey',
    polar_angularaxis_showline=False, polar_radialaxis_showline=False, 
    polar_radialaxis_layer='below traces',polar_radialaxis_gridcolor='#F2F2F2',
    polar_radialaxis_range=(0,80), polar_radialaxis_tickvals=[20, 40], 
    polar_radialaxis_ticktext=['20%', '40%'],polar_radialaxis_tickmode='array',title='Formal Education Level'
)

figure.show()

Japan, the United States of America, Russia has the highest percentage in Master's degree also joined once again by Morocco.

In [None]:
Language =df_Comp_Full.iloc[:,7:18]
colname={}
for i in Language.columns:
    colname[i] = Language[i].dropna().unique()[0]
Language = Language.rename(columns = colname)
Language[~Language.iloc[:,:].isna()] = 1
Language = Language.join(df["Q3"] , lsuffix='_caller', rsuffix='_other')

Language_group = Language.groupby("Q3").sum()

res = Language_group.div(Language_group.sum(axis=1), axis=0)


fig = go.Figure(data=go.Heatmap(
        z=res.values*100,
        x=res.columns,
        y=res.index,
        colorscale='Hot'))

fig.update_layout(
    title='Programming Languages')

fig.show()

SQL as expected has high percentages in both the USA and Russia with 20% . They are followed by Saudi Arabia with 19%

# Conclusion:

## Question 1(How do these countries compare agaisnt each other? Which country had the best advancements?):

Morocco wins this one. As it stands out in many categories from age to coding experience to degree holders. Kenya comes second to Morocco in many categories and has the highest percent of R users. But, the lack of Master’s degree holders and a high percentage of people that do not write code moves it to second place. Special mention for Egypt, Tunisia, and Ghana. Egypt represents the highest percent of Machine learning engineers and also has only a 5% of Kagglers who never wrote code. Tunisia has the highest female representation in the whole group. Ghana is starting to make it through the field but the future seems to be bright 

## Question 2(How do they compare agaisnt other ML and data science giants?):

 The main differences between advanced countries and African-Arabic countries are the number of professionals, coding experience, and Master’s degree students/holders. As we saw in most countries the number of professionals is low compared to the number of students (for example Data Engineering). Most students are bachelor by the percentage that can reach up to 78% in Kenya for example. For coding experience, the huge drop between 3-5 to 5-10 years of coding experience then drop to 10-20 years is the problem. This confirms the lack of more experienced programmers from these countries at least on Kaggle.

## Final Conclusion:

The growth in the field by African and Arabic countries was remarkable. But, they still have a long road ahead in them to be on the same level as industry leaders. I hope 2021 is the year where these countries will shine but for now, we will wait until next year's competition. 


## Huge thanks to these notebooks:
- Enthusiast to Data Professional - What changes? https://www.kaggle.com/spitfire2nd/enthusiast-to-data-professional-what-changes
-  Kagglers of Middle East (Is Oil forgotten?!!)
https://www.kaggle.com/sinatavakolibanizi/kagglers-of-middle-east-is-oil-forgotten