## **How popular is kaggle in Africa?**
![](https://tse1.mm.bing.net/th?id=OIP.oiEMOXFTdUOCzjSN9xEpmgHaDp&pid=Api&P=0&w=315&h=156)

[Image source :: https://theconversation.com](https://theconversation.com/how-data-science-in-and-for-africa-can-blaze-new-trails-120920)


## **Introduction**

Kaggle is arguabily the most popular online platform for data scientists and machine learning engineers and up todate has over 8 million registered users around the globe. The platform caters people from around 194 countries. On its platform kaggle offers four main services to its users i.e, machine learning competitions, Notebooks(kernels), Public datasets and Kaggle Learn [1]. 

Since 2017 kaggle hosts an industry-wide survey in a attempt to get a comprehensive view of the state of data science and machine learning community. Kagglers from African countries have also been participating since then. Using this notebook I would like to explore the participation of African kagglers using the 2021 kaggle survey dataset. I will also use previous year's datasets for historical overview.

#### **AI/Data Science in Africa**

The excerpts below give us a glimpse on the state of AI/DS in the continent, the range of problems/challenges AI/DS can help solve for the continent, the role Africans can play in the industry. Let's briefly look at them and will proceed to the data exploration part. 

<blockquote style="margin-right:auto; margin-left:auto; background-color: #f8f8ff; padding: 1em; margin:24px;">
    
"Over the past decade, virtually every university in Europe and North America has responded to the challenges and opportunities of data science by establishing new institutes, departments and degree programs in the field. Academic institutions in Africa have only recently begun to catch up. Some are creating structures, networks and training programs to stimulate research and capacity development in the subject. Examples include the African Center of Excellence in Data Science in Rwanda, the AI & Data Science Research Group at Makerere University in Uganda, Data Science Africa, and the Deep Learning Indaba. But with a time lag of at least half a decade, the question is whether Africa is bound to be trailing behind."   
</blockquote>

[Data science institutions focused on Africa are being built across the continent](https://qz.com/africa/1683977/why-data-science-is-crucial-to-africa-development/)

<blockquote style="margin-right:auto; margin-left:auto; background-color: #f8f8ff; padding: 1em; margin:24px;">
    
"The rapidly developing set of artificial intelligence (AI) technologies has the potential to solve some of the most pressing challenges that impact Sub-Saharan Africa and drive growth and development in core sectors such as agriculture, healthcare, public services and more. Forward thinking policy-makers, innovative startups, and others are already mobilizing to promote the growth of a vibrant AI ecosystem in Africa."    

</blockquote>    
    
[AI In Africa](https://info.microsoft.com/ME-DIGTRNS-WBNR-FY19-11Nov-02-AIinAfrica-MGC0003244_01Registration-ForminBody.htm)


<blockquote style="margin-right:auto; margin-left:auto; background-color: #f8f8ff; padding: 1em; margin:24px;">
    
"Research and development of AI and machine learning technologies is growing in African countries. Programmes such as Data Science Africa, Data Science Nigeria, and the Deep Learning Indaba with its satellite IndabaX events, which have so far been held in 27 different African countries, illustrate the interest and human investment in the fields. The potential of AI and related technologies to promote opportunities for growth, development, and democratization in Africa is a key driver of this research. Yet very few African voices have so far been involved in the international ethical frameworks that aim to guide the research. This might not be a problem if the principles and values in those frameworks have universal application. But it’s not clear that they do."
    
</blockquote>

[Why AI needs input from Africans](https://tribunecontentagency.com/article/why-ai-needs-input-from-africans/)


<blockquote style="margin-right:auto; margin-left:auto; background-color: #f8f8ff; padding: 1em; margin:24px;">
    
"In Africa, AI can help with some of the region’s most pervasive problems: from reducing poverty and improving education, to delivering healthcare and eradicating diseases, addressing sustainability challenges — and from meeting the growing demand for food from fast-growing population to advancing inclusion in societies. AI democratises access to innovative and productivity-boosting technology to fuel the growth the continent needs."

</blockquote>

[Artificial Intelligence for Africa: An Opportunity for Growth, Development, and Democratisation](https://www.up.ac.za/media/shared/7/ZP_Files/ai-for-africa.zp165664.pdf)




Now let's see what the kaggle DS survey of this year tell us about the African respondents. We will see, among others, the demographic composition, education, jobs/sectors and experience; technologies and DS/AI tools they use to tackle the problems the continent is facing which can be solved by data analytics ans AI. The **analysis** by and large will be **descriptive** and will focus on comparing African kagglers (*Africa*) with the rest of the world (*referred as world in the legends of the charts*). 



## **Table of Contents**

1. [How does Africa compares with rest of the world?](#1)
2. [Which African countries are kaggle-aware?](#2)
3. [Demography : Age and Gender](#3)
4. [Education, Jobs and Experience](#4)
5. [Programming Languages & IDE's](#5)
6. [Machine Learning: Experience, Framework and Algorithms](#6)
7. [Computing Resources](#7)
8. [Employment and role at work](#8)
9. [Learning Platform and Media](#9)
10. [Summary](#10)
11. [Reference](#11)




In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt

import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
pio.templates.default = "none"
# import plotly.offline as py
# py.offline.init_notebook_mode()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings("ignore")

In [None]:
df17= pd.read_csv("/kaggle/input/kaggle-survey-2017/multipleChoiceResponses.csv", encoding="ISO-8859-1")
df18= pd.read_csv("/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv", )
df19= pd.read_csv("/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv", )
df20= pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", )
df21= pd.read_csv("/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv", )

In [None]:
### Helper functions

paper_color = None# '#F5F5F5'
plot_color = None# '#F5F5F5'


## horizontal bar graphs 
def plotly_hBar(df, q, title, height=400,l=250,r=50,b=50,t=100,):   
    fig = px.histogram(df.iloc[1:], 
                       y=q,
                       orientation='h',
                       width=700,
                       height=height,
                       histnorm='percent',
                       color='region',
                       color_discrete_map={
                           "Africa": "gold", "World": "salmon"
                       },
                       opacity=0.6
                       )

    fig.update_layout(title=title, 
                      font_family="San Serif",
                      bargap=0.2,
                      barmode='group',
                      titlefont={'size': 24},
                      template= 'simple_white',
                      paper_bgcolor=paper_color,
                      plot_bgcolor=plot_color,                       
                      legend=dict(
                      orientation="v", 
                          y=1, 
                          yanchor="top", 
                          x=1.250, 
                          xanchor="right",)                 
                      ).update_yaxes(categoryorder='total ascending')
    fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5)
    fig.update_layout(yaxis_title=None,yaxis_linewidth=2.5,
    autosize=False,
    margin=dict(
        l=l,
        r=r,
        b=b,
        t=t,
    ),
    )
    fig.update_xaxes(showgrid=False, linecolor='white')
    fig.update_yaxes(showgrid=False)
    fig.show()   
    

    

###################
## vertical bar graphs##########################
def plotly_vBar(df, q, title, l=50,r=50,b=50,t=100):   
    fig = px.histogram(df.iloc[1:],
                       x=q,
                       orientation='v',
                       width=700,
                       height=450,
                       histnorm='percent',
                       color='region',
                       color_discrete_map={
                           "Africa": "gold", "World": "salmon"
                       },
                       opacity=0.6
                       )

    fig.update_layout(title=title, 
                      font_family="San Serif",
                      bargap=0.2,
                      barmode='group',
                      titlefont={'size': 24},
                      paper_bgcolor=paper_color,
                      plot_bgcolor=plot_color,
                      legend=dict(
                      orientation="v", 
                          y=1, 
                          yanchor="top", 
                          x=1.250, 
                          xanchor="right",)                 
                      ).update_xaxes(categoryorder='total descending')
    fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5)
    fig.update_layout(xaxis_title=None,xaxis_linewidth=2.5,
    autosize=False,
    margin=dict(
        l=l,
        r=r,
        b=b,
        t=t,
    ),
    )
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.show()   

## head_count function copied from https://www.kaggle.com/ruchi798/kaggle-ml-ds-survey-analysis
def head_count(df, question_num, parts):
    questions = []
    questions = ['Q'+ str(question_num) +'_Part_'+ str(j) for j in range(1, parts)]
    questions.append('Q'+ str(question_num) + '_OTHER')
    
    categories = []
    values = []
    for i in questions:
        category = df[i].value_counts().index[0]
        val = df[i].value_counts()[0]
        
        categories.append(category)
        values.append(val)
       
    combined_df = pd.DataFrame()
    combined_df['Category'] = categories
    combined_df['Value'] = values
    
    #combined_df = combined_df.sort_values(['Value'],ascending=False)
    
    return combined_df

###########################################
## make a dataframe with percentages included 
def df_with_percentages(df, q, n, region):
    
    dff = head_count(df, q, n)
    perc = []

    for item  in dff.Value:          
        
        perc.append(np.round(item/len(df)*100, 2))

    data = pd.DataFrame()
    data['Percentage'] = perc
    df_perc = pd.concat([dff, data], axis=1)
    df_perc['Region'] = region
    
    return df_perc
    

###############################################################
## bargraphs for multipart questions 
def plot_barH_percent(df1, df2, title, l=150, r=50, b=50, t=100):
    fig = go.Figure()
    fig.add_trace(go.Bar(y=df1['Category'],
                         x=df1['Percentage'],
                         orientation= 'h',
                         name="World",
                         marker_color='salmon',
                         opacity =0.6
                        ),
                 )
    fig.add_trace(go.Bar(y=df2['Category'],
                         x=df2['Percentage'],
                         orientation= 'h',
                         name="Africa",
                         marker_color='gold',
                         opacity =0.6
                        ),
                 )
    fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5)
    fig.update_layout(title=title, 
                  font_family="San Serif",
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  paper_bgcolor=paper_color,
                  plot_bgcolor=plot_color,
                  legend=dict(
                  orientation="v", 
                      y=1, 
                      yanchor="top", 
                      x=1.250, 
                      xanchor="right",)                 
                  ).update_yaxes(categoryorder='total ascending')

    fig.update_layout(xaxis_title='percentage',yaxis_linewidth=2.5,
    autosize=False,
    margin=dict(
        l=l, r=r, b=b, t=t,
    ),
    )
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.show()     
    
    
###################################################   
def annotated_heatmap(df_w, df_a, title, width=850):
    x =list(df_w['Category'])

    y = ['World', 'Africa']

    z= [list(df_w['Percentage']),list(df_a['Percentage'])]

    fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z, colorscale='orrd', xgap=3, ygap=3)

    fig.update_layout(title_text=title,
                      title_x=0.5,
                      titlefont={'size': 24, 'family':'San Serif'},
                      width=width, height=350,
                      xaxis_showgrid=False,
                      xaxis={'side': 'bottom'},
                      yaxis_showgrid=False,
                      yaxis_autorange='reversed',                   
                      paper_bgcolor=paper_color,
                      )
    fig.show()
    
    
###################################################################    
def lollypop_plots(df1, df2, title, l=150, r=50, b=50, t=100):
    fig = go.Figure()
    fig.add_trace(go.Scatter(y=df1['Category'],
                         x=df1['Percentage'],
                         orientation= 'h',
                         name="World",
                         mode='markers',
                         marker_color='salmon',
                         marker=dict(
                             size=16,
                             ),
                        ),
                 )
    for i in range(0, len(df1)):
               fig.add_shape(type='line',
                              x0 = 0, y0 = i,
                              x1 = df1["Percentage"][i],
                              y1 = i,
                              line=dict(color='lightgray', width = 3))
    
    fig.add_trace(go.Scatter(y=df2['Category'],
                         x=df2['Percentage'],
                         orientation= 'h',
                         name="Africa",
                         mode='markers',
                         marker_color='gold',
                         marker=dict(
                             size=16,
                             ),
                        ),
                 )
    for i in range(0, len(df2)):
               fig.add_shape(type='line',
                              x0 = 0, y0 = i,
                              x1 = df2["Percentage"][i],
                              y1 = i,
                              line=dict(color='lightgray', width = 3))
             
    
    fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5)
    fig.update_layout(title=title, 
                  font_family="San Serif",
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  paper_bgcolor=paper_color,
                  plot_bgcolor=plot_color,
                  legend=dict(
                  orientation="v", 
                      y=1, 
                      yanchor="top", 
                      x=1.250, 
                      xanchor="right",)                 
                  ).update_yaxes(categoryorder='total ascending')

    fig.update_layout(xaxis_title='percentage', yaxis_linewidth=0,
    autosize=False,
    margin=dict(
        l=l, r=r, b=b, t=t,
    ),
    )
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.show() 
    
####################################################################   
def dumbbel_plots(df1, df2, title, l=150, r=50, b=50, t=100):
    fig = go.Figure()
    fig.add_trace(go.Scatter(y=df1['Category'],
                         x=df1['Percentage'],
                         orientation= 'h',
                         name="World",
                         mode='markers',
                         marker_color='salmon',
                         marker=dict(
                             size=16,
                             ),
                        ),
                 )
    fig.add_trace(go.Scatter(y=df2['Category'],
                         x=df2['Percentage'],
                         orientation= 'h',
                         name="Africa",
                         mode='markers',
                         marker_color='gold',
                         marker=dict(
                             size=16,
                             ),
                        ),
                 )

    for i in range(0, len(df1)):
               fig.add_shape(type='line',
                              x0 = df1["Percentage"][i],
                              y0 = i,
                              x1 = df2["Percentage"][i],
                              y1 = i,
                              line=dict(color='lightgray', width = 3))         
    
    fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5)
    fig.update_layout(title=title, 
                  font_family="San Serif",
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  paper_bgcolor=paper_color,
                  plot_bgcolor=plot_color,
                  legend=dict(
                  orientation="v", 
                      y=1, 
                      yanchor="top", 
                      x=1.250, 
                      xanchor="right",)                 
                  ).update_yaxes(categoryorder='total ascending')

    fig.update_layout(xaxis_title='percentage', yaxis_linewidth=0,
    autosize=False,
    margin=dict(
        l=l, r=r, b=b, t=t,
    ),
    )
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.show() 
    
    
##############################################################################
def annotated_heatmap_Trans(df_w, df_a, title, width=850, height=750, l=150):
    x =list(df_w['Category'])
    #x = (np.array(x)).T

    y = ['World', 'Africa']
    #y = (np.array(y)).T

    z= [list(df_w['Percentage']),list(df_a['Percentage'])]
    z = (np.array(z)).T

    fig = ff.create_annotated_heatmap(z, x=y, y=x, annotation_text=z, colorscale='orrd', xgap=3, ygap=3)

    fig.update_layout(title_text=title,
                      title_x=0.5,
                      titlefont={'size': 24, 'family':'San Serif'},
                      width=width, height=height,
                      xaxis_showgrid=False,
                      xaxis={'side': 'top'},
                      yaxis_showgrid=False,
                      yaxis_autorange='reversed',                   
                      paper_bgcolor=paper_color,
                      )
    fig.update_layout(xaxis_title=None,
    autosize=False,
    margin=dict(
        l=l,
        r=50,
        b=50,
        t=75,
    ),
    )
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.show()
    
    
###############################################    
def head_count_suf(df, question_num, part, n):
    questions = []
    questions = ['Q'+ str(question_num)+str(part) +'_Part_'+ str(j) for j in range(1, n)]
    questions.append('Q'+ str(question_num) +str(part)+ '_OTHER')
    
    categories = []
    values = []
    for i in questions:
        category = df[i].value_counts().index[0]
        val = df[i].value_counts()[0]
        
        categories.append(category)
        values.append(val)
       
    combined_df = pd.DataFrame()
    combined_df['Category'] = categories
    combined_df['Value'] = values
    
    #combined_df = combined_df.sort_values(['Value'],ascending=False)
    
    return combined_df

def df_with_percentages_suf(df, q, part, n, region):
    
    dff = head_count_suf(df, q, part, n)
    perc = []

    for item  in dff.Value:          
        
        perc.append(np.round(item/len(df)*100, 2))

    data = pd.DataFrame()
    data['Percentage'] = perc
    df_perc = pd.concat([dff, data], axis=1)
    df_perc['Region'] = region
    
    return df_perc

In [None]:
# grouping african countries
africa17 = ['Nigeria','Kenya', 'South Africa', 'Egypt']
africa18 = ['Nigeria','Kenya', 'South Africa', 'Egypt', 'Tunisia', 'Morocco'] 
africa19 = ['Nigeria','Kenya', 'South Africa', 'Egypt', 'Tunisia', 'Morocco', 'Algeria']
africa20 = ['Nigeria','Kenya', 'South Africa', 'Egypt', 'Tunisia', 'Morocco', 'Ghana']
africa21 = ['Nigeria','Kenya', 'South Africa', 'Egypt', 'Tunisia', 'Morocco', 'Algeria', 'Ghana', 'Uganda', 'Ethiopia']

africa = ['Nigeria', 'Egypt', 'South Africa', 'Algeria', 'Tunisia', 'Morocco', 'Kenya', 'Uganda', 'Ghana', 'Ethiopia']
df21_africa = df21[df21['Q3'].isin(africa)]
df21_world = df21[~df21['Q3'].isin(africa )]
df21['region']=["Africa" if x in africa else "World" for x in df21['Q3']]

df20_africa = df20[df20['Q3'].isin(africa)]
df20_world = df20[~df20['Q3'].isin(africa )]
df20['region']=["Africa" if x in africa else "World" for x in df20['Q3']]

df19_africa = df19[df19['Q3'].isin(africa)]
df19_world = df19[~df19['Q3'].isin(africa)]
df19['region']=["Africa" if x in africa else "World" for x in df19['Q3']]

df18_africa = df18[df18['Q3'].isin(africa)]
df18_world = df18[~df18['Q3'].isin(africa)]
df18['region']=["Africa" if x in africa else "World" for x in df18['Q3']]

df17_africa = df17[df17['Country'].isin(africa)]
df17_world = df17[~df17['Country'].isin(africa )]
df17['region']=["Africa" if x in africa else "World" for x in df17['Country']]       

## **1. How does Africa compares with rest of the world?** <a class="anchor" id="1"></a>
* In the 2021 kaggle survey, only **7.9%** of the respondents are from Africa.
* Compared to the population share, this number appears to be low. Africa's population amounts to **17.5%** of the world's population according to the latest census [2]
* The good news is that this number is only likely to increase in the future (hopefully). This year's (2021) survey has seen the African participants increase by a whopping **633%** compared to the inaugural survey in 2017.

In [None]:
afro21 = len(df21_africa)
row21 = len(df21) - afro21

afro20 = len(df20_africa)
row20 = len(df20) - afro20

afro19 = len(df19_africa)
row19 = len(df19) - afro19

afro18 = len(df18_africa)
row18 = len(df18) - afro18

afro17 = len(df17_africa)
row17 = len(df17) - afro17


region = ['Africa', 'Rest of the World']
value = [afro21, row21]
percent =[afro21/(afro21 +row21)*100, row21/(afro21+row21)*100]

fig = go.Figure(data=[go.Bar(
            x=value, y=region,
            text=(np.round(percent,1)),
            textposition=['outside', 'inside'],
            texttemplate = ["<b style='color: #f'>%{text}%</b>"]*2,
            textfont=dict(  family="sans serif",
                            size=16,
                            color="black"),
            orientation='h',
            marker_color=['gold', 'salmon'],
            opacity=0.6,
                    )])
fig.update_traces(marker_line_color='black',
                  marker_line_width=2.5)
fig.update_layout(title='<b>Number of respondents: Africa vs Rest of the world (2021)<b>', 
                  font_family="San Serif",
                  yaxis_linewidth=2.5,
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  paper_bgcolor=paper_color,
                  plot_bgcolor=plot_color,                  
              
                  )
fig.update_layout(
    autosize=False,
    margin=dict(
        l=150,
        r=50,
        b=50,
        t=100,
    ),
    )

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()



In [None]:
## historical data, all gender
year = ['2017', '2018', '2019', '2020', '2021']
value = [afro17, afro18, afro19, afro20, afro21]

percent =[
    afro17/(afro17 +row17)*100,
    afro18/(afro18 +row18)*100,
    afro19/(afro19 +row19)*100,
    afro20/(afro20 +row20)*100,
    afro21/(afro21 +row21)*100]  
   

color = 5* ['salmon'] 
color[4] = 'gold'
fig = go.Figure(data=[go.Bar(
            y=value, x=year,
            text=np.round(percent, 1), 
            textposition='outside',
            texttemplate = ["<b style='color: #f'>%{text}%</b>"]*5,
            textfont=dict(  family="sans serif",
                            size=16,
                            color="black"),
            orientation='v',
            marker_color= color, 
            opacity=0.6
                    )])
fig.update_traces(marker_line_color='black',
                  marker_line_width=2.5)
fig.update_layout(title='<b>The rise of African kagglers<b>', 
                  font_family="San Serif",
                  xaxis_linewidth=2.5,
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 28},
                  template='simple_white',
#                   paper_bgcolor='#F5F5F5',
#                   plot_bgcolor='#F5F5F5',                  
                  )
fig.update_layout(yaxis_title='Number of Respondents',xaxis_title='Year',
    autosize=False,
    margin=dict(
        l=100,
        r=50,
        b=50,
        t=70,
        pad=0,
    ),
    )

fig.show()

## **2. Which African countries are** *kaggle-aware*? <a class="anchor" id="2"></a>

* **Four** countries participated in the inaugural kaggle survey in 2017. That number grew to **ten** in 2021. Slowly but surely more African countries are coming to kaggle.
* **Nigeria**, **Kenya**, **Egypt** and **South Africa** have appeared in all survey since the first survey year in 2017. 
* This year's survey saw two debutant countries in **Ethiopia** and **Uganda**.
* **Nigeria** (the most populous country in Africa) has the highest participants with 34.1% of the 2021 survey data; on the contrary **Ethiopia**, the second most populous country in the continent, has the lowest participants with a meager 2.1%.
* Not surprisingly, the top four countries in terms of participation are those four countries who got to *know* kaggle first. 
* **More than 57%** of African kagglers are either from **Nigeria** or **Egypt**. 
* So far no country from **central Africa region** entered the kaggle world. If there are any, they are not responding to the kaggle survey. May be **kaggle's BIPOC initiatives** needs to scout central African region. 

In [None]:
# code copied and adapted from: https://www.kaggle.com/datafan07/kaggle-survey-a-general-picture/notebook
def plotly_choroplethMap(locations,counts,title):
    data = [ dict(
            type = 'choropleth',
            locations = locations,
            z = counts,
            locationmode = 'country names',
            autocolorscale = True,
            marker = dict(
                line = dict(color = '#999000', width = 1.5)),
                colorbar = dict(autotick = True, legth = 3, len=0.75, title = 'Nr. of Participants')
                )
           ]
    layout = dict(
        title = title,
        titlefont={'size': 28, 'family': 'san serif'},
        width=750, 
        height=475,
        paper_bgcolor=paper_color,
        geo = dict(
            showframe = True,
            showcoastlines = True,
            #scope="africa",
            fitbounds="locations",
            ),
        
        annotations = [dict(
        x=0.55,
        y=1.1,
        xref='paper',
        yref='paper',
        text='<b> Central Africa - kaggle is missing you!',
            font=dict(family='san serif', size=20, color='#cc5500'),            
        showarrow = False
    )]
    )
    
    fig = dict(data=data, layout=layout)
    iplot(fig, validate=False, filename='world-map')
    
z = df21_africa['Q3'].value_counts()
plotly_choroplethMap(locations=z.index, counts=z.values,title= '<b> African Countries (2021 survey) <b>')

<blockquote style="margin-right:auto; margin-left:auto; background-color: #f8f8ff; padding: 1em; margin:24px;">
<strong>Note:</strong><br>
No African country whose official language is <strong>French</strong> appears in the survey. Could language barrier have played a role in preventing people from the Western and Central African region (most of them speak French as official language) from enjoying the kaggle platform? I would not bet against it.

</blockquote>

[List of countries where French is an official language](https://en.wikipedia.org/wiki/List_of_countries_and_territories_where_French_is_an_official_language)

In [None]:
#### Heatmap
fig = go.Figure(data=go.Heatmap(
                   z=[[73, 59, 127, 66, 0 ,0, 0, 0, 0, 0, 0],
                    [208, 85, 147, 96, 74, 71, 0, 0, 0, 0],
                    [395, 114, 120, 122, 68, 123, 58, 0, 0, 0],
                    [476, 153, 141, 179, 99, 133, 0, 52, 0, 0],
                    [702, 248, 146, 482, 109, 140, 44, 99, 47, 43],                      
                     ],
                   x=['Nigeria','Kenya', 'South Africa', 'Egypt', 'Tunisia', 'Morocco', 'Algeria', 'Ghana', 'Uganda', 'Ethiopia'],
                   y=['2017', '2018', '2019', '2020', '2021'],
                   hoverongaps = False,
                   opacity=1.0, xgap=2.5, ygap=2.5, colorscale='orrd'),
                   )
fig.update_layout(title='<b> More countries appearing on the picture <b>',
                  titlefont={'size': 24, 'family': "San Serif"},
                  height=400, width=700,
                  template='simple_white',
                  paper_bgcolor=paper_color,
                  plot_bgcolor=plot_color,
                  autosize=False,
                  margin=dict(l=50,r=50,b=50, t=150,
                             ),
                 )
annotations = []
annotations.append(dict(xref='paper', yref='paper',
                        x=-0.01, y=1.3,
                        text='o <b>Nigeria and Egypt are rising quickly. Nigerian kagglers grew almost 10 folds since 2017!',
                             font=dict(family='Arial', size=14, color='darkgray'),
                        showarrow=False))
annotations.append(dict(xref='paper', yref='paper',
                        x=-0.01, y=1.2,
                        text="o <b>Algeria<b> had a <em>gap-year</em> in 2020.",
                             font=dict(family='Arial', size=14, color='darkgray'),
                        showarrow=False))

fig.update_layout(annotations=annotations)
fig.show()


<blockquote style="margin-right:auto; margin-left:auto; background-color: #f8f8ff; padding: 1em; margin:24px;">
<strong>Note:</strong><br>
Many factors can be cited as reasons why the African participation is low compared to the rest of the world. One prominent factor could be <strong>access to internet</strong> without which this very survey would not have been possible. According to [7], only <strong>43% </strong> of Africa's population has access to the internet in comparision to <strong>68.6%</strong> to the rest of the world. The chart below shows the internet coverage of the 10 African countries appeared in the survey. It is <em>fair</em> to say there seem to be a relation between internet access and survey participation.
</blockquote>

In [None]:
africa21 = ['Kenya','Nigeria', 'Morocco', 'Tunisia', 'South Africa', 'Algeria', 'Egypt',  'Ghana', 'Uganda', 'Ethiopia']
internet = [85.5, 73, 68.5, 68.4, 57.3, 57, 52.5, 46.6, 39.3, 17.9 ]

fig = go.Figure()

fig.add_trace(go.Barpolar(r=internet,
                          theta=africa21,
                          name='Internet Acceces',
                          marker_color=['#C49702','#DEB11B','#F7CA35','#F8D14E','#F9D767','#FADE81','#FBE59A','#FCEBB3','#FDF2CC','#FEF8E6'],
                          marker_line_color="black",
                          marker_line_width=2,
                          opacity=0.8
                          ),
              )
fig.update_layout(polar=dict(radialaxis=dict(visible=False,
                                             )
                             ),
                  showlegend=False,
                  title= '<b><span style="font-size:24px; color: black"> Internet penetration 2021 (% population) <b>',
                  titlefont={'size': 28, 'family':'Serif'},
                  width=600, height=500,
                 )

annotations = []

annotations.append(dict(xref='paper', yref='paper',
                        x=-0.01, y=-0.2,
                        text="<b>Data source: Internet world stats [ref. 7] ",
                             font=dict(family='Arial', size=12, color='darkgray'),
                        showarrow=False))

fig.update_layout(annotations=annotations)
fig.show()

 
## **3. Demography** : *Age and Gender* <a class="anchor" id="3"></a>
### Age

* Generally both groups (Africa & rest of the world) of participants are dominated by younger kagglers. **Africa's 64.6%** and **worlds's 52.2%** of the respondents are made up of kagglers who are 30 or under. Except for the age range 18-21, Africa's proportion of kagglers under 40 is higher than the rest of the world. 
* However,Africa falls short when it comes to older kagglers with *no respondent aged* **70** or **older**. For a continent whose average life expectancy is just 64years [3], this figure may not be a surprise at all. 

### Gender
* General trend is the same as the as rest of the world; men dominate the survey. However, comparatively **Africa beats the rest of the world in women's representation percentage wise**. Africa has 20% vs 18.7% (rest of the world). Well done Africa!
* **Tunisia**  has the highest proportion of women respondents with 38.5%. 
* The participation of African women has been steady (around 20% of the African repondents) since 2018 with a slight increase in 2019. In the inuagural survey year, that figure was 17.8%.
* Another subtle difference is that Africa has only **one** respondent who identified themselves as "**nonbinary**". 
    

In [None]:
#def plotly_vBar(df, q, title, l=50,r=50,b=50,t=100):   
title= "<b> Age Distribution <b>"\
'<br><span style="font-size:16px; color: darkgray"> Around 64% of African kagglers are under 30 years of age!'


df = df21
q = 'Q1'
order  = ['18-21', '22-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-69', '70+']

fig = px.histogram(df.iloc[1:],
                   x=q,
                   orientation='v',
                   width=700,
                   height=450,
                   histnorm='percent',
                   color='region',                   
                   category_orders = {q: ['18-21', '22-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-69', '70+']},
                   color_discrete_map={
                       "Africa": "gold", "World": "salmon"
                   },
                   opacity=0.6
                   )

fig.add_vrect(
    x0=-0.5, x1=2.5,
    fillcolor="lightgray", opacity=0.2,
    layer="below", line_width=0,
),

fig.update_layout(title=title, 
                  font_family="San Serif",
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  template='simple_white',
                  paper_bgcolor=paper_color,
                  plot_bgcolor=plot_color,
                  legend=dict(
                  orientation="v", 
                      y=1, 
                      yanchor="top", 
                      x=1.250, 
                      xanchor="right",)                 
                  )
fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)

fig.update_layout(xaxis_title=None,xaxis_linewidth=2.5,
autosize=False,
margin=dict(
    l=50,
    r=50,
    b=50,
    t=125,
),
)
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()   

In [None]:
#gender
title= '<b><span style="font-size:24px; color: black"> Gender Distribution <b>'\
 '<br><span style="font-size:16px; color: darkgray">  Kaggle is a male dominated platform! Sadly so is the work place in general' 
plotly_hBar(df21, "Q2", title=title, l=150)

In [None]:
DF = df21_africa
DF['all'] = 'Africa 2021'

fig = px.treemap(DF, path=['all','Q3', 'Q2'], color='Q3',
                 
                 color_discrete_map={'(?)': 'lightgray','Nigeria':'#802300','Kenya': '#FF6A33', 'South Africa': "#FF7D4D", 'Egypt': '#CC3700', 'Tunisia': "#FFA280",
                                     'Morocco': '#FF8F66', 'Algeria': '#FFDACC', 'Ghana': '#FFB599', 'Uganda': '#FFC7B3', 'Ethiopia': '#FFECE6'},
                )

fig.update_layout(title="<b> Gender distribution: Africa - 2021<b>",
                  titlefont={'size': 24, 'family': "San Serif"},
                  height=500, width=700,
                  template='simple_white',
                  paper_bgcolor=paper_color,
                  plot_bgcolor=plot_color,
                  autosize=False,
                  margin=dict(l=50,r=50,b=50, t=250,
                             ),
                 )
fig.update_layout(margin = dict(t=50, l=50, r=50, b=100))

annotations = []
annotations.append(dict(xref='paper', yref='paper',
                        x=-0.01, y=-0.1,
                        text='<b>1. Tunisia has the highest female respondents (38.5%)',
                             font=dict(family='Arial', size=16, color='darkgray'),
                        showarrow=False))
annotations.append(dict(xref='paper', yref='paper',
                        x=-0.01, y=-0.175,
                        text="<b>2. Ethiopia has the lowest female reppondents (2 out of 43)",
                             font=dict(family='Arial', size=15, color='darkgray'),
                        showarrow=False))

fig.update_layout(annotations=annotations)
fig.show()


In [None]:
## historical data, women

df21_afrowomen = df21_africa[df21_africa['Q2'] == 'Woman']
df20_afrowomen = df20_africa[df20_africa['Q2'] == 'Woman']

df19_afrowomen = df19_africa[df19_africa['Q2'] == 'Female']
df18_afrowomen = df18_africa[df18_africa['Q1'] == 'Female']
df17_afrowomen = df17_africa[df17_africa['GenderSelect'] == 'Female']

afrowomen21 = len(df21_afrowomen)
afrowomen20 = len(df20_afrowomen)
afrowomen19 = len(df19_afrowomen)
afrowomen18 = len(df18_afrowomen)
afrowomen17 = len(df17_afrowomen)

    
year = ['2017', '2018', '2019', '2020', '2021']
value = [afrowomen17, afrowomen18, afrowomen19, afrowomen20, afrowomen21]

percent =[
    afrowomen17/len(df17_africa)*100,
    afrowomen18/len(df18_africa)*100,
    afrowomen19/len(df19_africa)*100,
    afrowomen20/len(df20_africa)*100,
    afrowomen21/len(df21_africa)*100,
    ]  

color = 5* ['#fa8072'] 
color[0] = 'gray'
fig = go.Figure(data=[go.Bar(
            y=value, x=year,
            text=np.round(percent, 1), 
            textposition='outside',
            texttemplate = ["<b style='color: #f'>%{text}%</b>"]*5,
            textfont=dict(  family="sans serif",
                            size=16,
                            color="black"),
            orientation='v',
            marker_color= color, 
            opacity=0.6
                    )])
fig.update_traces(marker_line_color='black',
                  marker_line_width=2.5)

fig.update_layout(title='<b>African women kagglers (2017-2021)<b>', 
                  font_family="San Serif",
                  bargap=0.2,
                  barmode='group',
                  titlefont={'size': 24},
                  template='simple_white',
#                   paper_bgcolor='#F5F5F5',
#                   plot_bgcolor='#F5F5F5',                  
                  )
fig.update_layout(yaxis_title='Number of Respondents',xaxis_title='Year',xaxis_linewidth=2.5,
    autosize=False,
    margin=dict(
        l=100,
        r=50,
        b=50,
        t=50,
        ),
    )

annotations = []
annotations.append(dict(xref='paper', yref='paper',
                        x=0.01, y=0.97,
                        text='<b>  <b>',
                             font=dict(family='Arial', size=12, color='#cc5500'),
                        showarrow=False))
annotations.append(dict(xref='paper', yref='paper',
                        x=0.01, y=.93,
                        text="<b> <b> ",
                             font=dict(family='Arial', size=12, color='#cc5500'),
                        showarrow=False))

fig.update_layout(annotations=annotations)
fig.show()

## **4. Education, Jobs and Experience** <a class="anchor" id="4"></a>

### Education:
* Perhaps not surprisingly, the majority of the respondents (both groups) are Bachelor's degree or Masters degree holders. PhD's are a rare breed to find in any field and DS is no different. What is more surprising (at least for me) however is that, the rest of the world has more Masters than Bachelors. 

### Jobs:

* Anybody who kaggled for sometime knows that there is a lot of students engagement in our platform. To see a large contingent of students is perhaps not a surprise. Nearly a  quarter (**23.1% of Africa's and 26.5% of the world's**) of the respondents are students.
* People who said they are **not currenly employed** in Africa are higher than that of the world by about 3.5%. Another job title where **Africa** has more relative percentage are **Data Analyst** and **Statisticians**. 

### Code Experience:
* Over half (51.8%) of the world and more that two-third (66.9%) of African kagglers have a coding experience of 3 years of less. Africa lags behind when it comes to experience with the largest % difference coming from the 20+years experience group.
* In terms of kagglers who have never coded before, Africa leads the world.

In [None]:
title= '<b><span style="font-size:24px; color: black"> Highest level of formal education <b>'
plotly_hBar(df21, "Q4", title=title, l=350)


title= "<b><span style='font-size:24px'> Current role/Job title <b>"
plotly_hBar(df21, "Q5", title=title, l=200)

title= "<b><span style='font-size:24px'> Coding Experience <b>"\
'<br><span style="font-size:16px; color: darkgray"> Africa is light on coding experience.'
plotly_hBar(df21, "Q6", title=title, l=200)

## **5. Programming Languages & IDE's** <a class="anchor" id="5"></a>
### Programming Languages

* Like in the rest of the world **Python** is the most popular programming language in Africa as well. Around 82% of African respondents prefer python. The rest of the world is ahead by 2%.
* **SQL** is a distant second, but is best of the rest with +/-40%.
* **Swift** is the at bottom of the list. 
* We also see that in *all* programming language preferences the **rest of the world is mariginally ahead of Africa**. Programming language comfort-zone for African kagglers is a little narrower.
* **Python** is by far the most recommended programming language for aspiring data scientists to learn first. 

### IDE's 
* With the huge majority of the respondents chosing for python as a preferred programming language, seeing **Jupyter labs** leads by a signifcant marigin is not a surprise. **Visual Studio Codes** comes in second followed by Pycharm.

In [None]:
# Q7, 
df_w = df_with_percentages(df21_world, 7, 13, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages(df21_africa, 7, 13, 'Africa').sort_values(['Value'],ascending=False)

title = '<b> Programing Language Preference <b>'
#plot_barH_percent(df_w, df_a, title)
annotated_heatmap(df_w, df_a, title)

In [None]:
# # Q9, 

df_w = df_with_percentages(df21_world, 9, 13, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages(df21_africa, 9, 13, 'Africa').sort_values(['Value'],ascending=False)

# merge Jupyter's together
df_w.loc[10] += df_w.loc[0]
df_a.loc[10] += df_a.loc[0]

# drop the duplicate
df_w.drop([0], inplace=True)
df_a.drop([0], inplace=True)

title = "<b> Integrated Developments Environment (IDE's) Usage <b>"
plot_barH_percent(df_w, df_a, title, l=350)


In [None]:
title= "<b>Programming Language Recommendation <b>"
plotly_hBar(df21, "Q8", title=title, l=150)

## **6. Machine Learning**: *Experience, Framework and Algorithms* <a class="anchor" id="6"></a>
### Experience:
* **Africa** has less ML experience than the rest of the worlds with **66.5%** of the total respondents having **2 years** or **less** in ML. The rest of the world isn't that far ahead either - over half of the kagglers have two or less years of ML exposure.
* People who do not use machine learning methods at all constitute a significant portion of the survey population - almost 20% of African respondents and 16% of the rest do not use ML methods.
* ML veterans (20+years experience) are almost non-existent with only **0.05%** of the respondents in Africa and <1% of the rest of the world.

### Framework

* **Scikit-learn** is well ahead of the others globally followed by **Tensor Flow** and **Keras**. In all of the ML plateform usage, Africa's average is behind the rest of the world.

### Algorithms

* Overall **Linear/Logistic Regressions** and **Decision Trees/Random Forest** are the most used machine learning algorithms. Although the usage of all the algorithms listed is less common among African kagglers, the diffrerence even more visible for the Gradient boosting machine, CNNs and DNN. This might suggest that the African kagglers could be less experienced that their peers from other continents. 
* **GAN**, and **Evolutionary Approaches** are the least used algorithms.

### Advanced Methods (CV, NLP)

* Advanced methods need experience and experities to apply them on a regulary basis. We already saw that the magority of the respondents are short of experience and this exactly what is manifiested here; only a small percentage of the respondents said they use Computer Vision and Natural Language Processing methods in a regular basis. The story is the same for both groups although Africa trials in each of the categories.
* While *general purpose image/video tools (PIL, cv2, skimage, etc)* is the highest in CV methods with 17.2% and 13.1% for rest of the world and Africa respectively; *Word embeddings/vectors (GLoVe, fastT ext, word2vec)* is mariginally best of the rest in NLP methods. 


In [None]:
title= "<b>Machine Learning Experience <b>"
plotly_hBar(df21, "Q15", title=title)

In [None]:
# Q16, 
df_w = df_with_percentages(df21_world, 16, 18, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages(df21_africa, 16, 18, 'Africa').sort_values(['Value'],ascending=False)
title = "<b> Machine Learning Platform Usage <b>"
plot_barH_percent(df_w, df_a, title, l=150)

In [None]:
# Q17, 
df_w = df_with_percentages(df21_world, 17, 12, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages(df21_africa, 17, 12, 'Africa').sort_values(['Value'],ascending=False)

title = "<b> Machine Learning Algorithms Usage <b>"
plot_barH_percent(df_w, df_a, title, l=300)

In [None]:
# Q18, 
df_w = df_with_percentages(df21_world, 18, 7, 'World')
df_a = df_with_percentages(df21_africa, 18, 7, 'Africa')

title= "<b> Computer Vision Methods Used <b>"

annotated_heatmap_Trans(df_w, df_a, title, width=950, height=550, l=700)


# Q19, 
df_w = df_with_percentages(df21_world, 19, 6, 'World')
df_a = df_with_percentages(df21_africa, 19, 6, 'Africa')

title= "<b> Natural Language Processing Methods Used <b>"

annotated_heatmap_Trans(df_w, df_a, title, width=700, height=550, l=400)

In [None]:
# Q14, 12
df_w = df_with_percentages(df21_world, 14, 12, 'World')#.sort_values(['Value'],ascending=False)
df_a = df_with_percentages(df21_africa, 14, 12, 'Africa')#.sort_values(['Value'],ascending=False)

title = '<b> Data Visualizations Library<b>'
annotated_heatmap(df_w, df_a, title)

## **7. Computing Resources** <a class="anchor" id="7"></a>

#### Computing Platform

* **Laptops** are by far the most used computing machines among all respondents, albeit more popular in Africa. 79.4% of the African respondents use them compared to 64.5% of the rest of the world. 
* **Deep learning workstations** such as **NVIDIA GTX**, **LambdaLabs** are the least used among all respondents. Africa's usage is very low at less than 1%. 

#### TPU
* Both groups aren't a heavy TPU users; more than **two-third** of the respondents have **never** used a **TPU**.

#### Specialized Hardware
* The use of specialized hardware is not a common thing. More that half the repondents reported that they do not use sepcialized hardware. Having already seen that student kagglers are the large portion of the respondents, it is likely that most do not have the previllage of having an access to specialized hardware. Even a laptop/personal computer is a luxury for many. 
* However, of those whos do have access to some, **NVIDIA GPUs** are the used more frequently than any other hardware. **Google Cloud TPUs** comes in second. Here Africa trails the rest of the world by more than 10%. 

#### Cloud
* For African **Google Cloud Platform (GCP)** is the most frequently used cloud computing platform. Whereas the rest of the world uses **Amazon Web Service (AWS)** than others. 
* However, there is a distinct difference in the preferred cloud platform between the groups. Around **27.7% (the highest) of African** respondents could not differentiate among the cloud platforms; they enjoyed all of them. Whereas the rest of the world preferred **Amazon Web Service (AWS)**. The second choice for both groups is **Google Cloud Platform (GCP)**. 
* **IBM Cloud/Red Hat** is more popular in Africa than the rest of the world. Around 6.7% of African respondents preferred IMB Cloud/Red Hat.

#### Big Data

* MySQL and PostgreSQL are the top two used big data products globally.
* Respondent who do not used any big data products at all is also high in the ranks; 4th for Africa and 2nd in the rest. 

In [None]:
title= "<b> Computing platform usage <b>"
plotly_hBar(df21, "Q11", title=title, l=350)

In [None]:
title= "<b> Frequency of TPU Usage <b>"
plotly_hBar(df21, "Q13", title=title, l=150)

In [None]:
# Q12,

df_w = df_with_percentages(df21_world, 12, 6, 'World')
df_a = df_with_percentages(df21_africa, 12, 6, 'Africa')

title = "<b> Specialized Hardware <b>"

annotated_heatmap(df_w, df_a, title)

In [None]:
title= "<b> Preferred Cloud Platform <b>"
#plotly_hBar(df21, "Q27", title=title, l=300)

#27_A, 12
df_w = df_with_percentages_suf(df21_world, 27, '_A', 12, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages_suf(df21_africa, 27, '_A', 12, 'Africa').sort_values(['Value'],ascending=False)

title = "<b> Frequently Used Cloud Computing Platforms<b>"
plot_barH_percent(df_w, df_a, title, l=250)

In [None]:
title= "<b> Preferred Cloud Platform <b>"
plotly_hBar(df21, "Q28", title=title, l=300)

In [None]:
#29_A, 5
df_w = df_with_percentages_suf(df21_world, 29, '_A', 5, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages_suf(df21_africa, 29, '_A', 5, 'Africa').sort_values(['Value'],ascending=False)

title = "<b>Cloud computing products used on a regular basis<b>"
plot_barH_percent(df_w, df_a, title, l=250)

In [None]:
#32_A, 12
df_w = df_with_percentages_suf(df21_world, 32, '_A', 21, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages_suf(df21_africa, 32, '_A', 21, 'Africa').sort_values(['Value'],ascending=False)

title = "<b> Big Data Product Used <b>"
plot_barH_percent(df_w, df_a, title, l=250)

In [None]:
# Q33
title= "<b> Most Often Used Big Data Product <b>"
plotly_hBar(df21, "Q33", title=title, height=500, l=200)

In [None]:
#34_A, 17
df_w = df_with_percentages_suf(df21_world, 34, '_A', 17, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages_suf(df21_africa, 34, '_A', 17, 'Africa').sort_values(['Value'],ascending=False)

title = "<b> Business Intelligence (BI) Tools <b>"
plot_barH_percent(df_w, df_a, title, l=200)

In [None]:
# Q10, 17
df_w = df_with_percentages(df21_world, 10, 17, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages(df21_africa, 10, 17, 'Africa').sort_values(['Value'],ascending=False)
title = "<b> Hosted notebook products <b>"
plot_barH_percent(df_w, df_a, title, l=300)

## **8. Employment, role at work and payment** <a class="anchor" id="8"></a>

### Employment

* **Computer/technology** and **Academics/Education** are the standout employment sectors across all respondents. While Africa leads the in Academics/Education, the rest of the world is ahead in Computer/Technology. 
* **Non-prifit/Services** is another sector where african kagglers have slight percentage lead. Whereas in **Manufacturing/Fabrication** sector rest of the world leads.om
* In terms of compay size we see a noticable difference at both ends of the scale. Almost half of the respondents (49%) from Africa work for relatively small companies (< 50 employes), about 30% of the respondents from the rest of the world work in a similar company size. At the other end of the scale (> 10, 000 employees), we observe quite the opposite world ~22% vs Africa ~8%. 

### Role at work
* More that a third of the responents have described their important role at work as "*analyse and understand data to influence product or business decisions*". The difference between Africa and the rest of the world in this categoty is mariginal. Africa trails in the rest of the categories. Considering that **data scientist** and **data analyst** are the top two job titles (excuding students), this response seems to be consistent with previous findings.

### Payment
* With the large majority of the respondents being students, this category is biased towards the lower end of the pay scale. Around 43% of the African respondents and 20% of the rest of the world reported less \$1000 USD as their annual compensation. 
* The stark difference between Africa and the rest of the world when it comes to compensation is quite clear! Around 78% of the respondents in Africa earn less than \$10,000 UDS per year. Whereas around 43% from the rest of the world are in the same category.

In [None]:
# Q20
title= "<b> Current Employer <b>"
plotly_hBar(df21, "Q20", title=title, height=600, l=200)

In [None]:
# Q24, 

df_w = df_with_percentages(df21_world, 24, 8, 'World')
df_a = df_with_percentages(df21_africa, 24, 8, 'Africa')

title= "<b> Important Activity As Part of Job/Role At Work <b>"

annotated_heatmap_Trans(df_w, df_a, title, width=950, height=550, l=650)

In [None]:
## Company size

dfa = pd.DataFrame(df21_africa.iloc[1:]['Q21'].value_counts()).rename(columns={"Q21": "Africa"})
dfw = pd.DataFrame(df21_world.iloc[1:]['Q21'].value_counts()).rename(columns={"Q21": "World"})

DF = pd.concat([dfa, dfw], axis=1)
af_sum = DF['Africa'].sum()
rw_sum = DF['World'].sum()

DF['per_af'] = DF['Africa']/af_sum*100
DF['per_rw'] = DF['World']/rw_sum*100

#print(DF.index)

order = ['0-49 employees', '50-249 employees', '250-999 employees', '1000-9,999 employees', '10,000 or more employees']

new_index = ['0-49', '50-249', '250-999', '1000-9,999', '> 10,000']

title = '<b> Company Size <b>'\
'<br><span style="font-size:16px; color: darkgray"> Almost half of the Africans work for small companies!'


fig = go.Figure()
fig.add_trace(go.Bar(x=new_index,
                     y=DF['per_rw'],
                     orientation= 'v',
                     name="World",
                     marker_color='salmon',
                     opacity =0.6
                    ),
             )
fig.add_trace(go.Bar(x=new_index,
                     y=DF['per_af'],
                     orientation= 'v',
                     name="Africa",
                     marker_color='gold',
                     opacity =0.6
                    ),
             )
fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title,
                  height=400,
                  width=700,
              font_family="San Serif",
              bargap=0.2,
              barmode='group',
              titlefont={'size': 28},
              template='simple_white',
              paper_bgcolor=paper_color,
              plot_bgcolor=plot_color,
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              )


fig.update_yaxes(categoryarray= new_index)
fig.update_layout(yaxis_title='percentage',xaxis_title='Nr. of Employees',
autosize=False,
margin=dict(
l=150, r=50, b=50, t=100,
))

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()     

In [None]:
dfa = pd.DataFrame(df21_africa.iloc[1:]['Q25'].value_counts()).rename(columns={"Q25": "Africa"})
dfw = pd.DataFrame(df21_world.iloc[1:]['Q25'].value_counts()).rename(columns={"Q25": "World"})

DF = pd.concat([dfa, dfw], axis=1)
af_sum = DF['Africa'].sum()
rw_sum = DF['World'].sum()

DF['per_af'] = DF['Africa']/af_sum*100
DF['per_rw'] = DF['World']/rw_sum*100

order = ['$0-999', '1,000-1,999', '2,000-2,999',  '3,000-3,999','4,000-4,999','5,000-7,499',
         '7,500-9,999','10,000-14,999', '15,000-19,999','20,000-24,999', '25,000-29,999',
         '30,000-39,999', '40,000-49,999', '50,000-59,999','60,000-69,999','70,000-79,999', '80,000-89,999', '90,000-99,999',
       '100,000-124,999', '125,000-149,999','150,000-199,999', '200,000-249,999', '250,000-299,999', '300,000-499,999', 
       '$500,000-999,999','>$1,000,000',]

title = '<b> The Payment Pyramid <b>'

new_index = ['0-999', '1,000-1,999', '2,000-2,999',  '3,000-3,999','4,000-4,999','5,000-7,499',
         '7,500-9,999','10,000-14,999', '15,000-19,999','20,000-24,999', '25,000-29,999',
         '30,000-39,999', '40,000-49,999', '50,000-59,999','60,000-69,999','70,000-79,999', '80,000-89,999', '90,000-99,999',
       '100,000-124,999', '125,000-149,999','150,000-199,999', '200,000-249,999', '250,000-299,999', '300,000-499,999', 
       '500,000-999,999','>1,000,000',]

fig = go.Figure()
fig.add_trace(go.Bar(y=new_index,
                     x=-DF['per_rw'],
                     orientation= 'h',
                     name="World",
                     marker_color='salmon',
                     opacity = 0.6,
                    ),
             )
fig.add_trace(go.Bar(y=new_index,
                     x=DF['per_af'],
                     orientation= 'h',
                     name="Africa",
                     marker_color='gold',
                     opacity =0.6,
                    ),
             )
fig.update_traces(marker_line_color='black',
              marker_line_width=1.5)
fig.update_layout(title=title,
                  height=700,
                  width=700,
              font_family="San Serif",
              bargap=0.2,
              barmode='group',
              titlefont={'size': 28},
              template='simple_white',
              paper_bgcolor=paper_color,
              plot_bgcolor=plot_color,
              legend=dict(
              orientation="v", 
                  y=1, 
                  yanchor="top", 
                  x=1.250, 
                  xanchor="right",)                 
              )
fig.add_shape(type="rect",
              xref="paper", yref="y",
              x0=0, x1=6, y0=-0.5, y1=6.5,
              fillcolor="lightgray",    
              opacity=0.2
              )  

fig.update_yaxes(categoryarray= new_index)
fig.update_layout(xaxis_title='percentage',
                  xaxis= dict(
                      tickvals = [-20, 0, 20, 40],
                      ticktext = [20, 0, 20, 40],
                  ),
autosize=False,
margin=dict(
l=150, r=50, b=150, t=50,
))

                 
annotations = []
annotations.append(dict(xref='paper', yref='paper',
                        x=0.1, y=0.15,
                        text='<b> 42.7%',
                             font=dict(family='Arial', size=20, color='black'),
                        showarrow=False))
annotations.append(dict(xref='paper', yref='paper',
                        x=0.6, y=0.15,
                        text="<b> 77.5%",
                             font=dict(family='Arial', size=20, color='black'),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='paper',
                        x=-0.2, y=-0.2,
                        text="<i>Note: This payment scale is not adjusted for local purchasing power!",
                             font=dict(family='Arial', size=16, color='black'),
                        showarrow=False))

fig.update_layout(annotations=annotations)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()     

## **9. Learning Platform and Media** <a class="anchor" id="9"></a>

### Learning platform

* With around 4% lead over Kaggle Learn Courses, **Coursera** is the most favoured data science learning platform for the rest of the world. Perhaps much to the delight of kaggle.com, **kaggle learn courses are top of the pile among African kagglers**, albiet mariginally (+1.5% advantage over Coursera). Udemy comes in third for both groups.
* It is quite surprizing to see that Udacity (not a cheap learning platform) is more popular in Africa than the rest of the world; 23% of Africa's repondents use Udacity compared to 9.7% of the world's.  

### Media
* **Kaggle** is happy here as well! It is the favourite media for both African respondents (48%) and the rest of the world (43%)
* **YouTube**, blogs such as **Towards Data Science** and **Analytic Vidhya** comes in second and third. 
* African kagglers' use of **Twitter** for DS information is higher than their peers in the rest of the world. 

<blockquote style="margin-right:auto; margin-left:auto; background-color: #f8f8ff; padding: 1em; margin:24px;">
<strong>Note:</strong><br>
    <strong>Kaggle</strong> featured in the top two prefered platforms in all categories it appeared as choice option! For African respondents it was choosen as preferred platform in <strong>two</strong> instances.
</blockquote>

In [None]:
# Q40, 12
df_w = df_with_percentages(df21_world, 40, 12, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages(df21_africa, 40, 12, 'Africa').sort_values(['Value'],ascending=False)
title = "<b> Data Science Learning Platform<b>"
plot_barH_percent(df_w, df_a, title, l=360)

In [None]:
# Q42, 12, 
df_w = df_with_percentages(df21_world, 42, 12, 'World')
df_a = df_with_percentages(df21_africa, 42, 12, 'Africa')
    
title= "<b> Favourite Media Source for DS Topics  <b>"

annotated_heatmap_Trans(df_w, df_a, title, width=850, height=550, l=500)

In [None]:
# Q39, 10
df_w = df_with_percentages(df21_world, 39, 10, 'World').sort_values(['Value'],ascending=False)
df_a = df_with_percentages(df21_africa, 39, 10, 'Africa').sort_values(['Value'],ascending=False)
title = "<b> Preferred Content Sharing/Deploying Platfrom for DS/ML<b>"
plot_barH_percent(df_w, df_a, title, l=250)

## **10. Summary** <a class="anchor" id="10"></a>

As I mentioned in the introduction part, the aim of this notebook was to do a descriptive analysis of the the African kagglers in comparision with thier peers from the rest of the world. We did so by looking at the majority of question contained in the survey. As a summary, I have highlighted few of the points where African kagglers have different statistics than the rest of the world.  

- Africa's participation in the kaggle data science/ML community is **steadily increasing** although its disproportionate to the size of the continent. Only eleven out of 54 countries have so far appeared in the survey with two debutant countries this year. Following the global trend African women's involvement on the survey is disappointingly low. On a positive note, they are **slightly more than the global average**. 

- The majority of African kagglers are young with **~65% under the age of 30** with **no respondents aged 70 and above**. However, at the young end of the age spectrum, Africa is behind the rest of the world by about 4%. 

- Almost **half** of the African respondents have a **Bachelor's degree** unlike their peers from the rest of world where Master's degree holder edges the Bachelor's slightly. Similar to the global average, students are the majority of the respondents in Africa as well. Data scientists being the second. **Data analysts** and **unemployed kagglers** from Africa appeared in more number (%) than the rest of the world. 

- Coding experience is one of the categories where African respondents lag hugely in comparison to the rest of the world. Nearly **three-quarter (73.3%) of them have less than three year coding experience or have never coded before**. We also see the same trend when it comes to machine learing experience. 

- Python is the most common programming language, Jupyter labs/notebooks as favourite IDEs. Scikit-learn as ML platform and the more traditional ML algorithms such as the linear/logistic regression are widely used in Africa; just like the rest of the world.

- Unlike the rest of the world whose majority work in the computer/technology sector, Africa's kagglers work in the **research /education**. Another sector where Africans have more presence relative to the rest is the **Non-profit/service**.  

- MySQL is the most used big data prroduct not only in Africa but also globally. However, **Google's cloud platform and products** are more popular than other alterantive in Africa. Whereas AWS is the most chosen in the rest of the world.

- The **payment** disparity between African kagglers and the rest of the world is huge. Around **77% of the Africans earn an annual salary of 10,000 USD** or compared to **~43%** of the rest. This salary comparison is as is without adjusting for purchasing power. 

- **Kaggle learn** is the choice for Africans as a data science learning platform. Coursera, the first choice for everyone else, comes in second for Africans. **Kaggle forums and notebooks** are favourite **media** sources data science topics. Interestingly, **Twitter** is the fouth choice for both Africa and the rest of the world after YouTube and blogs. However, more Africans use Twitter than kagglers from the rest of the world.


Finally, I sincerly hope that this notebook gave you an overview of the African kaggler within the scope of the survey question; where they lag behind and in some instances where they have the edges compared to the rest of the kaggling world. Although some stats are dissapointigly low, the future looks bright for Africa. We have seen the steady growth in overall participation and the consistent presence of African women kagglers. With better internet access, drafting data science/AI curriculum in higher educations and training, the field of DS and AI will surely catch up with pace. I am sure we will see that in future kaggle surveys. 
    

## **11. References** <a class="anchor" id="11"></a>

[1]. https://en.wikipedia.org/wiki/Kaggle

[2]. https://www.worldometers.info/world-population/africa-population/

[3]. https://www.statista.com/statistics/1218173/life-expectancy-in-african-countries/

[4]. https://www.kaggle.com/

[5]. https://www.kaggle.com/mhajabri/africai

[6]. https://qz.com/africa/1683977/why-data-science-is-crucial-to-africa-development/

[7]. https://www.internetworldstats.com/stats1.htm

[8]. https://qz.com/africa/1683977/why-data-science-is-crucial-to-africa-development/

[9]. https://tribunecontentagency.com/article/why-ai-needs-input-from-africans/




#### End of Notebook. 

#### Thank you for reading!
