![Behind the codes](https://imittech.com/wp-content/uploads/2019/07/product-page.jpg)

# Behind The Screens

Whether you are here just to check out what Kaggle is all about or you have been toiling hard on getting closer to the much coveted Kaggle Grandmaster title, one thing that you must be interested in, is the community - who are the Kagglers? Are we a growing community? Is it only a bunch of boys in their hoodies, perhaps mostly college droupouts sitting behind their screens in the U.S.? Or are we much more diversified. Well, it's not that hard to find out, when you have such an involved community, who care enough to share their data through the Kaggle annual survey. So, what are we waiting for. Let's dig right in!

<font color='green'><u> **I am just a beginner. So your upvote especially means a lot! Please do upvote!** </u></font> üòäüôè


1. **How big is our survey sample?** \
Before dissecting the survey data, it is important to ensure that the survey data is large enough to be representative of the population. The 'meta-kaggle' data comes handy here. As expected the Kaggle 'Users' data shows exponetial growth in new user registration since its inception in 2010. Starting with a humble 4558 users, today kaggle userbase is reaching a humongous ~6 million! Out of these ~6M users, more than a third registered in 2020 itself. 

In [None]:
%reset -f
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import gc
import warnings
from textwrap import fill
import matplotlib.ticker as ticker
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator 
import seaborn as sns


warnings.filterwarnings("ignore")
gc.enable()

pd.options.display.max_colwidth=250

plt.style.use('seaborn-white')
plt.rcParams['axes.labelsize'] = 20
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['font.sans-serif'] = "Comfortaa"


#Number of active Kagglers by year
df_list = ['Users','Submissions','TeamMemberships']
for i in range(len(df_list)):
    globals()[df_list[i]]=pd.read_csv('/kaggle/input/meta-kaggle/'+df_list[i]+'.csv')
    
Users['RegisterDate']=pd.to_datetime(Users['RegisterDate'])
Users['RegYear']=Users['RegisterDate'].dt.year


fig, ax = plt.subplots(1, 1, figsize=(12,4))
fig.suptitle('Number of new Kagglers registering per year', fontsize=20, y=1.0)
    
for i in range(4):
    Users['RegYear'].value_counts(ascending=True).plot.bar(ax=ax, width=0.3, color='#BDC2BB')
    ax.spines["top"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.spines["left"].set_visible(False)
    
    
    y_axis = ax.axes.get_yaxis()
    y_axis.set_visible(False)
    
    for p in  ax.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax.annotate(height, (x+0.15, height+25000), fontsize=12, ha='center',weight='normal', size='large')
            
plt.tight_layout()
plt.show()    

count=Users.shape[0]
Submissions['SubmissionDate']=pd.to_datetime(Submissions['SubmissionDate'])
Submissions['Year'] = Submissions['SubmissionDate'].dt.year
active=Submissions[Submissions['Year'].isin(range(2017,2021))]
active=active[['TeamId','SubmittedUserId', 'Year']].drop_duplicates(subset=['Year','TeamId'])
active=active.merge(TeamMemberships.drop(columns=['Id', 'RequestDate']), on=['TeamId'])
active=active.drop_duplicates(subset=['Year','UserId'])
active.Year = active.Year.astype(str)
active=pd.DataFrame(active['Year'].value_counts()).sort_index()
active.columns=['Active']


#data preparation
survey_2017MCQ = pd.read_csv('/kaggle/input/kaggle-survey-2017/multipleChoiceResponses.csv',encoding='latin1')
survey_2018MCQ = pd.read_csv('/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv',encoding='latin1')
survey_2019MCQ = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
survey_2020MCQ = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
conversion = pd.read_csv('/kaggle/input/kaggle-survey-2017/conversionRates.csv')


for i in range(2018,2021):
    globals()['survey_'+str(i)+'Q']=pd.DataFrame(eval('survey_'+str(i)+'MCQ').iloc[[0]]).T
    globals()['survey_'+str(i)+'Q'].columns=['questions']
    globals()['survey_'+str(i)+'MCQ'] = eval('survey_'+str(i)+'MCQ').drop(eval('survey_'+str(i)+'MCQ').index[0])


#format gender variable uniformly
survey_2020MCQ.loc[survey_2020MCQ['Q2'] == 'Man', 'Q2'] = 'Male'
survey_2020MCQ.loc[survey_2020MCQ['Q2'] == 'Woman', 'Q2'] = 'Female'

#survey_2017MCQ['Age'] = pd.to_numeric(survey_2017MCQ['Age'], errors='coerce')

age_list=[21,24,29,34,39,44,49,54,59]
a=len(age_list)
x1=x2=0
#Create Age-groups
for i in range(a):
    x1=age_list[i]
    if i in range(1,a):
        x2=age_list[i-1]+1            
        p=str(x2)+"-"+str(x1)
        survey_2017MCQ.loc[survey_2017MCQ['Age'].isin(range(x2,x1)), 'Agegroup'] = p
    
survey_2017MCQ.loc[survey_2017MCQ['Age']<=21, 'Agegroup'] = "<=21"
survey_2017MCQ.loc[survey_2017MCQ['Age']>=60, 'Agegroup'] = ">=60"
        

var_list=['Country','Q3','Q3','Q3']
country_list=['States','Kingdom','China','Iran','Emirates','disclose']
short_list=['U.S.A.','U.K.','China','Iran','U.A.E.','Other']
agevar_list=['Age','Q2','Q1','Q1']
qualvar_list=['FormalEducation','Q4','Q4','Q4']

for i in range(4):
    df=globals()['survey_'+str(i+2017)+'MCQ']
    df=df.fillna('NA')
    for j in range(len(short_list)):
        df.loc[df[var_list[i]].str.contains(country_list[j], na=False), var_list[i]] = short_list[j]
        globals()['survey_'+str(i+2017)+'MCQ']=df
    
    if i>0:
        df.loc[df[agevar_list[i]].isin(['60-69','70-79','70+','80+']), agevar_list[i]] = '>=60'
        df.loc[df[agevar_list[i]]=='18-21', agevar_list[i]] = '<=21'
        
    for j in range(len(qualvar_list)):
        df.loc[df[qualvar_list[i]].str.contains('Some', na=False), qualvar_list[i]] = 'College dropout'
        df.loc[df[qualvar_list[i]].str.contains('high', na=False), qualvar_list[i]] = 'High School'
        df.loc[df[qualvar_list[i]].str.contains('Bach', na=False), qualvar_list[i]] = 'Bachelors'
        df.loc[df[qualvar_list[i]].str.contains('Mast', na=False), qualvar_list[i]] = 'Masters'
        df.loc[df[qualvar_list[i]].str.contains('refer', na=False), qualvar_list[i]] = 'NA'
        df[qualvar_list[i]] = df[qualvar_list[i]].str.replace(' degree','')
    

However compared to these huge numbers, the number of users who participated in the survey in 2020 seems too tiny at ~20k, and indicates that only a tiny fraction of these ~6M registered users might be  actively involved in Kaggling. To confirm the same, let's chart out the number of tiered users. Turns out that as suspected, **less than 2% of all registered users, i.e. 102419 users on Kaggle are tiered** (Performance tier 1-4, 4 being the best tier).

In [None]:
Users=Users[~Users['PerformanceTier'].isin([0,5])]
user_count=pd.DataFrame(Users['PerformanceTier'].value_counts())
user_count=user_count[user_count.index!=0]
perf0=1-int(user_count.sum())/count
user_count.sort_values(by=['PerformanceTier'], ascending=False,inplace=True)
user_count=user_count.T

fig, ax = plt.subplots(figsize=(15,1))  
ax = plt.axes()
g=sns.heatmap(user_count, annot=True, fmt='g', yticklabels='',  annot_kws={"size": 14},ax=ax)#, cbar_kws={"orientation": "horizontal", "shrink": 10})

footnote='Total no. of registered users on Kaggle = ' + str(count) + '\n' + 'Untiered Kagglers (i.e. Performance Tier=0) =' + str("{:.0%}".format(perf0))
                
ax.set_title('Less than 2% of all registered users are tiered', pad=10)
plt.annotate(footnote, xy=(-0.05, -1.25), xycoords='axes fraction')
plt.xlabel('Performance Tiers', fontsize=12)
plt.tight_layout(pad=2.0)
plt.show()

Similarly, the number of active users in a year i.e. users who made at least one submission that year, are consistently low. Less than 80k users have made any submission on Kaggle in 2020 so far. So turns out that every year **at least 1 in 5 of the active users fill the surveys**.\
\
<u>**Takeaway: The survey data is large enough  to be representative of the active Kaggler community.**</u>

In [None]:
#Kaggling community - gender-wise participation
var_list=['GenderSelect','Q1','Q2','Q2']

for i in range(4):
    x=str(i+2017)
    globals()['g'+x]=pd.DataFrame(eval('survey_'+x+'MCQ')[var_list[i]].value_counts(dropna=False))
    globals()['g'+x].columns=[x]
    df=globals()['g'+x]
    df.loc['LGBTQA+ /Not-specified']=df[~df.index.str.contains('ale', na=False)].sum(axis=0)
    df=df.loc[['Male','Female','LGBTQA+ /Not-specified']]
    df.loc['Overall']=df.sum(axis=0)
    globals()['g'+x]=df
    
    if i>0:
        gender=gender.merge(df,left_index=True, right_index=True)
    else:
        gender=df
        
gender=gender.T
gender=gender.merge(active, left_index=True, right_index=True)
gender['%Participation']=gender['Overall']/gender['Active']

plt.rcParams['axes.labelsize'] = 20

fig, ax1 = plt.subplots(figsize=(10,5))
ax2 = ax1.twinx()  # set up the 2nd axis
ax1.bar(width=0.3,height=gender['Active'],x=gender.index, color='#A2B59F', label='Kagglers with >=1 submission') 
ax2.plot(gender.index,gender['%Participation'], 
         color='#FF6600', linestyle='--', marker='o', markersize=8,
            label='Survey participation (%)')

y1label = '\n'.join((r'No. of Kagglers with',
                         'at least 1 submission')) 

ax1.set_title('More than 20% of active Kagglers* participate in the survey every year', y=1.1)
ax1.set_ylabel(y1label, fontsize=14)
ax2.set_ylabel('Survey participation(%)', fontsize=14)

i=0
for spine in ax1.spines.values():
    if list(ax1.spines.keys())[i]!='bottom':
        spine.set_visible(False)
    i=i+1
     
i=0
for spine in ax2.spines.values():
    if list(ax2.spines.keys())[i]!='bottom':
        spine.set_visible(False)
    i=i+1
    
ax2.yaxis.set_major_formatter(ticker.PercentFormatter(decimals=0,xmax=1))

for p in  ax1.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax1.annotate(height, (x, height*1.05), fontsize=10)
        
#labels=['Kagglers with >=1 submission','Survey participation (%)']

footnote="\n Note: Users who made at least one submission in the year have been considered to be active Kagglers*"
fig.subplots_adjust(wspace=0.0, hspace=0, top=0.2, bottom=.1)
fig.legend(bbox_to_anchor=(.88, 0.75), frameon=True, prop={'size': 8})
plt.tight_layout(pad=1.0)
plt.annotate(footnote, xy=(0, -0.2), xycoords='axes fraction')
plt.show()

2. **How is the gender-wise development in the community?** \
Charting the count of survey participants over the years and dissecting the same by gender, we find that while the total number of participants has remained almost unchanged from last year, the same holds true as far as male and LGBTQA+ (including not specified) participants are concerned, <u>**there is a sharp increase in the number of female participants in the Kaggle survey in 2020.**</u>

In [None]:
colors = ['#E8E7D2', '#C9BA9B', '#BDC2BB', '#FFD0A6']
var_list=['Overall','Female','Male','LGBTQA+ /Not-specified'] 
fig, ax = plt.subplots(2, 2, figsize=(12,8))
fig.suptitle('Female Kagglers are on the rise', fontsize=20, y=1.05)
    
for i in range(4):
    gender[var_list[i]].plot.bar(ax=ax[int(i/2)][i%2], color=[colors[i]], legend=True, width=0.3)
    ax[int(i/2)][i%2].spines["top"].set_visible(False)
    ax[int(i/2)][i%2].spines["right"].set_visible(False)
    ax[int(i/2)][i%2].spines["left"].set_visible(False)
    
    ax[int(i/2)][i%2].legend(loc='top center', bbox_to_anchor= (0.4, 1.5), ncol=1,
                             borderaxespad=0, frameon=False, fontsize=12)
    
    
    y_axis = ax[int(i/2)][i%2].axes.get_yaxis()
    y_axis.set_visible(False)
    
    for p in  ax[int(i/2)][i%2].patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax[int(i/2)][i%2].annotate(height, (x, height*1.05), fontsize=12)
        if (1<x<2):
            x1=x
            h1=height
            

    ax[int(i/2)][i%2].annotate('', 
                               xy=(x+width*1.2, height*0.85),  xycoords='data',
                               xytext=(x1-width/4, h1*0.85), textcoords='data',
                               arrowprops=dict(facecolor='black', shrink=0.2),
                               horizontalalignment='right', verticalalignment='top',
                               )

    textstr = '\n'.join((r'$2019-20:$',
                         'Total number of Kagglers surveyed remained stable...'))
    ax[0][0].text(0.0, 1.3, textstr, transform=ax[0][0].transAxes, fontsize=14,
        verticalalignment='top', c='#808080')
    
    textstr = '\n'.join((r'$2019-20:$',
                         '...number of female Kagglers increased')) 
    ax[0][1].text(0.18, 1.3, textstr, transform=ax[0][1].transAxes, fontsize=14,
        verticalalignment='top', c='#FF8547')
            
plt.tight_layout(pad=1.0)
plt.show()    

The percentage data makes this picture clearer. In 2020,female particpants in the survey constituted ~20% of the total number of participants. This is a sharp increase from the ~16% level in the past three years (2017-2019). 

<u>**Takeaway: Female Kagglers are on the rise.**</u>

In [None]:
gender['%Female']=gender['Female']/gender['Overall']
gender['%LGBTQA+ /Not-specified']=gender['LGBTQA+ /Not-specified']/gender['Overall']

colors = ['#C9BA9B','#FFD0A6']
var_list = ['%Female', '%LGBTQA+ /Not-specified']

ax = gender[var_list].plot(kind = 'bar',
                           stacked = True,
                           width = 0.3, 
                           align='center', 
                           figsize=(10,4),
                           legend=True,
                           color=['#C9BA9B','#FFD0A6'])

#labels = [fill(l, 15) for l in gender[var_list].columns]
labels= ['%Female', '\n'.join((r'%LGBTQA+ /' ,r'Not specified'))]

ax.legend(labels, loc='lower left', bbox_to_anchor= (1.01, 0.0), ncol=1,
          borderaxespad=0, frameon=False, fontsize=12, labelspacing=2)

for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height:.1%}', (x + width/2, y + height/2), ha='center', va='center',fontsize=12)
    
    if y==0:
        if (1<x<2):
            x1=x
            h1=height
        elif (x>=2):
            ax.annotate('',
                        xy=(x+width*1.2, height*0.7),  xycoords='data',
                        xytext=(x1-width/4, h1*0.7), textcoords='data',
                        arrowprops=dict(facecolor='black', shrink=0.2),
                        horizontalalignment='right', verticalalignment='top',
                       )
        
    
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)

ax.set_title('Year-wise percentage of female/ LGBTQA+ Kagglers', fontsize=20,pad=20)

y_axis = ax.axes.get_yaxis()
y_axis.set_visible(False)

plt.tight_layout() 
plt.show()

3. **Where do the Kagglers reside?** \
**40% of the community is concentrated in just the top 2 countries - India and U.S.A.** And it has been as such since 2017 itself. However since 2019, India has overtaken U.S.A. to become the country with largest number of Kagglers. Given that India houses more than 1 in 6 person around the world (source: [Worldometer data](https://www.worldometers.info/world-population/#top20)), this seems like a natural progression.


In [None]:
#Top 5 Countries where most Kagglers reside in
var_list=['GenderSelect','Q1','Q2','Q2']
countryvar_list = ['Country','Q3','Q3','Q3']
colors = ['#E8E7D2', '#C9BA9B', '#BDC2BB', '#FFD0A6']


for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ']
    df=pd.DataFrame(df[countryvar_list[i]].value_counts(normalize=True,ascending=False,dropna=False))
    df.columns=[x]
    df=df[df.index!='Other'].head(n=5)
    df.sort_values(by=[x], inplace=True)
    globals()['all'+x]=df   
    

fig, ax = plt.subplots(2, 2, figsize=(12,6))
fig.suptitle('Top 5 countries where Kagglers reside in', fontsize=20, y=1.05)

        
for i in range(4):
    x=str(i+2017)
    df=globals()['all'+x]
    df[x].plot.barh(ax=ax[int(i/2)][i%2], color=[colors[i]], legend=True, width=0.4)
    ax[int(i/2)][i%2].spines["bottom"].set_visible(False)
    ax[int(i/2)][i%2].spines["top"].set_visible(False)
    ax[int(i/2)][i%2].spines["right"].set_visible(False)
    ax[int(i/2)][i%2].set_yticklabels(df.index, fontsize=12)
    
    ax[int(i/2)][i%2].legend(loc='top', bbox_to_anchor= (0.45, 1.02), ncol=1,
                             borderaxespad=0, frameon=True, fontsize=14)
    
    
    x_axis = ax[int(i/2)][i%2].axes.get_xaxis()
    x_axis.set_visible(False)
    
    for p in  ax[int(i/2)][i%2].patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        #print("x=",x, ", y= ", y, " width=", width)
        ax[int(i/2)][i%2].annotate(f'{width:.1%}', (width*1.05, y), fontsize=12)

              
plt.tight_layout(pad=2.0)
plt.show()    

Almost half of all female/ LGBTQA+ Kagglers (45% to be accurate), live in these two countries, India, and U.S.A. 

In [None]:
#Top 5 Countries where most female/ LGBTQA+ Kagglers reside in
var_list=['GenderSelect','Q1','Q2','Q2']
countryvar_list = ['Country','Q3','Q3','Q3']
colors = ['#E8E7D2', '#C9BA9B', '#BDC2BB', '#FFD0A6']


for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ']
    df=pd.DataFrame(df[df[var_list[i]]!='Male'][countryvar_list[i]].value_counts(normalize=True,ascending=False,dropna=False))
    df.columns=[x]
    df=df[~df.index.isin([np.nan,'Other'])].head(n=5)
    df.sort_values(by=[x], inplace=True)
    globals()['g'+x]=df   
    
    

fig, ax = plt.subplots(2, 2, figsize=(12,6))
fig.suptitle('Top 5 countries where female/ LGBTQA+ Kagglers reside in', fontsize=20, y=1.05)

        
for i in range(4):
    x=str(i+2017)
    df=globals()['g'+x]
    df[x].plot.barh(ax=ax[int(i/2)][i%2], color=[colors[i]], legend=True, width=0.4)
    ax[int(i/2)][i%2].spines["top"].set_visible(False)
    ax[int(i/2)][i%2].spines["right"].set_visible(False)
    ax[int(i/2)][i%2].spines["bottom"].set_visible(False)
    ax[int(i/2)][i%2].set_yticklabels(df.index, fontsize=12)
    
    ax[int(i/2)][i%2].legend(loc='top', bbox_to_anchor= (0.45, 1.02), ncol=1,
                             borderaxespad=0, frameon=True, fontsize=12)
    
    
    x_axis = ax[int(i/2)][i%2].axes.get_xaxis()
    x_axis.set_visible(False)
    
    for p in  ax[int(i/2)][i%2].patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        #print("x=",x, ", y= ", y, " width=", width)
        ax[int(i/2)][i%2].annotate(f'{width:.1%}', (width*1.05, y), fontsize=12)

              
plt.tight_layout(pad=2.0)
plt.show()        


Looking at the actual numbers of participants in India and U.S.A, we find that while Indians are on the rise on Kaggle, Americans are receding!

**<u>Takeaways: </u>**
1. **1-in-4 Kagglers is an Indian and 1-in-3 female/ LGBTQA+ kagglers is an Indian.**
2. **The Indian cohort expanded in each of the past 3 years (both overall numbers and the number of female/ LGBTQA+ Kagglers.**
3. **The numbers of Kagglers, overall and female/LGBTQA+, residing in U.S.A. have been shrinking in absolute numbers!**

In [None]:
df_list = ['all', 'g']
counter=0
title_list = ['%Overall', '%female/ LGBTQA+','Overall', 'female/ LGBTQA+']
for i in range (2):
    globals()[df_list[i]]=pd.DataFrame()
    for j in range(4):
        x=str(j+2017)
        df=globals()[df_list[i]+x]
        df=df.loc[['India', 'U.S.A.']]
        if j==0:
            globals()[df_list[i]]=df
        else:
            globals()[df_list[i]]=pd.merge(globals()[df_list[i]],df,left_index=True, right_index=True)
    
    globals()[df_list[i]]=globals()[df_list[i]].T  
    
    
for i in range(4):
    x=str(i+2017)
    df1=globals()['survey_'+x+'MCQ']
    df1=df1[df1[countryvar_list[i]].isin(['India', 'U.S.A.'])]
    df1=pd.DataFrame(df1[countryvar_list[i]].value_counts(dropna=False))
    df1.columns=[x]
    df2=globals()['survey_'+x+'MCQ']
    df2=df2[df2[countryvar_list[i]].isin(['India', 'U.S.A.'])]
    df2=pd.DataFrame(df2[df2[var_list[i]]!='Male'][countryvar_list[i]].value_counts(dropna=False))
    df2.columns=[x]
    if i==0:
        InUS=df1
        fem=df2
    else:
        InUS=InUS.merge(df1,left_index=True, right_index=True)
        fem=fem.merge(df2,left_index=True, right_index=True)

InUS=InUS.T
fem=fem.T

InUS=InUS[['India', 'U.S.A.']]
fem=fem[['India', 'U.S.A.']]
    
c=2
df_list = ['all', 'g','InUS', 'fem']

fig, ax = plt.subplots(2, c, figsize=(12,8))
fig.suptitle('Kagglers residing in India vs. U.S.A.', fontsize=20, y=1.05)

for i in range (c*2):
    df=globals()[df_list[i]]
    l=df.plot.line(ax=ax[int(i/c)][i%c],legend=False, 
                   markersize=8, marker='o',
                   linestyle='--',color=['#FF8547', '#A2B59F'])
    
    ax[int(i/c)][i%c].spines['right'].set_visible(False)
    ax[int(i/c)][i%c].spines['top'].set_visible(False)
    ax[int(i/c)][i%c].spines['left'].set_visible(False)

    ax[int(i/c)][i%c].set_title(title_list[i], fontsize=14,pad=20)

    
    for x, y in zip(range(4),df.iloc[:,1].to_list()):
        
        if i<=1:
            label = "{:.0%}".format(y)
        else: 
            label = "{:.0f}".format(y)
            
        if i==2:
            off=-20
            
            
        ax[int(i/c)][i%c].annotate(label,
                                   (x,y),
                                   textcoords="offset points",
                                   xytext=(0,10),
                                   ha='center',
                                   color='#808080',
                                   fontsize=12)
        
    
    for x, y in zip(range(4),df.iloc[:,0].to_list()):
        off1=0
        
        if i<=1:
            label = "{:.0%}".format(y)
        else: 
            label = "{:.0f}".format(y)
            
        if i==0:
            off=-20
            
        if i in ([1,3]):
            off=10
            
        if i in ([2,4]):
            off1=10
            
            
        ax[int(i/c)][i%c].annotate(label,
                                   (x,y),
                                   textcoords="offset points",
                                   xytext=(off1,off),
                                   ha='center',
                                   color='#FF8547',
                                   fontsize=12)
         


    y_axis = ax[int(i/c)][i%c].axes.get_yaxis()
    y_axis.set_visible(False)
    
plt.legend(bbox_to_anchor=(-0.1, 1.05), loc='top center', frameon=True, borderaxespad=0.)
plt.tight_layout(pad=3.0)
plt.show()

4. **How old are the Kagglers?**\
Age-wise (25-29) year old users continue to dominate Kaggle, though other younger Kagglers are fast catching up. 

In [None]:
#Age of Kagglers
var_list=['Agegroup','Q2','Q1','Q1']
colors = ['#E8E7D2', '#C9BA9B', '#BDC2BB', '#FFD0A6']


for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ']
    df=pd.DataFrame(df[var_list[i]].value_counts(normalize=True))
    df.columns=[x]
    globals()['all'+x]=df[df.index=='<=21'].append(df[df.index!='<=21'].sort_index())
    df=globals()['all'+x]
    if i==0:
        all=df
    else:
        all=all.merge(df, left_index=True, right_index=True)    
    

fig, ax = plt.subplots(2, 2, figsize=(12,6))
fig.suptitle('Kagglers by age group', fontsize=20, y=1.05)

        
for i in range(4):
    x=str(i+2017)
    all[x].plot.bar(ax=ax[int(i/2)][i%2], color=[colors[i]], legend=True, width=0.3)
    ax[int(i/2)][i%2].spines["top"].set_visible(False)
    ax[int(i/2)][i%2].spines["right"].set_visible(False)
    ax[int(i/2)][i%2].spines["left"].set_visible(False)
    
    ax[int(i/2)][i%2].legend(loc='top right', bbox_to_anchor= (0.85, 0.6), ncol=1,
                             borderaxespad=0, frameon=False, fontsize=12)
    
    
    y_axis = ax[int(i/2)][i%2].axes.get_yaxis()
    y_axis.set_visible(False)
    
    for p in  ax[int(i/2)][i%2].patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax[int(i/2)][i%2].annotate(f'{height:.1%}', (x-.15, height+.01), fontsize=12)

            
plt.tight_layout(pad=2.0)
plt.show()    

* Users below 21 years grew most consistently.
* The user base in the 22-24 year old age bracket have grown considerably as well - from 9% in 2017 to 19% in 2020.

In [None]:
cmap = plt.cm.Dark2
c=5
fig, ax = plt.subplots(2, c, figsize=(15,6))
fig.suptitle('Kagglers by age group', fontsize=20, y=1.02)

for i in range(all.shape[0]):
    colors = [cmap(i) for i in range(all.shape[0])]
    all.iloc[i].T.plot.line(ax=ax[int(i/c)][i%c], legend=True, color=colors[i],
                           markersize=8, marker='o',linestyle='--')
    ax[int(i/c)][i%c].spines["top"].set_visible(False)
    ax[int(i/c)][i%c].spines["right"].set_visible(False)
    ax[int(i/c)][i%c].spines["left"].set_visible(False)
    ax[int(i/c)][i%c].yaxis.set_major_formatter(ticker.PercentFormatter(decimals=1,xmax=1))
    
    ax[int(i/c)][i%c].legend(loc='center', bbox_to_anchor= (0.65, 1.6), ncol=1, 
                             borderaxespad=0, frameon=True, fontsize=12)
    
    
    y_axis = ax[int(i/c)][i%c].axes.get_yaxis()
    y_axis.set_visible(False)
    
    for x, y in zip(range(4),all.iloc[i].to_list()):
        label = "{:.0%}".format(y)
        ax[int(i/c)][i%c].annotate(label,
                                   (x,y),
                                   textcoords="offset points",
                                   xytext=(5,10),
                                   ha='center',
                                   color=colors[i],
                                   fontsize=14)

            
plt.tight_layout()
plt.show()   

* A closer look at the data reveals that the user demographics in India and U.S.A are starkly different.

* Indian users are predominantly below 30 years (~80%); the age group with highest number of users is <=21 years (35% users)

* Less than 5% users are aged above 45.

In [None]:
#Age of Kagglers in India
var_list=['Agegroup','Q2','Q1','Q1']
colors = ['#E8E7D2', '#C9BA9B', '#BDC2BB', '#FFD0A6']


for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ']
    df=df[df[countryvar_list[i]]=='India']
    df=pd.DataFrame(df[var_list[i]].value_counts(normalize=True))
    df.columns=[x]
    globals()['all'+x]=df[df.index=='<=21'].append(df[df.index!='<=21'].sort_index())
    df=globals()['all'+x]
    if i==0:
        all=df
    else:
        all=all.merge(df, left_index=True, right_index=True)    
    

cmap = plt.cm.Dark2
c=5
fig, ax = plt.subplots(2, c, figsize=(15,6))
fig.suptitle('India - Kagglers by age group', fontsize=20, y=1.02)

for i in range(all.shape[0]):
    colors = [cmap(i) for i in range(all.shape[0])]
    all.iloc[i].T.plot.line(ax=ax[int(i/c)][i%c], legend=True, color=colors[i],
                           markersize=8, marker='o',linestyle='--')
    ax[int(i/c)][i%c].spines["top"].set_visible(False)
    ax[int(i/c)][i%c].spines["right"].set_visible(False)
    ax[int(i/c)][i%c].spines["left"].set_visible(False)
    ax[int(i/c)][i%c].yaxis.set_major_formatter(ticker.PercentFormatter(decimals=1,xmax=1))
    
    ax[int(i/c)][i%c].legend(loc='center', bbox_to_anchor= (0.65, 1.6), ncol=1, 
                             borderaxespad=0, frameon=True, fontsize=12)
    
    
    y_axis = ax[int(i/c)][i%c].axes.get_yaxis()
    y_axis.set_visible(False)
    
    for x, y in zip(range(4),all.iloc[i].to_list()):
        label = "{:.0%}".format(y)
        ax[int(i/c)][i%c].annotate(label,
                                   (x,y),
                                   textcoords="offset points",
                                   xytext=(5,10),
                                   ha='center',
                                   color=colors[i],
                                   fontsize=14)

            
plt.tight_layout()
plt.show()   

* In contrast, only 5% users in U.S.A are aged <=21 years. 
* And the users are much more uniformly distributed across different age groups.
* A significant 8% of the users are ages >=60 years, and the number of users in the older age groups are also increasing.

In [None]:
#Age of Kagglers in U.S.A.
var_list=['Agegroup','Q2','Q1','Q1']
colors = ['#E8E7D2', '#C9BA9B', '#BDC2BB', '#FFD0A6']


for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ']
    df=df[df[countryvar_list[i]]=='U.S.A.']
    df=pd.DataFrame(df[var_list[i]].value_counts(normalize=True))
    df.columns=[x]
    globals()['all'+x]=df[df.index=='<=21'].append(df[df.index!='<=21'].sort_index())
    df=globals()['all'+x]
    if i==0:
        all=df
    else:
        all=all.merge(df, left_index=True, right_index=True)    
    

cmap = plt.cm.Dark2
c=5
fig, ax = plt.subplots(2, c, figsize=(15,6))
fig.suptitle('U.S.A. - Kagglers by age group', fontsize=20, y=1.02)

for i in range(all.shape[0]):
    colors = [cmap(i) for i in range(all.shape[0])]
    all.iloc[i].T.plot.line(ax=ax[int(i/c)][i%c], legend=True, color=colors[i],
                           markersize=8, marker='o',linestyle='--')
    ax[int(i/c)][i%c].spines["top"].set_visible(False)
    ax[int(i/c)][i%c].spines["right"].set_visible(False)
    ax[int(i/c)][i%c].spines["left"].set_visible(False)
    ax[int(i/c)][i%c].yaxis.set_major_formatter(ticker.PercentFormatter(decimals=1,xmax=1))
    
    ax[int(i/c)][i%c].legend(loc='center', bbox_to_anchor= (0.65, 1.6), ncol=1, 
                             borderaxespad=0, frameon=True, fontsize=12)
    
    
    y_axis = ax[int(i/c)][i%c].axes.get_yaxis()
    y_axis.set_visible(False)
    
    for x, y in zip(range(4),all.iloc[i].to_list()):
        label = "{:.0%}".format(y)
        ax[int(i/c)][i%c].annotate(label,
                                   (x,y),
                                   textcoords="offset points",
                                   xytext=(5,10),
                                   ha='center',
                                   color=colors[i],
                                   fontsize=14)

            
plt.tight_layout()
plt.show()   

5. **How educated are the Kagglers (formally) ?**\
While people with masters degree consistently dominated the Kaggling community, the number of users with bachelors degree has started to catch up. There is a sharp drop in the percentage of docatoral candidates this year. Very few users (< 7%) possess no formal degree at all and about 3 users have a professional degree.

In [None]:
#Formal education
var_list=['FormalEducation','Q4','Q4','Q4']
colors = ['#E8E7D2', '#C9BA9B', '#BDC2BB', '#FFD0A6']


for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ'] 
    df=pd.DataFrame(df[qualvar_list[i]].value_counts(normalize=True,ascending=False,dropna=False))
    df.columns=[x]
    df.sort_values(by=[x], inplace=True)
    globals()['all'+x]=df   
    

fig, ax = plt.subplots(2, 2, figsize=(12,6))
fig.suptitle('Formal education level of Kagglers', fontsize=20, y=1.05)

        
for i in range(4):
    x=str(i+2017)
    df=globals()['all'+x]
    df[x].plot.barh(ax=ax[int(i/2)][i%2], color=[colors[i]], legend=True, width=0.4)
    ax[int(i/2)][i%2].spines["bottom"].set_visible(False)
    ax[int(i/2)][i%2].spines["top"].set_visible(False)
    ax[int(i/2)][i%2].spines["right"].set_visible(False)
    ax[int(i/2)][i%2].set_yticklabels(df.index, fontsize=12)
    
    ax[int(i/2)][i%2].legend(loc='top', bbox_to_anchor= (0.45, 1.02), ncol=1,
                             borderaxespad=0, frameon=True, fontsize=14)
    
    
    x_axis = ax[int(i/2)][i%2].axes.get_xaxis()
    x_axis.set_visible(False)
    
    for p in  ax[int(i/2)][i%2].patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        #print("x=",x, ", y= ", y, " width=", width)
        ax[int(i/2)][i%2].annotate(f'{width:.1%}', (width*1.05, y), fontsize=12)


plt.tight_layout(pad=2.0)
plt.show()    


<u>**Takeaways:**</u>
1. Users with higher degrees dominate Kaggle for the time being (more than half of the users have doctoral, master's, or professional degrees).
2. Users with only a bachelor's degree or no degree at all are fast catching up (reaching 50%).
3. The trend is similar across genders.

<u> *These observations are in line with the previous observation that every year younger users are increasingly getting into Kaggling.*</u>

In [None]:
#Formal education
gendvar_list=['GenderSelect','Q1','Q2','Q2']
qualvar_list=['FormalEducation','Q4','Q4','Q4']
colors = ['#A2B59F', '#FF8547']
title_list = ['Overall', 'female/ LGBTQA+']


for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ'] 
    df1=df[df[gendvar_list[i]]!='Male']
    df=pd.DataFrame(df[qualvar_list[i]].value_counts(ascending=False,dropna=False))
    df.columns=[x]
    df1=pd.DataFrame(df1[qualvar_list[i]].value_counts(ascending=False,dropna=False))
    df1.columns=[x]
    
    if i==0:
        all=df
        fem=df1
    else:
        all=all.merge(df, left_index=True, right_index=True)
        fem=fem.merge(df1, left_index=True, right_index=True)
        
all=all.T
fem=fem.T
all['All']=all.sum(axis=1)
fem['All']=fem.sum(axis=1)

#all['Higher degree']=(all['Masters']+all['Doctoral']+all['Professional'])/all['All']
all['Bachelors/ No degree']=(all['Bachelors']+all['College dropout']+all['High School']+all['NA'])/all['All']

#fem['Higher degree']=(fem['Masters']+fem['Doctoral']+fem['Professional'])/fem['All']
fem['Bachelors/ No degree']=(fem['Bachelors']+fem['College dropout']+fem['High School']+fem['NA'])/fem['All']


all=all[['Bachelors/ No degree']]
fem=fem[['Bachelors/ No degree']]
    
df_list = ['all', 'fem']

fig, ax = plt.subplots(len(df_list), 1, figsize=(12,8))
fig.suptitle('%Kagglers who either have a Bachelors degree or no degree at all', fontsize=20, y=1.05)
 

for i in range(len(df_list)):
    df=globals()[df_list[i]]
    l=df.plot.line(ax=ax[i],legend=False, 
                   markersize=8, marker='o',
                   linestyle='--',color=colors)
    
    ax[i].spines['right'].set_visible(False)
    ax[i].spines['top'].set_visible(False)
    ax[i].spines['left'].set_visible(False)

    ax[i].set_title(title_list[i], fontsize=16,pad=25)
   

    y_axis = ax[i].axes.get_yaxis()
    y_axis.set_visible(True)
    y_axis.set_major_formatter(ticker.PercentFormatter(1.0, decimals=0))
    
    for j in range(1):
        for x, y in zip(range(4),df.T.iloc[j].to_list()):
            label = "{:.0%}".format(y)
            ax[i].annotate(label,
                                       (x,y),
                                       textcoords="offset points",
                                       xytext=(5,10),
                                       ha='center',
                                       color=colors[j],
                                       fontsize=14)
    
    
plt.legend(bbox_to_anchor=(0.2, 1.25), loc='top center', frameon=True, borderaxespad=0., prop={'size': 14})
plt.tight_layout(pad=3.0)
plt.show()  


* Majority of the Kagglers in India only have a Bachelor's degree or no degree at all.
* On the contrary, majority of the Kagglers in U.S.A have a higher degree (master's, doctoral or professional).

<u>**Takeaway**</u>

A lot of very young Indians are getting involved in Kaggling. Given their age, they are mostly students/ youngsters with either no degree yet, or with Bachelor's degree only. Given that these young Indians constitute a significantly large portion of the user base, their demographics dominate the survey data on the whole. 

In [None]:
#Formal education
gendvar_list=['GenderSelect','Q1','Q2','Q2']
qualvar_list=['FormalEducation','Q4','Q4','Q4']
countryvar_list = ['Country','Q3','Q3','Q3']
colors = ['#A2B59F', '#FF8547']
clist=['India','U.S.A.']
all=pd.DataFrame()
fem=all

for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ'] 
    df1=df[df[gendvar_list[i]]!='Male']
    df=df[df[countryvar_list[i]].isin(clist)]
    df1=df1[df1[countryvar_list[i]].isin(clist)]
    df=pd.DataFrame(df.groupby(countryvar_list[i])[qualvar_list[i]].value_counts(normalize=True,ascending=False,dropna=False))
    df=df.unstack(level=0)
    df=df.droplevel(None, axis=1)
    df.index.name=None
    df.columns.name=None
    df.columns=['India','U.S.A.']
    df=df.loc[~df.index.isin(['Masters','Doctoral','Professional'])]
    df.loc[x]=df.sum()
    df=df.loc[df.index.isin([x])]   
    all=all.append(df)
    
    df1=pd.DataFrame(df1.groupby(countryvar_list[i])[qualvar_list[i]].value_counts(normalize=True,ascending=False,dropna=False))
    df1=df1.unstack(level=0)
    df1=df1.droplevel(None, axis=1)
    df1.index.name=None
    df1.columns.name=None
    df1.columns=['India','U.S.A.']
    df1=df1.loc[~df1.index.isin(['Masters','Doctoral','Professional'])]
    df1.loc[x]=df1.sum()
    df1=df1.loc[df1.index.isin([x])]   
    fem=fem.append(df1)

df_list=['all','fem']
fig, ax = plt.subplots(1, 2, figsize=(12,4))
fig.suptitle('U.S.A. vs. India - %Kagglers with Bachelors/ No degree', fontsize=20, y=1.05)
 

for i in range(len(df_list)):
    df=globals()[df_list[i]]
    l=df.plot.line(ax=ax[i],legend=False, 
                   markersize=8, marker='o',
                   linestyle='--',color=colors)
    
    ax[i].spines['right'].set_visible(False)
    ax[i].spines['top'].set_visible(False)
    ax[i].spines['left'].set_visible(False)

    ax[i].set_title(title_list[i], fontsize=16,pad=25)
   

    y_axis = ax[i].axes.get_yaxis()
    y_axis.set_visible(True)
    y_axis.set_major_formatter(ticker.PercentFormatter(1.0, decimals=0))
    
    for j in range(2):
        for x, y in zip(range(4),df.T.iloc[j].to_list()):
            label = "{:.0%}".format(y)
            ax[i].annotate(label,
                                       (x,y),
                                       textcoords="offset points",
                                       xytext=(5,10),
                                       ha='center',
                                       color=colors[j],
                                       fontsize=14)
    
    
plt.legend(bbox_to_anchor=(-0.25, 0.75), loc='top center', frameon=True, borderaxespad=0., prop={'size': 14})
plt.tight_layout(pad=3.0)
plt.show()  


6. **How much do Kagglers make in their day jobs?** \
Given the growing number of younger users, with bachelor's degree or lower, it is expected that the compensation earned by the pool of Kagglers in their daily jobs would not see much rise. But before even digging into the data, it is important to check the data availability over the years. 

 - Percentage of users who disclosed their salary in 2017 is pretty low (only 27%).
 - In 2018, this this percentage went up sharply to 65%
 - The percentages in 2019 and 2020 are a bit lower but still consistently above 50%.

<u> *Given the low data availability in 2017, we focus on 2018-20 only as far as compensation analysis is concerned.* </u>

Since the pay scales vary vastly depending on the country of residence, it is imperative that we dissect the data by country.

We focus on our top 2 countries (i.e. countries where highest number of Kagglers reside) - India and U.S.A.


In [None]:
survey_2017MCQ = pd.read_csv('/kaggle/input/kaggle-survey-2017/multipleChoiceResponses.csv',encoding='latin1')
survey_2017MCQ['CompensationAmount']=pd.to_numeric(survey_2017MCQ['CompensationAmount'].str.replace(',',''), errors="coerce")
val=['NA','I do not wish to disclose my approximate yearly compensation']
comp_count=pd.DataFrame()
comp_count.loc[2017,'count']=survey_2017MCQ[~survey_2017MCQ.CompensationAmount.isnull()].shape[0]/ survey_2017MCQ.shape[0]
comp_count.loc[2018,'count']=survey_2018MCQ[~survey_2018MCQ.Q9.isin(val)].shape[0]/survey_2018MCQ.shape[0]
comp_count.loc[2019,'count']=survey_2019MCQ[~survey_2019MCQ.Q10.isin(val)].shape[0]/survey_2019MCQ.shape[0]
comp_count.loc[2020,'count']=survey_2020MCQ[~survey_2020MCQ.Q24.isin(val)].shape[0]/survey_2020MCQ.shape[0]

fig, ax = plt.subplots(1, 1, figsize=(7,4))
fig.suptitle('%Kagglers who disclosed their compensation data', fontsize=20, y=1.05)

comp_count.plot.bar(ax=ax, width=0.3, color=colors[0], legend=False)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
    
    
y_axis = ax.axes.get_yaxis()
y_axis.set_visible(False)
    
for p in  ax.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax.annotate(f'{height:0.0%}', (x+.15, height+.02), fontsize=12, ha='center',weight='normal', size='large')
            
plt.tight_layout(pad=2.0)
plt.show()

The USD compensation data is also in line with the demographics of the users in India and U.S.A. respectively, along with the purchasing power parity of these countries.

In [None]:
comp=survey_2020MCQ[~survey_2020MCQ.Q24.isin(['NA'])][['Q2','Q3','Q24']]
comp=comp[comp.Q3.isin(['India','U.S.A.'])]
user_count=pd.DataFrame(comp.groupby(['Q3','Q24'])['Q24'].count())
user_count.columns=['Count']
user_count=user_count.unstack(level=0)
user_count.columns=['India','U.S.A.']
user_count[['B1','B2']] = user_count.index.to_series().str.split('-',expand=True)

user_count['B1'] = user_count['B1'].str.replace('$','')
user_count['B1'] = user_count['B1'].str.replace(',','')
user_count['B1'] = user_count['B1'].str.replace('>','')
user_count['B1'] = user_count['B1'].str.replace('NA','')
user_count.B1=pd.to_numeric(user_count.B1, errors="coerce")

user_count.sort_values(by=['B1'], inplace=True)
user_count.drop(columns=['B1','B2'], inplace=True)
user_count.index.name=None

fig, ax = plt.subplots(2, 1, figsize=(12,8))
fig.suptitle('Number of Kagglers by $compensation, 2020', fontsize=20, y=1.0)

    
for i in range(2):
    user_count.iloc[:,i].plot.bar(ax=ax[i], width=0.3, color=colors[i])
    ax[i].spines["top"].set_visible(False)
    ax[i].spines["right"].set_visible(False)
    ax[i].spines["left"].set_visible(False)
    
    
    y_axis = ax[i].axes.get_yaxis()
    y_axis.set_visible(False)
    
    for p in  ax[i].patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax[i].annotate(height, (x+.15, height+10), fontsize=12, ha='center',weight='normal', size='large')
            
plt.tight_layout(pad=2.0)
plt.show()    


7. **Which is the most recommended first programming language for data scientists?** \
No surprises here! 
 - Python has consistently topped the list and it has only gained in popularity over the year. This year, >80% users recommending it. 
 - The pick up in Python's popularity comes at the expense of the decline of that of R. 
 - The list of top 5 recommended programming languages always had the same 5 languages, and in the same sequence: Python, R, SQL, C/C++/C#, and Matlab

In [None]:
#Top 5 recommended 1st lanugage by Kagglers:
var_list=['GenderSelect','Q1','Q2','Q2']
countryvar_list = ['Country','Q3','Q3','Q3']
prog_list = ['LanguageRecommendationSelect', 'Q18','Q19','Q8']
colors = ['#E8E7D2', '#C9BA9B', '#BDC2BB', '#FFD0A6']


for i in range(4):
    x=str(i+2017)
    df=globals()['survey_'+x+'MCQ']
    df=df[~df[prog_list[i]].isin(['NA','nan',np.nan,'None'])]
    df=pd.DataFrame(df[prog_list[i]].value_counts(normalize=True,ascending=False,dropna=False))
    df.columns=[x]
    df.loc['C/C++/C#']=df[df.index.str.contains('C')].sum()
    df=df[~df.index.str.contains('C')].append(df[df.index=='C/C++/C#'])
    df.sort_values(by=[x], inplace=True)
    globals()['all'+x]=df.tail(n=5)   
    

fig, ax = plt.subplots(2, 2, figsize=(12,6))
fig.suptitle('Top 5 - Recommended first language to learn', fontsize=20, y=1.05)

        
for i in range(4):
    x=str(i+2017)
    df=globals()['all'+x]
    df[x].plot.barh(ax=ax[int(i/2)][i%2], color=[colors[i]], legend=True, width=0.4)
    ax[int(i/2)][i%2].spines["bottom"].set_visible(False)
    ax[int(i/2)][i%2].spines["top"].set_visible(False)
    ax[int(i/2)][i%2].spines["right"].set_visible(False)
    ax[int(i/2)][i%2].set_yticklabels(df.index, fontsize=12)
    
    ax[int(i/2)][i%2].legend(loc='top', bbox_to_anchor= (0.45, 1.02), ncol=1,
                             borderaxespad=0, frameon=True, fontsize=14)
    
    
    x_axis = ax[int(i/2)][i%2].axes.get_xaxis()
    x_axis.set_visible(False)
    
    for p in  ax[int(i/2)][i%2].patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        #print("x=",x, ", y= ", y, " width=", width)
        ax[int(i/2)][i%2].annotate(f'{width:.1%}', (width*1.05, y), fontsize=12)

              
plt.tight_layout(pad=2.0)
plt.show()    