## <img src= "https://upload.wikimedia.org/wikipedia/commons/b/b7/Flag_of_Europe.svg" style='width: 35px; float: left'> <span style = 'color:#003399'> EUROPE</span> : <span style = 'color: #b7b146'>ANY DIFFERENT FROM THE WORLD ?</span>

## <span style = 'color:#003399'>Introduction</span>

> Kaggle is a great platform for data scientists and machine learning practitioners, it has grown quite alot since its start and has reached **over 8 million registered users as of 2021**. Most of the data scientists and machine learning practitioners are fimilar with Kaggle, hence this survey would surely provide a accurate picture of data science and machine learning scene.
> 
> Since we already have plenty of notebooks about the world, I don't want to waste the time of my fellow readers to go through another one, but let me take all of you to **the largest economic continent on the Earth i.e, Europe**. It is the richest region as measured by assets under management with over \\$32.7 trillion compared to North America's $27.1 trillion in 2008.
> 
> The objective of this notebook would be figuring out the difference in scene of ML and Data Science in Europe as compared to the world. I would try to have a pretty comprehensive analysis and would present most of the questions asked in the survey.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")
import plotly.express as px
import squarify 
from textwrap import wrap
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import warnings
warnings.filterwarnings("ignore")
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## <span style = 'color:#003399'>Data</span>

In [None]:
data_ml= pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
eu =['Greece', 'Belgium','Poland', 'Italy','Spain','France', 'Sweden','Netherlands','Romania', 'Austria','Ireland','Portugal', 'Denmark', 'Germany','Czech Republic']
data_eu = data_ml[data_ml.Q3.apply(lambda x : x in eu)]
data_ml.head(5)

##  <span style = 'color:#003399'>Countries Participating in the Survey</span>
> It is quite evident from the below plot that the maximum number of people are from **India, which constitute nearly 30% of the total participants of the survey, followed by USA.** The population of India and its presence in IT industry might be a factor for their participation in the Kaggle. But thats just absolute figure, a more accurate representation would be person per 100 of the respective courtry's population. 
> 
> The **European countries** start with the **11th position, occupied by the Germany**, followed by Spain. **Germany has 1.8%** whereas Spain is not that far behind and has 1.7% participants.

In [None]:
# total countries 
countries = data_ml.Q3.unique()

# total no. of users
total_participants = len(data_ml)

# Creating a dictiononary of countries with their population % to create a Series
participants = {}
eu_participants = {}
for i in countries:
    total_participants_country = len(data_ml[data_ml.Q3 == i].Q3)
    if i in eu:
        eu_participants_country = len(data_ml[data_ml.Q3 == i].Q3)
        eu_participants[i] = (eu_participants_country)
    participants[i] = (total_participants_country/total_participants)* 100
participants.pop('In which country do you currently reside?')

# Creating dataframe of countries with their percentage users
country_participants = pd.Series(participants)
country_participants = pd.DataFrame(country_participants).reset_index().rename(columns = {'index':'Countries', 0:'%'}).sort_values('%', ascending = False)
# country_participants =country_participants[:50]

fig,axes = plt.subplots(1,1,figsize=(20,5))
sns.barplot(data = country_participants, x = 'Countries', y = '%', palette ='cividis')
axes.bar_label(axes.containers[0], fmt='%1.1f',size = 8)
plt.xticks(rotation=89)
axes.grid(False)
axes.tick_params(bottom=True, left=False)
sns.despine(left = True)
plt.yticks([])
plt.show()



## <span style = 'color:#003399'>EU vs the World</span>
> The **European countries has a share of 10.3%** in this survey. **Germany, Spain and France** are the top three contributors, all of the have **over 400 users**. 

In [None]:
eu_df = pd.DataFrame(pd.Series(data = eu_participants)).reset_index().rename(columns = {'index': 'Country', 0:'Users'}).sort_values('Users',ascending = False)
total_eu_users = eu_df.Users.sum()

eu_users_percentage = (total_eu_users/ total_participants)*100
rest_of_world = (total_participants - total_eu_users)/total_participants*100

eu_vs_world_dic = {'Eu':eu_users_percentage ,'World':rest_of_world}

eu_vs_world_df = pd.DataFrame(pd.Series(data = eu_vs_world_dic)).reset_index().rename(columns = {'index': 'EU/World', 0:'% Users'})

# kig = plt.figure(figsize=(18,5))
# axes = plt.subplot2grid((1,2),(0,0))
# sns.barplot(data = eu_vs_world_df, x = 'EU/World', y = '% Users')

kig = plt.figure(figsize=(17,6))
axes = plt.subplot2grid((1,2),(0,0))
plt.pie(data = eu_vs_world_df,x = '% Users',pctdistance=0.50,labels = 'EU/World', autopct='%1.1f%%',colors = ['#cdbe68','#0d2f5d'], labeldistance= 1.1, radius = 0.8,)
centre_circle = plt.Circle((0,0),0.7,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)



axes = plt.subplot2grid((1,2),(0,1))
sns.barplot(data = eu_df, y = 'Country', x = 'Users', palette = 'cividis')
axes.bar_label(axes.containers[0],size = 10)
plt.xticks([])
sns.despine(bottom = True)
axes.grid(False)
axes.tick_params(bottom=False, left=True)

plt.tight_layout()
plt.show()


## <span style = 'color:#003399'>Demography: Age and Gender</span>

> The interesting observation here is that, the peak age in the world's distribution is around the age group of 18-29, i.e, the contribution of young population is quite evident there whereas the peak of age's distribution of EU is at 25-29. **EU is missing the contribution of people in the age group of 18-24.**
> 
> However, having a median in the age group of 25-29 is still a positive sign for the ML and Data Science practitioners, because it suggests that young people are interested in this field and this would surely help in flourishing field further.

In [None]:
# for eu
age_dist = data_eu.groupby(['Q1']).count()['Time from Start to Finish (seconds)']
age_dist = pd.DataFrame(age_dist).reset_index().rename(columns = {'Q1': 'Age Group', 'Time from Start to Finish (seconds)': "% Users"})
age_dist['% Users']= (age_dist['% Users']/age_dist['% Users'].sum())*100
# for World
age_dist_w = data_ml.groupby(['Q1']).count()['Time from Start to Finish (seconds)']
age_dist_w = pd.DataFrame(age_dist_w).reset_index().rename(columns = {'Q1': 'Age Group', 'Time from Start to Finish (seconds)': "% Users"})
age_dist_w.drop(labels= 11, inplace = True)
age_dist_w['% Users'] = (age_dist_w['% Users']/age_dist_w['% Users'].sum())*100

kig = plt.figure(figsize=(20,5))
axes = plt.subplot2grid((1,2),(0,0))
sns.barplot(data = age_dist, x = '% Users', y = 'Age Group', palette='cividis')
sns.despine(bottom=True)
axes.grid(False)
axes.tick_params(bottom=False, left=True)
axes.bar_label(axes.containers[0], fmt = '%1.1f',size = 10)
plt.xticks([])

axes = plt.subplot2grid((1,2),(0,1))
sns.barplot(data = age_dist_w, x = '% Users', y = 'Age Group',  palette='rainbow')
sns.despine(bottom=True)
axes.grid(False)
axes.tick_params(bottom=False, left=True)
axes.bar_label(axes.containers[0],size = 10,fmt = '%1.1f')
plt.xticks([])

plt.suptitle('Age Distribution: left:- EU, right:- The World')
plt.show()


> There isn't much divergence regarding the gender distribution between EU and the World. This field seems to be **male dominant** with around huge **80% participants being male**. And only 15% being female. 

In [None]:
# for eu
man = len(data_eu[data_eu.Q2 == 'Man'])
prefer_not_to_say = len(data_eu[data_eu.Q2 == 'Prefer not to say'])
woman = len(data_eu[data_eu.Q2 == 'Woman'])
nonbinary = len(data_eu[data_eu.Q2 == 'Nonbinary'])
prefer_to_self_describe = len(data_eu[data_eu.Q2 == 'Prefer to self-describe'])
others = prefer_not_to_say + prefer_to_self_describe + nonbinary
gender_dis ={'Man': man,'Others': others,'Woman': woman}

# for world
man_w = len(data_ml[data_ml.Q2 == 'Man'])
prefer_not_to_say_w = len(data_ml[data_ml.Q2 == 'Prefer not to say'])
woman_w = len(data_ml[data_ml.Q2 == 'Woman'])
nonbinary_w = len(data_ml[data_ml.Q2 == 'Nonbinary'])
prefer_to_self_describe_w = len(data_ml[data_ml.Q2 == 'Prefer to self-describe'])
others_w = prefer_not_to_say_w + prefer_to_self_describe_w + nonbinary_w
gender_dis_w ={'Man': man_w,'Others': others_w,'Woman': woman_w}

a = pd.DataFrame(pd.Series(gender_dis)).reset_index().rename(columns={'index':'Gender', 0: 'Participants'})
b = pd.DataFrame(pd.Series(gender_dis_w)).reset_index().rename(columns={'index':'Gender', 0: 'Participants'})

kig = plt.figure(figsize=(20,5))
axes = plt.subplot2grid((1,2),(0,0))
plt.pie( data = a,x = 'Participants', startangle=90,pctdistance=0.70,labels = 'Gender', autopct='%1.1f%%',colors = ['#0e3362','#c4b66e','#6b6c70'], labeldistance= 1.1, radius = 0.8,)
centre_circle = plt.Circle((0,0),0.7,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

axes = plt.subplot2grid((1,2),(0,1))
plt.pie( data = b,x = 'Participants', startangle = 90,pctdistance=0.70,labels = 'Gender', autopct='%1.1f%%',colors = ['#69e2c1','#e78759','#6859e5'], labeldistance= 1.1, radius = 0.8,)
centre_circle = plt.Circle((0,0),0.7,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.suptitle('Gender Distribution: right:- EU, left:- The World')
plt.tight_layout()
plt.show()



In [None]:
new_df = data_ml[['Q6','Q25']].drop(0).dropna()
years_arr = new_df.Q6.unique()

Q_dict = {'25,000-29,999':[], '60,000-69,999':[], '$0-999':[], '30,000-39,999':[],
       '15,000-19,999':[], '70,000-79,999':[], '2,000-2,999':[], '10,000-14,999':[],
       '5,000-7,499':[], '20,000-24,999':[], '1,000-1,999':[], '100,000-124,999':[],
       '7,500-9,999':[], '4,000-4,999':[], '40,000-49,999':[], '50,000-59,999':[],
       '3,000-3,999':[], '300,000-499,999':[], '200,000-249,999':[],
       '125,000-149,999':[], '250,000-299,999':[], '80,000-89,999':[],
       '90,000-99,999':[], '150,000-199,999':[], '>$1,000,000':[],
       '$500,000-999,999':[]}
L_dict = {'25,000-29,999':[], '60,000-69,999':[], '$0-999':[], '30,000-39,999':[],
       '15,000-19,999':[], '70,000-79,999':[], '2,000-2,999':[], '10,000-14,999':[],
       '5,000-7,499':[], '20,000-24,999':[], '1,000-1,999':[], '100,000-124,999':[],
       '7,500-9,999':[], '4,000-4,999':[], '40,000-49,999':[], '50,000-59,999':[],
       '3,000-3,999':[], '300,000-499,999':[], '200,000-249,999':[],
       '125,000-149,999':[], '250,000-299,999':[], '80,000-89,999':[],
       '90,000-99,999':[], '150,000-199,999':[], '>$1,000,000':[],
       }

for i in years_arr:
    salary_count_df = pd.DataFrame(new_df[new_df.Q6 == i].drop('Q6', axis =1).value_counts())
    if i != 'I have never written code':
        for key in Q_dict:
            Q_dict[key].append(salary_count_df.loc[key][0].sum())
    else:   
        for key in L_dict:
            Q_dict[key].append(salary_count_df.loc[key][0].sum())

Q_dict['$500,000-999,999'].append(0)   

In [None]:
def df_creater(eu_world,replacing, column_name):
    if replacing == 'Q9':
        Q9 = ['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7','Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_Part_12','Q9_OTHER']
    elif replacing =='Q10':
        Q9 = ['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7','Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_Part_12','Q9_Part_13','Q9_Part_14','Q9_Part_15','Q9_Part_16','Q9_OTHER']
    elif replacing =='Q16':
        Q9 = ['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7','Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_Part_12','Q9_Part_13','Q9_Part_14','Q9_Part_15','Q9_Part_16','Q9_Part_17','Q9_OTHER']
    elif replacing =='Q32_A':
        Q9 = ['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7','Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_Part_12','Q9_Part_13','Q9_Part_14','Q9_Part_15','Q9_Part_16','Q9_Part_17','Q9_Part_18','Q9_Part_19','Q9_Part_20','Q9_OTHER']
    elif replacing == 'Q18':
        Q9 = ['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_OTHER']
    elif replacing == 'Q19':
        Q9 = ['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_OTHER']
    elif replacing == 'Q24' or replacing == 'Q37_A' or replacing == 'Q30_A':
        Q9 = ['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7','Q9_OTHER']
    else:
        Q9 = ['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_7','Q9_Part_8','Q9_Part_9','Q9_Part_10','Q9_Part_11','Q9_OTHER']
    Q_dict = {}
    for i in Q9:
        i = i.replace('Q9',replacing)
        if str(eu_world) == str(data_ml):
            Q_dict[eu_world.drop(0)[i].dropna().unique()[0]] = len(eu_world.drop(0)[i].dropna())
        else:
            Q_dict[eu_world[i].dropna().unique()[0]] = len(eu_world[i].dropna())
        dataframe_name = pd.DataFrame(Q_dict, index = [0]).T.reset_index().rename(columns = {'index': column_name, 0:'% Users'})
        
    return dataframe_name     

## <span style = 'color:#003399'>Education</span>

<span style = 'color: #b7b146'>**1. Highest Level of Formal Education**

> The age distribution is clearly being reflected in the education plot. Since EU is missing people in the age group of 18-24, which generally is the age of Bachelor's degree, we see great **divergence between the people having Bachelor's degree in EU as compared to the World**. It is roughly the **difference of 25%**. 
> 
> * People in **EU** are having roughly **15% more Master's degree** as compared to the world.
> * The Bachelor's degree and Master's degree holders are almost equivalent when we consider the world. On the other hand there is a **difference of almost 40%** when we consider the EU, In EU we have way more Master's degree holders.

In [None]:
#world Level
education_level_w = data_ml.groupby(['Q4']).count()['Time from Start to Finish (seconds)']
education_level_w = pd.DataFrame(education_level_w).reset_index().rename(columns = {'Q4': 'Education Level', 'Time from Start to Finish (seconds)': 'Count'}).drop(labels = 7).sort_values('Count', ascending =False)
education_level_w.Count = (education_level_w.Count/education_level_w.Count.sum())*100

# EU
education_level = data_eu.groupby(['Q4']).count()['Time from Start to Finish (seconds)']
education_level = pd.DataFrame(education_level).reset_index().rename(columns = {'Q4': 'Education Level', 'Time from Start to Finish (seconds)': 'Count'})
education_level.Count = (education_level.Count/education_level.Count.sum())*100



fig, ax = plt.subplots(figsize=(15,5))

x = np.arange(len(education_level))
width = 0.4
plt.barh(x-0.2, education_level.reindex([3,0,1,6,2,4,5]).Count, width, color='#0e3362', label='EU') 

plt.barh(x+0.2, education_level_w.Count, width, color='#69e2c1', label='World')


plt.ylabel(None)
plt.yticks(education_level.index, labels =  ['Master’s degree', 'Bachelor’s degree', 'Doctoral degree','Some college/university study','I prefer not to answer', 'No formal education past high school','Professional doctorate'])

plt.xlabel('Percentage', fontsize=10)
plt.yticks(fontsize=9)
sns.despine(bottom=True)
ax.grid(False)
ax.tick_params(bottom=False, left=True)

plt.suptitle('Formal Education')
plt.legend(frameon=False, fontsize=15)
plt.show()



<span style= 'color:#b7b146'>**2. Social Media**

> Prior to the Covid-19, the online platforms were not that rigorously used but now things have changed. The online platforms share almost the same level of importance, if not more. Below is the treemap of popular social media sources for getting information regarding ML and Data Scince. There are some slight difference between popularity in the EU and the world but, overall the **top 3 place holders are  Kaggle, YouTube and Blogs**.

In [None]:
Social_w_df = df_creater(data_ml,'Q42', 'Social Media').sort_values('% Users')
Social_w_df['% Users'] = (Social_w_df['% Users']/Social_w_df['% Users'].sum())*100
Social_w_df['Names'] = ['Other', 'None', 'Slack', 'Podcasts', 'Reddit', 'Course Forums', 'Journal Publications', 'Newsletters', 'Twitter', 'Blogs','YouTube', 'Kaggle']


Social_df = df_creater(data_eu,'Q42', 'Social Media').sort_values('% Users')
Social_df['% Users'] = (Social_df['% Users']/Social_df['% Users'].sum())*100
Social_df['Names'] = ['Other', 'None', 'Slack', 'Course Forums','Podcasts', 'Reddit', 'Twitter','Newsletters','Journal Publications', 'YouTube',  'Blogs', 'Kaggle']


labels = Social_w_df['Names']
sizes = Social_w_df['% Users'].values.tolist()
colors = [plt.cm.viridis(i/float(len(labels))) for i in range(len(labels))]

plt.figure(figsize=(18,6), dpi= 80)
axes = plt.subplot2grid((1,2),(0,1))
squarify.plot(sizes=sizes, label=labels, alpha=.8, color = colors)
plt.axis('off')


labels = Social_df['Names']
sizes = Social_df['% Users'].values.tolist()
axes = plt.subplot2grid((1,2),(0,0))
colors = [plt.cm.cividis(i/float(len(labels))) for i in range(len(labels))]
squarify.plot(sizes=sizes, label=labels, alpha=.8, color = colors)


# Decorate
plt.suptitle('Popular Social Media')
plt.axis('off')
plt.tight_layout()
plt.show()




<span style= 'color: #b7b146'>**3. Platforms for Data Science Courses**

> The top online platforms which are desired for doing data science courses are Kaggle, followed by Coursera, Udemy and University courses.
> Important observation here is that **people prefer Kaggle courses more than University courses**. One of the reason could be the non practicality of most of the University courses.

In [None]:
plat_ds_course_w_df = df_creater(data_ml,'Q40', 'Platform for DS Courses').sort_values('% Users', ascending = False)
plat_ds_course_w_df['% Users'] = (plat_ds_course_w_df['% Users']/plat_ds_course_w_df['% Users'].sum())*100 

plat_ds_course_w_df['Platform for DS Courses'].replace(['University Courses (resulting in a university degree)','Cloud-certification programs (direct from AWS, Azure, GCP, or similar)'],['University', 'Cloud-Certificate'], inplace = True)

labels = plat_ds_course_w_df['Platform for DS Courses']
sizes = plat_ds_course_w_df['% Users'].values.tolist()
colors = [plt.cm.cividis(i/float(len(labels))) for i in range(len(labels))]

plt.figure(figsize=(20,6))
squarify.plot(sizes=sizes, label=labels, alpha=.8, color = colors)
plt.axis('off')

# Decorate
plt.suptitle('Popular Platform for DS Courses')
plt.axis('off')
plt.tight_layout()
plt.show()


## <span style= 'color:#0e3362'>Programming Language</span>
> There isn't much difference about the prefernce of programming language among the EU and the world. Both the parties prefer the **python** with about **33%** users, followed by the R and SQL.
> 

In [None]:
programming_w_df = df_creater(data_ml,'Q7', 'Language').sort_values('% Users', ascending = False)
programming_w_df['% Users'] = (programming_w_df['% Users']/programming_w_df['% Users'].sum())*100
# programming_w_df

programming_df = df_creater(data_eu,'Q7', 'Language').sort_values('% Users')
programming_df['% Users'] = (programming_df['% Users']/programming_df['% Users'].sum())*100


fig, ax = plt.subplots(figsize=(18,6))

x = np.arange(len(programming_w_df))
width = 0.4
plt.barh(x-0.2, programming_df['% Users'].reindex([0,2,4,1,5,3,6,10,11,9,7,8]), width, color='#0e3362', label='EU') 
plt.barh(x+0.2, programming_w_df['% Users'], width, color='#69e2c1', label='World')

# plt.title('TOP5 countries by pure alcohol consumption', fontsize=25)
plt.ylabel(None)
plt.yticks(programming_w_df.index, labels = programming_w_df.Language)

plt.xlabel('Percentage', fontsize=7)
plt.yticks(fontsize=10)
sns.despine(bottom=True)
ax.grid(False)
ax.tick_params(bottom=False, left=True)
plt.suptitle('Preffered Programming Language')
plt.legend(frameon=False, fontsize=15)
plt.show()


> Also, when asked about which programming language would you recommend to an aspiring Data Scientist,we get following results:

In [None]:
recomm_language_w = pd.DataFrame(data_ml.Q8.drop(0).dropna().value_counts()).reset_index().rename(columns = {'index':'Recommended Language', 'Q8': 'Count'})
recomm_language_w.Count =(recomm_language_w.Count/recomm_language_w.Count.sum())*100

recomm_language= pd.DataFrame(data_eu.Q8.dropna().value_counts()).reset_index().rename(columns = {'index':'Recommended Language', 'Q8': 'Count'})
recomm_language.Count =(recomm_language.Count/recomm_language.Count.sum())*100

fig, ax = plt.subplots(figsize=(18,6))
x = np.arange(len(recomm_language))
width = 0.4
plt.barh(x-0.2, recomm_language['Count'], width, color='#0e3362', label='EU') 
plt.barh(x+0.2, recomm_language_w['Count'], width, color='#69e2c1', label='World')

plt.ylabel(None)
plt.yticks(recomm_language_w.index, labels = recomm_language_w['Recommended Language'])

plt.xlabel('Percentage', fontsize=7)
plt.yticks(fontsize=10)
sns.despine(bottom=True)
ax.grid(False)
ax.tick_params(bottom=False, left=True)

plt.suptitle('Recommended Programming Language')
plt.legend(frameon=False, fontsize=15)
plt.show()


> Dominance of Python is quite clear here, **almost 80% people would recommend Python**. Also, this advice is coming from the people who are already in this field, which clearly indicates that **Python is the furture!**.

## <span style = 'color:#0e3362'>Job</span>
> In EU the top 3 places are occupied by the **Data Scientist, Student** and Research Scientist. Whereas Student, Data Scientist and Software Engineer are the top 3 place holders in the world.
> 
> The major divergence between the world and the EU is only present in the Student and Data Scientist category.
> 

In [None]:
# for world
job_w = data_ml.groupby(['Q5']).count()['Time from Start to Finish (seconds)']
job_w = pd.DataFrame(job_w).reset_index().rename(columns = {'Q5': 'Current Job', 'Time from Start to Finish (seconds)': '% Users'}).drop(labels = 12)
total_sum_w = job_w['% Users'].sum()
job_w['% Users'] = (job_w['% Users']/ total_sum_w)*100
# for Eu
job = data_eu.groupby(['Q5']).count()['Time from Start to Finish (seconds)']
job = pd.DataFrame(job).reset_index().rename(columns = {'Q5': 'Current Job', 'Time from Start to Finish (seconds)': '% Users'})
total_sum = job['% Users'].sum()
job['% Users'] = (job['% Users'] / total_sum)*100


fig, ax = plt.subplots(figsize=(18,7))


sns.scatterplot(alpha = 0.8, data = job, x = '% Users', y = 'Current Job',sizes=(1000, 1001),size = "% Users", legend=False, color = '#0e3362')

  
sns.scatterplot(data = job_w, x = '% Users', y = 'Current Job',sizes=(1000, 1001), size = "% Users",legend=False, color = '#69e2c1')

ax.tick_params(bottom=False, left=True)

plt.suptitle("Job's Title")
plt.tight_layout()
plt.show()

<span style= 'color: #b7b146'>**1. Industry of the Current Employer**

> The industry is kind of equivalent for both the world and EU. **Computers, Academia and Accounting** takes the **top 3** places. However, it is good to see rest of the industry having their hands on ML and Data Science.

In [None]:
industry_w_df = pd.DataFrame(data_ml['Q20'].value_counts()).reset_index().drop(18).rename(columns = {'index':'Industry','Q20': 'Users'})
industry_w_df.Users = (industry_w_df.Users/industry_w_df.Users.sum())*100

industry_df = pd.DataFrame(data_eu['Q20'].value_counts()).reset_index().rename(columns = {'index':'Industry','Q20': 'Users'})
industry_df.Users = (industry_df.Users/industry_df.Users.sum())*100

labels = industry_df['Industry']
width = 0.35       # the width of the bars: can also be len(x) sequence

fig, ax = plt.subplots(figsize=(15,7))

ax.barh(labels, industry_df['Users'], width, label='EU', color='#0e3362')
ax.barh(labels, industry_w_df['Users'], width,  color='#69e2c1',left=industry_df['Users'],label='World')


plt.yticks(fontsize=8)
ax.set_xlabel('% Users')
ax.grid(False)
sns.despine(bottom=True)
# ax.set_title('Scores by group and gender')
ax.tick_params(bottom=False, left=True)

ax.legend()
plt.show()


<span style = 'color: #c4b66e'>**2. Incorporation of ML Methods into Business**

> * Approximately 70% of the Businesses are either using some form of ML methods in their working or are going to work on it in coming days.
> * Out of these 70%, 20% have well established ML Methods.
> * Both EU and the world are on par with each other regarding this.
> 

In [None]:
ml_business_w_df = pd.DataFrame(data_ml['Q23'].value_counts()).reset_index().drop(6).rename(columns = {'index':'Ml incorporation in Business','Q23': 'Users'})
ml_business_w_df.Users = (ml_business_w_df.Users/ml_business_w_df.Users.sum())*100


ml_business_df = pd.DataFrame(data_eu['Q23'].value_counts()).reset_index().rename(columns = {'index':'Ml incorporation in Business','Q23': 'Users'})
ml_business_df.Users = (ml_business_df.Users/ml_business_df.Users.sum())*100

ml_business_df['Ml incorporation in Business'] = ['\n'.join(wrap(x, 45)) for x in  ml_business_df['Ml incorporation in Business']]

labels = ml_business_df['Ml incorporation in Business']
width = 0.35       # the width of the bars: can also be len(x) sequence

fig, ax = plt.subplots(figsize=(13,5))

ax.barh(labels, ml_business_df['Users'], width, label='EU', color='#0e3362')
# labels = ml_business_w_df['Ml incorporation in Business']
ax.barh(labels, ml_business_w_df['Users'], width,  color='#69e2c1',left=ml_business_df['Users'],label='World')

plt.yticks(fontsize=8)
ax.set_xlabel('Users')
ax.grid(False)
sns.despine(bottom=True)
# ax.set_title('Scores by group and gender')
ax.tick_params(bottom=False, left=True)
ax.legend()
plt.show()


<span style = 'color: #c4b66e'>**3. Activities Important for Work**

> **Analysis and understanding of data** to influence product or business decisions takes the first place here with 27% followed by building prototypes to explore **ML in new areas** with 16%.

In [None]:
role_w_df = df_creater(data_ml, 'Q24', 'Role').sort_values('% Users', ascending = False)
role_w_df['% Users'] = (role_w_df['% Users']/role_w_df['% Users'].sum())*100
role_w_df['Role'] = ['\n'.join(wrap(x, 35)) for x in  role_w_df['Role']]

fig, ax = plt.subplots(figsize=(15,6))

ax.hlines(role_w_df['Role'], xmin=0, xmax = role_w_df['% Users']-0.40, colors = '#0e3362')
sns.scatterplot( data = role_w_df, x = '% Users', y = 'Role',sizes=(650, 651),size = "% Users", legend=False, color= '#04a664', alpha = 0.5)

# for j in range(len(hosted_nb_w_df)):
#     plt.annotate( "{hosted_nb_w_df['% Users'][j]: .0%}".format(hosted_nb_w_df['% Users'][j]), (hosted_nb_w_df['% Users'][j], hosted_nb_w_df['Hosted Notebook'][j]),weight = 'bold', color = '#5b5eec')

for j in range(len(role_w_df)):
    plt.annotate( f'{role_w_df["% Users"][j]:1.0f}', (role_w_df['% Users'][j]-0.3, role_w_df['Role'][j]),weight = 'bold', color = 'white')

plt.xticks([]) 
ax.set_xlim(0,30)
sns.despine(bottom=True)
ax.grid(False)
ax.tick_params(bottom=False, left=False)
plt.suptitle('Role at Work')
plt.tight_layout()
plt.show()


In [None]:
salary_year_df = pd.DataFrame(data = Q_dict,index =years_arr ,columns = ['25,000-29,999', '60,000-69,999', '$0-999', '30,000-39,999',
       '15,000-19,999', '70,000-79,999', '2,000-2,999', '10,000-14,999',
       '5,000-7,499', '20,000-24,999', '1,000-1,999', '100,000-124,999',
       '7,500-9,999', '4,000-4,999', '40,000-49,999', '50,000-59,999',
       '3,000-3,999', '300,000-499,999', '200,000-249,999',
       '125,000-149,999', '250,000-299,999', '80,000-89,999',
       '90,000-99,999', '150,000-199,999', '>$1,000,000',
       '$500,000-999,999'])      
salary_year_df = salary_year_df.reindex(['I have never written code','< 1 years','1-3 years','3-5 years','5-10 years','10-20 years','20+ years'])
salary_year_df = salary_year_df.reindex(columns = ['$0-999','1,000-1,999','2,000-2,999','3,000-3,999','4,000-4,999','5,000-7,499', '7,500-9,999','10,000-14,999','15,000-19,999','20,000-24,999','25,000-29,999','30,000-39,999','40,000-49,999','50,000-59,999','60,000-69,999','70,000-79,999','80,000-89,999','90,000-99,999','100,000-124,999','125,000-149,999','150,000-199,999','200,000-249,999','250,000-299,999','300,000-499,999','$500,000-999,999','>$1,000,000'])

<span style = 'color:#c4b66e'>**4. Impact of Coding Experience on Compensastion**

> An interesting observation that can be drawn out from the below heat map is that most of the people have **coding experience under 3 years and earn in the range of \\$0-999**.
> 


In [None]:
fig, ax = plt.subplots(figsize=(18,6))
sns.heatmap(data = salary_year_df,cmap = 'viridis', linecolor='white', linewidths=2, annot = True, fmt = 'd')
plt.suptitle('Relation between Coding experience and Compensastion ')
plt.show()

<span style = 'color:#c4b66e'>**5. Company Size vs Data Science Workers**

> There is almost a linear relation between the company size and Data Science. **The bigger the comapny the larger data science employees it has**. 

In [None]:
company_ds_w_df = data_ml[['Q21','Q22']].drop(0).dropna()
company_arr = company_ds_w_df.Q21.unique()

Q_dict = {'3-4':[], '1-2':[], '0':[], '5-9':[], '10-14':[],'20+':[], '15-19':[]}

for i in company_arr:
    company_ds_count_df = pd.DataFrame(company_ds_w_df[company_ds_w_df.Q21 == i].drop('Q21', axis =1).value_counts())
    for key in Q_dict:
        Q_dict[key].append(company_ds_count_df.loc[key][0].sum())

new_company_ds_df = pd.DataFrame(data = Q_dict,index =company_arr ,columns = ['3-4', '1-2', '0', '5-9', '10-14','20+', '15-19'])      
new_company_ds_df = new_company_ds_df.reindex([ '0-49 employees','50-249 employees', '250-999 employees', '1000-9,999 employees','10,000 or more employees'])
new_company_ds_df = new_company_ds_df.reindex(columns = ['0','1-2','3-4', '5-9', '10-14', '15-19','20+'])

fig, ax = plt.subplots(figsize=(13,6))
sns.heatmap(data = new_company_ds_df, linecolor='white', linewidths=2, annot  =True, fmt = 'd',cmap = 'viridis')
plt.show()

## <span style = 'color:#0e3362'>Visualization Tools</span>

> * **Matplotlib, Seaborn and Plotly are fan favourite** with whooping 33%, 23% and 13% users.
> * About **8%-10% people doesn't use any visualization tools**, they depend upon default.
> * Here also there isn't much deviation between EU and the world.

In [None]:
# For World
visualization_w_df = df_creater(data_ml,'Q14', 'Visualization Tool')
visualization_w_df['% Users'] = (visualization_w_df['% Users']/visualization_w_df['% Users'].sum())*100

# For Eu
visualization_df = df_creater(data_eu,'Q14', 'Visualization Tool')
visualization_df['% Users'] = (visualization_df['% Users']/visualization_df['% Users'].sum())*100

labels = visualization_df['Visualization Tool']
width = 0.35       # the width of the bars: can also be len(x) sequence

fig, ax = plt.subplots(figsize=(13,6))

ax.barh(labels, visualization_df['% Users'], width, label='EU', color='#0e3362')
ax.barh(labels, visualization_w_df['% Users'], width,  color='#69e2c1',left=visualization_df['% Users'],
       label='World')
# plt.xticks(rotation = 90)
ax.set_xlabel('% Users')
ax.grid(False)
sns.despine(bottom=True)
# ax.set_title('Scores by group and gender')
ax.tick_params(bottom=False, left=True)
ax.legend()
plt.show()



In [None]:
hosted_nb_w_df = df_creater(data_ml, "Q10", 'Hosted Notebook').sort_values('% Users',ascending = False)
hosted_nb_w_df['% Users'] =( hosted_nb_w_df['% Users']/hosted_nb_w_df['% Users'].sum())* 100

##  <span style = 'color:#0e3362'>Hosted Notebook Products</span>

>* **Google Colab Notebooks and Kaggle Notebooks** are the clear **winner** with 26% and 25% users respectively.
> * A large majority of users **doesn't use any hosted notebook**, which accounts for **19% users**.

In [None]:
fig, ax = plt.subplots(figsize=(15,7))

ax.hlines(hosted_nb_w_df['Hosted Notebook'], xmin=0, xmax = hosted_nb_w_df['% Users']-0.45, colors = '#0e3362')
sns.scatterplot( data = hosted_nb_w_df, x = '% Users', y = 'Hosted Notebook',sizes=(650, 651),size = "% Users", legend=False, color= '#04a664', alpha = 0.5)

# for j in range(len(hosted_nb_w_df)):
#     plt.annotate( "{hosted_nb_w_df['% Users'][j]: .0%}".format(hosted_nb_w_df['% Users'][j]), (hosted_nb_w_df['% Users'][j], hosted_nb_w_df['Hosted Notebook'][j]),weight = 'bold', color = '#5b5eec')

for j in range(len(hosted_nb_w_df)):
    plt.annotate( f'{hosted_nb_w_df["% Users"][j]:.1f}', (hosted_nb_w_df['% Users'][j]-0.4, hosted_nb_w_df['Hosted Notebook'][j]),weight = 'bold', color = 'white')

plt.xticks([]) 
ax.set_xlim(0,30)
sns.despine(bottom=True)
ax.grid(False)
ax.tick_params(bottom=False, left=False)
plt.tight_layout()
plt.show()


## <span style = 'color:#0e3362'>Computing Platform for Data Science Projects</span>

> Majority of users are going with the laptop and a personal computer, followed by the cloud platforms. The recent advancements in the computing power of the laptops and their comfortability might be a huge boost to this trend.

In [None]:
computing_plat_w_df = pd.DataFrame(data_ml['Q11'].value_counts()).reset_index().drop(6).rename(columns = {'index':'Computing Platform','Q11': 'Users'})
computing_plat_w_df.Users  = (computing_plat_w_df.Users /computing_plat_w_df.Users .sum())*100
sum_4_5 = computing_plat_w_df.Users[4]+computing_plat_w_df.Users[5]
df2 = pd.DataFrame([['None/Other', sum_4_5 ]], columns=['Computing Platform', 'Users'], index=[4])
computing_plat_w_df = computing_plat_w_df.drop([4,5]).append(df2)
computing_plat_w_df['Computing Platform'] = ['\n'.join(wrap(x, 30)) for x in  computing_plat_w_df['Computing Platform']]

computing_plat_df = pd.DataFrame(data_eu['Q11'].value_counts()).reset_index().rename(columns = {'index':'Computing Platform','Q11': 'Users'})
computing_plat_df.Users  = (computing_plat_df.Users /computing_plat_df.Users .sum())*100
sum_4_5 = computing_plat_df.Users[4]+computing_plat_df.Users[5]
df2 = pd.DataFrame([['None/Other', sum_4_5 ]], columns=['Computing Platform', 'Users'], index=[4])
computing_plat_df = computing_plat_df.drop([4,5]).append(df2)
computing_plat_df['Computing Platform'] = ['\n'.join(wrap(x, 30)) for x in  computing_plat_df['Computing Platform']]

color_list = ['#0e3362','#55658a','#979693','#c9be89','#eddb71']

kig = plt.figure(figsize=(15,5.5))
axes = plt.subplot2grid((1,2),(0,0))
plt.pie( data = computing_plat_df,x = 'Users', startangle=30,pctdistance=0.65,labels = 'Computing Platform', autopct='%1.1f%%',colors = color_list, labeldistance= 1.1, radius = 0.8,)
centre_circle = plt.Circle((0,0),0.7,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

axes = plt.subplot2grid((1,2),(0,1))
plt.pie( data = computing_plat_w_df,x = 'Users', startangle = 30,pctdistance=0.65,labels = 'Computing Platform', autopct='%1.1f%%',colors = ['#693476','#6275a2','#4da7a3','#7ed381','#d3e64d'], labeldistance= 1.1, radius = 0.8,)
centre_circle = plt.Circle((0,0),0.7,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.suptitle('Computing Platforms: left:- EU, right:- World')
plt.tight_layout()
plt.show()



## <span style = 'color:#0e3362'>Big Data</span>

> * For Big Data, most popular products are **MySQL, ProstgreSQL and MongoDB**.
> * 13% people doesn't use any big data product.
> 
> 

In [None]:
big_data_w_df = df_creater(data_ml, "Q32_A", 'Big Data Products')
big_data_w_df['% Users'] = (big_data_w_df['% Users']/big_data_w_df['% Users'].sum())*100

big_data_df = df_creater(data_eu, "Q32_A", 'Big Data Products')
big_data_df['% Users'] = (big_data_df['% Users']/big_data_df['% Users'].sum())*100

labels = big_data_w_df['Big Data Products']
width = 0.35       # the width of the bars: can also be len(x) sequence

fig, ax = plt.subplots(figsize=(15,7))

ax.barh(labels,big_data_df['% Users'], width, label='EU', color='#0e3362')
ax.barh(labels, big_data_w_df['% Users'], width,  color='#4da7a3',left=big_data_df['% Users'],label='World')

plt.yticks(fontsize=8)
ax.set_xlabel('% Users')
ax.grid(False)
sns.despine(bottom=True)
# ax.set_title('Scores by group and gender')
ax.tick_params(bottom=False, left=True)
ax.legend()
plt.suptitle('Big Data Products')
plt.show()


## <span style = 'color:#0e3362'>Business Intelligence Tools</span>

> * The EU trend matches the world trend regarding Business Intelligence Tools.
> * **Microsoft Power BI and Tableau** are the **top 2** favourites with 35% and 30% users respectively.

In [None]:
business_intelligence_w_df = pd.DataFrame(data_ml['Q35'].value_counts()).reset_index().drop([15,16]).rename(columns = {'index':'Business Intelligence','Q35': 'Users'})
business_intelligence_w_df.Users = (business_intelligence_w_df.Users/business_intelligence_w_df.Users.sum())*100

business_intelligence_df = pd.DataFrame(data_eu['Q35'].value_counts()).reset_index().rename(columns = {'index':'Business Intelligence','Q35': 'Users'})
business_intelligence_df.Users = (business_intelligence_df.Users/business_intelligence_df.Users.sum())*100

labels = business_intelligence_df['Business Intelligence']
width = 0.35       # the width of the bars: can also be len(x) sequence

fig, ax = plt.subplots(figsize=(15,7))

ax.barh(labels,business_intelligence_df['Users'], width, label='EU', color='#0e3362')
ax.barh(labels, business_intelligence_w_df['Users'], width,  color='#4da7a3',left=business_intelligence_df['Users'],label='World')

plt.yticks(fontsize=8)
ax.set_xlabel('Users')
ax.grid(False)
sns.despine(bottom=True)
# ax.set_title('Scores by group and gender')
ax.tick_params(bottom=False, left=True)
ax.legend()
plt.show()


## <span style = 'color:#0e3362'>Machine Learning</span>

<span style = 'color:#cbbf80'>**1. ML Methods Experinece**

> * Most of the users are quite new to ML methods, there are only few seasoned professionals with 20+ years of experience. This indicates that young people are interested in this field, which is good for this growing community.
> * Both the world and the EU are following **similar trend** regarding this. Almost **30% users** have **under 1 year of experince**.
> * Also, **13% users** claims to have **no ML method experience**.

In [None]:
ml_method_w_df = pd.DataFrame(data_ml['Q15'].value_counts()).reset_index().drop(9).rename(columns = {'index':'ML Method','Q15': 'Users'})
ml_method_w_df.Users = (ml_method_w_df.Users / ml_method_w_df.Users.sum())*100
ml_method_w_df['ML Method'].replace('I do not use machine learning methods', 'No Method',inplace =True)


ml_method_df = pd.DataFrame(data_eu['Q15'].value_counts()).reset_index().rename(columns = {'index':'ML Method','Q15': 'Users'})
ml_method_df.Users = (ml_method_df.Users / ml_method_df.Users.sum())*100
ml_method_df['ML Method'].replace('I do not use machine learning methods', 'No Method',inplace =True)

labels = ml_method_df['ML Method']
width = 0.35       # the width of the bars: can also be len(x) sequence

fig, ax = plt.subplots(figsize=(15,6))

ax.barh(labels, ml_method_df['Users'], width, label='EU', color='#0e3362')
ax.barh(labels, ml_method_w_df['Users'], width,  color='#4da7a3',left=ml_method_df['Users'],
       label='World')
# plt.xticks(rotation = 90)
ax.set_xlabel('Users')
ax.grid(False)
sns.despine(bottom=True)
# ax.set_title('Scores by group and gender')
ax.tick_params(bottom=False, left=True)

plt.suptitle('ML Method Experience')
ax.legend()
plt.show()


<span style = 'color:#cbbf80'>**2. Popular ML Framework**

> The **top 4** popular ML frameworks are **Scikit-learn, TensorFlow, Keras and PyTorch**. Scikit-learn being most popular shows that traditional ML algorithms are still being used, but the rising popularity of TensorFlow and PyTorch reflects the growth of Deep Learning.

In [None]:
ml_framework_w_df  = df_creater(data_ml,'Q16', 'ML Framework').sort_values('% Users', ascending = False)
ml_framework_w_df['% Users'] = (ml_framework_w_df['% Users']/ml_framework_w_df['% Users'].sum())*100 

labels = ml_framework_w_df['ML Framework']
sizes = ml_framework_w_df['% Users'] .values.tolist()
colors = [plt.cm.viridis(i/float(len(labels))) for i in range(len(labels))]

plt.figure(figsize=(15,6))
squarify.plot(sizes=sizes, label=labels, alpha=.8, color = colors)
plt.axis('off')

# Decorate
plt.suptitle('Popular ML Framework')
plt.axis('off')
plt.tight_layout()
plt.show()



<span style = 'color:#cbbf80'>**3. Popular ML Algorithms**

> * Most popular ML Method: **Linear or Logistic Regression**, CV Method: **VGG, Inception, ResNet**, etc, NLP Method: **Word embeddings/vectors**.
> * Both EU and the World follow the **same pattern**.

In [None]:

ml_algo_w_df = df_creater(data_ml,'Q17', 'ML Algorithm').sort_values('% Users')
ml_algo_w_df['% Users'] = (ml_algo_w_df['% Users']/ml_algo_w_df['% Users'].sum())*100
ml_algo_df = df_creater(data_eu,'Q17', 'ML Algorithm').sort_values('% Users')
ml_algo_df['% Users'] = (ml_algo_df['% Users']/ml_algo_df['% Users'].sum())*100
ml_algo_df['ML Algorithm'] = ['\n'.join(wrap(x, 7)) for x in  ml_algo_df['ML Algorithm']]

# cv
cv_method_w_df = df_creater(data_ml,'Q18', 'CV Method').sort_values('% Users')
cv_method_w_df['% Users'] = (cv_method_w_df['% Users']/cv_method_w_df['% Users'].sum())*100
cv_method_df = df_creater(data_eu,'Q18', 'CV Method').sort_values('% Users')
cv_method_df['% Users'] = (cv_method_df['% Users']/cv_method_df['% Users'].sum())*100
cv_method_w_df['CV Method'].replace(['Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)'],['VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc'], inplace = True)
cv_method_w_df['CV Method'] = ['\n'.join(wrap(x, 12)) for x in  cv_method_w_df['CV Method']]

# nlp
nlp_method_w_df = df_creater(data_ml,'Q19', 'NLP Method').sort_values('% Users')
nlp_method_w_df['% Users'] = (nlp_method_w_df['% Users']/nlp_method_w_df['% Users'].sum())*100
nlp_method_df = df_creater(data_eu,'Q19', 'NLP Method').sort_values('% Users')
nlp_method_df['% Users'] = (nlp_method_df['% Users']/nlp_method_df['% Users'].sum())*100
nlp_method_w_df['NLP Method'] = ['\n'.join(wrap(x, 12)) for x in nlp_method_w_df['NLP Method']]


labels = ml_algo_df['ML Algorithm']
width = 0.35       # the width of the bars: can also be len(x) sequence

kig = plt.figure(figsize=(20,8))

a = '#0e3362'
b = '#4da7a3'

axes = plt.subplot2grid((1,3),(0,0))
axes.bar(labels, ml_algo_df['% Users'] , width, label='EU', color=a)
axes.bar(labels, ml_algo_w_df['% Users'] , width,  color=b,bottom=ml_algo_df['% Users'],label='World')
axes.set_ylabel('% Users')
axes.grid(False)
axes.tick_params(bottom=True, left=False)
axes.legend()
plt.title('ML Methods')
plt.xticks(fontsize=8)

axes = plt.subplot2grid((1,3),(0,1))
labels = cv_method_w_df['CV Method']
axes.bar(labels, cv_method_df['% Users'] , width, label='EU', color=a)
axes.bar(labels, cv_method_w_df['% Users'] , width,  color=b,bottom=cv_method_df['% Users'],label='World')
axes.set_ylabel('% Users')
axes.grid(False)
axes.tick_params(bottom=True, left=False)
axes.legend()
plt.title('CV Methods')
plt.xticks(fontsize=8)

axes = plt.subplot2grid((1,3),(0,2))
labels = nlp_method_w_df['NLP Method']
axes.bar(labels, nlp_method_df['% Users'] , width, label='EU', color=a)
axes.bar(labels,nlp_method_w_df['% Users'] , width,  color=b,bottom=nlp_method_df['% Users'],label='World')
axes.set_ylabel('% Users')
axes.grid(False)
axes.tick_params(bottom=True, left=False)
axes.legend()
plt.title('NLP Methods')
plt.xticks(fontsize=8)

sns.despine(left=True)
plt.tight_layout()
plt.show()


<span style = 'color:#cbbf80'>**4. Auto ML**

> Both in EU and the world people **doesn't prefer** to use **any automated ML tools**.
> However, there is a differnce of 10% among the people in world and EU who prefer to use Google Cloud AutoML.

In [None]:
#37-A -7
auto_ml_w_df = df_creater(data_ml, "Q37_A", 'Automated ML Tools').sort_values('% Users')
auto_ml_w_df['% Users'] = (auto_ml_w_df['% Users']/auto_ml_w_df['% Users'].sum())*100

auto_ml_df = df_creater(data_eu, "Q37_A", 'Automated ML Tools').sort_values('% Users')
auto_ml_df['% Users'] = (auto_ml_df['% Users']/auto_ml_df['% Users'].sum())*100

auto_ml_w_df['Automated ML Tools'] = ['\n'.join(wrap(x, 15)) for x in  auto_ml_w_df['Automated ML Tools']]
auto_ml_df['Automated ML Tools'] = ['\n'.join(wrap(x, 15)) for x in  auto_ml_df['Automated ML Tools']]


color_list = ['#55658a','#797d8a','#979693','#b7b090','#c9be89','#dbcc80','#eddb71','#334e71']
colors = ['#693476','#69619c','#6275a2','#5a86a4','#4da7a3','#7ed381','#a6df69','#d3e64d']

kig = plt.figure(figsize=(20,5.5))
axes = plt.subplot2grid((1,2),(0,0))
plt.pie( data = auto_ml_df,x = '% Users', startangle=90,pctdistance=0.65,labels = 'Automated ML Tools', autopct='%1.1f%%',colors = color_list, labeldistance= 1.1, radius = 0.8,)
centre_circle = plt.Circle((0,0),0.7,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

axes = plt.subplot2grid((1,2),(0,1))
plt.pie( data = auto_ml_w_df,x = '% Users', startangle = 90,pctdistance=0.65,labels = 'Automated ML Tools', autopct='%1.1f%%',colors = colors, labeldistance= 1.1, radius = 0.8,)
centre_circle = plt.Circle((0,0),0.7,fc='white') 
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.suptitle('Auto ML Tools: left- EU, right- World')
plt.tight_layout()
plt.show()




<span style = 'color:#c9be89'>**5. Tools to Manage ML Experiments**

> * About **60%** users **doesn't prefer** to use any managing tool.
> * Still **15%** users prefer to use **MLflow** and about **20% TensorBoard**.

In [None]:
tools_w_df = df_creater(data_ml, 'Q38_A', 'Tools')
tools_w_df['% Users'] = (tools_w_df['% Users']/tools_w_df['% Users'].sum())*100

tools_df = df_creater(data_eu, 'Q38_A', 'Tools')
tools_df['% Users'] = (tools_df['% Users']/tools_df['% Users'].sum())*100

labels = tools_df['Tools']
width = 0.35       # the width of the bars: can also be len(x) sequence

fig, ax = plt.subplots(figsize=(15,6))

ax.barh(labels,tools_df['% Users'], width, label='EU', color='#0e3362')
ax.barh(labels, tools_w_df['% Users'], width,  color='#4da7a3',left=tools_df['% Users'],label='World')

plt.yticks(fontsize=8)
ax.set_xlabel('% Users')
ax.grid(False)
sns.despine(bottom=True)
# ax.set_title('Scores by group and gender')
ax.tick_params(bottom=False, left=True)
ax.legend()
plt.suptitle('ML Managing Tools')
plt.show()


## <span style = 'color:#0e3362'>Cloud Platform</span>

<span style= 'color:#c4b66e'>**1. Usage of Cloud Platform**

The top 3 places in both the EU and the world is occupied by **AWS, Microsoft Azure and Google Cloud PLatform**. In EU GCP is being used more as compared to the world which is favouring Microsoft Azure a bit more but overall, the trend of cloud platforms in EU and the world, align with each other. 


In [None]:
cloud_plat_w_df = df_creater(data_ml, 'Q27_A', 'Cloud Computing Platform').sort_values('% Users',ascending = False)
cloud_plat_w_df['% Users'] = (cloud_plat_w_df['% Users'] /cloud_plat_w_df['% Users'].sum())*100

cloud_plat_df = df_creater(data_eu, 'Q27_A', 'Cloud Computing Platform').sort_values('% Users',ascending = False).reindex([0,2,10,1,3,4,7,11,5,6,8,9])
cloud_plat_df['% Users'] = (cloud_plat_df['% Users'] /cloud_plat_df['% Users'].sum())*100

fig, ax = plt.subplots(figsize=(15,6))
x = np.arange(len(cloud_plat_df))
width = 0.4
plt.barh(x-0.2, cloud_plat_df['% Users'], width, color='#0e3362', label='EU') 
plt.barh(x+0.2, cloud_plat_w_df['% Users'], width, color='#69e2c1', label='World')

plt.ylabel(None)
plt.yticks(cloud_plat_w_df.index, labels = cloud_plat_w_df['Cloud Computing Platform'])

plt.xlabel('Percentage', fontsize=7)
plt.yticks(fontsize=10)
sns.despine(bottom=True)
ax.grid(False)
ax.tick_params(bottom=False, left=True)
plt.legend(frameon=False, fontsize=15)
plt.show()


<span style= 'color:#c4b66e'>**2. Money Spent on ML and/or Cloud Computing Services**
* **38% people aren't speding any money** on ML and/or cloud Computing services.
* Here also, there isn't much difference in EU and the World

In [None]:
cloud_investment_w_df = pd.DataFrame(data_ml['Q26'].value_counts()).reset_index().drop(6).rename(columns = {'index':'Cloud Investment','Q26': 'Users'})
cloud_investment_w_df.Users = (cloud_investment_w_df.Users/cloud_investment_w_df.Users.sum())*100

cloud_investment_df = pd.DataFrame(data_eu['Q26'].value_counts()).reset_index().rename(columns = {'index':'Cloud Investment','Q26': 'Users'})
cloud_investment_df.Users = (cloud_investment_df.Users/cloud_investment_df.Users.sum())*100

fig, ax = plt.subplots(figsize=(15,5))
x = np.arange(len(cloud_investment_df))
width = 0.4
plt.barh(x-0.2, cloud_investment_df.reindex([0,2,1,4,3,5])['Users'], width, color='#0e3362', label='EU') 
plt.barh(x+0.2, cloud_investment_w_df['Users'], width, color='#69e2c1', label='World')

plt.ylabel(None)
plt.yticks(cloud_investment_w_df.index, labels = cloud_investment_w_df['Cloud Investment'])

plt.xlabel('Percentage', fontsize=7)
plt.yticks(fontsize=10)
sns.despine(bottom=True)
ax.grid(False)
ax.tick_params(bottom=False, left=True)
plt.legend(frameon=False, fontsize=15)
plt.show()


<span style= 'color:#c4b66e'>**3. Usage of Data Storage Products**

* Amazon Simple Storage Service is taking the lead here with 29%, followed by Google Cloud Storage with 20%
* 17% of people aren't using any storage device, this could be because most of them are student.

In [None]:
# data_storage_w_df = df_creater(data_ml, 'Q30_A', 'Data Storage Products').sort_values('% Users')
# data_storage_w_df['% Users'] = (data_storage_w_df['% Users']/data_storage_w_df['% Users'].sum())*100

data_storage_df = df_creater(data_eu, 'Q30_A', 'Data Storage Products').sort_values('% Users', ascending = False)
data_storage_df['% Users'] = (data_storage_df['% Users']/data_storage_df['% Users'].sum())*100

fig, ax = plt.subplots(figsize=(13,4))

ax.hlines(data_storage_df['Data Storage Products'], xmin=0, xmax = data_storage_df['% Users']-0.50, colors = '#0e3362')
sns.scatterplot( data = data_storage_df, x = '% Users', y = 'Data Storage Products',sizes=(650, 651),size = "% Users", legend=False, color= '#04a664', alpha = 0.5)

# for j in range(len(hosted_nb_w_df)):
#     plt.annotate( "{hosted_nb_w_df['% Users'][j]: .0%}".format(hosted_nb_w_df['% Users'][j]), (hosted_nb_w_df['% Users'][j], hosted_nb_w_df['Hosted Notebook'][j]),weight = 'bold', color = '#5b5eec')

for j in range(len(data_storage_df)):
    plt.annotate( f'{data_storage_df["% Users"][j]:1.0f}', (data_storage_df['% Users'][j]-0.3, data_storage_df['Data Storage Products'][j]),weight = 'bold', color = 'white')

plt.xticks([]) 
ax.set_xlim(0,30)
sns.despine(bottom=True)
ax.grid(False)
ax.tick_params(bottom=False, left=False)
plt.tight_layout()
plt.show()


## <span style = 'color:#0e3362'>Conclusion</span>

> There isn't any major difference in the working of EU and the world, apart from slight variations, like variations in age distribution and the formal education distribution. Even these variations can be easily explained. For example, because of presence of a subsidized education system in EU, people tend to have more formal degrees as compared to other nations where the fees are quite hefty. We can therefore say that EU is not an outlier and is mostly aligned with the rest of the world.

This is my first nootebook ever, and I tried my best, your feedbacks are much appreciated!
