# Overview of the Most Frequent Answer to Each Question

The 5th Kaggle Survey includes **63 unique (single- and multiple choice-) questions** about how machine learning and data science currently are applied by practitioners. In order to show a broader picture of the state of the machine learning and data science, I listed the most common answer to each question in the **q_a** dataframe. *For multiple choice questions, only the most frequent response was chosen*.

In brief, ML and DS practitioners are desired by companies in recent years, which drives young professionals and graduates to enter this fields. For instance, the majority of participants are **male from an age group of 25-29, residing in india**. Most of them are student or young professionals who just started coding around 1-3 years. 

**AWS** has done pretty well as a cloud computing services provider. The relational database, **MySQL**, is the most used database management system. **Matplotlib** libary beats Seaborn as the most popular data visualisation tool. Also, **Excel** is still used frequently as a statistical analysis software. The main companies apply machine learning in their busienss that are SMEs. 

In summary, ML and DS implementation in busienss is still new and the practitioners are also tend to be young. 

In [None]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
mpl.rcParams['figure.dpi'] = 360

import warnings
warnings.simplefilter(action='ignore', category=(FutureWarning))

#pd.set_option('display.max_columns', None)
df =pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv',low_memory=False)
answers = df.drop(0,axis=0)
stats = df.drop(0).describe().transpose()

q_a = pd.concat([stats,df.loc[0,:]],axis=1).rename(columns={0:'question'}) #concat question row and stats df
q_a = q_a.reset_index()

def split_q(x):
    try:
        return x.split('-')[0]
    except ValueError:
        return x.split('?')[0]
    
q_a['question'] = q_a['question'].apply(split_q)

mul_q = q_a['index'].str.split('_',expand=True).rename(columns={0:'QA',1:'QB'})
mul_q = mul_q.loc[mul_q[['QA','QB']].duplicated(keep=False)]
inds = list()
for q , ind in mul_q.groupby(['QA','QB']):
    inds.append(ind.index.tolist())

q = list(set(q_a.index)-set(mul_q.index))
# # single answer
single_q = q_a.loc[q, ['index','question','top']].rename(columns={'top':'answer'})
multi_answers = pd.DataFrame([],columns=['index','question','top'])

for i in inds:
    for _, r in q_a.iloc[i,:].iterrows():
        max_ = q_a.iloc[i,4].max()
        if r['freq'] == max_:
            multi_answers = multi_answers.append(r.reset_index().query('index=="index" | index=="question" | index=="top"').transpose().iloc[1,[0,2,1]])

multi_answers = multi_answers.loc[:,[5,3,0]].rename(columns={0:'index',5:'question',3:'answer'})
q_a = pd.concat([single_q, multi_answers])
q_a = q_a.set_index('index') 

In [None]:
#pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
q_a

Curious to look into the individual questions and answer? Try the `check_detail` function yourself.

In [None]:
def check_detail(question_index):
    print(q_a.loc[question_index,'question'])
    print(q_a.loc[question_index,'answer'])
    
#check_detail('Q26')

# Questions I explored

1. [What is the percentage of the gender identities from different age groups?](#age_gender)
2. [What are the three most popular programming languages in different country?](#lang_cn)


<a id='age_gender'></a>
## What is the percentage of the gender identities from different age groups?


In [None]:
age_sex = answers[['Q1','Q2']]
age_gender = age_sex.groupby(['Q1','Q2'])['Q2'].count() 
age_gender = age_gender.rename('Count').reset_index()
age_gender = pd.pivot(age_gender, index='Q1', columns = 'Q2', values='Count')
age_gender.div(age_gender.sum(axis=1),axis=0)
age_gender = age_gender.div(age_gender.sum(axis=1),axis=0)

ax = age_gender.plot(
    kind='barh', 
    xlabel='Age group',
    stacked=True, 
    #colormap='coolwarm', 
    figsize=(16,8) 
    )
ax.legend(bbox_to_anchor=(1.04,1), loc='upper left')
ax.set_title('The percentage of different gender specifications from different age groups', fontdict={'fontsize':14})
plt.show()

In [None]:
age_group_count = age_sex.Q1.value_counts()
sex_count = age_sex.Q2.value_counts()

fig, axes = plt.subplots(1,2,figsize=(11, 4))
axes[0].pie(age_group_count, autopct='%1.1f%%',pctdistance=1.2,)
axes[0].legend(age_group_count.index, bbox_to_anchor=(1,1))
axes[0].set_title('Pct of Age Group')
axes[1].pie(sex_count, autopct='%1.1f%%',pctdistance=1.2)
axes[1].legend(sex_count.index, bbox_to_anchor=(1,1))
axes[1].set_title('Pct of Gender Identity')
plt.tight_layout()
plt.show()

The amount of female coders is increasing over different age groups as our survey result shows. However, the male coders are still dominating ML and DS fields, constituing at least 79.3%. More than half of the survey participants are under 30 years old.

<a id='lang_cn'></a>
## What are the three most popular programming languages in different country?

In [None]:
from collections import Counter

def agg_lang(x):
    return [ i for i in x if i != 0]

def get_lang(x):
    return x[0]

loc = answers['Q3']
q7 = list()
for c in answers.columns:
    if 'Q7' in c:
        q7.append(c)
        
langs = answers.loc[:, q7].fillna(0)
langs = langs.agg(agg_lang, axis=1)
loc_lang = pd.concat([langs, loc], axis=1)
loc_lang = loc_lang.rename(columns={'Q3':'Country',0:'Langs'})
loc_lang = loc_lang.query('(Country!="Other") & (Country!="I do not wish to disclose my location")')
cn_count = loc_lang['Country'].value_counts()

lang_dict = dict()
for g in loc_lang.groupby('Country'):
    lang = g[1].loc[:, 'Langs']# sub series
    lang_lst = list()
    for l in lang:
        lang_lst.extend(l)
    lang_dict[g[0]] = Counter(lang_lst).most_common(3)

lang_cn = pd.DataFrame.from_dict(lang_dict).transpose().rename(columns={0:'Top1',1:'Top2',2:'Top3'})
lang_cn = lang_cn.applymap(get_lang)
lang_cn = lang_cn.merge(cn_count,left_index=True, right_index=True).rename(columns={'Country':'Practitioners'})
lang_cn = lang_cn.reset_index().rename(columns={'index':'Country'})
top3 = lang_cn[['Top1','Top2','Top3']].agg(' '.join,axis=1)
top3 = pd.concat([top3,lang_cn['Practitioners']], axis=1)
top3 = top3.rename(columns={0:'Top 3'})
top3 = top3.merge(lang_cn[['Country','Practitioners']], right_on='Practitioners',left_on='Practitioners')

import plotly.express as px 
import json 
with open(r'../input/country-state-geo-location/countries.geo.json','r') as f:
    map_json = json.load(f)

fig = px.choropleth(
    top3,
    geojson = map_json['features'],
    locations = top3.Country,
    locationmode = 'country names',
    hover_name = top3['Country'],
    color = top3.Practitioners,
    color_continuous_scale="Viridis",
    hover_data = [ top3['Top 3']]
)

fig.show()

From the map above, it shows that the most popular programming languages for the machine learning and data science fields is Python. SQL is the second most popular programming language. In addition, there are more than 2500 practitioners currently residing in India and United States.