# Introduction

<p style="font-size:18px">When it comes to a dynamic and fast-evolving field like machine learning where new architectures crop up every month, it can become difficult for experienced people with multiple responsibilites to keep up with the latest developments. In the technology sector, this can lead to older people losing out on employment opportunities as they may not be able to jump on the bandwagon of the latest framework or technology stack.
This notebook explores if this is true in the ML community, and how preferences change with age.</p>

<p style="font-size:18px;">We look at the subset of Python users and explore if the different age groups vary in the tools and methods used by them.</p>

In [None]:
!pip install venn -q
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
from scipy.stats import chi2_contingency, chisquare
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from venn import venn
import plotly.express as px
from pandas.plotting import parallel_coordinates as pc
from ipywidgets import interact, interactive
import ipywidgets as widgets
from IPython.display import display
import plotly.graph_objects as go
from plotly.graph_objects import Layout
import re
import warnings
warnings.simplefilter('ignore')

def cleant(string):
    string = string.replace('$',"")
    string = string.replace(',',"")
    string = string.replace('>',"")
    string = string.split("-")[0]
    return string

def restring(x):
    x = re.sub("[\(\[].*?[\)\]]", "", x)
    return x.strip()

plt.rcParams.update({'font.size': 18})

# Python Users

<p style='font-size: 18px;'>Python is the most popular language in the Machine Learning community owing to its ease of use, and richly developed open source ecosystem with NumPy, SciPy and other computing packages. It offers the ability to speed up programs via C++ interoperability and interfacing with GPU/ TPU.</p>

In [None]:
responses = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv').iloc[1:]
pythonall = responses[(responses.Q7_Part_1 == 'Python')]
pythonusa = responses[(responses.Q7_Part_1 == 'Python') & (responses.Q3 == "United States of America")]
responses['proglangsno'] = responses[[c for c in responses.columns if 'Q7' in c]].count(axis=1)
pythonly = responses[(responses.Q7_Part_1 == 'Python') & (responses.proglangsno == 1)]
pythonusa['salary'] = pythonusa.Q25.fillna("0").apply(lambda x: int(cleant(x)))
pythonall['salary'] = pythonall.Q25.fillna("0").apply(lambda x: int(cleant(x)))
pythonall.to_csv('resp.csv', index=False)

In [None]:
py = responses[(responses.Q7_Part_1 == 'Python')].Q1.value_counts().sort_index()
r = responses[~(responses.Q7_Part_2.isna())].Q1.value_counts().sort_index()
sql = responses[~(responses.Q7_Part_3.isna())].Q1.value_counts().sort_index()
c = responses[~(responses.Q7_Part_4.isna())].Q1.value_counts().sort_index()
cpp = responses[~(responses.Q7_Part_5.isna())].Q1.value_counts().sort_index()
java = responses[~(responses.Q7_Part_6.isna())].Q1.value_counts().sort_index()
js = responses[~(responses.Q7_Part_7.isna())].Q1.value_counts().sort_index()
julia = responses[~(responses.Q7_Part_8.isna())].Q1.value_counts().sort_index()
swift = responses[~(responses.Q7_Part_9.isna())].Q1.value_counts().sort_index()
bash = responses[~(responses.Q7_Part_10.isna())].Q1.value_counts().sort_index()
matlab = responses[~(responses.Q7_Part_11.isna())].Q1.value_counts().sort_index()
other = responses[~(responses.Q7_Part_12.isna())].Q1.value_counts().sort_index()
q7s = [c for c in responses.columns if 'Q7' in c]
q7s.remove('Q7_Part_1')
q7s.remove('Q7_Part_2')
q7s.remove('Q7_Part_3')
others = responses[responses[q7s].count(axis=1)>0]
sums = np.zeros((11,))
colors = ['midnightblue','blue','slateblue','thistle', 'plum', 'fuchsia','mediumvioletred','violet', 'orchid','deeppink','palevioletred','pink']
for each in [py, r, sql, c, cpp, java, js, julia, swift, bash, matlab, other]:
    sums += each.values
preeach = np.zeros((11,))
preeachs = np.zeros((11,))
fig, ax = plt.subplots(1,2, figsize=(30,10))
colori = 0
for each in [py, r, sql, c, cpp, java, js, julia, swift, bash, matlab, other]:
    ax[1].bar(each.index, each.values, bottom = preeach, width=0.5, color=colors[colori])
    preeach += each.values
    each = each/sums * 100
    ax[0].bar(each.index, each.values, bottom = preeachs, width=0.5, color=colors[colori])
    preeachs += each.values
    colori += 1
ax[0].legend(['Python', 'R', 'SQL', 'C', 'C++', 'Java', 'JS', 'Julia', 'Swift', 'Bash', 'Matlab', 'Other'], bbox_to_anchor=(1.1, 1.05), prop={'size': 12})
ax[0].set_title('Distribution of Language Usage by Age')
ax[1].set_title('Absolute Number of Language Users by Age')
for item in ([ax[0].title, ax[0].xaxis.label, ax[0].yaxis.label] + ax[0].get_xticklabels() + ax[0].get_yticklabels()):
    item.set_fontsize(15)
for item in ([ax[1].title, ax[1].xaxis.label, ax[1].yaxis.label] + ax[1].get_xticklabels() + ax[1].get_yticklabels()):
    item.set_fontsize(15)
plt.show()

<p style="font-size:18px">As the choice of programming language could change various factors, we would like to focus on Python programmers specifically, so that we get a good sample size which is representative of the ML community.</p>

In [None]:
ccppall = responses[~(responses.Q7_Part_4.isna() & responses.Q7_Part_5.isna())]
sqlall = responses[~(responses.Q7_Part_3.isna())]
fig, ax3 = plt.subplots(1,2, figsize = (25,8))
responses['proglangsno'].value_counts().sort_index().plot(kind='bar', ax = ax3[0], rot=0, title = 'Number of Languages used by Respondents', cmap='coolwarm')
responsesdf = responses.groupby('Q1')['proglangsno'].mean().sort_index()
responsedf.index.name = 'Age'
responsesdf.plot(kind='bar', ax=ax3[1], title = 'Number of Languages used by Age Groups', colormap=cm.get_cmap('coolwarm'))
otherlang = responses[responses['proglangsno']>0]
plt.show()

<p style="font-size:18px;">Also to note here is that respondents on an average program in 2+ different languages, and this remains true for all age groups. Among Python users with multiple language skills, SQL is the other language that is preferred as the choice for data management.</p>

In [None]:
datasetdict = {'Python': set(pythonall.index), 'C/C++': set(ccppall.index), 'SQL': set(sqlall.index), 'Other':set(others.index)}
venn(datasetdict, cmap="coolwarm", fmt="{percentage: 0.1f}%", fontsize=12, legend_loc="upper left")
plt.show()

# Salary Distribution

<p style="font-size:18px">Salary is one of the top indicators which would indicate if there is any age-based preference shown by employers. For the purpose of salary comparison across different age groups, we only consider respondents from the USA, as the salary would vary a lot when compared across geographies, owing to different costs of living and Purchasing Power Parity. </p>

In [None]:
salage = pd.crosstab(index=pythonusa['Q1'], columns=pythonusa['Q25'], normalize='index')
saldict = dict(zip(salage.columns, [int(cleant(x)) for x in salage.columns]))
saldict = dict(sorted(saldict.items(), key=lambda x: x[1]))
salage = salage[list(saldict)]
salage.index.name = 'Age'
fig, ax = plt.subplots(1,2, figsize=(25, 10))
salage.plot(kind='bar', stacked=True, colormap=cm.get_cmap('coolwarm'), ax = ax[0], title = "Salary Distribution by Age for All Python Users", edgecolor='none').legend(loc='lower left', bbox_to_anchor=(0, -.5), ncol=5)
salage = pd.crosstab(index=pythonusa['Q1'], columns=pythonusa['Q25'])
salage = salage[list(saldict)]
df = np.multiply(salage.values , np.array([list(saldict.values())]*11)).sum(axis=1)/salage.sum(axis=1)
df.index.name = 'Age'
df.plot(kind='bar', stacked=True, ax = ax[1], title = "Average Salary by Age for All Python Users", colormap=cm.get_cmap('coolwarm'))
for item in ([ax[0].title, ax[0].xaxis.label, ax[0].yaxis.label] + ax[0].get_xticklabels() + ax[0].get_yticklabels()):
    item.set_fontsize(20)
for item in ([ax[1].title, ax[1].xaxis.label, ax[1].yaxis.label] + ax[1].get_xticklabels() + ax[1].get_yticklabels()):
    item.set_fontsize(20)
plt.show()

**<p style="font-size:18px">Salary Distribution for Data Scientists in the USA</p>**

In [None]:
datapythonusa = pythonusa[pythonusa.Q5.str.startswith('Data')]
salage = pd.crosstab(index=datapythonusa['Q1'], columns=datapythonusa['Q25'], normalize='index')
saldict = dict(zip(salage.columns, [int(cleant(x)) for x in salage.columns]))
saldict = dict(sorted(saldict.items(), key=lambda x: x[1]))
salage = salage[list(saldict)]
salage.index.name = 'Ages'
fig, ax = plt.subplots(1,2, figsize=(25, 6))
salage.plot(kind='bar', stacked=True, colormap=cm.get_cmap('coolwarm'), ax = ax[0], title='Salary Distribution by Age for Data Scientists', edgecolor="none").legend(loc='lower left', bbox_to_anchor=(0, -.8), ncol=5)
salage = pd.crosstab(index=datapythonusa['Q1'], columns=datapythonusa['Q25'])
salage = salage[list(saldict)]
salage.index.name = 'Ages'
df = np.multiply(salage.values , np.array([list(saldict.values())]*11)).sum(axis=1)/salage.sum(axis=1)
df.plot(kind='bar', stacked=True,   ax = ax[1], title='Average Salary by Age for Data Scientists',colormap=cm.get_cmap('coolwarm'))
for item in ([ax[0].title, ax[0].xaxis.label, ax[0].yaxis.label] + ax[0].get_xticklabels() + ax[0].get_yticklabels()):
    item.set_fontsize(20)
for item in ([ax[1].title, ax[1].xaxis.label, ax[1].yaxis.label] + ax[1].get_xticklabels() + ax[1].get_yticklabels()):
    item.set_fontsize(20)
plt.show()

<p style="font-size:18px;">When we take a look at the distribution of salaries by age, we see that earnings peak in the 40s and remain stagnant or fall off slightly thereafter. When we filter specifically for Data Analyst/ Data Engineer/ Data Scientist roles, we do find a peak in the early 40s, followed by a higher peak in the late 50s before retirement.</p>
<p style="font-size:18px;">When considering all respondents, the stagnation in average salary may be due to people continuing in IC roles without making a shift to management, different educational qualifications, outmoded skills, or because a majority of respondents from different age-groups belong to different industries.</p>
<p style="font-size:18px;">

# Data Exploration

**<p style="font-size:20px">Educational Qualifications</p>**

In [None]:
degrees = ['I prefer not to answer','No formal education past high school', 'Some college/university study without earning a bachelor’s degree', 'Bachelor’s degree', 'Master’s degree',
           'Professional doctorate', 'Doctoral degree']
degreedf = pd.crosstab(index=pythonall['Q1'], columns=pythonall['Q4'], normalize='index')
degreedf = degreedf[degrees]
degreedf.index.name = "Age"
degreedf.plot(kind='bar', stacked = True, figsize=(15,10), colormap=cm.get_cmap('coolwarm')).legend(loc='lower center', bbox_to_anchor=(0.4,-.4), ncol=3)
plt.show()

<p style="font-size:18px;">Among the respondents, the older age groups have a significantly higher proportion of people with advanced degrees. Taking this into consideration, we would have expected salaries to increase monotonically with age. But that is not the case, so we will look further.</p>

**<p style="font-size:20px">Industry</p>**

In [None]:
q20cols = [c for c in responses.columns if 'Q20' in c]
industry = pd.melt(pythonall, id_vars = 'Q1',value_vars = q20cols)
industry = industry.groupby(['Q1','value']).count().reset_index()
industry = pd.pivot(industry, index='Q1', columns='value', values='variable')
industry = industry.div(industry.sum(axis=1), axis=0)
industry.index.name = 'Age'
industry.plot(kind='bar', stacked=True, mark_right=True, cmap='tab20b', figsize=(15,10), edgecolor='none', title='Breakdown of Respondents by Industry').legend(loc='right', bbox_to_anchor=(1.5, .5))
plt.show()

<p style="font-size:18px;">The proportion of respondents from the Computers/ Technology industry decreases slightly with age, as the tech industry has boomed in the recent past with massive hiring sprees.</p>

In [None]:
salindustry = pythonusa.groupby(['Q1','Q20'])['salary'].mean().reset_index()
salindustry.index.name = 'Age'
sns.set(rc={'figure.figsize':(25,15)})
sns.set_style("whitegrid", {"grid.color": ".8", "grid.linestyle": ":", 'axes.edgecolor': 'blue'})
salindustry.index.name = 'Age'
plts = sns.scatterplot(data = salindustry, x='Q1', y='Q20', size='salary', sizes=(1,800))
plts.set_title('Salary by Industry and Age')
plts.set_xlabel("Age",fontsize=30)
plts.set_ylabel("Industry",fontsize=20)
plts.tick_params(labelsize=15)
plt.show()

<p style="font-size:18px;">When it comes to Online Business/ Internet-based Sales, there is a great disparity in salaries with a peak at 40-44 years of age, and a sharp drop thereafter. The situation in other industries is better, with average salaries stabilizing or increasing with age.</p>

**<p style="font-size:18px;">ML Usage by Industry</p>**

In [None]:
mluse = pythonall.groupby(['Q20','Q23'])['Q1'].count().reset_index()
mluse['Q23'] = mluse.Q23.apply(lambda x: restring(x))
mluse['Q23'] = pd.Categorical(mluse.Q23, categories =['I do not know','No','We are exploring ML methods', 'We use ML methods for generating insights',
                                                                                       'We recently started using ML methods','We have well established ML methods' ], ordered=True)
sns.set(rc={'figure.figsize':(30,15)})
sn = sns.scatterplot(data = mluse, x='Q20', y='Q23', size='Q1', sizes=(10,800))
sn.set_xlabel("Industry",fontsize=15)
sn.tick_params(labelsize=35)
sn.set_xticklabels(mluse.Q20.unique(), rotation=90)
sn.set_title("ML Usage by Industry")
plt.show()

<p style="font-size:18px;">While the Online Business sector showed disparity in salary between different age-groups, we can see that respondents from the industry do not use Machine Learning methods at work. Industries such as Computers/Technology, and Academia, which use heavy ML and hire Data Scientists, have more equitable salary profiles.</p>

# Hardware and Technology Stacks

<p style="font-size:18px;">Let us take a look at the stack used and see if we can find any differences in technology skills and preferences.</p>

**<p style="font-size:20px;">Workstations</p>**

In [None]:
workstation = pd.crosstab(index=pythonall['Q1'], columns=pythonall['Q11'], normalize='index')
workstation.columns = [restring(c) for c in workstation.columns]
workstation.index.name = 'Age'
workstation.plot(kind='bar', stacked=True, colormap='Set3', figsize=(10, 6), title='Computing Platform for DS', edgecolor='none').legend(loc='right', bbox_to_anchor=(2, .5), title='Hardware')
plt.show()

<p style="font-size:18px;">Older respondents have a marked preference for PCs/ Desktops which are usually set up ergonomically, while a laptop is the workstation of choice for younger respondents who probably prefer the flexibility of working from different locations. </p>

**<p style="font-size:20px;">Hardware</p>**

In [None]:
fig, ax = plt.subplots(1,2, figsize=(25,5))
q12cols = [c for c in responses.columns if 'Q12' in c]
hw = pd.melt(pythonall, id_vars = 'Q1',value_vars = q12cols)
hw = hw.groupby(['Q1','value']).count().reset_index()
hw = pd.pivot(hw, index='Q1', columns='value', values='variable')
hw = hw.div(hw.sum(axis=1), axis=0)
hw.index.name = "Age"
hw.plot(kind='bar', stacked=True, mark_right=True, ax = ax[0], title = "Computing Hardware", colormap=cm.get_cmap('Set3'), edgecolor='none').legend(loc='lower left',bbox_to_anchor=(0, -.4), ncol = 5, prop={'size': 12})
q13cols = [c for c in responses.columns if 'Q13' in c]
tpuse = pd.melt(pythonall, id_vars = 'Q1',value_vars = q13cols)
tpuse = tpuse.groupby(['Q1','value']).count().reset_index()
tpuse = pd.pivot(tpuse, index='Q1', columns='value', values='variable')
tpuse = tpuse.div(tpuse.sum(axis=1), axis=0)
tpuse = tpuse[['Never', 'Once', '2-5 times', '6-25 times', 'More than 25 times']]
tpuse.index.name = "Age"
tpuse.plot(kind='bar', stacked=True, mark_right=True, ax = ax[1], title="TPU Usage", colormap=cm.get_cmap('coolwarm'), edgecolor='none').legend(loc='upper left', bbox_to_anchor=(0, -.2), ncol=5, prop={'size': 12})
plt.show()

<p style="font-size:18px;">When it comes to accelerated computing on GPUs and TPUs, there is no marked difference between the various age groups, with the proportion of people using these systems remaining the same </p>

**<p style="font-size:20px;">IDE</p>**

In [None]:
q9cols = [c for c in pythonly.columns if 'Q9' in c]
ides = pd.melt(pythonall, id_vars = 'Q1',value_vars = q9cols)
ides['value'] = ides['value'].fillna("").apply(lambda x: restring(x))
ides['value'] = ides.value.replace('Jupyter Notebook', 'Jupyter')
ides = ides.groupby(['Q1','value']).count().reset_index()
ides = pd.pivot(ides, index='Q1', columns='value', values='variable')
ides = ides.div(ides.sum(axis=1), axis=0)
ides.index.name = "Age"
ides.plot(kind='bar', stacked=True, mark_right=True, title = "IDE Preferences", colormap=cm.get_cmap('tab20c'), figsize=(15,5), edgecolor='none').legend(loc='lower left',bbox_to_anchor=(0, -.5), ncol = 6)
plt.show()

<p style="font-size:18px;">Jupyter remains the platform of choice among all age groups.</p>

**<p style="font-size:20px;">Algorithms</p>**

In [None]:
q17cols = [c for c in responses.columns if 'Q17' in c]
algos = pd.melt(pythonall, id_vars = 'Q1',value_vars = q17cols)
algos = algos.groupby(['Q1','value']).count().reset_index()
algos = pd.pivot(algos, index='Q1', columns='value', values='variable')
algos.index.name = 'Ages'
algos = algos.div(algos.sum(axis=1), axis=0)
algos.plot(kind='bar', stacked=True, mark_right=True, figsize=(15,5), colormap=cm.get_cmap('coolwarm'), edgecolor='none').legend(loc='right', bbox_to_anchor=(1.4, .5), prop={'size': 10})
plt.show()

<p style="font-size:18px;">Linear/ Logistic Regression and Decision Trees are the most popular approaches across all groups.</p>

**<p style="font-size:20px">Deep Learning Methods</p>**

In [None]:
fig, ax = plt.subplots(1,2, figsize=(20,5))
q18cols = [c for c in responses.columns if 'Q18' in c]
algos = pd.melt(pythonall, id_vars = 'Q1',value_vars = q18cols)
algos = algos.groupby(['Q1','value']).count().reset_index()
algos = pd.pivot(algos, index='Q1', columns='value', values='variable')
algos = algos.div(algos.sum(axis=1), axis=0)
algos.columns = ['General Purpose Img/Vid Tools', 'Generative Networks', 'Img Classif Networks', 'Img Segmentation Methods', 'None', 'Oject Detection Methods', 'Other']
algos.plot(kind='bar', stacked=True, mark_right=True, ax = ax[0], title = 'Computer Vision Methods', colormap=cm.get_cmap('Set3'), edgecolor='none').legend(loc='lower center', bbox_to_anchor=(0.5, -.5), ncol=3)
q19cols = [c for c in responses.columns if 'Q19' in c]
algos = pd.melt(pythonall, id_vars = 'Q1',value_vars = q19cols)
algos = algos.groupby(['Q1','value']).count().reset_index()
algos = pd.pivot(algos, index='Q1', columns='value', values='variable')
algos = algos.div(algos.sum(axis=1), axis=0)
algos.columns = [restring(c) for c in algos.columns]
algos.plot(kind='bar', stacked=True, mark_right=True, ax = ax[1], title='Natural Language Processing Methods', colormap=cm.get_cmap('coolwarm'), edgecolor='none').legend(loc='lower center', bbox_to_anchor=(.5, -.45), ncol=3)
plt.show()

<p style="font-size:18px;">Usage of deep learning methods is similar among all the age groups.</p>

<p style="font-size:18px;"></p>

**<p style="font-size:20px">Advanced Machine Learning Products</p>**

In [None]:
fig, ax = plt.subplots(1,2, figsize=(20,5))
q31acols = [c for c in responses.columns if 'Q31_A' in c]
mltools = pd.melt(pythonall, id_vars = 'Q1',value_vars = q31acols)
mltools = mltools.groupby(['Q1','value']).count().reset_index()
mltools = pd.pivot(mltools, index='Q1', columns='value', values='variable')
mltools = mltools.div(mltools.sum(axis=1), axis=0)
mltools.plot(kind='bar', stacked=True, mark_right=True, ax = ax[0], title = 'Managed ML Products', edgecolor='none', colormap=cm.get_cmap('Set3')).legend(loc='lower center', bbox_to_anchor=(0.5, -.5), ncol=3)
q37acols = [c for c in responses.columns if 'Q37_A' in c]
mltools = pd.melt(pythonall, id_vars = 'Q1',value_vars = q37acols)
mltools = mltools.groupby(['Q1','value']).count().reset_index()
mltools = pd.pivot(mltools, index='Q1', columns='value', values='variable')
mltools = mltools.div(mltools.sum(axis=1), axis=0)
mltools.plot(kind='bar', stacked=True, mark_right=True, ax = ax[1], title='Auto ML Tools', edgecolor='none', colormap=cm.get_cmap('coolwarm')).legend(loc='lower center', bbox_to_anchor=(.5, -.45), ncol=3)
plt.show()

<p style="font-size:18px;">When it comes to managed ML products, there is a slight trends towards lesser usage with higher age.</p>

**<p style="font-size:20px">ML Tools</p>**

In [None]:
agedict = dict(zip(['18-21', '22-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','70+'],range(11)))
q36acols = [c for c in pythonall.columns if 'Q36_A' in c]
q37acols = [c for c in pythonall.columns if 'Q37_A' in c]
q36melt = pd.melt(pythonall, id_vars = q36acols+['Q1'],value_vars = q37acols)
q36melt = q36melt.rename(columns={'value':'value1'})
q37melt = pd.melt(q36melt, id_vars = ['value1','Q1'], value_vars = q36acols)
q37melt = q37melt.dropna()
q37melt['value'] = q37melt.value.apply(lambda x: restring(x))
q37melt[(q37melt.value1!="") & (q37melt.value!="")]
q37melt['agemap'] = q37melt.Q1.map(agedict)
colorscale =px.colors.diverging.RdYlGn
figa1 = go.Figure(go.Parcats(dimensions=[{'label':'Age', 'values':q37melt.Q1, 'categoryorder': 'array', 'categoryarray':['18-21', '22-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-69','70+']}
                                         ,{'label':'ML Tools', 'values':q37melt.value}, {'label':'ML Tools Products', 'values': q37melt.value1}],
                            line={'color' : q37melt.agemap, 'colorscale': colorscale}, bundlecolors=True))
figa1 = go.FigureWidget(figa1)
figa1.update_layout()
widgets.HBox([figa1])
figa1.show()

**<p style="font-size:20px">Cloud Usage</p>**

In [None]:
q27acols = [c for c in responses.columns if 'Q27_A' in c]
cloudplatforms = pd.melt(pythonall, id_vars = 'Q1',value_vars = q27acols)
cloudplatforms = cloudplatforms.groupby(['Q1','value']).count().reset_index()
cloudplatforms = pd.pivot(cloudplatforms, index='Q1', columns='value', values='variable')
cloudplatforms = cloudplatforms.div(cloudplatforms.sum(axis=1), axis=0)
cloudplatforms.index.name='Age'
cloudplatforms.plot(kind='bar', stacked=True, mark_right=True, edgecolor='none' , colormap=cm.get_cmap('coolwarm'), figsize=(15,5)).legend(loc='right', bbox_to_anchor=(1.5, .5))
plt.show()

<p style="font-size:18px">There is a decrease in Cloud usage with age.</p>

**<p style="font-size:20px">BI Tools</p>**

In [None]:
q34acols = [c for c in responses.columns if 'Q34_A' in c]
bitools = pd.melt(pythonall, id_vars = 'Q1',value_vars = q34acols)
bitools = bitools.groupby(['Q1','value']).count().reset_index()
bitools = pd.pivot(bitools, index='Q1', columns='value', values='variable')
bitools = bitools.div(bitools.sum(axis=1), axis=0)
bitools.index.name='Age'
bitools.plot(kind='bar', stacked=True, mark_right=True, colormap=cm.get_cmap('tab20'), edgecolor='none', figsize = (15,5), title='BI Tools Usage').legend(loc='right', bbox_to_anchor=(1.5, .5))
plt.show()

<p style="font-size:18px">There is a decrease in usage of newer BI tools such as Tableau among the older respondents.</p>

# Conclusion

<p style="font-size:18px;">Looking at the distribution of data by age, we can see that while there are several similarities across the age-groups, some of the differences stand out. When it comes to industry bias, data scientists do not really face much discrimination in terms of salary as they get older, but the rate of growth in salary can be impacted. When it comes to application of core data science skills such as Computer Vision and Natural Language Processing, respondents across all age groups show similar usage and preferences for algorithms and tools. This is indicative of older respondents building on their foundational knowledge and upskilling in line with the industry. When it comes to managed ML products and Cloud computing preferences, there is a difference between the different age groups with younger people showing greater affinity for these tools.</p>

<p style="font-size:18px"></p>