<div align="center">
    <h1>Initial years in Data Science Journey</h1>
<img src="https://user-images.githubusercontent.com/48846576/102959431-1d7dac00-44a5-11eb-92fa-69f1e42bca45.jpg"  width="800" height="300">
<span>Photo by <a href="https://unsplash.com/@foxxmd?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Matt Duncan</a> on <a href="https://unsplash.com/s/photos/start?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>
</div><br>
<div align="left">
    <h3>What most people do in their initial years of data science / machine learning journey?</h3>
    
In this report, I try to analyze and understand what most people do during the initial years (1-3 years) of their career in data science, especially the experienced coders (with 5-20 years of coding experience) starting a new chapter in data science. As a data science enthusiast who fits this profile, my aim is to understand whether I am aligned with rest of the community.
</div>
<br>
<div align="left">
I have segmented the participants into these categories for my analysis.
</div>
<br>
<div align="left">
    <ul>
<li>First few years (0-3 years) of using machine learning methods (everyone across roles and coding experience)</li>
<li>Many years of coding (5-20 years) experience but have been using machine learning methods only for few (0-3) years</li>
<li>Many years of coding (5-20 years) experience, have been using machine learning methods only for few (0-3) years and are currently performing data science / machine learning related jobs like Data Analyst, Data Engineer, Data Scientist, Machine Learning Engineer, Research Scientist & Statistician</li>
</ul>
</div>

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import os
sns.set()

survey_df = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv" , low_memory=False, index_col=None)
master = survey_df.copy() # Make a copy
master = master.drop(0) # Drop first row containing questions
q15 = master[master['Q15'].notna()]

> Plot below represents the number of survey participants that fall under my segmentation

In [None]:
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

ds_roles = ['Data Analyst','Data Engineer','Data Scientist','Machine Learning Engineer','Research Scientist','Statistician']
ff_df = q15[(q15['Q15'].str.contains('Under 1 year') | q15['Q15'].str.contains('1-2 years') | q15['Q15'].str.contains('2-3 years'))]
coding_df = q15[(q15['Q15'].str.contains('Under 1 year') | q15['Q15'].str.contains('1-2 years') | q15['Q15'].str.contains('2-3 years'))
           & (q15['Q6'].str.contains('20+ years') | q15['Q6'].str.contains('10-20 years') | q15['Q6'].str.contains('5-10 years'))]
#May not be using machine learning methods but doing data science job - I do not use machine learning methods??
ds_df = q15[(q15['Q15'].str.contains('Under 1 year|1-2 years|2-3 years'))
    &(q15['Q6'].str.contains('20+ years|10-20 years|5-10 years'))
    &(q15['Q5'].str.contains('Data Analyst|Data Engineer|Data Scientist|Machine Learning Engineer|Research Scientist|Statistician'))]
# Under 1 year, 1-2 years, 2-3 years - 11402 (all)
# Under 1 year, 1-2 years, 2-3 years - 2116 (5-10, 10-20+ years coding all)
# Under 1 year, 1-2 years, 2-3 years - 828 (5-10, 10-20+ years coding data science roles) or 915
#print(ds_df.shape)
fig = go.Figure(go.Funnel(
    y = ["Using ML methods for upto 3 years", "Using ML methods for upto 3 years, <br>have 5-20 years of coding experience", "Using ML methods for upto 3 years, <br> have 5-20 years of coding experience and <br>currently in data science / ML jobs"],
    x = [11402, 2116, 828],
    textposition = "inside",
    textinfo = "value+percent total",
    opacity = 0.85, marker = {"color": ["slategray", "slategray", "slategray" ],
    "line": {"width": [6, 3, 3, ], "color": ["plum", "springgreen", "indianred" ]}},
    connector = {"line": {"color": "black", "dash": "dot", "width": 3}})
    )
fig.update_traces(hoverinfo="x+y+percent total", selector=dict(type='funnel'))
#fig.update_traces(marker_colorbar_ticklabelposition='outside top', selector=dict(type='funnel'))
fig.update_layout(font_size=14)
fig.update_layout(font_color='#055')
fig.update_layout(plot_bgcolor='#9acee7')
fig.update_layout(title_text='Participant segments analyzed in this report',width=700,height=500)

iplot(fig)

> Next few plots are to understand the over all participants demographics 

In [None]:

# Where there are 300+ participants
country_list = ['India','United States of America','Brazil','Japan','Russia','United Kingdom of Great Britain and Northern Ireland','Nigeria','China','Germany','Turkey','Spain','France','Canada']
#lambda x : True if (x > 10 and x < 20) else False
master['country'] = master['Q3'].apply(lambda x: x if x in country_list else 'Rest of the world!')

q15 = master[master['Q15'].notna()]
q15_q5 = q15.groupby(['Q15','Q5']).count().reset_index()
q15_q5 = q15_q5[['Q15','Q5','Q1']]
q15_q5 = pd.pivot_table(q15_q5, index=['Q15'], columns='Q5', values='Q1', aggfunc=np.sum)
q15_q5.sort_index(level=0, ascending=True, inplace=True)
indx = ['Under 1 year', '1-2 years',  '2-3 years',  '3-4 years', '4-5 years', '5-10 years','10-20 years','20 or more years','I do not use machine learning methods']
cols = [ 'Data Analyst', 'Data Engineer', 'Data Scientist', 'Machine Learning Engineer', 'Research Scientist', 'Software Engineer', 'Statistician','Business Analyst', 'DBA/Database Engineer',  'Product/Project Manager', 'Other', 'Currently not employed', 'Student']
#q15_q5.index = pd.CategoricalIndex(q15_q5.index, categories= ['Under 1 year', '1-2 years',  '2-3 years',  '3-4 years', '4-5 years', '5-10 years','10-20 years','20 or more years','I do not use machine learning methods'])
#q15_q5.sort_index(level=0, inplace=True)
q15_q5 = q15_q5.reindex(index=indx, columns=cols)

q6_cols = [ '< 1 years', '1-2 years','3-5 years','5-10 years', '10-20 years', '20+ years']
q15_q6 = q15.groupby(['Q15','Q6']).count().reset_index()
q15_q6 = q15_q6[['Q15','Q6','Q1']]
q15_q6 = pd.pivot_table(q15_q6, index=['Q15'], columns='Q6', values='Q1', aggfunc=np.sum)
#q15_q6.sort_index(level=0, ascending=True, inplace=True)
q15_q6 = q15_q6.reindex(index=indx, columns=q6_cols)

f = plt.figure(figsize=(15,12))
gs = f.add_gridspec(2, 1)

with sns.axes_style("darkgrid"):
    sns.set_context("notebook", font_scale=1.2, rc={"lines.linewidth": 3.5})
    ax = f.add_subplot(gs[0, 0])
    #g1 = sns.barplot(x='over', y='runs', data=MI_runs);
    g1 = sns.heatmap(q15_q5, annot=True, fmt="g", cmap='Oranges')
    #g1.set_facecolor("#fdb913")
    #g1.set_yticks(range(0,60,10))
    #g1.set_ylim(0,60)
    g1.axes.set_title("Years using ML Methods vs Job Title",fontsize=20)
    g1.set_xlabel(None,fontsize=18)
    g1.set_ylabel(None,fontsize=18)        
    g1.set_xticklabels(g1.get_xticklabels(), rotation=80)
    g1.set_yticklabels(g1.get_yticklabels(), rotation=0)
    ax.hlines([3], *ax.get_xlim())
    ax.vlines([7], *ax.get_ylim())
    
with sns.axes_style("darkgrid"):
    sns.set_context("notebook", font_scale=1.2, rc={"lines.linewidth": 3.5})
    ax = f.add_subplot(gs[1, 0])
    #g1 = sns.barplot(x='over', y='runs', data=MI_runs);
    g1 = sns.heatmap(q15_q6, annot=True, fmt="g", cmap='Purples')
    #g1.set_facecolor("#fdb913")
    #g1.set_yticks(range(0,60,10))
    #g1.set_ylim(0,60)
    g1.axes.set_title("Years using ML Methods vs Coding Experience",fontsize=20)
    g1.set_xlabel(None,fontsize=18)
    g1.set_ylabel(None,fontsize=18)        
    g1.set_xticklabels(g1.get_xticklabels(), rotation=80)
    g1.set_yticklabels(g1.get_yticklabels(), rotation=0)
    ax.hlines([3], *ax.get_xlim())
    ax.vlines([3], *ax.get_ylim())
f.tight_layout(pad=3.0)

#q4_cols = ['Bachelor’s degree','Master’s degree','Doctoral degree', 'Professional degree', 'Some college/university study without earning a bachelor’s degree','No formal education past high school','I prefer not to answer']
q4_cols = ['Bachelor’s degree','Master’s degree','Doctoral degree', 'Professional degree', 'Some college/university study','No formal education past high school','I prefer not to answer']
q15_q4 = q15.groupby(['Q15','Q4']).count().reset_index()
q15_q4 = q15_q4[['Q15','Q4','Q1']]
q15_q4 = pd.pivot_table(q15_q4, index=['Q15'], columns='Q4', values='Q1', aggfunc=np.sum)
q15_q4 = q15_q4.rename(columns={'Some college/university study without earning a bachelor’s degree':'Some college/university study'})
#q15_q6.sort_index(level=0, ascending=True, inplace=True)
q15_q4 = q15_q4.reindex(index=indx, columns=q4_cols)

f = plt.figure(figsize=(12,12))
gs = f.add_gridspec(1, 1)    
with sns.axes_style("darkgrid"):
    sns.set_context("notebook", font_scale=1.2, rc={"lines.linewidth": 3.5})
    ax = f.add_subplot(gs[0,0])
    #g1 = sns.barplot(x='over', y='runs', data=MI_runs);
    g1 = sns.heatmap(q15_q4, annot=True, fmt="g", cmap='Oranges')
    #g1.set_facecolor("#fdb913")
    #g1.set_yticks(range(0,60,10))
    #g1.set_ylim(0,60)
    g1.axes.set_title("Years using ML Methods vs Degree",fontsize=20)
    g1.set_xlabel(None,fontsize=18)
    g1.set_ylabel(None,fontsize=18)        
    g1.set_xticklabels(g1.get_xticklabels(), rotation=90)
    g1.set_yticklabels(g1.get_yticklabels(), rotation=0)
    ax.hlines([3], *ax.get_xlim())
    ax.vlines([7], *ax.get_ylim())    
f.tight_layout(pad=3.0)

> 

In [None]:
f = plt.figure(figsize=(10, 18))
gs = f.add_gridspec(1, 1)
with sns.axes_style("darkgrid"):
    #sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 3.5})    
    ax = f.add_subplot(gs[0, 0])    
    g1 = sns.countplot(y="Q3", data=master,palette="twilight_shifted_r", order = master['Q3'].value_counts().index )
    #ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
    #ax.set_yticklabels(ax.get_yticklabels(), rotation=0)
    #g1.set_facecolor("#fdb913")
    g1.axes.set_title("Participants by Country",fontsize=20)
    g1.set_xlabel("Number of participants",fontsize=18)
    g1.set_ylabel(None,fontsize=20)
    ax.set_xticks(range(0,6000,500))

> Countries where there are 300+ participants

In [None]:
f = plt.figure(figsize=(10, 8))
gs = f.add_gridspec(1, 1)
with sns.axes_style("darkgrid"):
    #sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 3.5})    
    ax = f.add_subplot(gs[0, 0])    
    g1 = sns.countplot(y="country", data=master,palette="Paired", order = master['country'].value_counts().index )
    #ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
    #ax.set_yticklabels(ax.get_yticklabels(), rotation=0)
    #g1.set_facecolor("#fdb913")
    g1.axes.set_title("Participants by Country",fontsize=20)
    g1.set_xlabel("Number of participants",fontsize=18)
    g1.set_ylabel(None,fontsize=20)
    ax.set_xticks(range(0,7000,500))

In [None]:
import json
import random as rd

def generate_color(num):
    clr_list =[]
    for i in range(0, num):
        clr = "rgba({},{},{},{})".format(rd.randint(0,255),rd.randint(0,255),rd.randint(0,255),round(rd.uniform(0, 1),1))
        clr_list.append(clr)
    return clr_list

class SData:
    def __init__(self, df):
        self._nodes = None
        self._data = None
        self._df = df
        self._tup_list = []
        self._json_data = None
    def build_nodes(self,attr1, attr2):
        df =  self._df.groupby([attr1,attr2]).sum().reset_index()
        if self._nodes is None:
            self._nodes  = list(df[attr1].unique())
            #self._nodes.append(self._nodes.pop(self._nodes.index('Other-Countries'))) #Move 'Other' in countries to end of the countries list
        else: 
            self._nodes  += list(df[attr1].unique())
        self._nodes  += list(df[attr2].unique())
        
        self._tup_list.append((attr1,attr2))
        
        
    def get_nodes(self):
        return self._nodes
    
    def create_json(self):
        
        #Step 1 - Create Nodes
        node_list = []
        #i=0 
        #clr_list = generate_color(70)
        for item in self._nodes:
            node = {}
            node['label'] = item
            node['visible'] = True
            #node['color'] = clr_list[i]
            node['color'] = "rgba({},{},{},{})".format(rd.randint(100,200),rd.randint(0,60),rd.randint(0,60),round(rd.uniform(0, 1),1))
            #i+=1
            node_list.append(node)
        
        #Step 2 - Create Links
        links_list = []
        #self._tup_list = [('country','degree'),('degree','job_title'),('job_title','years_coding'),('years_coding','years_using_ml')]
        for tup in self._tup_list:
            #print(tup[0], tup[1])
            df =  self._df.groupby([tup[0],tup[1]]).sum().reset_index()
            #print(df.shape)
            for index, row in df.iterrows():
                link = {}
                source = self._nodes.index(row[tup[0]])
                target = self._nodes.index(row[tup[1]])
                count = row['participants']
                link['source'] = source
                link['target'] = target
                link['value'] = count
                link['label'] = '{}-{}'.format(row[tup[0]],row[tup[1]])
                link['color'] = "rgba({},{},{},{})".format(rd.randint(0,200),rd.randint(0,220),rd.randint(0,230),round(rd.uniform(0, 1),1))
                links_list.append(link)
                

        self._data = {}
        self._data['layout'] = {
            "title": "First few years in Data Science",
            "width": 1600,
            "height": 1000
          }
        self._data['data'] = [
            {
              "type": "sankey",
              "domain": {
                "x": [
                  0,
                  1
                ],
                "y": [
                  0,
                  1
                ]
              },
              "nodes": node_list ,
            "links" : links_list}]
        
        self._json_data = json.dumps(self._data)
        #print(json_data)
        return self._json_data
    
    def get_sankey_data(self):
        data = json.loads(self._json_data)
        label = []
        color = []
        for x in data['data'][0]['nodes']:
            label.append(x['label'])
            color.append(x['color'])

        linkSource = []
        linkValue = []    
        linkTarget = []
        linkLabel = []
        linkColor = []
        for x in data['data'][0]['links']:
            linkLabel.append(x['label'])
            linkSource.append(x['source'])
            linkValue.append(x['value'])
            linkTarget.append(x['target'])  
            linkColor.append(x['color'])              

        data_trace = dict(
            type='sankey',
            domain = dict(
              x =  [0,1],
              y =  [0,1]
            ),
            orientation = "h",
            valueformat = ".0f",
            node = dict(
              pad = 5,
              thickness = 30,
              line = dict(
                color = "black",
                width = 0.5
              ),
              label =  label,
              color = color
            ),
            link = dict(
              source = linkSource,
              target = linkTarget,
              value = linkValue,
              label =  linkLabel,
            color = linkColor
          )
        )    

        layout =  dict(
            #title = "First few years in Data Science",
            height = 800,
            width = 800,
            font = dict(
              size = 10
            ),    
        )

        fig = dict(data=[data_trace], layout=layout)
        return fig
    
def build_menus():
    labels = list(df_group['years_using_ml'].unique())
    btn_lst = []
    
    for label in labels:
        d = SData(df_group[df_group['years_using_ml'] == label])
        d.build_nodes('years_using_ml','degree')
        d.build_nodes('degree','job_title')    
        d.build_nodes('job_title','country')        
        jsn = d.create_json()
        data = d.get_sankey_data()    
        btn_dict = dict(label = label,
                       method = "animate",
                       args = [data])
        btn_lst.append(btn_dict)

    d = SData(df_group) # All values
    d.build_nodes('years_using_ml','degree')
    d.build_nodes('degree','job_title')    
    d.build_nodes('job_title','country')        
    jsn = d.create_json()
    data = d.get_sankey_data()    
    btn_dict = dict(label = "All",
                   method = "animate",
                   args = [data])
    btn_lst.append(btn_dict)
        
    updatemenu = [dict(buttons = btn_lst, pad={"r": 10, "t": 10},
            showactive=True,
            x=0.11,
            xanchor="right",
            y=1.1,
            yanchor="top")]
    
    return updatemenu


> The demographics presented as interactive sankey

In [None]:
df_group = master.groupby(['country','Q4','Q5','Q6','Q15']).count().reset_index().sort_values('Q1',ascending=False)
df_group = df_group[['country','Q4','Q5','Q6','Q15','Q1']]
df_group = df_group.rename(columns={'Q1':'participants','Q4':'degree','Q5':'job_title','Q6':'years_coding', 'Q15':'years_using_ml'})
#Q4 - degree
#Q5 - Job
#Q6 - years coding
df_group = master.groupby(['Q15','Q4','Q5','country']).count().reset_index().sort_values('Q1',ascending=False)
df_group = df_group[['Q15','Q4','Q5','Q1','country']]
df_group = df_group.rename(columns={'Q1':'participants','Q4':'degree','Q5':'job_title','Q15':'years_using_ml'})
#master['Q15'].value_counts()
df_group

d = SData(df_group)
d.build_nodes('years_using_ml','degree')
d.build_nodes('degree','job_title')    
d.build_nodes('job_title','country')        
d.create_json()
all_data = d.get_sankey_data()
updatemenus = build_menus()
fig = go.Figure(all_data)
fig.update_layout(font_size=15, height = 800,
    width = 800,updatemenus=updatemenus)
fig.show()

In [None]:
ds_roles = ['Data Analyst','Data Engineer','Data Scientist','Machine Learning Engineer','Research Scientist','Statistician']
#ff_df = q15[(q15['Q15'].str.contains('Under 1 year') | q15['Q15'].str.contains('1-2 years') | q15['Q15'].str.contains('2-3 years'))]
#coding_df = q15[(q15['Q15'].str.contains('Under 1 year') | q15['Q15'].str.contains('1-2 years') | q15['Q15'].str.contains('2-3 years'))
#           & (q15['Q6'].str.contains('20+ years') | q15['Q6'].str.contains('10-20 years') | q15['Q6'].str.contains('5-10 years'))]
#May not be using machine learning methods but doing data science job - I do not use machine learning methods??
#ds_df = q15[(q15['Q15'].str.contains('Under 1 year|1-2 years|2-3 years'))
#    &(q15['Q6'].str.contains('20+ years|10-20 years|5-10 years'))
#    &(q15['Q5'].str.contains('Data Analyst|Data Engineer|Data Scientist|Machine Learning Engineer|Research Scientist|Statistician'))]

#cols = [qn for qn in list(master.columns) if 'Q19' in qn]
#ff_df[cols].isna()
#series = ff_df[cols].apply(lambda x: x.value_counts()).T.stack()
#frm = series.to_frame().reset_index()
#frm.drop(frm.columns[0], axis=1, inplace=True)
#frm.rename(columns=({'level_1':'answer',0:'count'}),inplace=True)

#ff_df
#coding_df
#ds_df
def get_chart_df(df, qn_substr):
    """ Call this method ff_df, coding_df, ds_df to get pandas df for charting for each of the segment"""
    cols = [qn for qn in list(master.columns) if qn_substr in qn]
    series = df[cols].apply(lambda x: x.value_counts()).T.stack()
    frm = series.to_frame().reset_index()
    frm.drop(frm.columns[0], axis=1, inplace=True)
    frm.rename(columns=({'level_1':'answer',0:'count'}),inplace=True)
    frm['percent'] = round((frm['count'] / frm['count'].sum()) * 100,1)   
    return frm

def get_chart_percent(df, qn_substr):
    """ Call this method ff_df, coding_df, ds_df to get pandas df for charting for each of the segment"""
    cols = [qn for qn in list(master.columns) if qn_substr in qn]
    series = df[cols].apply(lambda x: x.value_counts()).T.stack()
    frm = series.to_frame().reset_index()
    frm.drop(frm.columns[0], axis=1, inplace=True)
    frm.rename(columns=({'level_1':'answer',0:'count'}),inplace=True)
    frm['percent'] = round((frm['count'] / frm['count'].sum()) * 100,1)
    return frm

In [None]:
def get_fig(qn_substr, chart_title, x=None, vertical_spacing=0.2, shared_xaxes=True):
    orientation='v'
    fig = None
    ff_data = get_chart_df(ff_df, qn_substr).sort_values("count",ascending=False)
    #plot_data.sort_values("count",ascending=False)
    fig = make_subplots(rows=2, cols=1, vertical_spacing=vertical_spacing,  shared_xaxes=shared_xaxes)

    x_labels = None
    if x is not None:
        x_labels = x
    else:
        x_labels = ff_data['answer']
        
    bar1 = go.Bar(
        x=x_labels,
        y=ff_data['count'],
        orientation=orientation,        
        name='0-3 years of DS/ML Experience',
        marker=dict(color='#484a8a'), legendgroup='lg1'
    )

    coding_data = get_chart_df(coding_df, qn_substr).sort_values("count",ascending=False)

    bar2 = go.Bar(
        x=x_labels,
        y=coding_data['count'],
        orientation=orientation,
        name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience',
        marker=dict(color='#f5ab3d'), legendgroup='lg2'
    )

    ds_data = get_chart_df(ds_df, qn_substr).sort_values("count",ascending=False)
    bar3= go.Bar(
        x=x_labels,
        y=ds_data['count'],
        orientation=orientation,        
        name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience &<br> Currently in DS/ML Jobs',
        marker=dict(color='#860f8c'), legendgroup='lg3'
    )

    fig.add_trace(bar1,row=1, col=1)
    fig.add_trace(bar2,row=1, col=1)
    fig.add_trace(bar3,row=1, col=1)
    #fig.update_layout(height=600, width=800, title_text="Side By Side Subplots")
    fig.update_layout(height=600, width=650, title_text="Side By Side Subplots")
    fig.update_layout(barmode='group')  

    fig.add_trace(go.Bar(name='Segment 1',orientation=orientation,x=x_labels, y=ff_data['percent'],marker=dict(color='#484a8a'),showlegend=False, legendgroup='lg1'),row=2, col=1)
    fig.add_trace(go.Bar(name='Segment 2', orientation=orientation, x=x_labels, y=coding_data['percent'],marker=dict(color='#f5ab3d'),showlegend=False, legendgroup='lg2'),row=2, col=1)
    fig.add_trace(go.Bar(name='Segment 3', orientation=orientation, x=x_labels, y=ds_data['percent'],marker=dict(color='#860f8c'),showlegend=False, legendgroup='lg3'),row=2, col=1)
    fig.update_layout(
        title=chart_title,
    #    legend_title="Legend Title",
        font=dict(
            family="Computer Modern",
            size=14,
            color="#593232"
        )
    )
    fig['layout']['yaxis']['title']='Number of users'
    fig['layout']['yaxis2']['title']='Percentage of users'
    fig['layout']['yaxis2']['ticksuffix'] = '%'
    fig['layout']['xaxis']['tickangle']= 45
    fig['layout']['xaxis2']['tickangle']= 45
    
    return fig

## Q7 What programming languages do you use on a regular basis?
Python is used by around 30% of the participants in the segment

In [None]:
from plotly.subplots import make_subplots

ff_data = get_chart_df(ff_df, 'Q7').sort_values("count",ascending=False)
#plot_data.sort_values("count",ascending=False)
fig = make_subplots(rows=2, cols=1)

bar1 = go.Bar(
    x=ff_data['answer'],
    y=ff_data['count'],
    name='0-3 years of DS/ML Experience',
    marker=dict(color='#484a8a'), legendgroup='lg1'
)

coding_data = get_chart_df(coding_df, 'Q7').sort_values("count",ascending=False)

bar2 = go.Bar(
    x=coding_data['answer'],
    y=coding_data['count'],
    name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience',
    marker=dict(color='#f5ab3d'), legendgroup='lg2'
)

ds_data = get_chart_df(ds_df, 'Q7').sort_values("count",ascending=False)
bar3= go.Bar(
    x=ds_data['answer'],
    y=ds_data['count'],
    name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience &<br> Currently in DS/ML Jobs',
    marker=dict(color='#860f8c'), legendgroup='lg3'
)

fig.add_trace(bar1,row=1, col=1)

fig.add_trace(bar2,row=1, col=1)
fig.add_trace(bar3,row=1, col=1)
fig.update_layout(height=600, width=600, title_text=" programming languages ")
fig.update_layout(barmode='group')  

fig.add_trace(go.Bar(name='Segment 1', x=ff_data['answer'], y=ff_data['percent'],marker=dict(color='#484a8a'),showlegend=False, legendgroup='lg1'),row=2, col=1)
fig.add_trace(go.Bar(name='Segment 2', x=coding_data['answer'], y=coding_data['percent'],marker=dict(color='#f5ab3d'),showlegend=False, legendgroup='lg2'),row=2, col=1)
fig.add_trace(go.Bar(name='Segment 3', x=ds_data['answer'], y=ds_data['percent'],marker=dict(color='#860f8c'),showlegend=False, legendgroup='lg3'),row=2, col=1)
fig.update_layout(
    title="Programming Languages",
#    legend_title="Legend Title",
    font=dict(
        family="Computer Modern",
        size=14,
        color="#593232"
    )
)


fig['layout']['yaxis']['title']='Number of Participants'
fig['layout']['yaxis2']['title']='Percentage of Participants'
fig['layout']['yaxis2']['ticksuffix'] = '%'
fig['layout']['xaxis']['tickangle']= 45
fig['layout']['xaxis2']['tickangle']= 45

iplot(fig);


## Q8 What programming language would you recommend an aspiring data scientist to learn first?

As expected python is recommended by 80% of the participants.

In [None]:
iplot(get_fig('Q8', 'What programming language to learn first?',vertical_spacing=0.04))

## Q9 - Which of the following integrated development environments (IDE's) do you use on a regular basis?

In [None]:
lst = ['Jupyter',
 'RStudio',
 'Visual Studio',
 'VSCode',
 'PyCharm',
 'Spyder',
 'Notepad++',
 'Sublime Text',
 'Vim / Emacs',
 'MATLAB',
 'None',
 'Other']
fig = get_fig('Q9', 'Integrated Development Environments',x=lst, vertical_spacing=0.04)
iplot(fig)

## Q10 - Which of the following hosted notebook products do you use on a regular basis? 

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q10')['answer'])]
x_labels =  ['Kaggle',
   'Colab',
   'Azure',
   'Paperspace / Gradient',
   'Binder / JupyterHub',
   'Code Ocean',
   'IBM Watson Studio',
   'Amazon Sagemaker',
   'Amazon EMR',
   'Google Cloud AI',
   'Google Cloud Datalab',
   'Databricks Collaborative',
   'None',
   'Other']
iplot(get_fig('Q10', 'Hosted Notebook Products', x=x_labels, vertical_spacing=0.04))

## Q11 - What type of computing platform do you use most often for your data science projects?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q11')['answer'])]
#x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q11')['answer'])]
x_labels =['Computer / Laptop',
 'Cloud platform',
 'Deep learning workstation',
 'None',
 'Other']
iplot(get_fig('Q11', 'Computing Platform Used',x=x_labels, vertical_spacing=0.04))

## Q12 - Which types of specialized hardware do you use on a regular basis?

In [None]:
iplot(get_fig('Q12', 'Specialized hardware',vertical_spacing=0.04))

## Q13 - Approximately how many times have you used a TPU (tensor processing unit)?

In [None]:
iplot(get_fig('Q13', 'TPU Usage', vertical_spacing=0.04))

### Q14 - What data visualization libraries or tools do you use on a regular basis?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q14')['answer'])]
iplot(get_fig('Q14', 'Visualization libraries',x=x_labels,vertical_spacing=0.04))

### Q16 - Which of the following machine learning frameworks do you use on a regular basis? 

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q16')['answer'])]
iplot(get_fig('Q16', 'Machine learning frameworks',x=x_labels,vertical_spacing=0.04))

### Q17 - Which of the following ML algorithms do you use on a regular basis? 

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q17')['answer'])]
iplot(get_fig('Q17', 'Machine learning algorithms', x=x_labels,vertical_spacing=0.04))

### Q18 - Which categories of computer vision methods do you use on a regular basis? 

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q18')['answer'])]
x_labels = ['General purpose image/video tools',
 'Image segmentation methods',
 'Object detection methods',
 'Image classification and <br>other general purpose networks',
 'Generative Networks',
 'None',
 'Other']

fig = get_fig('Q18', 'Computer vision methods ', x=x_labels,vertical_spacing=0.9,shared_xaxes=True)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)


### Q19 - Which of the following natural language processing (NLP) methods do you use on a regular basis?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q19')['answer'])]

fig = get_fig('Q19', 'Natural Language Processing (NLP) methods', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.49]},yaxis={'domain': [0.55, 1]})
iplot(fig)

### Q20 - What is the size of the company where you are employed?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q20')['answer'])]
fig = get_fig('Q20', 'Company size', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q21 - Approximately how many individuals are responsible for data science workloads at your place of business?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q21')['answer'])]
fig = get_fig('Q21', 'Data Science Workload', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q22 - Does your current employer incorporate machine learning methods into their business?


In [None]:
def get_horizontal_bar(x_data, y_data, colors):
    fig = go.Figure()

    for i in range(0, len(x_data[0])):
        for xd, yd in zip(x_data, y_data):
            fig.add_trace(go.Bar(
                x=[xd[i]], y=[yd],
                orientation='h',
                marker=dict(
                    color=colors[i],
                    line=dict(color='rgb(248, 248, 249)', width=1)
                )
            ))

    fig.update_layout(
        xaxis=dict(
            showgrid=False,
            showline=False,
            showticklabels=False,
            zeroline=False,
            domain=[0.15, 1]
        ),
        yaxis=dict(
            showgrid=False,
            showline=False,
            showticklabels=False,
            zeroline=False,
        ),
        barmode='stack',
        paper_bgcolor='rgb(248, 255, 255)',
        plot_bgcolor='rgb(248, 255, 255)',
        margin=dict(l=120, r=10, t=140, b=80),
        showlegend=False,
    )

    annotations = []

    for yd, xd in zip(y_data, x_data):
        # labeling the y-axis
        annotations.append(dict(xref='paper', yref='y',
                                x=0.14, y=yd,
                                xanchor='right',
                                text=str(yd),
                                font=dict(family='Arial', size=14,
                                          color='rgb(67, 67, 67)'),
                                showarrow=False, align='right'))
        # labeling the first percentage of each bar (x_axis)
        annotations.append(dict(xref='x', yref='y',
                                x=xd[0] / 2, y=yd,
                                text=str(xd[0]) + '%',
                                font=dict(family='Arial', size=14,
                                          color='rgb(248, 248, 255)'),
                                showarrow=False))
        # labeling the first Likert scale (on the top)
        if yd == y_data[-1]:
            annotations.append(dict(xref='x', yref='paper',
                                    x=xd[0] / 2, y=1.1,
                                    text=top_labels[0],
                                    font=dict(family='Arial', size=14,
                                              color='rgb(67, 67, 67)'),
                                    showarrow=False))
        space = xd[0]
        for i in range(1, len(xd)):
                # labeling the rest of percentages for each bar (x_axis)
                annotations.append(dict(xref='x', yref='y',
                                        x=space + (xd[i]/2), y=yd,
                                        text=str(xd[i]) + '%',
                                        font=dict(family='Arial', size=14,
                                                  color='rgb(248, 248, 255)'),
                                        showarrow=False))
                # labeling the Likert scale
                if yd == y_data[-1]:
                    annotations.append(dict(xref='x', yref='paper',
                                            x=space + (xd[i]/2), y=1.1,
                                            text=top_labels[i],
                                            font=dict(family='Arial', size=14,
                                                      color='rgb(67, 67, 67)'),
                                            showarrow=False))
                space += xd[i]

    fig.update_layout(annotations=annotations)
    
    return fig

In [None]:
import math
def get_h_bar(y_data, df1, df2, df3, x_data):
    fig = go.Figure()

    fig.add_trace(go.Bar(
        y=y_data,
        x=list(df1['percent']),
        name='0-3 years of DS/ML Experience',
        legendgroup='lg1',
        orientation='h',
#        textposition='inside',
#        hoverinfo='none',
        marker=dict(color='#484a8a',
            line=dict(color='#484a8a', width=1)
        )
    ))

    fig.add_trace(go.Bar(
        y=y_data,
        x=list(df2['percent']),
        name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience',
        orientation='h',
        legendgroup='lg2',    
#        textposition='inside',        
#        hoverinfo='none',        
        marker=dict(
            color='#f5ab3d',
            line=dict(color='#f5ab3d', width=1)
        )
    ))
    fig.add_trace(go.Bar(
        y=y_data,
        x=list(df3['percent']),
        name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience &<br> Currently in DS/ML Jobs',
        orientation='h',
        legendgroup='lg3', 
#        textposition='inside',        
#        hoverinfo='none',        
        marker=dict(
            color='#860f8c',
            line=dict(color='#860f8c', width=1)
        )
    ))

    fig.update_layout(
            xaxis=dict(
                showgrid=False,
                showline=False,
                showticklabels=True,
                zeroline=False,
                domain=[0.15, 1]
            ),
            yaxis=dict(
                showgrid=False,
                showline=False,
                showticklabels=True,
                zeroline=False,
            ),
            barmode='stack',
            paper_bgcolor='rgb(248, 255, 255)',
            plot_bgcolor='rgb(248, 255, 255)',
            margin=dict(l=120, r=10, t=40, b=20),
            showlegend=True,
        )

    annotations = []
#[[25.4, 25.9, 26.3],
# [19.7, 20.0, 12.8],
# [17.2, 19.7, 24.4],
# [14.1, 10.3, 7.8],
# [12.2, 13.2, 16.1],
# [11.4, 10.9, 12.6]]
    for yd, xd in zip(y_data, x_data):
        space = 0
        for i in range(0, len(xd)):
            #print(yd, xd, round(space + (xd[i]/2)))
            
            # labeling the rest of percentages for each bar (x_axis)
            annotations.append(dict(xref='x', yref='y',
                                            #x=space + (xd[i]/2), y=yd,
                                            x= math.ceil(space + (xd[i]/2)), y=yd,
                                            text=str(xd[i]) + '%',
                                            font=dict(family='Arial', size=14,
                                                      color='rgb(248, 248, 255)'),
                                            showarrow=False))
            space += xd[i]
    #fig.update_layout(annotations=annotations)
    return fig

In [None]:
from functools import reduce
qn = 'Q22'
df1 = get_chart_df(ff_df,qn)
df2 = get_chart_df(coding_df,qn)
df3 = get_chart_df(ds_df,qn)
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='answer'), dfs)
arr = df_final[['percent_x','percent_y','percent']].to_numpy()
x_data = arr.tolist()

y_data = ['We are exploring ML methods <br>(and may one day put a model into production)',
       'No (we do not use ML methods)',
       'We recently started using ML methods <br>(i.e., models in production for less than 2 years)',
       'I do not know',
       'We have well established ML methods <br>(i.e., models in production for more than 2 years)',
       'We use ML methods for generating insights <br>(but do not put working models into production)']


fig = get_h_bar(y_data,df1, df2, df3, x_data)
fig.update_layout(title='Does your current employer incorporate <br>machine learning methods into their business?')
fig['layout']['xaxis']['ticksuffix'] = '%'
fig['layout']['xaxis']['tickangle']= 20

iplot(fig)

### Q23 - Select any activities that make up an important part of your role at work

In [None]:
import plotly.graph_objects as go

qn = 'Q23'
df1 = get_chart_df(ff_df,qn)
df2 = get_chart_df(coding_df,qn)
df3 = get_chart_df(ds_df,qn)
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='answer'), dfs)
arr = df_final[['percent_x','percent_y','percent']].to_numpy()
df_final
x_data = arr.tolist()

y_data = ['Analyze and understand data to <br> influence product or business decisions',
       'Build and/or run the data infrastructure that my business <br> uses for storing, analyzing, and operationalizing data',
       'Build prototypes to explore applying <br>machine learning to new areas',
       'Build and/or run a machine learning service <br>that operationally improves my product or workflows',
       'Experimentation and iteration to <br> improve existing ML models',
       'Do research that advances the <br>state of the art of machine learning',
       'None of these activities are an <br>important part of my role at work',
       'Other']


fig = get_h_bar(y_data,df1, df2, df3, x_data)
fig.update_layout(title='Select any activities that make up an important part of your role at work')
fig['layout']['xaxis']['ticksuffix'] = '%'
fig['layout']['xaxis']['tickangle']= 20
iplot(fig)

### Q24 - What is your current yearly compensation (approximate $USD)?


In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q24')['answer'])]
fig = get_fig('Q24', 'Compensation', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.45]},yaxis={'domain': [0.51, 1]})
iplot(fig)

In [None]:
sal_ff = ff_df.groupby(['Q24']).count().reset_index()
sal_ff = sal_ff[['Q24','Q1']]
sal_cd = coding_df.groupby(['Q24']).count().reset_index()
sal_cd = sal_cd[['Q24','Q1']]
sal_cd

x0 = sal_ff['Q1']
# Add 1 to shift the mean of the Gaussian distribution
x1 = sal_cd['Q1']

fig = go.Figure()
fig.add_trace(go.Histogram(x=ff_df['Q24'], y=ff_df['Q1'], histfunc="count", 
                                   name='0-3 years of DS/ML Experience',
                        marker=dict(color='#484a8a'), legendgroup='lg1', histnorm='percent'))
fig.add_trace(go.Histogram(x=coding_df['Q24'], y=coding_df['Q1'], histfunc="count", 
                            name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience',
        marker=dict(color='#f5ab3d'), legendgroup='lg2', histnorm='percent'))
fig.add_trace(go.Histogram(x=ds_df['Q24'], y=ds_df['Q1'], histfunc="count",
                           name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience &<br> Currently in DS/ML Jobs',
        marker=dict(color='#860f8c'), legendgroup='lg3', histnorm='percent'))

# Overlay both histograms
fig.update_layout(barmode='stack')
# Reduce opacity to see both histograms
#fig.update_traces(opacity=)
fig.update_layout(title='Yearly compensation Stacked Histogram', paper_bgcolor='rgb(248, 255, 255)', plot_bgcolor='rgb(248, 248, 255)')
fig['layout']['yaxis']['ticksuffix'] = '%'
fig['layout']['xaxis']['tickangle']= 45
iplot(fig)

### Q25 - Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services at home (or at work) in the past 5 years (approximate $USD)?


In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q25')['answer'])]
x_labels = ['0', '100-999', '1000-9,999', '1-99', '10,000-99,999', '100,000 or more']
fig = get_fig('Q25', 'Money spent in the past 5 years (USD)', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)
#print(x_labels)

### Q26 - Which of the following cloud computing platforms do you use on a regular basis?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q26')['answer'])]
fig = get_fig('Q26', 'Cloud computing platforms', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q27 - Do you use any of the following cloud computing products on a regular basis?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q27')['answer'])]
fig = get_fig('Q27', 'Cloud computing products', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q28 - Do you use any of the following machine learning products on a regular basis? 

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q28')['answer'])]
fig = get_fig('Q28', 'Machine learning Products', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.45]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q29 - Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q29')['answer'])]
fig = get_fig('Q29', 'Big Data Products', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q30 - Which of the following big data products (relational database, data warehouse, data lake, or similar) do you use most often? 

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q30')['answer'])]
fig = get_fig('Q30', 'Most often used big data products', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q31 - Which of the following business intelligence tools do you use on a regular basis?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q31')['answer'])]
fig = get_fig('Q31', 'Business Intelligence Tools used on regular basis', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q32 - Which of the following business intelligence tools do you use most often?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q32')['answer'])]
fig = get_fig('Q32', 'Business Intelligence Tools used most often', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q33 - Do you use any automated machine learning tools (or partial AutoML tools) on a regular basis?

In [None]:
import math
def get_simple_h_bar(y_data, df1, df2, df3, x_data):
    fig = go.Figure()

    fig.add_trace(go.Bar(
        y=y_data,
        x=list(df1['percent']),
        name='0-3 years of DS/ML Experience',
        legendgroup='lg1',
        orientation='h',
#        textposition='inside',
        hoverinfo='none',
        marker=dict(color='#484a8a',
            line=dict(color='#484a8a', width=1)
        )
    ))

    fig.add_trace(go.Bar(
        y=y_data,
        x=list(df2['percent']),
        name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience',
        orientation='h',
        legendgroup='lg2',    
#        textposition='inside',        
#        hoverinfo='none',        
        marker=dict(
            color='#f5ab3d',
            line=dict(color='#f5ab3d', width=1)
        )
    ))
    fig.add_trace(go.Bar(
        y=y_data,
        x=list(df3['percent']),
        name='0-3 years of DS/ML Experience & <br> 5+ years of coding experience &<br> Currently in DS/ML Jobs',
        orientation='h',
        legendgroup='lg3', 
#        textposition='inside',        
#        hoverinfo='none',        
        marker=dict(
            color='#860f8c',
            line=dict(color='#860f8c', width=1)
        )
    ))

    fig.update_layout(
            barmode='stack',
            paper_bgcolor='rgb(248, 255, 255)',
            #plot_bgcolor='rgb(248, 255, 255)',
            margin=dict(l=120, r=10, t=40, b=20),
            showlegend=True,
        )

    annotations = []
#[[25.4, 25.9, 26.3],
# [19.7, 20.0, 12.8],
# [17.2, 19.7, 24.4],
# [14.1, 10.3, 7.8],
# [12.2, 13.2, 16.1],
# [11.4, 10.9, 12.6]]
    for yd, xd in zip(y_data, x_data):
        space = 0
        for i in range(0, len(xd)):
            #print(yd, xd, round(space + (xd[i]/2)))
            
            # labeling the rest of percentages for each bar (x_axis)
            annotations.append(dict(xref='x', yref='y',
                                            #x=space + (xd[i]/2), y=yd,
                                            x= math.ceil(space + (xd[i]/2)), y=yd,
                                            text=str(xd[i]) + '%',
                                            font=dict(family='Arial', size=14,
                                                      color='rgb(248, 248, 255)'),
                                            showarrow=False))
            space += xd[i]
    #fig.update_layout(annotations=annotations)
    return fig
qn = 'Q33'
df1 = get_chart_df(ff_df,qn)
df2 = get_chart_df(coding_df,qn)
df3 = get_chart_df(ds_df,qn)
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='answer'), dfs)
arr = df_final[['percent_x','percent_y','percent']].to_numpy()
df_final
x_data = arr.tolist()


y_data = ['Automated data augmentation <br>(e.g. imgaug, albumentations)',
       'Automated feature engineering/selection <br>(e.g. tpot, boruta_py)',
       'Automated model selection<br> (e.g. auto-sklearn, xcessiv)',
       'Automated model architecture searches <br>(e.g. darts, enas)',
       'Automated hyperparameter tuning <br>(e.g. hyperopt, ray.tune, Vizier)',
       'Automation of full ML pipelines <br>(e.g. Google AutoML, H20 Driverless AI)',
       'No / None', 'Other',
       'Automation of full ML pipelines <br>(e.g. Google Cloud AutoML, H20 Driverless AI)',
       'None']


fig = get_simple_h_bar(y_data,df1, df2, df3, x_data)
fig.update_layout(title='Use of automated machine learning tools (or partial AutoML tools)')
fig.update_layout(
            xaxis=dict(
                showticklabels=True,
            ),            plot_bgcolor='rgb(248, 255, 255)'
)
fig['layout']['xaxis']['ticksuffix'] = '%'
fig['layout']['xaxis']['tickangle']= 20
iplot(fig)


### Q34 - Which of the following automated machine learning tools (or partial AutoML tools) do you use on a regular basis?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q34')['answer'])]
fig = get_fig('Q34', 'Automated machine learning tools used', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q35 - Do you use any tools to help manage machine learning experiments?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q35')['answer'])]
fig = get_fig('Q35', 'Tools to help manage machine learning experiments', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q36 - Where do you publicly share or deploy your data analysis or machine learning applications?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q36')['answer'])]
fig = get_fig('Q36', 'Where do you publicly share or deploy your data analysis or machine learning applications?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q37 - On which platforms have you begun or completed data science courses?

In [None]:
x_labels = [x.strip() for x in list(get_chart_df(ff_df,'Q37')['answer'])]
x_labels = ['Coursera',
 'edX',
 'Kaggle Learn Courses',
 'DataCamp',
 'Fast.ai',
 'Udacity',
 'Udemy',
 'LinkedIn Learning',
 'Cloud-certification programs',
 'University Courses',
 'None',
 'Other']
fig = get_fig('Q37', 'On which platforms have you begun or completed data science courses?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q38 - What is the primary tool that you use at work or school to analyze data?

In [None]:
qn = 'Q38'
df1 = get_chart_df(ff_df,qn)
df2 = get_chart_df(coding_df,qn)
df3 = get_chart_df(ds_df,qn)
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='answer'), dfs)
arr = df_final[['percent_x','percent_y','percent']].to_numpy()
df_final
x_data = arr.tolist()

y_data = ['Local development environments <br>(RStudio, JupyterLab, etc.)',
       'Basic statistical software <br>(Microsoft Excel, Google Sheets, etc.)',
       'Business intelligence software <br>(Salesforce, Tableau, Spotfire, etc.)',
       'Cloud-based data software & APIs <br>(AWS, GCP, Azure, etc.)',
       'Advanced statistical software <br>(SPSS, SAS, etc.)', 'Other']

fig = get_simple_h_bar(y_data,df1, df2, df3, x_data)
fig.update_layout(title='Primary tool to analyze data')
fig['layout']['xaxis']['ticksuffix'] = '%'
fig.update_layout(
            xaxis=dict(
                showgrid=False,
                showline=False,
                showticklabels=False,
            ),
            yaxis=dict(
                showgrid=False,
                showline=False,
                showticklabels=True,
                zeroline=False,
            ),
            barmode='stack',
            paper_bgcolor='rgb(248, 255, 255)',
            plot_bgcolor='rgb(248, 255, 255)',
            margin=dict(l=120, r=10, t=40, b=20),
            showlegend=True,
        )
fig['layout']['xaxis']['ticksuffix'] = '%'
fig['layout']['xaxis']['tickangle']= 20
iplot(fig)

### Q39 - Who/what are your favorite media sources that report on data science topics?

In [None]:
x_labels = []
x_labels = ['Twitter',
 "Email newsletters",
 'Reddit',
 'Kaggle',
 'Course Forums',
 'YouTube',
 'Podcasts',
 'Blogs',
 'Journal Publications',
 'Slack Communities',
 'None',
 'Other']
fig = get_fig('Q39', 'Who/what are your favorite media sources that report on data science topics?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

Long answers:
* Twitter (data science influencers)
* Email newsletters (Data Elixir, O'Reilly Data & AI, etc)
* Reddit (r/machinelearning, etc)
* Kaggle (notebooks, forums, etc)
* Course Forums (forums.fast.ai, Coursera forums, etc)
* YouTube (Kaggle YouTube, Cloud AI Adventures, etc)
* Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)
* Blogs (Towards Data Science, Analytics Vidhya, etc)
* Journal Publications (peer-reviewed journals, conference proceedings, etc)
* Slack Communities (ods.ai, kagglenoobs, etc)

### Q26 B - Which of the following cloud computing platforms do you hope to become more familiar with in the next 2 years?

In [None]:
qn = 'Q26_B_'
x_labels = [x.strip() for x in list(get_chart_df(ff_df,qn)['answer'])]
fig = get_fig(qn, 'Which of the following cloud computing platforms do you hope to become more familiar with in the next 2 years?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q27 B - In the next 2 years, do you hope to become more familiar with any of these specific cloud computing products?

In [None]:
qn = 'Q27_B_'
x_labels = [x.strip() for x in list(get_chart_df(ff_df,qn)['answer'])]
fig = get_fig(qn, 'In the next 2 years, do you hope to become more familiar with any of these <br>specific cloud computing products?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q28 B - In the next 2 years, do you hope to become more familiar with any of these specific machine learning products?

In [None]:
qn = 'Q28_B_'
x_labels = [x.strip() for x in list(get_chart_df(ff_df,qn)['answer'])]
fig = get_fig(qn, 'In the next 2 years, do you hope to become more familiar with any of these <br>specific machine learning products?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q29 B - Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you hope to become more familiar with in the next 2 years?

In [None]:
qn = 'Q29_B_'
x_labels = [x.strip() for x in list(get_chart_df(ff_df,qn)['answer'])]
fig = get_fig(qn, 'Which of the following big data products (relational databases, data warehouses, data lakes, <br>or similar) do you hope to become more familiar with in the next 2 years?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q31 B - Which of the following business intelligence tools do you hope to become more familiar with in the next 2 years?

In [None]:
qn = 'Q31_B_'
x_labels = [x.strip() for x in list(get_chart_df(ff_df,qn)['answer'])]
fig = get_fig(qn, 'Which of the following business intelligence tools do you hope to become more <br>familiar with in the next 2 years?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q33 B - Which categories of automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years?

In [None]:
qn = 'Q33_B_'
df1 = get_chart_df(ff_df,qn)
df2 = get_chart_df(coding_df,qn)
df3 = get_chart_df(ds_df,qn)
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='answer'), dfs)
arr = df_final[['percent_x','percent_y','percent']].to_numpy()
df_final
x_data = arr.tolist()

y_data = ['Automated data augmentation <br>(e.g. imgaug, albumentations)',
       'Automated feature engineering/selection <br>(e.g. tpot, boruta_py)',
       'Automated model selection <br>(e.g. auto-sklearn, xcessiv)',
       'Automated model architecture searches <br>(e.g. darts, enas)',
       'Automated hyperparameter tuning <br>(e.g. hyperopt, ray.tune, Vizier)',
       'Automation of full ML pipelines <br>(e.g. Google Cloud AutoML, H20 Driverless AI)',
       'None', 'Other']

fig = get_simple_h_bar(y_data,df1, df2, df3, x_data)
fig.update_layout(title=' Which categories of automated machine learning tools (or partial AutoML tools) <br>do you hope to become more familiar with in the next 2 years?')
fig['layout']['xaxis']['ticksuffix'] = '%'
fig.update_layout(
            xaxis=dict(
                showgrid=False,
                showline=False,
                showticklabels=True,
            ),
            yaxis=dict(
                showgrid=False,
                showline=False,
                showticklabels=True,
                zeroline=False,
            ),
            barmode='stack',
            paper_bgcolor='rgb(248, 255, 255)',
            plot_bgcolor='rgb(248, 255, 255)',
            margin=dict(l=120, r=10, t=40, b=20),
            showlegend=True,
        )
fig['layout']['xaxis']['ticksuffix'] = '%'
fig['layout']['xaxis']['tickangle']= 20
iplot(fig)

### Q34 B - Which specific automated machine learning tools (or partial AutoML tools) do you hope to become more familiar with in the next 2 years? 

In [None]:
qn = 'Q34_B_'
x_labels = [x.strip() for x in list(get_chart_df(ff_df,qn)['answer'])]
fig = get_fig(qn, 'Which specific automated machine learning tools (or partial AutoML tools) <br>do you hope to become more familiar with in the next 2 years?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Q35 B - In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments? 

In [None]:
qn = 'Q35_B_'
x_labels = [x.strip() for x in list(get_chart_df(ff_df,qn)['answer'])]
fig = get_fig(qn, 'In the next 2 years, do you hope to become more familiar <br>with any of these tools for managing ML experiments?', x=x_labels,vertical_spacing=1.0)
fig.update_layout(yaxis2={'domain': [0, 0.50]},yaxis={'domain': [0.51, 1]})
iplot(fig)

### Conclusion

In [None]:
#list_org = ['Q7','Q7','Q7','Q7','Q7','Q7','Q7','Q7','Q7','Q7','Q7','Q7','Q9','Q9','Q9','Q9','Q9','Q9','Q9','Q9','Q9','Q9','Q9','Q10','Q10','Q10','Q10','Q10','Q10','Q10','Q10','Q10','Q10','Q10','Q10','Q10','Q12','Q12','Q12','Q14','Q14','Q14','Q14','Q14','Q14','Q14','Q14','Q14','Q14','Q14','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q16','Q17','Q17','Q17','Q17','Q17','Q17','Q17','Q17','Q17','Q17','Q17','Q18','Q18','Q18','Q18','Q18','Q18','Q19','Q19','Q19','Q19','Q19','Q23','Q23','Q23','Q23','Q23','Q23','Q23','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q28','Q28','Q28','Q28','Q28','Q28','Q28','Q28','Q28','Q28','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q33','Q33','Q33','Q33','Q33','Q33','Q33','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q35','Q35','Q35','Q35','Q35','Q35','Q35','Q35','Q35','Q35','Q36','Q36','Q36','Q36','Q36','Q36','Q36','Q36','Q36','Q37','Q37','Q37','Q37','Q37','Q37','Q37','Q37','Q37','Q37','Q37','Q39','Q39','Q39','Q39','Q39','Q39','Q39','Q39','Q39','Q39','Q39','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q26','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q27','Q28','Q28','Q28','Q28','Q28','Q28','Q28','Q28','Q28','Q28','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q29','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q31','Q33','Q33','Q33','Q33','Q33','Q33','Q33','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q34','Q35','Q35','Q35','Q35','Q35','Q35','Q35','Q35','Q35','Q35'
#           ,'Q26_B','Q26_B','Q26_B','Q26_B','Q26_B','Q26_B','Q26_B','Q26_B','Q26_B','Q26_B','Q26_B','Q26_B','Q27_B','Q27_B','Q27_B','Q27_B','Q27_B','Q27_B','Q27_B','Q27_B','Q27_B','Q27_B','Q27_B','Q27_B','Q28_B','Q28_B','Q28_B','Q28_B','Q28_B','Q28_B','Q28_B','Q28_B','Q28_B','Q28_B','Q28_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q29_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q31_B','Q33_B','Q33_B','Q33_B','Q33_B','Q33_B','Q33_B','Q33_B','Q33_B','Q34_B','Q34_B','Q34_B','Q34_B','Q34_B','Q34_B','Q34_B','Q34_B','Q34_B','Q34_B','Q34_B','Q34_B','Q35_B','Q35_B','Q35_B','Q35_B','Q35_B','Q35_B','Q35_B','Q35_B','Q35_B','Q35_B','Q35_B']
#lst_cols = []
#[lst_cols.append(x) for x in list_org if x not in lst_cols] 
lst_cols = ['Q7', 'Q9', 'Q10', 'Q12', 'Q14', 'Q16', 'Q17', 'Q18', 'Q19', 'Q23', 'Q26', 'Q27', 'Q28', 'Q29', 'Q31', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q39', 'Q26_B', 'Q27_B', 'Q28_B', 'Q29_B', 'Q31_B', 'Q33_B', 'Q34_B', 'Q35_B']

### Areas to focus during initial years of data science expedition (1-3 years)

In [None]:
wc_dict = {}

for q in lst_cols:
    sub_qns = [qn for qn in list(master.columns) if q in qn]
    for sub_q in sub_qns:
        dct = ff_df[sub_q].value_counts().to_dict()
        for key in dct:
            if 'None' in key or 'Other' in key:
                continue
            wc_dict[key.strip()] = dct[key]
#wc_dict

In [None]:
import wordcloud as w
#wordcloud = w.WordCloud(collocations=False).generate_from_frequencies(wc_dict)
wordcloud = w.WordCloud(collocations=False, repeat=True, max_font_size=100, max_words=len(wc_dict), 
                        background_color="#bdb7ac", colormap='gist_heat',
                        prefer_horizontal=1.0,
                        min_font_size = 1).generate_from_frequencies(wc_dict)

#wordcloud = w.WordCloud(collocations = False).generate(text)
plt.figure(figsize = (60,30))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()
#fig.tight_layout(pad=3.0)

### After 3 years

In [None]:
expds_df = q15[(q15['Q15'].str.contains('3-4 years') | q15['Q15'].str.contains('4-5 years') | q15['Q15'].str.contains('5-10 years') |  q15['Q15'].str.contains('10-20 years') | q15['Q15'].str.contains('20 or more years'))]
wc_dict = {}

for q in lst_cols:
    sub_qns = [qn for qn in list(master.columns) if q in qn]
    for sub_q in sub_qns:
        dct = expds_df[sub_q].value_counts().to_dict()
        for key in dct:
            if 'None' in key or 'Other' in key:
                continue
            wc_dict[key.strip()] = dct[key]
#wc_dict

In [None]:
#wordcloud = w.WordCloud(collocations=False).generate_from_frequencies(wc_dict)
wordcloud = w.WordCloud(collocations=False, repeat=True, max_font_size=100, max_words=len(wc_dict), 
                        background_color="#bdb7ac", colormap='summer',
                        prefer_horizontal=1.0,
                        min_font_size = 1).generate_from_frequencies(wc_dict)

#wordcloud = w.WordCloud(collocations = False).generate(text)
plt.figure(figsize = (60,30))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()
#fig.tight_layout(pad=3.0)

It's evident (and quite obvious) to spend time on or learn the following to start with 

* Python
* Matplotlib
* Scikit-learn
* Linear / Logistic Regression
* Decision Trees / Random Forests
* Jupyter (Lab/Notebooks)
* Kaggle (notebooks, forums)

and move on to explore other areas such as

* Seaborn
* Kaggle (notebooks, forums)
* Tensorflow, Keras, PyTorch
* Gradient Boosting Machines (xgboost, lightgbm, etc)
* Convolutional Neural Networks

Of course this is not a comprehensive list. However, but a good starting list to plunge into data science world that aligns with the community !

##### References

* https://chart-studio.plotly.com/~alishobeiri/1591/plotly-sankey-diagrams/#/
* https://www.kaggle.com/iyadavvaibhav/plotly-sankey-with-filters (Inspiration for Interactive Sankey)
* https://plotly.com/python/