In [None]:
%%HTML
<style type="text/css">

div.h2 {
    background-color: white; 
    color: #2C3E50; 
    padding: auto;
    margin:5px,
    font-size: 24px; 
}
span.high_all{
    font-weight: bold;
    #background-color: #ffffb3;
    font-size: 16px; 
}
span.high_prod{
    font-weight: bold;
    color: #45a1a1;
    font-size: 16px; 
}
span.high_expl{
    font-weight: bold;
    color: #a1a112;
    font-size: 16px; 
}
span.high {
    background-color: #f5f5ef; 
    color: #000000; 
    font-size: 16px; 
    margin-top: 1px;
}
span.high_p {
    background-color: #a6d9d9; 
    color: #000000;  
    font-size: 16px; 
    margin-top: 1px;
}
span.high_e {
    background-color: #f2f28c; 
    color: #000000;  
    font-size: 16px; 
    margin-top: 1px;
}

span.high_1{    
    background-color: #E8B151; 
    color: #37322D; 
    font-size: 36px; 
    max-width: 1500px; 
    margin-top: 1px;}

p.normal {
    padding-bottom: 3px;
    color: #383b40; 
    font-size: 16px;
    display: table;
    margin:auto;
    #border-bottom:1px dotted;
}
p.heading1 {
    padding-bottom: 3px;
    border-bottom: 1px solid #355f96;
    color: #355f96; /* #b54002; */
    font-size: 30px;
    display: inline-block;
    text-align: center;

}
p.heading2 {
    padding-bottom: 3px;
    border-bottom: 1px solid #355f96;
    color: #355f96; /* #b54002; */
    font-size: 20px;
    display: inline-block;

}
div.a {
  text-indent: 50px;
}
div.h {
  text-align: center;
}
div.p {
  text-align: justify;
  content: "";
  display: inline-block;
  width: 100%;
}
</style>

<p class='heading1'>Handling machine learning methods</p>


There is generally large interest surrounding data science, machine learning domain as a result of its large scale popularity.
However, the domain is quite complex and there are many multidisciplinary facets involved when trying to put toghether all the necessary pieces needed for the entire "machine" to work well.

My interest and objective in this analysis is to identify and profile the kaggle survey respondents who brought models in production in contrast with professionals that are analyzing data or using ML methods, but haven't deployed their work in production. In the first case, the work is operationalized and implemented in production, while the latter category has either no need to deploy the model (when the final 'product' is packed as insights helping business decision makers) or they weren't succesfull in deploying the model in production for various reasons. 

By profiling the two groups, we meet a secondary objective, which is to better understand and be able to assess professionals, teams or enterprises maturity in using ML methods.


Survey question 'Q22 Does your current employer incorporate machine learning methods into their business' marks off these two populations that are both using machine learning methods, but with different approach and purpose.

Based on Q22 answers, the two groups have been segmented as such:
* <p class='normal'><span class='high_expl'>EXPLORERS</span> - respondents choosing one of the two options below:</p>
    <div class ='a'><span class='high_expl'>We are exploring ML methods (and may one day put a model into production)</span></div>
    <div class ='a'><span class='high_expl'>We use ML methods for generating insights (but do not put working models into production)</span></div>


* <p class='normal'><span class='high_prod'>PRODUCERS</span> - respondents choosing one of the two options below: </p>
    <div class ='a'><span class='high_prod'>We have well established ML methods (i.e., models in production for more than 2 years)</span></div>
    <div class ='a'><span class='high_prod'>We recently started using ML methods (i.e., models in production for less than 2 years)</span></div>

The analysis explores 2020 survey data and highlights the differences between the two groups. This notebook is structured around three main areas:
    <div class ='a'>[General profile](#general_profile)</div>
    <div class ='a'>[Skills](#skills)</div>
    <div class ='a'>[Infrastructure used](#infrastructure)</div>

Note: I have exluded from the analysis those who answered at Q22 that they are not using ML methods or do not know if their company has incorporated ML methods.

In [None]:
import numpy as np
import pandas as pd
import re
from itertools import chain
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import matplotlib.ticker as mtick
import seaborn as sns

# load 2020 data
df_2020 = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)

# 2019 data
df_2019 = pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv", low_memory=False)

# 2018 data
df_2018 = pd.read_csv("../input/kaggle-survey-2018/multipleChoiceResponses.csv", low_memory=False)

# 2017 data
df_2017 = pd.read_csv("../input/kaggle-survey-2017/multipleChoiceResponses.csv",encoding='ISO-8859-1', low_memory=False)



#schema_2019 = pd.read_csv('/kaggle/input/kaggle-survey-2019/survey_schema.csv')

# capture column names; remove col names from main data 
df_2020_cols = df_2020.loc[0,:]
df_2020.drop(0, inplace=True)

# capture column names; remove col names from main data 
df_2019_cols = df_2019.loc[0,:]
df_2019.drop(0, inplace=True)

df_2018_cols = df_2018.loc[0,:]
df_2018.drop(0, inplace=True)

# new variables
df_2020.loc[:,'All'] = 'All'
df_2020_cols = df_2020_cols.append(pd.Series('All', index=['All']))

# explorers vs production
df_2020.loc[df_2020.loc[:,'Q22'].isin(['We are exploring ML methods (and may one day put a model into production)',
                                       'We use ML methods for generating insights (but do not put working models into production)']),'q22_groups'] = 'Explorers'
df_2020.loc[df_2020.loc[:,'Q22'].isin(['We have well established ML methods (i.e., models in production for more than 2 years)',
                                       'We recently started using ML methods (i.e., models in production for less than 2 years)']),'q22_groups'] = 'Producers'
df_2020_cols = df_2020_cols.append(pd.Series('q22_groups', index=['q22_groups']))

# 2019
df_2019.loc[df_2019.loc[:,'Q8'].isin(['We are exploring ML methods (and may one day put a model into production)',
                                       'We use ML methods for generating insights (but do not put working models into production)']),'Q8_groups'] = 'Explorers'
df_2019.loc[df_2019.loc[:,'Q8'].isin(['We have well established ML methods (i.e., models in production for more than 2 years)',
                                       'We recently started using ML methods (i.e., models in production for less than 2 years)']),'Q8_groups'] = 'Producers'
df_2019_cols = df_2019_cols.append(pd.Series('Q8_groups', index=['Q8_groups']))

# 2018
df_2018.loc[df_2018.loc[:,'Q10'].isin(['We are exploring ML methods (and may one day put a model into production)',
                                       'We use ML methods for generating insights (but do not put working models into production)']),'Q10_groups'] = 'Explorers'
df_2018.loc[df_2018.loc[:,'Q10'].isin(['We have well established ML methods (i.e., models in production for more than 2 years)',
                                       'We recently started using ML methods (i.e., models in production for less than 2 years)']),'Q10_groups'] = 'Producers'
df_2018_cols = df_2018_cols.append(pd.Series('Q10_groups', index=['Q10_groups']))


# income recode
df_2020.loc[:,'Q24r']=df_2020.loc[:,'Q24'].astype(str).replace(regex=r'^\$0', value='0') \
    .replace(regex=r'^> \$500,000', value='500,000-').apply(lambda x: x.split('-')[0]).str.replace(r'\,','').astype(float)

df_2020.loc[(df_2020['Q24r']>=0) & (df_2020['Q24r']<50000),'Q24rec'] = 'a) bellow 50k USD'
df_2020.loc[(df_2020['Q24r']>=50000) & (df_2020['Q24r']<100000),'Q24rec'] = 'b) 50k-100k USD'
df_2020.loc[(df_2020['Q24r']>=100000) & (df_2020['Q24r']<150000),'Q24rec'] = 'c) 100k-150k USD'
df_2020.loc[(df_2020['Q24r']>=150000) & (df_2020['Q24r']<200000),'Q24rec'] = 'c) 150k-200k USD'
df_2020.loc[(df_2020['Q24r']>=200000),'Q24rec'] = 'd) >=200k USD'

In [None]:
plt.style.use('default')
## helper functions
## tables - counts and %
def tab_counts_perc(df, schema, q_prefix, multi=True):
    cnt=0
    # find columns by prefix
    if multi:
        for ele in df_cols.index:
            mtch = re.match(q_prefix, ele)
            if (mtch): 
                cnt=cnt+1
                
                agg = df.groupby(ele).agg({
                        ele: 'count'})
                if q_prefix.find("_")>-1:
                    fin_name = q_prefix[:q_prefix.find("_")]
                else:
                    fin_name = q_prefix
                agg.columns = [fin_name]
                if cnt==1:
                    df_fin = agg
                else:
                    df_fin = pd.concat([df_fin, agg])
                #print(fin_name)
    else:
        fin_name = q_prefix
        df_fin = df.groupby(q_prefix).agg({q_prefix: 'count'})
        
    #df_fin['percent'] = df_fin[fin_name]/int(schema.loc[1,fin_name])
    df_fin['percent'] = df_fin[q_prefix].apply(lambda x: (x/df_fin[q_prefix].sum()*100).round(1))
    df_fin.index.name = schema[fin_name][0]

    return(df_fin)


def tab_multi_cross_perc(df, schema, q_prefix, col_var, multi=True):
    cnt=0
    lst_row_names = []
    # find columns by prefix
    if multi:
        for ele in schema.index:
            mtch = re.match(q_prefix, ele)
            if (mtch): 
                cnt=cnt+1
                valid_cols=pd.crosstab(df[ele],df[col_var]).columns
                if cnt==1:
                    all_cols=valid_cols
                elif (int(cnt) > int(1)) and (len(pd.crosstab(df[ele],df[col_var]).columns) > len(valid_cols)):
                    all_cols=valid_cols
                    
                agg = pd.crosstab(df[ele],df[col_var]).apply(lambda r: r/ tab_counts_perc(df, schema, col_var, multi=False)[col_var][valid_cols].values, axis=1)

                if q_prefix.find("_")>-1:
                    #print(q_prefix[:q_prefix.find("_")])
                    fin_name = q_prefix[:q_prefix.find("_")]
                else:
                    fin_name = q_prefix
                #print(fin_name)
                #print(agg.values)
                #print(agg.index)
                #print(agg.index.name)
                lst_row_names.append(list(agg.index.values))
                #agg.index.name = [fin_name]
                
                if cnt==1:
                    df_fin = pd.DataFrame(agg.values,columns=valid_cols)
                else:
                    df_fin = pd.concat([df_fin, pd.DataFrame(agg.values, columns=valid_cols)])
                    #df_fin = df_fin.append(pd.DataFrame(agg.values))
                #print(fin_name)
    else:
        fin_name = q_prefix
        valid_cols=pd.crosstab(df[q_prefix],df[col_var]).columns
        agg = pd.crosstab(df[q_prefix],df[col_var]).apply(lambda r: r/ tab_counts_perc(df, schema, col_var, multi=False)[col_var][valid_cols].values, axis=1)
        df_fin = pd.DataFrame(agg.values,columns=agg.columns)
        lst_row_names.append(list(agg.index.values))
        
    #df_fin.index.name = schema[fin_name][0]
    #df_fin.columns = all_cols #list(set(df.loc[df[col_var].notna(),col_var]))
    df_fin.index = list(chain(*lst_row_names))
    return(df_fin)

## cross tab - counts only
def tab_cross_cnt(df, schema, q_prefix, col_var, multi=True):
    df_cross = pd.crosstab(df[q_prefix],df[col_var])
    df_cross.index.name = schema[q_prefix][0]
    return(df_cross)

## cross tab - %
def tab_cross_perc(df, schema, q_prefix, col_var, multi=True):
    valid_cols=pd.crosstab(df[q_prefix],df[col_var]).columns
    df_cross = pd.crosstab(df[q_prefix],df[col_var]).apply(lambda r: r/ tab_counts_perc(df, schema, col_var, multi=False)[col_var][valid_cols].values, axis=1)
    df_cross.index.name = schema[q_prefix][0]
    return(df_cross)




# custom function to return only % target values - by category
def select_indicator_by_group(main_df, quest_no, lst_exact_match, exact_match_label, find_word, lst_words, words_label, has_col_item, label_col=None,
                                has_recode=None):
    if (has_col_item):
        df_q = main_df.loc[(main_df.loc[:,'question_number']==quest_no) & (main_df.loc[:,'column_name']==label_col),:].copy()  # target question
    else:
        df_q = main_df.loc[main_df.loc[:,'question_number']==quest_no,:].copy()  # target question
    df_q.loc[:,'response_value'].fillna('No Response', inplace=True)
    df_ans = df_q['response_value'].value_counts() / df_q.shape[0]  # results % values
    
    if(has_recode):
        df_cross = pd.crosstab(df_q['response_recode'], df_q['primary_industry']).apply(lambda r: r/r.sum(), axis=0)
    else:
        df_cross = pd.crosstab(df_q['response_value'], df_q['primary_industry']).apply(lambda r: r/r.sum(), axis=0)
        
    n_e=0
    for exact_m in lst_exact_match:
        n_e+=1
        exact_mtch = df_cross.loc[df_cross.index==exact_m,:].copy()
        exact_mtch.loc[:,'indicator'] = exact_match_label[n_e-1]
        if (n_e==1): exact_mtch_fin=exact_mtch
        else: exact_mtch_fin=exact_mtch_fin.append(exact_mtch)
    
    if (find_word):
        n_w = 0
        for word in lst_words:
            n_w += 1
            find_wrd = df_cross.loc[df_cross.index.str.contains(word,regex=True),:].copy()
            d_fin = pd.DataFrame(find_wrd.sum()).transpose()
            d_fin.loc[:,'indicator'] = words_label[n_w-1]
            if (n_w==1): d_fin2 = exact_mtch.append(d_fin)
            else: d_fin2 = d_fin2.append(d_fin)
    else:
        d_fin2 = exact_mtch_fin

    return(d_fin2.reset_index())


   
def highlight_max(s):
    '''
    highlight the maximum in a Series.
    '''
    is_max = s == s.max()
    return ['background-color: #fff5cc' if v else '' for v in is_max]

def title_style():
    return [{
    'selector': 'caption',
    'props': [
        ('color', '#496FA0'),
        ('font-size', '16px')
    ]
}]


def delta_cross_table(df, q_prx, head_var, q_tit, multi_val=False):
    dlt_df = tab_multi_cross_perc(df, df_2020_cols, q_prefix=q_prx, col_var=head_var, multi=multi_val)
    dlt_df.loc[:,'delta'] = dlt_df.loc[:,'Producers'] - dlt_df.loc[:,'Explorers']
    dlt_df.columns.name=''
    dlt_df = dlt_df.sort_values('delta', ascending=False)
    return(dlt_df)

def apply_style_tab(df, q_tit):
    dff = df.style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                    .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption(q_tit) \
                    .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())
    return(dff)


## change position of y series for scatter pplot - not to overlap labels
def position_series_labels(df, target_value):
    use_itemy = df
    label_height = 0.03

    y2_df = pd.DataFrame(use_itemy[target_value])
    y2_df.sort_values(by=[target_value], inplace=True)

    y2_names = y2_df.index
    y2 = (y2_df[target_value].values)

    y_fst = y2[0]-0.006
    ser_y2_dup =[]
    ser_y2_dup.append(y_fst)
    ser_y2_dup.extend(y2[1:])

    dif_y2 = []
    dif_y2.append(ser_y2_dup[0])
    dif_y2.extend(np.array(ser_y2_dup)[1:] - np.array(ser_y2_dup)[:-1])

    err=0

    while err < 20:
        n = -1
        err = err + 1

        for i in dif_y2:
            #print(i)
            n = n+1
            if (n == 0): 
                ser_y2_dup[n] = ser_y2_dup[n]
            else:
                if (i < label_height):
                    #print(i)
                    ser_y2_dup[n] = ser_y2_dup[n] + (label_height - dif_y2[n])
                    dif_y2[n] = round(ser_y2_dup[n] - ser_y2_dup[n-1],4)  # update dif value
                else:
                    ser_y2_dup[n] = ser_y2_dup[n]

    final_frame = pd.DataFrame({'index' : y2_names, 'values_y' : ser_y2_dup})
    
    final_join = use_itemy.join(final_frame.set_index('index'))
                        
    return(final_join['values_y'].values)

<p class='heading2'>Proportion of target population</p>

<p class='normal'>About two thirds of survey participants fall into these two categories - overlapping the profesionally engaged population.</p>

In [None]:
prop_q22 = pd.DataFrame(df_2020['Q22'].value_counts() / df_2020['Q22'].value_counts().sum()).sort_index(ascending=False)
prop_q22['Does your current employer incorporate machine learning methods into their business?'] = prop_q22.index
prop_q22.style.apply(lambda x: ['background: #a6d9d9' if x.name in ['We have well established ML methods (i.e., models in production for more than 2 years)',
                                                                   'We recently started using ML methods (i.e., models in production for less than 2 years)'] 
                                
                              else ('background: #f2f28c' if x.name in ['We are exploring ML methods (and may one day put a model into production)',
                                                                   'We use ML methods for generating insights (but do not put working models into production)']  
                                    else 'background: #f5f5f0')
                            for i in x], 
                   axis=1).format("{:.1%}",  subset=['Q22']).set_properties(**{'width': '210px', 'height': '9px', 'font-size': '10pt'}).hide_index()

<p class='normal'>The proportion of explorers and producers is relativelly stable over time - each group representing around one half of ML users population.</p>

In [None]:
q22_gr = tab_counts_perc(df_2020, df_2020_cols ,"q22_groups", multi=False).sort_values(by='percent', ascending=False)
q8_gr = tab_counts_perc(df_2019, df_2019_cols ,"Q8_groups", multi=False).sort_values(by='percent', ascending=False)
q10_gr = tab_counts_perc(df_2018, df_2018_cols ,"Q10_groups", multi=False).sort_values(by='percent', ascending=False)
q22_gr.index.name='2020'
q8_gr.index.name='2019'
q10_gr.index.name = '2018'
q22_evo = pd.concat([q10_gr['percent']/100, q8_gr['percent']/100, q22_gr['percent']/100], axis=1)
q22_evo.columns = ['2018', '2019', '2020']
q22_evo.style.apply(lambda x: ['background: #a6d9d9' if x.name in ['Producers'] 
                              else ('background: #f2f28c' if x.name in ['Explorers'] else '')
                            for i in x], axis=1) \
        .set_caption('Explorers vs Producers evolution').format("{:.1%}").set_properties(**{'width': '160px', 'height': '9px', 'font-size': '10pt'})

In [None]:
class create_basket_var:
    def __init__(self, q_prefix, q_prefix_short, cross_q, df, df_cols, limit):
        self.q_prefix = q_prefix
        self.q_prefix_short = q_prefix_short
        self.cross_q = cross_q
        self.new_df=df
        self.df_cols=df_cols
        
        
        var_list = [x for x in df_cols.index if re.match(r'^{0}.+$'.format(q_prefix),x)]
        self.new_df['{0}_ans_basket'.format(q_prefix_short)] = df.loc[:,var_list].astype(str).apply(lambda x: '|'.join(x), axis=1) \
            .replace(regex=r'nan', value='').replace(regex=r'\|+', value='|').replace(regex=r'\|', value='\n')
        self.new_df['{0}_ans_cnt'.format(q_prefix_short)] = df.loc[:,var_list].count(axis=1)
        self.df_cols = df_cols.append(pd.Series('{0} answer basket'.format(q_prefix_short), index=['{0}_ans_basket'.format(q_prefix_short)]))
        
        check_lst = var_list
        check_lst.append('{0}_ans_basket'.format(q_prefix_short))
        check_lst.append('{0}_ans_cnt'.format(q_prefix_short))

        # include in analisys only if sample >= 30 cases
        lst_keep = list(self.new_df.loc[self.new_df['{0}_ans_cnt'.format(self.q_prefix_short)]>1,'{0}_ans_basket'.format(q_prefix_short)].value_counts()[self.new_df['{0}_ans_basket'.format(q_prefix_short)].value_counts()>=30].index)
        try:
            lst_keep.remove('\n') 
        except:
            pass
        try:
            lst_keep.remove('\nNone') 
        except:
            pass
        self.lst_keep=lst_keep[0:limit]  # select top n baskets
        #print(lst_keep)
        #print(len(self.lst_keep))

    def cross_df(self):
        cross_df = tab_multi_cross_perc(self.new_df.loc[(self.new_df['{0}_ans_cnt'.format(self.q_prefix_short)]>1) & 
                                                        (self.new_df['{0}_ans_basket'.format(self.q_prefix_short)].isin(self.lst_keep)),:], self.df_cols, q_prefix=self.cross_q, col_var='{0}_ans_basket'.format(self.q_prefix_short), multi=False).sort_index(axis=0).sort_index(axis=1)
        #print(self.new_df['{0}_ans_basket'.format(self.q_prefix_short)].value_counts()[35:70])
        return(cross_df)
    
    def counts_answer_basket(self):
        return(self.new_df.loc[(self.new_df['{0}_ans_cnt'.format(self.q_prefix_short)]>1) & (self.new_df['{0}_ans_basket'.format(self.q_prefix_short)].isin(self.lst_keep)),
                        '{0}_ans_basket'.format(self.q_prefix_short)].value_counts())
    def counts_total(self):
        return(self.new_df.loc[:,'{0}_ans_cnt'.format(self.q_prefix_short)].value_counts())
    
    def counts_more_than_one(self):
        return(self.new_df.loc[(self.new_df['{0}_ans_cnt'.format(self.q_prefix_short)]>1),
                        '{0}_ans_basket'.format(self.q_prefix_short)].value_counts())

In [None]:
# STACKED BAR function
# color codes
color_dict={'Producers': '#40a6bf', 'Explorers': '#f9f986',
            'Alteryx ': "#99ccff", 
            'Google Data Studio': "#0059b3",
            'Microsoft Power BI': "#ffcc00", 
            'Qlik': "#00802b",
            'Salesforce':"#0099e6", 
            'Tableau': "#b30000", 
            'Other': "#a3a375",
           'Bash': '#1f1f14',
           'C': '#4d4dff',
           'C++': '#ccddff',
           'Java': '#ffff33',
           'Javascript': "#ffcc00",
           'Julia': '#d65cad',
           'MATLAB': "#008080",
           'None': '#ebebe0',
           'Python': '#005580',
           'R': '#ffff66',
           'SQL': '#ff704d',
           'Swift': "#ff6600",
           'Microsoft Access ': '#ff8000',
           'Microsoft Azure Data Lake Storage ': '#ffbf00',
           'Microsoft SQL Server ': '#ffff66',
           'MongoDB ': '#70db70',
           'MySQL ': '#0080ff',
           'Oracle Database ': '#ff3333',
           'PostgresSQL ': '#4d6680',
           'SQLite ': '#80e5ff'}

def bar_stacked(txt_index, plotdata, title, xlab, ylab, bar_orient='bar', add_rectang=False, categ_prop=None):
    plt.style.use('seaborn-bright')
    N = len(txt_index)
    if(bar_orient=='bar'):
        fsize = (25,8.5)
        if not categ_prop is None:
            fig, (ax, ax0) = plt.subplots(nrows=2, ncols=1, gridspec_kw={'height_ratios': [4, 1]}) 
            plt.subplots_adjust(hspace=0.7)
        else:
            fig, ax = plt.subplots()
    elif(bar_orient=='barh'):
        fsize = (12,14)
        if not categ_prop is None:
            fig, (ax, ax0) = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [4, 1]}) 
            plt.subplots_adjust(hspace=0.5)
        else:
            fig, ax = plt.subplots()
        
    sort_lst = list(plotdata.transpose().index.values)
    try:
        sort_lst.remove("Other")
    except:
        pass
    
    if 'Explorers' in sort_lst and bar_orient=='barh':
        use_data = plotdata.sort_values(by=sort_lst, ascending=False)
        if not categ_prop is None:
            use_data0 = categ_prop.sort_index(axis=0, level=sort_lst)
    else:
        use_data = plotdata.sort_values(by=sort_lst, ascending=True)
        if not categ_prop is None:
            use_data0 = categ_prop.sort_index(axis=0, level=sort_lst)
    
    #ax=use_data.plot(kind=bar_orient,stacked=True,color=color_dict, figsize=fsize, width=0.95, alpha=0.70)
    ax=use_data.plot(ax=ax, kind=bar_orient,stacked=True,color=color_dict, figsize=fsize, width=0.95, alpha=0.70); #axes[0,0].set_title('A')
    #ax.plot(data=use_data, kind=bar_orient, stacked=True,color=color_dict, figsize=fsize, width=0.95, alpha=0.70)
    ax.set_title(title+'\n', fontsize=18, color='#5c5c3d', fontweight='bold')
    ax.set_xlabel(xlab, fontsize=14, color='#5c5c3d')
    ax.set_ylabel(ylab, fontsize=14, color='#5c5c3d')
    ax.spines['bottom'].set_color('#000000')
    ax.spines['top'].set_visible(False) 
    ax.spines['right'].set_visible(False) 
    ax.spines['left'].set_visible(False) 
    #ax.set_xticklabels(['0%','20%','40%','60%','80%','100%'])
    if(bar_orient=='bar'):
        ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
    elif(bar_orient=='barh'):
        ax.xaxis.set_major_formatter(mtick.PercentFormatter(1.0))
        
    for tick in ax.get_xticklabels():
                tick.set_rotation(0)
        
    #print(len(sort_lst))
    if len(sort_lst)>=8:
        pos_x_legend = -0.05
    else:
        pos_x_legend = 0.05
    ax.legend(ncol=len(list(plotdata.transpose().index.values)), 
              loc='upper center', fontsize=12, borderaxespad = -1 ) # bbox_to_anchor=(pos_x_legend, 1),
    
    if not categ_prop is None:          
        ax0=use_data0.plot(kind=bar_orient,stacked=False,color="#d6d6c2", figsize=fsize, width=0.95, alpha=0.70)
        ax0.spines["top"].set_visible(False)
        ax0.spines["left"].set_visible(False)
        ax0.spines["right"].set_visible(False)
        ax0.set_ylabel('# observations', fontsize=14, color='#5c5c3d')
        for tick in ax0.get_xticklabels():
            tick.set_rotation(0)
        plt.setp(ax0.get_xticklabels(), visible=False)
        plt.setp(ax0.get_yticklabels(), visible=False)

    if not categ_prop is None:
        tar_ax = [ax,ax0]
    else:
        tar_ax = [ax]
    cnt = -1
    for ax in tar_ax:
        cnt+=1
        for rect in ax.patches:
            # Find where everything is located
            height = rect.get_height()
            width = rect.get_width()
            x = rect.get_x()
            y = rect.get_y()
            prev_y = y
            
            # The height of the bar is the data value [bar] | width [barh]
            if(bar_orient=='bar'):
                if cnt==0:
                    label_text = f'{height*100:.0f}%'  # f'{height:.2f}' 
                else:
                    label_text = f'{height:.0f}'
            elif(bar_orient=='barh'):
                if cnt==0:
                    label_text = f'{width*100:.0f}%'  # f'{height:.2f}' 
                else:
                    label_text = f'{width:.0f}'

            # ax.text(x, y, text)
            label_x = x + width / 2
            label_y = y + height / 2

            # plot only when height is greater than specified value
            if height > 0.1:
                ax.text(label_x, label_y, label_text, ha='center', va='center', fontsize=14, color='#0f0f0a', fontweight='bold', alpha=0.4)

    #plt.setp([a.get_xticklabels() for a in fig.axes[:-1]], visible=True)

<a id='general_profile'></a>
<p class='heading2'>Profile</p>

<p class='normal'>Concerning the main work activities, <span class='high_all'>experimentation, improving product & workflow and prototyping</span> are the three key aspects that are differentiating Producers from Explorers.</p>

In [None]:
q23_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q23_Part', col_var='q22_groups')
q23_22group.loc[:,'delta'] = q23_22group.loc[:,'Producers'] - q23_22group.loc[:,'Explorers']
q23_22group.columns.name =''
q23_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q23 main work activities") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

<p class='normal'>We can also observe what are the main work activities combined. The categories are order by highest share of producers and we can also see the absolute frequency of each group in grey, on the right side - to assess the popularity of each 'basket' of answers. Producers are generally doing more activities, the number of activities in each basket is higher when we have a higher proportion of producers, while the most popular basket is a combination of analyzing data and building/running the data infrastructure,
</p>

In [None]:
# Q23 main work activities - producers/explorers
q23basket_obj = create_basket_var('Q23_Part', 'q23', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=10)
q23_by_q22gr = q23basket_obj.cross_df()
q23_prop = q23basket_obj.counts_answer_basket()
bar_stacked(q23_by_q22gr.transpose().index, q23_by_q22gr.transpose(), 
            'Main work activities - producers vs explorers distribution','%','Work activities mix', 'barh', False, q23_prop)

<p class='normal'> There is a higher share of producers in US compared to other countries.</p>

In [None]:
q3_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q3', col_var='q22_groups', multi=False)
q3_22group.loc[:,'delta'] = q3_22group.loc[:,'Producers'] - q3_22group.loc[:,'Explorers']
q3_22group.columns.name=''
pd.concat([q3_22group.sort_values('delta', ascending=False).head(), q3_22group.sort_values('delta', ascending=False).tail()]) \
    .style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']).format("{:.1%}", subset=['Producers', 'Explorers','delta']) \
    .set_caption("Q3 Country") \
    .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

<p class='normal'>Interestingly the machine learning engineers doesn't show to be greatly concentrated in the producers group and neither data engineers, however the <span class='high_all'>data scientist</span> job shows the biggest gap between the two groups (<span class='high_all'>+15pp</span> for producers vs. explorers)</p>


In [None]:
q5_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q5', col_var='q22_groups', multi=False)
q5_22group.loc[:,'delta'] = q5_22group.loc[:,'Producers'] - q5_22group.loc[:,'Explorers']

q5_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']).format("{:.1%}", subset=['Producers', 'Explorers','delta']) \
    .set_caption("Q5 Title").set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())



<p class='normal'><span class='high_prod'>Producers</span> have <span class='high_all'>more coding experience</span> and especially <span class='high_all'>more ML experience</span> (biggest gap between the two groups is on less than 1 year ML experience category, where producers have an advance of <span class='high_all'>+15pp</span> vs. explorers. Producers are also using more cloud computing platforms. </p>

In [None]:
q6_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q6', col_var='q22_groups', multi=False)
q6_22group.loc[:,'delta'] = q6_22group.loc[:,'Producers'] - q6_22group.loc[:,'Explorers']
q6_22group.columns.name=''
q6_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']).format("{:.1%}", subset=['Producers', 'Explorers','delta']) \
.set_caption("Q6 Coding experience").set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

In [None]:

q15_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q15', col_var='q22_groups', multi=False)
q15_22group.loc[:,'delta'] = q15_22group.loc[:,'Producers'] - q15_22group.loc[:,'Explorers']
q15_22group.columns.name=''
q15_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']).format("{:.1%}", subset=['Producers', 'Explorers','delta']) \
.set_caption("Q15 ML experience").set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())


In [None]:
q11_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q11', col_var='q22_groups', multi=False)
q11_22group.loc[:,'delta'] = q11_22group.loc[:,'Producers'] - q11_22group.loc[:,'Explorers']
q11_22group.columns.name=''
q11_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']).format("{:.1%}", subset=['Producers', 'Explorers','delta']) \
    .set_caption("Q11 Clomputing platform").set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

<p class='normal'><span class='high_prod'>Producers</span> are definitely <span class='high_all'>employed in bigger companies</span> and are part of <span class='high_all'>bigger data science workload teams</span>: producers are included in teams of +20 members by +22pp vs explorers, while almost 30% of producers work in companies with more than 10k employees (+15pp vs. explorers). Producers are also <span class='high_all'>earning higher salaries</span> compared to explorers.</p>

In [None]:
q20_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q20', col_var='q22_groups', multi=False)
q20_22group.loc[:,'delta'] = q20_22group.loc[:,'Producers'] - q20_22group.loc[:,'Explorers']
q20_22group.columns.name=''
q20_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q20 Company size") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())


In [None]:
q21_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q21', col_var='q22_groups', multi=False)
q21_22group.loc[:,'delta'] = q21_22group.loc[:,'Producers'] - q21_22group.loc[:,'Explorers']
q21_22group.columns.name=''
q21_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q21 Data science team size") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

In [None]:
q24_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q24rec', col_var='q22_groups', multi=False)
q24_22group.loc[:,'delta'] = q24_22group.loc[:,'Producers'] - q24_22group.loc[:,'Explorers']
q24_22group.columns.name='salary ranges'
q24_22group.style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q24 Salary") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())


**<p class=normal><span class='high_all'>Going in production requires more resources</span>, more infrastructure and running services, therefore more spending on ML products - about <span class='high_prod'>15% of all producers</span> have spend <span class='high_prod'>over 100k USD</span></p>**

In [None]:
q25_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q25', col_var='q22_groups', multi=False)
q25_22group.loc[:,'delta'] = q25_22group.loc[:,'Producers'] - q25_22group.loc[:,'Explorers']
q25_22group.insert(0,'',q25_22group.index)

q25_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q25 Spend on ML") \
                .set_table_styles(title_style()) \
                .hide_index() \
                .set_properties(**{'width': '160px', 'height': '16px', 'font-size': '10pt'})

<p class='normal'><span class='high_expl'>Explorers</span> share their work more on <span class='high_all'>Kaggle</span>, while <span class='high_prod'>producers</span> either <span class='high_all'>don't share their work publicly</span> or they share it on <span class='high_all'>GitHub</span>.</p>

In [None]:
q36_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q36', col_var='q22_groups', multi=True)
q36_22group.loc[:,'delta'] = q36_22group.loc[:,'Producers'] - q36_22group.loc[:,'Explorers']
q36_22group.columns.name=''
q36_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q36 Public share") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())


In [None]:
# Q36 public share - producers/explorers
q36basket_obj = create_basket_var('Q36_Part', 'q36', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q36_by_q22gr = q36basket_obj.cross_df()
q36_prop = q36basket_obj.counts_answer_basket()
bar_stacked(q36_by_q22gr.transpose().index, q36_by_q22gr.transpose(), 
            'Public share platforms - producers vs explorers distribution','','%', 'bar', False, q36_prop)

<p class='normal'> Coursera is the preferred platform used for data science courses by producers compared to explorers. Interestingly when looking at the answers 'baskets': <span class='high_all'>Coursera & Fast.ai</span> combination is the most popular choice among <span class='high_prod'>producers</span> (<span class='high_all'>78%</span> of respondents that chose this combination are producers).</p>

In [None]:
q37_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q37', col_var='q22_groups', multi=True)
q37_22group.loc[:,'delta'] = q37_22group.loc[:,'Producers'] - q37_22group.loc[:,'Explorers']
q37_22group.columns.name=''
q37_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q37 Platforms used for data science courses") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())


In [None]:
# Q37 data science courses platforms used - producers/explorers
q37basket_obj = create_basket_var('Q37_Part', 'q37', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q37_by_q22gr_expl = q37basket_obj.cross_df()
q37_prop = q37basket_obj.counts_answer_basket()
bar_stacked(q37_by_q22gr_expl.transpose().index, q37_by_q22gr_expl.transpose(), 
            'Platforms used for data science courses','%',
            ' ', 'barh', False, q37_prop)

<p class='normal'> <span class='high_all'>Blogs</span> and <span class='high_all'>Journal publications</span> are the sources of information used in a higher degree by <span class='high_prod'>producers</span> when compared to explorers.</p>

In [None]:
q39_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q39', col_var='q22_groups', multi=True)
q39_22group.loc[:,'delta'] = q39_22group.loc[:,'Producers'] - q39_22group.loc[:,'Explorers']
q39_22group.columns.name = ''
q39_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q39 Media sources of information on data science topics") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())


In [None]:
# Q39 media sources of information - producers/explorers
q39basket_obj = create_basket_var('Q39_Part', 'q39', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q39_by_q22gr_expl = q39basket_obj.cross_df()
q39_prop = q39basket_obj.counts_answer_basket()
bar_stacked(q39_by_q22gr_expl.transpose().index, q39_by_q22gr_expl.transpose(), 
            'Media sources of information','%',
            ' ', 'barh', False, q39_prop)

<a id='skills'></a>
<p class='heading2'>Skills</p>

<p class ='normal'><span class='high_all'>Bash, Python</span> and <span class='high_all'>SQL</span> are in a higher degree used by <span class='high_prod'>producers</span>. We also observe that <span class='high_expl'>explorers</span> are also using Javascript, C or Matlab more than producers are.</p>

In [None]:
q7_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q7_Part', col_var='q22_groups')
q7_22group.loc[:,'delta'] = q7_22group.loc[:,'Producers'] - q7_22group.loc[:,'Explorers']
q7_22group.columns.name=''
q7_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q7 Programming languages") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

<p class='normal'>While R and SQL is the most popular choice to use together, <span class='high_prod'>70% of producers</span> are using Python,Bash or Python,SQL,Bash or Python,Bash,SQL,R. Either way, <span class='high_all'>Python and Bash</span> seem to be present in producers' routine usage of programming languages.</p>

In [None]:
# Q7 prog lang - producers/explorers
q7basket_obj_ = create_basket_var('Q7_Part', 'q7', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q7_by_q22gr = q7basket_obj_.cross_df()
q7_prop_ = q7basket_obj_.counts_answer_basket()
bar_stacked(q7_by_q22gr.transpose().index, q7_by_q22gr.transpose(), 
            'Programming languages usage basket',' ','%', 'bar',add_rectang=False, categ_prop=q7_prop_)

<p class='normal'>I've also checked overall (Explorers and Producers combined) what would be the recommended programming laguage to an aspiring data scientist based on the most frequent choices. Python is definitely leading the recommended programming language to be learned first and  it's interesting that even one third of users that are solely using R and SQL are also recommending Python to aspiring data scientists. It is also interesting to see that the most frequent combination of programing languages used is Python, SQL and Javascript (881 cases).</p>

In [None]:
# Q7 Programming languages do you use on a regular basis vs Q8 Programming language recommended to aspiring data scientist
q7basket_obj = create_basket_var('Q7_Part', 'q7', 'Q8', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q7basket_df = q7basket_obj.cross_df()
q7_prop = q7basket_obj.counts_answer_basket()
bar_stacked(q7basket_df.transpose().index, q7basket_df.transpose(), 'Language recommended to aspiring DS by people using multiple programming languages on a regular basis',
            '','Programing laguage recommended to aspiring data scientist', 'bar', add_rectang=False, categ_prop=q7_prop)

<p class ='normal'><span class='high_all'>Seaborn, Plotly</span> and <span class='high_all'>Matplotlib</span> visualization libraries are in higher use by <span class='high_prod'>producers</span> than by explorers.</p>

In [None]:
q14_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q14_Part', col_var='q22_groups')
q14_22group.loc[:,'delta'] = q14_22group.loc[:,'Producers'] - q14_22group.loc[:,'Explorers']
q14_22group.columns.name=''
q14_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q14 Data visualization libraries/tools used") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

<p class='normal'><span class='high_all'>Xgboost</span> is very popular among <span class='high_prod'>producers</span> which are using this framework by <span class='high_all'>+14pp</span> more than explorers are using it (must be working great in production).</p>

In [None]:
q16_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q16_Part', col_var='q22_groups')
q16_22group.loc[:,'delta'] = q16_22group.loc[:,'Producers'] - q16_22group.loc[:,'Explorers']
q16_22group.columns.name=''
q16_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q16 ML Frameworks used on a regular bases") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

In [None]:
# Q16 ML frameworks used on a regular basis - producers/explorers
q16basket_obj = create_basket_var('Q16_Part', 'q16', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=20)
q16_by_q22gr_expl = q16basket_obj.cross_df()
q16_prop = q16basket_obj.counts_answer_basket()
bar_stacked(q16_by_q22gr_expl.transpose().index, q16_by_q22gr_expl.transpose(), 
            'ML frameworks used - top 20 combinations',' ',
            '%', 'bar', False, q16_prop)

<p class=normal>Producers display a higher usage of algoritms and frameworks vs explorers, but <span class='high_all'>gradient boosting algorithms</span> are by far the most used solution in production environments.</p>

In [None]:
q17_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q17_Part', col_var='q22_groups')
q17_22group.loc[:,'delta'] = q17_22group.loc[:,'Producers'] - q17_22group.loc[:,'Explorers']
q17_22group.columns.name=''
q17_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q17 ML Algorithms used") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

In [None]:
# Q17 ML algorithms used on a regular basis - producers/explorers
q17basket_obj = create_basket_var('Q17_Part', 'q17', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=10)
q17_by_q22gr_expl = q17basket_obj.cross_df()
q17_prop = q17basket_obj.counts_answer_basket()
bar_stacked(q17_by_q22gr_expl.transpose().index, q17_by_q22gr_expl.transpose(), 
            'ML algorithms used - top 10 combinations','%',
            ' ', 'barh', False, q17_prop)

<p class='normal'>In general, computer vision methods and NLP methos are used more by producers than explorers.</p>

In [None]:
q18_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q18_Part', col_var='q22_groups')
q18_22group.loc[:,'delta'] = q18_22group.loc[:,'Producers'] - q18_22group.loc[:,'Explorers']
q18_22group.columns.name=''
q18_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q18 Computer vision methods used") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

In [None]:
q19_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q19_Part', col_var='q22_groups')
q19_22group.loc[:,'delta'] = q19_22group.loc[:,'Producers'] - q19_22group.loc[:,'Explorers']
q19_22group.columns.name=''
q19_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q19 NLP methods used") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

In [None]:
q38_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q38', col_var='q22_groups', multi=False)
q38_22group.loc[:,'delta'] = q38_22group.loc[:,'Producers'] - q38_22group.loc[:,'Explorers']
q38_22group.columns.name=''
q38_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q38 Primary analysis tool") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

<a id='infrastructure'></a>
<p class='heading2'>Infrastructure used</p>

<p class='normal'>Jupiter, VSCode and Vim/Emacs combined usage is mostly preferred by producers, while Jupyter and PyCharm is the most popular choice in general.</p>

In [None]:
q9_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q9_Part', col_var='q22_groups')
q9_22group.loc[:,'delta'] = q9_22group.loc[:,'Producers'] - q9_22group.loc[:,'Explorers']
q9_22group.columns.name=''
q9_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q9 IDE's used") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

In [None]:
# Q9 IDE's used on a regular basis vs producers/explorers
q9basket_obj = create_basket_var('Q9_Part', 'q9', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q9_by_q22gr_expl = q9basket_obj.cross_df()
q9_prop = q9basket_obj.counts_answer_basket()
bar_stacked(q9_by_q22gr_expl.transpose().index, q9_by_q22gr_expl.transpose(), 
            'Multiple IDE users','%','', 'barh', False, q9_prop)

In [None]:
q10_22group = tab_multi_cross_perc(df_2020, df_2020_cols, q_prefix='Q10_Part', col_var='q22_groups')
q10_22group.loc[:,'delta'] = q10_22group.loc[:,'Producers'] - q10_22group.loc[:,'Explorers']
q10_22group.columns.name=''
q10_22group.sort_values('delta', ascending=False).style.bar(subset=['delta'], align='mid', color=['#EDA39A', '#99DAAE']) \
                .format("{:.1%}", subset=['Producers', 'Explorers','delta']).set_caption("Q10 hosted notebooks used") \
                .set_properties(**{'width': '100px', 'height': '25px', 'font-size': '10pt'}).set_table_styles(title_style())

In [None]:
q12_22group = delta_cross_table(df=df_2020, q_prx = 'Q12', head_var='q22_groups', q_tit="Q12 specialized hardware used used", multi_val=True)
apply_style_tab(q12_22group, q_tit="Q12 specialized hardware used used")

<p class='normal'> Amazon Web Serices are preferred by people with models in production.</p>

In [None]:
q26a_22group = delta_cross_table(df_2020,q_prx = 'Q26_A_', head_var='q22_groups', q_tit="Q26A cloud computing platform used", multi_val=True)
apply_style_tab(q26a_22group, q_tit="Q26A cloud computing platform used")

In [None]:
# Q26 cloud computing platforms used on a regular basis - producers/explorers
q26basket_obj = create_basket_var('Q26_A_Part', 'q26', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=10)
q26_by_q22gr_expl = q26basket_obj.cross_df()
q26_prop = q26basket_obj.counts_answer_basket()
bar_stacked(q26_by_q22gr_expl.transpose().index, q26_by_q22gr_expl.transpose(), 
            'Cloud computing platforms used - producers vs explorers distribution',' ',
            '%', 'bar', False, q26_prop)

In [None]:
q27a_22group = delta_cross_table(df_2020,q_prx = 'Q27_A_Part', head_var='q22_groups', q_tit="Q27A Cloud computing products used", multi_val=True)
apply_style_tab(q27a_22group, q_tit="Q27A cloud computing products used")

In [None]:
# Q27 cloud computing peoducts used on a regular basis - producers/explorers
q27basket_obj = create_basket_var('Q27_A_Part', 'q27', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q27_by_q22gr_expl = q27basket_obj.cross_df()
q27_prop = q27basket_obj.counts_answer_basket()
bar_stacked(q27_by_q22gr_expl.transpose().index, q27_by_q22gr_expl.transpose(), 
            'Cloud computing products used - producers vs explorers distribution','%',' ', 
            'barh', False,q27_prop)

In [None]:
q28a_22group = delta_cross_table(df_2020, q_prx = 'Q28_A_Part', head_var='q22_groups', q_tit="Q28A ML products used", multi_val=True)
apply_style_tab(q28a_22group, q_tit="Q28A ML products used")

In [None]:
q29a_22group = delta_cross_table(df_2020,q_prx = 'Q29_A_Part', head_var='q22_groups', q_tit="Q29A big data products used", multi_val=True)
apply_style_tab(q29a_22group, q_tit="Q29A Big data products used")

<p class='normal'>Among ML users of big data infrastructure, it was interesting to see how different big data products compete with each other. For example, MySQL is quite frequent as main big data product used, but tends to lower its' share when used togheter with PostgresSQL.</p>

In [None]:
# Q29 big data products used on a regular basis vs Q30 big data product used most often
q29basket_obj = create_basket_var('Q29_A_Part', 'q29', 'Q30', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=20)
q29_by_q30 = q29basket_obj.cross_df()
bar_stacked(q29_by_q30.transpose().index, q29_by_q30.transpose(), 'Big data products used on a regular basis vs. most often used',
            'Big data products used on a regular basis','Big data product used most often', 'bar')

In [None]:
# Q29A big dats tools - producers/explorers
q29basket_obj = create_basket_var('Q29_A_Part', 'q29', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q29_by_q22gr = q29basket_obj.cross_df()

q29_prop = q29basket_obj.counts_answer_basket()
bar_stacked(q29_by_q22gr.transpose().index, q29_by_q22gr.transpose(), 
            'Big data tools usage mix - producers vs explorers distribution',' ','%', 'bar', add_rectang=False, categ_prop=q29_prop)

<p class='normal'>Generally BI tools are more used by explorers. One exception could be Tableau, that seems to be used a bit more by producers.</p>

In [None]:
q31a_22group = delta_cross_table(df_2020,q_prx = 'Q31_A_Part', head_var='q22_groups', q_tit="Q31A BI tools used", multi_val=True)
apply_style_tab(q31a_22group, q_tit="Q31A BI used used")

In [None]:
# Q31 BI tools used on a regular basis vs Q32 BI tool used most often
q31basket_obj = create_basket_var('Q31_A_Part', 'q31', 'Q32', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=20)
q31_by_q32 = q31basket_obj.cross_df()
bar_stacked(q31_by_q32.transpose().index, q31_by_q32.transpose(), 'Prefered tool when using multiple BI tools','BI tools usage mix','More often BI tool used',
            'bar', add_rectang=False, categ_prop=None)

In [None]:
# Q31 BI tools - producers/explorers
q31basket_obj_ = create_basket_var('Q31_A_Part', 'q31', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q31_by_q22gr = q31basket_obj_.cross_df()

q31_prop = q31basket_obj_.counts_answer_basket() #/ q31basket_obj_.counts_answer_basket().sum()
bar_stacked(q31_by_q22gr.transpose().index, q31_by_q22gr.transpose(), 
            'BI tools usage mix - producers vs explorers distribution',' ','%', 'bar', add_rectang=False, categ_prop=q31_prop)

In [None]:
# Q33A AutoML tools - producers/explorers
q33basket_obj = create_basket_var('Q33_A_Part', 'q33', 'q22_groups', df_2020.loc[df_2020.loc[:,'q22_groups'].isin(['Producers','Explorers']),:].copy(), df_2020_cols, limit=15)
q33_by_q22gr = q33basket_obj.cross_df()
q33_prop = q33basket_obj.counts_answer_basket()
bar_stacked(q33_by_q22gr.transpose().index, q33_by_q22gr.transpose(), 
            'AutoML tools - producers vs explorers distribution','%','AutoMl usage tool mix', 'barh', False, q33_prop)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import svd


class CA(object):
    """Simple corresondence analysis.
    
    Inputs
    ------
    ct : array_like
      Two-way contingency table. If `ct` is a pandas DataFrame object,
      the index and column values are used for plotting.
    Notes
    -----
    The implementation follows that presented in 'Correspondence
    Analysis in R, with Two- and Three-dimensional Graphics: The ca
    Package,' Journal of Statistical Software, May 2007, Volume 20,
    Issue 3.
    """

    def __init__(self, ct):
        self.rows = ct.index.values if hasattr(ct, 'index') else None
        self.cols = ct.columns.values if hasattr(ct, 'columns') else None
        
        # contingency table
        N = np.matrix(ct, dtype=float)

        # correspondence matrix from contingency table
        P = N / N.sum()

        # row and column marginal totals of P as vectors
        r = P.sum(axis=1)
        c = P.sum(axis=0).T

        # diagonal matrices of row/column sums
        D_r_rsq = np.diag(1. / np.sqrt(r.A1))
        D_c_rsq = np.diag(1. / np.sqrt(c.A1))

        # the matrix of standarized residuals
        S = D_r_rsq * (P - r * c.T) * D_c_rsq

        # compute the SVD
        U, D_a, V = svd(S, full_matrices=False)
        D_a = np.asmatrix(np.diag(D_a))
        V = V.T

        # principal coordinates of rows
        F = D_r_rsq * U * D_a

        # principal coordinates of columns
        G = D_c_rsq * V * D_a

        # standard coordinates of rows
        X = D_r_rsq * U

        # standard coordinates of columns
        Y = D_c_rsq * V

        # the total variance of the data matrix
        inertia = sum([(P[i,j] - r[i,0] * c[j,0])**2 / (r[i,0] * c[j,0])
                       for i in range(N.shape[0])
                       for j in range(N.shape[1])])

        self.F = F.A
        self.G = G.A
        self.X = X.A
        self.Y = Y.A
        self.inertia = inertia
        self.eigenvals = np.diag(D_a)**2

    def plot_ca(self):
        """Plot the first and second dimensions."""
        xmin, xmax = None, None
        ymin, ymax = None, None
        fig, ax = plt.subplots(1,1,figsize=(8,10))
        if self.rows is not None:
            for i, t in enumerate(self.rows):
                x, y = self.F[i,0], self.F[i,1]
                plt.text(x, y, t, va='center', ha='center', color='#7d7d4f', size=9)
                xmin = min(x, xmin if xmin else x)
                xmax = max(x, xmax if xmax else x)
                ymin = min(y, ymin if ymin else y)
                ymax = max(y, ymax if ymax else y)
        else:
            plt.plot(self.F[:, 0], self.F[:, 1], 'ro')  #plt

        if self.cols is not None:
            for i, t in enumerate(self.cols):
                x, y = self.G[i,0], self.G[i,1]
                plt.text(x, y, t, va='center', ha='center', color='#a81608', fontweight='bold', size=14)
                xmin = min(x, xmin if xmin else x)
                xmax = max(x, xmax if xmax else x)
                ymin = min(y, ymin if ymin else y)
                ymax = max(y, ymax if ymax else y)
        else:
            plt.plot(self.G[:, 0], self.G[:, 1], 'bs')
            

        if xmin and xmax:
            pad = (xmax - xmin) * 0.05
            plt.xlim(xmin - pad, xmax + pad)
        if ymin and ymax:
            pad = (ymax - ymin) * 0.05
            plt.ylim(ymin - pad, ymax + pad)

        plt.grid(color='#f5f5ef', linestyle="-")
        #plt.xlabel('Dim 1' + str(self.inertia))
        #plt.ylabel('Dim 2' + str(self.eigenvals))
        ax.axes.xaxis.set_visible(False)
        ax.axes.yaxis.set_visible(False)
        
        ax.spines['right'].set_visible(False)
        ax.spines['top'].set_color("#e2e2d0")
        ax.spines['left'].set_visible(False)
        #ax.spines['bottom'].set_visible(False)
        ax.spines['bottom'].set_color('#e2e2d0')
        

<p class='heading2'>Conclusion</p>
<p class='normal'>To conclude, the two maps below are summarizing the main differences between explorers and producers. The visuals are produced by a correspondence analysis taking as input the highest dissimilarities of the analyzed indicators between the two populations.</p>
<p class='normal'>The separation between the two groups is mainly loaded on X axis. We can observe in the general profile map, the items revolving around <span class='high_prod'>producers</span> are related to higher salary ranges, bigger team size, bigger company size, more experience with code and ML and closely related to Data science job title.</p>

In [None]:
# general profile
q5_22group_use = pd.concat([q5_22group.sort_values('delta', ascending=False).head(1),q5_22group.sort_values('delta', ascending=False).tail(1)])
q5_22group_use.index = '' + q5_22group_use.index

q6_22group_use = pd.concat([q6_22group.sort_values('delta', ascending=False).head(1),q6_22group.sort_values('delta', ascending=False).tail(1)])
q6_22group_use.index = 'Code exp: ' + q6_22group_use.index

q15_22group_use = pd.concat([q15_22group.sort_values('delta', ascending=False).head(1),q15_22group.sort_values('delta', ascending=False).tail(1)]) 
q15_22group_use.index = 'ML exp: ' + q15_22group_use.index

q20_22group_use = pd.concat([q20_22group.sort_values('delta', ascending=False).head(1),q20_22group.sort_values('delta', ascending=False).tail(1)]) 
q20_22group_use.index = 'Comp.size: ' + q20_22group_use.index

q21_22group_use = pd.concat([q21_22group.sort_values('delta', ascending=False).head(1),q21_22group.sort_values('delta', ascending=False).tail(1)]) 
q21_22group_use.index = 'Team size: ' + q21_22group_use.index

q24_22group_use = pd.concat([q24_22group.sort_values('delta', ascending=False).head(1),q24_22group.sort_values('delta', ascending=False).tail(1)]) 
q24_22group_use.index = 'Salary: ' + q24_22group_use.index



q11_22group_use = q11_22group.loc[~q11_22group.index.isin(['Other','None']),:]
q11_22group_use.index = '' + q11_22group_use.index

q23_22group_use = pd.concat([q23_22group.sort_values('delta', ascending=False).head(1),q23_22group.sort_values('delta', ascending=False).tail(1)]) 
q23_22group_use.index = ' ' + q23_22group_use.index

q36_22group_use = pd.concat([q36_22group.sort_values('delta', ascending=False).head(1),q36_22group.sort_values('delta', ascending=False).tail(1)]) 
q36_22group_use.index = 'Public share: ' + q36_22group_use.index

q37_22group_use = pd.concat([q37_22group.sort_values('delta', ascending=False).head(1),q37_22group.sort_values('delta', ascending=False).tail(1)]) 
q37_22group_use.index = 'Courses: ' + q37_22group_use.index

q39_22group_use = pd.concat([q39_22group.sort_values('delta', ascending=False).head(1),q39_22group.sort_values('delta', ascending=False).tail(1)]) 
q39_22group_use.index = 'Info sources: ' + q39_22group_use.index


p1_df = pd.concat([q5_22group_use,q6_22group_use,q15_22group_use,q20_22group_use,q21_22group_use,q24_22group_use,
                   q11_22group_use,q23_22group_use, q36_22group_use,q37_22group_use]) #q39_22group_use

p1_df['indicator'] = p1_df.index

p1_df_afc = pd.DataFrame(p1_df[['Explorers','Producers']].reset_index().copy())

df = p1_df_afc.iloc[:,:]
df = df.set_index('index')

ca = CA(df)
ca.plot_ca()
plt.title('General profile map')
plt.show()

<p class='normal'>Second map summarizes the producers' interest toward cloud computing platforms, gradient boosting machines in particular or Bash programming language.</p>

In [None]:
q7_22group_use = pd.concat([q7_22group.sort_values('delta', ascending=False).head(1),q7_22group.sort_values('delta', ascending=False).tail(1)]) 
q7_22group_use.index = 'Prog.lang: ' + q7_22group_use.index

q14_22group_use = pd.concat([q14_22group.sort_values('delta', ascending=False).head(1),q14_22group.sort_values('delta', ascending=False).tail(1)]) 
q14_22group_use.index = 'Viz.libraries: ' + q14_22group_use.index

q17_22group_use = pd.concat([q17_22group.sort_values('delta', ascending=False).head(1),q17_22group.sort_values('delta', ascending=False).tail(1)]) 
q17_22group_use.index = 'ML algorithms: ' + q17_22group_use.index + '\n'

q18_22group_use = pd.concat([q18_22group.sort_values('delta', ascending=False).head(1),q18_22group.sort_values('delta', ascending=False).tail(1)]) 
q18_22group_use.index = 'Computer vision: ' + q18_22group_use.index

q19_22group_use = pd.concat([q19_22group.sort_values('delta', ascending=False).head(1),q19_22group.sort_values('delta', ascending=False).tail(1)]) 
q19_22group_use.index = 'NLP: ' + q19_22group_use.index

q38_22group_use = pd.concat([q38_22group.sort_values('delta', ascending=False).head(1),q38_22group.sort_values('delta', ascending=False).tail(1)]) 
q38_22group_use.index = 'Main tool: ' + q38_22group_use.index


p2_df = pd.concat([q7_22group_use,q14_22group_use,q17_22group_use, q18_22group_use, q19_22group_use, q38_22group_use]) 
p2_df['indicator'] = p2_df.index


p2_df_afc = pd.DataFrame(p2_df[['Explorers','Producers']].reset_index().copy())

#df = p2_df_afc.iloc[:,:]
#df = df.set_index('index')
#ca = CA(df)
#ca.plot_ca()
#plt.show()

q9_22group_use = pd.concat([q9_22group.sort_values('delta', ascending=False).head(1),q9_22group.sort_values('delta', ascending=False).tail(1)]) 
q9_22group_use.index = 'IDE: ' + q9_22group_use.index

q10_22group_use = pd.concat([q10_22group.sort_values('delta', ascending=False).head(1),q10_22group.sort_values('delta', ascending=False).tail(1)]) 
q10_22group_use.index = 'Hosted notebooks: ' + q10_22group_use.index

q12_22group_use = pd.concat([q12_22group.sort_values('delta', ascending=False).head(1),q12_22group.sort_values('delta', ascending=False).tail(1)]) 
q12_22group_use.index = 'Hosted notebooks: ' + q12_22group_use.index

q26a_22group_use = pd.concat([q26a_22group.sort_values('delta', ascending=False).head(1),q26a_22group.sort_values('delta', ascending=False).tail(1)]) 
q26a_22group_use.index = 'Cloud platforms: ' + q26a_22group_use.index

q27a_22group_use = pd.concat([q27a_22group.sort_values('delta', ascending=False).head(1),q27a_22group.sort_values('delta', ascending=False).tail(1)]) 
q27a_22group_use.index = 'Cloud products: ' + q27a_22group_use.index

q28a_22group_use = pd.concat([q28a_22group.sort_values('delta', ascending=False).head(1),q28a_22group.sort_values('delta', ascending=False).tail(1)]) 
q28a_22group_use.index = 'ML products: ' + q28a_22group_use.index

q29a_22group_use = pd.concat([q29a_22group.sort_values('delta', ascending=False).head(1),q29a_22group.sort_values('delta', ascending=False).tail(1)]) 
q29a_22group_use.index = 'Big data: ' + q29a_22group_use.index

q31a_22group_use = pd.concat([q31a_22group.sort_values('delta', ascending=False).head(1),q31a_22group.sort_values('delta', ascending=False).tail(1)]) 
q31a_22group_use.index = 'BI: ' + q31a_22group_use.index

#q33a_22group_use = pd.concat([q33a_22group.sort_values('delta', ascending=False).head(2),q33a_22group.sort_values('delta', ascending=False).tail(2)]) 
#q33a_22group_use.index = 'Auto ML: ' + q33a_22group_use.index

#q34a_22group_use = pd.concat([q34a_22group.sort_values('delta', ascending=False).head(2),q34a_22group.sort_values('delta', ascending=False).tail(2)]) 
#q34a_22group_use.index = 'Auto ML: ' + q34a_22group_use.index

#q35a_22group_use = pd.concat([q35a_22group.sort_values('delta', ascending=False).head(1),q35a_22group.sort_values('delta', ascending=False).tail(1)]) 
#q35a_22group_use.index = 'Experiments: ' + q35a_22group_use.index


p3_df = pd.concat([q9_22group_use, q10_22group_use, q12_22group_use, q26a_22group_use, q27a_22group_use, q28a_22group_use, q29a_22group_use,
                   q31a_22group_use])
p3_df['indicator'] = p3_df.index
p3_df_afc = pd.DataFrame(p3_df[['Explorers','Producers']].reset_index().copy())

df = pd.concat([p2_df_afc,p3_df_afc])

#df = p3_df_afc.iloc[:,:]
df = df.set_index('index')

ca = CA(df)
ca.plot_ca()
plt.title('Skills and infrastructure map')
plt.show()

<p class='normal'>I hope this analysis brought more clarity in the way ML methods are being used and happy production or exploration for all ML users!</p>

<p class='normal'>References</p>

[Correspondence analysis implementation in python](https://okomestudio.net/biboroku/2014/05/brand-positioning-by-correspondence-analysis/)