# Answering Questions on How to Start Your Data Career Based on Current Trends

Stuck on your career? Want some challenge? Or just want to follow hype on current trends?  Try the data industry!

"But how to go there? I mean I dont even know where to start and what to do..."

If that thought comes across your mind, this following research might help you find your answers

## 0. Preparation : Data, Module, Utilify Function Loading, and Data Overview 

In [None]:
import numpy as np 
import itertools
import pandas as pd 

from matplotlib.offsetbox import OffsetImage,AnnotationBbox
import seaborn as sns
sns.set()
sns.set_style(style='white')
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objects as go

import scipy.stats as ss

In [None]:
df = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")

In [None]:
df.head()

In [None]:
question = np.array([x+"_" for x in df.columns], np.str)
sub_question = np.array(list(df.iloc[0].values), np.str)
joined_question = np.char.add(question,sub_question)
df.columns = joined_question
df.drop(0, axis=0, inplace=True)
df.replace(np.nan, "", inplace=True)

The data contains some attributes:

Customer Profile Related Attribute : 

1. Age (Q1) - Categorical
2. Gender (Q2) - Categorical
3. Country (Q3) - Categorical
4. Education (Q4) - Categorical

Professional Related Attribute

1. Job (Q5) - Categorical
2. Company Size (Q20) - Categorical
3. Coding Experience (Q6) - Categorical
4. Machine Learning Experience (Q15) - Categorical
5. DS Team in Company (Q21) - Categorical
6. Incorporation of ML Methods in Company (Q22) - Categorical
7. Yearly Compensation (Q24) - Categorical
8. Money spent on ML or Cloud Computing (Q25) - Categorical

In [None]:
ATTRIBUTE_JOB = 'Q5_Select the title most similar to your current role (or most recent title if retired): - Selected Choice'
ATTRIBUTE_COMP_SIZE = 'Q20_What is the size of the company where you are employed?'
ATTRIBUTE_CODING_EXP = 'Q6_For how many years have you been writing code and/or programming?'
ATTRIBUTE_ML_EXP = 'Q15_For how many years have you used machine learning methods?'
ATTRIBUTE_DS_TEAMSIZE = 'Q21_Approximately how many individuals are responsible for data science workloads at your place of business?'
ATTRIBUTE_ML_USAGE_COMPANNY = 'Q22_Does your current employer incorporate machine learning methods into their business?'
ATTRIBUTE_SALARY = 'Q24_What is your current yearly compensation (approximate $USD)?'
ATTRIBUTE_MONEY_CLOUDML = 'Q25_Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services at home (or at work) in the past 5 years (approximate $USD)?'
ALL_ATTRIBUTE = [ATTRIBUTE_JOB,ATTRIBUTE_COMP_SIZE,
                 ATTRIBUTE_CODING_EXP,ATTRIBUTE_ML_EXP,ATTRIBUTE_DS_TEAMSIZE,ATTRIBUTE_ML_USAGE_COMPANNY,ATTRIBUTE_SALARY,ATTRIBUTE_MONEY_CLOUDML]

In [None]:
COURSE_COLOR ={"Coursera" : "#2a73cc", "Kaggle":"#88ddff", "Udemy":"#ea5656", "University":"#074b5c"}
MEDIA_COLOR = {"Kaggle":"#88ddff", "YouTube":"#ff0000", "Blogs":"#355876"}
VIZ_FRAMEWORK_COLOR ={"Matplotlib" : "#ffde71", "Seaborn" : "#7db0bc", "None" : "grey", "Plotly" : "#3d4d71"}
ML_FRAMEWORK_COLOR = {"Sklearn" : "#3294c7", "TF" : "#f09437"}

In [None]:
DICTIONARY_CHANGE_VALUE = {
    ATTRIBUTE_ML_EXP : {
        "I do not use machine learning methods" : "No"
    },
    ATTRIBUTE_ML_USAGE_COMPANNY : {
        "We are exploring ML methods (and may one day put a model into production)" : "Exploration and Insights",
        "We use ML methods for generating insights (but do not put working models into production)" : "Exploration and Insights",
        "No (we do not use ML methods)" : "Not using",
        "We have well established ML methods (i.e., models in production for more than 2 years)" : "Production > 2 years",
        "We recently started using ML methods (i.e., models in production for less than 2 years)" : "Production < 2 years",
        "I do not know" : "Unknown"
    },
    ATTRIBUTE_MONEY_CLOUDML : {
        "$100,000 or more ($USD)" : "More than $100,000 ($USD)"
    },
    ATTRIBUTE_CODING_EXP : {
        "I have never written code" : "No or < 1 year",
        "< 1 years" : "No or < 1 year"
    },
    "Big Data Technology" : {
        "Microsoft SQL Server" : "Ms. SQLServer",
        "Google Cloud BigQuery" : "BigQuery",
        "Microsoft Azure Data" : "Azure DataLake",
        "Google Cloud Firestore" : "Firestore",
        "Oracle Database" : "Oracle DB",
        "Microsoft Access" : "Ms. Access",
        "Amazon Redshift" : "Redshift",
        "Amazon Athena" : "Athena",
        "Amazon DynamoDB" : "DynamoDB",
        "Google Cloud SQL" : "GC SQL"
    },
    "BI Tools" : {
        "Amazon QuickSight" : "QuickSight",
        "Microsoft Power BI" : "Power BI",
        "Google Data Studio" : "Data Studio",
        "SAP Analytics Cloud" : "SAP",
        "TIBCO Spotfire" : "Spotfire"
    },
    "ML Frameworks" : {
        "Scikit-learn" : "Sklearn",
        "TensorFlow" : "TF",
        "Xgboost" : "XGB",
        "LightGBM" : "LGBM",
        "Tidymodels" : "Tdymdls"
        
    },
    "ML Algorithms" : {
        "Decision Trees or Random Forests" : "DT/RF",
        "Gradient Boosting Machines (xgboost, lightgbm, etc)" : "GBM",
        "Bayesian Approaches" : "Bayesian",
        "Dense Neural Networks (MLPs, etc)" : "MLP",
        "Convolutional Neural Networks" : "CNN",
        "Recurrent Neural Networks" : "RNN",
        "Linear or Logistic Regression" : "LR",
        "Transformer Networks (BERT, gpt-3, etc)" : "Transformer",
        "Evolutionary Approaches" : "Evol. Appr",
        "Generative Adversarial Networks" : "GAN"
    },
    "AutoML Workflow" : {
        "Automated data augmentation (e.g. imgaug, albumentations)" : "Augmentation",
        "Automated feature engineering/selection (e.g. tpot, boruta_py)" : "Feature Engineering",
        "Automated model selection (e.g. auto-sklearn, xcessiv)" : "Model Selection",
        "Automated model architecture searches (e.g. darts, enas)" : "Architecture Search",
        "Automated hyperparameter tuning (e.g. hyperopt, ray.tune, Vizier)" : "Hyperparams Tuning",
        "Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)" : "Full Pipeline"
    },
    "Data Role" : {
        "Analyze and understand data to influence product or business decisions" : "Analyzing Business",
        "Do research that advances the state of the art of machine learning" : "Doing Research",
        "Build and/or run a machine learning service that operationally improves my product or workflows" : "Building Service",
        "Experimentation and iteration to improve existing ML models" : "Doing Research",
        "Build prototypes to explore applying machine learning to new areas" : "Building Service",
        "Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data" : "Building Infras"
    },
    "Cloud" : {
        "Amazon Web Services" : "AWS",
        "Azure" : "Azure",
        "Google Cloud Platform" : "GCP",
        "Salesforce Cloud" : "Salesforce",
        "VMware" : "VMware",
        "IBM Cloud /" : "IBM",
        "Alibaba" : "Alibaba",
        "SAP Cloud" : "SAP"
    }
}
DICT_ZOOM_ICON = {"Coursera":0.05,"Kaggle":0.07,"Udemy":0.07,"University":0.0425,"DataCamp":0.065,"edX":0.08,"Udacity":0.035,"LinkedIn":0.095,"Fast.ai":0.035}

In [None]:
def get_first_word(x, cnt_word) :
    x = x.split()[:cnt_word]
    x = " ".join(x)
    return x
get_first_word = np.vectorize(get_first_word)

def sort_two_arrays(x,y, reverse=True) :
    sorted_x_y = list(zip(*sorted(zip(x, y), reverse=True)))
    return sorted_x_y

def show_values_on_bars(axs, threshold, unit, bartype):
    def _show_on_single_plot(ax, bartype):
        if bartype == "vertical" : 
            for idx, p in enumerate(ax.patches):
                if idx <= threshold : 
                    _x = p.get_x() + p.get_width() / 2
                    _y = p.get_y() + p.get_height() + 0.55
                    value = '{:.2f} {}'.format(p.get_height(), unit)
                    ax.text(_x, _y, value, ha="center")
        if bartype == "horizontal" :
            for idx, p in enumerate(ax.patches):
                if idx <= threshold :
                    _x = p.get_width() + 1.5
                    _y = p.get_y() + p.get_height() / 2
                    value = '{:.2f} {}'.format(p.get_width(), unit)
                    ax.text(_x, _y, value, ha="center")
                    
    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax, bartype)
    else:
        _show_on_single_plot(axs, bartype)

def barplot(unique, count, title, y_title, x_title, annotation_unit, category_to_show, colors, bartype, title_position=(0,1)) :
    
    if bartype == "vertical" : 
        ax = sns.barplot(unique, count, palette=colors)
        ax.set_title(title, fontsize=20, pad=20)
    elif bartype == "horizontal" :
        ax = sns.barplot(count, unique, palette=colors)
        ax.set_title(title, fontsize=20, pad=100, position=title_position)
    ax.set_ylabel(y_title)
    ax.set_xlabel(x_title)
    ax.yaxis.labelpad = 20
    ax.xaxis.labelpad = 20
    ax.tick_params(axis='both', which='major', pad=15)
    show_values_on_bars(ax, category_to_show, annotation_unit, bartype)
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)
        
    return ax


def important_attribute(attributes, measured_question, data, num_attribute) :
    def cramers_corr(x,y):

        """ calculate Cramers V statistic for categorial-categorial association.
            uses correction from Bergsma and Wicher, 
            Journal of the Korean Statistical Society 42 (2013): 323-328
        """
        result=-1
        x = data[x]
        y = data[y]
        if len(x.value_counts())==1 :
            print("First variable is constant")
        elif len(y.value_counts())==1:
            print("Second variable is constant")
        else:   
            conf_matrix=pd.crosstab(x, y)

            if conf_matrix.shape[0]==2:
                correct=False
            else:
                correct=True

            chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0]

            n = sum(conf_matrix.sum())
            phi2 = chi2/n
            r,k = conf_matrix.shape
            phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
            rcorr = r - ((r-1)**2)/(n-1)
            kcorr = k - ((k-1)**2)/(n-1)
            result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
        return abs(round(result,6))
    
    
    def cramers_corr_total(x,y) :
        
        list_cramers = list(map(cramers_corr, itertools.repeat(x,len(y)), measured_vars))
        average_cramers = sum(list_cramers) / len(list_cramers)
        return average_cramers
    
    total_cramers_overall = []
    measured_vars = data[[c for c in df.columns if measured_question in c]]
    
    for att in attributes :
        total_cramers_per_attribute = cramers_corr_total(att, measured_vars)
        total_cramers_overall.append(total_cramers_per_attribute)
        
    sorted_idx = sorted(range(len(total_cramers_overall)), key=lambda k: total_cramers_overall[k], reverse=True)[:num_attribute]
    
    
    important_attributes = [attributes[i] for i in sorted_idx]
    
    
    return important_attributes

def group_least_category(series, group) :
    
    unique_val = list(series.value_counts().index)
    unique_val.remove('')
    least_category = unique_val[4:]
    series = series.apply(lambda x : group if x in least_category else x)
    return series

def data_for_plotting(data, question, cnt_word, focused_category=None ,focused_value=None):
    if focused_value is not None : 
        selected_data = data[data[focused_category]==focused_value][[c for c in data.columns if question in c]]
    else :
        selected_data = data[[c for c in data.columns if question in c]]
    all_data = np.concatenate(selected_data.values)
    all_data = np.unique(all_data, return_counts=True)
    all_data_unique = all_data[0]
    all_data_cnt =  all_data[1]
    nan = np.where(all_data_unique=="")[0][0]
    all_data_unique = np.delete(all_data_unique, nan)
    if cnt_word != None : 
        all_data_unique = get_first_word(all_data_unique, cnt_word)
    all_data_cnt = np.delete(all_data_cnt, nan)
    all_data_cnt = all_data_cnt / selected_data.shape[0] * 100
    sorted_data = sort_two_arrays(all_data_cnt, all_data_unique)
    all_data_cnt, all_data_unique = np.asarray(sorted_data[0]), np.asarray(sorted_data[1])
    
    return all_data_cnt, all_data_unique

def get_color_barplot(values, dictionary_color=None) : 
    
    if dictionary_color == None :
        color = None
    else : 
        selected_color = [dictionary_color[values[0]], dictionary_color[values[1]], dictionary_color[values[2]]]
        other_color = ['grey'] * (values.shape[0] - 3)
        color = selected_color + other_color
    return color

def change_value(series, attribute) :
    series = series.replace(DICTIONARY_CHANGE_VALUE[attribute])
    return series

def barplot_comparison_2_year(width ,input1, unique1, input2, label1, label2, xaxis_label, yaxis_label, title, ind_growth, growth, num_to_show, title_position) :
    
    ind = np.arange(start = 0, stop = input1.shape[0]*1.5, step=1.5)
    # Plotting
    ax = plt.bar(ind, input1 , width, label=label1)
    ax = plt.bar(ind + width, input2, width, label=label2)
    plt.xlabel(xaxis_label)
    plt.ylabel(yaxis_label)
    plt.title(title, position=title_position)
    plt.xticks(ind + width / 2, unique1)
    plt.title(title, {"fontsize":20}, pad=20)
    plt.ylabel(yaxis_label, labelpad = 20)
    plt.xlabel(xaxis_label, labelpad = 20)
    plt.tick_params(axis='both', which='major', pad=15)
    
    rects = ax.patches
    
    
    i = 0
    for rect in rects:
        if i in ind_growth[:num_to_show] : 
            height = rect.get_height()
            plt.text(rect.get_x() + rect.get_width() / 2, height + 1, f"Growth: {round(growth[i],2)}%",
                    ha='center', va='bottom')
        i+=1

    # Finding the best position for legends and putting it
    plt.legend(loc='best')
    plt.grid(b=None, axis='y')
    plt.grid(axis='y', linestyle="--", linewidth=0.5, color='black')
    
    return ax

def make_crosstab(input1, input2, special_case=False) :
    
        
    tmp1 = df[[c for c in df.columns if (input1 in c)]]
    tmp2 = df[[c for c in df.columns if (input2 in c)]]
    tmp1_unique = pd.unique(tmp1.values.flatten())
    if tmp1_unique[0]=="" :
        tmp1_unique = np.delete(tmp1_unique,0)
    tmp2_unique = pd.unique(tmp2.values.flatten())
    if tmp2_unique[0]=="" :
        tmp2_unique = np.delete(tmp2_unique,0)
    dict_tmp = {}
    list_len = []
    for i in tmp1_unique :
        try : 
            if ("None" not in i) and (i!="") :
                if special_case : 
                    grouper = tmp1.iloc[:,0].apply(lambda x : x if x==i else "")
                else : 
                    grouper = tmp1[[c for c in tmp1.columns if (i in c)]].iloc[:,0]
                list_tmp = []
                for j in tmp2_unique :
                    if ("None" not in j) and (j!="") : 
                        column = tmp2[[c for c in tmp2.columns if (j in c)]].iloc[:,0]
                        value = pd.crosstab(grouper, [column]).iloc[1,1]
                        list_tmp.append(value)
                dict_tmp[i] = list_tmp
        except :
            pass
    for j in tmp2_unique :
        if ("None" not in j) and (j!="") : 
            column = tmp2[[c for c in tmp2.columns if (j in c)]].iloc[:,0]
            length = len([c for c in column.tolist() if c!=""])
            list_len.append(length)
    crosstab = pd.DataFrame.from_dict(dict_tmp)
    crosstab["number"] = list_len
    crosstab.loc[:,:] =  crosstab.loc[:,:].div(crosstab["number"], axis=0) * 100
    crosstab.drop(columns="number", axis=1, inplace=True)
    crosstab.index = [c for c in tmp2_unique if ("None" not in c) and (c!="")]
    crosstab = crosstab.round(2)
    return crosstab

def render_table(data, dict_change_value, col_width=3.0, row_height=0.625, font_size=12,
                     header_color='#40466e', row_colors=['#f1f1f2', 'w'], edge_color='w',
                     bbox=[0, 0, 1, 1], header_columns=0,
                     ax=None, **kwargs):
    
    data = data.reset_index().rename(columns={"index":""})
    maximum = data.iloc[:,1:].idxmax() + 1
    maximum.index = list(range(1,len(data.columns)))
    list_maximum = list(zip(maximum, maximum.index))
    data = data.applymap(lambda x : f"{x} %" if isinstance(x,str)==False else x)
    
    if ax is None:
        size = (np.array(data.shape[::-1]) + np.array([0, 1])) * np.array([col_width, row_height])
        fig, ax = plt.subplots(figsize=size)
        ax.axis('off')
    table = ax.table(cellText=data.values, cellLoc='center', bbox=bbox, colLabels=[dict_change_value[c.strip()] if c.strip() in dict_change_value else c for c in data.columns], **kwargs)
    table.auto_set_font_size(False)
    table.set_fontsize(font_size)

    for k, cell in table._cells.items():
        cell.set_edgecolor(edge_color)
        if ((k[0] == 0) and (k[1] > 0)) or ((k[0] > 0) and (k[1]==0)):
            cell.set_text_props(weight='bold', color='w')
            cell.set_facecolor(header_color)
        elif (k[0] == 0) and (k[1] == 0):
            cell.set_facecolor('white')
        elif (k[0]!=0) and (k[1] > 0) and (k not in list_maximum):
            cell.set_facecolor(row_colors[k[0]%len(row_colors) ])
        elif (k[0]!=0) and (k[1] > 0) and (k in list_maximum): 
            cell.set_text_props(weight="bold", color="white")
            cell.set_facecolor("green")   
    return ax.get_figure(), ax

def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram'):
    # maximum of 6 value cols -> 6 colors
    colorPalette = ['#4B8BBE','#306998','#FFE873','#FFD43B','#646464']
    labelList = []
    colorNumList = []
    for catCol in cat_cols:
        labelListTemp =  list(set(df[catCol].values))
        colorNumList.append(len(labelListTemp))
        labelList = labelList + labelListTemp
        
    # remove duplicates from labelList
    labelList = list(dict.fromkeys(labelList))
    
    # define colors based on number of levels
    colorList = []
    for idx, colorNum in enumerate(colorNumList):
        colorList = colorList + [colorPalette[idx]]*colorNum
        
    # transform df into a source-target pair
    for i in range(len(cat_cols)-1):
        if i==0:
            sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            sourceTargetDf.columns = ['source','target','count']
        else:
            tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            tempDf.columns = ['source','target','count']
            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
        
    # add index for source-target pair
    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
    
    # creating the sankey diagram
    data = dict(
        type='sankey',
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(
            color = "black",
            width = 0.5
          ),
          label = labelList,
          color = colorList
        ),
        link = dict(
          source = sourceTargetDf['sourceID'],
          target = sourceTargetDf['targetID'],
          value = sourceTargetDf['count']
        )
      )
    
    layout =  dict(
        title = title,
        font = dict(
          size = 10
        )
    )
       
    fig = dict(data=[data], layout=layout)
    return fig

def get_flag(name, path):
    path = f"{path}/{name}.png"
    im = plt.imread(path)
    return im

def offset_image(coord, name, ax, path):
    img = get_flag(name, path)
    if name in DICT_ZOOM_ICON : 
        im = OffsetImage(img, zoom=DICT_ZOOM_ICON[name])
    else : 
        im = OffsetImage(img, zoom=0.05)
    im.image.axes = ax

    ab = AnnotationBbox(im, (0, coord), xybox=(-50, -5.), frameon=False,
                        xycoords='data',  boxcoords="offset points", pad=0)
    
    list(ax.get_yticklabels())[coord].set_color('white')

    ax.add_artist(ab)

In [None]:
df[[c for c in df.columns if 'Q23_Part' in c]] = df[[c for c in df.columns if 'Q23_Part' in c]].replace(DICTIONARY_CHANGE_VALUE["Data Role"])

df["Q23_Part_3_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas"]=df[[c for c in df.columns if ('Q23_Part_3' in c) or ('Q23_Part_4' in c)]].max(axis=1)
df.drop(columns="Q23_Part_4_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run a machine learning service that operationally improves my product or workflows", axis=1,inplace=True)

df["Q23_Part_5_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models"]=df[[c for c in df.columns if ('Q23_Part_5' in c) or ('Q23_Part_6' in c)]].max(axis=1)
df.drop(columns="Q23_Part_6_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Do research that advances the state of the art of machine learning", axis=1,inplace=True)

df = df.rename(columns = {
    "Q23_Part_3_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build prototypes to explore applying machine learning to new areas" : "Q23_Part_3_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Building Service",
    "Q23_Part_5_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Experimentation and iteration to improve existing ML models":"Q23_Part_4_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Doing Research",
    "Q23_Part_1_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyze and understand data to influence product or business decisions":"Q23_Part_1_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Analyzing Business",
    "Q23_Part_2_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data" : "Q23_Part_2_Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - Building Infras"
})

df["Q9_Part_3_Which of the following integrated development environments (IDE's) do you use on a regular basis? (Select all that apply) - Selected Choice - Visual Studio / Visual Studio Code"] = df[[c for c in df.columns if ("Q9_Part_3" in c) or ("Q9_Part_4" in c)]].max(axis=1)
df.drop(columns="Q9_Part_4_Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice - Click to write Choice 13",axis=1,inplace=True)
df.drop(columns="Q9_Part_3_Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -  Visual Studio / Visual Studio Code ",axis=1,inplace=True)
df = df.rename(columns={"Q9_Part_3_Which of the following integrated development environments (IDE's) do you use on a regular basis? (Select all that apply) - Selected Choice - Visual Studio / Visual Studio Code":
                       "Q9_Part_3_Which of the following integrated development environments (IDE's) do you use on a regular basis? (Select all that apply) - Selected Choice - Visual Studio / Visual Studio Code (VSCode)"})
df["Q9_Part_3_Which of the following integrated development environments (IDE's) do you use on a regular basis? (Select all that apply) - Selected Choice - Visual Studio / Visual Studio Code (VSCode)"] = df["Q9_Part_3_Which of the following integrated development environments (IDE's) do you use on a regular basis? (Select all that apply) - Selected Choice - Visual Studio / Visual Studio Code (VSCode)"].replace({"Visual Studio":"VSCode","Visual Studio Code (VSCode)":"VSCode"})

## 1. Where should I learn?

In [None]:
all_course_cnt, all_course_unique = data_for_plotting(df, 'Q37', 1)
course_colors = get_color_barplot(all_course_unique, COURSE_COLOR)
plt.figure(figsize=(16,10))
ax = barplot(all_course_unique, all_course_cnt, title="Source to Study Data Science", 
                         y_title="Source", x_title="Percentage of User (%)", annotation_unit='%', category_to_show=3, colors=course_colors, bartype="horizontal", title_position=(0,1))
ax.text(-6.5,-2, "The most favourite sources are:")
ax.text(2.5,-2, "Coursera, Kaggle, and Udemy", bbox={'facecolor': 'green', 'alpha': 0.5, 'pad': 10}, color="white", weight='bold')
for i, c in enumerate(all_course_unique):
    try : 
        offset_image(i, c, ax, "../input/ds-study-sources-icons")
    except :
        pass
plt.show()

36.84% of the respondent choose **Coursera as their most favourite place to learn data science**, followed by Kaggle and Udemy. But we may dig deeper of this information. So what factor effect this choice of learning place the most?

In [None]:
important_attribute(ALL_ATTRIBUTE, 'Q37', df, 1)

By applying the function "important_attribute", we can get the factors that affecting choice of learning place the most. That is **years of experience in using machine learning methods**. 

In [None]:
temp_data = df.copy()
temp_data[ATTRIBUTE_ML_EXP] = group_least_category(temp_data[ATTRIBUTE_ML_EXP], 'More than 3 years')
temp_data[ATTRIBUTE_ML_EXP] = change_value(temp_data[ATTRIBUTE_ML_EXP], ATTRIBUTE_ML_EXP)
unique_value = list(temp_data[ATTRIBUTE_ML_EXP].unique())
unique_value.remove('')
idx=np.array([1,3,0,4,2])
unique_value = np.array(unique_value)[idx]
temp_data = temp_data[temp_data[ATTRIBUTE_ML_EXP]!='']
plt.figure(figsize=(26,20))
for i, value in enumerate(unique_value) : 
    all_course_grouped_cnt, all_course_grouped_unique = data_for_plotting(temp_data, 'Q37', 1, ATTRIBUTE_ML_EXP, value)
    course_colors = get_color_barplot(all_course_grouped_unique, COURSE_COLOR)
    i+=1
    plt.subplot(3,2,i)
    ax = barplot(all_course_grouped_unique, all_course_grouped_cnt, title=f"Source to Study Data Science \n {value} Experiences in Machine Learning", 
                         y_title="Percentage of User (%)", x_title="Source", annotation_unit='%', category_to_show=3, colors=course_colors, bartype="vertical")
    plt.subplots_adjust(hspace=0.5)
plt.show()

From our data we can clearly see that Coursera still be the best place to learn among all of the choice. But we could see that people with < 2 years of experience in machine learning favor Kaggle more than Udemy and University. 

From this information we could conclude that people at the early days really use the learning platform and formal education to learn data science. But nowadays Kaggle successfully make it easy to learn data science. This maybe because we really learn **more "real-case-like" data science from Kaggle**, but still the **need of studying "basic concept" of data science could be fulfilled from Coursera**.

In [None]:
all_media_cnt, all_media_unique = data_for_plotting(df, 'Q39', 1)
media_colors = get_color_barplot(all_media_unique, MEDIA_COLOR)
plt.figure(figsize=(16,10))
ax = barplot(all_media_unique,all_media_cnt ,title="Media to Get Data Science Latest Topics", 
                         y_title="Media", x_title="Percentage of User (%)", annotation_unit='%', category_to_show=3, colors=media_colors, bartype="horizontal", title_position=(0.1,1))
ax.text(-5.25,-2, "The most favourite media are:")
ax.text(4.5,-2, "Kaggle, YouTube, and Blogs", bbox={'facecolor': 'blue', 'alpha': 0.5, 'pad': 10}, color="white", weight='bold')
plt.show()

In [None]:
important_attribute(ALL_ATTRIBUTE, 'Q39', df, 6)

In [None]:
temp_data = df.copy()
temp_data[ATTRIBUTE_ML_EXP] = group_least_category(temp_data[ATTRIBUTE_ML_EXP], 'More than 3 years')
temp_data[ATTRIBUTE_ML_EXP] = change_value(temp_data[ATTRIBUTE_ML_EXP], ATTRIBUTE_ML_EXP)
unique_value = list(temp_data[ATTRIBUTE_ML_EXP].unique())
unique_value.remove('')
idx=np.array([1,3,0,4,2])
unique_value = np.array(unique_value)[idx]
temp_data = temp_data[temp_data[ATTRIBUTE_ML_EXP]!='']
plt.figure(figsize=(26,20))
for i, value in enumerate(unique_value) :
    all_media_grouped_cnt, all_media_grouped_unique = data_for_plotting(temp_data, 'Q39', 1, ATTRIBUTE_ML_EXP, value)
    course_colors = get_color_barplot(all_media_grouped_unique, MEDIA_COLOR)
    i+=1
    plt.subplot(3,2,i)
    ax = barplot(all_media_grouped_unique, all_media_grouped_cnt, title=f"Media to Get Data Science Latest Topics \n {value} Experiences in Machine Learning", 
                         y_title="Percentage of User (%)", x_title="Source", annotation_unit='%', category_to_show=3, colors=course_colors, bartype="vertical")
    plt.subplots_adjust(hspace=0.5)
plt.show()

The trend of "need for practical case with the easiest way" continues in the context of choosing media to get data science latest topics. People with less experience tends to choose Youtube because it might be easier to consume. We could get the information only by sitting down watching other people explain it to us.

But people with more experience tends to choose Kaggle because we could directly read many tricks and latest code implementation for various competition so it could be more practical. This trend stops for people with more than 3 years of experience in machine learning because they tend to choose Blogs (ex: Towards Data Science). Personally I think this because people with more than 3 years of experience could implement the idea posted in the blog post without reading full code implementation like in Kaggle. They just need the overview of the idea, simple code implementation (without all of the data processing etc), and voila! They could make cool stuffs out of those things.

## 2. I have already taken some courses, read books, and subscribing articles, but there are bunch of tools out there! Which one should I pick?

Oke now we will get deeper into Data Science workflow.
Generally there are 4 processes : ETL, Exploration, Preprocessing and Modelling, and Deployment.

Which tool is needed for each of the processe?

### We begin with Big Data Tools (ETL tools).

In [None]:
all_etl_cnt, all_etl_unique = data_for_plotting(df, 'Q29_A', 3)
all_etl2_cnt, all_etl2_unique = data_for_plotting(df, 'Q29_B', 3)
indices = list(np.where(all_etl_unique[:, None] == all_etl2_unique[None, :])[1])
all_etl2_cnt = all_etl2_cnt[indices]
all_etl_unique = np.array([DICTIONARY_CHANGE_VALUE["Big Data Technology"][x] if x in DICTIONARY_CHANGE_VALUE["Big Data Technology"] else x for x in all_etl_unique])
growth = (all_etl2_cnt - all_etl_cnt)/all_etl_cnt * 100
ind_growth = np.argsort(growth)[::-1]

In [None]:
plt.figure(figsize=(26,10))
ax = barplot_comparison_2_year(0.6,all_etl_cnt, all_etl_unique, all_etl2_cnt, "Now", "Next 2 Years", "Tech Stack", "Percentage of User (%)", "Commonly Used Big Data Technologies", ind_growth, growth, 3, (0.083,1.1))
plt.text(-3.2,23,"Most commonly used Big Data Technologies in the future (prediction) are:")
plt.text(6.3,23,"Google Cloud SQL, Google Firestore (Firebase), and Google BigQuery", bbox={'facecolor': '#00A5FF', 'pad': 10}, color="white", weight='bold')
sns.despine()
plt.show()

We could see that **MySQL and PostgreSQL still be the champion in the market**. But what about 2 years from now? **Google Cloud SQL, Google Firestore (Firebase), and Google BigQuery are predicted to have the steepest growth of user**. All Google Technologies!

### After getting the data, We continue doing Exploratory Data Analysis. Here are the tools.

In [None]:
all_bi_cnt, all_bi_unique = data_for_plotting(df, 'Q31_A', 3)
all_bi2_cnt, all_bi2_unique = data_for_plotting(df, 'Q31_B', 3)
indices = list(np.where(all_bi_unique[:, None] == all_bi2_unique[None, :])[1])
all_bi2_cnt = all_bi2_cnt[indices]
all_bi_unique = np.array([DICTIONARY_CHANGE_VALUE["BI Tools"][x] if x in DICTIONARY_CHANGE_VALUE["BI Tools"] else x for x in all_bi_unique])
growth = (all_bi2_cnt - all_bi_cnt)/all_bi_cnt * 100
ind_growth = np.argsort(growth)[::-1]

In [None]:
plt.figure(figsize=(26,10))
ax = barplot_comparison_2_year(0.6,all_bi_cnt, all_bi_unique, all_bi2_cnt, "Now", "Next 2 Years", "Tech Stack", "Percentage of User (%)", "Commonly Used Business Intelligence Tools", ind_growth, growth, 4, (0.089,1.1))
plt.text(-2.9,21,"Most commonly used BI Tools in the future (prediction) are:")
plt.text(3.5,21,"Einstein Analytics, SAP, Amazon QuickSight, Google Data Studio", bbox={'facecolor': '#FF8C00', 'pad': 10}, color="white", weight='bold')
sns.despine()
plt.show()

For the data exploration by BI tools, **Tableau and Power BI still are the champion in the market**. But what about 2 years from now? You better also be prepared for the era of **SAP, Amazon Quicksight, Einstein Analytics**, and.... of course our strongest contender, **Google Data Studio**!

### Already gain the insights from data? Go to preprocessing and modelling part!

In [None]:
crosstab = make_crosstab('Q17','Q16')
fig,ax = render_table(crosstab, DICTIONARY_CHANGE_VALUE["ML Algorithms"], header_columns=0, col_width=2.0)

The table above shows you how many percent of a certain workflow user, also use a certain algorithm in the daily basis. I hope that this could be a proxy for measuring how comfortable is using a certain workflow/library for an algorithm. For example, 89,56% Caret user use Decision Tree / Random Forest in the daily basis. 

Of course this method is not perfect. This doesnt mean that Tensorflow could also be used for making Decision Tree / Random Forest algorithm. Also there is another drawback. Some of the library/workflow are for R language while the others are for Python language make they are not really apple to apple to compare.

Despite all of the drawbacks, from this method we could see that **JAX from Google secure the highest percentage in almost all of ML Algorithms**. Still, Google Technologies are the best here.

### How about automated machine learning workflow? I've heard so many convenient tools out there.

#### First let see what part of machine learning workflows people would like to automate

In [None]:
tmp = df[[c for c in df.columns if "Q33_A" in c]].iloc[:,:5].replace(DICTIONARY_CHANGE_VALUE["AutoML Workflow"])
tmp.columns = ["Augmentation","Feature Engineering","Model Selection","Architecture Search","Hyperparams Tuning"]
all_user = tmp[(tmp["Augmentation"]!="")|(tmp["Feature Engineering"]!="")|(tmp["Model Selection"]!="")|(tmp["Architecture Search"]!="")|(tmp["Hyperparams Tuning"]!="")].shape[0]
tmp = pd.DataFrame(tmp.groupby(list(tmp.columns)).size()).reset_index().rename(columns={0:"Percentage"}).iloc[1:,:]
tmp["Percentage"] = round(tmp["Percentage"] / all_user * 100,2)
for c in tmp.columns :
    tmp[c] = tmp[c].apply(lambda x : f"No {c}" if x=="" else x)
sorted_tmp = tmp.sort_values(by="Percentage",ascending=False)

size = (np.array(sorted_tmp.shape[::-1]) + np.array([0, 1])) * np.array([3.5, 0.1])
fig, ax = plt.subplots(figsize=size)
ax.axis('off')
tmp_edited = sorted_tmp.iloc[:6,:]
tmp_edited.loc[:,"Percentage"] = tmp_edited.loc[:,"Percentage"].apply(lambda x : f"{x}%")
table = ax.table(cellText=tmp_edited.values, cellLoc='center', bbox=[0, 0, 1, 1], colLabels=sorted_tmp.columns)
table.auto_set_font_size(False)
table.set_fontsize(14)

for k, cell in table._cells.items():
    if k[0]==0 :
        cell.set_text_props(weight='bold', color='w')
        cell.set_facecolor("#FF4500")
    s = table.get_celld()[k].get_text()
    if "No" in s._text :
        s.set_color("red")
    elif ("No" not in s._text) and ("%" not in s._text) and (k[0]!=0):
        s.set_color("green")

Is it clear now? Most of the automatic pipeline user only use Model Selection process. So the most important thing for us in the context of automated ML workflow is Model Selection part.

In [None]:
# data = genSankey(tmp,cat_cols=['Augmentation','Feature Engineering','Model Selection','Architecture Search',"Hyperparams Tuning"],value_cols='Percentage',title='AutoML')

# fig = go.Figure(data=[go.Sankey(
#     valueformat = ".0f",
#     # Define nodes
#     node = dict(
#       pad = 15,
#       thickness = 15,
#       line = dict(color = "black", width = 0.5),
#       label =  data['data'][0]['node']['label'],
#       color =  data['data'][0]['node']['color']
#     ),
#     # Add links
#     link = dict(
#       source =  data['data'][0]['link']['source'],
#       target =  data['data'][0]['link']['target'],
#       value =  data['data'][0]['link']['value'],
#       label =  [""]*len(data['data'][0]['link']['source']),
#       color =  ["yellow","yellow","green","green","blue","blue","red","red","green","green","red","red","green","green","yellow","yellow","grey"]+["grey"]*(len(data['data'][0]['link']['source'])-17)
# ))])
# fig.show()

In [None]:
crosstab = make_crosstab('Q33_A','Q34_A')
fig,ax = render_table(crosstab, DICTIONARY_CHANGE_VALUE["AutoML Workflow"], header_columns=0, col_width=3.0)

And afterall, **Tpot and Excessiv wins the race of Automatic ML Workflow**. But dont forget that the most important part, **Model Selection, is led by Auto-sklearn**.

## 3. By the way, there are so much buzz-word related to cloud computing out there. Do I also need to learn them all?

In [None]:
crosstab = make_crosstab('Q23','Q26_A')
fig,ax = render_table(crosstab, DICTIONARY_CHANGE_VALUE["Data Role"], header_columns=0, col_width=4)

The Analyzing Business role and Doing Research doesnt give me surprise with the result. Salesforce is a software focused on Customer Relationship Manager and AWS give you simple way of doing ML research. But Tencent is might be the best new player that I have ever heard and will rise soon!

In [None]:
all_cloud_cnt, all_cloud_unique = data_for_plotting(df, 'Q26_A', 3)
all_cloud2_cnt, all_cloud2_unique = data_for_plotting(df, 'Q26_B', 3)
indices = list(np.where(all_cloud_unique[:, None] == all_cloud2_unique[None, :])[1])
all_cloud2_cnt = all_cloud2_cnt[indices]
all_cloud_unique = np.array([DICTIONARY_CHANGE_VALUE["Cloud"][x] if x in DICTIONARY_CHANGE_VALUE["Cloud"] else x for x in all_cloud_unique])
growth = (all_cloud2_cnt - all_cloud_cnt)/all_cloud_cnt * 100
ind_growth = np.argsort(growth)[::-1]

In [None]:
plt.figure(figsize=(26,10))
ax = barplot_comparison_2_year(0.6,all_cloud_cnt, all_cloud_unique, all_cloud2_cnt, "Now", "Next 2 Years", "Tech Stack", "Percentage of User (%)", "Commonly Used Cloud Technologies", ind_growth, growth, 3, (0.083,1.1))
plt.text(-2,28.75,"Most commonly used Cloud Technologies in the future (prediction) are:")
plt.text(4,28.75,"Tencent, IBM, and Oracle", bbox={'facecolor': '#00A5FF', 'pad': 10}, color="white", weight='bold')
sns.despine()
plt.show()

As for the growth, Tencent also predicted to have 300% growth for 2 years a head. We will see...

## 4. Ok Cool. Is there anything else I need to know?

### Many people debating about IDEs. Here are the most used IDEs

In [None]:
all_tool_cnt, all_tool_unique = data_for_plotting(df, 'Q9', 1)
tool_colors = get_color_barplot(all_tool_unique, None)
plt.figure(figsize=(16,10))
ax = barplot(all_tool_unique,all_tool_cnt ,title="Most Used Text Editor", 
                         y_title="tool", x_title="Percentage of User (%)", annotation_unit='%', category_to_show=3, colors=tool_colors, bartype="horizontal", title_position=(0,1))
ax.text(-7.1,-1.5, "The most favourite IDEs are:")
ax.text(6,-1.5, "Jupyter, VSCode, and PyCharm", bbox={'facecolor': 'blue', 'alpha': 0.5, 'pad': 10}, color="white", weight='bold')
plt.show()

### And this is IDEs by the role of the person

In [None]:
df[ATTRIBUTE_JOB] = df[ATTRIBUTE_JOB].replace({"DBA/Database Engineer" : "Data Engineer","Product/Project Manager":"Project Manager", "Machine Learning Engineer" : "ML Engineer", "Business Analyst":"Data Analyst"})
crosstab = make_crosstab('Q5','Q9', special_case=True)
fig,ax = render_table(crosstab, DICTIONARY_CHANGE_VALUE["Data Role"], header_columns=0, col_width=2.95)