<img src="https://i.imgur.com/XoU1Jmh.png">

<h1><center>- Building a Smarter and Faster Path to Greatness -</center></h1>

#### Functions below ‚¨á

In [None]:
# --- CSS STYLE ---
from IPython.core.display import HTML
def css_styling():
    styles = open("../input/2020-cost-of-living/alerts.css", "r").read()
    return HTML("<style>"+styles+"</style>")
css_styling()

In [None]:
HTML("""
<style>
@import url('https://fonts.googleapis.com/css2?family=Source+Code+Pro&display=swap');
</style>
""")

In [None]:
def add_unique_id(data):
    '''Adds new column with a unique id to the original survey data.'''
    
    # Add unique ID
    data["id"] = data.index + 1
    return data


def get_McMeal_units(data):
    '''Returns a table containing per ID the pay, the middle value and the McMeal units - our new currency.
    data: the raw survey data'''
    
    # There is no missing data for country
    pay = "What is your current yearly compensation (approximate $USD)?"
    country = "In which country do you currently reside?"

    df = data[['id', pay, country]]

    # Remove people with NA for yearly compensation
    # This operation removes half of the data
    df = df[df[pay].isna()!=True].reset_index(drop=True)

    # Remove Country == Other
    df = df[df[country]!='Other'].reset_index(drop=True)

    # Compute middle pay (the middle of the interval)
    df[pay] = df[pay].apply(lambda x: x.replace('$', '').strip())
    df[pay] = df[pay].apply(lambda x: x.replace('>', '').strip())

    df['lower_pay'] = df[pay].apply(lambda x: int(x.split('-')[0].replace(',', '')))
    df['upper_pay'] = df[pay].apply(lambda x: int(x.split('-')[1].replace(',', '')) 
                                    if x.split('-')[0] != '500,000' 
                                    else 1000000)
    df['middle_pay'] = -(-(df['upper_pay'] + df['lower_pay']) // 2)

    # Rename columns
    df.columns = ['id', 'Pay', 'Country', 'lower_pay', 'upper_pay', 'middle_pay']

    # Add the price of a McDonalds Menu
    df = df.replace({'United States of America':'United States',
                     'Viet Nam':'Vietnam',
                     'United Kingdom of Great Britain and Northern Ireland' : 'United Kingdom',
                     'Iran, Islamic Republic of...' : 'Iran',
                     'Republic of Korea' : 'South Korea'})

    merged = pd.merge(left=df, right=col_data[['Country', 'McMeal($)']], how='inner', on='Country')

    # Remove countries where no McMeal price was found (meaning Ghana)
    merged = merged[merged['Country'] != 'Ghana']

    # Compute the McMeal(units)
    merged['McMeal(units)'] = -(-merged['middle_pay'] // merged['McMeal($)']).astype(int)
    
    return merged[['id', 'Country', 'middle_pay', 'McMeal(units)']]

      

        

def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(_x, _y, format(value, ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)
        
        

def segment_units(x):
    '''Segments the McMeal units to Low, High, very High and Crazy High.
    Use with .apply() function on the McMeal(units) column. '''
    
    if x < 10000:
        return 'Other'
    elif x >= 10000 and x < 20000:
        return 'High'
    elif x >= 20000 and x < 50000:
        return 'Very High'
    else:
        return 'Crazy High'
    
    
    
def frame_image(img, frame_width):
    '''Adds a border to the image.
    source: https://stackoverflow.com/questions/43261338/adding-a-border-to-and-image-in-my-code'''
    
    b = frame_width # border size in pixel
    ny, nx = img.shape[0], img.shape[1] # resolution / number of pixels in x and y
    if img.ndim == 3: # rgb or rgba array
        framed_img = np.zeros((b+ny+b, b+nx+b, img.shape[2]))
    elif img.ndim == 2: # grayscale image
        framed_img = np.zeros((b+ny+b, b+nx+b))
    framed_img[b:-b, b:-b] = img
    return framed_img



def offset_image(x, y, label, bar_is_too_short, ax, zoom, offset):
    '''For adding flags to the graph.
    source: https://stackoverflow.com/questions/61971090/how-can-i-add-images-to-bars-in-axes-matplotlib'''
    
    img = plt.imread(f'../input/2020-cost-of-living/flags/{label} flag-icon-256.png')
    img = frame_image(img, 2)
    im = OffsetImage(img, zoom=zoom)
    im.image.axes = ax
    x_offset = offset
#     if bar_is_too_short:
#         x = 0
    ab = AnnotationBbox(im, (x, y), xybox=(x_offset, 0), frameon=False,
                        xycoords='data', boxcoords="offset points", pad=0)
    ax.add_artist(ab)
    
    
def offset_png(x, y, path, ax, zoom, offset, border=2):
    '''For adding other .png images to the graph.
    source: https://stackoverflow.com/questions/61971090/how-can-i-add-images-to-bars-in-axes-matplotlib'''
    
    img = plt.imread(path)
    img = frame_image(img, border)
    im = OffsetImage(img, zoom=zoom)
    im.image.axes = ax
    x_offset = offset
    ab = AnnotationBbox(im, (x, y), xybox=(x_offset, 0), frameon=False,
                        xycoords='data', boxcoords="offset points", pad=0)
    ax.add_artist(ab)
    
    
    
def gender_coding(x):
    '''To encode gender - for pie graph readability.'''
    if x == "Man":
        return "M"
    elif x == "Woman":
        return "W"
    elif x == "Prefer not to say":
        return "An"
    elif x == "Nonbinary":
        return "N"
    else:
        return "SD"
    
    
def code_education(x):
    if x == "Bachelor‚Äôs degree":
        return "Bachelor"
    elif x == "Doctoral degree":
        return "Doctoral"
    elif x == "I prefer not to answer":
        return "Anon"
    elif x == "Master‚Äôs degree":
        return "Master"
    elif x == "No formal education past high school":
        return "HS"
    elif x == "Professional degree":
        return "Prof"
    else:
        return "College"
    
    
    
def encode_mlLong(x):
    '''Encoding for question: how long have you been writing ML code?'''
    if x == "Under 1 year":
        return "< 1 years"
    elif x == "I do not use machine learning methods":
        return "Never"
    elif x == "20 or more years":
        return "20+ years"
    else:
        return x
    

def encode_codeLong(x):
    '''Encoding for question: how long have you been writing code?'''
    if x == "I have never written code":
        return "Never"
    else:
        return x
    
    
    
def code_role(x):
    if x == "Product/Project Manager":
        return "Project Manager"
    elif x == "Machine Learning Engineer":
        return "ML Engineer"
    elif x == "DBA/Database Engineer":
        return "DB Engineer"
    elif x == 'Data Analyst':
        return "Analyst"
    elif x == 'Business Analyst':
        return "Analyst"
    elif x == 'Statistician':
        return "Analyst"
    else:
        return x
    
    

def prep_role_duties_data(df):
    '''Function to prepare data for the Sankey - Role and Duties at Work subchapter.'''
    
    # Columns
    role = "Select the title most similar to your current role (or most recent title if retired): - Selected Choice"
    dutie = ["Analyze and understand data to influence product or business decisions",
         "Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data",
         "Build prototypes to explore applying machine learning to new areas",
         "Build and/or run a machine learning service that operationally improves my product or workflows",
         "Do research that advances the state of the art of machine learning",
         "None of these activities are an important part of my role at work"]
    duties = []
    for k, d in enumerate(dutie):
        duties.append(f"Select any activities that make up an important part of your role at work: (Select all that apply) - Selected Choice - {d}")

    # Select data
    columns = [k for k in duties]
    columns.append(role)
    dt = df[df[role] != "Other"][columns]

    # Rename
    dt.columns = ["dutie1", "dutie2", "dutie3", "dutie4", "dutie5", "dutie6", "role"]
    
    # Apply role coding
    dt["role"] = dt["role"].apply(lambda x: code_role(x))
    
    return dt


def code_ml_spent(x):
    if x == "$0 ($USD)":
        return "0"
    elif x == "$100,000 or more ($USD)":
        return ">100,000"
    else:
        return x
    
    
def code_advice(x):
    if x == "Python":
        return "Python"
    elif x == "R":
        return "R"
    elif x == "SQL":
        return "SQL"
    else:
        return "Other"
    
    
    
def get_environment_questions():
    '''Returns all column names for the environment portion.'''
    
    # Environment
    ides = ["Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice - Jupyter (JupyterLab, Jupyter Notebooks, etc) ",
            "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -  RStudio ",
            "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -  Visual Studio / Visual Studio Code ",
            "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -  PyCharm ",
            "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -   Spyder  ",
            "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -   Notepad++  ",
            "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -   Sublime Text  ",
            "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -   Vim / Emacs  ",
            "Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -  MATLAB "]


    notebooks = ["Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice -  Kaggle Notebooks",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice - Colab Notebooks",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice - Azure Notebooks",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice -  Paperspace / Gradient ",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice -  Binder / JupyterHub ",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice -  Code Ocean ",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice -  IBM Watson Studio ",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice -  Amazon Sagemaker Studio ",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice -  Amazon EMR Notebooks ",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice - Google Cloud AI Platform Notebooks ",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice - Google Cloud Datalab Notebooks",
                 "Which of the following hosted notebook products do you use on a regular basis?  (Select all that apply) - Selected Choice -  Databricks Collaborative Notebooks "]

    computation = "What type of computing platform do you use most often for your data science projects? - Selected Choice"
    
    all_columns = ides.copy()
    all_columns.extend(notebooks)
    all_columns.append(computation)
    
    return all_columns



def multiple_choice_prep(data, category_name="Categ", var_name="IDE"):
    '''Receives a dataframe which contains on columns the multiple choice questions + Categ colum.
    Return the cleaned dataframe.'''
    
    data = data.melt(id_vars=[category_name], var_name="Name", value_name="count").dropna().reset_index(drop=True)
    data = data.groupby(category_name)["count"].value_counts().unstack().reset_index()
    data = data.melt(id_vars=[category_name], var_name=var_name, value_name="count").dropna().reset_index(drop=True)
    
    return data

def one_choice_prep(data, category_name="Categ", count_on="Computing Platform", var_name="Computing"):
    '''Received the dataframe containing on first column the choices and on second the "Category".
    Returns the cleaned Grouped By version of the data.'''
    
    data = data.groupby(category_name)[count_on].value_counts().unstack().reset_index()
    data = data.melt(id_vars=[category_name], var_name=var_name, value_name="count").dropna().reset_index(drop=True)
    
    return data




def code_ide(x):
    if x == "Jupyter (JupyterLab, Jupyter Notebooks, etc) ":
        return "Jupyter"
    else:
        return x
    
    
def code_notebook(x):
    if x == "Colab Notebooks":
        return "Colab"
    elif x == " Kaggle Notebooks":
        return "Kaggle"
    elif x == "Google Cloud Datalab Notebooks":
        return "Google Cloud Datalab"
    elif x == "Azure Notebooks":
        return "Azure"
    elif x == "Google Cloud AI Platform Notebooks ":
        return "Google Cloud AI Platform"
    elif x == " Databricks Collaborative Notebooks ":
        return "Databricks Collaborative"
    elif x == " Amazon EMR Notebooks ":
        return "Amazon EMR"
    else:
        return x.strip()
    
    
def code_course(x):
    if x == "Cloud-certification programs (direct from AWS, Azure, GCP, or similar)":
        return "Cloud Programs"
    elif x == "Kaggle Learn Courses":
        return "Kaggle Learn"
    elif x == "University Courses (resulting in a university degree)":
        return "University Courses"
    else:
        return x
    

    
def code_media(x):
    if x == "Blogs (Towards Data Science, Analytics Vidhya, etc)":
        return "Blogs"
    elif x == "Course Forums (forums.fast.ai, Coursera forums, etc)":
        return "Course Forums"
    elif x == "Email newsletters (Data Elixir, O'Reilly Data & AI, etc)":
        return "Newsletters"
    elif x == "Journal Publications (peer-reviewed journals, conference proceedings, etc)":
        return "Publications"
    elif x == "'Kaggle (notebooks, forums, etc)":
        return "Kaggle"
    elif x == "Podcasts (Chai Time Data Science, O‚ÄôReilly Data Show, etc)":
        return "Podcasts"
    elif x == "Reddit (r/machinelearning, etc)":
        return "Reddit"
    elif x == "Slack Communities (ods.ai, kagglenoobs, etc)":
        return "Slack"
    elif x == "Twitter (data science influencers)":
        return "Twitter"
    else:
        return "YouTube"


    
def get_reliable_sources(df):
    '''Input the original dataframe and returns 2 tables containing the courses & media information.
    Used for the "Reliable Sources" chapter.'''

    # Social Media and First Steps

    # Columns
    courses = ["On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Coursera",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - edX",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Kaggle Learn Courses",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - DataCamp",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Fast.ai",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udacity",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Udemy",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - LinkedIn Learning",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - Cloud-certification programs (direct from AWS, Azure, GCP, or similar)",
               "On which platforms have you begun or completed data science courses? (Select all that apply) - Selected Choice - University Courses (resulting in a university degree)"]
    media = ["Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Twitter (data science influencers)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Email newsletters (Data Elixir, O\'Reilly Data & AI, etc)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Reddit (r/machinelearning, etc)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Kaggle (notebooks, forums, etc)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Course Forums (forums.fast.ai, Coursera forums, etc)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - YouTube (Kaggle YouTube, Cloud AI Adventures, etc)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Podcasts (Chai Time Data Science, O‚ÄôReilly Data Show, etc)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Blogs (Towards Data Science, Analytics Vidhya, etc)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Journal Publications (peer-reviewed journals, conference proceedings, etc)",
             "Who/what are your favorite media sources that report on data science topics? (Select all that apply) - Selected Choice - Slack Communities (ods.ai, kagglenoobs, etc)"]

    courses.append(categ)
    media.append(categ)

    courses_data = df[courses]
    media_data = df[media]

    courses_data.columns = ["Coursera", "edX", "Kaggle Learn", "DataCamp", "Fast.ai",
                            "Udacity", "Udemy", "LinkedIn Learn", "Cloud programs", "University", "Categ"]
    media_data.columns = ["Twitter", "Newsletter", "Reddit", "Kaggle", "Course Forums",
                          "YouTube", "Podcasts", "Blogs", "Publications", "Slack", "Categ"]

    courses_data = multiple_choice_prep(courses_data, category_name="Categ", var_name="Course")
    media_data = multiple_choice_prep(media_data, category_name="Categ", var_name="Media")

    courses_data["Course"] = courses_data["Course"].apply(lambda x: code_course(x))
    media_data["Media"] = media_data["Media"].apply(lambda x: code_media(x))
    
    return courses_data, media_data



def code_viz(x):
    if x == " Ggplot / ggplot2 ":
        return "Ggplot"
    elif x == " Leaflet / Folium ":
        return "Folium"
    elif x == " Plotly / Plotly Express ":
        return "Plotly"
    else:
        return x.strip()
    
    
def get_frameworks(df):
    '''Prepares data for framework subchapter.
    Returns viz_data and framework_data.'''
    viz_cols = [col for col in df.columns if 
                "What data visualization libraries or tools do you use on a regular basis?" in col]
    framework_cols = [col for col in df.columns if 
                      "Which of the following machine learning frameworks do you use on a regular basis?" in col]

    # Exclude none & others
    viz_cols = viz_cols[:-2] ; viz_cols.append(categ)
    framework_cols = framework_cols[:-2] ; framework_cols.append(categ)

    # Data
    viz_data = df[viz_cols]
    framework_data = df[framework_cols]

    viz_data = multiple_choice_prep(viz_data, category_name="Category", var_name="Visualization")
    framework_data = multiple_choice_prep(framework_data, category_name="Category", var_name="Frameworks")

    viz_data["Visualization"] = viz_data["Visualization"].apply(lambda x: code_viz(x))
    
    return viz_data, framework_data



def code_ml(x):
    if x == "Bayesian Approaches":
        return "Bayesian"
    elif x == "Convolutional Neural Networks":
        return "CNNs"
    elif x == "Decision Trees or Random Forests":
        return "Tree Based"
    elif x == "Dense Neural Networks (MLPs, etc)":
        return "Dense NNs"
    elif x == "Evolutionary Approaches":
        return "Evolutionary"
    elif x == "Generative Adversarial Networks":
        return "GANs"
    elif x == "Gradient Boosting Machines (xgboost, lightgbm, etc)":
        return "Gradient Boosting"
    elif x == "Linear or Logistic Regression":
        return "Regressions"
    elif x == "Recurrent Neural Networks":
        return "RNNs"
    else:
        return "Transformers"
    
    
    
def code_compvis(x):
    if x == "General purpose image/video tools (PIL, cv2, skimage, etc)":
        return "General Image"
    elif x == "Generative Networks (GAN, VAE, etc)":
        return "GANs"
    elif x == "Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)":
        return "Image Classification"
    elif x == "Image segmentation methods (U-Net, Mask R-CNN, etc)":
        return "Image Segmentation"
    else:
        return "Object Detection Methods"
    
    
def code_nlp(x):
    if x == "Contextualized embeddings (ELMo, CoVe)":
        return "Context Embeddings"
    elif x == "Encoder-decorder models (seq2seq, vanilla transformers)":
        return "Encoder-Decoder"
    elif x == "Transformer language models (GPT-3, BERT, XLnet, etc)":
        return "Transformers"
    else:
        return "Word Embeddings"
    
    
def get_ml_dl_data(df):
    '''Returns 3 tables for the ML and DL subchapter.'''

    ml_cols = [col for col in df.columns if 
               "Which of the following ML algorithms do you use on a regular" in col][:-2]
    compvision_cols = [col for col in df.columns if 
                       "Which categories of computer vision methods do you use on a regular basis?" in col][:-2]
    nlp_cols = [col for col in df.columns if
                "Which of the following natural language processing (NLP) methods do you use on a regular basis?" in col][:-2]

    ml_cols.append(categ)
    compvision_cols.append(categ)
    nlp_cols.append(categ)

    ml_data = df[ml_cols]
    compvis_data = df[compvision_cols]
    nlp_data = df[nlp_cols]

    # Data prep
    ml_data = multiple_choice_prep(ml_data, category_name="Category", var_name="ML")
    compvis_data = multiple_choice_prep(compvis_data, category_name="Category", var_name="CompVis")
    nlp_data = multiple_choice_prep(nlp_data, category_name="Category", var_name="NLP")

    ml_data["ML"] = ml_data["ML"].apply(lambda x: code_ml(x))
    compvis_data["CompVis"] = compvis_data["CompVis"].apply(lambda x: code_compvis(x))
    nlp_data["NLP"] = nlp_data["NLP"].apply(lambda x: code_nlp(x))
    
    return ml_data, compvis_data, nlp_data



def prep_ml_extended(df):

    # ML Products
    ml_prods = [col for col in df.columns if 
                "Do you use any of the following machine learning products on a regular basis?" in col]
    ml_prods_f = [col for col in df.columns if 
                  "In the next 2 years, do you hope to become more familiar with any of these specific machine learning" in col]

    # ML Experiments
    ml_experm = [col for col in df.columns if "Do you use any tools to help manage machine learning experiments?" in col]
    ml_experm_f = [col for col in df.columns if 
                   "In the next 2 years, do you hope to become more familiar with any of these tools for managing ML experiments?" in col]

    # Auto ML
    automl1 = [col for col in df.columns if 
               "Do you use any automated machine learning tools (or partial AutoML tools) on a regular basis?" in col]
    automl2 = [col for col in df.columns if 
               "Which of the following automated machine learning tools (or partial AutoML tools) do you use" in col]
    automl1_f = [col for col in df.columns if 
                 "Which categories of automated machine learning tools (or partial AutoML tools) do you hope to" in col]
    automl2_f = [col for col in df.columns if 
                 "Which specific automated machine learning tools (or partial AutoML tools) do you hope to become" in col]

    # Append category
    ml_prods.append(categ)
    ml_prods_f.append(categ)
    ml_experm.append(categ)
    ml_experm_f.append(categ)
    automl1.append(categ)
    automl2.append(categ)
    automl1_f.append(categ)
    automl2_f.append(categ)


    # Create dataframes
    ml_prods_data = df[ml_prods]
    ml_prods_f_data = df[ml_prods_f]
    ml_experm_data = df[ml_experm]
    ml_experm_f_data = df[ml_experm_f]
    automl1_data = df[automl1]
    automl2_data = df[automl2]
    automl1_f_data = df[automl1_f]
    automl2_f_data = df[automl2_f]

    # Change dataframes
    var_names = ["Prods", "Prods F", "Experiment", "Experiment F",
                 "AutoML1", "AutoML2", "AutoML1 F", "AutoML2 F"]

    ml_prods_data = multiple_choice_prep(data=ml_prods_data, category_name="Category", var_name=var_names[0])
    ml_prods_f_data = multiple_choice_prep(data=ml_prods_f_data, category_name="Category", var_name=var_names[1])
    ml_experm_data = multiple_choice_prep(data=ml_experm_data, category_name="Category", var_name=var_names[2])
    ml_experm_f_data = multiple_choice_prep(data=ml_experm_f_data, category_name="Category", var_name=var_names[3])
    automl1_data = multiple_choice_prep(data=automl1_data, category_name="Category", var_name=var_names[4])
    automl2_data = multiple_choice_prep(data=automl2_data, category_name="Category", var_name=var_names[5])
    automl1_f_data = multiple_choice_prep(data=automl1_f_data, category_name="Category", var_name=var_names[6])
    automl2_f_data = multiple_choice_prep(data=automl2_f_data, category_name="Category", var_name=var_names[7])
    
    return ml_prods_data, ml_prods_f_data, ml_experm_data, ml_experm_f_data, automl1_data, automl2_data, automl1_f_data, automl2_f_data


def code_mlprods(x):
    x = x.strip()
    if x == "Azure Cognitive Services":
        return "Azure Cognitive S."
    elif x == "Azure Machine Learning Studio":
        return "Azure ML Studio"
    elif x == "Google Cloud AI Platform / Google Cloud ML Engine":
        return "Google Cloud AI/ML"
    elif x == "Google Cloud Natural Language":
        return "Google Cloud NLP"
    elif x == "No / None":
        return "None"
    else:
        return x
    
    
def code_mlexper(x):
    x = x.strip()
    if x == "Sacred + Omniboard":
        return "Sacred+Omniboard"
    elif x == "Weights & Biases":
        return "Weights&Biases"
    elif x == "No / None":
        return "None"
    else:
        return x
    
    return x


def code_automl(x):
    x = x.strip()
    
    if x == "Automated data augmentation (e.g. imgaug, albumentations)":
        return "Data Augmentation"
    elif x == "Automated feature engineering/selection (e.g. tpot, boruta_py)":
        return "Feature Engineering"
    elif x == "Automated hyperparameter tuning (e.g. hyperopt, ray.tune, Vizier)":
        return "Hyperparam. Tuning"
    elif x == "Automated model architecture searches (e.g. darts, enas)":
        return "Model Architectures"
    elif x == "Automated model selection (e.g. auto-sklearn, xcessiv)":
        return "Model Selection"
    elif x == "Automation of full ML pipelines (e.g. Google AutoML, H20 Driverless AI)":
        return "Full ML Pipeline"
    else:
        return "None"
    
    
def code_deploy(x):
    x = x.strip()
    
    if x == "I do not share my work publicly":
        return "I don't share"
    else:
        return x
    
def code_tool(x):
    if x == "Local development environments (RStudio, JupyterLab, etc.)":
        return "RStudio/Jupyter etc."
    elif x == "Basic statistical software (Microsoft Excel, Google Sheets, etc.)":
        return "Excel/Google Sheets etc."
    elif x == "Cloud-based data software & APIs (AWS, GCP, Azure, etc.)":
        return "AWS/Azure etc."
    elif x == "Advanced statistical software (SPSS, SAS, etc.)":
        return "SPSS/SAS etc."
    elif x == "Business intelligence software (Salesforce, Tableau, Spotfire, etc.)":
        return "Salesforce/Tableau etc."
    else:
        return "Other"

# 1. Introducing the subject 
**Ahoy**! If you are like me, you've always been looking up to the **top Data Scientists**, the **cream of the community**, the very few that, somehow, through some *magic* only they know, manage to understand, teach and perform like none others.

**But how**? This survey is an opportunity to take this curiosity very *close and personal* in the <span style="background:#fed56f; font-weight:bold; color:black">search for the treasure</span> that might reveal us the steps to reach that greatness.

**Buckle up pirates; the treasure hunt is <span style="background:#8EEA7D; font-weight:bold; color:black">ON</span>**.

<h3 style="font-family: 'Source Code Pro', monospace">Criteria</h3>

Unfortunately, there will need to be some *bias* involved.

In the Kaggle Survey, there is no feature or clue to identify which of the respondents are Masters/Grandmasters - neither the rank nor any performance within the Kaggle Community.

Hence, I had to *define* what is a *successful* data scientist. Some assumptions were:


<style type="text/css">
.tg  {border-collapse:collapse;
     border-spacing:0;}
.tg td{border-color:"#010307";
    border-style:solid;
    border-width:1px;
    font-family:'Source Code Pro', monospace;
    font-size:14px;
    overflow:hidden;
    padding:10px 5px;
    word-break:normal;}
.tg th{border-color:"#010307";
    border-style:solid;
    border-width:1px;
    font-family:'Source Code Pro', monospace;
    font-size:14px;
    font-weight:normal;
    overflow:hidden;
    padding:10px 5px;
    word-break:normal;}
.tg .tg-c3ow{border-color:"#010307";
    text-align:center;
    vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-c3ow">Criteria</font></th>
    <th class="tg-c3ow">Value</font></th>
    <th class="tg-c3ow">Strength</font></th>
    <th class="tg-c3ow">Flaw</font></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-c3ow">Level of Education</td>
    <td class="tg-c3ow">Masters / Doctoral</td>
    <td class="tg-c3ow">Majority of Grandmasters have<br>some level of upper education</td>
    <td class="tg-c3ow">Many bright data scientists are<br>self-taught</td>
  </tr>
  <tr>
    <td class="tg-c3ow">Years of Programming/ML</td>
    <td class="tg-c3ow">4 - 5+ years</td>
    <td class="tg-c3ow">Might show increased experience</td>
    <td class="tg-c3ow">The passing of time doesn't reflect<br>how much one has learned</td>
  </tr>
  <tr>
    <td class="tg-c3ow">Pay</td>
    <td class="tg-c3ow">USD 100,000+</td>
    <td class="tg-c3ow">Very high pay can signal high skill</td>
    <td class="tg-c3ow">High pay doesn't guarantee skill;<br>Regional bias also involved</td>
  </tr>
  <tr>
    <td class="tg-c3ow">Spending</td>
    <td class="tg-c3ow">USD 100,000+</td>
    <td class="tg-c3ow">A company investing thousands in ML<br>is investing in bright DS employees</td>
    <td class="tg-c3ow">How much a company spends on ML doesn't<br>define the skill of the employer</td>
  </tr>
</tbody>
</table>

<div class="alert simple-alert">
  <p>In the end, I decided to use <span style="background:#fed56f; font-weight:bold; color:black">pay</span> as my delimiter. Firstly, I consider that it has the lowest bias out of all. Secondly, I could lower the regional bias significantly (more of that in the next chapter). I would also believe that highly skilled people usually have very high pay and the exceptions from that rule aren't that many.</p>
</div>

#### Libraries below ‚¨á

In [None]:
# Results: https://www.kaggle.com/kaggle-survey-2020

# Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import matplotlib.patches as patches
from pylab import text
import matplotlib.image as mpimg
from matplotlib.offsetbox import AnnotationBbox, OffsetImage
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
import plotly

import warnings
warnings.filterwarnings("ignore")
plotly.offline.init_notebook_mode(connected = True)


%matplotlib inline

# --- Set defaults of the notebook ---
sns.set(font="'Source Code Pro', monospace")
sns.set_style("whitegrid")
plt.rcParams["font.family"] = "'Source Code Pro', monospace"

# Color Palettes
treasure_colors = ["#703728", "#c86b25", "#dc9555", "#fed56f", "#c89a37"]
pirate_colors = ["#010307", "#395461", "#449FAF", "#B1F4FC", 
                 "#F4D499", "#835211"]
sns.palplot(sns.color_palette(treasure_colors))
sns.palplot(sns.color_palette(pirate_colors))

all_colors = treasure_colors.copy()
all_colors.extend(pirate_colors[1:])


# --- Data Import ---
data = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", 
                   skiprows=1)
# Cost of Living Data
col_data = pd.read_csv("../input/2020-cost-of-living/cost of living 2020.csv")

# 2. How many McMeal menus can you buy?

## 2.1 The problem
Ok, so we agreed on using the <span style="background:#fed56f; font-weight:bold; color:black">pay</span> as the indicator. However, this feature has many issues on its own:

<style type="text/css">
.tg  {border-collapse:collapse;
     border-spacing:0;}
.tg td{border-color:"#010307";
    border-style:solid;
    border-width:1px;
    font-family:'Source Code Pro', monospace;
    font-size:14px;
    overflow:hidden;
    padding:10px 5px;
    word-break:normal;}
.tg th{border-color:"#010307";
    border-style:solid;
    border-width:1px;
    font-family:'Source Code Pro', monospace;
    font-size:14px;
    font-weight:normal;
    overflow:hidden;
    padding:10px 5px;
    word-break:normal;}
.tg .tg-c3ow{border-color:"#010307";
    text-align:center;
    vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-c3ow">No.</th>
    <th class="tg-c3ow">Problem</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-c3ow">1.</td>
    <td class="tg-c3ow">It doesn't take into account the purchasing power (e.g., a person in Ukraine with 90k a year may buy more "valuables" in their country than another person in Japan with 150k a year).</td>
  </tr>
  <tr>
    <td class="tg-c3ow">2.</td>
    <td class="tg-c3ow">It might be misunderstood: there might be people that completed the salary in their base currency. Nevertheless, this column is expressed in dollars.</td>
  </tr>
  <tr>
    <td class="tg-c3ow">3.</td>
    <td class="tg-c3ow">More than 50% of the respondents didn't respond altogether, so only half remain available.</td>
  </tr>
  <tr>
    <td class="tg-c3ow">4.</td>
    <td class="tg-c3ow">There are the occasional trolls who might give a lower or a much higher salary than they actually have.</td>
  </tr>
</tbody>
</table>

In [None]:
# Data
q = "What is your current yearly compensation (approximate $USD)?"
order = ['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999',
         '5,000-7,499', '7,500-9,999', '10,000-14,999',  '15,000-19,999',
         '20,000-24,999', '25,000-29,999', '30,000-39,999', '40,000-49,999', 
         '50,000-59,999', '60,000-69,999', '70,000-79,999', '80,000-89,999',
         '90,000-99,999', '100,000-124,999', '125,000-149,999', '150,000-199,999', 
         '200,000-249,999', '250,000-299,999', '300,000-500,000', '> $500,000']

# Plot
paper = mpimg.imread('../input/2020-cost-of-living/paper_scroll1.jpg')
imagebox = OffsetImage(paper, zoom=0.25)
xy = (0.5, 0.7)
ab = AnnotationBbox(imagebox, xy, frameon=False, pad=1, xybox=(1900, 19.7))

plt.figure(figsize=(16, 12))
plt.rcParams['figure.dpi'] = 360

ax = sns.countplot(y = data[q], order = order, orient="v", palette = "YlOrBr_r",
                   saturation=1)
show_values_on_bars(ax, h_v="h", space=0.4)
# ax.add_artist(ab)

plt.text(670, 20, '???', size=18, color=pirate_colors[1])
plt.plot([0, 650], [15.5, 15.5], lw=2, color=pirate_colors[1])
plt.plot([0, 650], [24.5, 24.5], lw=2, color=pirate_colors[1])
plt.plot([650, 650], [15.5, 24.5], lw=2, color=pirate_colors[1])
plt.plot([3, 3], [15.5, 24.5], lw=2, color=pirate_colors[1])

# Arrow
style = "Simple, tail_width=5, head_width=16, head_length=23"
kw = dict(arrowstyle=style, color=pirate_colors[1])
arrow = patches.FancyArrowPatch((1850, 1), (1850, 24),
                             connectionstyle="arc3,rad=-.15", **kw)
plt.gca().add_patch(arrow)

plt.text(1890, 1.5, 'less pay', size=14, color=pirate_colors[1])
plt.text(1650, 23, 'more pay', size=14, color=pirate_colors[1])


ax.set_xlabel("Frequency", size = 18, color = pirate_colors[0])
ax.set_ylabel("Pay", size = 18, color = pirate_colors[0])
ax.set_title("- Pay distribution around respondents -", 
             size = 26, color = treasure_colors[0], weight='bold')
plt.xticks([])
plt.yticks(fontsize=11)
sns.despine(left=True, bottom=True);

> Hence ... **what do we choose**? We can't just take all people with a salary > 80k - this would be highly bias and inefficient (80k is lots of money in the UK, but not that much in the US). Unfortunately, points *2.*, *3.* and *4.* are systematic issues, so we'll have to go ahead and trust the Kagglers that they completed the *pay* question to the best of their abilities.

## 2.2 The Solution üçü

**Thanks to a good friend** who gave me this idea, we can all have a snack break now.

<img src="https://i.imgur.com/Tkhvnyg.png" width=700>

A McMeal may solve our regional bias problem. Instead of using <span style="background:#fed56f; font-weight:bold; color:black">pay</span>, we can look around the world at **how many McMeals can one respondent buy with their salary in their own country**. Afterward, we can use the <span style="background:#fed56f; font-weight:bold; color:black">McMeal(units)</span> as our non-bias indicator.

Now, the only thing that remains is where do we draw the line in the McMeal units?

In [None]:
# Get mcmeal_units
data = add_unique_id(data)
mcmeal_units = get_McMeal_units(data)
q3 = int(mcmeal_units['McMeal(units)'].quantile([0.75]).item())

# Plot
plt.figure(figsize=(16, 11))
plt.rcParams['figure.dpi'] = 360

ax = sns.boxplot(x = mcmeal_units['McMeal(units)'], color=pirate_colors[2])

plt.text(q3 + 800, 0.46, f'Q3 value: {q3:,} meals', size=13, color=treasure_colors[1], weight='bold')
plt.text(120000, 0.085, 'outliers we want to analyze', size=13, color=treasure_colors[1], weight='bold')

# Arrow
style = "Simple, tail_width=0.5, head_width=6, head_length=5"
kw = dict(arrowstyle=style, color=treasure_colors[1])
arrow = patches.FancyArrowPatch((35000, 0.42), (q3, 0.3),
                             connectionstyle="arc3,rad=.3", **kw)
plt.gca().add_patch(arrow)

# Fries
path='../input/2020-cost-of-living/fries.png'
offset_png(x=240000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=215000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=205000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=185000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=190000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=150000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=141000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=129000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=120000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=115000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=98000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=92500, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=86000, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=83500, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=80500, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=77500, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=74500, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)
offset_png(x=66500, y=0, path=path, ax=ax, zoom=0.05, offset=0, border=1)


plt.plot([26000, 245000], [0.045, 0.045], lw=2, color=treasure_colors[1])
plt.plot([26000, 25300], [0.045, 0.02], lw=2, color=treasure_colors[1])
plt.plot([245000, 245700], [0.045, 0.02], lw=2, color=treasure_colors[1])

ax.set_xlabel("McMeals", size = 15, color = pirate_colors[0])
ax.set_ylabel("")
ax.set_title("- McMeal : Units variability -", size = 26, color = pirate_colors[0], weight='bold')
# plt.xticks([])
plt.yticks(fontsize=11)
sns.despine(left=True)
ax.get_xaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')));

About 2,300 people have at least 10,000 Meals or more (10,000 threshold was chosen looking at Q3). I will be analyzing these people from now on. However, I will also be segmenting them into <span style="background:#fed56f; font-weight:bold; color:black">three categories</span>:

<style type="text/css">
.tg  {border-collapse:collapse;
     border-spacing:0;}
.tg td{border-color:"#010307";
    border-style:solid;
    border-width:1px;
    font-family:'Source Code Pro', monospace;
    font-size:14px;
    overflow:hidden;
    padding:10px 5px;
    word-break:normal;}
.tg th{border-color:"#010307";
    border-style:solid;
    border-width:1px;
    font-family:'Source Code Pro', monospace;
    font-size:14px;
    font-weight:normal;
    overflow:hidden;
    padding:10px 5px;
    word-break:normal;}
.tg .tg-c3ow{border-color:"#010307";
    text-align:center;
    vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-c3ow"><font>Category Name</font></th>
    <th class="tg-c3ow"><font>Meaning</font></th>
    <th class="tg-c3ow"><font>Number of respondents</font></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-c3ow">High</td>
    <td class="tg-c3ow"><b>10,000 - 20,000 units</b>: These people can buy more McMeals than more than <b>75%</b> of our base users.</td>
    <td class="tg-c3ow">~1600</td>
  </tr>
  <tr>
    <td class="tg-c3ow">Very High</td>
    <td class="tg-c3ow"><b>20,000 - 50,000 units</b>: These people can buy more McMeals than more than <b>90%</b> of our base users.</td>
    <td class="tg-c3ow">~700</td>
  </tr>
  <tr>
    <td class="tg-c3ow">Crazy High</td>
    <td class="tg-c3ow"><b>50,000 + units</b>: These people can buy more McMeals than more than <b>99%</b> of our base users.</td>
    <td class="tg-c3ow">~100</td>
  </tr>
</tbody>
</table>

In [None]:
x = mcmeal_units[mcmeal_units["McMeal(units)"] <= 100000]
x0 = 10000
x1 = 20000
x2 = 50000

plt.figure(figsize=(16, 11))

ax = sns.distplot(x["McMeal(units)"][::20], hist=False, kde=True, kde_kws = {'lw':3}, color= treasure_colors[0])
kde_x, kde_y = ax.lines[0].get_data()

p1 = plt.axvline(x=x0,color=pirate_colors[1], lw=2, ls="--")
p2 = plt.axvline(x=x1,color=pirate_colors[1], lw=2, ls="--")
p2 = plt.axvline(x=x2,color=pirate_colors[1], lw=2, ls="--")

ax.fill_between(kde_x, kde_y, where=(kde_x>x0) & (kde_x<x1) , 
                interpolate=True, color=treasure_colors[1])
ax.fill_between(kde_x, kde_y, where=(kde_x>x1) & (kde_x<x2) , 
                interpolate=True, color=treasure_colors[2])
ax.fill_between(kde_x, kde_y, where=(kde_x>x2) , 
                interpolate=True, color=treasure_colors[3])

plt.text(10350, 0.000075, '75%', size=13, color=pirate_colors[1], weight='bold')
plt.text(20350, 0.000075, '90%', size=13, color=pirate_colors[1], weight='bold')
plt.text(50350, 0.000075, '99%', size=13, color=pirate_colors[1], weight='bold')

ax.set_xlabel("Units", size = 15, color = pirate_colors[0])
ax.set_ylabel("")
ax.set_title("- McMeal Units Distribution -", size = 26, color = pirate_colors[0], weight='bold')
plt.yticks([])
plt.xticks(np.arange(min(x["McMeal(units)"])-34, max(x["McMeal(units)"])+1, 10000))
sns.despine(left=True)
ax.get_xaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')));

## The Pareto Principle

Our McMeal units distribution match the Pareto Principle very well: [roughly 80% of consequences come from 20% of the causes](https://en.wikipedia.org/wiki/Pareto_principle).

Hence, we are noticing a **"top 20%" of the "top 20%" of the "top 20%"** situation, meaning:
* out of 9,893 total respondents, 2,301 (*25%*) have more than 10,000 units
* out of the 2,301 respondents, 705 (*~ 20%*) have more than 20,000 units
* out of the 705 respondents, 100 (*~ 15%*) have more than 50,000 units

> What does this mean? It doesn't matter to which category you look, the distribution is always going to be **skewed to the right**. This is why <span style="background:#fed56f; font-weight:bold; color:black">segmentation</span> is so important here.

In [None]:
# Data
dist1 = mcmeal_units[mcmeal_units["McMeal(units)"] >= 10000]
dist2 = mcmeal_units[mcmeal_units["McMeal(units)"] >= 20000]
dist3 = mcmeal_units[mcmeal_units["McMeal(units)"] >= 50000]

# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 11))
fig.suptitle('- Pareto Principle -', size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2, ax3]

sns.distplot(dist1["McMeal(units)"][::30], hist=False, kde=True, kde_kws = {'lw':3}, color= treasure_colors[0], ax=ax1)
sns.distplot(dist2["McMeal(units)"][::10], hist=False, kde=True, kde_kws = {'lw':3}, color= treasure_colors[0], ax=ax2)
sns.distplot(dist3["McMeal(units)"], hist=False, kde=True, kde_kws = {'lw':3}, color= treasure_colors[0], ax=ax3)

ax1.axvspan(20000, 200000, color=pirate_colors[4], alpha=0.5)
ax2.axvspan(50000, 200000, color=pirate_colors[4], alpha=0.5)

ax1.text(110000, 0.000035, '>= 20,000 --->', size=13, color=treasure_colors[2], weight='bold')
ax2.text(110000, 0.000026, '>= 50,000 --->', size=13, color=treasure_colors[2], weight='bold')

ax1.set_title('>= 10,000 units distribution', size = 13, color = treasure_colors[0], weight='bold')
ax2.set_title('>= 20,000 units distribution', size = 13, color = treasure_colors[0], weight='bold')
ax3.set_title('>= 50,000 units distribution', size = 13, color = treasure_colors[0], weight='bold')
for ax in axs:
    ax.set_ylabel("")
    ax.set_xlabel("")
    ax.set_yticks([])
    ax.get_xaxis().set_major_formatter(
        matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    
every_nth = 2
for n, label in enumerate(ax3.xaxis.get_ticklabels()):
    if n % every_nth != 0:
        label.set_visible(False)
        
sns.despine(left=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.86, wspace=None, hspace=None);


# --- Finalize Preprocessing ---
# Apply the Category and merge data together
mcmeal_units["Category"] = mcmeal_units["McMeal(units)"].apply(lambda x: segment_units(x))
df_all = pd.merge(left=data, right=mcmeal_units, how="inner", on="id")

# 3. On our way to finding the treasure

Alright! We established our feature, normalized it, excluded the regional bias, and segmented our target respondents. We can say **our map** is already laid out in front of us (well structured and ready, but still empty); hence we now need to follow the steps to our <span style="background:#fed56f; font-weight:bold; color:black">treasure</span>.

<div class="alert simple-alert">
  <p>The map is BLANK for the moment, but it will start revealing itself once we begin discovering new Realms.</p>
</div>


*üìå Note: From now on, when I'll mention **pay**, I will refer to the units we got after the pay normalization on the McMeal units :)*

<h2><span style="background:#F3E9D9; font-weight:normal; color:black">3.1 The Personal Profile</span></h2>

<h3 style="font-family: 'Source Code Pro', monospace">Where do they reside?</h3>

Some pointers we observe here:
* The top countries do **match the profile** of the majority **of masters/grandmasters** on Kaggle.
* The top 2 countries for all categories are **the USA** and **India**.
* **Japan, China, Indonesia, Russia, Canada, and the UK** are the other 6 places where these highly skilled people reside.

In [None]:
# --- Filter on target respondents ---
df = df_all[df_all['McMeal(units)'] >= 10000]

# Data
country = 'Country'
units = 'McMeal(units)'
categ = 'Category'

x = df.groupby([country, categ])[units].sum().reset_index()\
                                        .sort_values(units, ascending=False).groupby(categ).head(5).reset_index(drop=True)
x["McMeal(units)"] = x["McMeal(units)"]/100000

# Incorporate all
datas = [x[x[categ] == "High"], x[x[categ] == "Very High"], x[x[categ] == "Crazy High"]]
labels = [datas[0][country].unique(), datas[1][country].unique(), datas[2][country].unique()]
height = 0.9


# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16,11))
fig.suptitle('- Top 5 Countries on Categories (in 100k) -', size = 26, color = pirate_colors[0],
             weight='bold')
axs = [ax1, ax2, ax3]

for data, ax in zip(datas, axs):
    sns.barplot(y = data[country], x = data[units], ax=ax, color = treasure_colors[2])

ax1.set_title('High Pay', size = 13, color = treasure_colors[0], weight='bold')
ax2.set_title('Very High Pay', size = 13, color = treasure_colors[0], weight='bold')
ax3.set_title('Crazy High Pay', size = 13, color = treasure_colors[0], weight='bold')


for ax in axs:
    ax.set_ylabel("")
    ax.set_xlabel("")
    ax.set_xticks([])
    ax.tick_params(axis='y', rotation=80)
    show_values_on_bars(ax, h_v="h", space=1)
    
for i in range(len(datas)):
    
    max_value = datas[i][units].max()
    for k, (label, value) in enumerate(zip(labels[i], datas[i][units].values)):
        offset_image(value, k, label, bar_is_too_short=value < max_value / 10, ax=axs[i],
                     zoom=0.14, offset=17)
        
sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.3, hspace=None);

<h3 style="font-family: 'Source Code Pro', monospace">Who are they?</h3>

Let's take it step by step:
* As the percentage of the entire population of the survey is **mostly formed by males**, the "high end" people we're studying match accordingly.
* Looking at age, we can observe that the *average* age increases by category:
    * High Pay: the average age is around *35* yo
    * Very High Pay: average age starts moving towards *40 yo and 45 yo* (these bars start to rise)
    * Crazy High Pay: the 30s drop suddenly, whereas the beginning of *40s* stays the same. Later age (*45, 50s*) are also visible.
* Hence, there is a clear, **direct correlation between high pay and age** - <span style="background:#fed56f; font-weight:bold; color:black">the older, the wiser, the wealthier</span>.

My personal opinion is that the 20s and 30s are hectic anyway, and the golden ages are still after 40. Glad that this survey also reflects that. üòÅ [@Dieter](https://www.kaggle.com/christofhenkel) might agree with me as well. üëÄ

In [None]:
# --- OVERALL DATA ---
age = "What is your age (# years)?"
gender = "What is your gender? - Selected Choice"
categ = 'Category'

agc = df[[age, gender, categ]]



# Gender plot data 
d = agc[[gender, categ]].value_counts().reset_index()
d[gender] = d[gender].apply(lambda x: gender_coding(x))
datas = [d[d[categ] == "High"], d[d[categ] == "Very High"], d[d[categ] == "Crazy High"]]

# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16,11))
fig.suptitle('- Gender on Pay Categories -', size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2, ax3]

for k, (data, ax) in enumerate(zip(datas, axs)):
    labels = data[gender].values
    sizes = data[0].values
    explode = (0.2, 0, 0, 0)
    
    if k == 0:
        explode = (0.2, 0, 0, 0, 0)
    
    colors = treasure_colors
    patch, texts = ax.pie(sizes, explode=explode,
                            colors=colors, startangle=90)
    ax.legend(patch, labels, loc="best",
              bbox_to_anchor=(0.3, 0.9))
    
    ax.axis('equal')

ax1.set_title('High Pay', size = 13, color = pirate_colors[0], weight='bold',pad=-30)
ax2.set_title('Very High Pay', size = 13, color = pirate_colors[0], weight='bold',pad=-30)
ax3.set_title('Crazy High Pay', size = 13, color = pirate_colors[0], weight='bold',pad=-30)

fig.text(0.15, 0.25, 'M:Male | W:Woman | An:Prefer not to say | N:Nonbinary | SD:Prefer to self describe', 
         size=12, color="#BDC3C7", weight='bold');
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.93, wspace=None, hspace=None);

In [None]:
# Man & Woman View
y = agc[agc[gender].isin(['Man', 'Woman'])].reset_index(drop=True)
y = pd.concat([y, pd.get_dummies(y[gender])],
              axis=1)
y = y.groupby([categ, age])[['Man', 'Woman']].sum().reset_index()
datas = [y[y[categ] == "High"], y[y[categ] == "Very High"], y[y[categ] == "Crazy High"]]

# Plot
fig, axs = plt.subplots(ncols=6, sharey=True, figsize=(16, 11))
fig.suptitle('- Population Pyramid : view over Pay Categories -', size = 26, color = pirate_colors[0], 
             weight='bold')
pairs = [[0, 1], [2, 3], [4, 5]]

for pair, data in zip(pairs, datas):

    #define x and y limits
    y_axis = range(0, len(data))
    x_male = data['Man']
    x_female = data['Woman']

    #define male and female bars
    axs[pair[0]].barh(y_axis, x_male, align='center', color=pirate_colors[1])
    axs[pair[0]].set_title(f'Male - {data[categ].unique()[0]}',
                     color = treasure_colors[0], weight='bold', size=12)
    axs[pair[1]].barh(y_axis, x_female, align='center', color=treasure_colors[1])
    axs[pair[1]].set_title(f'Female - {data[categ].unique()[0]}',
                     color = treasure_colors[0], weight='bold', size=12)

    #adjust grid parameters and specify labels for y-axis
    axs[pair[1]].grid()
    axs[pair[0]].set(yticks=y_axis, yticklabels=data[age])
    axs[pair[0]].invert_xaxis()
    axs[pair[0]].grid()

    
plt.rcParams['figure.dpi'] = 360;

<div class="alert success-alert">
  <p><b>Takeaway</b>: We now know that the high end of respondents is mostly from the US and India, but also located in Japan, China, Indonesia, UK, Russia, or Canada, and have a Male 30-45 yo profile.</p>
  <p>However, these "personal" aspects don't define an outstanding Data Scientist. Hence, we'll start from now on to look at education, expertise, work ethic, and knowledge, rather than focusing on biological, racial, or other environmental aspects.</p>
</div>

> Oh, and look! The first portion of the map is clear now!

<img src="https://i.imgur.com/MA7TBEF.png">

> Closer look üîé
<img src="https://i.imgur.com/UlLtWKF.jpg" width=700>

<h2><span style="background:#F3E9D9; font-weight:normal; color:black">3.2 The Education</span></h2>

As for education, the 3 Pay Categories differentiate through:
* *High Pay* and *Very High Pay* have very similar distributions. **50% of the respondents** have or plan to complete a **Masters's degree** in the next 2 years.
* However, *Crazy High Pay* steals ~15 percentage points from the *Master* category and adds to the **Doctoral** category.
* Hence, the majority of extremely <span style="background:#fed56f; font-weight:bold; color:black">well-paid respondents choose to continue their superior studies to a Doctoral</span>.

In [None]:
education = "What is the highest level of formal education that \
you have attained or plan to attain within the next 2 years?"
categ = 'Category'

dt = df[[education, categ]].reset_index(drop=True)
dt = dt.groupby(categ)[education].value_counts().unstack(fill_value=0).reset_index()
dt = dt.melt(id_vars=[categ], var_name=education, value_name="count")
dt[education] = dt[education].apply(lambda x: code_education(x))

total = dt.groupby(categ)["count"].sum().reset_index()
dt = pd.merge(left=dt, right=total, on=categ)
dt["perc"] = round(dt["count_x"]/dt["count_y"]*100, 0).astype(int)

datas = [dt[dt[categ] == "High"], dt[dt[categ] == "Very High"], dt[dt[categ] == "Crazy High"]]



# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16,14))
fig.suptitle('- Education on Pay Categories -', size = 26, color = pirate_colors[0],
             weight='bold')
axs = [ax1, ax2, ax3]

order = ["Doctoral", "Master", "Bachelor", "Prof", "College", "HS", "Anon"]
for data, ax in zip(datas, axs):
    sns.barplot(y = data[education], x = data["perc"], ax=ax, color = treasure_colors[0],
                order = order)

ax1.set_title('High Pay (%)', size = 13, color = pirate_colors[0], weight='bold')
ax2.set_title('Very High Pay (%)', size = 13, color = pirate_colors[0], weight='bold')
ax3.set_title('Crazy High Pay (%)', size = 13, color = pirate_colors[0], weight='bold')

path='../input/2020-cost-of-living/education_white.png'
offset_png(x=50, y=1, path=path, ax=ax1, zoom=0.04, offset=-22, border=1)
offset_png(x=50, y=1, path=path, ax=ax2, zoom=0.04, offset=-22, border=1)
offset_png(x=31, y=0.03, path=path, ax=ax3, zoom=0.04, offset=-22, border=1)
offset_png(x=36, y=1, path=path, ax=ax3, zoom=0.04, offset=-22, border=1)

for ax in axs:
    ax.set_ylabel("")
    ax.set_xlabel("")
    ax.set_xticks([])
    ax.tick_params(axis='y', rotation=80)
    show_values_on_bars(ax, h_v="h", space=1)
        
sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=0.3, right=None, top=None, wspace=0.22, hspace=None)
fig.text(0.10, 0.25, 
         'Anon:Prefer not to answer | HS:No formal education past high school | College:College study wout earning a degree | Prof:Professional degree', 
         size=12, color="#BDC3C7", weight='bold');

<div class="alert success-alert">
  <p><b>Takeaway</b>: The high-end respondents have mostly a high to very high education; the majority have at least a Masters's complete. The difference between High Pay and Crazy High Pay is in majority's choice to pursue their passion further to a Doctoral.</p>
  <p>As you may know, indeed, many Masters/Grandmasters discuss (on Twitter, podcasts, etc.) that they have a Doctoral in a field supported by Data Science work.</p>
</div>

> And the Education Mountains revealed themselves!

<img src="https://i.imgur.com/swURPrJ.jpg">

> Closer look üîé
<img src="https://i.imgur.com/2tGUCaZ.jpg" width=700>

<h2><span style="background:#F3E9D9; font-weight:normal; color:black">3.3 The Expertise and Work Environment</span></h2>

Here is our chance to take a glimpse into the work environment and skills acquired. How long does it take to get there? How hard do we need to work? What job roles should we pursue?

<h3 style="font-family: 'Source Code Pro', monospace">For how long have they been practicing?</h3>

The most important points here:
* These exceptional people have **more coding experience** than ML in terms of time.
* Looking at the coding expertise, we see more than 50% of them having **10+ years of coding experience**.
* Looking at the ML expertise, the donut is more evenly split between **1 and 10 years**, with *less than 10% having 10+ years of experience*.
* What does this mean? It means that <span style="background:#fed56f; font-weight:bold; color:black">exceptional people do have lots of coding experience but NOT ML necessarily</span>, and most of them are on Kaggle because they might have recently (or in the last years) found their passion in data.

In [None]:
# --- OVERALL DATA ---
code = "For how many years have you been writing code and/or programming?"
ml = "For how many years have you used machine learning methods?"

dt = df[[code, ml]]
dt[code] = dt[code].apply(lambda x: encode_codeLong(x))
dt[ml] = dt[ml].apply(lambda x: encode_mlLong(x))
datas = [dt[code].value_counts().reset_index(), dt[ml].value_counts().reset_index()]
datas[0] = datas[0].reindex([6, 5, 4, 3, 1, 0, 2]).reset_index(drop=True)
datas[1] = datas[1].reindex([6, 1, 2, 3, 5, 4, 0, 7, 8]).reset_index(drop=True)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 11))
fig.suptitle('- For how long have they been practicing? -', size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2, ax3]

# Pie 1
labels = datas[0]["index"].values
sizes = datas[0][code].values
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
colors = ["#D7C8C0", "#CBBBB2", "#AB978D", "#C77952", "#C36130", "#A8410E", "#802C02"]

ax1.pie(sizes, explode=explode, colors=colors, startangle=90, labels=labels, 
        autopct='%1.0f%%', pctdistance=0.8)

#draw circle
centre_circle = plt.Circle((0,0),0.65,fc='white')
ax1.add_artist(centre_circle)
ax1.axis('equal')

# Pie 2
labels = datas[1]["index"].values
sizes = datas[1][ml].values
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
colors = ["#A6ADB1", "#80AABD", "#6196AE", "#4C829A", "#2E7493", "#255D76", "#0C5372", "#BBC2C5", "#D0D9DC"]

ax2.pie(sizes, explode=explode, colors=colors, startangle=90, labels=labels,
        autopct='%1.0f%%', pctdistance=0.8)

centre_circle = plt.Circle((0,0),0.65,fc='white')
ax2.add_artist(centre_circle)
ax2.axis('equal')

# Settings
ax1.set_title('Coding in General', size = 14, color = pirate_colors[0], weight='bold', pad=20)
ax2.set_title('Used ML Methods', size = 14, color = pirate_colors[0], weight='bold', pad=20)

# Image
offset_png(x=0, y=0, path="../input/2020-cost-of-living/icons/code_icon.png", 
           ax=ax1, zoom=0.06, offset=0, border=1)
offset_png(x=0, y=0, path="../input/2020-cost-of-living/icons/ml_icon.png",
           ax=ax2, zoom=0.05, offset=0, border=1)

plt.subplots_adjust(left=None, bottom=None, right=None, top=0.8, wspace=0.4, hspace=None);

<h3 style="font-family:'Source Code Pro', monospace">Role and Duties at Work</h3>

Pointers here:
* The most frequent jobs for the top respondents are <span style="background:#fed56f; font-weight:bold; color:black">Data Scientist, ML Engineer or Software Engineer</span> (Analyst incorporates 3 jobs - Data Analyst, Business Analyst, and Statistician).
* Some of the most crucial duties are:
    * *Software Engineer*: more oriented towards **data infrastructure, exploration, and creating ML models**.
    * *ML Engineer*: most prominent duties are for **building and exploring new ideas for ML models**.
    * *Data Scientist*: the **most versatile** out of all, it incorporates almost equally all the duties (however, lower in the research areas). Basically, they need to know everything üëÄ.

In [None]:
# Get data
dt = prep_role_duties_data(df)

# Filter out none
dt = dt.drop(columns=["dutie6"], axis=1)

# Prep data for Sankey
role_data = dt["role"].value_counts().reset_index()
role_data.columns = ["role", "count"]

dutie_data = dt.melt(id_vars=["role"], var_name="Name", value_name="Dutie")
dutie_data.dropna(inplace=True)

dutie_data = dutie_data.groupby("role")["Dutie"].value_counts().unstack().reset_index()
dutie_data = dutie_data.melt(id_vars=["role"], var_name="dutie", value_name="count")
dutie_data = dutie_data.sort_values(['count'], ascending=False).groupby('role').head(5).sort_values(['role', 'dutie'])

dutie_data = dutie_data[dutie_data["role"] != "DB Engineer"]
dutie_data = dutie_data[dutie_data["role"] != "Data Engineer"]
dutie_data = dutie_data[dutie_data["role"] != "Research Scientist"]
dutie_data = dutie_data[dutie_data["role"] != "Project Manager"]


# Get the graph data
label = ['Analyst', 'Data Scientist', 'ML Engineer', 'Software Engineer',
          
         'Analyze and Understand Data',
         'Build and run ML',
         'Build and run data infrastructure',
         'Create ML to explore new areas',
         'Research to advance the state of ML']

source = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
target = [4, 5, 6, 7, 8, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8]
value = dutie_data["count"].values

# Colors
color_node = ["#CC5600", "#9D4800", "#91281A", "#DA9300",
              "#325C6E", "#325C6E", "#325C6E", "#325C6E", "#325C6E"]

color_link = ["#F8E8DC","#CC5600", "#F8E8DC", "#F8E8DC", "#CC5600",
              "#EBD5C3","#EBD5C3", "#9D4800", "#EBD5C3", "#9D4800",
              "#DDCECC", "#DDCECC", "#91281A", "#DDCECC", "#91281A",
              "#F8EED9", "#DA9300", "#F8EED9", "#F8EED9", "#DA9300"]

# Data to dict, dict to sankey
link = dict(source = source, target = target, value = value, color = color_link)
node = dict(label = label, pad=10, thickness=21, color=color_node, 
            line = dict(color = "black", width = [5,10]))
data = [go.Sankey(link = link, node=node, arrangement='snap')]

layout = go.Layout(hovermode = 'x',
                   title="- Top Roles and Duties at Work -",
                   font=dict(size = 14, color = pirate_colors[0], family="Source Code Pro",),
                   paper_bgcolor='#FCFCFC',
                   height=900)

# PLOT
fig = go.Figure(data, layout)

fig.add_annotation(x=0, y=1.017,
                   text="Roles",
                    showarrow=False,
                   font=dict(
                       family="Source Code Pro",
                       size=14,
                       color="#325C6E"))
fig.add_annotation(x=1, y=1.033,
                   text="Duties",
                    showarrow=False,
                   font=dict(
                       family="Source Code Pro",
                       size=14,
                       color="#325C6E"))

fig.show()

<h3 style="font-family:'Source Code Pro', monospace">How large are the company and the team?</h3>

The 2 plots <span style="background:#fed56f; font-weight:bold; color:black">match very well the overall distribution of the respondents</span> ([you can see the overall summary here](https://www.kaggle.com/kaggle-survey-2020)):
* On average, most respondents (regardless of their pay) are located in **small companies in 20% of the cases**, and more than **50% of the cases in large ones** (1000+ employees).
* Also, the team size is **half the time bigger than 10 people**; however, **40% of cases are in small teams, of a maximum of 4 people**. This is a direct implication of the companies' size (there are many respondents in tiny companies, hence smaller DS teams and vice versa).

In [None]:
# --- OVERALL DATA ---
size = "What is the size of the company where you are employed?"
ds_team = "Approximately how many individuals are responsible for data science workloads at your place of business?"

dt = df[[size, ds_team]]
datas = [dt[size].value_counts().reset_index(), dt[ds_team].value_counts().reset_index()]
datas[0] = datas[0].reindex([2, 3, 4, 1, 0]).reset_index(drop=True)
datas[1] = datas[1].reindex([4, 1, 2, 3, 5, 6, 0]).reset_index(drop=True)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 11))
fig.suptitle('- How large are the company and the team? -', size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2]

# Pie 1
labels = datas[0]["index"].values
sizes = datas[0][size].values
explode = (0.05, 0.05, 0.05, 0.05, 0.05)
colors = ["#6E2D0A", "#DEC4B3", "#C2A99A", "#C5551A", "#913707"]

ax1.pie(sizes, explode=explode, colors=colors, startangle=90, labels=labels, 
        autopct='%1.0f%%', pctdistance=0.8)

#draw circle
centre_circle = plt.Circle((0,0),0.65,fc='white')
ax1.add_artist(centre_circle)
ax1.axis('equal')

# Pie 2
labels = datas[1]["index"].values
sizes = datas[1][ds_team].values
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
colors = ["#ADC2CF", "#69BDD7", "#3C9FBD", "#1785A6", "#99ADBA", "#839AA9", "#055E79"]

ax2.pie(sizes, explode=explode, colors=colors, startangle=90, labels=labels,
        autopct='%1.0f%%', pctdistance=0.8)

centre_circle = plt.Circle((0,0),0.65,fc='white')
ax2.add_artist(centre_circle)
ax2.axis('equal')

# Settings
ax1.set_title('Company Size', size = 14, color = pirate_colors[0], weight='bold', pad=20)
ax2.set_title('Team Size', size = 14, color = pirate_colors[0], weight='bold', pad=20)

# Image
offset_png(x=0, y=0, path="../input/2020-cost-of-living/icons/company_icon.png", 
           ax=ax1, zoom=0.06, offset=0, border=1)
offset_png(x=0, y=0, path="../input/2020-cost-of-living/icons/team_icon.png",
           ax=ax2, zoom=0.05, offset=0, border=1)

plt.subplots_adjust(left=None, bottom=None, right=None, top=0.8, wspace=0.4, hspace=None);

<h3 style="font-family:'Source Code Pro', monospace">The Interest of the Company in Machine Learning</h3>

This question *might have some bias* because some people could have guessed a rough estimate, as they might not know of the "business side" of the company. However, because we're talking about high-end Data Scientists, who most certainly are involved in their projects' finances, this bias might be lower.

*Or are they?*

The following graphs show some fascinating insights:
* Firstly, respondents around **all groups** agreed that their company is either using  well established ML - or - they've just started implementing ML into the business ([and these numbers are increasing as the years pass, according to the general summary](https://www.kaggle.com/kaggle-survey-2020)).
* We can see that the respondents with **High Pay** are located in companies that spend much less on ML than the other 2 groups. In the **Very High Pay** and **Crazy High Pay** groups, *more than 50% of the respondents* are employed in companies that *spend tens of thousands of dollars* on their ML equipment and team.
* Hence, the graph shows that <span style="background:#fed56f; font-weight:bold; color:black">the bigger the individual pay, the more the company invests in ML in its business model</span>.

In [None]:
ml_business = "Does your current employer incorporate machine learning methods into their business?"
ml_spent = "Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services at home (or at work) in the past 5 years (approximate $USD)?"

dt = df[[ml_business, ml_spent, categ]].reset_index(drop=True)
dt[ml_spent] = dt[ml_spent].apply(lambda x: code_ml_spent(x))


# ======= ML SPENT =======
dt1 = dt.groupby(categ)[ml_spent].value_counts().unstack(fill_value=0).reset_index()
dt1 = dt1.melt(id_vars=[categ], var_name=ml_spent, value_name="count")

total = dt1.groupby(categ)["count"].sum().reset_index()
dt1 = pd.merge(left=dt1, right=total, on=categ)
dt1["perc"] = round(dt1["count_x"]/dt1["count_y"]*100, 0).astype(int)

datas1 = [dt1[dt1[categ] == "High"], dt1[dt1[categ] == "Very High"], dt1[dt1[categ] == "Crazy High"]]

order = ["0", "$1-$99", "$100-$999", "$1000-$9,999", "$10,000-$99,999", ">100,000"]

# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16,14))
fig.suptitle('- Company Spending on ML Technology (on Pay Category) -', size = 26, color = pirate_colors[0],
             weight='bold')
axs = [ax1, ax2, ax3]


for data, ax in zip(datas1, axs):
    sns.barplot(y = data[ml_spent], x = data["perc"], ax=ax, color = pirate_colors[1],
                order = order)
    
ax1.set_title('High Pay (%)', size = 13, color = treasure_colors[0], weight='bold')
ax2.set_title('Very High Pay (%)', size = 13, color = treasure_colors[0], weight='bold')
ax3.set_title('Crazy High Pay (%)', size = 13, color = treasure_colors[0], weight='bold')

for ax in axs:
    ax.set_ylabel("")
    ax.set_xlabel("")
    ax.set_xticks([])
    ax.tick_params(axis='y', rotation=0)
    show_values_on_bars(ax, h_v="h", space=1)
    
path='../input/2020-cost-of-living/moneybag.png'
offset_png(x=25, y=0, path=path, ax=ax1, zoom=0.05, offset=-22, border=1)
offset_png(x=21, y=3, path=path, ax=ax1, zoom=0.05, offset=-22, border=1)
offset_png(x=26, y=5, path=path, ax=ax2, zoom=0.05, offset=-22, border=1)
offset_png(x=21, y=4, path=path, ax=ax2, zoom=0.05, offset=-22, border=1)
offset_png(x=44, y=5, path=path, ax=ax3, zoom=0.05, offset=-22, border=1)
offset_png(x=22, y=4, path=path, ax=ax3, zoom=0.05, offset=-22, border=1)

# Arrow
style = "Simple, tail_width=0.5, head_width=6, head_length=13"
kw = dict(arrowstyle=style, color="#828E9D")
arrow = patches.FancyArrowPatch((23, 2), (32, 4.3),
                                connectionstyle="arc3,rad=-.3", **kw)
plt.gca().add_patch(arrow)
    
sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.85, wspace=0.4, hspace=None)
fig.text(0.04, 0.91, 
         'The majority of **all** groups agreed that they either have well established ML methods - or - they have just started using ML.', 
         size=14, color="#B76148", weight='bold')
fig.text(0.76, 0.56, '-more than 65%-', 
         size=12, color="#828E9D", weight='bold');

<div class="alert success-alert">
  <p><b>Takeaway</b>: We've learned that our high-end respondents are usually Data Scientists or Software/Machine Learning Engineers, with lots of coding experience, but not necessarily ML seniority.</p>
  <p> They come from large and small companies, but the higher the investment and interest of the company in ML, the higher the individual income received.</p>
</div>

> And another area of the map has revealed itself: the Work Habitat!

<img src="https://i.imgur.com/e05nc2Z.jpg">

> Closer look üîé
<img src="https://i.imgur.com/6EkCvou.jpg" width=700>

<h2><span style="background:#F3E9D9; font-weight:normal; color:black">3.4 The Coding Preferences</span></h2>

Now we know that more than half of our respondents have 10+ years of coding experience. But let's discover what languages they use most, what is their DS setup, how do they deal with cloud and big data, and what advice they have to share.

<h3 style="font-family:'Source Code Pro', monospace">Languages Used and Advice to Community</h3>

Most coders secretly root for Python, but are all the other languages obsolete?
* **Python** is by far the most popular language in most areas, both in **usage** and as **a recommendation** for future users.
* The next 3 most used languages are **SQL, R, and Bash**, all very useful in the Data Science discipline.
* Hence, all our top respondents (regardless of the Pay Category) <span style="background:#fed56f; font-weight:bold; color:black">use and recommend the most Python, followed by SQL, R, and Bash</span>.

In [None]:
# Language
languages = ["What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - C",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - C++",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Java",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Javascript",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Julia",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Swift",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Bash",
             "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - MATLAB"]
advice = "What programming language would you recommend an aspiring data scientist to learn first? - Selected Choice"

all_columns = languages.copy()
all_columns.append(advice)
all_columns.append(categ)


# Data
dt = df[all_columns]
dt.columns = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", "Julia", 
              "Swift", "Bash", "MATLAB", "Advice", "Category"]

advice_data = pd.DataFrame(dt.groupby("Advice")["Category"].count()).reset_index()
advice_data["Advice"] = advice_data["Advice"].apply(lambda x: code_advice(x))
advice_data = advice_data.groupby("Advice")["Category"].sum().reset_index()

lang_data = dt.drop(columns="Advice", axis=1)
lang_data = lang_data.melt(id_vars=["Category"], var_name="Language", value_name="count").dropna().reset_index(drop=True)
lang_data = lang_data.groupby("Category")["Language"].value_counts().unstack().reset_index()
lang_data = lang_data.melt(id_vars=["Category"], var_name="Language", value_name="count").dropna().reset_index(drop=True)
lang_data = lang_data.groupby("Language")["count"].sum().reset_index().sort_values('count', ascending=False)



# ===== PLOT ======
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 12))
fig.suptitle('- What are the most popular languages? -', size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2]

# The barplot
sns.barplot(y = lang_data["Language"], x = lang_data["count"], ax=ax1, color = treasure_colors[0])


# Pie
labels = advice_data["Advice"].values
sizes = advice_data["Category"].values
explode = (0.05, 0.05, 0.05, 0.05)
colors = ["#8BADBC", "#09516F", "#267292", "#508DA6"]

ax2.pie(sizes, explode=explode, colors=colors, startangle=90, labels=labels,
        autopct='%1.0f%%', pctdistance=0.8)

centre_circle = plt.Circle((0,0),0.65,fc='white')
ax2.add_artist(centre_circle)
ax2.axis('equal')

# Beautify

ax1.set_title('Languages used on a regular basis', size = 13, color = treasure_colors[0], weight='bold')
ax2.set_title('What should you learn first?', size = 13, color = treasure_colors[0], weight='bold')

ax1.set_ylabel("")
ax1.set_xlabel("")
ax1.set_xticks([])
ax1.tick_params(axis='y', rotation=0)
show_values_on_bars(ax1, h_v="h", space=1)

path='../input/2020-cost-of-living/language_logos'
offset_png(x=1901, y=0, path=f'{path}/python_logo.png', ax=ax1, zoom=0.05, offset=-22, border=1)
offset_png(x=1212, y=1, path=f'{path}/sql_logo.png', ax=ax1, zoom=0.05, offset=-22, border=1)
offset_png(x=706, y=2, path=f'{path}/r_logo.png', ax=ax1, zoom=0.025, offset=-22, border=1)
offset_png(x=445, y=3, path=f'{path}/bash_logo.png', ax=ax1, zoom=0.05, offset=-22, border=1)
offset_png(x=374, y=4, path=f'{path}/javascript_logo.png', ax=ax1, zoom=0.05, offset=-22, border=1)
offset_png(x=366, y=5, path=f'{path}/java_logo.png', ax=ax1, zoom=0.05, offset=-22, border=1)
offset_png(x=323, y=6, path=f'{path}/c++_logo.png', ax=ax1, zoom=0.03, offset=-22, border=1)
offset_png(x=260, y=7, path=f'{path}/c_logo.png', ax=ax1, zoom=0.02, offset=-22, border=1)
offset_png(x=181, y=8, path=f'{path}/matlab_logo.png', ax=ax1, zoom=0.04, offset=33, border=1)
offset_png(x=56, y=9, path=f'{path}/julia_logo.png', ax=ax1, zoom=0.025, offset=33, border=1)
offset_png(x=36, y=10, path=f'{path}/swift_logo.png', ax=ax1, zoom=0.04, offset=33, border=1)


offset_png(x=0, y=0, path="../input/2020-cost-of-living/snake.png", ax=ax2, zoom=0.2, offset=0, border=1)

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.89, wspace=None, hspace=None)

<h3 style="font-family:'Source Code Pro', monospace">What's the best setup?</h3>

The graphs below are about the same for all 3 Pay Categories:
* Most work on their **personal laptop/computer**.
* When working on the personal computer, the usage is usually oriented towards the classics: **Jupyter, RStudio, and PyCharm**.
* They also use cloud computing services, like **Google Cloud Datalab/ AI Platform**, but most excessively **Colab** or **Kaggle Notebooks**.
* To conclude, the top data scientists have *similar behavior in terms of environments and IDE with the average Kaggler*: use the most the <span style="background:#fed56f; font-weight:bold; color:black">personal gear, Colab and Kaggle Notebooks and work on a combination of Jupyter, PyCharm, and RStudio</span> environments.

In [None]:
# Environment
all_columns = get_environment_questions()

ide = all_columns[:9] ; ide.append(categ)
notebook = all_columns[9:21] ; notebook.append(categ)
computing = all_columns[21:22] ; computing.append(categ)

ide_names = ["Jupyter", "R Studio", "Visual Studio", "PyCharm", 
             "Spyder", "Notepad++", "Sublime Text", "Vim/Emacs", "MATLAB", "Categ"]
notebook_names = ["Kaggle", "Colab", "Azure", "Paperspace", "Binder", 
                  "Code Ocean", "IBM Watson Studio", "Amazon Sagemaker Studio",
                  "Amazon EMR", "Google Cloud AI Platform", "Google Cloud Datalab", "Databrick", 'Categ']
computing_names = ["Computing Platform", "Categ"]

# Data
ide_data = df[ide].reset_index(drop=True) ; ide_data.columns = ide_names
notebook_data = df[notebook].reset_index(drop=True) ; notebook_data.columns = notebook_names
computing_data = df[computing] ; computing_data.columns = computing_names
computing_data.dropna(axis=0, inplace=True)

# Prepping
ide_data = multiple_choice_prep(data=ide_data, category_name="Categ", var_name="IDE")
notebook_data = multiple_choice_prep(data=notebook_data, category_name="Categ", var_name="Notebook")
computing_data = one_choice_prep(data=computing_data, category_name="Categ", 
                                 count_on="Computing Platform", var_name="Computing")

ide_data["IDE"] = ide_data["IDE"].apply(lambda x: code_ide(x).strip())
notebook_data["Notebook"] = notebook_data["Notebook"].apply(lambda x: code_notebook(x))
notebook_data = notebook_data.groupby("Notebook")["count"].sum().reset_index().\
                    sort_values(by="count", ascending=False)
ide_data = ide_data.groupby("IDE")["count"].sum().reset_index().\
                            sort_values(by="count", ascending=False)



# ===== PLOT ======
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 14))
fig.suptitle("- What's the best setup? -", size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2]

# The barplot
sns.barplot(y = notebook_data["Notebook"], x = notebook_data["count"], ax=ax1, color = pirate_colors[1])



# Pie
labels = ide_data["IDE"].values
sizes = ide_data["count"].values
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
colors = ["#771C01", "#8F2C0E", "#AB4121", "#C6684C", "#C68A78", 
          "#C79383", "#CAA194", "#D0B7AF", "#D1C5C1"]

ax2.pie(sizes, explode=explode, colors=colors, startangle=90, labels=labels,
        autopct='%1.0f%%', pctdistance=0.8)

centre_circle = plt.Circle((0,0),0.65,fc='white')
ax2.add_artist(centre_circle)
ax2.axis('equal')

# Beautify

ax1.set_title('Choice of Notebooks', size = 13, color = treasure_colors[0], weight='bold')
ax2.set_title('Choice of IDE', size = 13, color = treasure_colors[0], weight='bold')

ax1.set_ylabel("")
ax1.set_xlabel("")
ax1.set_xticks([])
ax1.tick_params(axis='y', rotation=0)
show_values_on_bars(ax1, h_v="h", space=1)

path='../input/2020-cost-of-living/language_logos'
offset_png(x=651, y=0, path=f'{path}/colab_logo.png', ax=ax1, zoom=0.07, offset=-25, border=1)
offset_png(x=608, y=1, path=f'{path}/kaggle_logo.png', ax=ax1, zoom=0.07, offset=-26, border=1)
offset_png(x=0, y=0, path=f'{path}/ide_logo.png', ax=ax2, zoom=0.15, offset=0, border=1)

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.86, wspace=0.05, hspace=None)

# # Arrow
# style = "Simple, tail_width=4, head_width=8, head_length=16"
# kw = dict(arrowstyle=style, color="#000000")
# arrow = patches.FancyArrowPatch((-8, 0), (-1, 1),
#                                 connectionstyle="arc3,rad=0", **kw)
# plt.gca().add_patch(arrow)

fig.text(0.19, 0.92, '* Most use either a personal laptop/computer - and/or - a cloud computing plaftom *', 
         size=14, color=treasure_colors[2], weight='bold')

fig.text(0.33, 0.50, '30% also use', 
         size=12, color=pirate_colors[1], weight='bold')
fig.text(0.32, 0.47, 'Cloud Instances', 
         size=12, color=pirate_colors[1], weight='bold'); 

<h3 style="font-family:'Source Code Pro', monospace">Accelerators : Yay or Nay?</h3>

First, let's understand some concepts:
<style type="text/css">
.tg  {border-collapse:collapse;
     border-spacing:0;}
.tg td{border-color:"#010307";
    border-style:solid;
    border-width:1px;
    font-family:'Source Code Pro', monospace;
    font-size:14px;
    overflow:hidden;
    padding:10px 5px;
    word-break:normal;}
.tg th{border-color:"#010307";
    border-style:solid;
    border-width:1px;
    font-family:'Source Code Pro', monospace;
    font-size:14px;
    font-weight:normal;
    overflow:hidden;
    padding:10px 5px;
    word-break:normal;}
.tg .tg-c3ow{border-color:"#010307";
    text-align:center;
    vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-c3ow"><font>Accelerator</font></th>
    <th class="tg-c3ow"><font>Explanation</font></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-c3ow">GPU (Graphics Processing Unit)</td>
    <td class="tg-c3ow">Designed to rapidly manipulate and alter memory to accelerate the creation of images. Used greatly in all areas of Data Science.</td>
  </tr>
  <tr>
    <td class="tg-c3ow">TPU (Tensor Processing Unit)</td>
    <td class="tg-c3ow">AI accelerator developed by Google specifically for neural network machine learning. Mostly used in Deep Learning problems.</td>
  </tr>
</tbody>
</table>


Is GPU better? Technically no - either if you choose CPU, GPU or TPU, they all outperform in some areas and underperform in others. [Learn more about this comparison here.](https://analyticsindiamag.com/tpu-vs-gpu-vs-cpu-which-hardware-should-you-choose-for-deep-learning/#:~:text=TPU%20vs%20GPU%20vs%20CPU%3A%20A%20Cross%2DPlatform%20Comparison&text=TPU%3A%20Tensor%20Processing%20Unit%20is,small%20batches%20and%20nonMatMul%20computations.)

So, what are the top Data Scientists using to move fast during competitions and achieve the best scores? We know that they mostly use their personal laptop/ workstation, but how do these look like?
* The visualization below is *representative* of all 3 Pay Categories.
* Respondents are *split in half* (with very few exceptions): **~44% use no Acceleration**, while the other **44% use GPUs** most often.
* Most of this **88% percent have never even tried TPUs** before, and if they did, they only used it between 2 to 5 times.
* TPU users are **~ 7%** out of all respondents; however, half of them have been using TPUs more than 6 times (and around a quarter of them more than 25 times).
* Hence, the <span style="background:#fed56f; font-weight:bold; color:black">vast majority of our respondents use GPU acceleration</span>; however, there are a few <span style="background:#fed56f; font-weight:bold; color:black">very experienced TPU users that rely mostly on TPU</span> during their work/competitions. Moreover, this graph is also a relief for beginners: <span style="background:#fed56f; font-weight:bold; color:black">you don't necessarily need heavy computing power to be very good or earn well</span>.

In [None]:
# Set columns
accelerators = ["Which types of specialized hardware do you use on a regular basis?  (Select all that apply) - Selected Choice - GPUs",
                "Which types of specialized hardware do you use on a regular basis?  (Select all that apply) - Selected Choice - TPUs",
                "Which types of specialized hardware do you use on a regular basis?  (Select all that apply) - Selected Choice - None",
                "Which types of specialized hardware do you use on a regular basis?  (Select all that apply) - Selected Choice - Other"]
tpu = "Approximately how many times have you used a TPU (tensor processing unit)?"

all_columns = accelerators.copy()
all_columns.append(tpu)
all_columns.append(categ)

dt = df[all_columns]
dt.columns = ["GPUs", "TPUs", "None", "Other", "times_tpu_used", "categ"]

# Speciffic data
acc_data = dt[["GPUs", "TPUs", "None", "Other", "categ"]]
tpu_data = dt[["times_tpu_used", "categ"]]

acc_data = multiple_choice_prep(data=acc_data, category_name="categ", var_name="accelerator")
tpu_data = one_choice_prep(data=tpu_data, category_name="categ", 
                           count_on="times_tpu_used", var_name="tpu_used")

# Prep
acc_data = acc_data.groupby("accelerator")["count"].sum().reset_index().sort_values("count", ascending=False)
tpu_data = tpu_data.groupby("tpu_used")["count"].sum().reset_index()


# Sankey Data
sankey_data = dt[["GPUs", "TPUs", "None", "Other", "times_tpu_used"]]
sankey_data = multiple_choice_prep(sankey_data, category_name="times_tpu_used", var_name="Acc")
sankey_data = sankey_data.reindex([4, 0, 1, 2, 3,
                                     19, 15, 16, 17, 18,
                                     9, 5, 6, 7, 8,
                                     14, 10, 11, 12, 13])
sankey_data = sankey_data[["Acc", "times_tpu_used", "count"]]

# Get the graph data
label2 = ['GPUs', 'TPUs', 'None', 'Other',
         'Once', '2-5 times', '6-25 times', '>25 times', 'Never']
source2 = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
target2 = [4, 5, 6, 7, 8, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8]
value2 = sankey_data["count"].values

# Colors
color_node = ["#CC5600", "#DA9300", "#91281A", "#9D4800",
              "#325C6E", "#325C6E", "#325C6E", "#325C6E", "#325C6E"]

color_link = ["#CC5600","#F8E8DC", "#CC5600", "#CC5600", "#F8E8DC",
              "#DA9300", "#DA9300", "#DA9300", "#DA9300", "#DA9300",
              "#DDCECC", "#91281A", "#91281A", "#91281A", "#DDCECC",
              "#D0BBA7","#D0BBA7", "#D0BBA7", "#D0BBA7", "#D0BBA7"]

# Data to dict, dict to sankey
link = dict(source = source2, target = target2, value = value2, color = color_link)
node = dict(label = label2, pad=10, thickness=21, color=color_node, 
            line = dict(color = "black", width = [5,10]))
data = [go.Sankey(link = link, node=node, arrangement='snap')]

layout = go.Layout(hovermode = 'x',
                   title="- Accelerators -",
                   font=dict(size = 14, color = pirate_colors[0], family="Source Code Pro",),
                   paper_bgcolor='#FCFCFC',
                   height=850)

# PLOT
fig2 = go.Figure(data, layout)
fig2.add_annotation(x=0.5, y=1.07,
                   text="* more than 88% are divided between -no Acceleration- and -GPU Acceleration- *",
                    showarrow=False,
                   font=dict(
                       family="Source Code Pro",
                       size=14,
                       color="#CC5600"))
fig2.add_annotation(x=0, y=1.017,
                   text="Accelerator",
                    showarrow=False,
                   font=dict(
                       family="Source Code Pro",
                       size=14,
                       color="#325C6E"))
fig2.add_annotation(x=1, y=1.033,
                   text="Times TPU used",
                    showarrow=False,
                   font=dict(
                       family="Source Code Pro",
                       size=14,
                       color="#325C6E"))
fig2.show()

In [None]:
# Cloud
cloud_platform = "Which of the following cloud computing platforms do you use on a regular basis?\
            (Select all that apply) - Selected Choice -  ______ "
cloud_prod = "Do you use any of the following cloud computing products on a regular basis?\
            (Select all that apply) - Selected Choice -  ______"
cloud_platform_f = "Which of the following cloud computing platforms do you hope to become more familiar with\
                in the next 2 years? - Selected Choice - _____"
cloud_prod_f = "In the next 2 years, do you hope to become more familiar with any of these specific cloud \
                computing products? (Select all that apply) - Selected Choice - ____"


# Big Data
bigdata_regular = "Which of the following big data products (relational databases, data warehouses, \
                    data lakes, or similar) do you use on a regular basis? (Select all that apply)\
                        - Selected Choice - ____ "
bigdata_often = "Which of the following big data products (relational database, data warehouse, \
                    data lake, or similar) do you use most often? - Selected Choice"
bigdata_f = "Which of the following big data products (relational databases, data warehouses, data \
                    lakes, or similar) do you hope to become more familiar with in the next 2 years? \
                            (Select all that apply) - Selected Choice - _____ "


# BI Tools
bi_regular = "Which of the following business intelligence tools do you use on a \
            regular basis? (Select all that apply) - Selected Choice - _____ "
bi_often = "Which of the following business intelligence tools do you use most often? - Selected Choice"
bi_f = "Which of the following business intelligence tools do you hope to become more familiar \
                with in the next 2 years? (Select all that apply) - Selected Choice - ______ "

<div class="alert success-alert">
  <p><b>Takeaway</b>: We've learned that our high-end respondents are Python people, usually coding on their personal gear. The usual environments used are Jupyter Notebooks, PyCharm, Colab, and Kaggle.</p>
  <p>They most often use GPUs, but no worries, No Accelerator whatsoever is a popular choice as well - meaning skill only can bring you some pretty awesome results as well. ;)</p>
</div>

> The Code Waters is now visible!

<img src="https://i.imgur.com/UgwKjQH.jpg">

> Closer look üîé
<img src="https://i.imgur.com/LlgBbjN.jpg" width=700>

<h2><span style="background:#F3E9D9; font-weight:normal; color:black">3.5 The Machine Learning Preferences</span></h2>

In this last chapter, we'll find out what machine learning tools and resources these Data Science veterans use, as well as what we need to focus on when we start our Machine Learning journey.

Let the hunt conclude in elegance!

<h3 style="font-family:'Source Code Pro', monospace">Reliable Sources to Keep Informed</h3>

The learnings below are representative for all 3 levels of Pay Categories (meaning that all behave in the same manner, so the split wasn't shown):
* For *learning*, the most popular choice was **Coursera**; as you may know, Coursera has the most popular [Machine Learning course](https://www.coursera.org/learn/machine-learning) from Stanford University, with the top instructor being the King of ML, **Andrew Ng**. This aspect might have significantly influenced the overall decision.
* Other very popular Course platforms were **Kaggle Learn, Udemy, and University-specific courses**.
* As *resources for getting the daily Data Science intake*, **Youtube** had the lead (possibly from the multitude of videos on different abstract topics, which are better understood from animated content rather than by only reading them). **Blogs** (such as Towards Data Science), **Twitter, and Publications** were a popular choice as well.
* Hence, the "go-to" reliable sources for getting that Data Science information were <span style="background:#fed56f; font-weight:bold; color:black">Coursera, Kaggle Learn, Udemy, Youtube, and Blogs</span>.

In [None]:
# Data Prep
courses_data, media_data = get_reliable_sources(df)

courses_data = courses_data.groupby("Course")["count"].sum().reset_index()
media_data = media_data.groupby("Media")["count"].sum().reset_index()

x = [2, 3.4, 3, 6, 5, 5.5, 8, 8.5, 5.3, 8]
y = [2, 8, 4, 4.5, 6, 2.5, 7, 9, 9.7, 3.7]
courses_data["x"] = x
courses_data["y"] = y

x = [4, 2.5, 3, 6, 5, 7.5, 8, 7.5, 4.9]
y = [2.8, 7, 4, 4.5, 6, 3.5, 5.3, 7.7, 8.3]
media_data["x"] = x
media_data["y"] = y

# ===== Plots =====
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 12))
fig.suptitle("- Reliable Sources -", size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2]

# Plot 1
a1 = sns.scatterplot(x=courses_data["x"], y = courses_data["y"],
                     size = courses_data["count"], sizes=(150,20000),
                     hue = courses_data["Course"], ax=ax1,
                     palette = all_colors[:10])

a1.legend([],[], frameon=False)
for line in range(0,courses_data.shape[0]):
     ax1.text(courses_data.x[line]+0.4, courses_data.y[line], courses_data.Course[line], 
             horizontalalignment='left', size=10, color='#1B2631', weight='bold')

# Plot 2
a2 = sns.scatterplot(x=media_data["x"], y = media_data["y"],
                     size = media_data["count"], sizes=(150,30000),
                     hue = media_data["Media"], ax=ax2,
                     palette = all_colors[:9])

a2.legend([],[], frameon=False)
for line in range(0,media_data.shape[0]):
     ax2.text(media_data.x[line]+0.5, media_data.y[line], media_data.Media[line], 
             horizontalalignment='left', size=10, color='#1B2631', weight='bold')
        

# Beautify
ax1.set_title('Favorite Platforms to learn DS', size = 15, color = pirate_colors[0], weight='bold',
              pad=14)
ax2.set_title('Favorite Media Sources', size = 15, color = pirate_colors[0], weight='bold',
              pad=14)

ax1.set_xlim(1.8, 10)
ax1.set_ylim(1.5, 10.5)
ax2.set_xlim(1.9, 9.5)
ax2.set_ylim(1.9, 9.3)


for ax in axs:
    ax.set_ylabel("")
    ax.set_xlabel("")
    ax.set_xticks([])
    ax.set_yticks([])
    
offset_png(x=3.5, y=9, path=f'../input/2020-cost-of-living/pirate_hat.png', 
           ax=ax1, zoom=0.05, offset=0, border=1)
offset_png(x=2.8, y=8.4, path=f'../input/2020-cost-of-living/pirate_patch.png', 
           ax=ax1, zoom=0.05, offset=0, border=1)

offset_png(x=4.9, y=9.2, path=f'../input/2020-cost-of-living/pirate_hat.png', 
           ax=ax2, zoom=0.05, offset=0, border=1)
offset_png(x=4.2, y=8.7, path=f'../input/2020-cost-of-living/pirate_patch.png', 
           ax=ax2, zoom=0.055, offset=0, border=1)

offset_png(x=4, y=3.5, path=f'../input/2020-cost-of-living/pirate_hat.png', 
           ax=ax2, zoom=0.05, offset=0, border=1)
offset_png(x=3.4, y=3, path=f'../input/2020-cost-of-living/pirate_patch_white.png', 
           ax=ax2, zoom=0.05, offset=0, border=1)

    
sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.9, wspace=0.05, hspace=None)

ax2.plot([0.5, 0.5], [0.13, 0.9], color='#D0D3D4', lw=5,
         transform=plt.gcf().transFigure, clip_on=False);

<h3 style="font-family:'Source Code Pro', monospace">What Frameworks to have in your Pocket?</h3>

The below graphs and explanations are representative for all 3 Pay Categories:
* For the *overall Frameworks*, **Scikit-learn** has the lead. This is also **the oldest** (13 years), the **most volatile and versatile** library, so the fact that it has the most popularity isn't a shock.
* Out of the deep learning frameworks, **Tensorflow** is the most popular, but PyTorch is starting to gain some visibility itself (it is *1 year younger* than Tensorflow, so it's only natural that it is a little bit behind).
* As for the *visualization libraries*, our top-end respondents mainly prefer and use **the originals** Matplotlib, Seaborn, and Plotly, with Ggplot for R.
* Hence, the industry's top data scientists use lots of <span style="background:#fed56f; font-weight:bold; color:black">Scikit-learn, Tensorflow, Matplotlib, and Seaborn</span>. However, we mustn't lose sight of the emerging libraries that have the potential to overcome soon the giants, such as <span style="background:#fed56f; font-weight:bold; color:black">PyTorch</span> (can you already tell I'm a fan of PyTorch?üëÄ)

In [None]:
# Libraries
viz_data, framework_data = get_frameworks(df)

viz_data = viz_data.groupby("Visualization")["count"].sum().reset_index()
framework_data = framework_data.groupby("Frameworks")["count"].sum().reset_index()
framework_data = framework_data.sort_values("count", ascending=False)


# ===== PLOT ======
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 12))
fig.suptitle("- Frameworks in you Pocket -", size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2]

# The barplot
a1 = sns.barplot(y = framework_data["Frameworks"], x = framework_data["count"], 
                 ax=ax1, color = treasure_colors[2])

# Bubbles
x = [2.5, 5.5, 3, 6, 3.7, 5.5, 6, 7.7, 4, 8]
y = [2.5, 9, 4, 4.5, 5.5, 3.1, 6.5, 8, 8, 3.7]
viz_data["x"] = x
viz_data["y"] = y

a2 = sns.scatterplot(data = viz_data,
                     x="x", y = "y", size = "count", sizes=(150,20000),
                     hue = "Visualization", ax=ax2,
                     palette = all_colors[:10])

a2.legend([],[], frameon=False)
for line in range(0,viz_data.shape[0]):
     ax2.text(viz_data.x[line]+0.4, viz_data.y[line], viz_data.Visualization[line], 
             horizontalalignment='left', size=10, color='#1B2631', weight='bold')
        
ax2.set_xlim(2, 9.3)
ax2.set_ylim(2, 9.7)


# Beautify

ax1.set_title('Overall Frameworks', size = 13, color = treasure_colors[0], weight='bold')
ax2.set_title('Visualization Frameworks', size = 13, color = treasure_colors[0], weight='bold')

ax1.set_ylabel("")
ax1.set_xlabel("")
ax1.set_xticks([])
ax1.tick_params(axis='y', rotation=20)
show_values_on_bars(ax1, h_v="h", space=1)

ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.set_xticks([])
ax2.set_yticks([])

# Images
offset_png(x=6.0, y=7.3, path=f'../input/2020-cost-of-living/pirate_hat.png', 
           ax=ax2, zoom=0.05, offset=0, border=1)
offset_png(x=5.4, y=6.8, path=f'../input/2020-cost-of-living/pirate_patch.png', 
           ax=ax2, zoom=0.05, offset=0, border=1)

offset_png(x=4, y=8.75, path=f'../input/2020-cost-of-living/pirate_hat.png', 
           ax=ax2, zoom=0.03, offset=0, border=1)
offset_png(x=7.7, y=8.6, path=f'../input/2020-cost-of-living/pirate_hat.png', 
           ax=ax2, zoom=0.03, offset=0, border=1)
offset_png(x=5.5, y=3.7, path=f'../input/2020-cost-of-living/pirate_hat.png', 
           ax=ax2, zoom=0.03, offset=0, border=1)
path='../input/2020-cost-of-living/language_logos'
offset_png(x=1487, y=0, path=f'{path}/scikit_logo.png', ax=ax1, zoom=0.04, offset=-27, border=1)
offset_png(x=949, y=1, path=f'{path}/tensorflow_logo.png', ax=ax1, zoom=0.034, offset=-27, border=1)
offset_png(x=868, y=2, path=f'{path}/keras_logo.png', ax=ax1, zoom=0.22, offset=-27, border=1)
offset_png(x=755, y=3, path=f'{path}/xgboost_logo.png', ax=ax1, zoom=0.07, offset=-27, border=1)
offset_png(x=648, y=4, path=f'{path}/pytorch_logo.png', ax=ax1, zoom=0.08, offset=-26, border=1)

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.04, hspace=None);

<h3 style="font-family:'Source Code Pro', monospace">Machine Learning & Deep Learning Basics</h3>

The below graphs and explanations are representative for all 3 Pay Categories:
* Awkwardly enough, the most frequent choices for *ML Methods* were **Regressions and Tree-Based methods**, which are more straightforward approaches than Neural Nets. Hence, complicated methodologies aren't necessarily better.
* **Classification** and **General Image Methods** (like cv2, PIL, or skimage) are the most popular in the *Computer Vision* department.
* Regarding *NLP*, our top respondents were **Word Embeddings** and **Transformers**.
* Hence, the most popular choices were the simpler ones, such as <span style="background:#fed56f; font-weight:bold; color:black">Regressions, Tree-Based Methods, General & Classification methods, and Word Embeddings</span>.

In [None]:
colors1 = ["#703428", "#7c3d2c", "#884630", "#934f34", "#9f5938", 
          "#ab633c", "#b66d40", "#c17844", "#cc8347", "#d68e4b"]
colors2 = ["#c89a37", "#cfa651", "#d6b269", "#dcbe81", "#e2ca99"]
colors3 = ["#449faf", "#6abbc8", "#8ed7e2", "#b1f4fc"]


# ML & DL
ml_data, compvis_data, nlp_data = get_ml_dl_data(df)
ml_data = ml_data.groupby("ML")["count"].sum().reset_index()
compvis_data = compvis_data.groupby("CompVis")["count"].sum().reset_index()
nlp_data = nlp_data.groupby("NLP")["count"].sum().reset_index()

ml_data = ml_data.sort_values("count", ascending=False)
compvis_data = compvis_data.sort_values("count", ascending=False)
nlp_data = nlp_data.sort_values("count", ascending=False)


# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 11))
fig.suptitle('- What ML/DL Methods to use? -', size = 26, weight='bold')
axs = [ax1, ax2, ax3]

# Bar1
sns.barplot(data=ml_data, y="ML", x="count", ax=ax1, palette = sns.color_palette(colors1, 10))

# Bar2
sns.barplot(data=compvis_data, y="CompVis", x="count", ax=ax2, palette=sns.color_palette(colors2, 5))

# Bar3
sns.barplot(data=nlp_data, y="NLP", x="count", ax=ax3, palette=sns.color_palette(colors3, 4))

# Beautify
ax1.set_title('Machine Learning Methods', size = 14, color = pirate_colors[0], weight='bold', pad=20)
ax2.set_title('Deep Learning Methods', size = 14, color = pirate_colors[0], weight='bold', pad=20)
ax3.set_title('Natural Language Processing Methods', size = 14, color = pirate_colors[0], weight='bold', pad=20)

for ax in axs:
    ax.set_ylabel("")
    ax.set_xlabel("")
    ax.set_xticks([])
    ax.tick_params(axis='y', rotation=40)
    show_values_on_bars(ax, h_v="h", space=1)
    
    
offset_png(x=1650, y=0, path='../input/2020-cost-of-living/pirate_hat2.png', ax=ax1, zoom=0.14, offset=-16, border=1)
offset_png(x=552, y=0, path='../input/2020-cost-of-living/pirate_hat2.png', ax=ax2, zoom=0.2, offset=-29, border=1)
offset_png(x=434, y=0, path='../input/2020-cost-of-living/pirate_hat2.png', ax=ax3, zoom=0.25, offset=-32, border=1)

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.6, hspace=None);

<h3 style="font-family:'Source Code Pro', monospace">Analysis at Work and Deployment</h3>

The visualization and discussion below are representative for all 3 Pay Categories:
* Regarding *Deployment*, most of our top respondents prefer to either deploy on **GitHub** or **keep their models locally on their personal gear**. **Kaggle and Colab**, however are popular options as well.
* For *Tools Analysis*, the most popular are the **Local Development Environments**, like *Jupyter Notebooks* or *R Studio*. 
* So, our top respondents analyze data mostly on <span style="background:#fed56f; font-weight:bold; color:black">Local Environments</span> and deploy models either on <span style="background:#fed56f; font-weight:bold; color:black">GitHub, or keep them on their local machine</span>.

In [None]:
# ML at Work/School

# Columns
deploy = [col for col in df.columns if 
          "Where do you publicly share or deploy your data analysis or machine learning applications?" in col]
deploy.append(categ)
tool = ["What is the primary tool that you use at work or school to analyze data? (Include text response) - Selected Choice"]
tool.append(categ)

#Data
deploy_data = df[deploy]
tool_data = df[tool]

deploy_data = multiple_choice_prep(data=deploy_data, category_name="Category", var_name="Deploy")
tool_data = multiple_choice_prep(data=tool_data, category_name="Category", var_name="Tool")

deploy_data = deploy_data.groupby("Deploy")["count"].sum().reset_index().sort_values("count", ascending=False)
tool_data = tool_data.groupby("Tool")["count"].sum().reset_index().sort_values("count", ascending=False)

deploy_data["Deploy"] = deploy_data["Deploy"].apply(lambda x: code_deploy(x))
tool_data["Tool"] = tool_data["Tool"].apply(lambda x: code_tool(x))

deploy_data = deploy_data[deploy_data["Deploy"]!= "NBViewer"]
deploy_data = deploy_data[deploy_data["Deploy"]!= "Streamlit"]
deploy_data = deploy_data[deploy_data["Deploy"]!= "Other"]
deploy_data = deploy_data[deploy_data["Deploy"]!= "Plotly Dash"]

# ===== Plot =====
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 11))
fig.suptitle('- Deployment and Work Analysis -', size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2]

# Pie 1
labels = deploy_data["Deploy"].values
sizes = deploy_data["count"].values
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
colors = ["#7D3100", "#A34000", "#BA5B1D", "#CD753C", "#D99162", "#E5AE8A"]

ax1.pie(sizes, explode=explode, colors=colors, startangle=90, labels=labels, 
        autopct='%1.0f%%', pctdistance=0.8)

#draw circle
centre_circle = plt.Circle((0,0),0.65,fc='white')
ax1.add_artist(centre_circle)
ax1.axis('equal')

# Pie 2
labels = tool_data["Tool"].values
sizes = tool_data["count"].values
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05)
colors = ["#005071", "#006995", "#118BBE", "#3BA0CA", "#65AECC", "#88B8CD"]

ax2.pie(sizes, explode=explode, colors=colors, startangle=90, labels=labels,
        autopct='%1.0f%%', pctdistance=0.8)

centre_circle = plt.Circle((0,0),0.65,fc='white')
ax2.add_artist(centre_circle)
ax2.axis('equal')

# Settings
ax1.set_title('Deployment', size = 14, color = pirate_colors[0], weight='bold', pad=20)
ax2.set_title('Analysis at Work', size = 14, color = pirate_colors[0], weight='bold', pad=20)

# Image
offset_png(x=0, y=0, path="../input/2020-cost-of-living/deployment.png", 
           ax=ax1, zoom=0.13, offset=0, border=1)
offset_png(x=0, y=0, path="../input/2020-cost-of-living/analysis.png",
           ax=ax2, zoom=0.1, offset=0, border=1)

plt.subplots_adjust(left=None, bottom=None, right=None, top=0.8, wspace=0.5, hspace=None);

<h3 style="font-family:'Source Code Pro', monospace">Machine Learning - Miscellaneous Tools</h3>

Finally, the visualization and discussion below are representative for all 3 Pay Categories as well:
* For *ML Products*, there is a general lack of interest for all the presented options, as **None** was the most popular choice. However, most respondents say that they would like to gain knowledge of **Google Cloud Products** in the next 2 years.
* Regarding *ML Experiments*, there is even a more prominent lack of interest, as **None** is the predominant answer, followed from far by **TensorBoard**. The **carelessness continues** in the projection over the next 2 years. So, ML Experiments, not so popular at the moment.ü§∑‚Äç‚ôÄÔ∏è
* There is a lack of frequent usage in *Automated ML* as well; however, there can be seen a general trend for **Automated Model Selection** for the next 2 years.
* Hence, *ML products, Experiments, or Automated ML* aren't a popular choice between our top respondents. A reason for that could be the inclination towards old school coding, as they might <span style="background:#fed56f; font-weight:bold; color:black">prefer writing everything from scratch</span>, going through the data, the patterns, and the models with "their own hands", <span style="background:#fed56f; font-weight:bold; color:black">instead of relying on a service</span>. But I am sure this fact will change over the years.

In [None]:
# Get Data
ml_prods_data, ml_prods_f_data, ml_experm_data, ml_experm_f_data, automl1_data, automl2_data, \
        automl1_f_data, automl2_f_data = prep_ml_extended(df)

# Erase "Other" category
ml_prods_data = ml_prods_data[ml_prods_data["Prods"] != "Other"]
ml_experm_data = ml_experm_data[ml_experm_data["Experiment"] != "Other"]
automl1_data = automl1_data[automl1_data["AutoML1"] != "Other"]

# Group by
ml_prods_data = ml_prods_data.groupby("Prods")["count"].sum().reset_index()
ml_experm_data = ml_experm_data.groupby("Experiment")["count"].sum().reset_index()
automl1_data = automl1_data.groupby("AutoML1")["count"].sum().reset_index()

# Code values
ml_prods_data["Prods"] = ml_prods_data["Prods"].apply(lambda x: code_mlprods(x))
ml_experm_data["Experiment"] = ml_experm_data["Experiment"].apply(lambda x: code_mlexper(x))
automl1_data["AutoML1"] = automl1_data["AutoML1"].apply(lambda x: code_automl(x))

x = [1.2, 3.4, 4.5, 1.8, 5, 2.5, 2, 5, 4.3, 8]
y = [1.2, 8, 4, 5, 6, 2.5, 7, 9, 9.7, 1.7]
ml_prods_data["x"] = x
ml_prods_data["y"] = y

x = [1, 3.4, 4.5, 1.8, 5, 2.5, 2, 5, 4.3, 8]
y = [1, 8, 4, 5, 6, 2.5, 7, 9, 9.7, 1.7]
ml_experm_data["x"] = x
ml_experm_data["y"] = y

x = [1.5, 4.4, 4, 5, 2, 6, 8]
y = [1.5, 8, 4, 5, 6.5, 9.4, 3]
automl1_data["x"] = x
automl1_data["y"] = y

# ===== Plots =====
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 12))
fig.suptitle("- Miscellaneous ML Tools -", size = 26, color = pirate_colors[0], weight='bold')
axs = [ax1, ax2, ax3]

# Plot 1
a1 = sns.scatterplot(x=ml_prods_data["x"], y = ml_prods_data["y"],
                     size = ml_prods_data["count"], sizes=(30,3000),
                     hue = ml_prods_data["Prods"], ax=ax1,
                     palette = all_colors[:10])

a1.legend([],[], frameon=False)
for line in range(0,ml_prods_data.shape[0]):
     ax1.text(ml_prods_data.x[line]+0.4, ml_prods_data.y[line], ml_prods_data.Prods[line], 
             horizontalalignment='left', size=10, color='#1B2631', weight='bold')

# Plot 2
a2 = sns.scatterplot(x=ml_experm_data["x"], y = ml_experm_data["y"],
                     size = ml_experm_data["count"], sizes=(30,3000),
                     hue = ml_experm_data["Experiment"], ax=ax2,
                     palette = all_colors[:10])

a2.legend([],[], frameon=False)
for line in range(0,ml_experm_data.shape[0]):
     ax2.text(ml_experm_data.x[line]+0.5, ml_experm_data.y[line], ml_experm_data.Experiment[line], 
             horizontalalignment='left', size=10, color='#1B2631', weight='bold')
        
# Plot 2
a3 = sns.scatterplot(x=automl1_data["x"], y = automl1_data["y"],
                     size = automl1_data["count"], sizes=(30,3000),
                     hue = automl1_data["AutoML1"], ax=ax3,
                     palette = all_colors[:7])

a3.legend([],[], frameon=False)
for line in range(0,automl1_data.shape[0]):
     ax3.text(automl1_data.x[line]+0.5, automl1_data.y[line], automl1_data.AutoML1[line], 
             horizontalalignment='left', size=10, color='#1B2631', weight='bold')
        

# Beautify
ax1.set_title('Favorite ML Products', size = 15, color = pirate_colors[0], weight='bold',
              pad=35)
ax2.set_title('Favorite ML Experiments', size = 15, color = pirate_colors[0], weight='bold',
              pad=35)
ax3.set_title('Favorite Auto ML Solutions', size = 15, color = pirate_colors[0], weight='bold',
              pad=35)


for ax in axs:
    ax.set_ylabel("")
    ax.set_xlabel("")
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlim(0.9, 10)
    ax.set_ylim(0.9, 10)
    
# offset_png(x=3.5, y=9, path=f'../input/2020-cost-of-living/pirate_hat.png', 
#            ax=ax1, zoom=0.05, offset=0, border=1)
# offset_png(x=2.8, y=8.4, path=f'../input/2020-cost-of-living/pirate_patch.png', 
#            ax=ax1, zoom=0.05, offset=0, border=1)

    
sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.84, wspace=None, hspace=None)

fig.text(0.16, 0.865, '*2 years knowledge resolution:', 
         size=10, color=treasure_colors[2], weight='bold')
fig.text(0.18, 0.85, 'Google Cloud Products', 
         size=10, color=treasure_colors[2], weight='bold')

fig.text(0.435, 0.865, '*2 years knowledge resolution:', 
         size=10, color=treasure_colors[2], weight='bold')
fig.text(0.455, 0.85, 'NONE (&TensorBoard)', 
         size=10, color=treasure_colors[2], weight='bold')

fig.text(0.71, 0.865, '*2 years knowledge resolution:', 
         size=10, color=treasure_colors[2], weight='bold')
fig.text(0.75, 0.85, 'Model Selection', 
         size=10, color=treasure_colors[2], weight='bold')

ax2.plot([0.37, 0.37], [0.13, 0.9], color='#D0D3D4', lw=5,
         transform=plt.gcf().transFigure, clip_on=False);
ax2.plot([0.65, 0.65], [0.13, 0.9], color='#D0D3D4', lw=5,
         transform=plt.gcf().transFigure, clip_on=False);

<div class="alert success-alert">
  <p><b>Takeaway</b>: Our top respondents use Youtube, Personal Blogs, and Twitter to keep informed on the latest Data Science "gossips" while having lots of trust in Coursera as a reliable source of learning.</p>
  <p>As for tools and methods, even though one would expect some complicated answers, they still use and rely the most on simple structures, like Regressions, Tree-Based models, Image Classification, or Word Embeddings for NLP. Most loved frameworks are Scikit-learn, Tensorflow, and Matplotlib, with no complicated or additional current preference for miscellaneous tools.</p>
  <p>There IS something to learn here. There was a tweet from Grandmaster <a href="https://www.kaggle.com/tunguz">Bojan Tunguz</a> that was sarcastically stating that all models besides Neural Nets are obsolete and you should focus only on these. The takeaway here is that no simple method, tool, or model is "redundant", not even for the great of the great. And this chapter shows precisely that.</p>
</div>

<img src="https://i.imgur.com/OmlXHgw.png" width=700>

> And finally, our map is completely revealed!

<img src="https://i.imgur.com/0EanMig.jpg">

> Closer look üîé
<img src="https://i.imgur.com/kv3fJhx.jpg" width=700>


# 4. Conclusion

I for one would have loved to see a survey like this when I started, some 1 year and a half ago. With no experience and no slight clue what data science is, what programming language do I need to know, how do you do machine learning, what is a notebook, how deep is deep learning, and ... for the love of God, GPUs? And, with the internet full of healthy and diverse opinions, it is easy to get lost and start not to trust the sources.

Hence, I truly hope that this analysis will bring some light for anybody in any query they might have. However small or big, I think that it's best to ask a professional, a great veteran in the discipline, a guru, a <span style="background:#fed56f; font-weight:bold; color:black">Grandmaster :)</span> what is the best way to learn that specific subject? And if this notebook helped you just a bit getting closer to your answers, or a mentor, I can declare myself happy and fulfilled. I know it helped me, at least.

Good luck, and may we all succeed with grace. Happy Data Sciencin'!

# 5. References üìú

* [Numbero Cost of Living Index](https://www.numbeo.com/cost-of-living/rankings_by_country.jsp)
* [Pareto Principle](https://en.wikipedia.org/wiki/Pareto_principle)
* [TPU vs GPU](https://analyticsindiamag.com/tpu-vs-gpu-vs-cpu-which-hardware-should-you-choose-for-deep-learning/#:~:text=TPU%20vs%20GPU%20vs%20CPU%3A%20A%20Cross%2DPlatform%20Comparison&text=TPU%3A%20Tensor%20Processing%20Unit%20is,small%20batches%20and%20nonMatMul%20computations.)
* [Machine Learning Course - Stanford](https://www.coursera.org/learn/machine-learning)

<img src="https://i.imgur.com/0eIXf89.png">

# Specs ‚å®Ô∏èüé®
### (*tools that helped visualisation & creating the pirate map*)
* Z8 G4 Workstation üñ•
* 2 CPUs & 96GB Memory üíæ
* NVIDIA Quadro RTX 8000 üéÆ
* RAPIDS version 0.17 üèÉüèæ‚Äç‚ôÄÔ∏è