In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.cluster import KMeans

pio.templates.default = "plotly_white"

path = '../input/kaggle-survey-2019/'

questions_key = pd.read_csv(path+'multiple_choice_responses.csv', nrows=1)
questions_key = questions_key.transpose().reset_index()
questions_key.columns = ['q_num', 'q_text']
q_text = questions_key.q_text.str.split('-', n=1, expand=True)
questions_key['q_text_1'] = q_text[0].str.strip()
questions_key['q_text_2'] = q_text[1].str.strip()
questions_key.drop(columns=['q_text'], inplace=True)

response_all = pd.read_csv(path+'multiple_choice_responses.csv', skiprows=2, header=None)

In [None]:
def new_job_label (row):
    if row['job_title'] == 'Data Scientist':
        return 'Data Scientists'
    if row['job_title'] == 'Student':
        return 'Students/Not employed/Others'
    if row['job_title'] == 'Not employed':
        return 'Students/Not employed/Others'
    if row['job_title'] == 'Other':
        return 'Students/Not employed/Others'
    else :
        return 'Non Data Scientist'

def get_subset (data, col_from, col_to):
    df_sub = data.iloc[:, col_from:col_to]
    df_sub = pd.get_dummies(df_sub)
    df_sub.columns = [col.split('_')[-1].strip().split('(')[0].strip() for col in df_sub.columns]
    df_sub['job_title'] = data[6]
    df_sub['job_ds'] = df_sub.apply(new_job_label, axis=1)
    marker_valid = (df_sub.sum(axis=1, numeric_only=True) > 0)
    return df_sub[marker_valid]


media = get_subset(response_all, 22, 34)
algorithms = get_subset(response_all, 118, 130)
frameworks = get_subset(response_all, 155, 167)

media_table = response_all.iloc[:, 22:34]
media_table = pd.get_dummies(media_table)
media_table.columns = [col.split('_')[-1].strip() for col in media_table.columns]
media_table = media_table[(media_table.sum(axis=1, numeric_only=True) > 0)]
# media_table.drop(columns=['None', 'Other'], inplace=True)

algorithms_media = pd.merge(algorithms.drop(columns=['None', 'Other']), 
         media.drop(columns=['None', 'Other', 'job_title', 'job_ds']), 
         left_index=True, right_index=True)

frameworks_media = pd.merge(frameworks.drop(columns=['None', 'Other']), 
         media.drop(columns=['None', 'Other', 'job_title', 'job_ds']), 
         left_index=True, right_index=True)


# Media Choices and Machine Learning Tools Usage Pattern or: How I leant to read more blogs and watch less YouTube

# Overview

Data science (DS) and machine learning (ML) is a fast growing discipline. As someone who has been preparing to transition into data science from a different profession, one daunting realization I had is that, there is SO. MUCH. STUFF. TO. LEARN.

So I have been wondering, how do people keep up with all these new tools and learn how to use them? 

*(Trying to not think of all the unread Towards Data Science newsletters piling up in my inbox)*

In this short article, I will explore a very simple and narrow aspect of this question: does the media source we choose to get data science contents from affect our data science/machine learning related skillsets? More specifically, I will look at whether people with different media choices (e.g. reads blogs or not), and different media diets (i.e. different combinations of preferred media sources) relates to what machine learning algorithms and frameworks they know and use. 

To explore this relationship between media choices and how people use different ML tools, I will be primarily relying the following questions from the [2019 Kaggle Machine Learning & Data Science Survey](https://www.kaggle.com/c/kaggle-survey-2019/overview).

* Q12, Who/what are your favorite media sources that report on data science topics? 
* Q24, Which of the following ML algorithms do you use on a regular basis? 
* Q28, Which of the following machine learning frameworks do you use on a regular basis?


> ### Summary of findings

* People who prefers Kaggle, Blogs etc uses tools at higher rate; but not for Youtube users.
* People with blogs in their media diet uses various ML algorithms and tools at higher rate than people with YouTube in their media diet.
* Aspiring data scientists (those who are students, currently unemployed, or have non-data science related jobs whose media diet consists of Kaggle and blogs uses various tools at a much higher rate.


# Media Choices at a Glance

## Popularity of Media Sources

In [None]:
media_mean = media_table.mean()
media_sum = media_table.sum()
media_table_sum = pd.DataFrame({'Count': media_sum,
                               'Percentage': media_mean})
media_table_sum.columns.name = 'Favorite media source reporting on data scient topics'
media_table_sum.sort_values(by='Percentage', ascending=False, inplace=True)
media_table_sum.style.format({'Percentage': "{:.1%}"}).bar(subset=['Percentage'], color='skyblue')

First let’s take a look at which media sources people tend to gravitate to. The table shows the number and the proportion of survey respondents who selected the various media source as their favorite ways to get data science related content. 

There are ‘Big Three’ on the leaderboard - Kaggle, Blogs, and YouTube. For a majority of us, these are the go-to choices. It is not surprising, since these platforms are rich in content, and easy to access. 

Of course most people use more than one media sources. The histogram below shows the distribution of the total number of media sources respondents have selected. We see  that ,ost people use less than 5 sources, and the median number is 3. 

In [None]:
media_count = media_table.drop(columns=['None']).sum(axis=1)
algo_count = algorithms.iloc[:, 0:10].sum(axis=1)
fm_count = frameworks.iloc[:, 0:10].sum(axis=1)

count_df = pd.DataFrame({'media_count': media_count,
                        'algo_count': algo_count,
                        'fm_count': fm_count})
count_df.dropna(inplace=True)
count_df['media_count_cat'] = pd.qcut(count_df['media_count'],
                                      [0, 0.25, 0.75, 1],
                                      labels=['Low','Medium','High'])
# count_df['media_count_cat'] = pd.qcut(count_df['media_count'], 2, 
#                                       labels=['Low','High'])

media_cat_n = count_df['media_count_cat'].value_counts(sort=False)

algo_low = count_df.query('media_count_cat == "Low"')[
    'algo_count'].value_counts().sort_index()
algo_low = algo_low/media_cat_n['Low']

algo_med = count_df.query('media_count_cat == "Medium"')[
    'algo_count'].value_counts().sort_index()
algo_med = algo_med/media_cat_n['Medium']

algo_hi = count_df.query('media_count_cat == "High"')[
    'algo_count'].value_counts().sort_index()
algo_hi = algo_hi/media_cat_n['High']

fm_low = count_df.query('media_count_cat == "Low"')[
    'fm_count'].value_counts().sort_index()
fm_low = fm_low/media_cat_n['Low']

fm_med = count_df.query('media_count_cat == "Medium"')[
    'fm_count'].value_counts().sort_index()
fm_med = fm_med/media_cat_n['Medium']

fm_hi = count_df.query('media_count_cat == "High"')[
    'fm_count'].value_counts().sort_index()
fm_hi = fm_hi/media_cat_n['High']

In [None]:
fig = go.Figure(
    data=[go.Histogram(
        x=media_count,
        histnorm='percent',
        marker= dict(
            color='skyblue',
            opacity=0.6,
            line= {"color": "white", "width": 2}
        ))])

fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, zeroline=False, ticksuffix="%")
fig.update_layout(
    title='Distribution of the number of media sources used',
    width=600,
    height=400,
    yaxis=dict(title="% of respondents",)
)
fig.show()

In [None]:
fig = make_subplots(rows=1, cols=2,
                    subplot_titles=("Number of ML Algorithms",
                                    "Number of ML Frameworks"))

# Algorithms
fig.add_trace(go.Scatter(
    x=algo_low.index,
    y=algo_low.values,
    name='Low',
    fill='tozeroy',
    marker=dict(color='gray'),
    showlegend=True,
    opacity=0.2), 1, 1
)

fig.add_trace(go.Scatter(
    x=algo_med.index,
    y=algo_med.values,
    name='Medium',
    fill='tozeroy',
    marker=dict(color='salmon'),
    showlegend=True,
    opacity=0.2), 1, 1
)

fig.add_trace(go.Scatter(
    x=algo_hi.index,
    y=algo_hi.values,
    name='High',
    fill='tozeroy',
    marker=dict(color='dodgerblue'),
    showlegend=True,
    opacity=0.2), 1, 1
)

# Frameworks
fig.add_trace(go.Scatter(
    x=fm_low.index,
    y=fm_low.values,
    name='Low',
    fill='tozeroy',
    marker=dict(color='grey'),
    showlegend=False,), 1, 2
)

fig.add_trace(go.Scatter(
    x=fm_med.index,
    y=fm_med.values,
    name='Medium',
    fill='tozeroy',
    marker=dict(color='salmon'),
    showlegend=False,), 1, 2
)

fig.add_trace(go.Scatter(
    x=fm_hi.index,
    y=fm_hi.values,
    name='High',
    fill='tozeroy',
    marker=dict(color='dodgerblue'),
    showlegend=False,), 1, 2
)

fig.update_traces(opacity=0.2, mode='lines')

fig.update_layout(
    width=800,
    height=400,
    yaxis1=dict(title="% of respondents",
                tickformat='%'),
    yaxis2=dict(tickformat='%'),
)

fig.update_xaxes(showgrid=False, zeroline=False)
fig.update_yaxes(showgrid=False, range=[0, 0.3])
fig.show()

Now, let’s see if there is a relationship between the diversity of our media diet (total number of media choice) and the diversity in our skillsets (total number of ML tools regularly used).

Here we group the respondents into three categories according to the number of media sources they use:

* Low: lower 25th percentile of the distribution (0-2 sources)
* Medium: 25th-75th percentile (3-4 sources)
* High: top 25th percentile (more than 5 sources)

We see that people with more diverse media choices tend to have slightly higher varieties in their toolboxes - they tend to know and use one or two more ML algorithms and libraries. 

Next, let's zoom in a bit closer to see if there is any relationship between particular media choices and which specific ML tools we use.


# Media Choice and ML Tool Usage

Here we want to see if people with particular media choices (e.g. read blogs vs do not read blogs) adopts various tools (e.g. uses Random Forests vs does not use Random Forests) at different rates.

To keep the scope of this article manageable, we will focus our discussion on the 'Big Three’ (Kaggle, Blogs, YouTube) from now on.

The plots below compares between respondents who favors Kaggle/Blog/YouTube and those who do not, showing the proportion of each group who regularly use the various ML models and libraries listed, and differences greater than 5% are highlighted.

In [None]:
def draw_segments(data, thres_high, thres_low, xref, color='skyblue'):
    """
    Draw line segment connecting the two dots. 
    - If difference is greater than high/low threshold, draw segment
    - Otherwise draw near invisible segment
    """
    segment_list = []
    for i in range(len(data[0])):
        value = data['diff'].iloc[i]
        if (value >= thres_high):
            segment = dict(
                type='line',
                x0=data[0].iloc[i]*1.01,
                y0=i,
                x1=data[1].iloc[i]*0.99,
                y1=i,
                xref=xref,
                line=dict(
                    color=color,
                    width=2,
                )
            )
        elif (value <= thres_low):
            segment = dict(
                type='line',
                x0=data[0].iloc[i]*0.99,
                y0=i,
                x1=data[1].iloc[i]*1.01,
                y1=i,
                xref=xref,
                line=dict(
                    color='grey',
                    width=2,
                )
            )
        else:
            segment = dict(
                type='line',
                x0=data[0].iloc[i]+0.01,
                y0=i,
                x1=data[1].iloc[i]-0.01,
                y1=i,
                xref=xref,
                line=dict(
                    color='whitesmoke',
                    width=0.1
                )
            )
        segment_list.append(segment)
    return segment_list

def draw_annotation(data, thres_high, thres_low, xref):
    """
    Draw annotation 
    - If difference is greater than {thres}, annotate with value
    """
    annot_list = []
    for i in range(len(data[0])):
        value = data['diff'].iloc[i]
        if (value >= thres_high):
            annot = dict(
                x=data[1].iloc[i],
                y=i,
                xref=xref,
                xshift=30,
                text=f"+{value:0.1%}",
                showarrow=False
            )
        elif (value <= thres_low): 
            annot = dict(
                x=data[1].iloc[i],
                y=i,
                xref=xref,
                xshift=-30,
                text=f"{value:0.1%}",
                showarrow=False
            )
        else:
            annot = dict(
                x=data[1].iloc[i],
                y=i,
                text="",
                showarrow=False
            )
        annot_list.append(annot)
    return annot_list


def dotplot_diff_media (media_type):
    data_list = []
    for data_input in [algorithms_media, frameworks_media]:
        data = data_input.groupby(media_type).mean().T.iloc[0:10]
        data['Average'] = data.mean(axis=1)
        data.sort_values(by='Average', inplace=True)
        data['diff'] = data[1]-data[0]
        data_list.append(data)

    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=['ML Algorithms/Models', 
                        'ML Frameworks/Libraries'],
        horizontal_spacing=0.2,
        x_title='% in each group regularly using the given ML Algorithm/Framework'
    )

    # Algorithms
    fig.add_trace(
        go.Scatter(x=data_list[0][1],
                   y=data_list[0].index,
                   name='Yes',
                   mode="markers",
                   marker=dict(color='#4ec2f7', line_width=1, size=9)
                   ), 1, 1)

    fig.add_trace(
        go.Scatter(x=data_list[0][0],
                   y=data_list[0].index,
                   name='No',
                   mode="markers",
                   marker=dict(color='grey', symbol='x',
                               line_width=0.5, size=8)
                   ), 1, 1)

    # Frameworks
    fig.add_trace(
        go.Scatter(x=data_list[1][1],
                   y=data_list[1].index,
                   name='Yes',
                   showlegend=False,
                   mode="markers",
                   marker=dict(color='#4ec2f7', line_width=1, size=9)
                   ), 1, 2)

    fig.add_trace(
        go.Scatter(x=data_list[1][0],
                   y=data_list[1].index,
                   name='No',
                   showlegend=False,
                   mode="markers",
                   marker=dict(color='grey', symbol='x',
                               line_width=0.5, size=8)
                   ), 1, 2)

    # Annotations
    segment_1 = draw_segments(data_list[0], 0.05, -0.05, 'x1')
    annot_1 = draw_annotation(data_list[0], 0.05, -0.05, 'x1')

    segment_2 = draw_segments(data_list[1], 0.05, -0.05, 'x2')
    annot_2 = draw_annotation(data_list[1], 0.05, -0.05, 'x2')

    segments = segment_1 + segment_2
    annotations = annot_1 + annot_2

    title = {
        'text': f'{media_type} is a favorite media source',
        'y': 0.9,
        'x': 0.5,
        'font': {'size': 20},
        'xanchor': 'center',
        'yanchor': 'top'}

    fig.update_layout(title=title,
                      shapes=segments,
                      annotations=annotations,
                      width=900,
                      height=500,
                      xaxis_tickformat='%',
                      xaxis2_tickformat='%',
                      margin=dict(t=140,),
                      legend_orientation="h",
                      legend=dict(x=0.4,
                                  xanchor='center',
                                  y=1.2)
                      )

    fig.update_xaxes(range=[0, 0.85], showgrid=False, zeroline=False)
    fig.update_yaxes(showgrid=True, gridcolor='aliceblue')
    return(fig)

In [None]:
dotplot_diff_media('Kaggle')

We see that people who favors Kaggle uses a variety of tools at a higher rate than those who do not. And the difference is non-trivial. Particularly with regard to using tree-based models (Decision Trees, Random Forest) and Gradient Boosting Machines (XGBoost, LightGBM) - the difference is more than 10%.

This is probably not surprising to anyone who have spent time on Kaggle. Tree based models and GMB algorithms are regularly used in winning kernels in Kaggle competitions. They are also frequently discussed in Kaggle forums, exposing Kaggle users to these tools.

Kaggle also makes it incredibly easy to learn and use new ML tools. All the great notebooks are just one click away - fork them and you will have access to all the cutting edge knowledge. 

![](https://assets-auto.rbl.ms/26d9dccee2e5fb753e582143d887582e1dc8cfd119981c1b5a2c6d837b952697)

I recently learnt about CatBoost from a Kaggle notebook. I felt like a cool kid.

In [None]:
dotplot_diff_media('Blogs')

Blog readers similarly are much more likely to use various tools than non-blog readers. We see this for both the foundational must-haves (e.g. linear/tree-based models, Scikit-learn), and the more advanced methods such as GBM (XGBoost, LightGBM) and Deep Learning tools (CNN, Keras).

This is likely because data science blogs (e.g. Towards Data Science, Medium) are a treasure trove of contents, whether you are beginners or veterans, I’m sure you will find something new to learn. Speaking anecdotecally, reading Medium has become my personal [Wiki rabbit hole](http://).

In [None]:
dotplot_diff_media('YouTube')

YouTube however, shows a much different picture. There is NO discernible difference in the rate of using any particular ML algorithms between people who favors YouTube and those who do not. And in terms of ML libraries, YouTube fans are only more likely to use TensorFlow and Keras.

> ### Summary
> * People who prefer Kaggle or Blogs as their favorite source of data science content tend to use various ML tools at higher rate, both for foundational and more advanced tools.
> * However, people who prefer Youtube as their favorite data science related media source are 


#### ...Does this mean YouTube is useless?

![](https://data.whicdn.com/images/261402563/original.gif)

First, as we seen above, most people use more than one media sources, and likely using them in a wide variety of combinations. This means that it is possible that a media diet consists of YouTube and some other sources could increase our skillsets, thus simply comparing “YouTube vs No YouTube" does not really present a complete picture. 

Secondly, as with any claims regarding whether media affects our behaviors, we run into a [self selection problem](https://en.wikipedia.org/wiki/Self-selection_bias). In another word, is YouTube not useful in helping people to pick up ML tools, or are people who are less well versed in ML tools tend to gravitate towards YouTube? 

Although this article cannot fully address these two problems, in the following sections I will try will dig a little deeper to explore these two possibilities.

* Next up, We will explore if using Kaggle/YouTube/Blogs has any added effects when combined with other media choices. 
* Then, we will zoom in to a sub-category of the respondents whom I assume to have the least experience in data science - those who are students, currently unemployed, or have non-data science related jobs (‘Others’). From there, we will see if difference in media diets makes any difference to their skillsets, taking into account their similarly lack of on-job experience.

# Media Diets Profile and ML Tool Usage

In this part of the analysis, we want to see if using Kaggle/YouTube/Blogs increases ML tool usage when combining with other media choices. To do this, we use KMeans clustering to see if there are distinct patterns of media diets among the respondents, then from there, try to isolate the added impact of Kaggle/YouTube/Blogs.

## Finding profiles using KMeans clustering

After performing KMeans clustering, we have 4 distinct types of media diets among the survey respondents:

* Mostly Blogs, no Kaggle, and no YouTube
* Kaggle and Blogs
* Kaggle and YouTube
* Kaggle, Blogs, and YouTube


In [None]:
media_cluster = media.iloc[:, 0:10]
y_pred = KMeans(n_clusters=4, random_state=42, max_iter=10000).fit_predict(
    media_cluster.values)
media_cluster['cluster_number'] = y_pred
media_cluster['cluster_label'] = media_cluster['cluster_number']
media_cluster.replace({'cluster_label': {0: 'blogs',
                                        1: 'kaggle_blogs',
                                        2: 'kaggle_youtube',
                                        3: 'kaggle_youtube_blogs'}}, inplace=True)

In [None]:
data = media_cluster.groupby('cluster_number').mean().T
cluster_count = media_cluster.cluster_number.value_counts(sort=False)

fig = make_subplots(
    rows=2,
    cols=2,
    subplot_titles=(f"1: Blogs<br> (N={cluster_count[0]})<br> ", 
                    f"2: Kaggle + Blogs<br> (N={cluster_count[1]})<br> ", 
                    f"3: Kaggle + YouTube<br> (N={cluster_count[2]})<br> ", 
                    f"4: Kaggle + Blogs + YouTube<br> (N={cluster_count[3]})<br> "),
    specs=[[{'type': 'polar'}]*2]*2,
)

fig.add_trace(
    go.Barpolar(
        r=data[0]*100,
        theta=data.index,
        name='Cluster 0',
        marker_color='teal',
        opacity=0.6,
    ),
    row=1, col=1
)

fig.add_trace(
    go.Barpolar(
        r=data[1]*100,
        theta=data.index,
        name='Cluster 1',
        marker_color='gold',
        opacity=0.6,
    ),
    row=1, col=2
)

fig.add_trace(
    go.Barpolar(
        r=data[2]*100,
        theta=data.index,
        name='Cluster 2',
        marker_color='tomato',
        opacity=0.6,
    ),
    row=2, col=1
)

fig.add_trace(
    go.Barpolar(
        r=data[3]*100,
        theta=data.index,
        name='Cluster 3',
        marker_color='skyblue',
        opacity=0.6,
    ),
    row=2, col=2
)

fig.update_layout(
    title={'text': '4 Profiles of Media Diets',
           'font_size': 22,
           'x': 0.5,
           'y': 0.95},
    showlegend=False,
    title_font_color='#333333',
    margin=dict(t=150, l=20, r=20),
    legend_font_color='gray',
    legend_itemclick=False,
    legend_itemdoubleclick=False,
    width=850,
    height=700,
    polar=dict(
        angularaxis=dict(
            direction='clockwise',
            rotation=110,
            color='grey',
            visible=True,
            showline=True,
        ),
        radialaxis=dict(
            ticksuffix='%',
            tickvals=[25, 50, 75],
            range=[0, 100],
            visible=True,
            showline=True,
        )),
    polar2=dict(
        angularaxis=dict(
            direction='clockwise',
            rotation=110,
            color='grey',
            visible=True,
            showline=True,
        ),
        radialaxis=dict(
            ticksuffix='%',
            tickvals=[25, 50, 75],
            range=[0, 100],
            visible=True,
            showline=True,
        )),
    polar3=dict(
        angularaxis=dict(
            direction='clockwise',
            rotation=110,
            color='grey',
            visible=True,
            showline=True,
        ),
        radialaxis=dict(
            ticksuffix='%',
            tickvals=[25, 50, 75],
            range=[0, 100],
            visible=True,
            showline=True,
        )),
    polar4=dict(
        angularaxis=dict(
            direction='clockwise',
            rotation=110,
            color='grey',
            visible=True,
            showline=True,
        ),
        radialaxis=dict(
            ticksuffix='%',
            tickvals=[25, 50, 75],
            range=[0, 100],
            visible=True,
            showline=True,
        )),
)

fig.show()

Given these combinations of media choice profiles, we can do the following pairs of comparisons to answer a few questions: 

* Does reading Blogs add to your skillsets if you already use Kaggle and YouTube?
    * compare Kaggle + YouTube vs Kaggle + YouTube + Blogs
* Does using YouTube add to your skillsets if you already use Kaggle and Blogs?
    * compare Kaggle + Blogs vs Kaggle + YouTube + Blogs
* If you already use Kaggle, does  Blogs or YouTube add more to what you know?
    * compare Kaggle + Blogs vs Kaggle + YouTube


In [None]:
algorithms_media_cluster = pd.merge(algorithms.drop(columns=['None', 'Other']),
                                    media_cluster[['cluster_label']],
                                    left_index=True, right_index=True)

frameworks_media_cluster = pd.merge(frameworks.drop(columns=['None', 'Other']),
                                    media_cluster[['cluster_label']],
                                    left_index=True, right_index=True)

In [None]:
def dotplot_diff_cluster(group_1, group_2, title_text, line_color):

    cluster_colors = {
        'blogs': 'teal',
        'kaggle_blogs': 'gold',
        'kaggle_youtube': 'tomato',
        'kaggle_youtube_blogs': 'skyblue'
    }

    cluster_labels = {
        'blogs': 'Blogs only',
        'kaggle_blogs': 'Kaggle + Blogs',
        'kaggle_youtube': 'Kaggle + YouTube',
        'kaggle_youtube_blogs': 'Kaggle + Blogs + YouTube'
    }

    data_list = []
    for data_input in [algorithms_media_cluster, frameworks_media_cluster]:
        data = pd.DataFrame()
        data[0] = data_input.query(
            'cluster_label == @group_1').mean().T.iloc[0:-1]
        data[1] = data_input.query(
            'cluster_label == @group_2').mean().T.iloc[0:-1]
        data['Average'] = data_input.mean()
        data.sort_values(by='Average', inplace=True)
        data['diff'] = data[1]-data[0]
        data_list.append(data)

    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=['ML Algorithms/Models', 'ML Frameworks/Libraries'],
        horizontal_spacing=0.2,
        x_title='% in each group regularly using the given ML Algorithm/Framework'
    )

    # Algorithms
    fig.add_trace(
        go.Scatter(x=data_list[0][0],
                   y=data_list[0].index,
                   name=cluster_labels.get(group_1),
                   mode="markers",
                   marker=dict(
            color=cluster_colors.get(group_1),
            line_width=1, size=10)
        ), 1, 1)

    fig.add_trace(
        go.Scatter(x=data_list[0][1],
                   y=data_list[0].index,
                   name=cluster_labels.get(group_2),
                   mode="markers",
                   marker=dict(color=cluster_colors.get(group_2),
                               line_width=1, size=10)
                   ), 1, 1)

    # Frameworks
    fig.add_trace(
        go.Scatter(x=data_list[1][0],
                   y=data_list[1].index,
                   name=cluster_labels.get(group_1),
                   mode="markers",
                   showlegend=False,
                   marker=dict(
            color=cluster_colors.get(group_1),
            line_width=1, size=10)
        ), 1, 2)

    fig.add_trace(
        go.Scatter(x=data_list[1][1],
                   y=data_list[1].index,
                   name=cluster_labels.get(group_2),
                   showlegend=False,
                   mode="markers",
                   marker=dict(color=cluster_colors.get(group_2),
                               line_width=1, size=10)
                   ), 1, 2)

    # Annotations
    segment_1 = draw_segments(data_list[0], 0.05, -0.05, 'x1', line_color)
    annot_1 = draw_annotation(data_list[0], 0.05, -0.05, 'x1')

    segment_2 = draw_segments(data_list[1], 0.05, -0.05, 'x2', line_color)
    annot_2 = draw_annotation(data_list[1], 0.05, -0.05, 'x2')

    segments = segment_1 + segment_2
    annotations = annot_1 + annot_2

    title = {
        'text': title_text,
        'y': 0.9,
        'x': 0.5,
        'font': {'size': 20},
        'xanchor': 'center',
        'yanchor': 'top'}

    fig.update_layout(title=title,
                      shapes=segments,
                      annotations=annotations,
                      width=900,
                      height=500,
                      xaxis_tickformat='%',
                      xaxis2_tickformat='%',
                      margin=dict(t=140,),
                      legend_orientation="h",
                      legend=dict(x=0.4,
                                  xanchor='center',
                                  y=1.25)
                      )

    fig.update_xaxes(range=[0, 0.85], showgrid=False, zeroline=False)
    fig.update_yaxes(showgrid=True, gridcolor='aliceblue')
    return fig

In [None]:
dotplot_diff_cluster('kaggle_youtube', 'kaggle_youtube_blogs', 'Additional Values from Blogs', 'skyblue')

The pattern here echoes what we see in the previous section - reading Blogs do indeed correlates with higher usage rate of a wide variety of ML tools. Its benefits persists even when we have taken Kaggle and YouTube usage into account. 

In [None]:
dotplot_diff_cluster('kaggle_blogs', 'kaggle_youtube_blogs', 'Additional Values from YouTube', 'skyblue')

The pattern here unfortunately agrees with what we see in the analysis above - other than TensorFlow and Keras, YouTube videos do not seem to add much to people's skillsets. If you already regularly use Kaggle and read Blogs, YouTube does not seem to add much more. 

In [None]:
dotplot_diff_cluster('kaggle_youtube', 'kaggle_blogs', 'Blogs vs YouTube', 'gold')

Again, YouTube is losing out on this comparison. For those who already uses Kaggle, Blogs is a more powerful addition than YouTbe.

> #### Summary
> 
> What we see from comparing groups with distinct media diets (i.e. different combinations of media choices) largely echoes what we observe when simply comparing between user and non-users of a single media source. People who use Kaggle and Blogs have more diverse skillsets, uses various ML tools at a higher rate, but we do not see the same effect for YouTube users. 
> 
> * *Does reading Blogs add to your skillsets if you already use Kaggle and YouTube?* Yes
> * *Does using YouTube add to your skillsets if you already use Kaggle and Blogs?* Not really
> * *If you already use Kaggle, does  Blogs or YouTube add more to what you know?* Blogs



# What is a Good Media Diet for Aspiring Data Scientists?

Now we turn our attention to a sub-category of the respondents - those who selected their job title as ‘Students’, ’Not employed’, and ‘Others’. And I will now refer to this group as ‘Aspiring Data Scientists’. 

Recall our earlier speculation that perhaps the reason for YouTube’s lack of impact on our skillsets is not because of YouTube per se, but is because of who chose to use YouTube. If people who are newer to the data science discipline (e.g. students, career switchers) are more attracted to YouTube because of the video formats and accessibility, and they are also less likely to have knowledge in or have used some of the advanced ML tools, then it might explain why tool usage rate among YouTube users are relatively lower. 

To partly get around this, we can do a within-group comparison - comparing only among the Aspiring Data Scientists group. Assuming this group are similarly not experienced about the various ML tools, we can then explore if different media diets still correlates with different skillsets, despite this share feature. 

First, let’s confirm if our intuition is correct - are less experienced respondents more likely to choose YouTube as their preferred media source?

Indeed, comparing to data scientists in the survey, aspirating data scientists are much more likely to use YouTube, and much less likely to read Blogs.

In [None]:
def get_slope_color (media):
    if media == 'Kaggle':
        return 'deepskyblue'
    if media == 'Blogs':
        return 'gold'
    if media == 'YouTube':
        return 'tomato'
    else :
        return 'lightgrey'

media_rank = media.query('job_ds != "Non Data Scientist"').groupby(
    'job_ds').mean()
media_rank.drop(columns=['None', 'Other'], inplace=True)

In [None]:
fig = go.Figure()

for col in media_rank.columns:
    # Slope
    fig.add_trace(go.Scatter(
        x=media_rank.index,
        y=media_rank[col],
        mode='lines+markers',
        name=col,
        line=dict(
            color=get_slope_color(col),
            #color= 'dodgerblue' if col == 'Blogs' or col == 'YouTube' else 'lightgrey',
            width=2.5
        ),
    ))


fig.update_layout(
    width=600,
    height=600,
    xaxis=dict(
        showline=False,
        showgrid=False,
        showticklabels=False,
    ),
    yaxis=dict(
        showgrid=False,
        zeroline=False,
        showline=False,
        showticklabels=False,
    ),
    showlegend=False,
    autosize=False,
    margin=dict(
        autoexpand=False,
        l=200,
        r=100,
        t=110,
    ),
)

# Adding labels
annotations = []
for col in ['Kaggle', 'Blogs', 'YouTube']:
    # labeling the left_side of the plot
    annotations.append(dict(xref='paper', x=0.0, y=media_rank[col][0],
                            xanchor='right', yanchor='middle',
                            text=f'{col} {media_rank[col][0]*100:0.1f}%',
                            showarrow=False))
    # labeling the right_side of the plot
    annotations.append(dict(xref='paper', x=1.0, y=media_rank[col][1],
                            xanchor='left', yanchor='middle',
                            text=f'{col} {media_rank[col][1]*100:0.1f}%',
                            showarrow=False))

for col in ['Twitter', 'Hacker News', 'Reddit', 'Course Forums',
            'Podcasts', 'Journal Publications',
            'Slack Communities']:
    # labeling the left_side of the plot
    annotations.append(dict(xref='paper', x=0.0, y=media_rank[col][0],
                            xanchor='right', yanchor='middle',
                            text=f'{col}',
                            font=dict(size=10),
                            showarrow=False))

# Job labels
annotations.append(dict(xref='paper', yref='paper', x=0.0, y=0.98,
                        xanchor='center', yanchor='bottom',
                        text='% of Data Scientists<br> ',
                        font=dict(size=14),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='paper', x=0.9, y=0.98,
                        xanchor='center', yanchor='bottom',
                        text='% of Students/<br>Not employed/Others',
                        font=dict(size=14),
                        showarrow=False))

title = {
        'text': 'Aspiring Data Scientists have Different<br> Media Choices from Data Scientists',
        'y': 0.95,
        'x': 0.5,
        'font': {'size': 18},
        'xref': 'paper',
        'yanchor': 'top'}

fig.update_layout(
    title=title,
    width=500,
    height=450,
    annotations=annotations,
    margin=dict(t=100, b=10))

fig.show()

In [None]:
media_cluster_by_job = pd.merge(media[['job_ds']],
                                media_cluster[['cluster_label']],
                                left_index=True, right_index=True)

media_cluster_by_job = pd.get_dummies(media_cluster_by_job.query('job_ds != "Non Data Scientist"'), columns=[
                                      'cluster_label']).groupby('job_ds').mean().T

In [None]:
fig = go.Figure()

cluster_labels = ['Blogs only', 
                  'Kaggle +<br>Blogs',
                  'Kaggle +<br>YouTube', 
                  'Kaggle +<br>Blogs + YouTube']

cluster_colors = ['teal', 'gold', 'tomato', 'skyblue']

x_data = [media_cluster_by_job['Students/Not employed/Others']*100,
         media_cluster_by_job['Data Scientists']*100]

y_data = ['Students/ <br>Not employed/Others',
         'Data Scientists']

fig = go.Figure()

for i in range(0, len(x_data[0])):
    for xd, yd in zip(x_data, y_data):
        fig.add_trace(go.Bar(
            x=[xd[i]], y=[yd],
            orientation='h',
            width=0.5,
            marker=dict(
                color=cluster_colors[i],
                opacity=0.7,
                line=dict(color='white', width=1)
            )
        ))
        
fig.update_layout(
    xaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=False,
        zeroline=False,
        domain=[0.15, 1]
    ),
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=False,
        zeroline=False,
    ),
    barmode='stack',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=False,
)

annotations = []

for yd, xd in zip(y_data, x_data):
    # labeling the y-axis
    annotations.append(dict(xref='paper', yref='y',
                            x=0.14, y=yd,
                            xanchor='right',
                            text=str(yd),
                            showarrow=False, align='right'))
    # labeling the first percentage of each bar (x_axis)
    annotations.append(dict(xref='x', yref='y',
                            x=xd[0] / 2, y=yd,
                            text= f'{xd[0]:0.1f}%',
                            showarrow=False))
    # labeling the first Likert scale (on the top)
    if yd == y_data[-1]:
        annotations.append(dict(xref='x', yref='paper',
                                x=xd[0] / 2, y=1.1,
                                text=cluster_labels[0],
                                showarrow=False))
    space = xd[0]
    for i in range(1, len(xd)):
            # labeling the rest of percentages for each bar (x_axis)
            annotations.append(dict(xref='x', yref='y',
                                    x=space + (xd[i]/2), y=yd,
                                    text=f'{xd[i]:0.1f}%',
                                    showarrow=False))
            # labeling the Likert scale
            if yd == y_data[-1]:
                annotations.append(dict(xref='x', yref='paper',
                                        x=space + (xd[i]/2), y=1.1,
                                        text=cluster_labels[i],
                                        showarrow=False))
            space += xd[i]

title = {
        'text': 'Distribution of Media Diet Profiles <br>among Aspiring Data Scientists and Data Scientists',
        'y': 0.9,
        'x': 0.5,
        'font': {'size': 16},
        'xref': 'paper',
        'yanchor': 'top'}           
            
fig.update_layout(
    title=title,
    width=700,
    height=300,
    margin=dict(t=100,b=20,),
    annotations=annotations)
fig.show()

The distribution of media diet profiles shows a similar pattern - aspiring data scientists are much more likely to choose a combination of Kaggle + YouTube, while data scientists rarely do. 

Now, let’s conduct some similar comparisons we have done before, but this time we are limiting to the group of Aspiring Data Scientists only, and will use Data Scientist as a baseline for comparison. 

In the following two plots, the percentage of data scientists using the various tools is used as the zero-line. And the vertical distance from the zero-line shows the difference in usage rate between the different groups of Aspiring Data Scientists and that of Data Scientist.

This means that groups that are closer to this base-line are using the particular tools at a rate more similar to data science practitioners. Again, differences greater than 5% between the groups are highlighted.

In [None]:
def get_slope_color_2(data):
    diff = data.iloc[:, 1] - data.iloc[:, 0]
    #diff = abs(diff)
    color_list = []
    for i in range(0, len(diff)):
        if diff[i] >= 0.05:
            color = 'dodgerblue'
        elif diff[i] <= -0.05:
            color = 'coral'
        else:
            color = 'lightgrey'
        color_list.append(color)
    return color_list


def gen_subplot_data (data_input, label):
    data = data_input.query('job_ds == "Students/Not employed/Others"').groupby('cluster_label').mean().T
    data['ds_avg'] = data_input.query('job_ds == "Data Scientists"').mean().T
    data.sort_values(by='ds_avg', inplace=True)
    data['label'] = label
    data['diff_ds_b'] = data['blogs']-data['ds_avg']
    data['diff_ds_kb'] = data['kaggle_blogs']-data['ds_avg']
    data['diff_ds_ky'] = data['kaggle_youtube']-data['ds_avg']
    data['diff_ds_kyb'] = data['kaggle_youtube_blogs']-data['ds_avg']

    data_sub_yt_blog = data[['diff_ds_ky', 'diff_ds_kb', 'label']]
    data_sub_yt_blog['color'] = get_slope_color_2(data_sub_yt_blog)
    data_sub_yt_blog = data_sub_yt_blog.T

    data_sub_blog = data[['diff_ds_ky', 'diff_ds_kyb', 'label']]
    data_sub_blog['color'] = get_slope_color_2(data_sub_blog)
    data_sub_blog = data_sub_blog.T

    data_sub_yt = data[['diff_ds_kb', 'diff_ds_kyb', 'label']]
    data_sub_yt['color'] = get_slope_color_2(data_sub_yt)
    data_sub_yt = data_sub_yt.T
    return data_sub_yt_blog, data_sub_blog, data_sub_yt
    
    
def annotate_subplot_left(data, column, xref):
    # labeling the left_side of the plot
    label_l = data[column]['label']
    if data[column]['color'] == 'lightgrey':
        annot_l = dict(xref=xref, x=-0.1, y=data[column][0],
                       xanchor='right', yanchor='middle',
                       text=f'{label_l}',
                       font=dict(color="lightgrey"),
                       showarrow=False)
    else:
        annot_l = dict(xref=xref, x=-0.1, y=data[column][0],
                       xanchor='right', yanchor='middle',
                       text=f'{label_l}<br> {data[column][0]*100:0.1f}%',
                       font=dict(color=data[column]['color']),
                       showarrow=False)
    return annot_l


def annotate_subplot_right(data, column, xref):
    # labeling the right side of the plot
    if data[column]['color'] == 'lightgrey':
        annot_r = dict(xref=xref, x=1.1, y=data[column][1],
                       xanchor='left', yanchor='middle',
                       text='',
                       showarrow=False)
    else:
        annot_r = dict(xref=xref, x=1.1, y=data[column][1],
                       xanchor='left', yanchor='middle',
                       text=f'{data[column][1]*100:0.1f}%',
                       font=dict(color=data[column]['color']),
                       showarrow=False)

    return annot_r


algo_label = ['GAN', 'Evoluationary', 'Transformer N.', 'RNN',
                     'Bayesian', 'DNN', 'CNN', 'GBM', 'DT/RF', 'Linear/Logistic']

framework_label = ['Fast.ai', 'Spark MLib', 'Caret', 'PyTorch', 'LightGBM', 'RandomForest',
                   'TensorFlow', 'Xgboost', 'Keras', 'Scikit-learn']

algo_sub_yt_blog, algo_sub_blog, algo_sub_yt = gen_subplot_data(algorithms_media_cluster, algo_label)
framework_sub_yt_blog, framework_sub_blog, framework_sub_yt = gen_subplot_data(frameworks_media_cluster, framework_label)


In [None]:
data_sub_yt_blog = algo_sub_yt_blog
data_sub_blog = algo_sub_blog
data_sub_yt = algo_sub_yt

fig = make_subplots(
    rows=1,
    cols=3,
    shared_xaxes=True,
    shared_yaxes=True,
)

# Blogs vs YouTube
for col in data_sub_yt_blog.columns:
    # Slope
    fig.add_trace(go.Scatter(
        x=['Kaggle +<br>YouTube', 'Kaggle +<br>Blogs'],
        y=data_sub_yt_blog[col],
        mode='lines+markers',
        name=col,
        line=dict(color=data_sub_yt_blog[col]['color'],
                  width=2.5),
    ), row=1, col=1)


# Adding Blogs
for col in data_sub_blog.columns:
    # Slope
    fig.add_trace(go.Scatter(
        x=['Kaggle +<br>YouTube', 'Kaggle + <br>Blogs + YouTube'],
        y=data_sub_blog[col],
        mode='lines+markers',
        name=col,
        line=dict(color=data_sub_blog[col]['color'],
                  width=2.5),
    ), row=1, col=2)

    
# Adding YouTube
for col in data_sub_yt.columns:
    # Slope
    fig.add_trace(go.Scatter(
        x=['Kaggle +<br>Blogs', 'Kaggle +<br>Blogs + YouTube'],
        y=data_sub_yt[col],
        mode='lines+markers',
        name=col,
        line=dict(color=data_sub_yt[col]['color'],
                  width=2.5),
    ), row=1, col=3)


# Adding annotatios
annotations = []
for col in data_sub_yt_blog.columns:
    # Left side label
    annot_l = annotate_subplot_left(data_sub_yt_blog, col, 'x1')
    annotations.append(annot_l)
    # Right side label
    annot_r = annotate_subplot_right(data_sub_yt_blog, col, 'x1')
    annotations.append(annot_r)

    
for col in data_sub_blog.columns:
    # Left side label
    annot_l = annotate_subplot_left(data_sub_blog, col, 'x2')
    annotations.append(annot_l)
    # Right side label
    annot_r = annotate_subplot_right(data_sub_blog, col, 'x2')
    annotations.append(annot_r)


for col in data_sub_yt.columns:
    # Left side label
    annot_l = annotate_subplot_left(data_sub_yt, col, 'x3')
    annotations.append(annot_l)
    # Right side label
    annot_r = annotate_subplot_right(data_sub_yt, col, 'x3')
    annotations.append(annot_r)
    

annotations.append(dict(xref='paper', yref='y', x=0.15, y=0.0,
                        xanchor='center', yanchor='middle',
                        text='Baseline: % of Data Scientists regularly<br> use this algorithm',
                        font=dict(size=11),
                        showarrow=True))


annotations.append(dict(xref='paper', yref='y', x=0.1, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>YouTube',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.25, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>Blogs',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.42, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>YouTube',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.58, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>YouTube + Blogs',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.78, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>Blogs',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.95, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>YouTube + Blogs',
                        font=dict(size=13),
                        showarrow=False))

title_text = 'ML Algorithms Usage:<br>Aspiring Data Scientists with Different Media Diets<br>and How Close They are to Data Scientists <br><br>'
title = {
        'text': title_text,
        'y': 0.95,
        'x': 0.5,
        'font': {'size': 18},
        'xref': 'paper', #'yref': 'paper',
        'yanchor': 'top'} 

# Axis formatting
layout_xaxis = dict(
        showline=False,
        showgrid=False,
        showticklabels=False,
    )
layout_y_axis = dict(
        showgrid=False,
        zeroline=True,
        zerolinewidth=2.5, 
        zerolinecolor='dimgrey',
        showline=False,
        showticklabels=False,
        #range=[-0.45, 0.05]
    )

fig.update_layout(
    title=title,
    xaxis=layout_xaxis,
    yaxis=layout_y_axis,
    xaxis2=layout_xaxis,
    yaxis2=layout_y_axis,
    xaxis3=layout_xaxis,
    yaxis3=layout_y_axis,
)

fig.update_layout(annotations=annotations,
                  width=850,
                  height=650,
                  showlegend=False,
                  autosize=False,
                  margin=dict(
                      t=100,
                      l=10,
                      r=10,
                  ),
                  )
fig.show()

In [None]:
data_sub_yt_blog = framework_sub_yt_blog
data_sub_blog = framework_sub_blog
data_sub_yt = framework_sub_yt

fig = make_subplots(
    rows=1,
    cols=3,
    shared_xaxes=True,
    shared_yaxes=True,
)

# Blogs vs YouTube
for col in data_sub_yt_blog.columns:
    # Slope
    fig.add_trace(go.Scatter(
        x=['Kaggle +<br>YouTube', 'Kaggle +<br>Blogs'],
        y=data_sub_yt_blog[col],
        mode='lines+markers',
        name=col,
        line=dict(color=data_sub_yt_blog[col]['color'],
                  width=2.5),
    ), row=1, col=1)


# Adding Blogs
for col in data_sub_blog.columns:
    # Slope
    fig.add_trace(go.Scatter(
        x=['Kaggle +<br>YouTube', 'Kaggle + <br>Blogs + YouTube'],
        y=data_sub_blog[col],
        mode='lines+markers',
        name=col,
        line=dict(color=data_sub_blog[col]['color'],
                  width=2.5),
    ), row=1, col=2)

    
# Adding YouTube
for col in data_sub_yt.columns:
    # Slope
    fig.add_trace(go.Scatter(
        x=['Kaggle +<br>Blogs', 'Kaggle +<br>Blogs + YouTube'],
        y=data_sub_yt[col],
        mode='lines+markers',
        name=col,
        line=dict(color=data_sub_yt[col]['color'],
                  width=2.5),
    ), row=1, col=3)


# Adding annotatios
annotations = []
for col in data_sub_yt_blog.columns:
    # Left side label
    annot_l = annotate_subplot_left(data_sub_yt_blog, col, 'x1')
    annotations.append(annot_l)
    # Right side label
    annot_r = annotate_subplot_right(data_sub_yt_blog, col, 'x1')
    annotations.append(annot_r)

    
for col in data_sub_blog.columns:
    # Left side label
    annot_l = annotate_subplot_left(data_sub_blog, col, 'x2')
    annotations.append(annot_l)
    # Right side label
    annot_r = annotate_subplot_right(data_sub_blog, col, 'x2')
    annotations.append(annot_r)


for col in data_sub_yt.columns:
    # Left side label
    annot_l = annotate_subplot_left(data_sub_yt, col, 'x3')
    annotations.append(annot_l)
    # Right side label
    annot_r = annotate_subplot_right(data_sub_yt, col, 'x3')
    annotations.append(annot_r)
    

annotations.append(dict(xref='paper', yref='y', x=0.15, y=0.0,
                        xanchor='center', yanchor='middle',
                        text='Baseline: % of Data Scientists regularly<br> use this framework',
                        font=dict(size=11),
                        showarrow=True))


annotations.append(dict(xref='paper', yref='y', x=0.1, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>YouTube',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.25, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>Blogs',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.42, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>YouTube',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.58, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>YouTube + Blogs',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.78, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>Blogs',
                        font=dict(size=13),
                        showarrow=False))

annotations.append(dict(xref='paper', yref='y', x=0.95, y=0.1,
                        xanchor='center', yanchor='top',
                        text='Kaggle +<br>YouTube + Blogs',
                        font=dict(size=13),
                        showarrow=False))

title_text = 'ML Frameworks Usage:<br>Aspiring Data Scientists with Different Media Diets<br>and How Close They are to Data Scientists<br><br> '
title = {
        'text': title_text,
        'y': 0.95,
        'x': 0.5,
        'font': {'size': 18},
        'xref': 'paper', #'yref': 'paper',
        'yanchor': 'top'} 


# Axis formatting
layout_xaxis = dict(
        showline=False,
        showgrid=False,
        showticklabels=False,
    )
layout_y_axis = dict(
        showgrid=False,
        zeroline=True,
        zerolinewidth=2.5, 
        zerolinecolor='dimgrey',
        showline=False,
        showticklabels=False,
        #range=[-0.45, 0.05]
    )

fig.update_layout(
    title=title,
    xaxis=layout_xaxis,
    yaxis=layout_y_axis,
    xaxis2=layout_xaxis,
    yaxis2=layout_y_axis,
    xaxis3=layout_xaxis,
    yaxis3=layout_y_axis,
)

fig.update_layout(annotations=annotations,
                  width=850,
                  height=650,
                  showlegend=False,
                  autosize=False,
                  margin=dict(
                      l=10,
                      r=10,
                  ),
                  )
fig.show()

Once again, we see a similarly weak impact from adding YouTube to our media diets. - there is barely any boost in almost all the tools. On the other hand, aspiring data scientists with Blogs in their media diets are utilizing tools a a much more similar rate to data science practitioners.

# Closing Thoughts

To recap what we have see so far

* People who prefers Kaggle, Blogs etc uses tools at higher rate; but not for Youtube users.
* People with blogs in their media diet uses various ML algorithms and tools at higher rate than people with YouTube in their media diet.
* Aspiring data scientists (those who are students, currently unemployed, or have non-data science related jobs whose media diet consists of Kaggle and blogs uses various tools at a much higher rate.


Why is YouTube not an effective media source too boost our skillsets? Perhaps video formats are not conducive to learning by doing, contrast to Kaggle and blogs (where we have code snippets and links to GitHub repo). Or perhaps the contents are lower in quality? Siraj Raval scandal came to mind. 

One thing to note is that people have different learning styles. So if you are a visual learner who prefer videos, then watch all the YouTube videos you like! But do follows a couple of blogs, they will definitely help. And if you deciding whether to spend your time on YouTube or Blog? I'd say Blogs!
