Procrastination is almost prevalent in our lives. In this famous Ted Talk  __[Inside the mind of a master procrastinator | Tim Urban](https://www.youtube.com/watch?v=arj7oStGLkU&t=1s)__ [1], the speaker talked about the risk of letting "Instant Gratification Monkey" take the wheel. However, I realized another major obstacle on my own journey of self-learning data science, a hestitant "Rational Decision Maker".

Not long ago, I was offered a new position when my institution initiated transformation towards data-informed models, and later became the first and only "data" person in the institution. I started with my purely-theoretical statistics knowledge from college and zero coding experience, and then embarked on a journey of self-regulated "on-the-job training".

It did not take long for me to realize how serious a challenge self-learning could be for a procrastinator, especially when all those self doubts crept in, "can I do it without a dedicated college education", "is it too late for me to make this career shift", "what is ahead in this journey".

This story tries to provide insights to these questions using the 2019 Kaggle survey results. Focusing on respondents who are not formally trained in data-related disciplines, it is aimed to provide people in similar situation with a determination (if you are interested in data but worried about missing the formal academic background), a destination (if you do not know where this interest may lead you) and a plan (how to get where you want to go).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
survey_raw_data = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')

survey_raw_data_index = survey_raw_data.drop(survey_raw_data.index[0])

# Table of Contents

1. [Beginning Of The Journey](#p1)
1. [Who We Are](#p2)
1. [Where We Are Heading](#p3) 
1. [How To Get There](#p4)
1. [What Is Unknown About Us](#p6)
1. [References](#p7)

<a id="p1"></a>
# Beginning Of The Journey

In order to leverage the advantages of digitalization, there is a larger and larger demand for data talents. However, there is still a shortage of skilled staff who can perform data-related work [2].

Two reasons make this a great opportunity for people who are interested in data. First, the shortage of data talents is partially resulted from a lack of university degree programs in data science, or lack of focus on application of the knowledge to meet industry needs [3]. Instead of merely waiting for the formal education system to mature, many institutions becomes more willing to train their own data experts [4], either from their current staff who know the institution well, or from people who have accumulated professional experience in the industry.

Besides, more open resources for data training becoming available makes it possible to self-learn data-related skills with relatively low opportunity costs, so people do not need to quit job or enroll in a costly education program to gain relevant knowledge.

<a id="p2"></a>
# Who We Are

Respondents who did not learn data science through university courses and those that did are respectively grouped, in order to reveal the characteristics of these two groups. On one side, it provides insights on the differences and similarities between the groups. On the other side, and more importantly, I hope that finding the similarities with people who are self-trained data workers can prompt others' determination to dive into this new field even without a relevant degree.

"Student" and "Not employed" are excluded from the analysis, in order to focus on current data workers.

In [None]:
survey_raw_data_index = survey_raw_data_index[(survey_raw_data_index.Q5 != 'Student') & (survey_raw_data_index.Q5 != 'Not employed')]

In [None]:
survey_raw_data_index['Group'] = np.where(survey_raw_data_index.iloc[:,44]=='University Courses (resulting in a university degree)', 'University-trained Data Worker', 'Self-trained Data Worker')
survey_raw_data_group = survey_raw_data_index

## Age Starting Writing Code To Analyze Data

One of the major barriers to enter a new field or learn a new skill is age. The farther one has been devoting into one field, the higher the sunk costs, and the harder for one to make a switch of focus.

Both age (Q1) and experience writing code to analyze data (Q15) are used to calculate the age when respondents started this data journey. Since the responses are all in ranges, the records are converted to specific numbers for calculation purpose, using the mean value of the ranges. For open intervals with one side unspecified, such as 70+, it is replaced by the number itself, e.g. 70. Although this may skew the statistical results downwards a little, it is both the best solution given the dataset, and should not skew the interpretations as the specific age beyond 70 may not have significantly different impact on one's professional career.

In [None]:
survey_raw_data_group_a1 = survey_raw_data_group[survey_raw_data_group.iloc[:,1] == '70+']
survey_raw_data_group_a1['age_modified'] = 70

survey_raw_data_group_a2 = survey_raw_data_group[survey_raw_data_group.iloc[:,1] != '70+']
survey_raw_data_group_a2['age_modified_1'] = survey_raw_data_group_a2.iloc[:,1].str.extract('(\d\d)', expand=False)
survey_raw_data_group_a2['age_modified_2'] = survey_raw_data_group_a2.iloc[:,1].str.extract(r'-(\d\d)', expand=False)
survey_raw_data_group_a2['age_modified_1'] = pd.to_numeric(survey_raw_data_group_a2['age_modified_1'])
survey_raw_data_group_a2['age_modified_2'] = pd.to_numeric(survey_raw_data_group_a2['age_modified_2'])
survey_raw_data_group_a2['age_modified'] = (survey_raw_data_group_a2['age_modified_1'] + survey_raw_data_group_a2['age_modified_2']) / 2
survey_raw_data_group_a2 = survey_raw_data_group_a2.drop(['age_modified_1','age_modified_2'], axis = 1)

survey_raw_data_age_frame = [survey_raw_data_group_a1,survey_raw_data_group_a2]
survey_raw_data_age = pd.concat(survey_raw_data_age_frame)

In [None]:
survey_raw_data_age_y1 = survey_raw_data_age[survey_raw_data_age.iloc[:,55] == '< 1 years']
survey_raw_data_age_y1['yr_coding'] = survey_raw_data_age_y1.iloc[:,55].str.extract('(\d)', expand=False)

survey_raw_data_age_y2 = survey_raw_data_age[survey_raw_data_age.iloc[:,55] == '20+ years']
survey_raw_data_age_y2['yr_coding'] = survey_raw_data_age_y2.iloc[:,55].str.extract(r'(\d\d)+', expand=False)

survey_raw_data_age_y3 = survey_raw_data_age[survey_raw_data_age.iloc[:,55] == 'I have never written code']
survey_raw_data_age_y3['yr_coding'] = 0

survey_raw_data_age_y4 = survey_raw_data_age[(survey_raw_data_age.iloc[:,55] != '20+ years') & (survey_raw_data_age.iloc[:,55] != '< 1 years') & (survey_raw_data_age.iloc[:,55] != 'I have never written code')]
survey_raw_data_age_y4['yr_coding_1'] = survey_raw_data_age_y4.iloc[:,55].str.extract(r'(\d+)-', expand=False)
survey_raw_data_age_y4['yr_coding_2'] = survey_raw_data_age_y4.iloc[:,55].str.extract(r'-(\d+)', expand=False)
survey_raw_data_age_y4['yr_coding_1'] = pd.to_numeric(survey_raw_data_age_y4['yr_coding_1'])
survey_raw_data_age_y4['yr_coding_2'] = pd.to_numeric(survey_raw_data_age_y4['yr_coding_2'])
survey_raw_data_age_y4['yr_coding'] = (survey_raw_data_age_y4['yr_coding_1'] + survey_raw_data_age_y4['yr_coding_2']) / 2
survey_raw_data_age_y4 = survey_raw_data_age_y4.drop(['yr_coding_1','yr_coding_2'], axis = 1)
survey_raw_data_age_y4.head()

survey_raw_data_year_frame = [survey_raw_data_age_y1,survey_raw_data_age_y2,survey_raw_data_age_y3,survey_raw_data_age_y4]
survey_raw_data_year = pd.concat(survey_raw_data_year_frame)

Then the two fields are used to generate the derived field, age starting writing code to analyze data, by subtracting the number of years writing code from the current age.

In [None]:
survey_raw_data_year_nonull = survey_raw_data_year[survey_raw_data_year.loc[:,'Q15'].notna()]

survey_raw_data_year_nonull['age_modified'] = pd.to_numeric(survey_raw_data_year_nonull['age_modified'])
survey_raw_data_year_nonull['yr_coding'] = pd.to_numeric(survey_raw_data_year_nonull['yr_coding'])
survey_raw_data_year_nonull['age_coding'] = survey_raw_data_year_nonull['age_modified'] - survey_raw_data_year_nonull['yr_coding']

survey_raw_data_year_nonull['age_coding'].value_counts().sort_index()

After initial exploration, the respondents are grouped into the following bins to reflect different life stages when people started to code to analyze data.

- Before college: < 17
- Higher and graduate education: 18-25
- Early career: 26-35
- Mid career: 36-50
- Late career: 51-70

In [None]:
conditions = [
    (survey_raw_data_year_nonull['age_coding'] <= 17),
    (survey_raw_data_year_nonull['age_coding'] > 17) & (survey_raw_data_year_nonull['age_coding'] <= 25),
    (survey_raw_data_year_nonull['age_coding'] > 25) & (survey_raw_data_year_nonull['age_coding'] <= 35),
    (survey_raw_data_year_nonull['age_coding'] > 35) & (survey_raw_data_year_nonull['age_coding'] <= 50),
    (survey_raw_data_year_nonull['age_coding'] > 50) & (survey_raw_data_year_nonull['age_coding'] <= 70)
    ]
choices = ['Before college', 'Higher and graduate education', 'Early career','Mid career','Late career']

survey_raw_data_year_nonull['age_group_coding'] = np.select(conditions, choices)

In [None]:
df_age_coding = survey_raw_data_year_nonull.loc[:,['Group','age_group_coding']]
df_age_coding.sort_index().head()

In [None]:
self_trained = survey_raw_data_year_nonull[survey_raw_data_year_nonull.loc[:,'Group'] == 'Self-trained Data Worker'].loc[:,'age_group_coding'].value_counts()
university_trained = survey_raw_data_year_nonull[survey_raw_data_year_nonull.loc[:,'Group'] == 'University-trained Data Worker'].loc[:,'age_group_coding'].value_counts()
df = pd.DataFrame({'Self-trained Data Worker': self_trained,
                   'University-trained Data Worker': university_trained})
df = df.rename_axis("fields", axis='columns').rename_axis("age_group", axis = 'rows').reset_index()
df

From the table, we can see that most people started writing code to analysis data between higher education and mid career, whether self-trained or university trained. However, it is not easy to see further trends or details with pure numbers.

In [None]:
age_data = df_age_coding.groupby('Group').age_group_coding.value_counts(normalize = True).to_frame('percent').reset_index()

plt.figure(figsize=(21,10))
sns.barplot(x = age_data['age_group_coding'], y = age_data['percent'], hue = age_data['Group'], order = ['Before college','Higher and graduate education','Early career','Mid career','Late career'])
plt.xlabel('Age Start Coding to Analyze Data')
plt.ylabel('Percent in Group')

Like said, it is not surprising to see that most people, whether they have completed data courses in universities or not, started coding for data analysis from higher education to early career phases. However, for people who were not trained through university courses, a larger proportion, than those who were university-trained, started learning coding all across their professional career, from early to late phases.

This indicates that although it is common for people to start coding to analyze data during the education period, there is still a significant number of people making the shift in early or even mid-to-late career stages. For the latter, it may not be a complete career switch, but is likely to be a career advancement by adding data and coding to one's own professional skill set.

Like the saying goes, "the best time to plant a tree is twenty years ago - the second best time is now", it is never too late to start learning coding, but the second-worst choice to "never" is "later".

## Gender

Next let's look at the gender distribution of the respondents for both the university-trained and self-trained-groups.

In [None]:
df_gender = survey_raw_data_year_nonull.loc[:,['Group','Q2']]

gender_data = df_gender.groupby('Group').Q2.value_counts(normalize = True).to_frame('percent').reset_index()

plt.figure(figsize = (21,10))
sns.barplot(x = gender_data['Q2'], y = gender_data['percent'], hue = gender_data['Group'], order = ['Male','Female','Prefer to self-describe','Prefer not to say'])
plt.xlabel('Gender')
plt.ylabel('Percent in Group')

One thing that interests me is that the proportion of female respondents is larger in the university-trained group, as compared to those self-trained.

Although the difference may not be statistically significant, it is consistent with the social impression that males are disproportionately represented in the fields of data and coding, which is also reflected by the fact that the majority of respondents, in both groups, are male. Men may be more confident to dive into the fields even if they did not receive relevant formal training in college, while women might encounter other types of discouragement.

"Gender is irrelevant. Certainly the tennis ball doesn't know what the gender was of the tennis coach." It is reinforced by the significant number of women who participated in the survey even though they never took data courses during formal education. Data, as a tool, is neutral. It is the person who uses it makes the difference, and one should never be discouraged by a neutral object.

## Educational Attainment

Educational attainment is correlated with the age. However, it provides more direct insights into potential sunk costs of previous educational investments.

In [None]:
df_edu = survey_raw_data_year_nonull.loc[:,['Group','Q4']]

edu_data = df_edu.groupby('Group').Q4.value_counts(normalize = True).to_frame('percent').reset_index()

plt.figure(figsize = (21,10))
chart2 = sns.barplot(x = edu_data['Q4'], y = edu_data['percent'], hue = edu_data['Group'], order = ['Doctoral degree',"Master’s degree",'Professional degree',"Bachelor’s degree","Some college/university study without earning a bachelor’s degree",'No formal education past high school','I prefer not to answer'])
chart2.set_xticklabels(labels = ['Doctoral degree',"Master’s degree",'Professional degree',"Bachelor’s degree","Some college/university study without earning a bachelor’s degree",'No formal education past high school','I prefer not to answer'], rotation=60)
plt.xlabel('Educational Attainment')
plt.ylabel('Percent in Group')

The result is consistent with intuition. People who have advanced degrees, e.g. master's and doctoral degrees, are more likely to have been trained in data during formal education, since many data science or related programs are offered at master's and doctoral levels. Spending more time in formal education also makes it more possible for one to take data-related courses.

However, a larger proportion of people who have a Bachelor's degree did not take any data courses in college, while the proportion of people with advanced degrees but no data training in universities is also significant enough to be convincing that taking the interest in data further into action is not irrational, even knowing the higher the level of achieved education, the larger the opportunity cost may be generated in this career switch or seemingly irrelevant investment in data.

<a id="p3"></a>
# Where We Are Heading

Having the determination to embark on the journey is the first step of everything. However, without a clear destination, one may just get lost on the way. In order for people to estbalish specific goals, it is helpful to learn about the experiences of others that have gone through the same process.

## Professional Titles

First let's look at what professional titles are currently wore by this self-trained group of respondents, as compared to their university-trained companions.

In [None]:
df_title = survey_raw_data_year_nonull.loc[:,['Group','Q5']]

title_data = df_title.groupby('Group').Q5.value_counts(normalize = True).to_frame('percent').reset_index()

plt.figure(figsize = (21,10))
chart3 = sns.barplot(x = title_data['Q5'], y = title_data['percent'], hue = title_data['Group'])
chart3.set_xticklabels(labels = title_data['Q5'], rotation=60)
plt.xlabel('Professional Title')
plt.ylabel('Percent in Group')

Although it is harder for people to self-learn themselves into highly-technical positions which may require systematic training (which is usually done through formal education), such as statistician and research scientist), a significant proportion of respondents are now in positions as analysts, even without formal data training in universities.

The most popular professional title, for both university and self-trained data workers, is data scientist. Despite the fact that a slightly larger proportion of university-trained data workers ended as data scientists, the proportion is also significant for self-trained people. This should not only instill confidence into people who are still hestitating, but may also be viewed as a reference of which positions to target if the goal is a career switch.

## Salary

Salary is another important indicator of what to expect in a career with data involved. Considering the high expenses of completing university degrees in data, does the cost pay off when we look at the earnings of two groups?

In [None]:
df_salary = survey_raw_data_year_nonull.loc[:,['Group','Q10']]
salary_data = df_salary.groupby('Group').Q10.value_counts(normalize = True).to_frame('percent').reset_index()
plt.figure(figsize = (21,10))
chart4 = sns.barplot(x = salary_data['Q10'], y = salary_data['percent'], hue = salary_data['Group'], order = ['$0-999','1,000-1,999','2,000-2,999','3,000-3,999','4,000-4,999','5,000-7,499','7,500-9,999','10,000-14,999','15,000-19,999','20,000-24,999','25,000-29,999','30,000-39,999','40,000-49,999','50,000-59,999','60,000-69,999','70,000-79,999','80,000-89,999','90,000-99,999','100,000-124,999','125,000-149,999','150,000-199,999','200,000-249,999','250,000-299,999','300,000-500,000','> $500,000'])
chart4.set_xticklabels(labels = ['$0-999','1,000-1,999','2,000-2,999','3,000-3,999','4,000-4,999','5,000-7,499','7,500-9,999','10,000-14,999','15,000-19,999','20,000-24,999','25,000-29,999','30,000-39,999','40,000-49,999','50,000-59,999','60,000-69,999','70,000-79,999','80,000-89,999','90,000-99,999','100,000-124,999','125,000-149,999','150,000-199,999','200,000-249,999','250,000-299,999','300,000-500,000','> $500,000'], rotation=60)
plt.xlabel('Salary')
plt.ylabel('Percent in Group')

In order not to be bothered with too many details, salary ranges are dividied into following groups, based on the a quick Data Analyst salary reference [5].

- Low outlier: < 20,000
- Below regular: 20,000 - 39,999
- Bottom half regular: 40,000 - 59,999
- Top half regular: 60,000 - 79,999
- Beyond regular: 80,000 - 199,999
- High outlier: > 200,000

In [None]:
conditions = [
    (df_salary['Q10'] == '$0-999') | (df_salary['Q10'] == '10,000-14,999') | (df_salary['Q10'] == '1,000-1,999') | (df_salary['Q10'] == '15,000-19,999') | (df_salary['Q10'] == '2,000-2,999') | (df_salary['Q10'] == '3,000-3,999') | (df_salary['Q10'] == '4,000-4,999') | (df_salary['Q10'] == '5,000-7,499') | (df_salary['Q10'] == '7,500-9,999'),
    (df_salary['Q10'] == '25,000-29,999') | (df_salary['Q10'] == '20,000-24,999') | (df_salary['Q10'] == '30,000-39,999'),
    (df_salary['Q10'] == '40,000-49,999') | (df_salary['Q10'] == '50,000-59,999'),
    (df_salary['Q10'] == '60,000-69,999') | (df_salary['Q10'] == '70,000-79,999'),
    (df_salary['Q10'] == '80,000-89,999') | (df_salary['Q10'] == '90,000-99,999') | (df_salary['Q10'] == '100,000-124,999') | (df_salary['Q10'] == '125,000-149,999') | (df_salary['Q10'] == '150,000-199,999'),
    (df_salary['Q10'] == '> $500,000') | (df_salary['Q10'] == '200,000-249,999') | (df_salary['Q10'] == '250,000-299,999') | (df_salary['Q10'] == '300,000-500,000')
    ]
choices = ['Low outlier', 'Below regular', 'Bottom Half regular','Top Half regular','Beyond regular','High outlier']

df_salary['salary_range'] = np.select(conditions, choices)
salary_data_2 = df_salary.groupby('Group').salary_range.value_counts(normalize = True).to_frame('percent').reset_index()

In [None]:
plt.figure(figsize = (21,10))
sns.barplot(x = salary_data_2['salary_range'], y = salary_data_2['percent'], hue = salary_data_2['Group'], order = ['Low outlier', 'Below regular', 'Bottom Half regular','Top Half regular','Beyond regular','High outlier'])
plt.xlabel('Salary Range')
plt.ylabel('Percent in Group')

The trends in both groups, whether university-trained or self-trained, are similar. However, currently those that are self-trained are slightly more likely to be with low-outlier salaries. It may actually indicate the reason why people would want a career switch through learning data and coding. However, this does not take into account any potential geographical or positional differences. What is positive is that there is a decent proportion of self-trained people with salary in top-half ranges (including beyond regular).

<a id="p4"></a>
# How To Get There

With a determination and a destination in mind, the unavoidable question becomes how to achieve the goal. With so many educational and practical resources on the internet, which platforms to choose becomes a question.

## Course Platforms

Let's see what are the most popular course platforms recommended by respondents.

In [None]:
df_course = survey_raw_data_year_nonull.iloc[:,np.r_[35:47,246]]

self = df_course[df_course['Group'] == 'Self-trained Data Worker']
univ = df_course[df_course['Group'] == 'University-trained Data Worker']

n_self_1 = self[self['Q13_Part_1'].notnull()].Q13_Part_1.count()
n_self_2 = self[self['Q13_Part_2'].notnull()].Q13_Part_2.count()
n_self_3 = self[self['Q13_Part_3'].notnull()].Q13_Part_3.count()
n_self_4 = self[self['Q13_Part_4'].notnull()].Q13_Part_4.count()
n_self_5 = self[self['Q13_Part_5'].notnull()].Q13_Part_5.count()
n_self_6 = self[self['Q13_Part_6'].notnull()].Q13_Part_6.count()
n_self_7 = self[self['Q13_Part_7'].notnull()].Q13_Part_7.count()
n_self_8 = self[self['Q13_Part_8'].notnull()].Q13_Part_8.count()
n_self_9 = self[self['Q13_Part_9'].notnull()].Q13_Part_9.count()
n_self_10 = self[self['Q13_Part_10'].notnull()].Q13_Part_10.count()
n_self_11 = self[self['Q13_Part_11'].notnull()].Q13_Part_11.count()
n_self_12 = self[self['Q13_Part_12'].notnull()].Q13_Part_12.count()

n_univ_1 = univ[univ['Q13_Part_1'].notnull()].Q13_Part_1.count()
n_univ_2 = univ[univ['Q13_Part_2'].notnull()].Q13_Part_2.count()
n_univ_3 = univ[univ['Q13_Part_3'].notnull()].Q13_Part_3.count()
n_univ_4 = univ[univ['Q13_Part_4'].notnull()].Q13_Part_4.count()
n_univ_5 = univ[univ['Q13_Part_5'].notnull()].Q13_Part_5.count()
n_univ_6 = univ[univ['Q13_Part_6'].notnull()].Q13_Part_6.count()
n_univ_7 = univ[univ['Q13_Part_7'].notnull()].Q13_Part_7.count()
n_univ_8 = univ[univ['Q13_Part_8'].notnull()].Q13_Part_8.count()
n_univ_9 = univ[univ['Q13_Part_9'].notnull()].Q13_Part_9.count()
n_univ_10 = univ[univ['Q13_Part_10'].notnull()].Q13_Part_10.count()
n_univ_11 = univ[univ['Q13_Part_11'].notnull()].Q13_Part_11.count()
n_univ_12 = univ[univ['Q13_Part_12'].notnull()].Q13_Part_12.count()

course_counts = pd.DataFrame({'Self-trained Data Worker': [n_self_1,
                           n_self_2,
                           n_self_3,
                           n_self_4,
                           n_self_5,
                           n_self_6,
                           n_self_7,
                           n_self_8,
                           n_self_9,
                           n_self_10,
                           n_self_11,
                           n_self_12
                           ],
                           'University-trained Data Worker':[
                           n_univ_1,
                           n_univ_2,
                           n_univ_3,
                           n_univ_4,
                           n_univ_5,
                           n_univ_6,
                           n_univ_7,
                           n_univ_8,
                           n_univ_9,
                           n_univ_10,
                           n_univ_11,
                           n_univ_12
                           ]},
                          index=[
                           'Udacity',
                           'Coursera',
                           'edX',
                           'DataCamp',
                           'DataQuest',
                           'Kaggle Courses (i.e. Kaggle Learn)',
                           'Fast.ai',
                           'Udemy',
                           'LinkedIn Learning',
                           'University Courses (resulting in a university degree)',
                           'None',
                           'Other'
                           ])

course_counts

In [None]:
course_counts['Total'] = course_counts['Self-trained Data Worker'] + course_counts['University-trained Data Worker']
plt.figure(figsize=(21,10))
sns.barplot(x = course_counts.index, y = course_counts.Total, color = "C1").set_xticklabels(labels = course_counts.index, rotation = 60)
sns.barplot(x = course_counts.index, y = course_counts['Self-trained Data Worker'], color = "C0").set_xticklabels(labels = course_counts.index, rotation = 60)

topbar = plt.Rectangle((0,0),1,1,fc="C1", edgecolor = 'none')
bottombar = plt.Rectangle((0,0),1,1,fc='C0',  edgecolor = 'none')
plt.legend([bottombar, topbar], ['Self-trained Data Worker', 'University-trained Data Worker'], loc=1, ncol = 2, prop={'size':16})
plt.xlabel('Course Platforms')

The major difference between the two groups are (of course) on "University Courses". However, in addition to that, the popularity of each course platform is exactly the same between the two groups, with Coursera being the most popular one, closely followed by Kaggle, Udemy, DataCamp and edX.

## Media Sources

Similarly, we also look into the recommended media sources that report on data science topics.

In [None]:
df_media = survey_raw_data_year_nonull.iloc[:,np.r_[22:34,246]]

self = df_media[df_media['Group'] == 'Self-trained Data Worker']
univ = df_media[df_media['Group'] == 'University-trained Data Worker']

n_self_1 = self[self['Q12_Part_1'].notnull()].Q12_Part_1.count()
n_self_2 = self[self['Q12_Part_2'].notnull()].Q12_Part_2.count()
n_self_3 = self[self['Q12_Part_3'].notnull()].Q12_Part_3.count()
n_self_4 = self[self['Q12_Part_4'].notnull()].Q12_Part_4.count()
n_self_5 = self[self['Q12_Part_5'].notnull()].Q12_Part_5.count()
n_self_6 = self[self['Q12_Part_6'].notnull()].Q12_Part_6.count()
n_self_7 = self[self['Q12_Part_7'].notnull()].Q12_Part_7.count()
n_self_8 = self[self['Q12_Part_8'].notnull()].Q12_Part_8.count()
n_self_9 = self[self['Q12_Part_9'].notnull()].Q12_Part_9.count()
n_self_10 = self[self['Q12_Part_10'].notnull()].Q12_Part_10.count()
n_self_11 = self[self['Q12_Part_11'].notnull()].Q12_Part_11.count()
n_self_12 = self[self['Q12_Part_12'].notnull()].Q12_Part_12.count()

n_univ_1 = univ[univ['Q12_Part_1'].notnull()].Q12_Part_1.count()
n_univ_2 = univ[univ['Q12_Part_2'].notnull()].Q12_Part_2.count()
n_univ_3 = univ[univ['Q12_Part_3'].notnull()].Q12_Part_3.count()
n_univ_4 = univ[univ['Q12_Part_4'].notnull()].Q12_Part_4.count()
n_univ_5 = univ[univ['Q12_Part_5'].notnull()].Q12_Part_5.count()
n_univ_6 = univ[univ['Q12_Part_6'].notnull()].Q12_Part_6.count()
n_univ_7 = univ[univ['Q12_Part_7'].notnull()].Q12_Part_7.count()
n_univ_8 = univ[univ['Q12_Part_8'].notnull()].Q12_Part_8.count()
n_univ_9 = univ[univ['Q12_Part_9'].notnull()].Q12_Part_9.count()
n_univ_10 = univ[univ['Q12_Part_10'].notnull()].Q12_Part_10.count()
n_univ_11 = univ[univ['Q12_Part_11'].notnull()].Q12_Part_11.count()
n_univ_12 = univ[univ['Q12_Part_12'].notnull()].Q12_Part_12.count()

media_counts = pd.DataFrame({'Self-trained Data Worker': [n_self_1,
                           n_self_2,
                           n_self_3,
                           n_self_4,
                           n_self_5,
                           n_self_6,
                           n_self_7,
                           n_self_8,
                           n_self_9,
                           n_self_10,
                           n_self_11,
                           n_self_12
                           ],
                           'University-trained Data Worker':[
                           n_univ_1,
                           n_univ_2,
                           n_univ_3,
                           n_univ_4,
                           n_univ_5,
                           n_univ_6,
                           n_univ_7,
                           n_univ_8,
                           n_univ_9,
                           n_univ_10,
                           n_univ_11,
                           n_univ_12
                           ]},
                          index=[
                           'Twitter (data science influencers)',
                           'Hacker News (https://news.ycombinator.com/)',
                           'Reddit (r/machinelearning, r/datascience, etc)',
                           'Kaggle (forums, blog, social media, etc)',
                           'Course Forums (forums.fast.ai, etc)',
                           'YouTube (Cloud AI Adventures, Siraj Raval, etc)',
                           'Podcasts (Chai Time Data Science, Linear Digressions, etc)',
                           'Blogs (Towards Data Science, Medium, Analytics Vidhya, KDnuggets etc)',
                           'Journal Publications (traditional publications, preprint journals, etc)',
                           'Slack Communities (ods.ai, kagglenoobs, etc)',
                           'None',
                           'Other'
                           ])

media_counts

In [None]:
media_counts['Total'] = media_counts['Self-trained Data Worker'] + media_counts['University-trained Data Worker']
plt.figure(figsize=(21,10))
sns.barplot(x = course_counts.index, y = media_counts.Total, color = "C1").set_xticklabels(labels = media_counts.index, rotation = 60)
sns.barplot(x = course_counts.index, y = media_counts['Self-trained Data Worker'], color = "C0").set_xticklabels(labels = media_counts.index, rotation = 60)

topbar = plt.Rectangle((0,0),1,1,fc="C1", edgecolor = 'none')
bottombar = plt.Rectangle((0,0),1,1,fc='C0',  edgecolor = 'none')
plt.legend([bottombar, topbar], ['Self-trained Data Worker', 'University-trained Data Worker'], loc=1, ncol = 2, prop={'size':16})
plt.xlabel('Media Platforms')

The popularity of the media sources are also basically the same between the two groups, with Kaggle, Blog and Youtube being the three most popular sources. The only small difference is on course forums. This is understandable since people who are learning data through online course platforms instead of university courses may rely more on the forums when completing the courses.

<a id="p6"></a>
# What Is Unknown About Us

There is still a lot unknown about this group of not self-trained data people. Among all possibilities of exploration, I am specifically interested in what is unknown about us beyond data.

I view my largest advantage as being, of course not my to-be-honed data skills, but the valuable experience and knowledge I gained as a non-data worker before, being able to view issues from business angles, ablility to communicating with and between technical and non-technical audience, first-hand knowledge that is gained at the frontline not through data warehouses.

I am also interested in the industry difference that might affect people's judgement on the value of investing in their interest in data. Working in a non-profit organization, I often question myself how to monetize values of what I am doing if monetization is not the purpose in the first place.

As mentioned at the beginning, this story is not a rigid study considering the undemonstrated assumptions that are embedded in the entire analysis. Instead, this is just to show to people who are hestitating whether to take their interest in data any further, that there are a lot of people like us no matter how old we are, what gender we identify ourselves as, or what education level we have attained. Some of them have achieved a lot in their current positions, while some are still trying to figure things out. With the recommended ways and platforms to learn data, we are not alone, and thus we are not to be discouraged by the concerns that we are not on the right track.

<a id="p7"></a>
# References

1. https://www.youtube.com/watch?v=arj7oStGLkU&t=1s
2. https://betanews.com/2019/10/18/data-analytics-talent-shortage/
3. https://techcrunch.com/2015/12/31/how-to-stem-the-global-shortage-of-data-scientists/
4. https://www.infoworld.com/article/3131605/why-you-should-retrain-your-employees-to-become-your-data-scientists.html
5. https://www.payscale.com/research/US/Job=Data_Analyst/Salary