# The Path into Data Science Career
<a id="intro0"></a>
***

In this generation we're living, we can get attached to many learning sources easily online. When we search on search engines or platforms, tons of blogs, websites and courses pop up. While after scrolling down few web pages, there might be a possibilities that it's hard to choose which one to follow, learn and do. Everybody shares based on their own precious experiences, but there might be a bias based on the environments they're living in.

Thanks to the 'Kaggle 2019 ML & DS survey', survey into subjects related to data science, we can gain a more braoder insights into what's really going on in the data science career, also what those who already in the center in the career are doing.

Hope that through this analysis work, it can help you a little bit on the decisions of what you want to do in the career, what skills to learn, and how many experiences to gain for the jobs. 


# Table of Contents
<a id="outline"></a>
***
* [The path into data science career (intro) ](#intro0)
​
* [Roles in data science](#roles)
​
* [Career environment](#environments)
   * [Job titles](#titles)
   * [Amount of duties](#dutiesamount)
       * [Relationship between duty amounts and data team size](#dutyvsteamsize)
       * [Relationship between data team size and company status](#teamsizevsstatus)

* [Career qualification](#qualification)
   * [Education degree](#education)
   * [Experiences](#experience)
       * [Data analysis with coding](#codingyears)
       * [Machine learning](#mlyears)
   * [Skills](#skills)
       * [Programming language](#program)   
       * [Analysis](#analysis)
       * [Machine learning](#ml)
           * [Machine learning framework](#framework)
           * [Machine learning model](#model)

* [Path into career of data science](#path)
   * [Platforms for data science courses](#course)
   * [Favorite media sources reported on data science topics](#media)
​
* [Conclusion](#conclusion)
​
* [Appendix](#appendix)
   * [Tools for analysis](#analysisappendix)
   * [Tools for ml](#mlappendix)

# Roles in Data Science
<a id="roles"></a>
***

With 2019 Kaggle DS & ML Survey, we like to get into the subfields of those who are **working in a role related to data**, also working in the** business which incorporates with machine learning**, 8570 subjects in total. Based on the job contents question in survey asking '**Select any activities that make up an important part of your role at work (Select all that apply)**', there are options of 

1) **Analyze and understand data** to influence product or business decisions 

2) Build and/or run the **data infrastructure** that my business uses for storing, analyzing, and operationalizing data

3) Build **prototypes** to explore applying **machine learning** to new area

4) Build and/or run a **machine learning service** that operationally improves my product or workflows

5) **Experimentation** and iteration to improve existing ML models

6) Do **research** that advances the state of the art of machine learning

7) None of these activities are an important part of my role at work

8) Others

The works in data science include gaining data, understanding data, and using data to achieve different goals. Besides the processes of gaining data and cleaning data, the options in the survey have covered the basic overflow of the entire data science industry. We therefore want to first take a look at the amounts of these subjects based on the job roles. (subjects who only selects others and none are exluded)

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt

################################################

### target group division

def dividetargetgroup(answer):
    
    # Q5 -> career titles (no student and unemployed)

    q5data = answer.dropna(subset=['Q5'])
    studentsdata = answer[answer['Q5']=='Student']
    unemployeddata = answer[answer['Q5']=='Not employed']
    employeddata = q5data[~q5data['Q5'].isin(['Student','Not employed'])]
    
    # Q8 -> business w. ML
    
    employeddatanoML = employeddata.loc[(employeddata['Q8'] == 'No (we do not use ML methods)') | (employeddata['Q8'] == 'I do not know') ]
    employeddataML = employeddata.loc[(employeddata['Q8'] != 'No (we do not use ML methods)') & (employeddata['Q8'] != 'I do not know') ]
    employeddataML = employeddataML.dropna(subset=['Q8'])
    
    # Q9 -> job contents
    
    employeddataML_none = employeddataML[employeddataML['Q9_Part_7'] =='None of these activities are an important part of my role at work']
    employeddataML = employeddataML[employeddataML['Q9_Part_7'] != 'None of these activities are an important part of my role at work']

    employeddataML['Q9nans'] = employeddataML[['Q9_Part_1','Q9_Part_2','Q9_Part_3','Q9_Part_4','Q9_Part_5','Q9_Part_6','Q9_Part_8']].isnull().sum(axis=1)
    employeddataML_nan_onlyothers = employeddataML[(employeddataML['Q9nans']==7) | ((employeddataML['Q9nans']==6) & (employeddataML['Q9_Part_8']=='Other'))]
    employeddataML.drop(index=employeddataML_nan_onlyothers.reset_index()['index'].to_list(),inplace = True) # 8373 -> 7912
    employeddataML['Q9nans'] = employeddataML['Q9nans']-1
    
    return employeddataML

### Plots


def pd_plots_6groups_multi1(Qi, nsub, inputdatagroups, first, origdatatype=None):

    datatype = []
    datacount = []
    
    # nans -> find total responsdents
    
    columnlist = ['Q'+str(Qi)+'_Part_'+str(i+1) for i in range(nsub)] # create list
    inputdatagroups['Q'+str(Qi)+'nans'] = inputdatagroups[columnlist].isnull().sum(axis=1)
    groupqueslen = len(inputdatagroups['Q'+str(Qi)+'nans']<nsub)
    
    for i in range(nsub):
        
        # datatype
        
        if first:
            datatype.append(inputdatagroups['Q'+str(Qi)+'_Part_'+str(i+1)].value_counts().index.to_list()[0])
        else:
            datatype = origdatatype
        
        # datacount
        
        if len(inputdatagroups['Q'+str(Qi)+'_Part_'+str(i+1)].value_counts().to_list())>0:
            datacount.append(inputdatagroups['Q'+str(Qi)+'_Part_'+str(i+1)].value_counts().to_list()[0])
        else:
            datacount.append(0)
        
    pddata = pd.DataFrame({'counts': [x/groupqueslen*100 for x in datacount]}, index = datatype) # percentage
    
    if first:
        return pddata, datatype
    else:
        return pddata, None
    
def plots_multi_1group(inputdatagroups, inputtitle, Qi, nsub, first, origdatatype): 
    datatype = []
    datacount = []
    for i in range(nsub):
        if first:
            datatype.append(inputdatagroups['Q'+str(Qi)+'_Part_'+str(i+1)].value_counts().index.to_list()[0])
        else:
            datatype = origdatatype
        if len(inputdatagroups['Q'+str(Qi)+'_Part_'+str(i+1)].value_counts().to_list())>0:
            datacount.append(inputdatagroups['Q'+str(Qi)+'_Part_'+str(i+1)].value_counts().to_list()[0])
        else:
            datacount.append(0)
    pddata = pd.DataFrame({'counts': datacount, 'datatype': datatype})
    pddata.set_index('datatype').sort_values(by='counts').plot(kind='barh', title=inputtitle)
    
    return datatype

def plots_single_1group(inputdatagroups, inputtitle, Qi, charttype):
    
    plt.figure()
    inputdatagroups['Q'+str(Qi)].value_counts(ascending=True).plot(kind=charttype, title=inputtitle)

    return pddata
    
def plots_6groups(Qi, Q9_choice_select, nsub=None, multichoice=None, categoryorders=None):
    
    # Build table
 
    grouplist = []
    for i in range(6):
        if multichoice:
            if i==0:
                pddata, origdatatype = pd_plots_6groups_multi1(Qi, nsub, Q9_choice_select[i],1)
            else:
                pddata, _ = pd_plots_6groups_multi1(Qi, nsub, Q9_choice_select[i], 0, origdatatype)
            grouplist.append(pddata)
        else:
            grouplist.append(Q9_choice_select[i]['Q'+str(Qi)].value_counts(normalize=True)*100) # normalize within group   
    
    grouptable = pd.concat([grouplist[0],grouplist[1]],axis=1, sort=True)
    for i in range(4):
        grouptable = pd.concat([grouptable,grouplist[i+2]],axis=1, sort=True)
    grouptable.columns = targetgroupnames
    
    # Plot

    ax1 = grouptable.reindex(categoryorders).transpose().plot(kind='bar', figsize=(11,5), rot=0) 
    ax1.set(ylabel="percentage")
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    
################################################

### Read files 

answer = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')
answer_other = pd.read_csv('/kaggle/input/kaggle-survey-2019/other_text_responses.csv')
survey_schema = pd.read_csv('/kaggle/input/kaggle-survey-2019/survey_schema.csv')
answer = answer[1:]

### Target group

employeddataML = dividetargetgroup(answer)
targetgroupnames = ['Analyze data', 'Data infrastructure', 'ML prototypes', 'ML service', 'ML experimentation', 'ML research']

### Target group division

# Q9 groups

Q9_choice_none = []
Q9_choice_select = []
Q9_choice_only = []
Q9_choice_more = []

for i in range(6):
    Q9_choice_none.append(employeddataML[employeddataML['Q9_Part_'+str(i+1)].isnull()])
    Q9_choice_select.append(employeddataML.drop(index=Q9_choice_none[-1].reset_index()['index'].to_list()))
    Q9_choice_select_data = Q9_choice_select[-1]
    Q9_choice_only.append(Q9_choice_select_data[Q9_choice_select_data['Q9nans']==5])
    Q9_choice_more.append(Q9_choice_select_data[Q9_choice_select_data['Q9nans']<5])

In [None]:
groupnum = []
for i in range(6):
    groupnum.append(len(Q9_choice_select[i]))
groupnumpercentage = [x/len(employeddataML)*100 for x in groupnum]
groupnum = pd.DataFrame(groupnumpercentage, index = targetgroupnames).sort_values(by=0, ascending=True).rename(columns={0:'subject count percentage'})
ax = groupnum.plot(kind='barh')
_ = ax.set(xlabel="percentage")

As data science is the science based on data, a deep understanding of the data is needed before any decision or implemntation started. We can also see that among the 8.5K subjects, there are 70% of people whose roles include analysis and understanding the data. 

The data infrastructure, which is the fundamentals of data operation, also includes 40% of subjects.
In the job roles with machine learning, there are also 60% of people whose roles include building machine learning prototypes, further business decisions or technology developments would be made and discussed by taking a look into the results of prototypes. The least ratio among the 6 groups are the research of machine learning, less than 30%. 

--

Roles in marketing might be segmented into market analyst, branding, advertising, etc. Roles in website development might be segmented into frond-end, back-end, UI/UX etc. The roles and reponsibilities of the job positions can be separated into more divisions when the business have growned up to a scale.

As data science is a field that starts growing up in these 10 years, it might be segmented into more professional parts once the industry has become much well-developed in the next decade. Different people might play different but more specific roles in industry. Divisions on the responsibilities of each role might get more narrower, more professional skills and background might be needed in the future. For example, there might be roles focusing on the research and development of the newest deep learning algorithms in Google, such as developing the well-performed and state-of-the-art Text-to-speech algorithm with sophiscated designed deep learning models. In addition, there might be some duties and roles that we expected to do in the fields. In these situations, many of the the further discussions on career environments and career qualifications will be based on the groups depending on the job roles in the industry.

# Career Environments
<a id="environments"></a>
***

What is it like to be in the career of data science (DS)? What are the position names when searching the job sites? What are the duties responsible in the position? Who are my colleages? How's the team I'm working with? 

The questions start from situation when finding jobs (position name), and then the situations when working. The environment plays important roles during working, where the duties, responsibilities of your own and the status of company might affect the atmosphere and then our working moods. Whether there are colleagues who also work on similar problems can also affect the efficiency in working. We therefore next take a look at the career environments of the data science. Based on the questions above, we discuss with the job titles, amount of duties, and the data team size (here defined as employees whose roles are responsible in DS).

## Job Titles
<a id="titles"></a>

Different companies have different requirements and definitions on the job titles in data science. When scrolling down the job of 'Data Scientist' on Linkedin or Glassdoor, job responsibilitiy varies with different team, and company. We might find that some require data analysis and building ML prototype,
and others might also include building data infrastructure or research on state-of-the-art DL methods.


As we might have an expectation of the roles and duties in the industry of data science, and have prepared skills and projects based on the roles, we therefore want to breakdown the job title distribution based on the job role.


In [None]:
plots_6groups(5, Q9_choice_select)

When categorized into groups with job role, the title of data scientist is still the most, surpassing 40% in each group. The title of software engineer also stands in second or third most in each job role group, way more than data engineer and Database engineer, also demonstrating the title's career decision flexibility in the software industry. 

In the roles of analysis of data, data analyst and software engineer are the second and third most titles (more than 10%). On the other hand, the role for running or building data infrastructure also lies into the duty of software engineer and data analyst. It also indicates that some of duties of data analyst not only need analysis skills but also knowledges on data infrastructure.

The subjects with title of research scientist also play important roles on the duties involving machine learning, ranking second or third in ML-related groups, espescially when it comes to the research on machine learning.

In addition, the big amount of data would sometimes be utilized to drive several business decisions. Therefore, the roles involving business insights, such as business analyst and product manager, also require skills in data science (around 5% in each group). Statisticians suprisingly have played a really small ratio among all the groups, even in the group of analysis, while there are also more other job titles (maybe more specific) among these groups categorized by job roles.

## Amount of duties
<a id="dutiesamount"></a>

In the cases of technology developement, a start up company might only have 3 developers, while a big tech company
might have teams focusing on development from start to end, R&D, platform, quality assessment etc. Depending on the condition of the teams and the companies, the duty of jobs in the field of data science also differs. 

We take a look at the amount of duties assigned to the subjects, and further discuss the relationship between duties and the size of data team, also the relationships between the data team and the status of business.

In [None]:
### duty 

## amounts

fig, axes = plt.subplots(nrows=1, ncols=2)
fig.set_figheight(3)
fig.set_figwidth(10)
df1 = (employeddataML['Q9nans'].value_counts(normalize=True)*100).reset_index()
df1['duty amounts'] = 6-df1['index']
df1 = df1.sort_values(by=['duty amounts'])
ax1 = df1.set_index('duty amounts').drop(columns=['index']).rename(columns={"Q9nans": "subject counts percentage"}).plot(kind='barh',ax=axes[0])
ax1.set(xlabel="percentage", ylabel="duty amounts")

## ratio of only vs more

only = []
more = []
# print('none','select','only','more')
for i in range(6):
    only.append(len(Q9_choice_only[i]))
    more.append(len(Q9_choice_more[i]))
    #print(len(Q9_choice_none[i]),len(Q9_choice_select[i]),len(Q9_choice_only[i]),len(Q9_choice_more[i]))
df = pd.DataFrame(data={'single duty': only, 'more duties': more, 'index': targetgroupnames}).set_index('index')
df = df.div(df.sum(axis=1), axis=0)*100
ax2 = df.plot(kind='barh', stacked=True, rot=0, ax=axes[1])
ax2.set(xlabel="percentage", ylabel="duty type")

plt.tight_layout()

The left picture counts the number of subjects (in career of DS) based on the amount of duties and roles, shown in percentage. There are around 25% of subjects whose roles only focus on single duty during the work. The duty amounts of 2 and 3 contains the next most subjects, for each surpassing 20%. There are in total more than 60% of subjects that don't have more than 3 duties. And as the amount of duties increase, there are less subjects amoung the group. When the duty amount is 7 (with other duties included), there are only 12(/8570) subjects.

-

The right picture discusses the ratio of subjects with single duty and more than one duties. The ratio are calculated based on the each job role. 

In the group with roles containing analysis, there are almost 20% of subjects whose jobs focus only on analyzing data, which is the most among the 6 job role groups. In fact, there are around 1000 subjects whose main works focus only on analyzing data. As data analysis is also the most selected duties on the [previous section](#roles), it also indicates that there are lots of career market needs on the skills for data analysis currently. 

For the job divisions involving machine learning, the ML research has the second most single duty ratio among all groups. As [the subjects selecting ML research as the main job roles are the least among 6 duties](#roles), those who enter in this role have higher possiblities to focus on the research. ML experimentation has the least single duty ratio, job duties often in combination of other roles. 

### Relationship between duty amounts and data team size
<a id="dutyvsteamsize"></a>

In [None]:
# people with differet duty amounts vs data team size

employeddataML['duty counts'] = 6-employeddataML['Q9nans']
df3 = employeddataML.copy().rename(columns={'Q7':'employees responsible in DS'}).groupby(['employees responsible in DS','duty counts']).size().unstack(fill_value=0)
df3 = df3.reindex(['0','1-2','3-4','5-9','10-14','15-19','20+']).iloc[:, ::-1].transpose()
sns.heatmap(df3, annot=True, fmt=".1f", cmap="Blues")
plt.show()

The heatmap shows the counts of subjects in the corresponding values of duty counts and data team size.
For each data team size, they all follow similar trends to the total subject counts distribution based on duty counts in [previous section](#dutiesamount).

When looking at the data team size distribution based on each duty count, the distributions are also similiar among each group. They have the most subjects with the data team size of 1-4 and 20+ employees, having more job positions on the smaller teams or the bigger teams. As there are lots of employees in company with big data team, there are also lots of smaller companies required employees with data science but with smaller team size.

The subjects with more amounts of duties (more than 3) have more opportunities in smaller data team size or much bigger team size. In the other hand, within data team size of 1-2, there are surprisingly more than 560 job positions (subjects) that only focus on one duty, which is the most among all the types in this heatmap. 




### Relationship between data team size and company business status on ML
<a id="teamsizevsstatus"></a>

In [None]:
df = employeddataML.copy().rename(columns={'Q7':'Employees responsible in DS'}).pivot_table(index='Q8', columns='Employees responsible in DS', values='duty counts',aggfunc=len)

df['Business ML Condition'] = ['Exploring ML methods', 'Well established ML methods', 'Recently starting ML methods', 'Generating insights with ML']
df = df.set_index('Business ML Condition').reindex(['Generating insights with ML', 'Recently starting ML methods', 'Exploring ML methods', 'Well established ML methods'])
df = df.reindex(columns=['0','1-2','3-4','5-9','10-14','15-19','20+'])

sns.heatmap(df, annot=True, fmt=".1f", cmap="Blues")
plt.show()

The figure shows the counts of subjects in the corresponding values of data team size and the business status incorporating ML.

There are more than 1000 subjects (more than 10% in all subjects) are working in the companies that have well established ML methods into the business and data team size more than 20 people. 

More than 800 subjects are in the company that are exploring ML methods, but data team size only inculdes 1-2 employees. This somehow reflect the fact that there are companies who want to use data to benefit the business but have less resources or still not planning to invest more on the data team.

The amounts of subjects in different data team size are more equivalent in companies that have just started using ML methods into business. In the group of companies only generating insight through data, the most subjects lie into the very small data team size.

Note that besides well-established ML business, there are more subjects hired in the companies that only have 1-2 employees for DS, indicating that businesses without well developed ML technologies tend to hire less employees for the data science.


# Career qualification
<a id="qualification"></a>
***

When starting to seek for jobs in the job sites, the job descriptions include not only about the company, the team, the job roles in the offers, but also, their preferred qualification for the job candidates. The more qualifications matched with the demand of the job, the more higher possibilities to get selected into interviews and getting offers. 

What are the qualifications needed to get DS jobs? Need at least a master degree? How many years of experiences should we have? How many skills we need to have? and then we can get a chance to the interview ... and maybe the offer ...

We therefore discuss the personal backgrounds of the subjects, with their education degrees, experiences and skills, in order to get a look on the possible qualifications of the DS career. Assuming that we might have specific roles we want to take place in the DS field, the discussions below are discussed based on the percentage distributions in each job role.  

## Education Degree
<a id="education"></a>

In [None]:
categoryorders = ['No formal education past high school', 'Some college/university study without earning a bachelor\'s degree', 'Bachelors degree ', 'Masters degree', 'Doctoral degree', 'Professional degree', 'I prefer not to answer'] # categoryrename
plots_6groups(4, Q9_choice_select)

Among each job role groups, the academic degrees higher than high school contain the most subjects, more than 90% in each group. Subjects with master's degree surpassing 40% in each job role groups. 

In the job roles of analyzing data, data infrastureture or ML service, the subjects with Bachelor's degree have higher ratio than the doctoral degree. While in the job roles related to more machine learning implementation, buidling ML prototypes, experimentation and research, doctoral degree has higher composition than Bachelor's degree, more chances to keep doing research after PhD. 

In the other hand, there are also some subjects without high education can get into the career of data science.

## Experience
<a id="experience"></a>

The job requirements might write down expected years of experiences in the related domain or industry, or few years experience in programming, machine learning, deep learning, etc. As coding and machine learning are the common skills often required, we discuss the coding and machine learning experiences of the subjects.

### Data analysis with coding
<a id="codingyears"></a>

In [None]:
categoryorders = ['I have never written code', '< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']
plots_6groups(15, Q9_choice_select, categoryorders=categoryorders)

The ratio distributions among all job role groups follow similar trending. 3-5 years coding experiences subjects are the most dominant ratio among each job role group, around 30%. The ratio decreases when the year experience increases and decreases. While there are only around 8% of subjects that have coding experiences less than a year, there are still a small amounts of subjects who have never written code but still responsible for the roles in DS. 

### Machine learning
<a id="mlyears"></a>

In [None]:
categoryorders = ['< 1 years', '1-2 years', '2-3 years', '3-4 years', '4-5 years', '5-10 years', '10-15 years', '20+ years'] # categoryrename
plots_6groups(23, Q9_choice_select, categoryorders=categoryorders)

The ratio distributions among all job role groups also follow similar trending, with 1-2 years experiences as the highest ratio and then 2-3 years. There are fewer subjects with ML experiences more than 10 years. In addition, for those who have ML experiences less than a year, they have around 10% of subject ratio in each job role group, which is also not low! It also surpasses 10% in the group of data analysis and data infrastructure, since that the required knowledges on the job roles might not need ML.

## Skills
<a id="skills"></a>

As tons of data presenting, the process of understanding data can help us lead to the next step on using data in real word, implementing ideas based on data with algorithms, ML, DL. These procedures can be conducted through programming. Starting from the first step of gaining data, cleaning data, knowing data, and then using data, the process might be tough if they're no programming language. In this section, we discuss about the common skills required, in the perspective of programming, data analysis and machine learning.

### Programming language
<a id="program"></a>

In [None]:
plots_6groups(18, Q9_choice_select, nsub=12, multichoice=1)

Among all the job role groups, there are more than 75% of the subjects often using Python, more than 40% using SQL, and more than 30% using R. Python, SQL and R are the three most used programming languages for subjects working in DS. The next most used language is Bash, around 20%, following by different orders of Javascript, Java and C++ ...

### Analysis 
<a id="analysis"></a>

In [None]:
plots_6groups(14, Q9_choice_select)

More than 50% of subjects use local development environments (Rstudio, JupyterLab, etc...) as the primary tools to use for data analysis. In the roles of data analysis, basic statistical software and cloud-based data software have around 10% of ratio for each, while in other roles cloud-based data software is more used by subjects. The business intelligence software have relative less subjects using, even in the roles of data analysis.

For more analysis tools, such as data visualization tools, big data/analytics products, or relational data base products, please go to [Tools for analysis](#analysisappendix).

### Machine learning
<a id="ml"></a>

#### Machine learning framework
<a id="mlframework"></a>

In [None]:
plots_6groups(28, Q9_choice_select, nsub=12, multichoice=1)

Scikit-learn, Tensorflow, and Keras have the most usage ratio among each groups, following by Xgboost and RandomForest. For the deep learning frameworks, PyTorch have relatively less usage ratio when compared to Tensorflow, with ratio difference more than 10%.

#### Machine learning model
<a id="mlmodel"></a>

In [None]:
plots_6groups(24, Q9_choice_select, nsub=12, multichoice=1)

Linear or Logistic Regression have more than 60% usage ratio, followed by decision trees / random forests, gradient bossting machines and convolutional neural networks (CNN). While in the job roles with ML research, models with deep learning (CNN, DNN, GAN) are more often used compared to other job roles.

For more ML tools, such as ML tools, automated ML tools or cloud computing platforms, please go to [Tools for ml](#mlappendix).

# Path into Data Science Career
<a id="path"></a>
***

To get into the career of data science, we need to prepare ourselves with sufficient backgrounds to be qualified for interviews and gain the opportunities into jobs. And, thankfully, through the 'Kaggle survey on DS & ML', we've gained a glimpse on the job environments and skills requirement with the discussions above. The working environments of data science, and the common personal backgrounds are discussed based on the job contents options provided by Kaggle.

The roads to data science involve tons of learning and practicing. We might start with getting familiar with the field, and further deep dive into serveral subfields. The processes of understanding the field can be carried out with several courses in schools or online. Basic theories, some of the skills required are taught and practiced with the assignments or projects. (If lucky enough, it might help you get some interviews with these projects and skills)

After gaining some knowledges and skills through courses, if you have the chance to make decision yourself, it's also a good timing to really think about which subfields or skills sets wanting to deep dived into. **Designing your own path to the career!** A imagination of what kind of roles in the career, a consideration of what kind of learning procedures you like better, these might also help you on the decision.

Based on the roles willing to play in the industry, there are more learning and practicing for improvements on the skill sets, knowledge, amounts of projects, in order to get recognized in millions of resumes and perform well on the interviews. 

The road of learning might not stop even after entering the career. As the responsibilities of the team varies with time, you might need to do other tasks with new tools never used before. As new technology keeps coming out in this generation, you might need to start digging into more mathmetical parts to understand mechanism of the new hottest machine learning algorithms. In addition, some of the job positions might not only need hard skills but also soft skills, such as presentation skills, communication skills, or even business insights. **Learning process never stops.** Following those masters in the field through media platforms can help us keep in track of the newest informations in the industry, also knowing what those masters are thinking and learn with them.

Below are the learning platforms that can become the references when dicisions on the learning media.

## Platforms for Data Science Courses
<a id="course"></a>

In [None]:
plots_6groups(13, Q9_choice_select, nsub=12, multichoice=1)

Cousera have more than 55% ratio among all groups. DataCamp, Kaggle Courses, Udacity and Udemy are also competitive courses, with each have around 30% of subjects have taken courses in each group.

## Favorite Media Sources Reported on Data Science Topics
<a id="media"></a>

In [None]:
plots_6groups(12, Q9_choice_select, nsub=12, multichoice=1)

Blogs, Kaggle and Youtube are the top 3 chosen medias for gaining data science topics, following by journal publications and twitters. In the ML research, journal publication plays relatively important roles compared to other job role groups.

# Conclusion
<a id="conclusion"></a>
***

There are many resources online sharing and teaching us the experiences to get into the career in data science. While different people have different opinions based on their different backgrounds, we might get confused on what to do after scrolling down many web pages. We therefore take a look into the annual data science survey conducted by Kaggle, and observe what those already in the career of data science are doing. Job environments, job requirements and qualifications, learning ways and platforms are discussed based on the roles, the duties in the data science career, in order to know what skills to learn based on the 8500 people working in the DS field.

Thanks for the patience for reading on this work. Really hope that this work can help a little bit on the decision of the career path and learning process. Good luck and best wishes to all! 

# Appendix
<a id="apppendix"></a>
***


## Analysis tools
<a id="analysisappendix"></a>

### Data visualization libraries

In [None]:
plots_6groups(20, Q9_choice_select, nsub=12, multichoice=1)

### Big data / analytics products

In [None]:
plots_6groups(31, Q9_choice_select, nsub=12, multichoice=1)

### Relational database products

In [None]:
plots_6groups(34, Q9_choice_select, nsub=12, multichoice=1)

## Machine learning
<a id="mlappendix"></a>

### ML tools

In [None]:
plots_6groups(25, Q9_choice_select, nsub=8, multichoice=1)

### Automated ML tools

In [None]:
plots_6groups(33, Q9_choice_select, nsub=12, multichoice=1)

### Cloud computing platform

In [None]:
plots_6groups(29, Q9_choice_select, nsub=8, multichoice=1)