# Introduction

I began my Data Science journey at the beginning of the Covid-19 pandemic. I was completing my final quarter at the University of California - Santa Barbara as a Mathematical Sciences major and my curiosity for Data Science was growing. Now, I am enrolled in a Masters of Science in Data Science program at Cabrini University and I am hopeful to gain enough experience to finally break into the industry! My goal for this project is to discover the most valuable skills that Data Scientists in the United States have. Specifically, I will be comparing the Data Scientists whose highest degree achieved is bachelor's or master's. I hope to learn more about what Data Scientists are doing so that I could focus my own studying on relevant skills.

# Exploring the Data

Let's first take a look at the first 5 rows of the data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
sns.set(style = 'darkgrid', color_codes=True)

In [None]:
data = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory = False)
data.head()

We can see that the first row of the dataset contains the questions that were asked. Before continuing, I'd like to remove that row. I also want to preserve this dataset as is, so I will make a copy.

In [None]:
data.drop([0], axis = 0, inplace = True)
my_df = data.copy()
data.head()

Great! Now, I want to see the distribution in education among the Data Scientists. I imagine that most Data Scientists have a bachelor's or master's degree as their highest education achieved.

In [None]:
# 'some college without...' is too long, i'll rename it
my_df.loc[lambda df: df['Q4'] == 'Some college/university study without earning a bachelor’s degree', 'Q4'] = 'Some college'
my_df.loc[lambda df: df['Q4'] == 'No formal education past high school', 'Q4'] = 'High School'

In [None]:
# education frequency plot
g = sns.catplot(data = my_df, y = 'Q4', kind = 'count', palette = 'rocket', aspect = 2, order = my_df['Q4'].value_counts().index)
plt.title('Education Ordered by Frequency')
plt.ylabel('Highest Formal Education')
plt.xlabel('Count')
plt.show()

My intuition is correct! A large number of Data Scientists have a Bachelor's or Master's Degree. The next highest education is a Doctoral degree.

Next, I would like to see the distribution of Data Scientists by country.

In [None]:
india = sum(my_df['Q3'] == 'India')/len(my_df)
us = sum(my_df['Q3'] == 'USA')/len(my_df)
print('India : {:.2%}'.format(india))
print('USA : {:.2%}'.format(us))

In [None]:
# rename the long names
my_df.loc[lambda df: df['Q3'] == 'United States of America', 'Q3'] = 'USA'
my_df.loc[lambda df: df['Q3'] == 'United Kingdom of Great Britain and Northern Ireland', 'Q3'] = 'Great Britain'
my_df.loc[lambda df: df['Q3'] == 'Iran, Islamic Republic of...', 'Q3'] = 'Iran'
my_df.loc[lambda df: df['Q3'] == 'United Arab Emirates', 'Q3'] = 'Emirates'
my_df.loc[lambda df: df['Q3'] == 'Republic of Korea', 'Q3'] = 'Korea'

# plot the country
g = sns.catplot(y = 'Q3', kind = 'count', data = my_df, palette = 'rocket', height = 10, aspect = 1, order = my_df['Q3'].value_counts().index)
plt.title('Countries Ordered by Frequency')
plt.xscale('log')
plt.xlabel('Count (log scale)')
plt.ylabel('Country')
plt.show()

Wow! India has a lot of Data Scientists. (It's important to remember that this is a survey that was conducted by Kaggle. This does not represent the entire Data Science community. A better phrase may be "Data Scientists in India participated the most in the survey"). India represents about 29% of the responses and USA represents about 11% of the responses.

Maybe most of the respondents in India have master's degrees and that is skewing our data? Maybe the US Data Scientists have more bachelor's degrees than master's degrees! To find out, we'll plot the highest education achieved once more by India and the USA.

In [None]:
# make tmp df of USA and India
countries = ['USA','India']
tmp = my_df.loc[lambda x: x['Q3'].isin(countries), :]

# plot
g = sns.catplot(data = tmp, y = 'Q4', kind = 'count', palette = 'rocket',aspect = 1.5
                ,order = tmp['Q4'].value_counts().index, row = 'Q3')

g.set_xlabels('Count')
g.set_ylabels('Highest Formal Education')
g.set_titles(row_template = '{row_name}')

plt.show()

This is interesting. My intuition was incorrect. Most Data Scientists in India have a bachelor's degree over a master's but the reverse is true in the USA. In my personal experience applying to Data Science jobs in the US, many of the job postings require a master's degree which correlates with the US graph seen above (I'm not sure if the same is true for India).

Now I'd like to focus on US Data Scientists and filter by bachelor's and master's degrees.

In [None]:
data = my_df.loc[lambda x: x['Q3'] == 'USA', :]
data = data.loc[lambda x: x['Q4'].isin(['Bachelor’s degree','Master’s degree']), :]

## Exploring US Data Scientists
### Age/Gender/Pay
Let's get to know the difference between bachelor's and master's US Data Scientists.

First, we take a look at their age and gender.

In [None]:
order = sorted(data['Q1'].unique()) # display in increasing age

# plot
g = sns.catplot(data = data,y = 'Q1', kind = 'count', palette = 'rocket', order = order, row = 'Q4', aspect = 1.5)
plt.subplots_adjust(top=0.9)
g.fig.suptitle('US Data Scientists Age Distribution')
g.set_xlabels('Count')
g.set_ylabels('Age')
g.set_titles(row_template = '{row_name}')
plt.show()

In [None]:
# plot the gender distributions
g = sns.catplot(data = data, y = 'Q2', kind = 'count', 
                  palette = 'rocket', order = data['Q2'].value_counts().index, row = 'Q4', aspect = 1.5)
plt.subplots_adjust(top=0.9)
g.fig.suptitle('US Data Scientists Gender Distribution') 
g.set_xlabels('Count (log scale)')
g.set_ylabels('Gender')
g.set_titles(row_template = '{row_name}')
plt.xscale('log')
plt.show()

It looks like most bachelor's degree Data Scientists are between 18-39 years old and most of the master's degree Data Scientists are between 22-39 years old. There is an interesting increase at ages 60-69 for both bachelor's and master's degrees.

We can also notice that most Data Scientists are male.

Now, let's take a look at their annual pay.

In [None]:
def get_max(df, Q):
    """
    Q is the question I am interested in
    Example:
    Q = 'Q2'
    """
    n = len(df)
    answer_choices = df[Q].value_counts()
    return answer_choices.keys()[0], answer_choices[0]/n

def get_top_perc(df, Q, perc = 0.5):
    n = len(df)
    answer_choices = df[Q].value_counts()
    total_perc = 0
    categories = []
    for key in answer_choices.keys():
        if total_perc < perc:
            total_perc += answer_choices[key]/n
            categories.append(key)
        else:
            return categories, total_perc

In [None]:
bachelor = data.loc[lambda df: df['Q4'] == 'Bachelor’s degree', :]
master = data.loc[lambda df: df['Q4'] == 'Master’s degree',:]

In [None]:
# master pay
category, perc = get_max(master, 'Q24')
print('master pay')
print(category, perc)
print('\n')

categories, total_perc = get_top_perc(master, 'Q24')
print('top 50%')
print(categories, total_perc)
print('\n')

# bachelor pay
category, perc = get_max(bachelor, 'Q24')
print('bachelor pay')
print(category, perc)
print('\n')

categories, total_perc = get_top_perc(bachelor, 'Q24')
print('top 50%')
print(categories, total_perc)

In [None]:
# plot the pay distributions
order = ['$0-999', '1,000-1,999','2,000-2,999','3,000-3,999','4,000-4,999','5,000-7,499','7,500-9,999',
         '10,000-14,999','15,000-19,999','20,000-24,999','25,000-29,999','30,000-39,999','40,000-49,999',
         '50,000-59,999','60,000-69,999','70,000-79,999','80,000-89,999','90,000-99,999','100,000-124,999',
         '125,000-149,999','150,000-199,999','200,000-249,999','250,000-299,999','300,000-500,000','> $500,000']

g = sns.catplot(data = data, y = 'Q24', kind = 'count', order = order,
                  palette = 'rocket', row = 'Q4', aspect = 2)
plt.subplots_adjust(top=0.9)
g.fig.suptitle('US Data Scientists Yearly Compensation Distribution') 
g.set_xlabels('Count')
g.set_ylabels('Yearly Compensation')
g.set_titles(row_template = '{row_name}')
plt.show()

Generally, Data Scientists get paid well and we can clearly see that above. For both master's and bachelor's degree Data Scientists, there is a spike at 100k-125k.

Over 50% of master's degree Data Scientists make between 70k-250k. For bachelor's degree Data Scientists, the range is greater, with over 50% making between 50k-250k.

This makes sense, a lot of jobs that require a master's degree offer a high yearly compensation compared to jobs that don't require a master's degree.

### Programming Experience
Let's take a look at the programming experience and the first programming language for these Data Scientists. 

In [None]:
# define the order for the plots
order = ['I have never written code', '< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']

# plot
g = sns.catplot(data = data, y = 'Q6', kind = 'count', aspect = 2, palette = 'rocket', order = order, row = 'Q4')
plt.subplots_adjust(top=0.9)
g.fig.suptitle('US Data Scientists Programming Experience') 
g.set_xlabels('Count')
g.set_ylabels('Years Writing Code/Programming')
g.set_titles(row_template = '{row_name}')
plt.show()

In [None]:
# make function that selects the question parts for multiple selected questions
def get_questions(Q):
    q = []
    for column in my_df.columns:
        if Q in column:
            q.append(column)
    return q

# make function that makes frequencies of categories for the multiple selected questions
def make_dict_freq(df, sort = True):
    tmp_dict = dict()
    for column in df.columns:
        item = df.loc[df[column].notna(), column].unique()[0]
        tmp_dict[item] = len(df) - df[column].isna().sum()
    if sort:
        # now sort the dictionary
        tmp_dict = dict(sorted(tmp_dict.items(), key = lambda x: x[1], reverse = True))
    return tmp_dict

fig, axs = plt.subplots(2,1, figsize = (10,10))

# make the df for this plot (Masters)
q7 = get_questions('Q7')
Q7 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q7]
tmp_dict = make_dict_freq(Q7)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[0])
axs[0].set_title('Master\'s degree')

# make the df for this plot (Bachelors)
q7 = get_questions('Q7')
Q7 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q7]
tmp_dict = make_dict_freq(Q7)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[1])
axs[1].set_title('Bachelor\'s degree')

# add x and y labels and remove the extras
for ax in axs.flat:
    ax.set(xlabel='Count', ylabel='First Programming Language')

for ax in axs.flat:
    ax.label_outer()
fig.show()

In [None]:
# master experience coding
category, perc = get_max(master, 'Q6')
print('master experience coding')
print(category, perc)

In [None]:
# bachelor experience coding
category, perc = get_max(bachelor, 'Q6')
print('Bachelor experience coding')
print(category, perc)

In [None]:
def get_max_dict(tmp_dict, n):
    return list(tmp_dict.keys())[0], list(tmp_dict.values())[0]/n

In [None]:
# make the df for this plot (Masters)
q7 = get_questions('Q7')
Q7 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q7]
tmp_dict = make_dict_freq(Q7)

# master first language
category, perc = get_max_dict(tmp_dict, n = len(Q7))
print('master first language')
print(category, perc)

In [None]:
# make the df for this plot (Bachelors)
q7 = get_questions('Q7')
Q7 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q7]
tmp_dict = make_dict_freq(Q7)

# bachelor first language
# master first language
category, perc = get_max_dict(tmp_dict, n = len(Q7))
print('bachelor first language')
print(category, perc)

We have some interesting results here!

26% of Data Scientists with a masters degree and 26% of Data Scientists with a bachelor's degree have been coding for 3-5 years. 

Python is a popular first language! 79% and 73% of Data Scientists with a master's and bachelor's degree, respectively, learned python for their first language.

Would these Data Scientists recommend others to learn Python first? Let's find out!

In [None]:
g = sns.catplot(data = data, y = 'Q8', kind = 'count', palette = 'rocket', order = data['Q8'].value_counts().index, aspect = 2, row = 'Q4')
plt.subplots_adjust(top=0.9)
g.fig.suptitle('US Data Scientists\' Recommended First Language') 
g.set_xlabels('Count (log scale)')
g.set(xscale = 'log')
g.set_ylabels('Recommended First Language')
g.set_titles(row_template = '{row_name}')
plt.show()

In [None]:
# master recommeding coding language
category, perc = get_max(master, 'Q8')
print('master recommeded coding language')
print(category, perc)

In [None]:
# bachelor recommeding coding language
category, perc = get_max(bachelor, 'Q8')
print('bachelor recommeded coding language')
print(category, perc)

About 70% of master's and 67% of bachelor's recommended learning Python as a first programming language. The next most recommended languages are R and SQL.

### Machine Learning Experience
What else do these Data Scientists have in common? Let's take a look at their Machine Learning experience. 

In [None]:
order = ['No ML Experience', 'Under 1 year','1-2 years','2-3 years','3-4 years','4-5 years','5-10 years','10-20 years','20 or more years']
g = sns.catplot(data = data, y = 'Q15', kind = 'count', palette = 'rocket', order = order, aspect = 2, row = 'Q4')
plt.subplots_adjust(top=0.9)
g.fig.suptitle('US Data Scientists\' Machine Learning Experience') 
g.set_xlabels('Count')
g.set_ylabels('ML Experience')
g.set_titles(row_template = '{row_name}')
plt.show()

In [None]:
# master ml experience
category, perc = get_max(master, 'Q15')
print('master ML experience')
print(category, perc)

In [None]:
# bachelor ml experience
category, perc = get_max(bachelor, 'Q15')
print('bachelor ML experience')
print(category, perc)

To me, Machine Learning is a big part of Data Science and it is interesting to see that about 20% and 30% of master's and bachelor's degree holders, respectively, have less than 1 year of Machine Learning experience.

There are many Machine Learning algorithms out there. Which algorithms are used most? Which ML algorithms should an aspiring Data Scientists focus on? Let's take a look.

In [None]:
fig, axs = plt.subplots(2,1, figsize = (10,10))

# make the df for this plot (Masters)
q17 = get_questions('Q17')
Q17 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q17]
tmp_dict = make_dict_freq(Q17)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[0])
axs[0].set_title('Master\'s degree')

# make the df for this plot (Bachelors)
q17 = get_questions('Q17')
Q17 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q17]
tmp_dict = make_dict_freq(Q17)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[1])
axs[1].set_title('Bachelor\'s degree')

# add x and y labels and remove the extras
for ax in axs.flat:
    ax.set(xlabel='Count', ylabel='Most Used ML Algorithms')

for ax in axs.flat:
    ax.label_outer()
fig.show()

In [None]:
# make the df for this plot (Masters)
q17 = get_questions('Q17')
Q17 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q17]
tmp_dict = make_dict_freq(Q17)

# master first language
category, perc = get_max_dict(tmp_dict, n = len(Q17))
print('master most used ML algorithm')
print(category, perc)

print('\n')
# make the df for this plot (bachelor)
q17 = get_questions('Q17')
Q17 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q17]
tmp_dict = make_dict_freq(Q17)

# bachelor first language
category, perc = get_max_dict(tmp_dict, n = len(Q17))
print('bachelor most used ML algorithm')
print(category, perc)

For master's degree Data Scientists, the 63% majority use Linear/Logistic Regression. The next two most used algorithms are Decision Trees/Random Forests and Gradient Boosting Machines.

For bachelor's degree Data Scientists, the 47% majority use Linear/Logistic Regression. The next two most used algorithms are Decision Trees/Random Forests and Bayesian Approaches.

What frameworks do these data scientists use to apply these algorithms? Let's take a look.

In [None]:
fig, axs = plt.subplots(2,1, figsize = (10,10))

# make the df for this plot (Masters)
q16 = get_questions('Q16')
Q16 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q16]
tmp_dict = make_dict_freq(Q16)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[0])
axs[0].set_title('Master\'s degree')

# make the df for this plot (Bachelors)
q16 = get_questions('Q16')
Q16 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q16]
tmp_dict = make_dict_freq(Q16)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[1])
axs[1].set_title('Bachelor\'s degree')

# add x and y labels and remove the extras
for ax in axs.flat:
    ax.set(xlabel='Count', ylabel='Most Used ML Frameworks')

for ax in axs.flat:
    ax.label_outer()
fig.show()

Both master's and bachelor's degree Data Scientists have the same top three most used Machine Learning Frameworks: Scikit-Learn, TensorFlow, and Keras!

### Data Visualization Tools

Programming and Machine Learning are huge components of Data Science, but we can't forget about Data Visualization! Let's take a look at the most used Data Visualization tools.

In [None]:
fig, axs = plt.subplots(2,1, figsize = (10,10))

# make the df for this plot (Masters)
q14 = get_questions('Q14')
Q14 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q14]
tmp_dict = make_dict_freq(Q14)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[0])
axs[0].set_title('Master\'s degree')

# make the df for this plot (Bachelors)
q14 = get_questions('Q14')
Q14 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q14]
tmp_dict = make_dict_freq(Q14)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[1])
axs[1].set_title('Bachelor\'s degree')

# add x and y labels and remove the extras
for ax in axs.flat:
    ax.set(xlabel='Count', ylabel='Most Used Data Visualization Tools')

for ax in axs.flat:
    ax.label_outer()
fig.show()

Both master's and bachelor's Data Scientists have the same top three Data Visualization tool preferences: Matplotlib, seaborn, ggplot/ggplot2. The first two tools are Python Data Visualization tools while the final, ggplot/ggplot2, belongs to R.

### Sharing Projects/Work

I believe all Data Scientists could benefit from sharing their work! It is important to let others see your discoveries but also to allow others to criticize your work so that you are continually improving.

Let's take a look at how these Data Scientists share their work.

In [None]:
fig, axs = plt.subplots(2,1, figsize = (10,10))

# make the df for this plot (Masters)
q36 = get_questions('Q36')
Q36 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q36]
tmp_dict = make_dict_freq(Q36)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[0])
axs[0].set_title('Master\'s degree')

# make the df for this plot (Bachelors)
q36 = get_questions('Q36')
Q36 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q36]
tmp_dict = make_dict_freq(Q36)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[1])
axs[1].set_title('Bachelor\'s degree')

# add x and y labels and remove the extras
for ax in axs.flat:
    ax.set(xlabel='Count', ylabel= 'Sharing Projects Sites')

for ax in axs.flat:
    ax.label_outer()
fig.show()

In [None]:
# make the df for this plot (bachelor)
q36 = get_questions('Q36')
Q36 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q36]
tmp_dict = make_dict_freq(Q36)

# master sharing
category, perc = get_max_dict(tmp_dict, n = len(Q36))
print('master sharing work')
print(tmp_dict['I do not share my work publicly']/len(Q36))
print(category, perc)

# make the df for this plot (bachelor)
q36 = get_questions('Q36')
Q36 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q36]
tmp_dict = make_dict_freq(Q36)

# bachelor sharing
category, perc = get_max_dict(tmp_dict, n = len(Q36))
print('bachelor sharing work')
print(category, perc)

About 22.4% of master's degree Data Scientists share their work on GitHub. An interesting observation to be made is that about 22.1% and 19% of master's and bachelor's degree Data Scientists, respectively, do *not* share their work! I am shocked at this discovery, but it may be that this question is referring to their professional work and not their personal work. There may be some privacy violations that can occur if someone were to share the company's work online. I'd like to think that these Data Scientists *do* share their personal work online.

### Learning More
As Data Scientists, we are constantly learning. This is the part that makes the field fun and challenging. How do these Data Scientists further their understanding and sharpen their skills? What resources should we be focusing on? Let's find out!

In [None]:
# rename some answers
data.loc[lambda df: df['Q37_Part_9'] == 'Cloud-certification programs (direct from AWS, Azure, GCP, or similar)'
         , 'Q37_Part_9'] = 'Cloud-Certification Programs'
data.loc[lambda df: df['Q37_Part_10'] == 'University Courses (resulting in a university degree)', 'Q37_Part_10'] = 'University Courses' 

fig, axs = plt.subplots(2,1, figsize = (10,10))

# make the df for this plot (Masters)
q37 = get_questions('Q37')
Q37 = data.loc[lambda x: x['Q4'] == 'Master’s degree', q37]
tmp_dict = make_dict_freq(Q37)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[0])
axs[0].set_title('Master\'s degree')

# make the df for this plot (Bachelors)
q37 = get_questions('Q37')
Q37 = data.loc[lambda x: x['Q4'] == 'Bachelor’s degree', q37]
tmp_dict = make_dict_freq(Q37)
keys = list(tmp_dict.keys())
values = list(tmp_dict.values())

# plot
sns.barplot(x = values, y = keys, palette = 'rocket', ax = axs[1])
axs[1].set_title('Bachelor\'s degree')

# add x and y labels and remove the extras
for ax in axs.flat:
    ax.set(xlabel='Count', ylabel= 'Most Used Educational Resources')

for ax in axs.flat:
    ax.label_outer()
fig.show()

Coursera takes the lead for both master's and bachelor's Data Scientists. Even though Coursera is the most used among these Data Scientists, all resources are helpful!

# Insights Gained

Let's try to summarize what we've learned from these Data Scientists, and how we can apply these discoveries to our own studies. Remember, we filtered our data to contain only responses from USA and Data Scientists that earned a master's or bachelor's degree for their highest degree.

**Programming:**
* As Data Scientists, we need to program! Most Data Scientists have programmed for about 3-4 years. Start small, make coding fun, and learn from others.

* Python, SQL, or R? Python is a crowd favorite, but the order that you learn these languages in does not matter too much. Flip a coin or learn all at once! If you want to learn Python (the most used programming language among these Data Scientists), I recommend this book: https://www.oreilly.com/library/view/python-for-data/9781119547624/c00.xhtml . You can also try the daily challenges from https://leetcode.com or https://www.hackerrank.com .

**Machine Learning:**
* I was amazed to see how many of our Data Scientists had less than 1 year of Machine Learning experience. What does that mean for a new Data Scientist? It means that if you start your Machine Learning journey today, you'd be with the majority! Reinforce your Machine Learning knowledge with projects that use Linear/Logistic Regression, Decision Trees/Random Forests, Gradient Boosting Machines, and Bayesian Appoaches.

* Scikit-Learn, TensorFlow, and Keras are popular Machine Learning Frameworks. Lucky for us, there is a book for learning all three of these frameworks! Check it out here: https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ .

**Data Visualization:**
* Visualizing our data is fun and informative. Get comfortable using matplotlib and seaborn for Python, or ggplot/ggplot2 for R! Confused or stuck? Read the documentation! Also, there is a good chance your issue has been solved by someone already, and you may find the answer to your question at https://stackoverflow.com .

**Share Your Work!:**
* Whether it is on Twitter, LinkedIn, Facebook, GitHub, Kaggle, or anywhere! Get your work out there and have people see it. This is an opportunity for valuable feedback, so that you can be constantly improving.

**Continue Learning:**
* There are so many fantastic courses that can help you get started on your Data Science journey! Kaggle has incredible (and free) mini-courses: https://www.kaggle.com/learn/overview . A lot of US Data Scientists like Coursera. Personally, I have used edX and completed a Professional Data Science Certificate to get a better understanding of Data Science (check it out here: https://www.edx.org/professional-certificate/harvardx-data-science). 


# Conclusion

Thank you Kaggle for hosting this competition! I hope some of you gained some valuable insights from this notebook.

If you'd like to offer feedback, please leave a comment below. If you liked this notebook, please upvote! If you'd like to connect with me, my LinkedIn is here: https://www.linkedin.com/in/knolasco95/ . If you'd like to see my other projects, my GitHub is here: https://github.com/knolasco .