# **Does Data Science have a Future?**

![beg_and_end](https://y.yarn.co/96b749d9-3268-4274-96ae-62c5ff3470b8_text.gif)

***"Everything that has a beginning, has an end"*** - Everyday people come up with new ideas and innovations that shape our future. Every   discoveries and inventions that takes place, makes our life easier and better in many ways. With smart devices such as Alexa, you don't even need to get up to switch off the lights or to turn on the music. As new inventions are born, the older ones slowly fades away from existence.

The discovery of new scientific methods, algorithms and the invention of new and powerful hardwares have lead to the rise of Data Science. At present, Data Science(DS) is one among the most used words. So, *what is Data Science and what is the need for such a field of study?* 

Simply saying, **Data Science** is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. As the world entered the era of big data, the need for its storage also grew. So the main focus was on building frameworks and solutions to store data. Once that was done, the data started to pour in, at a tremendous rate, and it wasn't possible to structure all those data due to their immense volume. So, today, most of stored data are either unstructured or semistructured as shown in the following figure.

![data](https://miro.medium.com/max/893/1*JeIC6PreHjgh06w3WqkXMA.jpeg)

The large volume of unstructured data required more complex and advanced analytical tools and algorithms for processing, analyzing and drawing meaningful insights out of it. **Data Science was the answer to this problem!** The idea of Data Science came into existence before 2000s, but it is only recently that, with the discovery of new algorithms and analytical tools, Data Science has gained all the popularity.

Marked as the **highest paying job** in the year 2016 by **Glassdoor**, the field of Data science has witnessed an immense growth in recent years. Employers are in the search of data scientists more than ever. A report by **Indeed** indicated a 29% increase in the demand of data scientists in a year.

Is the situation the same today, or has the popularity of Data Science went down as the years have passed and what are trends seen in this field now? 

![DS_Future](https://vivente.com.au/wp-content/uploads/2017/11/future.jpeg)

In [None]:
import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import math
import warnings
from matplotlib.lines import Line2D
from bokeh.layouts import row, column
from bokeh.transform import cumsum, transform
from bokeh.transform import dodge, factor_cmap
from bokeh.plotting import figure, show, gridplot
from bokeh.io import output_notebook
from bokeh.core.properties import value
from bokeh.palettes import d3, brewer, plasma, Plasma256
from bokeh.models import LabelSet, ColumnDataSource, LinearColorMapper, ColorBar, BasicTicker, FactorRange

warnings.filterwarnings('ignore')

In [None]:
output_notebook()

In [None]:
multiplechoice_beg = pd.read_csv('../input/kaggle-survey-2017/multipleChoiceResponses.csv', low_memory=False, encoding='ISO-8859-1')
multiplechoice_old = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv', low_memory= False)
multiplechoice_new = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv', low_memory= False)

multiplechoice_old.columns = [x.split('_')[0] for x in list(multiplechoice_old.columns)]
multiplechoice_new.columns = [x.split('_')[0] for x in list(multiplechoice_new.columns)]

multiplechoice_old.columns = multiplechoice_old.columns + '_' + multiplechoice_old.iloc[0]
multiplechoice_new.columns = multiplechoice_new.columns + '_' + multiplechoice_new.iloc[0]
multiplechoice_old = multiplechoice_old.drop([0])
multiplechoice_new = multiplechoice_new.drop([0])

multiplechoice_old['Time from Start to Finish (seconds)_Duration (in seconds)'] = multiplechoice_old['Time from Start to Finish (seconds)_Duration (in seconds)'].astype('float')
multiplechoice_new['Time from Start to Finish (seconds)_Duration (in seconds)'] = multiplechoice_new['Time from Start to Finish (seconds)_Duration (in seconds)'].astype('float')

multiplechoice_old['Time from Start to Finish (seconds)_Duration (in seconds)'] = multiplechoice_old['Time from Start to Finish (seconds)_Duration (in seconds)'].apply(lambda x:x/3600)
multiplechoice_new['Time from Start to Finish (seconds)_Duration (in seconds)'] = multiplechoice_new['Time from Start to Finish (seconds)_Duration (in seconds)'].apply(lambda x:x/3600)

time_old = str(round(multiplechoice_old['Time from Start to Finish (seconds)_Duration (in seconds)'].median()*60, 1)) + ' min'
time_new = str(round(multiplechoice_new['Time from Start to Finish (seconds)_Duration (in seconds)'].median()*60, 2)) + ' min'

TOOLS="pan,wheel_zoom,zoom_in,zoom_out,undo,redo,reset,tap,save"

From 2017, **Kaggle**, one of the biggest online community of data scientists and machine learners, have been conducting an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. While not all Data Scientists take part in Kaggle competitions or have a Kaggle account, and not all Kagglers do work of data science, it is reasonable to assume a large overlap. The survey is conducted yearly, usually within a time period of about 3 weeks, and later on, the survey data is made publicly available. The survey data from 2017 to 2019 is used for the study to analyze where Data Science is headed. 

> On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?". I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.       
- **Charles Babbage, Passages from the Life of a Philosopher**

What **Charles Babbage** said, refers to **GIGO (garbage in, garbage out)**. GIGO is an important concept in computer science and mathematics which implies that *the quality of output is determined by the quality of input*. The same applies to this study, so lets begin by taking a look at the basic details of the survey.


### 1. A look at the survey data over the years  

The basic survey details are listed in the table below:

In [None]:
survey_data = pd.DataFrame({'Number of Respondents' : [len(multiplechoice_beg), len(multiplechoice_old), len(multiplechoice_new)],
                            'Number of Questions' :  ['64', '50', '34'],
                            'Median Response Time' : ['16.4 min', time_old, time_new]},
                            index = ['2017', '2018', '2019'])
survey_data

In [None]:
p0 = figure(x_range = survey_data.index.values, y_range = (0,26000), plot_width = 400, plot_height = 500, tools = TOOLS, title = "Number Of Respondents in Survey")
source0 = ColumnDataSource(dict(x=survey_data.index.values, y=survey_data.iloc[:,0].values.reshape(len(survey_data))))
labels0 = LabelSet(x='x', y='y', text='y', level='glyph', x_offset=-22, y_offset=0, source=source0, render_mode='canvas')

p0.vbar(survey_data.index.values, width = 0.9, top = survey_data.iloc[:,0].values.reshape(len(survey_data)), color='mediumseagreen')
p0.xaxis.axis_label = 'Year'
p0.yaxis.axis_label = 'Number of Respondents'
p0.yaxis.axis_label_text_font = 'times'
p0.yaxis.axis_label_text_font_size = '12pt'
p0.xaxis.axis_label_text_font = 'times'
p0.xaxis.axis_label_text_font_size = '12pt'
p0.ygrid.grid_line_color = None
p0.xgrid.grid_line_color = None
p0.add_layout(labels0)

years = ['2017', '2018', '2019']
col_name = ['No of Questions', 'Median Response Time']

plot_data = [(year, analysis_type) for year in years for analysis_type in col_name]
survey_data['Median Response Time'] = survey_data['Median Response Time'].apply(lambda x:x.replace(' min', ''))
counts = list(survey_data.iloc[0,1:].values) + list(survey_data.iloc[1,1:].values) + list(survey_data.iloc[2,1:].values)

source = ColumnDataSource(data=dict(x=plot_data, counts=counts))
source1 = ColumnDataSource(dict(x=survey_data.index.values, y=survey_data.iloc[:,1].values.reshape(len(survey_data))))
source2 = ColumnDataSource(dict(x=survey_data.index.values, y=survey_data.iloc[:,2].values.reshape(len(survey_data))))

labels1 = LabelSet(x='x', y='y', text='y', level='glyph', x_offset=-30, y_offset=0, source=source1, render_mode='canvas')
labels2 = LabelSet(x='x', y='y', text='y', level='glyph', x_offset=5, y_offset=0, source=source2, render_mode='canvas')

p1 = figure(x_range = FactorRange(*plot_data), y_range = (0,70), plot_width = 400, plot_height = 500, tools = TOOLS, title = "Number Of Questions v/s Mean Response time in Survey")
p1.vbar(x='x', top='counts', width=0.9, source=source, fill_color=factor_cmap('x', palette=['mediumslateblue', 'burlywood'], factors=col_name, start=1, end=2))
p1.xaxis.axis_label = 'Year'
p1.yaxis.axis_label_text_font = 'times'
p1.yaxis.axis_label_text_font_size = '12pt'
p1.xaxis.axis_label_text_font = 'times'
p1.xaxis.axis_label_text_font_size = '12pt'
p1.ygrid.grid_line_color = None
p1.xgrid.grid_line_color = None
p1.xaxis.major_label_orientation = math.pi/2
p1.add_layout(labels1)
p1.add_layout(labels2)

show(row(p0,p1))

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize= (16,16))

### Histogram plot
plt.subplot(421)
plt.hist(multiplechoice_old['Time from Start to Finish (seconds)_Duration (in seconds)'], bins = 50, color= 'indianred')
plt.yscale('log')
# plt.xlabel('Duration (in hrs)', fontsize = 'large')
plt.ylabel('Number of Respondents', fontsize = 'large')
plt.title('2018', fontsize = 'x-large', fontweight = 'roman')

### Density plot
plt.subplot(423)
ax = sns.kdeplot(multiplechoice_old['Time from Start to Finish (seconds)_Duration (in seconds)'], color= 'indianred')
ax.legend_.remove()
plt.xlabel('Duration (in hrs)', fontsize = 'large')
plt.ylabel('Density', fontsize = 'large')
# plt.title('2018', fontsize = 'x-large', fontweight = 'roman')

### Histogram plot
plt.subplot(422)
plt.hist(multiplechoice_new['Time from Start to Finish (seconds)_Duration (in seconds)'], bins = 50, color= 'darkslateblue')
plt.yscale('log')
# plt.xlabel('Duration (in hrs)', fontsize = 'large')
plt.ylabel('Number of Respondents', fontsize = 'large')
plt.title('2019', fontsize = 'x-large', fontweight = 'roman')

### Density plot
plt.subplot(424)
ax = sns.kdeplot(multiplechoice_new['Time from Start to Finish (seconds)_Duration (in seconds)'], color= 'darkslateblue')
ax.legend_.remove()
plt.xlabel('Duration (in hrs)', fontsize = 'large')
plt.ylabel('Density', fontsize = 'large')
# plt.title('2019', fontsize = 'x-large', fontweight = 'roman')

**Note** -*The y-axis of the histograms are in the log scale.*
  
  
Some insights from the analysis of the survey data are:

* The number of respondents in the ML & DS survey **increased by about 43%** from 16716 in 2017 to 23859 in 2018. This is an indicator of the increasing interest towards Data Science but that number has then went down to 19717 in 2019.


* Both in 2018 and in 2019, the survey was conducted in October *(Oct 22-29 in 2018 and Oct 8-28 in 2019)*. Even by keeping the survey live for more time, the *participation dropped by over 17%*. This is an interesting observation and might indicate the decreasing popularity of Data Science. More analysis needs to be done to confirm this.


* The median response time in 2017 survey is *16.4 min*, in 2018 is *17 min* and in 2019 is *9 min*. Looking at the number of questions, it is clear that the response time is proportional to the number of questions. More the number of questions, more the response time.


* Larger median survey time means that most respondents have spent time reading and understanding the survey questions before answering. **Thus we can consider the survey data to be genuine**.


* The histograms and kde plots gives the distribution of the response time for the survey. As expected, the number of respondents decrease exponentially as we move right, along the survey time axes but unexpectedly, there is a small spike around 50hrs in 2018 survey and around 150hrs in 2019 survey!!! The cause is unknown.

**Note** - *As you have seen from the analysis of survey data, the number of respondents is different in all the three years conducted. So, for all the analysis that involves comparison of the survey data belonging to different years, the normalized values of the data are used which makes the analysis more insightful.*

### 2. **The effect of Gender on Data Science**

**There should be no shortage of inspirational role models for young girls dreaming of a career in science**. Women have been responsible for some of the most important scientific breakthroughs that shaped the modern world, from Marie Curie’s discoveries about radiation, to Grace Hopper’s groundbreaking work on computer programming, and Barbara McClintock’s pioneering approach to genetics.

But too often their stories aren’t just about the difficulties they faced in cracking some of the toughest problems in science, but also about overcoming social and professional obstacles just because of their gender. And many of those obstacles still face women working and studying in science today. 

Is Data Science a male dominated domian or do females have a good share in its progress?

![gender](https://www.ft.com/paidpost/CBS/gender_differences/img/gender.jpg)

In [None]:
gender_beg = multiplechoice_beg['GenderSelect'].value_counts().to_frame()
gender_old = multiplechoice_old['Q1_What is your gender? - Selected Choice'].value_counts().to_frame()
gender_new = multiplechoice_new['Q2_What is your gender? - Selected Choice'].value_counts().to_frame()
gender_beg.index = gender_old.index.values

gender_beg = round(gender_beg/gender_beg.sum(), 2)*100
gender_old = round(gender_old/gender_old.sum(), 2)*100
gender_new = round(gender_new/gender_new.sum(), 2)*100

In [None]:
p0 = figure(x_range = gender_beg.index.values, y_range = (0,90), plot_width = 265, plot_height = 400, tools = TOOLS, )
source0 = ColumnDataSource(dict(x=gender_beg.index.values, y=gender_beg.values.reshape(len(gender_beg))))
labels0 = LabelSet(x='x', y='y', text='y', level='glyph', x_offset=-10, y_offset=0, source=source0, render_mode='canvas')

p0.vbar(gender_beg.index.values, width = 0.5, top = gender_beg.values.reshape(len(gender_beg)), color=d3['Category20b'][len(gender_beg)])
p0.xaxis.axis_label = 'Gender'
p0.yaxis.axis_label = 'Percentage of Respondents -2017'
p0.yaxis.axis_label_text_font = 'times'
p0.yaxis.axis_label_text_font_size = '12pt'
p0.xaxis.axis_label_text_font = 'times'
p0.xaxis.axis_label_text_font_size = '12pt'
p0.ygrid.grid_line_color = None
p0.xgrid.grid_line_color = None
p0.xaxis.major_label_orientation = math.pi/4
p0.add_layout(labels0)

p1 = figure(x_range = gender_old.index.values, y_range = (0,90), plot_width = 265, plot_height = 400, tools = TOOLS, )
source1 = ColumnDataSource(dict(x=gender_old.index.values, y=gender_old.values.reshape(len(gender_old))))
labels1 = LabelSet(x='x', y='y', text='y', level='glyph', x_offset=-10, y_offset=0, source=source1, render_mode='canvas')

p1.vbar(gender_old.index.values, width = 0.5, top = gender_old.values.reshape(len(gender_old)), color=d3['Category20b'][len(gender_old)])
p1.xaxis.axis_label = 'Gender'
p1.yaxis.axis_label = 'Percentage of Respondents -2018'
p1.yaxis.axis_label_text_font = 'times'
p1.yaxis.axis_label_text_font_size = '12pt'
p1.xaxis.axis_label_text_font = 'times'
p1.xaxis.axis_label_text_font_size = '12pt'
p1.ygrid.grid_line_color = None
p1.xgrid.grid_line_color = None
p1.xaxis.major_label_orientation = math.pi/4
p1.add_layout(labels1)

p2 = figure(x_range = gender_new.index.values, y_range = (0,90), plot_width = 265, plot_height = 400, tools = TOOLS, )
source2 = ColumnDataSource(dict(x=gender_new.index.values, y=gender_new.values.reshape(len(gender_new))))
labels2 = LabelSet(x='x', y='y', text='y', level='glyph', x_offset=-10, y_offset=0, source=source2, render_mode='canvas')

p2.vbar(gender_new.index.values, width = 0.5, top = gender_new.values.reshape(len(gender_new)), color=d3['Category20b'][len(gender_new)])
p2.xaxis.axis_label = 'Gender'
p2.yaxis.axis_label = 'Percentage of Respondents -2019'
p2.yaxis.axis_label_text_font = 'times'
p2.yaxis.axis_label_text_font_size = '12pt'
p2.xaxis.axis_label_text_font = 'times'
p2.xaxis.axis_label_text_font_size = '12pt'
p2.ygrid.grid_line_color = None
p2.xgrid.grid_line_color = None
p2.xaxis.major_label_orientation = math.pi/4
p2.add_layout(labels2)

show(row(p0,p1,p2))

* The analysis of the survey data reveals that Data Science has been and is still a male dominated domian.


* The percentage of Male and Female respondents are almost the same from 2017 to 2019. 


* There are **almost five Male Kagglers for every Female Kaggler**, indicating Male dominance in the field. There haven't yet been a surge of women kagglers so far, atleast for past 3 years. Women, always have played an important role in many important scientific breakthroughs and thus, more females should step forward to be a part of Data Science.

### **3. What is the impact of Data Science across the globe?**

The 2019 survey had **respondents from 171 countries and territories**. Kaggle, and with it Data Science is spreading across the globe encouraging people to be a part of it. Does Data Science have a big impact in all of those countries or do only a few countries have the expertise in the domain? This section is to understand the impact of DS across the globe. 

To make this analysis a bit simpler, only the top 10 countries, with respect to the number of survey respondents are selected.

![country](https://knowledge.wharton.upenn.edu/wp-content/uploads/2019/01/country-flags-rankings.jpg)

In [None]:
country = ['France', 'Canada','UK', 'Germany', 'Brazil', 'Russia', 'China', 'India', 'USA', 'Japan']

multiplechoice_beg['Country'] = multiplechoice_beg['Country'].replace("People 's Republic of China", 'China').replace('United Kingdom', 'UK').replace('United States', 'USA')
multiplechoice_old['Q3_In which country do you currently reside?'] = multiplechoice_old['Q3_In which country do you currently reside?'].replace('United Kingdom of Great Britain and Northern Ireland', 'UK').replace('United States of America', 'USA')
multiplechoice_new['Q3_In which country do you currently reside?'] = multiplechoice_new['Q3_In which country do you currently reside?'].replace('United Kingdom of Great Britain and Northern Ireland', 'UK').replace('United States of America', 'USA')

In [None]:
top_countries_beg = multiplechoice_beg['Country'].value_counts().to_frame().loc[country].sort_values('Country')
top_countries_old = multiplechoice_old['Q3_In which country do you currently reside?'].value_counts().to_frame().loc[country].sort_values('Q3_In which country do you currently reside?')
top_countries_new = multiplechoice_new['Q3_In which country do you currently reside?'].value_counts().to_frame().loc[country].sort_values('Q3_In which country do you currently reside?')

top_countries_beg = round(top_countries_beg/top_countries_beg.sum(), 2)*100
top_countries_old = round(top_countries_old/top_countries_old.sum(), 2)*100
top_countries_new = round(top_countries_new/top_countries_new.sum(), 2)*100

top_countries_old = top_countries_old.reindex(list(top_countries_beg.index))
top_countries_new = top_countries_new.reindex(list(top_countries_beg.index))

In [None]:
country_list = list(top_countries_beg.index)
beg = list(top_countries_beg['Country'])
old = list(top_countries_old['Q3_In which country do you currently reside?'])
new = list(top_countries_new['Q3_In which country do you currently reside?'])

dot = figure(title="Participants by Country", tools=TOOLS, plot_width = 800, plot_height = 400, y_range=country_list, x_range=[0,42])

dot.segment(0, country_list, beg, country_list, line_width=2, line_color="sienna", legend='2017')
dot.circle(beg, country_list, size=15, fill_color="plum", line_color="sienna", line_width=1, legend='2017')
dot.segment(0, country_list, old, country_list, line_width=2, line_color="sienna", legend='2018')
dot.circle(old, country_list, size=15, fill_color="skyblue", line_color="sienna", line_width=1, legend='2018')
dot.segment(0, country_list, new, country_list, line_width=2, line_color="sienna", legend='2019')
dot.circle(new, country_list, size=15, fill_color="yellowgreen", line_color="sienna", line_width=1, legend='2019')

dot.xaxis.axis_label = 'Percentage of Respondents'
dot.yaxis.axis_label = 'Country'
dot.yaxis.axis_label_text_font = 'times'
dot.yaxis.axis_label_text_font_size = '12pt'
dot.xaxis.axis_label_text_font = 'times'
dot.xaxis.axis_label_text_font_size = '12pt'
dot.ygrid.grid_line_color = None
dot.xgrid.grid_line_color = None
dot.legend.location = "bottom_right"
dot.legend.click_policy="hide"
show(dot)

**Note -** *The legends are interactive. Click on them to enable or disable the values associated with the legend*


* Data Science is not the same across the globe. **The opportunities you get with DS depends a lot on where you live**.


* The number of survey respondents dropped a lot in USA with the count going down every year, from 2017 to 2019. The same can also be observed for UK and France even though the decline is not as high as in the United States.


* On the other hand, the survey responses from India went up a lot. The increase in the number of respondents is pretty high between 2018 and 2019 showing the increasing demand for DS in the country in the past year. Japan and Brazil also follow a similar trend with the increase in the number of respondents.


* DS has stayed at the same level in Canada during the past 3 years of survey. 


* The other interesting case is that of China. China had a huge surge of people into DS between 2017 and 2018 but surprisingly that number has seen a big dip between 2018 and 2019.

### **4. Do people start young?**

The future of DS is ofcourse dependent on the people in it. **Youths represent the future, and it is only through their engagement that the field can have a good future**. The percentage of young people in Data Science is a great indicator of where the field is headed. More the percentage, better will be the future.

What is the impact of DS on youth population and how do non-youths respond to it?

![age](https://cdn.psychologytoday.com/sites/default/files/styles/image-article_inline_full/public/field_blog_entry_images/Longevity%20Cartoon_1.jpg?itok=X89Hn_1J)

In [None]:
age_beg_dict = {
'18-21' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 18) & (multiplechoice_beg['Age'] < 21)]['Age']),
'22-24' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 21) & (multiplechoice_beg['Age'] < 25)]['Age']),
'25-29' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 24) & (multiplechoice_beg['Age'] < 30)]['Age']),
'30-34' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 29) & (multiplechoice_beg['Age'] < 35)]['Age']),
'35-39' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 34) & (multiplechoice_beg['Age'] < 40)]['Age']),
'40-44' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 39) & (multiplechoice_beg['Age'] < 45)]['Age']),
'45-49' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 44) & (multiplechoice_beg['Age'] < 50)]['Age']),
'50-54' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 49) & (multiplechoice_beg['Age'] < 55)]['Age']),
'55-59' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 54) & (multiplechoice_beg['Age'] < 60)]['Age']),
'60-69' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 59) & (multiplechoice_beg['Age'] < 70)]['Age']),
'70+' : len(multiplechoice_beg[(multiplechoice_beg['Age'] > 70)])}

In [None]:
ylab_old = multiplechoice_old['Q2_What is your age (# years)?'].sort_values().unique()
ylab_new = multiplechoice_new['Q1_What is your age (# years)?'].sort_values().unique()

age_df_beg = pd.DataFrame(age_beg_dict, index = range(12)).T[0]
age_df_old = multiplechoice_old['Q2_What is your age (# years)?'].value_counts().to_frame().loc[ylab_old]
age_df_new = multiplechoice_new['Q1_What is your age (# years)?'].value_counts().to_frame().loc[ylab_new]

age_df_old_last_row = age_df_old.loc['70-79'] + age_df_old.loc['80+']
age_df_old = age_df_old.drop(['70-79','80+'])
age_df_old = age_df_old.append(pd.DataFrame([age_df_old_last_row], columns=['Q2_What is your age (# years)?'], index=['70+']))

age_df_beg = round(age_df_beg/age_df_beg.sum(), 2)*100
age_df_old = round(age_df_old/age_df_old.sum(), 2)*100
age_df_new = round(age_df_new/age_df_new.sum(), 2)*100

In [None]:
age_list = list(age_df_beg.index)[::-1]
beg = list(age_df_beg)[::-1]
old = list(age_df_old['Q2_What is your age (# years)?'])[::-1]
new = list(age_df_new['Q1_What is your age (# years)?'])[::-1]

dot = figure(title="Age Group of Respondents", tools=TOOLS, plot_width = 800, plot_height = 400, y_range=age_list, x_range=[0,30])

dot.segment(0, age_list, beg, age_list, line_width=2, line_color="sienna", legend='2017')
dot.circle(beg, age_list, size=15, fill_color="plum", line_color="sienna", line_width=1, legend='2017')
dot.segment(0, age_list, old, age_list, line_width=2, line_color="sienna", legend='2018')
dot.circle(old, age_list, size=15, fill_color="skyblue", line_color="sienna", line_width=1, legend='2018')
dot.segment(0, age_list, new, age_list, line_width=2, line_color="sienna", legend='2019')
dot.circle(new, age_list, size=15, fill_color="yellowgreen", line_color="sienna", line_width=1, legend='2019')

dot.xaxis.axis_label = 'Percentage of Respondents'
dot.yaxis.axis_label = 'Age Group'
dot.yaxis.axis_label_text_font = 'times'
dot.yaxis.axis_label_text_font_size = '12pt'
dot.xaxis.axis_label_text_font = 'times'
dot.xaxis.axis_label_text_font_size = '12pt'
dot.ygrid.grid_line_color = None
dot.xgrid.grid_line_color = None
dot.legend.location = "bottom_right"
dot.legend.click_policy="hide"
show(dot)

* **Young people (less than 30 year olds) makes up over 50% of the respondents**. 18-21 is the youngest among them and they might probably represent the student population. Currently, they make up around 15% of the respondents. The *youth population in DS have seen a massive surge* from under 5% in 2017 to around 15% in 2018 and 2019. Youth represent the future and this is indeed a good sign.


* Now, move a bit down along the y-axis and have a look at the older population of respondents that belong to the age group of 45+. They represent the **experts** in their respective fields with over 15+ years of experience. The percentage of respondents in this age group went up in the past year. This is a great news, if the experienced are to switch to DS, then they might definetly see a future here.


* The percentage of respondents dropped a lot between the age of 25 and 45 from 2017 to 2018, but in the past year the percentage has improved in the age group 35-45, while it remianed constant in the age group 30-34. Looking at the trend, it is clear that, in the past year, more people are into DS.


* Age group 22-29 represent the young working population and they make up more than 40% of the total respondents. This particular age group is the only one in the survey that had a massive decline in the percentage of respondents, in the past year. Eventhough there was a good increase in the respondents in the age group 22-24 in 2017-2018, the percentage went down a lot in 2019. The percentage of respondents in 25-29 age group has been in decline right from 2017.

### **5. What qualification do you need?**

**Education is a weapon to improve one’s life. It is probably the most important tool to change one’s life**. Education is very important to land in a good job and to have excellent compensation. Education qualification is a measure of what you have learned and, the more qualified you are more will be the oppurtunities.

Do you need higher degree of qualification to do Data Science or can you start early? The survey responses helps to understand that.

![education](https://s3.ap-southeast-1.amazonaws.com/images.deccanchronicle.com/dc-Cover-bsnudco08r3igtj44duecnr7m4-20180630063055.Medi.jpeg)

In [None]:
educ_lvl_beg = multiplechoice_beg['FormalEducation'].value_counts().to_frame()
educ_lvl_old = multiplechoice_old['Q4_What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].value_counts().to_frame()
educ_lvl_new = multiplechoice_new['Q4_What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].value_counts().to_frame()
educ_lvl_beg.index = ['Master’s degree', 'Bachelor’s degree', 'Doctoral degree','Some college/university study without earning a bachelor’s degree',
                      'Professional degree', 'No formal education past high school', 'I prefer not to answer']

educ_lvl_beg = educ_lvl_beg.drop('I prefer not to answer')
educ_lvl_old = educ_lvl_old.drop('I prefer not to answer')
educ_lvl_new = educ_lvl_new.drop('I prefer not to answer')

educ_lvl_beg = round(educ_lvl_beg/educ_lvl_beg.sum(), 2)*100
educ_lvl_old = round(educ_lvl_old/educ_lvl_old.sum(), 2)*100
educ_lvl_new = round(educ_lvl_new/educ_lvl_new.sum(), 2)*100

educ_lvl_beg = educ_lvl_beg.reindex(list(educ_lvl_old.index))

In [None]:
educ_list = list(educ_lvl_beg.index)[::-1]
beg = list(educ_lvl_beg['FormalEducation'])[::-1]
old = list(educ_lvl_old['Q4_What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'])[::-1]
new = list(educ_lvl_new['Q4_What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'])[::-1]

dot = figure(title="Education Level", tools=TOOLS, plot_width = 800, plot_height = 400, y_range=educ_list, x_range=[0,65])

dot.segment(0, educ_list, beg, educ_list, line_width=2, line_color="sienna", legend='2017')
dot.circle(beg, educ_list, size=15, fill_color="plum", line_color="sienna", line_width=1, legend='2017')
dot.segment(0, educ_list, old, educ_list, line_width=2, line_color="sienna", legend='2018')
dot.circle(old, educ_list, size=15, fill_color="skyblue", line_color="sienna", line_width=1, legend='2018')
dot.segment(0, educ_list, new, educ_list, line_width=2, line_color="sienna", legend='2019')
dot.circle(new, educ_list, size=15, fill_color="yellowgreen", line_color="sienna", line_width=1, legend='2019')

dot.xaxis.axis_label = 'Percentage of Respondents'
dot.yaxis.axis_label = 'Education Level'
dot.yaxis.axis_label_text_font = 'times'
dot.yaxis.axis_label_text_font_size = '12pt'
dot.xaxis.axis_label_text_font = 'times'
dot.xaxis.axis_label_text_font_size = '12pt'
dot.ygrid.grid_line_color = None
dot.xgrid.grid_line_color = None
dot.legend.location = "bottom_right"
dot.legend.click_policy="hide"
show(dot)

* Over 40% of the respondents have a Master's degree. Most of the other respondents are those with Bachelor's degree. They together make up over 75% of the total respondents, which is huge!!!. So it is **important to have atleast a Bachelor's degree** to easily become a part of DS community


* As seen in the analysis of age group of respondents, over 50% of the them are below the age of 30. That might be so because, people starts to get into DS after completing either the Bachelor's or Master's.


* There is a good percentage (about 20%) of respondents that have a Doctoral degree. According to the [census](https://www.census.gov/library/stories/2019/02/number-of-people-with-masters-and-phd-degrees-double-since-2000.html) by U.S. Census Bureau, there are only 4.5 million people with Doctoal degree in US in 2018 comapred to around 70 million of them with either Bachelors or Masters. T**he 20% percentage of respondents with Doctoral degree is definitely an indicator that it is, a lot more easier to land in a DS job with a Doctoral degree**.


* For those people without a degree, the percentage is too low. Students in universities also makes up a fair percentage of the total respondents. So, generally the higher the degree you have, more likely you are to end up in DS job.

### **6. What do DS people earn??**

This is perhaps the most important question. Data Science is viewed as the sexiest job of the 21st century. The growing demand of Data Science is what earned it that title. As the demand for it increases, so should the job openings and pay. Good pay is what attracts the people the most.

Is Data Science the sexiest job in terms of pay? As more and more people entered this domain, what impact did that had on the pay scale?

![salary](http://laschoolreport.com/wp-content/uploads/2014/09/Teacher-salary-LAUSD.jpg)

In [None]:
yearly_comp_old = multiplechoice_old['Q9_What is your current yearly compensation (approximate $USD)?'].value_counts().to_frame()[1:]
yearly_comp_new = multiplechoice_new['Q10_What is your current yearly compensation (approximate $USD)?'].value_counts().to_frame()[1:]

yearly_comp_old = round(yearly_comp_old/yearly_comp_old.sum(), 2)*100
yearly_comp_new = round(yearly_comp_new/yearly_comp_new.sum(), 2)*100

In [None]:
old_idx_sort = ['0-10,000', '10-20,000', '20-30,000', '30-40,000', '40-50,000','50-60,000', '60-70,000', '70-80,000', '80-90,000', '90-100,000',\
'100-125,000', '125-150,000',  '150-200,000', '200-250,000', '250-300,000', '300-400,000', '400-500,000', '500,000+']

new_idx_sort = ['1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999', '15,000-19,999',
'20,000-24,999', '25,000-29,999', '30,000-39,999', '40,000-49,999','50,000-59,999', '60,000-69,999', '70,000-79,999',
'80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999', '150,000-199,999', '200,000-249,999', '250,000-299,999',
'300,000-500,000', '> $500,000']

yearly_comp_old = yearly_comp_old.reindex(index = old_idx_sort)
yearly_comp_new = yearly_comp_new.reindex(index = new_idx_sort)

comb_idx = ['0-10,000', '10-20,000', '20-30,000', '30-40,000', '40-50,000','50-60,000', '60-70,000', '70-80,000', '80-90,000', '90-100,000',\
'100-125,000', '125-150,000',  '150-200,000', '200-250,000', '250-300,000', '300-500,000', '500,000+']

In [None]:
yearly_comp_old = pd.DataFrame({'salary': comb_idx,
                                'count': [x[0] for x in yearly_comp_old[:-3].values] + [yearly_comp_old.iloc[-3:-1].values.sum()] +\
                                [yearly_comp_old.iloc[-1][0]]}).set_index('salary')

yearly_comp_new = pd.DataFrame({'salary': comb_idx,
                                'count': [yearly_comp_new.iloc[:6].values.sum(), yearly_comp_new.iloc[6:8].values.sum(),
                                 yearly_comp_new.iloc[8:10].values.sum()] + [x[0] for x in yearly_comp_new[10:].values]}).set_index('salary')

In [None]:
yearly_comp_list = list(yearly_comp_old.index)[::-1]
old = list(yearly_comp_old['count'])[::-1]
new = list(yearly_comp_new['count'])[::-1]

dot = figure(title="Yearly Compensation", tools=TOOLS, plot_width = 800, plot_height = 400, y_range=yearly_comp_list, x_range=[0,31])

dot.segment(0, yearly_comp_list, old, yearly_comp_list, line_width=2, line_color="sienna", legend='2018')
dot.circle(old, yearly_comp_list, size=15, fill_color="skyblue", line_color="sienna", line_width=1, legend='2018')
dot.segment(0, yearly_comp_list, new, yearly_comp_list, line_width=2, line_color="sienna", legend='2019')
dot.circle(new, yearly_comp_list, size=15, fill_color="yellowgreen", line_color="sienna", line_width=1, legend='2019')

dot.xaxis.axis_label = 'Percentage of Respondents'
dot.yaxis.axis_label = 'Yearly Compensation (approx $USD)'
dot.yaxis.axis_label_text_font = 'times'
dot.yaxis.axis_label_text_font_size = '12pt'
dot.xaxis.axis_label_text_font = 'times'
dot.xaxis.axis_label_text_font_size = '12pt'
dot.ygrid.grid_line_color = None
dot.xgrid.grid_line_color = None
dot.legend.location = "bottom_right"
dot.legend.click_policy="hide"
show(dot)

**Note **- *The analysis ignores those who do not wish to disclose their yearly compensation*.

* The analysis only compares the yearly compensation of the respondents in 2018 and 2019. The graph is a pleasent sight for all those who are interested in the pay.


* The *percentage of respondents increased in all compensation brackets except between the range 0-10k and 90-100k*, though the decrease in 90-100k range isn't too big to worry about.


* It is delightful, taking a look at the percentage of respondents who earn more than 100k yearly. The percentage of them have increased a lot from around 12% in 2018 to around 20% in 2019, that is, **every 1 out of 5 respondents earn over 100k yearly!!**


* The percentage of those who earn less than 10k yearly dropped from 30% in 2018 to less than 25% in 2019. As a summary, percentage of those who earn less went down and, the percentage of those who earn more rocketed. **Data Science is indeed the sexiest job in terms of yearly compensation** and the demand for people in DS is still high!!

### **7. How do companies respond to Data Science?**

The demand for Data Science appears to be increasing year by year, so how do companies respond to it? Has every company started building a DS team within them and should you join large companies to be in a DS team?

![ds_comp](http://ehacking.in/images/Cybersecurityeng.jpg)

In [None]:
comb_df = multiplechoice_new.groupby(['Q6_What is the size of the company where you are employed?', 'Q7_Approximately how many individuals are responsible for data science workloads at your place of business?']).count().iloc[:,0]
comb_df = comb_df.unstack().reindex(['0-49 employees', '50-249 employees', '250-999 employees', '1000-9,999 employees',
                                     '> 10,000 employees'])
comb_df.columns = ['0', '1-2', '3-4', '5-9', '10-14', '15-19', '20+']
comb_df['1-10'] = comb_df['1-2'] + comb_df['3-4'] + comb_df['5-9']
comb_df['10+'] = comb_df['10-14'] + comb_df['15-19'] + comb_df['20+']
comb_df = comb_df[['0', '1-10', '10+']].reset_index()

patch1 = mpatches.Patch(color='sienna', label='0')
patch2 = mpatches.Patch(color='olive', label='1-10')
patch3 = mpatches.Patch(color='slategrey', label='10+')

fig, ax = plt.subplots(figsize=(15,7))
sns.pointplot(x="Q6_What is the size of the company where you are employed?", y="0", data=comb_df, color= 'sienna')
sns.pointplot(x="Q6_What is the size of the company where you are employed?", y="1-10", data=comb_df, color= 'olive')
sns.pointplot(x="Q6_What is the size of the company where you are employed?", y="10+", data=comb_df, color= 'slategrey')
plt.xticks(rotation=45)
plt.ylabel('Count', fontsize = 'large')
plt.xlabel('Company Size', fontsize = 'large')
plt.legend(title = "Data Science team size", handles=[patch1, patch2, patch3])
plt.title('Company size vs Data Science team size', fontsize = 'large')

* The good news is that, most of the companies have a Data Science team within them. Every company wants to be a part of this growing field. Larger the company, larger the probability to find a DS team within it. The size of the DS team is also dependent on the company size.


* The number of comapanies, that have a DS team of over 10, increases almost linearly with the size of company, for those with over 1000 employees. Even for smaller companies, having less than 50 employees, the chance of finding a DS team high. These companies mostly have a smaller DS team with 1-10 members in the team.

It is clear from the study that Data Science do indeed have a bright future ahead and its never too late to be a part of it.

> The Best Way To Predict The Future Is To Create It  
       **- Peter Drucker**
       

If you are still unceratin about the fututre of DS, then the quote is for you. As Peter Drucker said, **the best way to know the future, is to be among those creating it**.

Finally, if you want to be a part of DS, or if you are just new to this domain, there is always an uncertainity on what you should learn. Learning what is important and keeping up with the present trends can help you become successfull in any domain. The last section takes a look into the present trends in DS that gives you a perfect start.

### **8. Data Science - Where and What to Learn?**

The demand for Data Science is on the rise and more and more companies are stepping into this domain. The increasing demand attracts people to be a part of it. This section is for those people who are not yet a part of DS but want in or those who are are new to this domain and unaware of present trends in it. 

![much_to_learn](https://media0.giphy.com/media/3ohuAxV0DfcLTxVh6w/giphy.gif)

#### **a. Where do people learn Data Science?**

Nobody ever talks about motivation in learning. Data science is a broad and fuzzy field, which makes it hard to learn. Really hard. Without proper motivation, you’ll end up stopping halfway through and believing you can’t do it. So, how did those in the field managed to overcome these troubles and what kept them motivated to push further?

In [None]:
source = multiplechoice_new.iloc[:,22:32]
for col in source.columns:
    source[col] = source[col].value_counts()[0]
src_name = [col.split('Choice - ')[1].split(' (')[0] for col in source.columns]
source.columns = src_name
source = source.drop_duplicates().T
source = source.sort_values(by=1)

p1 = figure(x_range = source.index.values, y_range = (0,11500), plot_width = 400, plot_height = 400, title = 'Sources people follow to learn Data Science', tools = TOOLS)
p1.vbar(source.index.values, width = 0.5, top = source.values.reshape(len(source)), color=['peru']*10 + ['goldenrod'])
p1.yaxis.axis_label = 'Number of Respondents'
p1.xaxis.axis_label = 'Source for learning DS'
p1.yaxis.axis_label_text_font = 'times'
p1.yaxis.axis_label_text_font_size = '12pt'
p1.xaxis.axis_label_text_font = 'times'
p1.xaxis.axis_label_text_font_size = '12pt'
p1.xaxis.major_label_orientation = math.pi/4
p1.ygrid.grid_line_color = None

platform =  multiplechoice_new.iloc[:,35:45]
for col in platform.columns:
    platform[col] = platform[col].value_counts()[0]
plt_name = [col.split('Choice - ')[1].split(' (')[0] for col in platform.columns]
platform.columns = plt_name
platform = platform.drop_duplicates().T
platform = platform.sort_values(by=1)

p2 = figure(x_range = platform.index.values, y_range = (0,9500), plot_width = 400, plot_height = 400, title = 'Platforms used for learning Data Science courses', tools = TOOLS)
p2.vbar(platform.index.values, width = 0.5, top = platform.values.reshape(len(platform)), color=['slateblue']*10 + ['goldenrod', 'slategrey'])
p2.yaxis.axis_label = 'Number of Respondents'
p2.xaxis.axis_label = 'Platforms used'
p2.yaxis.axis_label_text_font = 'times'
p2.yaxis.axis_label_text_font_size = '12pt'
p2.xaxis.axis_label_text_font = 'times'
p2.xaxis.axis_label_text_font_size = '12pt'
p2.xaxis.major_label_orientation = math.pi/4
p2.ygrid.grid_line_color = None
show(row(p1,p2))

* The most followed source for learning DS is **Kaggle** which is closely followed by Blogs and YouTube. Kaggle is a platform for data science and analytics competitions. It claims to be the world’s largest community of active data scientists and it is where you can work with them, learn from them and stay motivated. Kaggle competitions help the participants prepare for real world problems.


* Blogs and YouTube are among the top sources used for learning DS. Details on 10 of the top DS blog sites can be found [here](https://www.tableau.com/learn/articles/data-science-blogs). Few of the most followed YouTube channels for learning DS include Datacamp, Sentdex and many others where you can understand certain topics a lot better than reading a book or a journal.


* In the present world, almost all the contents are found online. **The trend of learning has shifted from attending Universities to learning online**. Looking at the most used platforms for learning DS, there are twice as many users, who learn through [Coursera](https://www.coursera.org/) than those attending university courses. Coursera has tons of courses related to Data Science supported by universities around the globe with expert faculties that helps you to learn, ofcourse free of cost unless you need the certification.

#### **b. What should you learn to start coding?**

In [None]:
prog_lang = multiplechoice_new.iloc[:,[55] + list(range(82,92))]
prog_lang.columns = ['Coding exp'] + [x.split('Choice -')[1].split(' (')[0] for x in prog_lang.columns[1:]]
prog_lang = prog_lang.reindex(list(prog_lang['Coding exp'].dropna().index))
prog_lang = prog_lang.groupby('Coding exp').count().iloc[:-1].reindex(['< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years']).reset_index()
prog_lang = pd.melt(prog_lang, id_vars=['Coding exp'])
prog_lang['value'] = prog_lang['value']/90

p = figure(plot_width = 800, plot_height = 650, x_range = prog_lang['Coding exp'].unique(), y_range = prog_lang['variable'].unique(), title="Programming Language v/s Coding Experience", tools = TOOLS)
source = ColumnDataSource(prog_lang)
color_mapper = LinearColorMapper(palette = Plasma256[::-1], low = prog_lang['value'].min(), high = prog_lang['value'].max())
color_bar = ColorBar(color_mapper = color_mapper, location = (0, 0), ticker = BasicTicker())
p.add_layout(color_bar, 'right')
p.scatter(x = 'Coding exp', y = 'variable', size = 'value', legend = None, fill_color = transform('value', color_mapper), source = source)
p.xaxis.axis_label = 'Coding Experience'
p.yaxis.axis_label = 'Programming Language used'
p.yaxis.axis_label_text_font = 'times'
p.yaxis.axis_label_text_font_size = '12pt'
p.xaxis.axis_label_text_font = 'times'
p.xaxis.axis_label_text_font_size = '12pt'
p.xaxis.major_label_orientation = math.pi/4
p.xgrid.grid_line_color = None
show(p)

In [None]:
lang_recom = multiplechoice_new.iloc[:,95].value_counts().to_frame().drop(index=['Other','None'])
lang_recom.columns = ['Language Recommended']

lang_recom.plot(kind='bar', figsize=(16,8), color = 'mediumslateblue', legend=False)
plt.yscale('log')
plt.xlabel('Programming Languages', fontsize = 'large')
plt.ylabel('Number of Respondents', fontsize = 'large')
plt.title('Language Recommended', fontsize = 'x-large', fontweight = 'roman')

**Note** - *The number of respondents are in log scale*

* **Python is the most used and recommended language**. Python has now become one of the most popular coding languages in the world. The differentiating factor that Python brings to the table is that it enables programmers to flesh out concepts by writing less and readable code. The developers can further take advantage of several Python frameworks to mitigate the time and effort required for building large and complex software applications. According to [GitHub’s 2019 State of the Octoverse](https://octoverse.github.com/), for the first time, Python outranked Java as the second most popular language on GitHub by repository contributors.


* SQL is second most used language after python and is closely followed by R. The situtation is reversed when you look at the languages recommended. The respondents believe it is *better to learn R than SQL* to have a better career option in Data Science. R is commonly used in Academics for statistical analysis.


* As observed in the analysis of the age group of respondents, most of them belong are under 30 years of age. This is observed again, in the bubble chart where the size of bubble decreases with the increase in coding experience. This is just an indication that there are only a few people with over 10 years of experience among the respondents.

Now that you have found the languages that you should learn to have a good start, the next objective is to find some of the most used IDEs(Integrated development environment) where you can practice coding.

In [None]:
ide = multiplechoice_new.iloc[:,55:66]
ide.columns = ['Coding exp'] + [x.split('Choice -')[1].split(' (')[0] for x in ide.columns[1:]]
ide = ide.reindex(list(ide['Coding exp'].dropna().index))
ide = ide.groupby('Coding exp').count().iloc[:-1].reindex(['< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years'])
ide.columns = ['Jupyter', 'RStudio', 'PyCharm', 'Atom', 'MATLAB', 'Visual Studio / VS Code', 'Spyder', 'Vim / Emacs', 'Notepad++', 'Sublime Text']

fig, ax = plt.subplots(figsize=(16,8))
sns.heatmap(ide, annot= True, fmt="d", linewidths=.5, cmap='YlGnBu')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.xlabel('IDE used', fontsize = 'large')
plt.ylabel('Coding Experience', fontsize = 'large')
plt.title('IDE v/s Coding Experience', fontsize = 'large')

* [**Jupyter**](https://jupyter.org/) **is the most used IDE among the respondents**. Other Python IDEs such as PyCharm and Spyder and R IDEs such as RStudio are also used by many of them for coding. Jupyter supports several coding languages including Python and R, that are the most common among the people in DS, which explains its higher usage.


* Source code editors such as **VS Code** are also used by many. VS Code features a lightning fast source code editor, perfect for day-to-day use. With support for hundreds of languages, VS Code helps you be instantly productive with syntax highlighting, bracket-matching, auto-indentation, box-selection, snippets, and more which is exactly what every programmer wants.

#### **c. How do you generate insights from data?**

Data visualization is a great way to generate insights from data. Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. Because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner. What are the common libraries used for visualization?

In [None]:
vis_lib = multiplechoice_new.iloc[:,97:107]
vis_lib.columns = [x.split('Choice -  ')[1].split(' (')[0] for x in vis_lib.columns]
for col in vis_lib.columns:
    vis_lib[col] = vis_lib[col].value_counts()[0]
vis_lib = vis_lib.drop_duplicates().T
vis_lib = vis_lib.sort_values(by=1)

color_map = ['cadetblue']*4 + ['plum'] + ['rosybrown'] + ['cadetblue'] + ['rosybrown'] + ['cadetblue']*2
vis_lib[1].plot(kind='bar', color=tuple(color_map), figsize=(15,7))
custom_lines = [Line2D([0], [0], color='cadetblue', lw=4, label='Python'), Line2D([0], [0], color='plum', lw=4, label='Javascript'), Line2D([0], [0], color='rosybrown', lw=4, label='R')]
plt.legend(['Python', 'Javascript', 'R'], handles = custom_lines, title = 'Programming Language', title_fontsize = 'large')
plt.xticks(rotation=45)
plt.xlabel('Visualization Libraries', fontsize = 'large')
plt.ylabel('Count', fontsize = 'large')
plt.title('Visualization Libraries used', fontsize = 'large')
plt.show()

* Among the top 10 visualization libraries that the survey respondents use, *seven of them are that of Python, two of them belongs to R and the remaining one is that of Javascript*. [**Matplotlib**](https://matplotlib.org/) and [**Seaborn**](https://seaborn.pydata.org) (which is a high level interface for drawing attractive and informative statistical graphics based on matplotlib) are the most common libraries used. If you are interested in R, then [**ggplot**](http://ggplot.yhathq.com/) is what you need to learn and, for those having interest in javascript, [**D3.js**](https://d3js.org/) is most used javascript library in DS.

#### **d. Where do you store all the data?**

The most important part of Data Science is the data itself! As the volume of data increases, there is need for better storage methods where the data is properly structured enabling easy access to it. Databases are used for this purpose. In simple terms, a database is an organized collection of data. To work with database, you need a DBMS(Database Management System) which is a software system that enables users to define, create, maintain and control access to the database. Learning a language that helps in managing the data in DBMS is important and is what you need to learn next.

In [None]:
db = multiplechoice_new.iloc[:,233:240]
db.columns = [x.split('Choice -')[1].split(' (')[0] for x in db.columns]
db = pd.melt(db).dropna().groupby('variable')['value'].count().sort_values()[::-1]
db.rename(index={' AWS Relational Database Service':'Amazon RDS'},inplace=True)

fig, ax = plt.subplots(figsize=(15,7))
sns.barplot(x=list(db.index), y=list(db.values), palette="rocket")
plt.axhline(0, color="k", clip_on=False)
plt.xticks(rotation=45)
plt.xlabel('Database', fontsize = 'large')
plt.ylabel('Count', fontsize = 'large')
plt.title('Database Usage', fontsize = 'large')

* **MySQL is the most used DBMS** which is then followed by PostgresSQL and MS SQL Server. According to the [survey](https://insights.stackoverflow.com/survey/2018/#technology) conducted by stack overflow, the top three DBMS among developers are MySQL, MS SQL Server and PostgresSQL. PostgreSQL is gaining lots of traction in the last few years. Developers working with Postgres are very pleased with the product, both in terms of capabilities and performance.

#### **e. What role does Machine Learning have in Data Science?**

**Machine learning** is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It is one of most important element in Data Science and can be seen as the reason for the rise in the demand of Data Science. What Machine Learning (ML) skills should you develop to have better oppurtunities in DS?

The analysis takes a look into the most used ML algorithms and frameworks which helps to get a start in ML.

In [None]:
ml_alg = multiplechoice_new.iloc[:,117:128]
ml_alg.columns = ['ML exp'] + [x.split('Choice -')[1].split(' (')[0] for x in ml_alg.columns[1:]]
ml_alg = ml_alg.reindex(list(ml_alg['ML exp'].dropna().index))
ml_alg = ml_alg.groupby('ML exp').count().iloc[:-1].reindex(['< 1 years', '1-2 years', '2-3 years', '3-4 years', '4-5 years', '5-10 years', '10-15 years', '20+ years'])
ml_alg = ml_alg.fillna(0).astype('int').iloc[1:]

fig, ax = plt.subplots(figsize=(16,8))
sns.heatmap(ml_alg, annot= True, fmt="d", linewidths=.5, cmap='YlOrBr')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.xlabel('ML Algorithms used', fontsize = 'large')
plt.ylabel('ML Experience', fontsize = 'large')
plt.title('ML Experience v/s Algorithms used', fontsize = 'large')

* Simpler algorithms such as **Linear or Logistic regression and Decision trees or random forests** are the most used ML algorithms. These algorithms are easier to learn and gives reasonably good results making them the most used ones in all categories of ML experience. ML experience only takes into account the experience working with ML algorithms or frameworks.


* Among those with ML experience of under 3 years, deep learning methods such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are used more often as compared to Gradient Boosting Machines(GBM) and Bayesian Approaches. **It showcases the interest of the younger generation to be a part of the deep learning community**. For those with over 3 years of ML experience, the usage of algorithms such as GBM and Bayesian are a lot higher comapred to other algorithms.

How do you implement these algorithms and do you have to built them everytime you work on a problem or are they pre-implemented somewhere that you can use with ease? Machine Learning Framework is the solution to these problems.

In [None]:
ml_fw = multiplechoice_new.iloc[:,[117] + list(range(155,165))]
ml_fw.columns = ['ML exp'] + [x.split('Choice -')[1].split(' (')[0] for x in ml_fw.columns[1:]]
ml_fw = ml_fw.reindex(list(ml_fw['ML exp'].dropna().index))
ml_fw = ml_fw.groupby('ML exp').count().iloc[:-1].reindex(['< 1 years', '1-2 years', '2-3 years', '3-4 years', '4-5 years', '5-10 years', '10-15 years', '20+ years'])
ml_fw = ml_fw.fillna(0).astype('int').iloc[1:]

fig, ax = plt.subplots(figsize=(16,8))
sns.heatmap(ml_fw, annot= True, fmt="d", linewidths=.5, cmap='GnBu')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.xlabel('ML Frameworks used', fontsize = 'large')
plt.ylabel('ML Experience', fontsize = 'large')
plt.title('ML Experience v/s Frameworks used', fontsize = 'large')

* **Machine Learning Framework** refers to an interface, library or tool which allows developers to more easily and quickly build machine learning models, without getting into the details of the underlying algorithms. After learning ML algorithms, it is important to understand and learn the ML frameworks used to build and deploy ML algorithms.


* **Scikit-learn** is the most used ML framework. It has within it almost all the most used ML algorithms implemented including Linear or Logistic regression, Decision trees and many more. Boosting algorithms such as Random Forest and Xgboost are also used by many to build ML algoirthms.


* Tensorflow and Keras are the most used frameworks after Scikit-learn. Both Tensorflow and Keras are the frameworks used to build deep learning algorithms such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). PyTorch is another popular deep learning framework and its popularity is expected to grow in the coming years.

#### **f. Can Data Science be moved into the cloud?**

Cloud services is a rapidly growing market. Modern technologies like big data analytics, IoT, artificial intelligence and even web and mobile app hosting all need heavy computing power. Cloud computing offers enterprises an alternative to building their in-house infrastructure. **With cloud computing, anybody using the internet can enjoy scalable computing power on a plug and play basis**. Since this saves organizations from the need to invest and maintain costly infrastructure, it has become a very popular solution. What cloud platforms do people in DS use?

In [None]:
cloud_plat = multiplechoice_new.iloc[:,168:178]
cloud_plat.columns = [x.split('Choice -')[1].split(' (')[0] for x in cloud_plat.columns]
cloud_plat = pd.melt(cloud_plat).dropna().groupby('variable')['value'].count().sort_values()[::-1]

fig, ax = plt.subplots(figsize=(15,7))
sns.barplot(x=list(cloud_plat.index), y=list(cloud_plat.values), palette="rocket")
plt.axhline(0, color="k", clip_on=False)
plt.xticks(rotation=45)
plt.xlabel('Cloud Platforms', fontsize = 'large')
plt.ylabel('Count', fontsize = 'large')
plt.title('Cloud Platform Usage', fontsize = 'large')

* [**Amazon Web Services (AWS)**](https://aws.amazon.com/) is most used cloud platform followed by Google Cloud Platform (GCP) and Microsoft Azure. Under AWS, Amazon provides on-demand cloud computing platforms like storage, data analysis, etc. Amazon Web Services allow their subscribers to enjoy a full-fledged virtual cluster of computers, at any time, based on their requirements. The entire service is enabled through the internet. AWS's virtual cloud platform comes with most of the attributes of an actual computer including hardware (CPU(s) & GPU(s) for processing, hard-disk/SSD for storage & local/RAM for memory), an operating system to choose from and pre-loaded apps like web servers, databases, CRM, etc.

* GCP offers services in all major spheres including compute, networking, storage, machine learning (ML) and the internet of things (IoT). It also includes tools for cloud management, security, and development. The Google Cloud Storage is a highly dynamic storage solution that supports both SQL (Cloud SQL) and NoSQL (Cloud Datastore) database storage.

* Microsoft Azure is used to deploy code on Microsoft's servers. This code holds access to local storage resources (blobs, queues, and tables). While the SQL Azure it is not a full SQL Server instance it can be integrated with SQL Server. The security features like authentication, security, etc. are supported using Azure AppFabric that allows applications within your LAN to communicate with Azure cloud.

We started with a quote ***"Everything that has a beginning, has an end"*** and, as the saying goes, this study has come to its end. We did look at some of the most common questions including the uncertainity about the future of Data Science and what you should learn to have a successfull career in Data Science, and did find answers to them. Data Science will be around for many years and it would be great to see many more people being a part of it. So, *now pass on what you have learnt so that others do benefit from it as well !*

![pass on](https://i.imgur.com/33FHrJT.gif)

### **9. References**

1. https://www.edureka.co/blog/what-is-data-science/
2. https://dare2compete.com/bites/the-rise-of-data-science/
3. https://www.weforum.org/agenda/2019/03/gender-equality-in-stem-is-possible/
4. https://www.itproportal.com/features/a-snapshot-of-data-scientist-jobs-around-the-world/
5. https://www.census.gov/library/stories/2019/02/number-of-people-with-masters-and-phd-degrees-double-since-2000.html
6. https://www.kdnuggets.com/2018/09/how-many-data-scientists-are-there.html
7. https://octoverse.github.com/
8. https://insights.stackoverflow.com/survey/2018/#technology
9. https://www.newgenapps.com/blog/top-5-cloud-platforms-and-solutions-to-choose-from