In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

***
# Choosing the right company
***

From a novice to an experienced data scientist, there’s a vast range of potential employers out there, from small start-ups to the biggest multinationals. Which type best suits you is for you to decide, based on your personality, your likes and dislikes and, of course, what the company has to offer, for example

1. **Work for a company with great data** -- In determining what will be a great company to work for, data-science-as-a-strategic-differentiator is a necessary criterion, but it is not sufficient. The company must also have world-class data to work with.

1. **Work for a Company with Greenfield Opportunities** -- When evaluating opportunities, find a company that doesn’t have it all figured out yet.


With this notebook, using the data available in Kaggle survey I will try to explain what are the traits of different organizations when it comes to data science community and to answer questions like 

    - What kind of data science tools and practices are used in different type of organizations?
    - How mature are these organizations when it comes to machine learning and what kind of frameworks are being used?
    - What should you expect when looking for a job opportunity? Will the role allow you to hone skills you already have or add new ones?
 

![](https://multithreaded.stitchfix.com/assets/images/blog/three_data_scientists.jpg)
*Pic Credits:[blog](https://multithreaded.stitchfix.com/blog/2015/03/31/advice-for-data-scientists/)*

***
# Table of contents
***

* [Introduction](#introduction)


* [1. Job Categorization](#job)
    * [1.1. Data Science Community: Our Cohort under study](#title)
    * [1.2. Similarities & Difference: DS Community Vs. Others](#snd)


* [2. Enterprise Classification and Data Science Community](#enterprise)
   * [2.1. How to classify and why?](#classify)
   * [2.2. Geographical Distribution of Enterprises](#geo)
   
   
* [3. Data Science Community at work](#dsc)
    * [3.1. Educational & Compensation difference](#experience)
    * [3.2 Exposure to Data Analytics Tools & ML](#analytics)
    * [3.3 Data Storage & Cloud Computing](#storage)
    * [3.4. Deep dive into ML & DL Framework](#mldl)
    
    

* [Conclusion](#conclusion)

* [References](#references)

***
# Introduction
#### What I am targeting to achieve?
<a id="introduction"></a>
***

Evaluating the companies that you consider to work for based on their brand image, annual compensation, long term pay off and location etc. is important but while keeping these thing in consideration we usually tend to forget or we don't know whether the company I am considering to join is good for my professional development and future growth or not. 

<i>"Data scientist" is often used as a blanket title to describe jobs that are drastically different</i>

Understanding a company's culture and its approach to data is very important for your career growth in data science. I have heard about companies where data scientists are employed to confirm the opinions of interested parties. This may be true in some professions, however a data scientist is expected to form theories, test hunches, and find patterns eventually creating actionable insights.

We will be using subset of Kaggle survey 2019 data, people who have titles belonging to data science community only. To understand their nature of work, exposure to ML/AI methods, and what technology is being used, based on different type of organization that they are working for

Plan of action: 

1.	First we will separate our cohort under analysis from Kaggle survey and will also examine how they differ from other cohorts
2.	Enterprise classification: Based on survey data classification of businesses
3.	Examine the characteristics of these enterprises from the point of view of data science community 

***
# 1. Job Categorization
<a id="job"></a>
***

In this section we will analyze different groups of people based on their job title available in Kaggle 2019 survey. This will be achieved by looking at 

1. How titles can be categorized and why?
2. What are the factors making them similar or different?
3. How they differ from other group based on survey data?

### 1.1. Data Science Community: Our Cohort under study
<a id="title"></a>
<i>"Data scientist" is often used as a blanket title to describe jobs that are drastically different</i>

Data science combines several disciplines, including statistics, data analysis, machine learning, and computer science. This can be daunting if you’re new to data science, but keep in mind that different roles and companies will emphasize some skills over others, so you don’t have to be an expert at everything.


<img src="https://i2.wp.com/blog.udacity.com/wp-content/uploads/2014/11/Data-Science-Skills-Udacity-Matrix.png?resize=640%2C521&ssl=1" height="500" width="500">


Due to similarities in their role and responsibilities (example shown in skill matrix), we will using titles from Kaggle survey for grouping and below are the cohorts which will be used in this chapter  

> Q5: Select the title most similar to your current role (or most recent title if retired) 


<img src="https://i.ibb.co/m6SHLqc/cohort.png" height="700" width="700">


<ul>
<li>Data Science cohort --> 39% of total survey data</li>
<li>Data/SW Engineering cohort --> 17.6% of total survey data</li>
<li>Business and Management cohort --> 7.6% of total survey data</li>
</ul>


In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
#sns.set(color_codes=True)
#sns.set_style({"axes.facecolor": "1.0", 'grid.linestyle': '--', 'grid.color': '.8'})
sns.set_style("whitegrid")
#colors = ["#F28E2B", "#4E79A7","#79706E"]

colors = {'Data Science': "#F28E2B", 'Data/SW Engineering': "#4E79A7", 'Business/Management': "#79706E"}
colors_entr = {'Large Enterprise': "#17BECF", 'SME': "#BCBD22", 'SMB': "#C7C7C7", 'NA': "#FF7F0E"}

import matplotlib.pyplot as plt
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from IPython.display import display, HTML
init_notebook_mode(connected=True)
display(HTML("""
<style>
.output {
    display: flex;
    align-items: left;
    text-align: center;
}
</style>
"""))

data_19 = pd.read_csv("/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv", skiprows = range(1,2))
other_responses = pd.read_csv('/kaggle/input/kaggle-survey-2019/other_text_responses.csv')

conditions = [
    (data_19['Q5'] == 'Data Scientist') | (data_19['Q5'] == 'Statistician') | (data_19['Q5'] == 'Data Analyst') | (data_19['Q5'] == 'Research Scientist'), 
    (data_19['Q5'] == 'Software Engineer') | (data_19['Q5'] == 'Data Engineer') | (data_19['Q5'] == 'DBA/Database Engineer'),
    (data_19['Q5'] == 'Business Analyst') | (data_19['Q5'] == 'Product/Project Manager'),
    (data_19['Q5'] == 'Student') | (data_19['Q5'] == 'Not employed')]
choices = ['Data Science', 'Data/SW Engineering', 'Business/Management','Student']
data_19['JobDomain'] = np.select(conditions, choices, default='Others')


conditions = [
    (data_19['Q6'] == '1000-9,999 employees') | (data_19['Q6'] == '> 10,000 employees') , 
    (data_19['Q6'] == '50-249 employees') | (data_19['Q6'] == '250-999 employees') ,
    (data_19['Q6'] == '0-49 employees') ]
choices = ['Large Enterprise','SME', 'SMB', ]
data_19['Vertical'] = np.select(conditions, choices, default='NA')

conditions = [
    (data_19['Q3'] == 'United States of America') , 
    (data_19['Q3'] == 'India') ,
    (data_19['Q3'] == 'Russia'),
    (data_19['Q3'] == 'Japan'),
    (data_19['Q3'] == 'Brazil'),]
choices = ['USA','India', 'Russia','Japan','Brazil' ]
data_19['CountryGroup'] = np.select(conditions, choices, default='NA')

compensation_replace_dict = {
    '$0-999': '< 10,000','1,000-1,999': '< 10,000','2,000-2,999': '< 10,000','3,000-3,999': '< 10,000',
    '4,000-4,999': '< 10,000','5,000-7,499': '< 10,000','7,500-9,999': '< 10,000','10,000-14,999': '10,000 - 50,000',
    '15,000-19,999': '10,000 - 50,000','20,000-24,999': '10,000 - 50,000','25,000-29,999': '10,000 - 50,000',
    '30,000-39,999': '10,000 - 50,000','40,000-49,999': '10,000 - 50,000','50,000-59,999': '50,000 - 99,000',
    '60,000-69,999': '50,000 - 99,000','70,000-79,999': '50,000 - 99,000','80,000-89,999': '50,000 - 99,000',
    '90,000-99,999': '50,000 - 99,000','100,000-124,999': '> 100,000','125,000-149,999': '> 100,000',
    '150,000-199,999': '> 100,000','200,000-249,999': '> 100,000','250,000-299,999': '> 100,000',
    '300,000-500,000': '> 100,000','> $500,000': '> 100,000'}

data_19['Q10'] = data_19['Q10'].replace(compensation_replace_dict)

df = data_19.query(" JobDomain != 'Student' & JobDomain != 'Others'")

In [None]:
ax = sns.countplot(data=df, x="JobDomain",palette=colors)#sns.color_palette(colors))

ax.set_title('Number of Respondents by Title/Job Category\n')
ax.set_ylabel('')
ax.set_xlabel('')


plt.show()

No Surprise here: Since Kaggle is a data science platform, so most of the people who use this platform must be from data science domain. People like me, who are part of business & management group, but due to interest in data science use this platform to either keep their skillset up to date or they don't have any social life :)

It is trivial for managers, marketers and business leaders that even if they don’t have hands on experience in DS, but they should be aware of new improvements and developments in this field. This will make their life easy at their workplace or at least they can sound smart in front of their peers.

### 1.2. Similarities & Difference: DS Community Vs. Others
<a id="snd"></a>

Now that we have defined our groups, lets examine some attributes avaiable in survey data to understand them better

#### Job Categories by Country
Overall survey distribution by country
<ul>
<li>Majority of the people belong to India who have taken part in survey, which 24.6% of total</li>
<li>Followed by USA, which is 15.4% </li>
</ul>

However, when we analyze the distribution by countries for our 3 cohort under study, we see a different picture

In [None]:
fig, axs = plt.subplots(figsize=(10, 6),sharey=True)
country = (df.groupby(['JobDomain'])['CountryGroup']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('CountryGroup'))

r = sns.barplot(x="CountryGroup", y="Percentage", hue="JobDomain", data=country[country['CountryGroup']!= 'NA'], palette=colors)
r.set_title('Job Groups Compared by Countries\n')
r.legend(loc='upper center', bbox_to_anchor=(0.5, -0.4),ncol=2)
_ = plt.setp(r.get_xticklabels(), rotation=90)

We see a major difference when it comes to geographical distribution
<ul>
<li>When it comes to Data Science cohort, top 2 countries are still India & USA. However, both have similar contribution. USA being slightly higher 17.7% Vs. India 17.4%</li>
<li>Majority of the Data/SW Engineering group people are residing in India, which is 25% compare to USA 13.4% at second place </li>
<li>Business Management group is highest in India (23%), compare to second place USA (16%)</li>
</ul>

Its not surprise that India is among the top 5 counties when it comes to scientific research and technological investment. The country has improved its reputation in terms of the risk posed to foreign investments and, in 2019, ranked third in the world in terms of [attracting investment for technology transactions.](https://www.ibef.org/industry/science-and-technology.aspx)



Higher Data/SW Engineering percentage in India can be because of the fact that most of the American & European companies have their offshore development and technical support offices in India. The country’s outsourcing industry was recently [valued at 150 billion.](https://economictimes.indiatimes.com/tech/ites/indias-technology-vendors-paddling-shaky-boats/articleshow/56543653.cms) 

#### Job Categories by Education, Coding Experience & Difference in Compensation

In [None]:
edu = (df.groupby(['JobDomain'])['Q4']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q4'))

coding = (df.groupby(['JobDomain'])['Q15']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q15'))



fig, axs = plt.subplots(ncols=2,figsize=(20, 6),sharey=True)
plt.subplots_adjust(wspace=0.4)
p = sns.barplot(x="Q4", y="Percentage", hue="JobDomain", data=edu, ax=axs[0],palette=colors)
q = sns.barplot(x="Q15", y="Percentage", hue="JobDomain", data=coding, ax=axs[1],palette=colors)

p.set_title('Comparison by Education \n')
q.set_title('Years of coding experience for data analysis\n')
_ = plt.setp(p.get_xticklabels(), rotation=90)
_ = plt.setp(q.get_xticklabels(), rotation=90)

In [None]:
fig, axs = plt.subplots(figsize=(10, 6),sharey=True)
pay = (df.groupby(['JobDomain'])['Q10']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q10'))

r = sns.barplot(x="Q10", y="Percentage", hue="JobDomain", data=pay[:-1], palette=colors)
r.set_title('Annual Compensation\n')
r.legend(loc='upper center', bbox_to_anchor=(0.5, -0.4),ncol=2)
_ = plt.setp(r.get_xticklabels(), rotation=90)


##### Business & Management Cohort
<ul>
<li>Majority holds a masters degree, most probably MBA or Project Management degree</li>
<li>Majority has less number of years in coding to analyze data </li>
<li>This group has higher percentage in 50-99k and 100k+, this is because of product/project managers in this group and around 45% of them have yearly compensation > 50k </li>
<li>According to payscale.com median salary of product/project managers and data scientist is almost same</li>
</ul>

##### Data Science Cohort
<ul>
<li>Highest in terms of doctoral degree holders</li>
<li>Have high number of years in coding experience</li>
<li>This group has equal contribution compare to others in each annual compensation band, this is because of number of years of experience. The more experience you have the higher you get paid </li>
</ul>

##### Data/SW Engineering Cohort
<ul>
<li>Similar to business group, most of the people are master degree holder followed by bachelors</li>
<li>As Q15 is about the coding experience to analyze data that's why this group is second to DS when it comes to > 3 years </li>
<li>Majority lies under 50k yearly compensation, around 60%. This came as a surprise because payscale.com and glassdoor.com show that their salary range is 70k+</li>
</ul>

By looking at the results form this survey and based on experience gained from industry, it is important that people belonging from data science group should understand about engineering or project management area and vice versa. At the end of the day, we all have to work together and as a team

<i>If everyone is moving forward together, then success takes care of itself." --Henry Ford</i>

***
# 2. Enterprise Classification and Data Science Community
<a id="enterprise"></a>
***

In this section we will only work with our data science community cohort and will analyze how being part of a different organization can impact their nature of work

1. Classification of enterprise based on their employee size?
2. Examine if geography has any role to play to make them different?

### 2.1. How to classify and why?
<a id="classify"></a>

Depending on whom you ask, there are several definitions and key differentiators that influence the classification into which a business can be classified. The widely accepted definition of each business size classification is based on the number of employees and annual revenue. Since we only have no. of employee identifier availble in our survey data, so we will be using it to create our enterprise classifier. 

> Q6: What is the size of the company where you are employed?

There are three main types of businesses that you can work for in the private sector, each with it's own pros and cons:

1. SMB (Small and Medium-Sized Businesses)
**0-49 employees**
2. SME (Small and Medium Enterprises)
**50-999 employees**
3. Large Enterprise
**more than 1000 employees**

In [None]:
#will only work with data science community created in previous sections
df_ds = data_19.query(" JobDomain == 'Data Science' ")

ax = sns.countplot(data=df_ds, x="Vertical",palette=colors_entr)

ax.set_title('Number people from DS Cohort working in different businesses\n')
ax.set_ylabel('')
ax.set_xlabel('')


plt.show()

From data science community data available in survery (categorized in section 1) 
<ul>
<li>39.6% working in Large Enterprises</li>
<li>29.7% working in SME</li>
<li>29.2% working in SMB</li>
</ul>

### 2.2. Geographical Distribution of Enterprises
<a id="geo"></a>

Here we are exploring where are these enterprise location by analyzing the Data Science Cohort only.

In [None]:
df_ds_Excl = df_ds.query(" Vertical != 'NA' ")

fig, axs = plt.subplots(figsize=(10, 6),sharey=True)
country1 = (df_ds_Excl.groupby(['Vertical'])['CountryGroup']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('CountryGroup'))

r = sns.barplot(x="CountryGroup", y="Percentage", hue="Vertical", data=country1[country1['CountryGroup']!= 'NA'], palette=colors_entr)
r.set_title('Job Cohorts Compared by Top 5 Countries\n')
r.legend(loc='upper center', bbox_to_anchor=(0.5, -0.4),ncol=4)
_ = plt.setp(r.get_xticklabels(), rotation=90)

<ul>
<li>Majority of Data Science community who is working in large organizations are in USA at 24%, and India being the second highest at 16.6%</li>
<li>SME are equally distributed between Indian and USA at around 15%</li>
<li>Contrary to large enterprises, Data Science community from SMB are more in India at 19.6% compare to any other country</li>
</ul>


***Let's step out of the survey and look at overall footprint of digital/tech companies on the globe***

USA is the obvious leader when it comes to technology innovation, based on below [figure](https://www.brookings.edu/research/trends-in-the-information-technology-sector/). Our survey data is skewed towards India responders, due to which we don't see much enterprises from Europe and China.
</ul>

When looking at top 100 digital [companies](https://www.forbes.com/top-digital-companies/list/2/#tab:rank)

<ul>
<li>Out of 100 Digital companies, 38 are from USA</li>
<li>10 are from China and only 2 are from India</li>
<li>17 are from Europe</li>
</ul>



Figure: Global distribution of top 100 digital companies and market capitalization (US $billion)
<img src="kaggle-survey-2019/pics/digital companies.jpg" height="600" width="600">

***
# 3. Data Science Community at work
<a id="dsc"></a>
***

In this section we analyze 

1. What is the difference of education level between DS community peers employed in different organization?
2. Difference in compensation they can experience in small to large enterprise.
3. What kind of exposure our data science community gets on different ML & DL frameworks, data storage & computing etc.? Based on the type pf organization they are working in.



### 3.1. Educational & Compensation difference
<a id="experience"></a>

In this section lets look at some basic differences by enterprise types each as 

1. Education level of the data science employees in the company 
2. Difference in income level 

#### Enterprises by Education & Difference in Compensation

In [None]:
educ = (df_ds_Excl.groupby(['Vertical'])['Q4']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q4'))

sal = (df_ds_Excl.groupby(['Vertical'])['Q10']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q10'))



fig, axs = plt.subplots(ncols=2,figsize=(18, 6),sharey=True)
plt.subplots_adjust(wspace=0.2)
p = sns.barplot(x="Q4", y="Percentage", hue="Vertical", data=educ, ax=axs[0],palette=colors_entr)
q = sns.barplot(x="Q10", y="Percentage", hue="Vertical", data=sal, ax=axs[1],palette=colors_entr)

p.set_title('Education difference in DS community by Org type\n')
q.set_title('Salary difference in DS community by Org type\n')
_ = plt.setp(p.get_xticklabels(), rotation=90)
_ = plt.setp(q.get_xticklabels(), rotation=90)

Differences are evident when compared by Education & Compensation levels of Data Science Community among these enterprises

#### SMB (Small and Medium-Sized Businesses)
<ul>
<li>Holds higher percentage (68%) of Masters and Bachelor degree holders</li>
<li>Majority (63%) of data science community working in these companies are paid under 50k a year. It seems that fresh graduate after doing their bachelors/masters degree who have less experience in coding are being employed by these companies</li>
</ul>

#### SME (Small and Medium Enterprises)
<ul>
<li>SME holds the highest percentage by doctoral degree holders compare to other enterprises. </li>
<li>Although most people (56%) are paid under 50k a year. But because they are employing PHDs and people with more years of experience compare to SMB, so some are paid more the 50K or even 100k+ a year </li>
</ul>

#### Large Enterprise
<ul>
<li>50% of data science community working in these companies have master’s degree</li>
<li>Because these organizations employee people with sound background and experience, so annual compensation of this group is higher compare to other 2 groups</li>
</ul>

### 3.2. Exposure to Data Analytics Tools & ML
<a id="analytics"></a>

In this section lets look at very high level traits

1. Tools used for data analysis 
2. Usage of ML 

#### Enterprises compared by Data Analytics Tools & ML Usage Tenure

In [None]:
df_ds_Excl = df_ds.query(" Vertical != 'NA' ")


ML = (df_ds_Excl.groupby(['Vertical'])['Q8']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q8'))

tool = (df_ds_Excl.groupby(['Vertical'])['Q14']
                     .value_counts(normalize=True)
                     .rename('Percentage')
                     .mul(100)
                     .reset_index()
                     .sort_values('Q14'))



fig, axs = plt.subplots(ncols=2,figsize=(18, 6),sharey=True)
plt.subplots_adjust(wspace=0.2)
p = sns.barplot(x="Q8", y="Percentage", hue="Vertical", data=ML, ax=axs[0],palette=colors_entr)
q = sns.barplot(x="Q14", y="Percentage", hue="Vertical", data=tool, ax=axs[1],palette=colors_entr)

p.set_title('ML methods adoptation by Org type\n')
q.set_title('Tools used to analyze data by Org type\n')
_ = plt.setp(p.get_xticklabels(), rotation=90)
_ = plt.setp(q.get_xticklabels(), rotation=90)

#### SMB (Small and Medium-Sized Businesses)
<ul>
<li>High on local environment for data analysis and lower on advanced/enterprise ready tool. This can be because of the reason that since they are exploring the possibilities of ML or either have limited use, so they use open source/less expensive tools</li>
<li>36% of this group have ML in production, which is lower of all 3 categories of companies</li>
</ul>

#### SME (Small and Medium Enterprises)
<ul>
<li>These group of companies’ over-index on cloud-based data software compare to other 2 groups (8.4%), which is 1% higher compare to other 2 groups</li>
<li>40% of the companies from this group are using ML in production</li>
</ul>

#### Large Enterprise
<ul>
<li>Companies in this group over-index on Advanced Statistics Software (by 1.8%) and BI Software (by 1%)</li>
<li>55% of companies in this group are using ML in production</li>
</ul>

By looking at the characteristics mentioned above, everyone would want to join Large Enterprises as they pay more and actively use ML. 

But at a large, established enterprises, most roles are highly specialized, and when a new problem occurs, it will be addressed by the person or team with that specific set of skills. If you try to solve a problem that isn’t “owned” by your department, you’re likely to step on some toes.
At a startup—especially a very small one—nearly every problem is an opportunity for you to step in and add value. Your coworkers are more likely to appreciate an action-oriented, problem-solving approach.


### 3.3. Data Storage & Cloud Computing
<a id="storage"></a>

Here we will examine what to expect from these enterprises when it comes to Data Storage and Cloud Computing platforms/products

#### When it comes to Enterprise level Data Storage platform

Top most DB Engines used by our DS Community in this survey is MySQL, where as according to [DB Engine Ranking](https://db-engines.com/en/ranking)


In [None]:
mysql_ds = 100 * df_ds_Excl.groupby(['Q34_Part_1']).size()/len(df_ds_Excl)
sql_ds = 100 * df_ds_Excl.groupby(['Q34_Part_4']).size()/len(df_ds_Excl)
orc_ds = 100 * df_ds_Excl.groupby(['Q34_Part_5']).size()/len(df_ds_Excl)
post_ds = 100 * df_ds_Excl.groupby(['Q34_Part_2']).size()/len(df_ds_Excl)
lite_ds = 100 * df_ds_Excl.groupby(['Q34_Part_3']).size()/len(df_ds_Excl)
db_perc_all = pd.concat([mysql_ds, sql_ds,orc_ds,post_ds,lite_ds], axis=0)

db_perc_all_srt = db_perc_all.sort_values(ascending=False)

q = sns.barplot(db_perc_all_srt.index, db_perc_all_srt.values)
q.set_title('Top DB Engines usage by DS Community\n')
q.set(ylabel='Percentage')
_ = plt.setp(q.get_xticklabels(), rotation=90)

In [None]:
ora_ds = 100 * df_ds_Excl.groupby(['Q34_Part_5']).size()/len(df_ds_Excl)
sql_ds = 100 * df_ds_Excl.groupby(['Q34_Part_4']).size()/len(df_ds_Excl)
db_perc = pd.concat([ora_ds, sql_ds], axis=0)

df_ds_large = df_ds_Excl.query(" Vertical == 'Large Enterprise' ")

ora_ds_large = 100 * df_ds_large.groupby(['Q34_Part_5']).size()/len(df_ds_large)
sql_ds_large = 100 * df_ds_large.groupby(['Q34_Part_4']).size()/len(df_ds_large)
db_perc_large = pd.concat([ora_ds_large, sql_ds_large], axis=0)


fig, axs = plt.subplots(ncols=2,figsize=(18, 6),sharey=True)
plt.subplots_adjust(wspace=0.2)
p = sns.barplot(db_perc.index, db_perc.values, ax=axs[0])
q = sns.barplot(db_perc_large.index, db_perc_large.values, ax=axs[1])

p.set_title('SQL Server/Oracle DB usage by DS Community\n')
q.set_title('SQL Server/Oracle DB usage by DS Community in Large Organisations\n')

p.set(ylabel='Percentage')

_ = plt.setp(p.get_xticklabels(), rotation=0)
_ = plt.setp(q.get_xticklabels(), rotation=0)

Large enterprises still rely on traditional database engines (such as Oracle, DB2 – and even SQL Server) to run their mission critical systems. They over-index when it some comes enterprise level DB storage platform
<ul>
<li>For Oracle: they over-index by 4%</li>
<li>For SQL Server: they over-index by 3.3%</li>
</ul>

reason being

<ul>
<li>The up-front cost doesn’t bother these large companies, even a million dollars isn’t very much compared to the costs of running a $10bn company.    </li>
<li>Large organizations have to deal with large volume of structure data, so relying on a platform which is enterprise ready and is being used in industry for a long time is a reasonable choice.</li>
<li>The support is a big deal, large companies want to pay for those support contracts, they want 24/7 support, SLAs, and reassurance that when things go wrong. Support is available to sort things out.</li>
</ul>

For SME (Mid sized Org), their distribution is similar to total DS community. However, there is one DB Engine where SME Enterprises over-index, that is PostgreSQL Engine.

In [None]:
post_ds = 100 * df_ds_Excl.groupby(['Q34_Part_2']).size()/len(df_ds_Excl)


df_ds_med = df_ds_Excl.query(" Vertical == 'SME' ")
post_ds_med = 100 * df_ds_med.groupby(['Q34_Part_2']).size()/len(df_ds_med)

fig, axs = plt.subplots(ncols=2,figsize=(18, 6),sharey=True)
plt.subplots_adjust(wspace=0.2)

p = sns.barplot(post_ds.index, post_ds.values, ax=axs[0])
q = sns.barplot(post_ds_med.index, post_ds_med.values, ax=axs[1])



p.set_title('Enterprise DB usage by DS Community\n')
q.set_title('Enterprise DB usage by DS Community in SME Organisations\n')

p.set(ylabel='Percentage')
p.set(xlabel='')
q.set(xlabel='')

_ = plt.setp(p.get_xticklabels(), rotation=0)
_ = plt.setp(q.get_xticklabels(), rotation=0)

Medium size Enterprise over-index when it comes to PostgreSQL by 1.2%

reason being

<ul>
<li>According to DB-Engines research (shared above), PostgreSQL has significantly increased its score over the last year with an almost 50 point increase. </li>
<li>It’s a favorite because it’s open-source, free to use, community-driven without being owned by a single company, standards-compliant, filled with useful features, and very extensible.</li>
<li>Many companies have been built around Postgres itself like CitusDB, Timescale, PipelineDB and others. Even AWS Redshift is built on Postgres code.</li>
<li>Recently it has gotten significantly better with features like full-text search, JSON columns, logical replication, upsert, and better scalability.</li>
</ul>

Add all that together and you have a powerful data platform that’s hard to beat, especially for startups and smaller organizations that need a reliable choice without tons of effort or cost

In [None]:
mysql_ds = 100 * df_ds_Excl.groupby(['Q34_Part_1']).size()/len(df_ds_Excl)
lite_ds = 100 * df_ds_Excl.groupby(['Q34_Part_3']).size()/len(df_ds_Excl)
db_perc = pd.concat([mysql_ds, lite_ds], axis=0)

df_ds_small = df_ds_Excl.query(" Vertical == 'SMB' ")

mysql_small = 100 * df_ds_small.groupby(['Q34_Part_1']).size()/len(df_ds_small)
lite_ds_small = 100 * df_ds_small.groupby(['Q34_Part_3']).size()/len(df_ds_small)
db_perc_small = pd.concat([mysql_small, lite_ds_small], axis=0)


fig, axs = plt.subplots(ncols=2,figsize=(18, 6),sharey=True)
#plt.subplots_adjust(wspace=0.2)
p = sns.barplot(db_perc.index, db_perc.values, ax=axs[0])
q = sns.barplot(db_perc_small.index, db_perc_small.values, ax=axs[1])

p.set_title('Enterprise DB usage by DS Community\n')
q.set_title('Enterprise DB usage by DS Community in SMB (Small Size) Organisations\n')

p.set(ylabel='Percentage')

_ = plt.setp(p.get_xticklabels(), rotation=0)
_ = plt.setp(q.get_xticklabels(), rotation=0)

SQLite is mostly used in [small organisations](https://enlyft.com/tech/products/sqlite)

MySQL is deployment in small to medium sized enterprises and at the departmental level in large enterprises

<ul>
<li>For MySQL: they over-index by 1.4%</li>
<li>For SQLite: they over-index by 1.8%</li>
</ul>

reason being

<ul>
<li>Both are open source DB engines</li>
<li>Mysql DB is used in small scale and medium scale web softwares for database management</li>
<li>MySQL is the world’s most used client-server RDBMS (SQLite has more installations, because it is bundled and distributed with smaller client-application databases that reside on personal computing devices) and the de-facto standard for Linux-based systems</li>
</ul>


#### When it comes to Enterprise level Cloud Computing Platforms

Top most cloud computing platform used by our DS Community in this survey is AWS

In [None]:
gcp_ds = 100 * df_ds_Excl.groupby(['Q29_Part_1']).size()/len(df_ds_Excl)
aws_ds = 100 * df_ds_Excl.groupby(['Q29_Part_2']).size()/len(df_ds_Excl)
azure_ds = 100 * df_ds_Excl.groupby(['Q29_Part_3']).size()/len(df_ds_Excl)
db_perc_all = pd.concat([gcp_ds, aws_ds,azure_ds], axis=0)

db_perc_all_srt = db_perc_all.sort_values(ascending=False)

q = sns.barplot(db_perc_all_srt.index, db_perc_all_srt.values)
q.set_title('Top 3 Cloud Computing Platforms usage by DS Community\n')
q.set(ylabel='Percentage')
_ = plt.setp(q.get_xticklabels(), rotation=90)

In [None]:
gcp_ds_lg = 100 * df_ds_large.groupby(['Q29_Part_1']).size()/len(df_ds_large)
aws_ds_lg = 100 * df_ds_large.groupby(['Q29_Part_2']).size()/len(df_ds_large)
azure_ds_lg = 100 * df_ds_large.groupby(['Q29_Part_3']).size()/len(df_ds_large)
db_perc_lg = pd.concat([gcp_ds_lg, aws_ds_lg,azure_ds_lg], axis=0)

db_perc_lg = db_perc_lg.sort_values(ascending=False)

gcp_ds_med = 100 * df_ds_med.groupby(['Q29_Part_1']).size()/len(df_ds_med)
aws_ds_med = 100 * df_ds_med.groupby(['Q29_Part_2']).size()/len(df_ds_med)
azure_ds_med = 100 * df_ds_med.groupby(['Q29_Part_3']).size()/len(df_ds_med)
db_perc_med = pd.concat([gcp_ds_med, aws_ds_med,azure_ds_med], axis=0)

db_perc_med = db_perc_med.sort_values(ascending=False)

gcp_ds_sm = 100 * df_ds_small.groupby(['Q29_Part_1']).size()/len(df_ds_small)
aws_ds_sm = 100 * df_ds_small.groupby(['Q29_Part_2']).size()/len(df_ds_small)
azure_ds_sm = 100 * df_ds_small.groupby(['Q29_Part_3']).size()/len(df_ds_small)
db_perc_sm = pd.concat([gcp_ds_sm, aws_ds_sm,azure_ds_sm], axis=0)

db_perc_sm = db_perc_sm.sort_values(ascending=False)

fig, axs = plt.subplots(ncols=3,figsize=(18, 6),sharey=True)
#plt.subplots_adjust(wspace=0.2)
l = sns.barplot(db_perc_lg.index, db_perc_lg.values, ax=axs[0])
m = sns.barplot(db_perc_med.index, db_perc_med.values, ax=axs[1])
s = sns.barplot(db_perc_sm.index, db_perc_sm.values, ax=axs[2])


l.set_title('CCP usage by DS Community in Large Enterprise\n')
m.set_title('CCP usage by DS Community in SME\n')
s.set_title('CCP usage by DS Community in SMB\n')

l.set(ylabel='Percentage')

_ = plt.setp(l.get_xticklabels(), rotation=90)
_ = plt.setp(m.get_xticklabels(), rotation=90)
_ = plt.setp(s.get_xticklabels(), rotation=90)

#### SMB (Small and Medium-Sized Businesses)
<ul>
<li>SMB uses GCP among our groups, 2% higher compare to average usage</li>
<li>SMB under-index when it come to MS Azure usage, 2% lower compare to average distribution</li>
</ul>

#### SME (Small and Medium Enterprises)
<ul>
<li>Over-index in AWS usage by 1% compare to Large & SMB organizations</li>
</ul>

#### Large Enterprise
<ul>
<li>Over-index when it comes to MS Azure usage, by 11.5% compare to 9.8%</li>
<li>Under-index in Google Cloud Platform usage by 1.2%</li>
</ul>

#### When it comes to Enterprise level Cloud Computing Products
Top most cloud computing products used by our DS Community in this survey is AWS EC2

In [None]:
ec2_ds = 100 * df_ds_Excl.groupby(['Q30_Part_1']).size()/len(df_ds_Excl)
gce_ds = 100 * df_ds_Excl.groupby(['Q30_Part_2']).size()/len(df_ds_Excl)
#lamb_ds = 100 * df_ds_Excl.groupby(['Q30_Part_3']).size()/len(df_ds_Excl)
azure_vm_ds = 100 * df_ds_Excl.groupby(['Q30_Part_4']).size()/len(df_ds_Excl)
#g_ae_ds = 100 * df_ds_Excl.groupby(['Q30_Part_5']).size()/len(df_ds_Excl)
#g_cf_ds = 100 * df_ds_Excl.groupby(['Q30_Part_6']).size()/len(df_ds_Excl)
#aws_eb_ds = 100 * df_ds_Excl.groupby(['Q30_Part_7']).size()/len(df_ds_Excl)
#gk_ds = 100 * df_ds_Excl.groupby(['Q30_Part_8']).size()/len(df_ds_Excl)
#aws_b_ds = 100 * df_ds_Excl.groupby(['Q30_Part_9']).size()/len(df_ds_Excl)
#azure_c_ds = 100 * df_ds_Excl.groupby(['Q30_Part_10']).size()/len(df_ds_Excl)

cc_prod_all = pd.concat([ec2_ds, gce_ds,azure_vm_ds], axis=0)

cc_prod_all_srt = cc_prod_all.sort_values(ascending=False)

q = sns.barplot(cc_prod_all_srt.index, cc_prod_all_srt.values)
q.set_title('Top 3 Cloud Computing Products usage by DS Community\n')
q.set(ylabel='Percentage')
_ = plt.setp(q.get_xticklabels(), rotation=90)

In [None]:
ec2_ds_lg = 100 * df_ds_large.groupby(['Q30_Part_1']).size()/len(df_ds_large)
gce_ds_lg = 100 * df_ds_large.groupby(['Q30_Part_2']).size()/len(df_ds_large)
azure_vm_ds_lg = 100 * df_ds_large.groupby(['Q30_Part_4']).size()/len(df_ds_large)
db_perc_lg = pd.concat([ec2_ds_lg, gce_ds_lg,azure_vm_ds_lg], axis=0)

db_perc_lg = db_perc_lg.sort_values(ascending=False)

ec2_ds_med = 100 * df_ds_med.groupby(['Q30_Part_1']).size()/len(df_ds_med)
gce_ds_med = 100 * df_ds_med.groupby(['Q30_Part_2']).size()/len(df_ds_med)
azure_vm_ds_med = 100 * df_ds_med.groupby(['Q30_Part_4']).size()/len(df_ds_med)
db_perc_med = pd.concat([ec2_ds_med, gce_ds_med,azure_vm_ds_med], axis=0)

db_perc_med = db_perc_med.sort_values(ascending=False)

ec2_ds_sm = 100 * df_ds_small.groupby(['Q30_Part_1']).size()/len(df_ds_small)
gce_ds_sm = 100 * df_ds_small.groupby(['Q30_Part_2']).size()/len(df_ds_small)
azure_vm_ds_sm = 100 * df_ds_small.groupby(['Q30_Part_4']).size()/len(df_ds_small)
db_perc_sm = pd.concat([ec2_ds_sm, gce_ds_sm,azure_vm_ds_sm], axis=0)

db_perc_sm = db_perc_sm.sort_values(ascending=False)

fig, axs = plt.subplots(ncols=3,figsize=(18, 6),sharey=True)
#plt.subplots_adjust(wspace=0.2)
l = sns.barplot(db_perc_lg.index, db_perc_lg.values, ax=axs[0])
m = sns.barplot(db_perc_med.index, db_perc_med.values, ax=axs[1])
s = sns.barplot(db_perc_sm.index, db_perc_sm.values, ax=axs[2])


l.set_title('Cloud products used by DS Community in Large Enterprise\n')
m.set_title('Cloud products used by DS Community in SME\n')
s.set_title('Cloud products used by DS Community in SMB\n')

l.set(ylabel='Percentage')

_ = plt.setp(l.get_xticklabels(), rotation=90)
_ = plt.setp(m.get_xticklabels(), rotation=90)
_ = plt.setp(s.get_xticklabels(), rotation=90)

#### SMB (Small and Medium-Sized Businesses)
<ul>
<li>Again we are seeing usage of Google product more in this group</li>
<li>SMB under-index when it come to EC2 usage by 1.5%</li>
</ul>

#### SME (Small and Medium Enterprises)
<ul>
<li>Over-index in GCE usage</li>
<li>Slightly under-index in EC2 usage</li>
</ul>

#### Large Enterprise
<ul>
<li>Over-index when it comes to EC2 compare to average distribution among DS community</li>
<li>Slightly over-indexed when it comes to Azure VM usage</li>
</ul>


When analysing the number about cloud computing questions, we can conclude

<ul>
<li>All organizations are more inclined towards AWS because its older of all and is more mature.</li>
<li>Large organizations prefer Azure more compare to other group (SMB & SME) because of its expensive and only large organizations are more likely to afford it. Secondly, its integration with other Microsoft products can be a plus like MS SQL Server, Power BI etc.</li>
<li>Google Platform/Products is popular in small business because if free for first 12 months</li>
</ul>

### 3.4. Deep dive into ML & DL Framework
<a id="mldl"></a>

Lets deep dive into machine learning and deep learning framework usage among these organizations focused on Data Science community

#### Lets start with ML Algorithms

In the sections we will analyze how Data Science Community's exposure to ML Algos is different if they are part of different organization

In [None]:
q1_ds = 100 * df_ds_Excl.groupby(['Q24_Part_1']).size()/len(df_ds_Excl)
q2_ds = 100 * df_ds_Excl.groupby(['Q24_Part_2']).size()/len(df_ds_Excl)
q3_ds = 100 * df_ds_Excl.groupby(['Q24_Part_3']).size()/len(df_ds_Excl)
q4_ds = 100 * df_ds_Excl.groupby(['Q24_Part_4']).size()/len(df_ds_Excl)
q5_ds = 100 * df_ds_Excl.groupby(['Q24_Part_5']).size()/len(df_ds_Excl)
q6_ds = 100 * df_ds_Excl.groupby(['Q24_Part_6']).size()/len(df_ds_Excl)
q7_ds = 100 * df_ds_Excl.groupby(['Q24_Part_7']).size()/len(df_ds_Excl)
q8_ds = 100 * df_ds_Excl.groupby(['Q24_Part_8']).size()/len(df_ds_Excl)
q9_ds = 100 * df_ds_Excl.groupby(['Q24_Part_9']).size()/len(df_ds_Excl)
q10_ds = 100 * df_ds_Excl.groupby(['Q24_Part_10']).size()/len(df_ds_Excl)

algo_prod_all = pd.concat([q1_ds, q2_ds,q3_ds,q4_ds,q5_ds,q6_ds,q7_ds,q8_ds,q9_ds,q10_ds], axis=0)

algo_prod_all_srt = algo_prod_all.sort_values(ascending=False)

q = sns.barplot(algo_prod_all_srt.index, algo_prod_all_srt.values)
q.set_title('Usage of ML Algos by DS Community\n')
q.set(ylabel='Percentage')
_ = plt.setp(q.get_xticklabels(), rotation=90)

If we try to see the above distribution by different enterprise category, we will more or less see the same distribution. So we have to analyze by each algorithms type, with it we can identify if one algorithm is more in use at a particular organization type or not. 

In [None]:
q1_ds = 100 * df_ds_large.groupby(['Q24_Part_1']).size()/len(df_ds_large)
q2_ds = 100 * df_ds_large.groupby(['Q24_Part_2']).size()/len(df_ds_large)
q3_ds = 100 * df_ds_large.groupby(['Q24_Part_3']).size()/len(df_ds_large)
q4_ds = 100 * df_ds_large.groupby(['Q24_Part_4']).size()/len(df_ds_large)
q5_ds = 100 * df_ds_large.groupby(['Q24_Part_5']).size()/len(df_ds_large)
q6_ds = 100 * df_ds_large.groupby(['Q24_Part_6']).size()/len(df_ds_large)
q7_ds = 100 * df_ds_large.groupby(['Q24_Part_7']).size()/len(df_ds_large)
q8_ds = 100 * df_ds_large.groupby(['Q24_Part_8']).size()/len(df_ds_large)
q9_ds = 100 * df_ds_large.groupby(['Q24_Part_9']).size()/len(df_ds_large)
q10_ds = 100 * df_ds_large.groupby(['Q24_Part_10']).size()/len(df_ds_large)

algo_prod_lg = pd.concat([q1_ds, q2_ds,q3_ds,q4_ds,q5_ds,q6_ds,q7_ds,q8_ds,q9_ds,q10_ds], axis=0)


q1_ds = 100 * df_ds_med.groupby(['Q24_Part_1']).size()/len(df_ds_med)
q2_ds = 100 * df_ds_med.groupby(['Q24_Part_2']).size()/len(df_ds_med)
q3_ds = 100 * df_ds_med.groupby(['Q24_Part_3']).size()/len(df_ds_med)
q4_ds = 100 * df_ds_med.groupby(['Q24_Part_4']).size()/len(df_ds_med)
q5_ds = 100 * df_ds_med.groupby(['Q24_Part_5']).size()/len(df_ds_med)
q6_ds = 100 * df_ds_med.groupby(['Q24_Part_6']).size()/len(df_ds_med)
q7_ds = 100 * df_ds_med.groupby(['Q24_Part_7']).size()/len(df_ds_med)
q8_ds = 100 * df_ds_med.groupby(['Q24_Part_8']).size()/len(df_ds_med)
q9_ds = 100 * df_ds_med.groupby(['Q24_Part_9']).size()/len(df_ds_med)
q10_ds = 100 * df_ds_med.groupby(['Q24_Part_10']).size()/len(df_ds_med)

algo_prod_med = pd.concat([q1_ds, q2_ds,q3_ds,q4_ds,q5_ds,q6_ds,q7_ds,q8_ds,q9_ds,q10_ds], axis=0)


q1_ds = 100 * df_ds_small.groupby(['Q24_Part_1']).size()/len(df_ds_small)
q2_ds = 100 * df_ds_small.groupby(['Q24_Part_2']).size()/len(df_ds_small)
q3_ds = 100 * df_ds_small.groupby(['Q24_Part_3']).size()/len(df_ds_small)
q4_ds = 100 * df_ds_small.groupby(['Q24_Part_4']).size()/len(df_ds_small)
q5_ds = 100 * df_ds_small.groupby(['Q24_Part_5']).size()/len(df_ds_small)
q6_ds = 100 * df_ds_small.groupby(['Q24_Part_6']).size()/len(df_ds_small)
q7_ds = 100 * df_ds_small.groupby(['Q24_Part_7']).size()/len(df_ds_small)
q8_ds = 100 * df_ds_small.groupby(['Q24_Part_8']).size()/len(df_ds_small)
q9_ds = 100 * df_ds_small.groupby(['Q24_Part_9']).size()/len(df_ds_small)
q10_ds = 100 * df_ds_small.groupby(['Q24_Part_10']).size()/len(df_ds_small)

algo_prod_sm = pd.concat([q1_ds, q2_ds,q3_ds,q4_ds,q5_ds,q6_ds,q7_ds,q8_ds,q9_ds,q10_ds], axis=0)


al_usg = pd.concat([algo_prod_all,algo_prod_lg,algo_prod_med,algo_prod_sm], axis=1)
al_usg.columns = ['All', 'Large', 'SME','SMB']

from matplotlib.colors import ListedColormap
fig, axs = plt.subplots(ncols=2,figsize=(14, 8),sharey=True)

with sns.axes_style('white'):
      p = sns.heatmap(al_usg,
                cbar=False,
                square=False,
                annot=True,
                fmt='g',
                cmap=ListedColormap(['white']),
                linewidths=0.2,ax=axs[0])

tab_n = al_usg.div(al_usg.max(axis=1), axis=0)
q = sns.heatmap(tab_n,annot=False,cmap="YlGnBu", cbar=False, linewidths=0.5,ax=axs[1])
bottom, top = q.get_ylim()
q.set_ylim(bottom + 0.5, top - 0.5)

p.set(ylabel='Percentage')
p.set(title='% of Usage of Algos by Enterprise Type')

_ = plt.setp(p.get_xticklabels(), rotation=90)
_ = plt.setp(q.get_xticklabels(), rotation=90)

Clear difference can be seen above 

#### SMB (Small and Medium-Sized Businesses)
<ul>
<li>SMB in our dataset are over-indexed on NN usage, highlighting the fact that the data science community working in these companies are dealing with image processing or NLP related problems more compare to other 2 categories</li>
</ul>

#### SME (Small and Medium Enterprises)
<ul>
<li>Not much different as compare to overall distribution, but slightly high on Bayesian & Evolutionary approaches</li>
</ul>

#### Large Enterprise
<ul>
<li>Large Enterprise over-index when it come to the usage of algorithms like regression, RF, and GBM. This explains the fact that more of the Large Enterprises are using ML to problems related to structured data and like RF or GBM perform pretty well when it comes to structured data</li>
<li>Also over-index for MLP (possible again for tabular data), and BERT (for NLP)</li>
</ul>

#### Lets move towards ML/DL Frameworks

In [None]:
q1_ds = 100 * df_ds_Excl.groupby(['Q28_Part_1']).size()/len(df_ds_Excl)
q2_ds = 100 * df_ds_Excl.groupby(['Q28_Part_2']).size()/len(df_ds_Excl)
q3_ds = 100 * df_ds_Excl.groupby(['Q28_Part_3']).size()/len(df_ds_Excl)
q4_ds = 100 * df_ds_Excl.groupby(['Q28_Part_4']).size()/len(df_ds_Excl)
q5_ds = 100 * df_ds_Excl.groupby(['Q28_Part_5']).size()/len(df_ds_Excl)
q6_ds = 100 * df_ds_Excl.groupby(['Q28_Part_6']).size()/len(df_ds_Excl)
q7_ds = 100 * df_ds_Excl.groupby(['Q28_Part_7']).size()/len(df_ds_Excl)
q8_ds = 100 * df_ds_Excl.groupby(['Q28_Part_8']).size()/len(df_ds_Excl)
q9_ds = 100 * df_ds_Excl.groupby(['Q28_Part_9']).size()/len(df_ds_Excl)
q10_ds = 100 * df_ds_Excl.groupby(['Q28_Part_10']).size()/len(df_ds_Excl)

ml_fw_all = pd.concat([q1_ds, q2_ds,q3_ds,q4_ds,q5_ds,q6_ds,q7_ds,q8_ds,q9_ds,q10_ds], axis=0)

ml_fw_all_srt = ml_fw_all.sort_values(ascending=False)

q = sns.barplot(ml_fw_all_srt.index, ml_fw_all_srt.values)
q.set_title('Usage of ML Frameworks by DS Community\n')
q.set(ylabel='Percentage')
_ = plt.setp(q.get_xticklabels(), rotation=90)

In [None]:
q1_ds = 100 * df_ds_large.groupby(['Q28_Part_1']).size()/len(df_ds_large)
q2_ds = 100 * df_ds_large.groupby(['Q28_Part_2']).size()/len(df_ds_large)
q3_ds = 100 * df_ds_large.groupby(['Q28_Part_3']).size()/len(df_ds_large)
q4_ds = 100 * df_ds_large.groupby(['Q28_Part_4']).size()/len(df_ds_large)
q5_ds = 100 * df_ds_large.groupby(['Q28_Part_5']).size()/len(df_ds_large)
q6_ds = 100 * df_ds_large.groupby(['Q28_Part_6']).size()/len(df_ds_large)
q7_ds = 100 * df_ds_large.groupby(['Q28_Part_7']).size()/len(df_ds_large)
q8_ds = 100 * df_ds_large.groupby(['Q28_Part_8']).size()/len(df_ds_large)
q9_ds = 100 * df_ds_large.groupby(['Q28_Part_9']).size()/len(df_ds_large)
q10_ds = 100 * df_ds_large.groupby(['Q28_Part_10']).size()/len(df_ds_large)

ml_fw_lg = pd.concat([q1_ds, q2_ds,q3_ds,q4_ds,q5_ds,q6_ds,q7_ds,q8_ds,q9_ds,q10_ds], axis=0)


q1_ds = 100 * df_ds_med.groupby(['Q28_Part_1']).size()/len(df_ds_med)
q2_ds = 100 * df_ds_med.groupby(['Q28_Part_2']).size()/len(df_ds_med)
q3_ds = 100 * df_ds_med.groupby(['Q28_Part_3']).size()/len(df_ds_med)
q4_ds = 100 * df_ds_med.groupby(['Q28_Part_4']).size()/len(df_ds_med)
q5_ds = 100 * df_ds_med.groupby(['Q28_Part_5']).size()/len(df_ds_med)
q6_ds = 100 * df_ds_med.groupby(['Q28_Part_6']).size()/len(df_ds_med)
q7_ds = 100 * df_ds_med.groupby(['Q28_Part_7']).size()/len(df_ds_med)
q8_ds = 100 * df_ds_med.groupby(['Q28_Part_8']).size()/len(df_ds_med)
q9_ds = 100 * df_ds_med.groupby(['Q28_Part_9']).size()/len(df_ds_med)
q10_ds = 100 * df_ds_med.groupby(['Q28_Part_10']).size()/len(df_ds_med)

ml_fw_med = pd.concat([q1_ds, q2_ds,q3_ds,q4_ds,q5_ds,q6_ds,q7_ds,q8_ds,q9_ds,q10_ds], axis=0)


q1_ds = 100 * df_ds_small.groupby(['Q28_Part_1']).size()/len(df_ds_small)
q2_ds = 100 * df_ds_small.groupby(['Q28_Part_2']).size()/len(df_ds_small)
q3_ds = 100 * df_ds_small.groupby(['Q28_Part_3']).size()/len(df_ds_small)
q4_ds = 100 * df_ds_small.groupby(['Q28_Part_4']).size()/len(df_ds_small)
q5_ds = 100 * df_ds_small.groupby(['Q28_Part_5']).size()/len(df_ds_small)
q6_ds = 100 * df_ds_small.groupby(['Q28_Part_6']).size()/len(df_ds_small)
q7_ds = 100 * df_ds_small.groupby(['Q28_Part_7']).size()/len(df_ds_small)
q8_ds = 100 * df_ds_small.groupby(['Q28_Part_8']).size()/len(df_ds_small)
q9_ds = 100 * df_ds_small.groupby(['Q28_Part_9']).size()/len(df_ds_small)
q10_ds = 100 * df_ds_small.groupby(['Q28_Part_10']).size()/len(df_ds_small)

ml_fw_sm = pd.concat([q1_ds, q2_ds,q3_ds,q4_ds,q5_ds,q6_ds,q7_ds,q8_ds,q9_ds,q10_ds], axis=0)


al_usg = pd.concat([ml_fw_all,ml_fw_lg,ml_fw_med,ml_fw_sm], axis=1)
al_usg.columns = ['All', 'Large', 'SME','SMB']

from matplotlib.colors import ListedColormap
fig, axs = plt.subplots(ncols=2,figsize=(14, 8),sharey=True)

with sns.axes_style('white'):
      p = sns.heatmap(al_usg,
                cbar=False,
                square=False,
                annot=True,
                fmt='g',
                cmap=ListedColormap(['white']),
                linewidths=0.2,ax=axs[0])

tab_n = al_usg.div(al_usg.max(axis=1), axis=0)
q = sns.heatmap(tab_n,annot=False,cmap="YlGnBu", cbar=False, linewidths=0.5,ax=axs[1])
bottom, top = q.get_ylim()
q.set_ylim(bottom + 0.5, top - 0.5)

p.set(ylabel='Percentage')
p.set(title='% of Usage of ML Framwork by Enterprise Type')

_ = plt.setp(p.get_xticklabels(), rotation=90)
_ = plt.setp(q.get_xticklabels(), rotation=90)

#### SMB (Small and Medium-Sized Businesses)
<ul>
<li>Over-indexing on deep learning libraries, stating what we have seen before that these companies are dealing with Image processing & NLP based problem more compare to other categories</li>
</ul>

#### SME (Small and Medium Enterprises)
<ul>
<li>Not much different as compare to overall distribution</li>
</ul>

#### Large Enterprise
<ul>
<li>Large Enterprise over-index when it comes to frameworks which are widely used for structured data.</li>
<li>Although we can see that some of large companies are also using DL frameworks as well</li>
</ul>


Above results are similar to what we have seen in the analysis about algorithm usage. This has further cleared the difference among these organizations regarding the kind of work they do in machine learning. 

***
# Conclusion
<a id="conclusion"></a>
*** 

Of course, there are a lot of considerations: domain, company's brand, the culture, the people, location and the specific technology in use and so forth. All of these are equally important. We analyzed based on the survey results from data science community working out there in different organizations, to help the readers looking for data science opportunities when it comes to choosing what is right for them.

In this notebook, we have focus on Kaggle survey of 2019 to analyze the difference between difference companies when it comes to Data Science community exposure to data analytics tools, ML methods and frameworks. We have seen clear difference between our 3 groups (Large, SME, & SMB). 

The biggest plus with a large company is usually security and benefits , together with ample opportunity to move your career in the direction you want. Machine learning practices are pretty mature in large organizations, majorly dealing with structured data as RDBMS systems use is over indexing with BI tools. ML frameworks like Scikit-learn, GBM and RF also over index in large organizations.

Mid Sized companies (SME) holds highest percentage of PhD holders compare to other 2 groups from Data Science community. You can expect receive better benefits than in a small company, such as health care or a contributory pension depending on your experience and qualification. We have seen Data Science employees working in these organizations are equally exposed to all technologies, ranging from RDBMS systems, to cloud computing platform, to deep learning frameworks when compared with overall usage of data science community.

Companies with employee size up to 50 are considered as small business (SMB). In our case, we have seen these companies employ ML usage lowest compared to other 2 groups. However, the ones who are using ML are more inclined towards deep learning frameworks.  Generally smaller revenues and profits, pay and benefits are often lower in smaller companies. However, on the other hand, you'll almost certainly have more involvement in a wider range of tasks than in a bigger company, along with the chance of quick promotion if you prove yourself. Working for a small company is also an excellent way of acquiring new, transferable skills.

If you are at all unsure about what you're getting into, it's a good idea to arrange a trial period before committing yourself. An internship or temporary position offers the perfect opportunity to get a feel for people and the company. Part-time work is also an option, as it gives you the chance to try out two companies at once.

***
# References 
<a id="references"></a>
***

[1] Different data science jobs : https://blog.udacity.com/2018/01/4-types-data-science-jobs.html

[2] Business size classification : https://www.sangoma.com/articles/smb-sme-large-enterprise-size-business-matters/                    

[3] Choosing the right company based on business size  : https://www.monster.ie/career-advice/article/how-can-i-choose-the-right-company  

[4] Stats about India's tech industry : https://www.ibef.org/industry/science-and-technology.aspx

[5] DB Engines : https://db-engines.com/en/

[6] Trends in the Information Technology sector : https://www.brookings.edu/research/trends-in-the-information-technology-sector/

[7] Top 100 digital companies : https://www.forbes.com/top-digital-companies/list/3/#tab:rank