# Exploring Adopting Machine Learning in Companies

![](https://www.mobinius.com/wp-content/uploads/2020/04/Machine-learning-banner-img.png)
Image Source: mobinius.com

Interest in machine learning field has grown over the years, and many companies are aware of benefits of machine learning technologies. Research and development in machine learning are expensive, and the intelligence needed to create and train the model is intimidating. Despite these difficulties, some companies are trying to invest in machine learning, because they believe that adopting machine learning lead to bigger competitive advantages. While the other companies are still fear that machine learning is out of reach for them. The degrees of machine learning utilization in the companies are different. Some companies use machine learning for only getting insights from the data and some of them create and deploy machine learning models in production. 


This notebook will take us on a journey to **learn about the work environments** that differ in the way of adopting machine learning. We will also get to know the **skills and experiences of the companies' employees**. The notebook will benefit both *people* who want to learn more about the machine learning adoptions in work environments and the *companies* that want to start to use machine learning.


The notebook is organized as the following: it starts with taking a look at the dataset of Kaggle Machine Learning & Data Science Survey. Then, we will see how the companies adopt machine learning. After that, we will explore the employees' machine learning experience and skills, and the companies environments.

Let's first take a look at the dataset of Kaggle Machine Learning & Data Science Survey.

## 1. Dataset First Look

In [None]:
# Imports important library
import numpy as np 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt # visualization 
%matplotlib inline
import matplotlib.ticker as ticker
import seaborn as sns
import os

# list the data files
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# read csv file 
df = pd.read_csv('/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
df.head()

The first row in the dataset contains the questions. We will not use it and we will remove it.

In [None]:
df = df.drop(df.index[0])
df.head()

### 1.2. Machine Learning Adoptions

In [None]:
plt.pie(df['Q22'].value_counts(), labels = df['Q22'].value_counts().index, 
        colors = ['#1d79f3', '#4e96f6', '#7eb2f8', '#afcffb', '#e0ecfd'], autopct='%0.f%%')
plt.show()

Nearly 20% of the participants said their employer does not use machine learning algorithms. Most of the companies explore machine learning models and have not used the models in productions yet, but they may will use the models in production one day. About one third of the participants use machine learning, with around 17% of participants claim that their employer has well established machine learning methods for more than two years and 16% of the participants claim that their employer has started using machine learning models for less than two years. The percentage of participants who use machine learning for only getting insights is 11%. Also, there are percentage of participants do not know if their employers uses machine learning or not.

We do not need "*I do not know*" answer. so, we should remove it from the dataframe.

In [None]:
# Remove "I do not know" answer from the dataframe
ML_df = df[df['Q22'] != 'I do not know']

In [None]:
# Show answers percentage in pie chart
q22 = ML_df['Q22'].value_counts()
plt.pie(q22, labels = q22.index, colors = ['#1d79f3', '#4e96f6', '#7eb2f8', '#afcffb', '#e0ecfd'], autopct='%0.f%%')
plt.title('Company Machine Learning Adoption')
plt.show()

Now, we are going to deep dive into the participants' experiences and skills.

## 2. Machine Learning Experience and the Used Technologies

In this section, we will explore the current roles of the employees and their exprience in machine learning, programming and different technologies. We will investigate the relations between machine learning adoption in companies and some features that are related to the employees experinces. 

### 2.1. Current Role

In [None]:
# Show the answers of "Select the title most similar to your current role (or most recent title if retired):"
for ansr in q22.index:
    ML_use = ML_df[ML_df['Q22'] == ansr]
    (ML_use['Q5'].value_counts()/ML_use.shape[0]*100).plot(kind='barh', color = '#1d79f3')
    plt.gca().xaxis.set_major_formatter(ticker.PercentFormatter())
    plt.title (ansr)
    plt.xlabel('% of participants')
    plt.show()

From the above figures, we found that software engineer, data scientist, and data analyst roles are the top three roles of participants who are belonging to companies that explore machine learning algorithms. We found also that the three most common roles for participants who belong to companies that use machine learning in production are: data scientist, machine learning engineer, and software engineer. Moreover, over half of the participants from the companies that use machine learning for only getting insights are working as  researchers, data scientists, or data analysts.

Briefly, we can say that the data scientists usually use machine learning models, but not necessarily use the models in production. If you want to work on building and deploying machine learning models, join companies that have machine learning engineer job title. Moreover, the research scientist role focuses on using machine learning for only getting insights. Furthermore, the software engineer may develop machine learning models.

### 2.2. Machine Learning Experience

In [None]:
# Show the answer of "For how many years have you used machine learning methods?"
fig, axs = plt.subplots()
cross = pd.crosstab(ML_df['Q15'], ML_df['Q22'], normalize='columns', margins=False)
sns.heatmap(cross, cmap="Blues", annot=True, cbar=False, linewidths=0.5)
axs.set_title('Years of Experince in Machine Learning')
axs.set_xlabel('Machine Learning Adoption')
axs.set_ylabel('Experience')
plt.show()

There is a relationship between incorporating machine learning methods into businesses and the number of experince years in machine learning. 

It is noticeable that almost a quarter of the participants who belong to companies that do not use machine learning have no experience at all in machine learning, and more than 70% of them have less than two years of experience in machine learning.

Moreover, when we see companies that only explore machine learning methods or use machine learning for getting insights, we find that the largest percentage of their employees have less than two years of experience in machine learning. As for the companies that have started using machine learning for production recently, more than 75% of their employees have more than two years of experience. Finally, companies that well establised machine learning their employees have much more years of experience in machine learning.

### 2.3. Programming Language

In [None]:
# Show the answer of "What programming languages do you use on a regular basis?"
for ansr in q22.index:
    if (ansr != 'I do not know'):
        ML_use = ML_df[ML_df['Q22'] == ansr]
        plt.figure(figsize=(10, 5))
        plt.bar(['Python', 'R', 'SQL', 'C', 'C++', 'Java', 'Javascript', 'Julia', 'Swift', 'Bash', 'MATLAB', 'None', 'Other'], 
                [ML_use['Q7_Part_1'].count(), ML_use['Q7_Part_2'].count(), ML_use['Q7_Part_3'].count(),
                 ML_use['Q7_Part_4'].count(), ML_use['Q7_Part_5'].count(), ML_use['Q7_Part_6'].count(), 
                 ML_use['Q7_Part_7'].count(), ML_use['Q7_Part_8'].count(), ML_use['Q7_Part_9'].count(), 
                 ML_use['Q7_Part_10'].count(), ML_use['Q7_Part_11'].count(), ML_use['Q7_Part_12'].count(), 
                 ML_use['Q7_OTHER'].count()], color = '#1d79f3')
        plt.title(ansr)
        plt.xticks(rotation=90)
        plt.show()

The most three used programming languages are Python, SQL and R. It is notable that Bash is more used by participants from well-established machine learning companies.

### 2.4. Development Environment

In [None]:
# Show the answer of "Which of the following integrated development environments (IDE's) do you use on a regular basis?"
for ansr in q22.index:
    ML_use = ML_df[ML_df['Q22'] == ansr]
    plt.figure(figsize=(10, 5))
    plt.bar(['JupyterLab', 'RStudio', 'Visual Studio', 'Visual Studio Code (VSCode)', 'PyCharm', 'Spyder', 'Notepad++',
             'Sublime Text', 'Vim, Emacs, or similar', 'MATLAB', 'None', 'Other'], 
            [ML_use['Q9_Part_1'].count(), ML_use['Q9_Part_2'].count(), ML_use['Q9_Part_3'].count(),
             ML_use['Q9_Part_4'].count(), ML_use['Q9_Part_5'].count(), ML_use['Q9_Part_6'].count(), 
             ML_use['Q9_Part_7'].count(), ML_use['Q9_Part_8'].count(), ML_use['Q9_Part_9'].count(), 
             ML_use['Q9_Part_10'].count(), ML_use['Q9_Part_11'].count(), ML_use['Q9_OTHER'].count()], color = '#1d79f3')
    plt.title(ansr)
    plt.xticks(rotation=90)
    plt.show()

The most three used development environments are JupyterLab, Visual Studio Code and PyCharm. It is also notable Vim, Emacs, or similar are more used by participants from well-established machine learning companies.

### 2.5. Notebook Products

In [None]:
# Show the answer of "Which of the following hosted notebook products do you use on a regular basis?"
for ansr in q22.index:
    ML_use = ML_df[ML_df['Q22'] == ansr]
    plt.figure(figsize=(10, 5))
    plt.bar(['Kaggle', 'Colab', 'Azure', 'Paperspace / Gradient', 'Binder / JupyterHub', 'Code Ocean',
             'IBM Watson', 'Amazon Sagemaker', 'Amazon EMR', 'Google Cloud AI Platform', 'Google Cloud Datalab',
             'Databricks Collaborative', 'None', 'Other'], 
            [ML_use['Q10_Part_1'].count(), ML_use['Q10_Part_2'].count(), ML_use['Q10_Part_3'].count(),
             ML_use['Q10_Part_4'].count(), ML_use['Q10_Part_5'].count(), ML_use['Q10_Part_6'].count(), 
             ML_use['Q10_Part_7'].count(), ML_use['Q10_Part_8'].count(), ML_use['Q10_Part_9'].count(), 
             ML_use['Q10_Part_10'].count(), ML_use['Q10_Part_11'].count(), ML_use['Q10_Part_12'].count(),
             ML_use['Q10_Part_13'].count(), ML_use['Q10_OTHER'].count()], color = '#1d79f3')
    plt.title(ansr)
    plt.xticks(rotation=90)
    plt.show()

Most of the participants use Colab, kaggle or do not use notebook tools. Furthermore, the companies which used machine learning in production, their employees use Amazon Sagemaker more than the other employees.

### 2.6. Machine Learning Frameworks

In [None]:
# Show the answer of "Which of the following machine learning frameworks do you use on a regular basis?"
for ansr in q22.index:
    ML_use = ML_df[ML_df['Q22'] == ansr]
    plt.figure(figsize=(10, 5))
    plt.bar(['Scikit-learn', 'TensorFlow', 'Keras', 'PyTorch', 'Fast.ai', 'MXNet','Xgboost', 'LightGBM',
             'CatBoost', 'Prophet', 'H2O 3','Caret', 'Tidymodels', 'JAX', 'None', 'Other'], 
            [ML_use['Q16_Part_1'].count(), ML_use['Q16_Part_2'].count(), ML_use['Q16_Part_3'].count(),
             ML_use['Q16_Part_4'].count(), ML_use['Q16_Part_5'].count(), ML_use['Q16_Part_6'].count(), 
             ML_use['Q16_Part_7'].count(), ML_use['Q16_Part_8'].count(), ML_use['Q16_Part_9'].count(), 
             ML_use['Q16_Part_10'].count(), ML_use['Q16_Part_11'].count(), ML_use['Q16_Part_12'].count(),
             ML_use['Q16_Part_13'].count(), ML_use['Q16_Part_14'].count(), ML_use['Q16_Part_15'].count(),
             ML_use['Q16_OTHER'].count()], color = '#1d79f3')
    plt.title(ansr)
    plt.xticks(rotation=90)
    plt.show()

In general, the most used machine learning frameworks are Scikit-learn, TensorFlow, Keras, PyTorch and Xgboost. All of the five frameworks are python frameworks.

### 2.7. Machine Learning Algorithms

In [None]:
# Show the answer of "Which of the following ML algorithms do you use on a regular basis? "
for ansr in q22.index:
    ML_use = ML_df[ML_df['Q22'] == ansr]
    plt.figure(figsize=(10, 5))
    plt.bar(['Linear or Logistic Regression', 'Trees or Random Forest', 'Gradient Boosting', 'Bayesian', 'Evolutionary', 'DNN',
             'CNN', 'GAN', 'RRN', 'Transformer Networks','None', 'Other'], 
            [ML_use['Q17_Part_1'].count(), ML_use['Q17_Part_2'].count(), ML_use['Q17_Part_3'].count(),
             ML_use['Q17_Part_4'].count(), ML_use['Q17_Part_5'].count(), ML_use['Q17_Part_6'].count(), 
             ML_use['Q17_Part_7'].count(), ML_use['Q17_Part_8'].count(), ML_use['Q17_Part_9'].count(), 
             ML_use['Q17_Part_10'].count(), ML_use['Q17_Part_11'].count(), ML_use['Q17_OTHER'].count()], color = '#1d79f3')
    plt.title(ansr)
    plt.xticks(rotation=90)
    plt.show()

As for the machine learning algorithms, the participents commonly use simple and complex algorithms. The most common algorithms in all companies are linear or logistic regression,  decision trees or
random forests,  gradient boosting and convolutional neural networks.

### 2.8. Cloud Computing Platform

In [None]:
# Show the answer of "Which of the following cloud computing platforms do you use on a regular basis?"
for ansr in q22.index:
    ML_use = ML_df[ML_df['Q22'] == ansr]
    plt.figure(figsize=(10, 5))
    plt.bar(['AWS', 'Azure', 'Google Cloud Platform', 'IBM / Red Hat', 'Oracle', 'SAP','Salesforce',
             'VMware', 'Alibaba', 'Tencent','None', 'Other'], 
            [ML_use['Q26_A_Part_1'].count(), ML_use['Q26_A_Part_2'].count(), ML_use['Q26_A_Part_3'].count(),
             ML_use['Q26_A_Part_4'].count(), ML_use['Q26_A_Part_5'].count(), ML_use['Q26_A_Part_6'].count(), 
             ML_use['Q26_A_Part_7'].count(), ML_use['Q26_A_Part_8'].count(), ML_use['Q26_A_Part_9'].count(), 
             ML_use['Q26_A_Part_10'].count(), ML_use['Q26_A_Part_11'].count(), ML_use['Q26_A_OTHER'].count()], 
            color = '#1d79f3')
    plt.title(ansr)
    plt.xticks(rotation=90)
    plt.show()

AWS, Google Cloud Platform, and Microsoft Azure are the top three cloud computing platforms that are used by the different companies. The participants who belong to companies that do not used machine learning usually do not use cloud platforms.

### 2.9. Manage Machine Learning Tools

In [None]:
# Show the answer of "Do you use any tools to help manage machine learning experiments?"
for ansr in q22.index:
    ML_use = ML_df[ML_df['Q22'] == ansr]
    plt.figure(figsize=(10, 5))
    plt.bar(['Neptune.ai', 'Weights & Biases', 'Comet.ml', 'Sacred + Omniboard', 'TensorBoard',
             'Guild.ai', 'Polyaxon', 'Trains', 'Domino Model Monitor', 'None', 'Other'], 
            [ML_use['Q35_A_Part_1'].count(), ML_use['Q35_A_Part_2'].count(), ML_use['Q35_A_Part_3'].count(),
             ML_use['Q35_A_Part_4'].count(), ML_use['Q35_A_Part_5'].count(), ML_use['Q35_A_Part_6'].count(), 
             ML_use['Q35_A_Part_7'].count(), ML_use['Q35_A_Part_7'].count(), ML_use['Q35_A_Part_9'].count(), 
             ML_use['Q35_A_Part_10'].count(), ML_use['Q35_A_OTHER'].count()], color = '#1d79f3')
    plt.title(ansr)
    plt.xticks(rotation=90)
    plt.show()

Most of the participants do not use machine learning management tools. TensorBoard is the most favorite tool especially for companies which use machine learning.

## 3. Companies Environments

In this section, we will explore the companies enviroments and check if there are relationships between using machine learning and companies size and data science size

### 3.1. Companies Size

In [None]:
# Show the answer of "What is the size of the company where you are employed?"
fig, axs = plt.subplots()
cross = pd.crosstab(ML_df['Q20'], ML_df['Q22'], normalize='columns', margins=False)
sns.heatmap(cross, cmap="Blues", annot=True, cbar=False, linewidths=0.5)
axs.set_title('Company Size and Machine Learning')
axs.set_xlabel('Machine Learning Adoption')
axs.set_ylabel('Company Size')
plt.show()

We can notice that the well-established machine learning companies are large companies with more than 1000 people. It is noticed also that 18% of the well-established machine learning companies are small startups. 
In general, the large companies and small startups tend to use machine learning. This is could be because that the large companies have the possiblities and the resources to work in machine learning fields. As well as,the startups are looking to use machine learning to further optimize their services.

### 3.2. Data Science Team Size

In [None]:
# Show the answer of "Approximately how many individuals are responsible for data science workloads at your place of business?"
fig, axs = plt.subplots()
cross = pd.crosstab(ML_df['Q21'], ML_df['Q22'], normalize='columns', margins=False)
sns.heatmap(cross, cmap="Blues", annot=True, cbar=False, linewidths=0.5)
axs.set_title('Company Size and Machine Learning')
axs.set_xlabel('Machine Learning Adoption')
axs.set_ylabel('Data Science Team Size')
plt.show()

From the heatmap, we can see that nearly half of the well-established machine learning companies have large data science teams of 20+. We can also see that nearly half of the companies that do not use machine learning do not have a data science team.

### 3.3. Salary

In [None]:
print("The most often salaries in the companies:")
print()
for ansr in q22.index:
    ML_use = ML_df[ML_df['Q22'] == ansr]
    print(ansr,': ' ,ML_use['Q24'].mode()[0])

Generally, the well-established machine learning companies pay much more money than the other companies.

## Conclusion
As a summary, This notebook explores the different ways to adopt machine learning in the companies, and the employees experinces. From the analysis, we have seen that the company size, data science team, and employees machine learning experince affect the adopting machine learning in the companies.