# Who is using Automatic Machine Learning pipelines?

Automatic ML systems seek to "democratize" Machine Learning by making it more accessible to non-experts. For example, [Google Cloud AutoML](https://cloud.google.com/automl/)'s main selling point is that:

> Cloud AutoML enables developers with limited machine learning expertise to train high-quality models specific to their business needs. 

I would like to explore whether Automatic ML services - in particular, services that automate the entire ML pipeline - seem to be achieving this goal. To do this, I want to compare the subset of the Kaggle community who uses AutoML to the general population. 
How do AutoML users differ from Kaggle users at large in terms of:

- The amount of experience they tend to have?
- The ways they use ML in the workplace?



I'm limiting the use of ML to the workplace, rather than personal and hobby use, for two simple reasons:

1. That is the only data available in the Kaggle Survey dataset.
2. AutoML products are currently prohibitively expensive for personal use, and they tend to be explicitly targeted at businesses seeking to use ML for making business decisions.

# Data

In [None]:
# Copy-pasted map between questions and columns

q7_list_of_columns = ['Q7_Part_1',
                      'Q7_Part_2',
                      'Q7_Part_3',
                      'Q7_Part_4',
                      'Q7_Part_5',
                      'Q7_Part_6',
                      'Q7_Part_7',
                      'Q7_Part_8',
                      'Q7_Part_9',
                      'Q7_Part_10',
                      'Q7_Part_11',
                      'Q7_Part_12',
                      'Q7_OTHER']

q9_list_of_columns = ['Q9_Part_1',
                      'Q9_Part_2',
                      'Q9_Part_3',
                      'Q9_Part_4',
                      'Q9_Part_5',
                      'Q9_Part_6',
                      'Q9_Part_7',
                      'Q9_Part_8',
                      'Q9_Part_9',
                      'Q9_Part_10',
                      'Q9_Part_11',
                      'Q9_OTHER']

q10_list_of_columns = ['Q10_Part_1',
                       'Q10_Part_2',
                       'Q10_Part_3',
                       'Q10_Part_4',
                       'Q10_Part_5',
                       'Q10_Part_6',
                       'Q10_Part_7',
                       'Q10_Part_8',
                       'Q10_Part_9',
                       'Q10_Part_10',
                       'Q10_Part_11',
                       'Q10_Part_12',
                       'Q10_Part_13',
                       'Q10_OTHER']

q12_list_of_columns = ['Q12_Part_1',
                            'Q12_Part_2',
                            'Q12_Part_3',
                            'Q12_OTHER']

q14_list_of_columns = ['Q14_Part_1',
                            'Q14_Part_2',
                            'Q14_Part_3',
                            'Q14_Part_4',
                            'Q14_Part_5',
                            'Q14_Part_6',
                            'Q14_Part_7',
                            'Q14_Part_8',
                            'Q14_Part_9',
                            'Q14_Part_10',
                            'Q14_Part_11',
                            'Q14_OTHER']

q16_list_of_columns = ['Q16_Part_1',
                       'Q16_Part_2',
                       'Q16_Part_3',
                       'Q16_Part_4',
                       'Q16_Part_5',
                       'Q16_Part_6',
                       'Q16_Part_7',
                       'Q16_Part_8',
                       'Q16_Part_9',
                       'Q16_Part_10',
                       'Q16_Part_11',
                       'Q16_Part_12',
                       'Q16_Part_13',
                       'Q16_Part_14',
                       'Q16_Part_15',
                       'Q16_OTHER']

q17_list_of_columns = ['Q17_Part_1',
                       'Q17_Part_2',
                       'Q17_Part_3',
                       'Q17_Part_4',
                       'Q17_Part_5',
                       'Q17_Part_6',
                       'Q17_Part_7',
                       'Q17_Part_8',
                       'Q17_Part_9',
                       'Q17_Part_10',
                       'Q17_Part_11',
                       'Q17_OTHER']

q18_list_of_columns = ['Q18_Part_1',
                       'Q18_Part_2',
                       'Q18_Part_3',
                       'Q18_Part_4',
                       'Q18_Part_5',
                       'Q18_Part_6',
                       'Q18_OTHER']

q19_list_of_columns = ['Q19_Part_1',
                       'Q19_Part_2',
                       'Q19_Part_3',
                       'Q19_Part_4',
                       'Q19_Part_5',
                       'Q19_OTHER']

q23_list_of_columns = ['Q23_Part_1',
                       'Q23_Part_2',
                       'Q23_Part_3',
                       'Q23_Part_4',
                       'Q23_Part_5',
                       'Q23_Part_6',
                       'Q23_Part_7',
                       'Q23_OTHER']

q26a_list_of_columns = ['Q26_A_Part_1',
                        'Q26_A_Part_2',
                        'Q26_A_Part_3',
                        'Q26_A_Part_4',
                        'Q26_A_Part_5',
                        'Q26_A_Part_6',
                        'Q26_A_Part_7',
                        'Q26_A_Part_8',
                        'Q26_A_Part_9',
                        'Q26_A_Part_10',
                        'Q26_A_Part_11',
                        'Q26_A_OTHER']

q26b_list_of_columns = ['Q26_B_Part_1',
                        'Q26_B_Part_2',
                        'Q26_B_Part_3',
                        'Q26_B_Part_4',
                        'Q26_B_Part_5',
                        'Q26_B_Part_6',
                        'Q26_B_Part_7',
                        'Q26_B_Part_8',
                        'Q26_B_Part_9',
                        'Q26_B_Part_10',
                        'Q26_B_Part_11',
                        'Q26_B_OTHER']

q27a_list_of_columns = ['Q27_A_Part_1',
                        'Q27_A_Part_2',
                        'Q27_A_Part_3',
                        'Q27_A_Part_4',
                        'Q27_A_Part_5',
                        'Q27_A_Part_6',
                        'Q27_A_Part_7',
                        'Q27_A_Part_8',
                        'Q27_A_Part_9',
                        'Q27_A_Part_10',
                        'Q27_A_Part_11',
                        'Q27_A_OTHER']

q27b_dictionary_of_counts = ['Q27_B_Part_1',
                             'Q27_B_Part_2',
                             'Q27_B_Part_3',
                             'Q27_B_Part_4',
                             'Q27_B_Part_5',
                             'Q27_B_Part_6',
                             'Q27_B_Part_7',
                             'Q27_B_Part_8',
                             'Q27_B_Part_9',
                             'Q27_B_Part_10',
                             'Q27_B_Part_11',
                             'Q27_B_OTHER']

q28a_list_of_columns = ['Q28_A_Part_1',
                        'Q28_A_Part_2',
                        'Q28_A_Part_3',
                        'Q28_A_Part_4',
                        'Q28_A_Part_5',
                        'Q28_A_Part_6',
                        'Q28_A_Part_7',
                        'Q28_A_Part_8',
                        'Q28_A_Part_9',
                        'Q28_A_Part_10',
                        'Q28_A_OTHER']

q28b_list_of_columns = ['Q28_B_Part_1',
                        'Q28_B_Part_2',
                        'Q28_B_Part_3',
                        'Q28_B_Part_4',
                        'Q28_B_Part_5',
                        'Q28_B_Part_6',
                        'Q28_B_Part_7',
                        'Q28_B_Part_8',
                        'Q28_B_Part_9',
                        'Q28_B_Part_10',
                        'Q28_B_OTHER']

q29a_list_of_columns = ['Q29_A_Part_1',
                        'Q29_A_Part_2',
                        'Q29_A_Part_3',
                        'Q29_A_Part_4',
                        'Q29_A_Part_5',
                        'Q29_A_Part_6',
                        'Q29_A_Part_7',
                        'Q29_A_Part_8',
                        'Q29_A_Part_9',
                        'Q29_A_Part_10',
                        'Q29_A_Part_11',
                        'Q29_A_Part_12',
                        'Q29_A_Part_13',
                        'Q29_A_Part_14',
                        'Q29_A_Part_15',
                        'Q29_A_Part_16',
                        'Q29_A_Part_17',
                        'Q29_A_OTHER']

q29b_list_of_columns = ['Q29_B_Part_1',
                        'Q29_B_Part_2',
                        'Q29_B_Part_3',
                        'Q29_B_Part_4',
                        'Q29_B_Part_5',
                        'Q29_B_Part_6',
                        'Q29_B_Part_7',
                        'Q29_B_Part_8',
                        'Q29_B_Part_9',
                        'Q29_B_Part_10',
                        'Q29_B_Part_11',
                        'Q29_B_Part_12',
                        'Q29_B_Part_13',
                        'Q29_B_Part_14',
                        'Q29_B_Part_15',
                        'Q29_B_Part_16',
                        'Q29_B_Part_17',
                        'Q29_B_OTHER']

q31a_list_of_columns = ['Q31_A_Part_1',
                        'Q31_A_Part_2',
                        'Q31_A_Part_3',
                        'Q31_A_Part_4',
                        'Q31_A_Part_5',
                        'Q31_A_Part_6',
                        'Q31_A_Part_7',
                        'Q31_A_Part_8',
                        'Q31_A_Part_9',
                        'Q31_A_Part_10',
                        'Q31_A_Part_11',
                        'Q31_A_Part_12',
                        'Q31_A_Part_13',
                        'Q31_A_Part_14',
                        'Q31_A_OTHER']

q31b_list_of_columns = ['Q31_B_Part_1',
                        'Q31_B_Part_2',
                        'Q31_B_Part_3',
                        'Q31_B_Part_4',
                        'Q31_B_Part_5',
                        'Q31_B_Part_6',
                        'Q31_B_Part_7',
                        'Q31_B_Part_8',
                        'Q31_B_Part_9',
                        'Q31_B_Part_10',
                        'Q31_B_Part_11',
                        'Q31_B_Part_12',
                        'Q31_B_Part_13',
                        'Q31_B_Part_14',
                        'Q31_B_OTHER']

q33a_list_of_columns = ['Q33_A_Part_1',
                        'Q33_A_Part_2',
                        'Q33_A_Part_3',
                        'Q33_A_Part_4',
                        'Q33_A_Part_5',
                        'Q33_A_Part_6',
                        'Q33_A_Part_7',
                        'Q33_A_OTHER']

q33b_list_of_columns = ['Q33_B_Part_1',
                        'Q33_B_Part_2',
                        'Q33_B_Part_3',
                        'Q33_B_Part_4',
                        'Q33_B_Part_5',
                        'Q33_B_Part_6',
                        'Q33_B_Part_7',
                        'Q33_B_OTHER']

q34a_list_of_columns = ['Q34_A_Part_1',
                        'Q34_A_Part_2',
                        'Q34_A_Part_3',
                        'Q34_A_Part_4',
                        'Q34_A_Part_5',
                        'Q34_A_Part_6',
                        'Q34_A_Part_7',
                        'Q34_A_Part_8',
                        'Q34_A_Part_9',
                        'Q34_A_Part_10',
                        'Q34_A_Part_11',
                        'Q34_A_OTHER']

q34b_list_of_columns = ['Q34_B_Part_1',
                        'Q34_B_Part_2',
                        'Q34_B_Part_3',
                        'Q34_B_Part_4',
                        'Q34_B_Part_5',
                        'Q34_B_Part_6',
                        'Q34_B_Part_7',
                        'Q34_B_Part_8',
                        'Q34_B_Part_9',
                        'Q34_B_Part_10',
                        'Q34_B_Part_11',
                        'Q34_B_OTHER']


q35a_list_of_columns = ['Q35_A_Part_1',
                        'Q35_A_Part_2',
                        'Q35_A_Part_3',
                        'Q35_A_Part_4',
                        'Q35_A_Part_5',
                        'Q35_A_Part_6',
                        'Q35_A_Part_7',
                        'Q35_A_Part_8',
                        'Q35_A_Part_9',
                        'Q35_A_Part_10',
                        'Q35_A_OTHER']

q35b_list_of_columns = ['Q35_B_Part_1',
                        'Q35_B_Part_2',
                        'Q35_B_Part_3',
                        'Q35_B_Part_4',
                        'Q35_B_Part_5',
                        'Q35_B_Part_6',
                        'Q35_B_Part_7',
                        'Q35_B_Part_8',
                        'Q35_B_Part_9',
                        'Q35_B_Part_10',
                        'Q35_B_OTHER']

q36_list_of_columns = ['Q36_Part_1',
                       'Q36_Part_2',
                       'Q36_Part_3',
                       'Q36_Part_4',
                       'Q36_Part_5',
                       'Q36_Part_6',
                       'Q36_Part_7',
                       'Q36_Part_8',
                       'Q36_Part_9',
                       'Q36_OTHER']

q37_list_of_columns = ['Q37_Part_1',
                       'Q37_Part_2',
                       'Q37_Part_3',
                       'Q37_Part_4',
                       'Q37_Part_5',
                       'Q37_Part_6',
                       'Q37_Part_7',
                       'Q37_Part_8',
                       'Q37_Part_9',
                       'Q37_Part_10',
                       'Q37_Part_11',
                       'Q37_OTHER']

q39_list_of_columns = ['Q39_Part_1',
                       'Q39_Part_2',
                       'Q39_Part_3',
                       'Q39_Part_4',
                       'Q39_Part_5',
                       'Q39_Part_6',
                       'Q39_Part_7',
                       'Q39_Part_8',
                       'Q39_Part_9',
                       'Q39_Part_10',
                       'Q39_Part_11',
                       'Q39_OTHER']

In [None]:
answer_order = {
    'Q25': [ 
            '$100,000 or more ($USD)', 
            '$10,000-$99,999',
            '$1000-$9,999', 
            '$100-$999',
            '$1-$99', 
            '$0 ($USD)', 
           ],
    'Q21': [
            '20+', 
            '15-19',
            '10-14', 
            '5-9', 
            '3-4', 
            '1-2', 
            '0',         
           ],
    'Q6': [
            '20+ years', 
            '10-20 years', 
            '5-10 years', 
            '3-5 years', 
            '1-2 years',
            '< 1 years', 
            'I have never written code',
          ]
}

In [None]:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

survey_df = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv',low_memory=False)
survey_df['num_algs']=survey_df[q17_list_of_columns].count(axis=1)
responses_df = survey_df[1:]

In [None]:
# helper functions

def get_counts_for_multicol_q(cols, df=responses_df):
    return {df[df[col].notnull()][col].unique()[0]:\
            df[df[col].notnull()].shape[0] for col in cols}

def get_counts_for_singlecol_q(col, df=responses_df):
    return df.groupby(col)[col].count().to_dict()

def bar_chart(ans_dict, ax=plt, comparison=None, labels=None, ans_order = None):
    answers = ans_dict.keys()
    if ans_order:
        answers = [a for a in ans_order if a in ans_dict]
    x_pos = np.arange(len(answers))
    ax.yticks(x_pos, answers)

    y = [ans_dict[a] for a in answers]  # list(ans_dict.values())
    
    width = 0.8

    if comparison:
        ax.xlabel('Fraction of respondents')
        y = [y_i/sum(y) for y_i in y]
        comparison_y = [(comparison[k] if k in comparison else 0) for k in answers]
        comparison_y = [y_i/sum(comparison_y) for y_i in comparison_y]
        width=0.4
        b2 = ax.barh(x_pos-width, comparison_y, height=width)
    
    b1 = ax.barh(x_pos, y, height=width)
    if labels:
        ax.legend((b1,  b2), labels)
    # return ax

Due to the nature of the survey, I limited the set of respondents using two conditions:

- Only consider those respondents who spent more than $0 on cloud services
- Only consider those respondents who are not students or unemployed

I needed to do this because the questions I'm interested in - whether you use AutoML, and how you use ML at work - were only asked of people who have spent money on cloud services and who did not say they were a student or unemployed, respectively.

Other respondents may also have been able to provide relevant data. for example, it's entirely possible that a PhD student with industry experience has never spent a dollar on cloud services, but has nevertheless explored the possibility of auto-sklearn or even played around with the free trial version of Cloud AutoML. But we simply don't have that data.

In [None]:
all_applicable = responses_df[
    (~responses_df['Q5'].isin({'Student', 'Currently not employed'}))  # Not Students or Unemployed
    & (responses_df['Q25'] != '$0 ($USD)')
]
print(f'limited responses to {len(all_applicable)} applicable respondents out of {len(responses_df)} total')

In [None]:
automl_users = all_applicable[all_applicable['Q33_A_Part_6'].notnull()]
print(f'{len(automl_users)}\
 applicable respondents ({len(automl_users)/len(all_applicable)*100:.1f}%) said they have used full-pipeline AutoML tools.')

It's interesting that only a very small percentage of Kaggle users who *could* use AutoML are actually using it. This could indicate that AutoML is still in its nascency, or it could mean that the intesection of people who are interested in AutoML and those who use Kaggle is small.

In May 2018, the CEO of Google stated in his [announcement of Cloud AutoML](https://blog.google/technology/ai/making-ai-work-for-everyone/):

> We hope AutoML will take an ability that a few PhDs have today and will make it possible **in three to five years** for hundreds of thousands of developers to design new neural nets for their particular needs. 

(emphasis mine)

Considering that this was about 2.5 years ago, it appears some part of that hope is not quite on track. AutoML in general, and Cloud AutoML in particular, is **not yet** being adopted widely by large numbers of developers who were previously unable to break into ML.

Nevertheless, it's interesting to see who **is** currently adopting full-pipeline AutoML.

# Analysis

In [None]:
def compare_multicol(col_list, ax=plt):
    bar_chart(
        get_counts_for_multicol_q(col_list, all_applicable), 
        comparison = get_counts_for_multicol_q(col_list, automl_users),
        labels = ('All', 'AutoML users')
    )

def compare_singlecol(col, ax=plt):
    bar_chart(
        get_counts_for_singlecol_q(col, all_applicable),
        comparison = get_counts_for_singlecol_q(col, automl_users),
        labels = ('All', 'AutoML users'),
        ans_order = answer_order[col] if col in answer_order else None
    )

First, let's try to validate the idea that AutoML might be prohibitively expensive for personal use (or even smaller-scale business use):

In [None]:
plt.title('Approximately how much money have you (or your team) spent on machine learning and/or cloud computing services ... in the past 5 years?')
compare_singlecol('Q25')

Indeed, AutoML users tend to skew toward spending more money on ML than the general population. This demonstrates that companies who tend to be able to afford AutoML are those that already have a well-funded ML department or program. 

Moreover, as we can see from the next two graphs, people who use AutoML are more likely to belong to organizations with mature, well-established ML methods, with larger dedicated data science teams:

In [None]:
plt.title('Does your current employer incorporate machine learning methods into their business?')
compare_singlecol('Q22')

In [None]:
plt.title('Approximately how many individuals are responsible for data science workloads at your place of business?')
compare_singlecol('Q21')

So, companies which employ people with AutoML experience tend to have **bigger, better-funded, more mature ML and Data Science teams**. 

But what about the AutoML users themselves? Who are they, and how do they differ from the average population of Kagglers?

In [None]:

plt.title('Select the title most similar to your current role (or most recent title if retired)')
compare_singlecol('Q5')


AutoML users are much more likely to be Data Scientists or ML Engineers (and slightly more likely to be Research Scientists) - professionals who specialize in DS and ML, rather than generalist engineers trying to apply ML to their problems.

We can also verify the hypothesis that AutoML users are more likely to be ML specialists by analyzing how many different ML algorithms people in either group use on a regular basis:

In [None]:
fig, ax = plt.subplots(2,1)
ax[0].hist(all_applicable['num_algs'], bins=12)
ax[0].set_title('All applicable respondents')
ax[1].hist(automl_users['num_algs'], bins=12)
ax[1].set_title('AutoML users')
plt.suptitle('Total number of algorithms selected as being "used on a regular basis" (Q17)')
plt.tight_layout()
fig.subplots_adjust(top=0.8)
plt.show()

We can expect people with more ML expertise to be able to use many different algorithms, and choose algorithms that are most suitable for a given application. Non-experts, on the other hand, may learn one or two algorithms and apply them to everything (the [Maslow's Hammer](https://en.wikipedia.org/wiki/Law_of_the_instrument) phenomenon).

Indeed, AutoML users do tend to skew toward using more algorithms than the general population. In fact, the distribution for AutoML users is slightly bimodal, with a second spike at regularly using 10 algorithms (a.k.a. "all of the above"). Meanwhile, the distribution for the general population has a huge spike at 0 algorithms. In other words, a lot of Kagglers who have spent at least some money on ML-related cloud services nevertheless report not using any of the common ML algorithms.

In [None]:
plt.title('For how many years have you been writing code and/or programming?')
compare_singlecol('Q6')



AutoML users also tend to skew toward being more experienced programmers - even though one of the major selling points of AutoML tools is the ability to "do Machine Learning" without writing any code.

In [None]:
plt.title('Select any activities that make up an important part of your role at work: (Select all that apply)')
compare_multicol(q23_list_of_columns)


Perhaps most notably, AutoML users are less likely to be responsible for "influenc[ing] product or business decisions". Rather than focusing on **the application of ML** to business needs, they tend to have roles which focus on **ML directly** (advancing the state of the art of ML, improving exsting ML models, building and running ML services).  

I find this notable because, again, it seems to go against AutoML's goal of enabling many different types of people to apply ML to their business needs. Rather, at least for now, it seems the primary users of AutoML are ML experts trying to enhance or build on their already existing expertise.

# Is that good?

<p style="
    text-align: center;
"><img src="https://imgs.xkcd.com/comics/machine_learning.png" alt="xkcd: Machine Learning">
<a href="https://xkcd.com/1838/">xkcd: Machine Learning</a></p>

The evidence suggests that so far, AutoML is being used more by experts to help make their job easier and faster than by non-experts hoping to democratize ML. There could be several reasons for this apparent disconnect with the motivation behind AutoML. Part of this may be because AutoML is still in the "early adopter" phase, and early adopters tend to be experts. Another (related) reason may be that using these systems still requires expertise: even in systems which require no programming, the user is expected to make decisions like which columns to drop and how to balance performance vs. speed before they can even get started on the ML part. 

Whether this apparent disconnect is bad or good - whether it's a good idea to enable "hundreds of thousands of developers" to use ML at this stage in the evolution of ML as a field - is a separate question. And I think the answer depends on how ML is used now, and how that will change with the advent of AutoML.

Certain aspects of Machine Learning will be impossible to automate until we invent Skynet. These are human, creative aspects which require the developer to understand the context of the problem being solved, such as:

- Collecting the right data that will enable you to solve your problem; ensuring your input data contains the information you want the system to learn
- Choosing a target metric which actually represents the problem you're trying to solve, and does not fall prey to the [steetlight effect](https://en.wikipedia.org/wiki/Streetlight_effect) or [Goodhart's law](https://en.wikipedia.org/wiki/Goodhart%27s_law)
- Incorporating domain knowledge into feature engineering and model selection

But these human aspects of ML can easily be neglected, even by experts. In the wild, it is easy to find examples of:

- Optimizing for something that's easy to quantify instead of the thing you're actually trying to improve
- Making initial assumptions (e.g. of linearity) which cripple your advanced algorithm and make it entirely unnecessary
- Conversely, accepting hyperparameters suggested by a grid search even though they invalidate your initial assumptions 
- Focusing on specific successs metrics (e.g. AUC/ROC) because they are familiar to you, regardless of how applicable they are to the problem at hand.
- Trying to extract information that's simply not present in the underlying data (and almost inevitably falling into the overfitting/p-hacking trap as you try to get at least *some* result)
- Trying to naively apply ML to extrapolation problems, where the patterns that were common in the past are not likely to be applicable to the future

Would "democratizing" ML make these kinds of mistakes more or less likely to occur? 

Would AutoML open doors for people who may have great ideas about how to address these human aspects, but haven't had a chance to try until now? Or would it encourage many more people to take a "black box" approach to ML, by making it easy to do well on the standard metrics without having to reason about whether a solution makes sense from the "human aspects" perspective? 

H2O.ai's [Driverless AI](https://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html) is currently one of the more popular AutoML solutions. The name "Driverless" certainly does seem to suggest a "no need to worry about what is happening and how" philosophy, which does not bode well in this context.

[This blog post](https://www.h2o.ai/blog/kaggle-grand-masters-recipes-production-ready-clicks/) describes the parts of ML that Driverless AI seeks to automate. In particular, it says:

> we are trying to mimic what top data science teams would do when they need to develop a new machine learning pipeline

---

> We call this part of Driverless AI “Kaggle Grand Masters in a Box”. It is essentially the best data science practices, tricks and **creative feature engineering** of our Kaggle Grand Masters translated into an artificial intelligence (AI) platform.

(again, emphasis mine)

The idea of automating the **creative** process of "top data science teams" - of putting Grand Masters "in a box" - necessarily implies that the main contibutions of these data scientists are automatable. That replicating the "recipes" that top-level Kaggle competitors tend to use, and then seeing which ones stick, is sufficient for capturing their expertise in an automated system. 

This, again, does not bode well for the idea of putting more rather than less emphasis on the importance of human aspects of ML. Although the documentation and marketing of Driverless AI do not claim to automate aspects like data collection or selecting a metric for optimization, they do tend to avoid placing any emphasis on these aspects altogether. The documentation *does* come dangerously close to making the seductive claim that users can leave creative and subjective parts of feature engineering to an automated system. A system that, no matter how clever it is, cannot have any domain knowledge when it makes these decisions.   


A non-expert drawn in by the promise of democratized ML may well have some trouble noticing the distinction between what Driverless AI does or does not claim to do. So it is perhaps good that, at least in [its current documentation](http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/why_dai.html), Driverless AI doesn't actually try to sell itself as a "democratization" product - in fact, it seems directly geared toward empowering existing experts and making their job easier:

> With not enough data scientists to fill the increasing demand for data-driven business processes, H2O.ai offers Driverless AI, which automates several time consuming aspects of a typical data science workflow, including data visualization, feature engineering, predictive modeling, and model explanation.

---

> Driverless AI empowers data scientists or data analysts to work on projects faster and more efficiently by using automation and state-of-the-art computing power to accomplish tasks in just minutes or hours instead of the weeks or months that it can take humans.



So, for at least one AutoML system, the trends we see in the current AutoML userbase are evidence of the product working as intended. This driverless system should be operated by an experienced driver who is paying attention. [This is probably fine.](https://cal.streetsblog.org/2020/06/03/surprise-even-partial-automation-is-encouraging-drivers-not-to-pay-attention/)


Then again, perhaps none of it matters. After all, humans are very adept at making these same mistakes in optimization problems that *don't* involve ML. From startups focusing on "day-one retention" instead of whether their app is actually serving the intended purpose, to individuals becoming obsessed with climbing leaderboards or corporate ladders - to err in the human aspects of decision-making is human.