In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import re
import pandas as pd
import seaborn as sns
from textwrap import wrap

plt.rcParams["figure.dpi"] = 100

# Note
This notebook is more of a general view of things, before we specifically target one single topic to analyze on it. 

In [None]:
df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
df.head()

In [None]:
df.dtypes

In [None]:
df.isnull().to_numpy().sum(axis=0) / len(df) * 100

In [None]:
# NaN values less than 50%. 
np.where(df.isnull().to_numpy().sum(axis=0) / len(df) * 100 < 50)

In [None]:
df["Q5"].unique()

In [None]:
df["Q5"].to_numpy()[1:]

In [None]:
# By Andrada Olteanu
# https://www.kaggle.com/andradaolteanu/siim-covid-19-box-detect-dcm-metadata
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(round(_x, 5), round(_y, 5), format(round(value, 5), ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs): _show_on_single_plot(ax)
    else: _show_on_single_plot(axs)

In [None]:
def our_countplot(column):
    ax = sns.countplot(x=df[column].to_numpy()[1:])
    plt.title("\n".join(wrap(df[column].to_numpy()[0], 60)))
    plt.grid(axis="y")
    show_values_on_bars(ax)
    _ = plt.xticks(rotation=90)
    
    
def re_filter(query):
    r = re.compile(query)
    newlist = list(filter(r.match, df.columns.to_list()))
    
    columns = [df[col].iloc[1:].unique().astype("str") for col in newlist]
    columns = [g[g != "nan"][0] for g in columns]
    
    g = df[newlist].count() - 1  # first row requires minus out. 
    g.index = columns
    g = g.reset_index(level=0)
    g.columns = ["index", "count"]
    return g, columns, newlist[0]

In [None]:
def our_plot(query):
    try: our_countplot(query)
    except KeyError:
        out, columns, title_col = re_filter(query)
        ax = sns.barplot(x="index", y="count", data=out)
        show_values_on_bars(ax)
        plt.xticks(rotation=90)
        plt.grid(axis="y")
        plt.title("\n".join(wrap(df[title_col].to_numpy()[0].split("?")[0] + "?", 60)))

In [None]:
our_plot("Q1")

In [None]:
our_plot("Q2")

Lots of people doing ML are men... 

In [None]:
plt.figure(figsize=(12, 7))
our_plot("Q3")

And mainly from India, US. 

In [None]:
our_plot("Q4")

In [None]:
our_plot("Q5")

In [None]:
our_plot("Q6")

In [None]:
our_plot("Q7")

Python and SQL are among the most popular languages

In [None]:
our_plot("Q9")

Jupyter Notebook ans VSCode are the most popular editor.

In [None]:
our_plot("Q10")

Kaggle notebooks, Colab notebooks, and "None" (none of the above, `Q10_Part_16`) are among the top 3. 

In [None]:
our_plot("Q11")

In [None]:
our_plot("Q12")

There are still lots of people whom doesn't use any GPU. In terms of training, NVIDIA GPU is most popular, follow by Cloud TPUs. 

In [None]:
our_plot("Q13")

TPU isn't yet very popular among poeple. Competition people may start using them. Most only try it out because they heard of it, or there's a fad where they use it several times and perhaps find it too difficult to use. 

In [None]:
our_plot("Q14")

Although there are some people uses the more fancy plotly or ggplot, most people still reside to matplotlib and seaborn as main plotting method due to its easiness and popularity. 

In [None]:
our_plot("Q15")

ML is relatively new and most people have up to 3 years experiences only. 

In [None]:
our_plot("Q16")

Despite the popularity of rising DL frameworks, basic ML framework like scikit-learn is still mostly used as it has long history; and for tabular data, Xgboost is still very popular. 

In [None]:
our_plot("Q17")

This really reflects what kind of problem people are facing. It seems like most usage are linear/logistic regression and decision trees/random forest, hence perhaps most problems in ML that people encountered are in that area? (one guesses they're tabular data). Then for image data people uses CNNs for example. And Images data are more popularly solved than NLP as few people uses Transformer network. 

However my analysis isn't accurate as Transformers aren't limited to NLP anymore nowadays, and we could easily change context, say, representing tabular data as images, etc etc for easier and more accurate predictions. 

In [None]:
our_plot("Q18")

It seems like this is still problem-dependent. 

In [None]:
our_plot("Q19")

Same as this.

In [None]:
our_plot("Q20")

## Important
Where are ML most popular uses at? (except for being a student, PhD or not). Where are ML lacking in people? **Where would we want to develop ML in the future?** 

In [None]:
our_plot("Q21")

"Quite" balanced.

In [None]:
our_plot("Q22")

Most are either small teams or very large teams. Rarely in the middle. 

In [None]:
our_plot("Q23")

This requires looking at past data to see whether there are an increment in the industry: whether more ML are used as time passes. And we can decide whether to increase more guidance on specifics, or what is lacking. 

In [None]:
our_plot("Q24")

While having a variety of use cases, most are still use to analyze and understand data, so ML most used as an add-on to EDA/data analysis rather than deployment per se. 

MLaaS have not been very popular yet compared to understanding data. 

It also seems like people doing research either not taking part of this ML survey or they're not very active on Kaggle (if they have an account). Seems like Kaggle is mostly for practical ML? 

In [None]:
our_plot("Q25")

Previously we see there're 6804 students. And one can guess most students don't earn a lot (or no earnings). 

In [None]:
our_plot("Q26")

This depends on what people are using it for. Hobbyist or people with less earnings don't particularly want to spend money. People requiring more expert use cases might decide to spend more. People on large company (teams) spend a lot (and earn a lot). For relationship, this have to be analyze per row (instead of per column) basis. 

In [None]:
our_plot("Q27_A")

AWS, GCP and Azure are among the most popular cloud platform to use. Which to use depends on situation and preferences and which is being introduced, which is easier to use based on past experiences. 

In [None]:
our_plot("Q28")

Some people have more preference towards one or another, perhaps because they are first introduced to that and easily use that, have an affectionate to that particular service. Otherwise, they may be able to interchange between one another, resulting in similar enjoyable experiences (expertise is really what makes you enjoy stuffs. If you are expert enough to interchange between one another, you won't feel any difficulty). Starting point for each services is also important, as some are more bombarding with their interface and what it can be done than others. 

In [None]:
our_plot("Q29_A")

In [None]:
print(f"Uses Amazon EC2: {2270/3721*100:.2f}%")
print(f"Uses Azure VM: {1503/2450*100:.2f}%")
print(f"Uses GCE: {1960/3142*100:.2f}%")

There's quite a balanced usage of VMs. VMs are good as unlike a notebook interface, you can have something trained overnight with VM. It's also good that you can "see it with your eyes" the progress. With background training platform like Azure ML studio, you can't make changes after everything starts (although you can see the progress as well, but it really is via logs rather than through jupyter notebook). 

In [None]:
our_plot("Q30_A")

S3 and GCS are most use for storage. Perhaps it's also best to use especially if you're training on the cloud. 

In [None]:
our_plot("Q31_A")

Comparably, more people uses VMs than ML products. Let's look at the main types. 

In [None]:
print(f"Uses Amazon Sagemaker: {991/3721*100:.2f}%")
print(f"Uses Azure ML Studio: {945/2450*100:.2f}%")
print(f"Uses GC Vertex AI: {714/3142*100:.2f}%")

People whom uses Azure still have more significant usage of Azure ML Studio. 

In [None]:
our_plot("Q32_A")

Most people uses MySQL, but there're others whom are slightly less used but still lots of users like PostgreSQL, SQLite, MongoDB and Microsoft SQL Server. It may be they have different use cases, people are first exposed to which type of SQL, and how many products support such language. For example, GC BigQuery are only uses with GCP, and it has slightly changed type language than standard MySQL (although one admits it's easier to use, the constraint to use it only with BigQuery engine requires payment, which isn't what people with little earnings or whom not willing to pay, wants. Furthermore, it's restricted to BigQuery interface and not very transferable. MySQL on the other hand allows you to easily transfer skills to BigQuery as BigQuery have additional built-ins for easier and shorter query but it's still very similar to MySQL). 

And there are a comparable number of people whom didn't use any SQL. Perhaps they don't have large datasets or never uses large datasets (which large datasets are really easier to handle with SQL than Python). 

In [None]:
our_plot("Q33")

MySQL, PostgreSQL and Microsoft SQL Server are among the most popular big data products nowadays. Checking of history is required to see whether there are any rise in uses of other products, like BigQuery and MongoDB for example. Again, if there is a common language which everyone could learn (MySQL), it makes skills more transferable and it will be more used, more worth learning. 

In [None]:
our_plot("Q34_A")

A large group doesn't use any BI. BI are very specific in their use cases and if there are no reason to use it... For those who uses, Microsoft Power BI and Tableau are the most popular. History is required to check whether Google Data Studio is catching up. 

In [None]:
our_plot("Q35")

Whether or not to use Tableau or Power BI (or others) really depends on use cases. What does your company preference? What are your personal preference? Which is easier to use? Does the product allows you to analyze what you want/expect? What is the cost of the product (if you're paying for it)? 

As of now, 
- Tableau Creator (for individuals) cost 70 USD per user per month.  
- Power BI Pro cost 9.99 per user per month. 
- Google Data Studio is free.

In [None]:
our_plot("Q36_A")

Some of the AutoML tools are for particular aspects. Example, Auto Data augmentation are for images mostly, hence if not faced with image problems, not uses much of this. Not many people uses model architecture searches, perhaps because model selection is sufficient. It's worth asking the question whether model architecture improves a lot of scores for competitions, or whether it's really that important if you just need something good enough. 

In [None]:
our_plot("Q37_A")

From Q36_A and Q37_A it seems like AutoML aren't very popular among Kaggle users. It may be more popular for business users whom don't want to code or want something quick and easy, though, that we don't know without the data. Also, AutoML requires payment (rental of their VMs to perform computation). 

In [None]:
our_plot("Q38_A")

Lots of people still don't use tools to manage ML experiments. Particularly, either it is not required because it is small and easily recorded with excel or notepad or something, or requires learning a library and haven't have time to touch on it yet (until required). It seems like TensorBoard have been good visualization tools that comes with TensorFlow callbacks (and PyTorch have integrated it as well). MLflow second, and WandB are a rising new star. 

- Tensorboard v0.1.4 released on 17 Aug 2017, v1.5 (jumped to catch up on TF version) on 26 Jan 2018, and v2 on 20 September 2019.
- MLflow v0.2.0 released on 28 Jun 2018, v1 on 04 Jun 2019.
- WandB v0.8.15 on 07 Nov 2019 and currently on v0.12.5. 

Seems like the older the release, the more people uses it for now. New programs are catching up, while there are others that are not as popular that may be for other reasons like not easy to use, requires payment, there are more competitive products, etc. 

In [None]:
our_plot("Q39")

For people whom shares their work publicly, github is the most popular platform, followed by Kaggle and Colab. There are also people writing personal blog to teach others. 

In [None]:
our_plot("Q40")

It's worth noting that University courses aren't yet very popular (and still perfecting for most university whom only started teaching AI courses), hence lots of courses are self-learned. Or maybe people might just rush directly to play with it and learn the hard way! 

In [None]:
our_plot("Q41")

People who uses code uses their development platform to analyze data. People whom don't write code (or easier done without writing code) on Excel-similar platform would make things easier. Other software requires expertise to use. They may also be some good skills to learn as well! 

In [None]:
our_plot("Q42")

Really, learning from the most used platforms are also the easiest and don't need to search for more. And it also depends on the person: whether they already uses twitter for example and are following data science influencers. It's worth noting that Blogs contains lots of useful informations and perhaps you could consider writing blogs as well to help others out, if people are so used to reading blogs. Otherwise, uses YouTube channels to have video teaching also useful especially if word-based teaching is very confusing and having someone to spoonfeed you might be easier in this case. 

In [None]:
our_plot("Q27_B")

In [None]:
our_plot("Q29_B")

In [None]:
# our_plot("Q30_B")

We don't plot the one above as it raises error. In fact, there are no information. So that means that non-professionals have no intention on being more familiarize with these data storage products, or because they requires payment hence it's not worth becoming familiarize with them. 

Although one would say it's worth to learn though, for business and make yourself more skillful.

In [None]:
our_plot("Q31_B")

hah, is SageMaker not wished to be more familiar at? Previously in April 2021 one uses starting up a VM on SageMaker takes very long. Second, it doesn't retain the state of the machine (you need to pip install whatever that is not pre-installed on the machine, just like Colab). And on the other hand ML Studio takes quite some time for initial creation of VM but afterwards if not deleted it takes quite short time to start up. Vertex AI's jupyter notebook is fastest at starting up, from one's experience. 

In [None]:
our_plot("Q32_B")

MongoDB is indeed a rising new star towards popularity. Good old MySQL stays popular. 

In [None]:
our_plot("Q34_B")

And yes Tableau is a very popular platform for BI. Data Studio is worth learning as it is free. Microsoft users uses Power BI if they buy its package, hence prefers Power BI over Tableau (in terms of pricing, unless they have sufficient money to spend on both). 

In [None]:
our_plot("Q36_B")

In [None]:
our_plot("Q37_B")

In [None]:
our_plot("Q38_B")