# An analysis of the 2019 Kaggle ML and DS Survey for Vim/Emacs users
![](https://www.kialo.com/images/f8ba939f-e548-4bb4-92a6-9359e7cf9088_1200x630_stretched.jpeg)

## Introduction

Before Atom, before Sublime Text, before Notepad++, before Microsoft Word, before Notepad, and even before WordPerfect, there were Vi and Emacs, both of which came out in 1976. These early programmers had to write inside of a terminal. That means they could not use a mouse to click to move their cursor. People used to write code inside the terminal using just a keyboard with either Vim or Emacs, two of the oldest text editors. In fact, many people still prefer to use these text editors despite their steep learning curve because when you don't use a mouse, you can type/write code a lot faster.

The arguments of which is better, Vim or Emacs, are more fierce than the arguments of using tabs or spaces. However, Kaggle have decided to group the two text editors together (which likely disappointed those who consider them very different). Despite this, grouping them makes sense, since they are the two most popular text editors run inside the terminal. The people who use Vim/Emacs as their main text editor are somewhat legendary in the coding community because they have put in the years required to learn it efficiently.

As an anecdote, my professors in Computer Science all used Vim and they encouraged us to use it too, but I would say > 90% of the class (including me) did not use it because your typing is slowed so much when you have to look up so many different keyboard shortcuts/commands. Nowadays, I use Vim rarely--only when I am SSHed into an EC2 server and I need to edit 1 or 2 lines of code (as opposed to writing ALL the code inside of it). I instead prefer to upload my completed code as a file to the cloud, or work on Jupyter Notebook in the cloud and then convert the .ipynb to a Python file. Therefore I expect that many people do not use Vim/Emacs because of its incredibly steep learning curve.

## Hypotheses

So, who are these legends who still prefer to use Vim/Emacs? I hypothesize that they are either very old people (which means they used Vim/Emacs when they first came out), or they are people who almost exclusively use servers/cloud services (since many times when you SSH into a server, you may only have access to the terminal; there is often no ability for you to launch an application like Notepad). Additionally, I hypothesize that these people likely write code for the majority of their work day and do not hold manager positions but instead are full-time programmers (because why would a manager ever learn Vim/Emacs?).

## Methodology
This notebook will focus on the Vim/Emacs users in ML and DS, and comparing their responses to people who do not use these text editors. As we step through, we will answer the hypotheses through data analysis.

In [None]:
# import the necessary libraries
import numpy as np 
import pandas as pd

# Visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

# Graphics in retina format 
%config InlineBackend.figure_format = 'retina' 

# Increase the default plot size and set the color scheme
plt.rcParams['figure.figsize'] = 16, 10
#plt.rcParams['image.cmap'] = 'viridis'


import os

# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
os.listdir('../input/kaggle-survey-2019/')

In [None]:
# Importing the 2019 Dataset
df_2019 = pd.read_csv('../input/kaggle-survey-2019/multiple_choice_responses.csv')
df_2019.columns = df_2019.iloc[0]
df_2019=df_2019.drop([0]) # The first row just contains the column names, so we can drop it.

In [None]:
# Create a boolean column if they use Vim/Emacs.
# This is mainly so we don't have to refer to such a long column name every time.
df_2019['vim/emacs_user'] = '  Vim / Emacs  ' == df_2019["Which of the following integrated development environments (IDE's) do you use on a regular basis?  (Select all that apply) - Selected Choice -   Vim / Emacs  "] 

# Hypothesis 1
Hypothesis 1: Very old people use Vim/Emacs, because they used it when it came out and therefore don't see the need to change to newer text editors.

To explore this, we will change the "Age" column from categorical to numerical (by replacing from a uniformly distributed age within the range) and look at the kernel density plots of the Vim/Emacs distribution versus the non-Vim/Emacs distribution. The kernel density plot is basically an estimate of the PDF of that variable.

In [None]:
df_2019['numerical_age'] = 0 # Initialize the numerical_age column
np.random.seed(2019) # Set random seed for reproducability
df_2019.loc[df_2019['What is your age (# years)?']=='18-21', 'numerical_age'] = np.random.uniform(low=18,high=22,size=df_2019.loc[df_2019['What is your age (# years)?']=='18-21'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='22-24', 'numerical_age'] = np.random.uniform(low=22,high=25,size=df_2019.loc[df_2019['What is your age (# years)?']=='22-24'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='25-29', 'numerical_age'] = np.random.uniform(low=25,high=30,size=df_2019.loc[df_2019['What is your age (# years)?']=='25-29'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='30-34', 'numerical_age'] = np.random.uniform(low=30,high=35,size=df_2019.loc[df_2019['What is your age (# years)?']=='30-34'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='35-39', 'numerical_age'] = np.random.uniform(low=35,high=40,size=df_2019.loc[df_2019['What is your age (# years)?']=='35-39'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='40-44', 'numerical_age'] = np.random.uniform(low=40,high=45,size=df_2019.loc[df_2019['What is your age (# years)?']=='40-44'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='45-49', 'numerical_age'] = np.random.uniform(low=45,high=50,size=df_2019.loc[df_2019['What is your age (# years)?']=='45-49'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='50-54', 'numerical_age'] = np.random.uniform(low=50,high=55,size=df_2019.loc[df_2019['What is your age (# years)?']=='50-54'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='55-59', 'numerical_age'] = np.random.uniform(low=55,high=60,size=df_2019.loc[df_2019['What is your age (# years)?']=='55-59'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='60-69', 'numerical_age'] = np.random.uniform(low=60,high=70,size=df_2019.loc[df_2019['What is your age (# years)?']=='60-69'].shape[0])
df_2019.loc[df_2019['What is your age (# years)?']=='70+', 'numerical_age']   = np.random.uniform(low=70,high=90,size=df_2019.loc[df_2019['What is your age (# years)?']=='70+'].shape[0])

In [None]:
plt.rcParams['figure.figsize'] = 16, 10
ax = sns.kdeplot(df_2019.loc[~df_2019['vim/emacs_user'],'numerical_age'], shade=True, color='blue', label='Non-Vim/Emacs')
sns.kdeplot(df_2019.loc[df_2019['vim/emacs_user'], 'numerical_age'], shade=True, color='orange', label='Vim/Emacs', ax=ax)

Initially, from the kernel density plots we can see that the age of Vim/Emacs users are skewed higher than non-users. However, the hypothesis was really focused on older people. Can we zoom in on people who are over 60 years old and view the kernel density plots again?

In [None]:
ax = sns.kdeplot(df_2019.loc[~df_2019['vim/emacs_user'],'numerical_age'], shade=True, color='blue', label='Non-Vim/Emacs')
sns.kdeplot(df_2019.loc[df_2019['vim/emacs_user'], 'numerical_age'], shade=True, color='orange', label='Vim/Emacs', ax=ax)
ax.set_ylim(bottom=0,top=0.003) # Zoom in where the plot is low
ax.set_xlim(left=55,right=95) # Zoom in for the high ages

It may look like there is a high amount of people 80+ clumped on the right for Vim/Emacs users. However, if you remember that I just sampled from a random uniform distribution, then the actual values for these numbers don't matter and what matters more is the threshold of '70+' (Try changing the random seed and see how this plot changes!).

In fact, from this plot it appears that within this small cohort, the Vim/Emacs users have proportionately fewer people aged 60+ than non-Vim/Emacs users.

## Hypothesis 1 Conclusion

Given the kernel density plots, I conclude that the Vim/Emacs users are not the hoary legends I originally believed them to be. Remember, this only applies to those participants who filled out the Kaggle survey; therefore this population may not represent the entire population of people aged 60+ who use text editors.

What is most interesting to me is that the proportion of people aged 30-60 is higher for Vim/Emacs people than non-Vim/Emacs people, yet this is not true for people aged 60+. How can we explain this? Well, it takes many years to get used to these text editors due to their learning curve, and many young Kagglers simply don't have the time/effort to learn a new text editor when there are much more important things to learn (for example, I'd rather young people learn Deep Learning than learn to use an old text editor). But why are there proportionately fewer people aged 60+? Maybe because most of these people take on manager roles/no longer code due to their experience. Therefore they don't need to use something so technical like a text editor within the terminal but instead can rely on using more agreeable software (like Notepad or Microsoft Word).

Since the hypothesis originally stated that "very old" people use Vim/Emacs, I reject this and instead claim that people who use Vim/Emacs are more likely to be 30-60 than non-Vim/Emacs users.

# Hypothesis 2

Hypothesis 2: Vim/Emacs users almost exclusively use servers/cloud services (since many times when you SSH into a server, you may only have access to the terminal; there is often no ability for you to launch an application like Notepad). This hypothesis is mainly based on my own experience. Namely, why would I use Vim/Emacs when I can use something that supports a mouse like Atom or Sublime Text? The only reason I would use Vim/Emacs is if I'm forced to because I'm SSHed into a terminal and I can't launch an application from there. 

To explore this, we will look at the "Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?" question (because high cloud usage requires high spending on cloud computing products). Additionally, we will look at the question "What is the primary tool that you use at work or school to analyze data?" for people who responded with a non-Jupyter notebook answer to 'Cloud-based data software & APIs (AWS, GCP, Azure, etc.)'. Additionally, we will look at the question "Which of the following cloud computing platforms do you use on a regular basis?" and look at the bias for people who answered 'None' versus answering with something else (e.g. GCP, AWS, Microsoft Azure, etc.)

First, let's explore the "how much money" question. Like last time, we need to replace the categorical range values with numerical column of np.random.uniform

In [None]:
np.random.seed(2019)
df_2019['cloud_money'] = 0 # Initialize numerical column
df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='$1-$99', 'cloud_money'] = np.random.uniform(low=1,high=100,size=df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='$1-$99'].shape[0])
df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='$100-$999', 'cloud_money'] = np.random.uniform(low=100,high=1000,size=df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='$100-$999'].shape[0])
df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='$1000-$9,999', 'cloud_money'] = np.random.uniform(low=1000,high=10000,size=df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='$1000-$9,999'].shape[0])
df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='$10,000-$99,999', 'cloud_money'] = np.random.uniform(low=10000,high=100000,size=df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='$10,000-$99,999'].shape[0])
df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='> $100,000 ($USD)', 'cloud_money'] = np.random.uniform(low=100000,high=200000,size=df_2019.loc[df_2019['Approximately how much money have you spent on machine learning and/or cloud computing products at your work in the past 5 years?']=='> $100,000 ($USD)'].shape[0])

In [None]:
ax = sns.kdeplot(df_2019.loc[~df_2019['vim/emacs_user'],'cloud_money'], shade=True, color='blue', label='Non-Vim/Emacs')
sns.kdeplot(df_2019.loc[df_2019['vim/emacs_user'], 'cloud_money'], shade=True, color='orange', label='Vim/Emacs', ax=ax)

Wow! Can see that so many Vim/Emacs users actually have spent 0 dollars in the past 5 years on ML/cloud computing products. That is kind of unexpected.

What about the cloud services? I expect the majority of them NOT to use Notebook services.

In [None]:
# Some questions are text responses which therefore require a lookup into the 'other_text_responses.csv'
text_responses = pd.read_csv('../input/kaggle-survey-2019/other_text_responses.csv')
text_responses.columns = text_responses.iloc[0]
text_responses=text_responses.drop([0]) # The first row just contains the column names, so we can drop it.

In [None]:
df_2019['What is the primary tool that you use at work or school to analyze data? (Include text response) - Cloud-based data software & APIs (AWS, GCP, Azure, etc.) - Text'] = text_responses['What is the primary tool that you use at work or school to analyze data? (Include text response) - Cloud-based data software & APIs (AWS, GCP, Azure, etc.) - Text'].astype(str).str.upper()

In [None]:
# View the responses sorted by popularity
df_2019.loc[df_2019['vim/emacs_user'], 'What is the primary tool that you use at work or school to analyze data? (Include text response) - Cloud-based data software & APIs (AWS, GCP, Azure, etc.) - Text'].value_counts()

Interestingly, the majority of responses are NAN (indicating that they do not use cloud services). That is honestly shocking - that means that many of the Vim/Emacs users are actually using it locally! Wow!

Furthermore, we can see from the responses that there are many Notebook-based options that people have responded with: GOOGLE COLAB, AWS SAGEMAKER, JUPYTER NOTEBOOK, and JUPYTER LAB.

The final response, "AZURE - UNFORTUNATELY" is absolutely hilarious. Come on, what's so bad? Granted, I am an AWS/GCP user, and I have used Azure once in my life, so maybe I have never been exposed to the "unfortunate" sides of it ðŸ˜‚

Finally, let's look at the question "Which of the following cloud computing platforms do you use on a regular basis?" and look for bias of people answering None versus not. As we saw in the previous question, it actually appears that the overwhelming majority of Vim/Emacs users do NOT use cloud services. Let's verify.

In [None]:
df_2019['uses_cloud_compute'] = ((df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  Google Cloud Platform (GCP) '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  Amazon Web Services (AWS) '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  Microsoft Azure '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  IBM Cloud '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  Alibaba Cloud '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  Salesforce Cloud '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  Oracle Cloud '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  SAP Cloud '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  VMware Cloud '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice -  Red Hat Cloud '].notnull()) |
                                 (df_2019['Which of the following cloud computing platforms do you use on a regular basis? (Select all that apply) - Selected Choice - Other'].notnull())
                                )   

In [None]:
# The percentage of people who use cloud compute, grouped by if they use Vim/Emacs or not.
df_2019.groupby('vim/emacs_user')['uses_cloud_compute'].mean().to_frame('Probability of Using Cloud Compute').reset_index()

First of all, we can see that Vim/Emacs users are far more likely to use cloud services than non-Vim/Emacs users (more than twice as likely!) However, we can still see that the majority of Vim/Emacs users do not use the cloud. This is still extremely surprising because in my use cases, Vim/Emacs are only used when SSHing into a cloud server. Yet these results imply more than half of all Vim/Emacs users use it locally!

## Hypothesis 2 Conclusion
Hypothesis 2 believed that people who use Vim/Emacs almost exclusively use serves/cloud services. We instead discover that 46.8% of Vim/Emacs users actually use cloud computing platforms on a regular basis. The majority of them do not! However, to be fair, the probability of using cloud services for Vim/Emacs users is more than twice as much as the probability of using cloud services for non-Vim/Emacs users. This implies that Vim/Emacs users are biased to use cloud services in comparison to the rest of the population.

It was also very surprising to see from the kernel density plot that Vim/Emacs users rarely spend money for cloud purposes. Given the probability of using cloud services is more than twice as high as non-Vim/Emacs users, I would have thought that they spend more money too, but it appears not to be the case.

# Hypothesis 3
Hypothesis 3: Vim/Emacs users likely write code for the majority of their work day and do not hold manager positions but instead are full-time programmers. This is mainly because people who use Vim/Emacs learned all the keyboard shortcuts to code faster; therefore, they likely have enjoyed programming their whole life and don't want to give it up for a managerial position.

To look at this, we will study the "Select the title most similar to your current role (or most recent title if retired)" question. Let's investigate!

In [None]:
# Impute the Other with the freeform text
df_2019['Select the title most similar to your current role (or most recent title if retired): - Other - Text'] = text_responses['Select the title most similar to your current role (or most recent title if retired): - Other - Text'].astype(str).str.upper()

In [None]:
vc_vim = df_2019.loc[df_2019['vim/emacs_user'], 'Select the title most similar to your current role (or most recent title if retired): - Selected Choice'].value_counts(normalize=True)
vc_notvim = df_2019.loc[~df_2019['vim/emacs_user'], 'Select the title most similar to your current role (or most recent title if retired): - Selected Choice'].value_counts(normalize=True)

w = pd.DataFrame(data = [vc_vim, vc_notvim],index = ['Vim/Emacs','Non-Vim/Emacs'])

ax = w.T[['Non-Vim/Emacs']].plot(subplots=True, layout=(1,1),kind='bar',color='blue',linewidth=1,edgecolor='k',legend=True, label='Non-Vim/Emacs',alpha=0.25)
w.T[['Vim/Emacs']].plot(subplots=True, layout=(1,1),kind='bar',color='orange',linewidth=1,edgecolor='k',legend=True, label='Vim/Emacs',alpha=0.25, ax=ax)

plt.gcf().set_size_inches(10,8)
plt.title('Job Title of Vim/Emacs users vs. non-Vim/Emacs users',fontsize=15)
plt.xticks(rotation=45,fontsize='10', horizontalalignment='right')
plt.yticks( fontsize=10)
plt.xlabel('Job Title',fontsize=15)
plt.ylabel('Percentage of Users',fontsize=15)
plt.show()

## Hypothesis 3 Conclusion
From this bar plot we can see that Vim/Emacs users are more likely to be Data Scientists, Software Engineers, Research Scientists, Data Engineers, and DBA/Database Engineers (which are all NOT managerial roles). We can see that the proportion of Product/Project Manager and Business Analyst (both of which are considered "high-level" roles) is higher for Non-Vim/Emacs users.

Therefore the hypothesis that Vim/Emacs users hold more technical positions than managerial ones is true, and furthermore they hold more technical positions proportionately than non-Vim/Emacs users!

So, after reading this kernel, are you going to learn to use Vim/Emacs? Or maybe you already use it? Personally, I find the ease of Jupyter Noteboks to be so amazing that I don't think I could ever go back to using a text editor, or even an IDE, full-time. Maybe I'll change my mind when I enter the 30-60 age range, though :)