# Data Science in 2020: NLP vs CV

In my experience*, people who work in data science are mostly specialized in either CV or NLP, but rarely work in both fields at the same time.

**TL;DR**: This notebook tries to find and explain any significant differences between people working in the two major subfields of data science.


#### Testable hypotheses
Before looking at the data, I came up with the following ideas:
* Subfield **specialisation will increase with years of ML experience / work experience / level of education**. For example, while DS students recieve general education and don't assosiate themselves with any particular field, DS professionals devote their time to particular sets of problems inside one field.

* **NLP is more heterogeneous** (in different ways, e.g., programming languages of choice). I believe this because, in my experience, there are much more things that people somehow relate to NLP than in the case with CV.

* There are **more female specialists in NLP** than in CV. I think so because NLP is somewhat related to lingustics which is popular among females, while CV doesn't have any such related field.


#### Methodology
My judgement of whether a person works in CV or NLP comes from answers the questions Q18 and Q19 (which CV / NLP methods do you use, respectively). If a person has chosen at least one item on either list, I count it as an evidence that they are somehow related to the corresponding field.


\* I am an NLP specialist from Russia. It might well be that my experience is not represantatve.

### Data preparation

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
questions = df.iloc[0, :].T
data = df.iloc[1:, :]

In [None]:
# these are the columns responsible for CV / NLP methods or tools
q_18_cv_tools = ['Q18_Part_' + str(i) for i in range(1, 7)] + ['Q18_OTHER']
q_19_nlp_tools = ['Q19_Part_' + str(i) for i in range(1, 6)] + ['Q19_OTHER']

I count a respondent as an NLP / CV person, if they have chosen at least one option in the corresponding list, except for `None`.

In [None]:
def get_people_indices(subset, columns):
    people_ids = []
    for i, line in enumerate(subset[columns].values):
        
        # nothing_selected: either the person selected no option or they selected the `None` option
        nothing_selected = all([type(answer) == float for answer in line]) or line[-2] == 'None'
        
        # in other case, the person is related to the field
        if not nothing_selected:
            people_ids.append(i)
    return people_ids

In [None]:
cv_people_ids = get_people_indices(data, q_18_cv_tools)
nlp_people_ids = get_people_indices(data, q_19_nlp_tools)

data['CV_person'] = [True if i in cv_people_ids else False for i in range(len(data))]
data['NLP_person'] = [True if i in nlp_people_ids else False for i in range(len(data))]

## Ratio of CV / NLP specialists

It's time to visualize! Now we can compare the ratios.

In [None]:
import matplotlib.pyplot as plt; plt.rcdefaults()
import matplotlib.pyplot as plt

### Among all respondents

In [None]:
groups = ('CV only', 'NLP only', 'Both CV and NLP')
y_pos = range(len(groups))
sizes = [
    len(data[data['CV_person'] & ~data['NLP_person']]),
    len(data[data['NLP_person'] & ~data['CV_person']]),
    len(data[data['NLP_person'] & data['CV_person']])
]

plt.bar(y_pos, sizes, align='center', alpha=0.5, color=('b', 'g', 'c'))
plt.xticks(y_pos, groups)
plt.ylabel('Number of people')
plt.title('People who use CV / NLP methods')

plt.show()

From this graph, it doesn't seem that my observation regarding people only working in either CV or NLP field is correct: the group of people who chose both NLP and CV methods is actualy larger than NLP-only group.

Apart from that, we see that **CV seems to be much more popular than NLP**.

### Non-students only

It seems reasonable to remove students from our data, since they could be still undecided of their professional career.

In [None]:
non_students = data[data['Q5'] != 'Student']
non_students.shape

In [None]:
groups = ('CV only', 'NLP only', 'Both CV and NLP')
y_pos = range(len(groups))
sizes = [
    len(non_students[non_students['CV_person'] & ~non_students['NLP_person']]),
    len(non_students[non_students['NLP_person'] & ~non_students['CV_person']]),
    len(non_students[non_students['NLP_person'] & non_students['CV_person']])
]

plt.bar(y_pos, sizes, align='center', alpha=0.5, color=('b', 'g', 'c'))
plt.xticks(y_pos, groups)
plt.ylabel('Number of people')
plt.title('Non-students only')

plt.show()

Removing students doesn't really seem to change things. Let's comare the two distributions.

To compare the distributions better, this time I divide all numbers by total sizes of corresponding groups.

In [None]:
def get_field_sizes(dataset):
    return [
        len(dataset[dataset['CV_person'] & ~dataset['NLP_person']]),
        len(dataset[dataset['NLP_person'] & ~dataset['CV_person']]),
        len(dataset[dataset['NLP_person'] & dataset['CV_person']])
    ]

def get_field_ratios(dataset):
    sizes = get_field_sizes(dataset)
    return [s / len(dataset[dataset['CV_person'] | dataset['NLP_person']]) for s in sizes]

In [None]:
groups = ('CV only', 'NLP only', 'Both CV and NLP')
y_pos = np.arange(len(groups))
bar_width = 0.35
opacity=0.8

total_ratios = get_field_ratios(data)
total_plot = plt.bar(y_pos, total_ratios, bar_width,
                    align='center', alpha=opacity,
                    label='Among all',
                    color='darkcyan')

nonst_ratios = get_field_ratios(non_students)
nonst_plot = plt.bar(y_pos + bar_width, nonst_ratios, bar_width,
                      align='center', alpha=opacity,
                      label='Among non-students',
                      color='cyan')


plt.xticks(y_pos + bar_width / 2, groups)
plt.ylabel('Fraction of those who answered')
plt.title('Total vs Non-students')
plt.legend()

plt.tight_layout()
plt.show()

At first approximation, it seems that my hypothesis about increasing specialization does not confirm.

But let's go further and see how this distribution changes as we restrict the group to more and more experienced / professional respondents.

### Data Scientists / Research Scientists only

Let's narrow our group down to these professions. 

In [None]:
ds_data = data[data['Q5'] == 'Data Scientist']
rs_data = data[data['Q5'] == 'Research Scientist']

len(ds_data), len(rs_data)

The subsets seems rather small, but are probably enough. Let's take a look at ratios, compared to total.

In [None]:
bar_width = 0.25

total_ratios = get_field_ratios(data)
total_plot = plt.bar(y_pos, total_ratios, bar_width,
                    align='center', alpha=opacity,
                    label='Among all',
                    color='darkcyan')

ds_ratios = get_field_ratios(ds_data)
ds_plot = plt.bar(y_pos + bar_width, ds_ratios, bar_width,
                      align='center', alpha=opacity,
                      label='Among data scientists',
                      color='cyan')

rs_ratios = get_field_ratios(rs_data)
rs_plot = plt.bar(y_pos + bar_width * 2, rs_ratios, bar_width,
                      align='center', alpha=opacity,
                      label='Among research scientists',
                      color='deepskyblue')


plt.xticks(y_pos + bar_width, groups)
plt.ylabel('Percentage of subgroup size')
plt.title('Total vs Data Scientists vs Research Scientists')
plt.legend()

plt.tight_layout()
plt.show()

For **Data Scentist** group, we see that **ratio of NLP people significantly increased**. However, number of those who use both NLP and CV methods has also increased.

For **Research Scientists, CV looks much more popular** for some reason.

However, I still **don't see any strong evidence for increased specialization**.

### Experience / age/ salary vs specialization

What if we introduce a `specialization` parameter based on ratio of CV-or-NLP-only group to total respondents?

`specialization` is defined as: $$1 - \frac{|CV and NLP|}{|CV or NLP|}$$ which is a fraction of those who use either NLP or CV methods, but not both.

In [None]:
def count_specialization(dataset):
    both_size = len(dataset[dataset['CV_person'] & dataset['NLP_person']])
    union_size = len(dataset[dataset['CV_person'] | dataset['NLP_person']])
    both_frac = both_size / union_size if union_size else 0
    return 1 - both_frac

#### Programming experience

In [None]:
answers = [
    '< 1 years', '1-2 years', '3-5 years', '5-10 years', '10-20 years', '20+ years',
]
scores = []
for answer in answers:
    scores.append(count_specialization(data[data['Q6'] == answer]))

In [None]:
y_pos = range(len(answers))

plt.plot(y_pos, scores, 'bo-')
plt.xticks(y_pos, answers)
plt.ylabel('Specialization')
plt.title('Programming experience')

plt.show()

The trand seems to be quite the opposite of what I thought.

#### ML experience

In [None]:
answers = [
    'Under 1 year', '1-2 years', '2-3 years', '3-4 years', '4-5 years', '5-10 years', '10-20 years', '20 or more years',
]
scores = []
for answer in answers:
    scores.append(count_specialization(data[data['Q6'] == answer]))
    
y_pos = range(len(answers))

plt.plot(y_pos, scores, 'bo-')
plt.xticks(y_pos, answers, rotation=45)
plt.ylabel('Specialization')
plt.title('Programming experience')

plt.show()

And here we don't see any clear trand.

## Relation to other answers

For this part, I choose non-students only as more reliable source of data.

In [None]:
cv_people = non_students[non_students['CV_person'] & ~non_students['NLP_person']]
nlp_people = non_students[non_students['NLP_person'] & ~non_students['CV_person']]
both = non_students[non_students['NLP_person'] & non_students['CV_person']]

len(cv_people), len(nlp_people), len(both)

In [None]:
def get_feature_sizes(feature_col, feature_name):
    return [
        len(cv_people[cv_people[feature_col] == feature_name]),
        len(nlp_people[nlp_people[feature_col] == feature_name]),
        len(both[both[feature_col] == feature_name]),
    ]

def get_feature_ratios(feature_col, feature_name):
    sizes = get_feature_sizes(feature_col, feature_name)
    return [
        sizes[0] / len(cv_people),
        sizes[1] / len(nlp_people),
        sizes[2] / len(both),
    ]


def get_ingroup_feature_sizes(dataset, feature_col, feature_names):
    return [
        len(dataset[dataset[feature_col] == feature_name])
        for feature_name in feature_names
    ]

def get_ingroup_feature_ratios(dataset, feature_col, feature_names):
    sizes = get_feature_sizes(dataset, feature_col, feature_names)
    return [s / len(dataset) for s in sizes]

### Gender

In [None]:
groups = ('CV', 'NLP', 'Both')
y_pos = np.arange(len(groups))
bar_width = 0.35
opacity=0.8

female_ratios = get_feature_ratios('Q2', 'Woman')
fem_plot = plt.bar(y_pos, female_ratios, bar_width,
                    align='center', alpha=opacity,
                    label='Female',
                    color='cyan')

male_ratios = get_feature_ratios('Q2', 'Man')
male_plot = plt.bar(y_pos + bar_width, male_ratios, bar_width,
                    align='center', alpha=opacity,
                    label='Male',
                    color='red')

plt.xticks(y_pos + bar_width / 2, groups)
plt.ylabel('Percentage of respondents')
plt.title('Male / female scpecialists')
plt.legend()

plt.tight_layout()
plt.show()

This looks interesting. There are actually a little more female specialists in NLP than in CV, but this difference seems to be insignificant. However, using both CV and NLP methods appears to be a very male thing to do.

(I did not analyze data for non-binary people since their percentage is too small to make any reliable judgement).

### Programming languages 

In [None]:
pl_cols = ['Q7_Part_' + str(i + 1) for i in range(11)]

#### Preferred languages

In [None]:
from collections import Counter


def get_multiple_choice_counts(dataset, cols):
    return Counter(dataset[cols].values.reshape(-1))

def get_multiple_choice_ratios(dataset, cols):
    lang_counts = get_multiple_choice_counts(dataset, cols)
    return {lang: count / len(dataset) for (lang, count) in lang_counts.items()}

In [None]:
languages = get_multiple_choice_counts(data, pl_cols)
languages = sorted([l for l in languages if not type(l) == float])
y_pos = np.arange(len(languages))

In [None]:
bar_width = 0.35

nlp_lang_ratios = get_multiple_choice_ratios(nlp_people, pl_cols)
nlp_lang_ratios = [nlp_lang_ratios[l] for l in languages]
plt.bar(y_pos, nlp_lang_ratios, bar_width, align='center', alpha=0.8,
        label='NLP',
        color='deepskyblue')

cv_lang_ratios = get_multiple_choice_ratios(cv_people, pl_cols)
cv_lang_ratios = [cv_lang_ratios[l] for l in languages]
plt.bar(y_pos + bar_width, cv_lang_ratios, bar_width, align='center', alpha=0.8,
        label='CV',
        color='indigo')


plt.xticks(y_pos + bar_width * 0.5, languages, rotation=45)
plt.ylabel('Fraction of group size')
plt.title('Languages used by NLP / CV people')
plt.legend()

plt.show()

Some observations:
* C / C++ are a bit more popular among CV people
* on the other handm R and SQL are much more popular among NLP people
* python is very popular among both groups, as expected

I also was intrested in the average number of languages used by either group. It turned out that **both groups are roughly equally multilingual**, so, nothing unexpected here. (Unfold the cells below to see). 

In [None]:
def get_lang_per_person(dataset):
    return [
        len([lang for lang in langs if type(lang) != float])
        for langs in dataset[pl_cols].values
    ]

In [None]:
bar_width = 0.35
groups = ('NLP people', 'CV people')
y_pos = np.arange(len(groups))

nlp_langs_count = get_lang_per_person(nlp_people)
nlp_avg = np.average(nlp_langs_count)

cv_langs_count = get_lang_per_person(cv_people)
cv_avg = np.average(cv_langs_count)

plt.bar(y_pos, [nlp_avg, cv_avg], bar_width, align='center', alpha=0.8,
        color=('deepskyblue', 'indigo'))
plt.xticks(y_pos, groups, rotation=45)
plt.ylabel('Agerage number of languages')
plt.title('Number of languages used by NLP / CV people')

plt.show()

### ML / DL frameworks

In [None]:
framework_cols = ['Q16_Part_' + str(i + 1) for i in range(14)]

frameworks = get_multiple_choice_counts(data, framework_cols)
frameworks = sorted([l for l in frameworks if not type(l) == float])
y_pos = np.arange(len(frameworks))

In [None]:
bar_width = 0.35

nlp_lang_ratios = get_multiple_choice_ratios(nlp_people, framework_cols)
nlp_lang_ratios = [nlp_lang_ratios[l] for l in frameworks]
plt.bar(y_pos, nlp_lang_ratios, bar_width, align='center', alpha=0.8,
        label='NLP',
        color='deepskyblue')

cv_lang_ratios = get_multiple_choice_ratios(cv_people, framework_cols)
cv_lang_ratios = [cv_lang_ratios[l] for l in frameworks]
plt.bar(y_pos + bar_width, cv_lang_ratios, bar_width, align='center', alpha=0.8,
        label='CV',
        color='indigo')


plt.xticks(y_pos + bar_width * 0.5, [f.strip() for f in frameworks], rotation=45)
plt.ylabel('Fraction of group size')
plt.title('Frameworks used by NLP / CV people')
plt.legend()

plt.show()

My observations:

1. DL libraries: `Keras` and `TensorFlow` are more popular CV people, while NLP people give a slight preferrence to `PyTorch`.
2. For some reason, NLP people like `Xgboost` much more than CV people do.

## Most distinctive features

So far we've seen lots of answer comparison. But what are the answers that amost certainly tell that a person works in NLP / CV (except for the obvious ones)?

Let's build a binary classification model and see.

### Answer Vectorizer

First, we are building a vectorizer that transforms the responces to vectors of 0s and 1s. Each selection question is transformed to an one-hot encoding vector and then concatenated with the other answers.

Surely, we remove the answers to the defining questions (the ones about CV and NLP tools).

In [None]:
STOP_LIST = {
    'Time from Start to Finish (seconds)', 'CV_person', 'NLP_person',
    'Q17_Part_7', 'Q17_Part_8', 'Q17_Part_9', 'Q17_Part_10'
}
STOP_LIST.update(q_18_cv_tools)
STOP_LIST.update(q_19_nlp_tools)
SELECTION_QUESTIONS = {
    'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q8', 'Q11', 'Q13', 'Q15',
    'Q20', 'Q21', 'Q22', 'Q24', 'Q25', 'Q30', 'Q32', 'Q38'
}

In [None]:
class AnswerVectorizer:
    def __init__(self):
        self.size = 0
        self.ans_to_id = {}
        self.id_to_ans = []
        
    def add_selection_slots(self, data, col):
        answers = sorted(set(data[col]))
        for ans in answers:
            if not ans:
                continue
            self.ans_to_id[(col, ans)] = self.size
            self.id_to_ans.append((col, ans))
            self.size += 1
            
    def add_simple_answer(self, data, col):
        answers = list(set(data[col]))
        answer = answers[0] if answers[0] else answers[1]
        self.ans_to_id[self.size] = answer
        self.id_to_ans.append(answer)
        self.id_to_question.append(col)
        self.size += 1
        
    def fit(self, data):
        for col in data:
            if col in STOP_LIST:
                continue
                
            if col in SELECTION_QUESTIONS:
                self.add_selection_slots(data, col)
                
            else:
                self.add_selection_slots(data, col)
                
                
    def transform(self, data):
        vectors = np.zeros((len(data), self.size))
        i = 0
        for col in data:
            if col in STOP_LIST:
                continue
                
            if col in SELECTION_QUESTIONS:
                for i, answer in enumerate(data[col]):
                    if (col, answer) in self.ans_to_id:
                        vectors[i][self.ans_to_id[(col, answer)]] = 1
                        
            else:
                for i, answer in enumerate(data[col]):
                    if (col, answer) in self.ans_to_id:
                        vectors[i][self.ans_to_id[(col, answer)]] = 1
            
        return vectors

In [None]:
non_students.fillna('', inplace=True)

In [None]:
# Now, we vectorize!
vec = AnswerVectorizer()
vec.fit(non_students)
vectors = vec.transform(non_students)

### Training a classifier

Now we train a simple logistic regression classifier.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

First, we make a model for predicting NLP people. We are interested in significant features, but let's also track hoe good the classifier performs.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    vectors, non_students['NLP_person'], test_size=0.1, random_state=42)

In [None]:
nlp_prediction_model = LogisticRegression(max_iter=1000)
nlp_prediction_model.fit(X_train, y_train)

In [None]:
nlp_predicted = nlp_prediction_model.predict(X_test)
print(classification_report(y_test, nlp_predicted))

Works not so great. But what about the features?

Here are 5 most predictive answers for NLP, ranked starting from most important:

In [None]:
indices = nlp_prediction_model.coef_[0].argsort()[-5:]
for i, ind in enumerate(reversed(indices)):
    q, a = vec.id_to_ans[ind]
    print('{}. Question {}. Answer: {}.\n'.format(
        i, questions[q], a))

Not very insightful. It looks like those are just a bunch of random answers.

It works thew same way for CV people:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    vectors, non_students['CV_person'], test_size=0.1, random_state=42)

cv_prediction_model = LogisticRegression(max_iter=1000)
cv_prediction_model.fit(X_train, y_train)

cv_predicted = nlp_prediction_model.predict(X_test)
print(classification_report(y_test, cv_predicted))

In [None]:
indices = cv_prediction_model.coef_[0].argsort()[-5:]
for i, ind in enumerate(reversed(indices)):
    q, a = vec.id_to_ans[ind]
    print('{}. Question {}. Answer: {}.\n'.format(
        i, questions[q], a))

## Conclusions

1. My hypothesis about increasing specialization did not confirm. It is also possible that specialization with age / experience exists, but many people used some methods of the neighbouring field anyway, even though they didn't do it frequently. This last hypothesis cannot be tested on this survey.

2. My hypothesis about NLP being more female shere also did not confirm. However, I found that men tend to be more interdisciplinary for some reason.

3. C / C++ are a bit more popular among CV people while R and SQL are much more popular among NLP people. Python expectedly is the most popular language for both groups. 

4. TensorFlow is more popular among CV people while PyTorch is more popular among CV people (I don't have any idea why).