This kernel based on the kernel https://www.kaggle.com/artgor/a-look-at-russian-kagglers-over-time

<a class="anchor" id="0"></a>
# Kaggle ML & DS Survey

We have a Machine Learning and Data Science Survey from Kaggle (2017-2019).

Let's analyze information about Ukrainian kagglers (I'm Ukrainian myself too), including the best from them now (2020) - Grandmasters, Masters, Gold Medalists etc.

<a class="anchor" id="0.1"></a>
## Lists of those whom I found in Kaggle ranking (information is still being updating):

1. [Kaggle Competitions Grandmasters and Masters from Ukraine](#1)  
1. [Kaggle Notebooks Grandmasters, Masters and Experts from Ukraine](#2)
1. [Kaggle Competitions Gold Medal (or Medals)](#3)

In [None]:
!pip install -U vega_datasets notebook vega

In [None]:
# importing libraries
import numpy as np 
import pandas as pd 
import os
import math
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)
from ipywidgets import interact, interactive, interact_manual
import ipywidgets as widgets
import colorlover as cl

In [None]:
# loading data from different years
DIR = '/kaggle/input/kaggle-survey-2018/'
df_free_18 = pd.read_csv(DIR + 'freeFormResponses.csv', low_memory=False, header=[0,1])
df_choice_18 = pd.read_csv(DIR + 'multipleChoiceResponses.csv', low_memory=False, header=[0,1])
# Format Dataframes
df_free_18.columns = ['_'.join(col) for col in df_free_18.columns]
df_choice_18.columns = ['_'.join(col) for col in df_choice_18.columns]

DIR = '/kaggle/input/kaggle-survey-2019/'
df_free_19 = pd.read_csv(DIR + 'other_text_responses.csv', low_memory=False)
df_choice_19 = pd.read_csv(DIR + 'multiple_choice_responses.csv', low_memory=False, encoding='latin-1', header=[0,1])
df_choice_19.columns = ['_'.join(col) for col in df_choice_19.columns]

DIR = '/kaggle/input/kaggle-survey-2017/'
df_free_17 = pd.read_csv(DIR + 'freeformResponses.csv', low_memory=False)
df_choice_17 = pd.read_csv(DIR + 'multipleChoiceResponses.csv', low_memory=False, encoding='latin-1')

In [None]:
# processing data for visualizations

top_count = df_choice_19['Q3_In which country do you currently reside?'].value_counts().head(21).reset_index().rename(columns={'Q3_In which country do you currently reside?': 'count', 'index': 'Country'})

# taking only Ukrainian responders
df_choice_17 = df_choice_17.loc[df_choice_17['Country'] == 'Ukraine']
df_choice_18 = df_choice_18.loc[df_choice_18['Q3_In which country do you currently reside?'] == 'Ukraine']
df_choice_19 = df_choice_19.loc[df_choice_19['Q3_In which country do you currently reside?'] == 'Ukraine']

def get_age(x: int):
    """
    Convert numerical age to categories.
    """
    if 18 <= x <= 21:
        return '18-21'
    elif 22 <= x <= 24:
        return '22-24'
    elif 25 <= x <= 29:
        return '25-29'
    elif 30 <= x <= 34:
        return '30-34'
    elif 35 <= x <= 39:
        return '35-39'
    elif 40 <= x <= 44:
        return '40-44'
    elif 45 <= x <= 49:
        return '45-49'
    elif 50 <= x <= 54:
        return '50-54'
    elif 55 <= x <= 59:
        return '55-59'
    elif 60 <= x <= 69:
        return '60-69'
    elif x >= 70:
        return '70+'
    
# create a new age column with the same name and unique values in all datasets
df_choice_17['Age_'] = df_choice_17['Age'].apply(lambda x: get_age(x))
df_choice_18['Age_'] = df_choice_18['Q2_What is your age (# years)?']
df_choice_18.loc[df_choice_18['Age_'].isin(['70-79', '80+']), 'Age_'] = '70+'
df_choice_19['Age_'] = df_choice_19['Q1_What is your age (# years)?']

# renaming columns so that it would be easier to work with them
df_choice_17 = df_choice_17.rename(columns={'GenderSelect': 'Gender', 'FormalEducation': 'Degree'})
df_choice_18 = df_choice_18.rename(columns={'Q1_What is your gender? - Selected Choice': 'Gender', 'Q9_What is your current yearly compensation (approximate $USD)?': 'Salary',
                                            'Q4_What is the highest level of formal education that you have attained or plan to attain within the next 2 years?': 'Degree'})
df_choice_19 = df_choice_19.rename(columns={'Q2_What is your gender? - Selected Choice': 'Gender', 'Q10_What is your current yearly compensation (approximate $USD)?': 'Salary',
                                            'Q4_What is the highest level of formal education that you have attained or plan to attain within the next 2 years?': 'Degree'})
df_choice_19['Degree'] = df_choice_19['Degree'].replace({'Masterâs degree': 'Master’s degree', 'Bachelorâs degree': 'Bachelor’s degree',
                                                         'Some college/university study without earning a bachelorâs degree': 'Some college/university study without earning a bachelor’s degree'})
df_choice_17['Degree'] = df_choice_17['Degree'].replace({"Master's degree": 'Master’s degree', "Bachelor's degree": 'Bachelor’s degree',
                                                         "Some college/university study without earning a bachelor's degree": 'Some college/university study without earning a bachelor’s degree',
                                                         "I did not complete any formal education past high school": "No formal education past high school"})

# changing salary values to the same categories
df_choice_19.loc[df_choice_19['Salary'].isin(['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499', '7,500-9,999']), 'Salary'] = '0-10,000'
df_choice_19.loc[df_choice_19['Salary'].isin(['10,000-14,999', '15,000-19,999']), 'Salary'] = '10-20,000'
df_choice_19.loc[df_choice_19['Salary'].isin(['20,000-24,999', '25,000-29,999']), 'Salary'] = '20-30,000'
df_choice_19.loc[df_choice_19['Salary'] == '30,000-39,999', 'Salary'] = '30-40,000'
df_choice_19.loc[df_choice_19['Salary'] == '40,000-49,999', 'Salary'] = '40-50,000'
df_choice_19.loc[df_choice_19['Salary'] == '50,000-59,999', 'Salary'] = '50-60,000'
df_choice_19.loc[df_choice_19['Salary'] == '60,000-69,999', 'Salary'] = '60-70,000'
df_choice_19.loc[df_choice_19['Salary'] == '70,000-79,999', 'Salary'] = '70-80,000'
df_choice_19.loc[df_choice_19['Salary'] == '80,000-89,999', 'Salary'] = '80-90,000'
df_choice_19.loc[df_choice_19['Salary'] == '90,000-99,999', 'Salary'] = '90-100,000'
df_choice_19.loc[df_choice_19['Salary'] == '100,000-124,999', 'Salary'] = '100-125,000'
df_choice_19.loc[df_choice_19['Salary'] == '125,000-149,999', 'Salary'] = '125-150,000'
df_choice_19.loc[df_choice_19['Salary'] == '200,000-249,999', 'Salary'] = '200-250,000'
df_choice_18.loc[df_choice_18['Salary'].isin(['400-500,000', '300-400,000']), 'Salary'] = '300,000-500,000'

In [None]:
# Functions

def plot_gender_vars(var1: str = '', title_name: str = ''):
    """
    Make separate count plots for genders over years.
    """
    colors = cl.scales['3']['qual']['Paired']
    names = {0: '2017', 1: '2018', 2: '2019'}
    fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Male', 'Female'), print_grid=False)
    # there are too little responders, who don't identify as Male/Female, I have decided that I can use the most common genders.
    for j, c in enumerate(['Male', 'Female']):
        data = []
        for i, df in enumerate([df_choice_17, df_choice_18, df_choice_19]):
            grouped = df.loc[(df['Gender'] == c), var1].value_counts().sort_index().reset_index()
            grouped['Age_'] = grouped['Age_'] / np.sum(grouped['Age_'])
            trace = go.Bar(
                x=grouped['index'],
                y=grouped.Age_,
                name=names[i],
                marker=dict(color=colors[i]),
                showlegend=True if j == 0 else False,
                legendgroup=i
            )
            fig.append_trace(trace, 1, j + 1)    

    fig['layout'].update(height=400, width=800, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', title=f'Rate of Ukrainian kagglers by {title_name} and gender');
    return fig


def plot_var(var1: str = '', title_name: str = ''):
    """
    Plot one variable over years.
    """
    colors = cl.scales['3']['qual']['Paired']
    names = {0: '2017', 1: '2018', 2: '2019'}
    
    data = []
    for i, df in enumerate([df_choice_17, df_choice_18, df_choice_19]):
        grouped = df[var1].value_counts().sort_index().reset_index()
        grouped[var1] = grouped[var1] / np.sum(grouped[var1])
        trace = go.Bar(
            x=grouped['index'],
            y=grouped[var1],
            name=names[i],
            marker=dict(color=colors[i]),
            legendgroup=i
        )
        data.append(trace)
    layout = dict(height=400, width=800, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', title=f'Ukrainian kagglers by {title_name}');  
    fig = dict(data=data, layout=layout)
    return fig


def plot_var_salary(var1: str = '', title_name: str = '', normalize: bool = False):
    """
    Plot salary over years. This is a separate function, because
    it is necessary to add code for sorting.
    """
    colors = cl.scales['3']['qual']['Paired']
    names = {0: '2018', 1: '2019'}
    
    data = []
    for i, df in enumerate([df_choice_18, df_choice_19]):
        grouped = df[var1].value_counts().sort_index().reset_index()
        if normalize:
            grouped[var1] = grouped[var1] / np.sum(grouped[var1])
        map_dict = {'0-10,000': 0,
                    '10-20,000': 1,
                    '100-125,000': 10,
                    '125-150,000' : 11,
                    '150-200,000': 12,
                    '20-30,000': 2,
                    '200-250,000': 13,
                    '250,000-299,999': 14,
                    '30-40,000': 3,
                    '300,000-500,000': 15,
                    '40-50,000': 4,
                    '50-60,000': 5,
                    '60-70,000': 6,
                    '70-80,000': 7,
                    '80-90,000': 8,
                    '90-100,000': 9,
                    '> $500,000': 16,
                    'I do not wish to disclose my approximate yearly compensation': 17}
        grouped['sorting'] = grouped['index'].apply(lambda x: map_dict[x])
        grouped = grouped.loc[grouped['index'] != 'I do not wish to disclose my approximate yearly compensation']
        grouped = grouped.sort_values('sorting', ascending=True)
        trace = go.Bar(
            x=grouped['index'],
            y=grouped[var1],
            name=names[i],
            marker=dict(color=colors[i]),
            legendgroup=i
        )
        data.append(trace)
    layout = dict(height=500, width=800, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', title=f'Ukrainian kagglers by {title_name}');  
    fig = dict(data=data, layout=layout)
    return fig


def plot_choice_var(var: str = '', title_name: str = ''):
    """
    Plot a variable, in which responders could select several answers.
    """
    col_names = [col for col in df_choice_19.columns if f'{var}_Part' in col]
    data = []
    small_df = df_choice_19[col_names]
    text_values = [col.split('- ')[2] for col in col_names]
    counts = []
    for m, n in zip(col_names, text_values):
        if small_df[m].nunique() == 0:
            counts.append(0)
        else:
            counts.append(sum(small_df[m] == n))
            
    trace = go.Bar(
        x=text_values,
        y=counts,
        name='c',
        marker=dict(color='blue'),
        showlegend=False
    )
    data.append(trace)    
    fig = go.Figure(data=data)
    fig['layout'].update(height=600, width=800, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)', title=f'Popular {title_name}');
    return fig

## General information

The Ukrainian community on Kaggle is not very strong yet. We started taking part in Kaggle competitions a long time ago, and in many competitions there are at least a several teams in the medal zone, currently there many Ukrainian Masters and Experts, but only 191 people took part in this year's survey (20th place among countries, excluding "others").

About 200 Ukrainians took part in the survey this year. Why is this number lower than last year? I think this year the survey information was less widespread, I don't know why.

In [None]:
df_count = pd.DataFrame({'Year': [2017, 2018, 2019], 'Count': [df_choice_17.shape[0], df_choice_18.shape[0], df_choice_19.shape[0]]})
top_count = top_count.sort_values('count')

fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Bar(y=top_count['Country'], x=top_count['count'], orientation='h', name='Number of respondents by country in 2019'), row=1, col=1)
fig.add_trace(go.Bar(x=df_count['Year'], y=df_count['Count'], name='Number of Ukrainian responders by year'), row=1, col=2)

fig['layout'].update(height=600, width=800,paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)');
iplot(fig);

Not surprisingly, the average age is increasing over time. One of the reasons is that people grow older over time (obviously :)). Another reason - more elder experts are switching career from other spheres to DS.

It is interesting to note that men are interested in DS up to 69 and women - up to 54. The largest group of specialists among men is up to 44, among women up to 34.

In [None]:
fig = plot_gender_vars(var1='Age_', title_name='age')
iplot(fig);

Not only age progresses, but also the degree of education, We can see that more and more people get masters and doctor degrees and provide a valueable expertise.

But it may be that fewer and fewer bachelors are participating in this survey.

In [None]:
fig = plot_var(var1='Degree', title_name='degree')
iplot(fig);

If we talk about salaries, it is important to understand several things:
- some Ukrainians work in USA and other countries and as a result have quite high salaries;
- those working in Kyiv and some other big cities have good salaries;
- and people working in other cities usually have much lower salary.

100k is possible, but difficult to achieve - this is usually a level of head of DS.
50k is a level of a senior DS.
~30k is a level of a middle DS.

By the way, did you notice a rich person with 500k$+? I wonder who that person is :)

In [None]:
fig = plot_var_salary(var1='Salary', title_name='salary')
iplot(fig);

## Professional skills

It isn't surprising that Kaggle is one of the main sources of getting information :)
But is is worth noticing that slack communities aren't far behind.

In [None]:
fig = plot_choice_var(var='Q12', title_name='resources')
iplot(fig);

There are several interesting things about popular programming languages in Ukraine:
- Python and SQL are, of course, top used languages;
- R is the next in popularity DS language;
- a lot of kagglers have background in software programming, as a result a lot of people know C++, Javascript, Bash and other languages;

In [None]:
fig = plot_choice_var(var='Q18', title_name='languages')
iplot(fig);

I think it is quite amazing that a lot of people do some kind or research or try innovative things. A lot of people do research or try to apply new methods, some other people build technologically sound solutions.

In [None]:
fig = plot_choice_var(var='Q9', title_name='additional activities')
iplot(fig);


Not surprisingly word embeddings continue to be popular. They often don't require GPU at all or can be used in small neural nets (not requiring huge resources). On the other hand BERT and other transformers allow to get great results rather easily, provided of course you have enough resources.

In [None]:
fig = plot_choice_var(var='Q27', title_name='nlp tools')
iplot(fig);

While scikit-learn is a top popular library for obvious reasons, there are other interesting libraries.
- LGB and XGB are very popular as they provide great results and are easy to use.
- Keras and Tensorflow are still ahead of Pytorch by the number of users.

In [None]:
fig = plot_choice_var(var='Q28', title_name='libraries')
iplot(fig);

Popular libraries and automation ML tools are:
* Automated data augmentation
* Automated model selection
* Automated hyperparameter tuning and etc.

In [None]:
fig = plot_choice_var(var='Q25', title_name='ML tools')
iplot(fig);

## 1. Kaggle Competitions Grandmasters and Masters from Ukraine <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

* @aispiriants
* @alexkirnas
* @bloodaxe
* @bohdan
* @bpavlyshenko
* @cherednychenko
* @damsgroup
* @ddanevskyi (**Grandmaster**)
* @geneva
* @golondrina
* @igorkrashenyi
* @firenero
* @kesjien
* @kulishyurii
* @maksimovka
* @olenan
* @opanichev
* @sakvaua
* @spodarets
* @spsancti
* @tetyanayatsenko
* @vshmyhlo
* @x0x0w1
* @yaroshevskiy
* @szhitansky

**Who knows others? Please write in the comments**

## 2. Kaggle Notebooks Grandmasters, Masters and Experts from Ukraine <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

## Grandmasters:
* @vbmokin

## Masters:
* @isaienkov

## Experts:

* @govoruha 
* @gvyshnya
* @leighplt
* @opanichev

**Who knows others? Please write in the comments**

## 3. Kaggle Competitions Gold Medal (or Medals) <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

* @abmokin
* @aispiriants
* @alexkirnas
* @ayaroshevskiy
* @bloodaxe
* @bohdan
* @bpavlyshenko
* @cherednychenko
* @cutlass90
* @damsgroup
* @ddanevskyi
* @geneva
* @georgepashchenko
* @golondrina
* @hokmund
* @igorkrashenyi
* @irinaai
* @firenero
* @kavladglgm
* @kesjien
* @kulishyurii
* @ladler0320
* @maksimovka
* @natashamanzhos
* @olenan
* @opanichev
* @orgunova
* @pajari
* @sakvaua
* @spodarets
* @spsancti
* @szhitansky
* @tetyanayatsenko
* @vshmyhlo
* @x0x0w1
* @yaroshevskiy

**Who knows others? Please write in the comments**

Your comments and feedback are most welcome.

**I will continue thess lists. I ask those who know others, write about them in the comments**

[Go to Top](#0)