<h1><center><font size="6">Is Romania on the Kaggle map?</font></center></h1>

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c5/EU-Romania_%28orthographic_projection%29.svg/1024px-EU-Romania_%28orthographic_projection%29.svg.png" width="400"></img></center>  


## <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Prepare the data analysis</a>  
 - <a href='#21'>Load packages</a>  
 - <a href='#21'>Load the data</a>  
- <a href='#3'>Data exploration</a>   
- <a href='#4'>Combine the features</a>   
- <a href='#5'>Final note</a>   



# <a id='1'>Introduction</a>  

We will analyze the dataset `2020 Kaggle ML & DS Survey` with answers provided by the respondents to the survey of Kaggle users in 2020.  

From the total data, we will only focus on Romanian respondents. Romania is a small country in Europe, with a population of 19.2 milllion (according to last UN data estimates).

As a percent from entire World population, Romania population represents 0.25%.




# <a id='2'>Prepare the data analysis</a>   


Before starting the analysis, we need to make few preparation: load the packages, load and inspect the data.



# <a id='21'>Load packages</a>

We load the packages used for the analysis.

In [None]:
import pandas as pd
import numpy as np
import sys
import os
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

<a href="#0"><font size="1">Go to top</font></a>  


# <a id='22'>Load the data</a>  


There are three dataset files. Let's load the responses only.

In [None]:
data_df = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', low_memory=False)

In [None]:
print("Multiple choice response - rows: {} columns: {}".format(data_df.shape[0], data_df.shape[1]))


<a href="#0"><font size="1">Go to top</font></a>  


# <a id='3'>Data exploration</a>  


We will first select only data from Romanian contributors.

We will also glimpse the free format response dataset.

## Glimpse the data

In [None]:
multiple_df = data_df.loc[data_df.Q3=="Romania"]
print(f"Data entries about Romania: {multiple_df.shape[0]}. Percent from total answers: {round(multiple_df.shape[0] / data_df.shape[0],4) * 100}%")

With 0.3% of the answers, seems that Romania is a bit better represented than other countries (since the percent from respondents is larger than Romania's population percent from entire World population). But let's compare the percent with US.

In [None]:
us_responses = data_df.loc[data_df.Q3=='United States of America'].shape[0]
china_responses = data_df.loc[data_df.Q3=='China'].shape[0]
india_responses = data_df.loc[data_df.Q3=='India'].shape[0]
print(f"Data entries about USA: {us_responses}. Percent from total answers: {round(us_responses / data_df.shape[0],4) * 100}%")
print(f"Data entries about China: {china_responses}. Percent from total answers: {round(china_responses / data_df.shape[0],4) * 100}%")
print(f"Data entries about India: {india_responses}. Percent from total answers: {round(india_responses / data_df.shape[0],4) * 100}%")

Since USA population of 331 millions accounts for 4.25% of the World population and has 11.16% of the total answers, we see that actually Romania, with 120% factor of supra-representation on Kaggle, is well behind USA, with 260%.  

It is better represented by China, which, at a percent from World population of 18.47%, is underrepresented in the survey. This might be due to multiple factors: under-representation of China on Kaggle (we know that they have a lot on competitions locally) or reluctance of Chinese Kagglers to fill a survey.

It ranks lower than India, which have 29.2% of answers, whilst India population accounts for 17.7% of World population.

I will use UN Data to get more insight into these disparities.



<a href="#0"><font size="1">Go to top</font></a>  

## Columns to visualize


Some of the following columns are grouped, to capture the multiple choice answers where the order of the answer gives the order of preferences. 
Let's check which columns groups have only one item in the group (columns with multiple items in the group will be called like `Q11_Part1`, `Q11_Part2`[...]. For this we will compose filters like `Q1`, `Q2`, ..., `Q11`, `Q12` etc. and filter the columns containing these values and count the items. We separate only the group of columns with one item in the group. These will be the columns we will further represent.



In [None]:
tmp = pd.DataFrame(multiple_df.columns.values)
columns = []
for i in range(1,50):
    var = "Q{}".format(i)
    l = len(list(tmp[tmp[0].str.contains(var)][0]))
    if(l == 1):
        columns.append(var)

print("The columns with only one item in the column group are:\n",columns)

We will make sure to include these columns in the following, besides the obvious options `Q1`, `Q2` ... `Q7`.


<a href="#0"><font size="1">Go to top</font></a>  

## Age interval

Let's show the age interval, as declared by respondents. 
We will create here also a function to count categories for categorical data and a function to draw barplots using Plotly.

In [None]:
def get_categories(data, val):
    tmp = data[1::][val].value_counts()
    return pd.DataFrame(data={'Number': tmp.values}, index=tmp.index).reset_index()

In [None]:
df = get_categories(multiple_df, 'Q1')

In [None]:
def draw_trace_bar(data, title, xlab, ylab,color='Blue'):
    trace = go.Bar(
            x = data['index'],
            y = data['Number'],
            marker=dict(color=color),
            text=data['index']
        )
    data = [trace]

    layout = dict(title = title,
              xaxis = dict(title = xlab, showticklabels=True, tickangle=15,
                          tickfont=dict(
                            size=9,
                            color='black'),), 
              yaxis = dict(title = ylab),
              hovermode = 'closest'
             )
    fig = dict(data = data, layout = layout)
    iplot(fig, filename='draw_trace')

In [None]:
draw_trace_bar(df, 'Number of people/age interval', 'Age interval', 'Number of people' )


<a href="#0"><font size="1">Go to top</font></a>  



## Gender

Let's explore the gender groups.

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q2'), "Number of people in each gender", "Gender", "Number of people", "Green")

Majority of the respondents from Romania are men (80%), with 20% only being women.


<a href="#0"><font size="1">Go to top</font></a>  



## Country

We will show here how Romania ranks between countries in the World in terms of number of answers.

In [None]:
df = get_categories(data_df, 'Q3')
df.head()

In [None]:
trace = go.Choropleth(
            locations = df['index'],
            locationmode='country names',
            z = df['Number'],
            text = df['index'],
            autocolorscale =False,
            reversescale = True,
            colorscale = 'rainbow',
            marker = dict(
                line = dict(
                    color = 'rgb(0,0,0)',
                    width = 0.5)
            ),
            colorbar = dict(
                title = 'Respondents',
                tickprefix = '')
        )

data = [trace]
layout = go.Layout(
    title = 'Number of respondents per country',
    geo = dict(
        showframe = True,
        showlakes = False,
        showcoastlines = True,
        projection = dict(
            type = 'natural earth'
        )
    )
)

fig = dict( data=data, layout=layout )
iplot(fig)

Let's use UN Data to try to understand if there are correlations between GDP, GDP per capita, or other development indicators and percent of answers/country.

In [None]:
undata_df = pd.read_csv("../input/undata-country-profiles/country_profile_variables.csv")
undata_df['country'] = undata_df['country'].apply(lambda x: x.replace('United Kingdom', 'United Kingdom of Great Britain and Northern Ireland'))
undata_df['country'] = undata_df['country'].apply(lambda x: x.replace('Iran (Islamic Republic of)', 'Iran, Islamic Republic of...'))
data_df['country'] = data_df['Q3']
data_df = data_df.merge(undata_df, on='country')

In [None]:
agg_data_df = data_df.groupby(['country', 'Population in thousands (2017)', 'GDP per capita (current US$)', 'Education: Government expenditure (% of GDP)', 'Region'])['Q1'].count().reset_index()
agg_data_df.columns = ['country', 'Population', 'GDP_per_capita', 'Education: Government expenditure (% of GDP)', 'Region','Answers']
total_population = np.sum(agg_data_df.Population)
total_answers = np.sum(agg_data_df.Answers)
#print(total_population, total_answers)
agg_data_df['Answers Factor'] = (agg_data_df['Answers'] / total_answers) / (agg_data_df['Population'] / total_population)
agg_data_df.head()

In [None]:
import plotly.express as px

fig = px.scatter(agg_data_df, x="Answers Factor", y="GDP_per_capita", size="Answers", color="Region",
           hover_name="country", size_max=60)
fig.show()

In [None]:
trace = go.Choropleth(
            locations = agg_data_df['country'],
            locationmode='country names',
            z = agg_data_df['Answers Factor'],
            text = agg_data_df['country'],
            autocolorscale =False,
            reversescale = True,
            colorscale = 'viridis',
            marker = dict(
                line = dict(
                    color = 'rgb(0,0,0)',
                    width = 0.5)
            ),
            colorbar = dict(
                title = 'Answers factor',
                tickprefix = '')
        )

data = [trace]
layout = go.Layout(
    title = 'Answer factor (percent of respondents from total / percent of population from total) per country',
    geo = dict(
        showframe = True,
        showlakes = False,
        showcoastlines = True,
        projection = dict(
            type = 'natural earth'
        )
    )
)

fig = dict( data=data, layout=layout )
iplot(fig)

It looks like there is not a large correlation between GDP per capita and the answer factor (calculated as ratio between percent of answers and percent of population). 

In the same time, with the exception of Saudi Arabia and Republic of Korea, there are no countries with GDP per capita larger than 20,000$ and answer factor under one. 

As for countries with GDP per capita larger than 40,000 $, with the exception of Germany and Belgium, all the countries have answer factor > 2. The largest answer factor is from Singapore, with allmost 9.


Romania is placed, with his GDP per capita (as for 2017) < 10,000$, with an answer factor of 1.04.


<a href="#0"><font size="1">Go to top</font></a>  

## Highest level of formal education

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q4'), "Highest level of formal education", "Education", "Number of people", "Magenta")

## Current job title

The next question is about the description of the industry of the current employer.

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q5'), "Current job title", "Current job title", "Number of respondents", "Tomato")


<a href="#0"><font size="1">Go to top</font></a>  


## Years of experience writting code

The next question is about the description of the title.

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q6'), "Years of experience", "Years of experience", "Number of respondents", "Red")

<a href="#0"><font size="1">Go to top</font></a>  


## What programming language to learn first?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q8'), 
               "What programming language would you recommend an aspiring data scientist to learn first?", 
               "Programming language", "Number of respondents", "Lightblue")

<a href="#0"><font size="1">Go to top</font></a>  


## What computing platform use most often?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q11'), 
               "What type of computing platform do you use most often for your data science projects?", 
               "Computing platform", "Number of respondents", "Orange")

<a href="#0"><font size="1">Go to top</font></a>  


## What computing platform use most often?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q22'), 
               multiple_df['Q22'].values[0], 
               "ML methods included in the business", "Number of respondents", "Lightgray")

<a href="#0"><font size="1">Go to top</font></a>  


## Individuals responsible for data science at your place of business?

In [None]:
draw_trace_bar(get_categories(multiple_df,'Q21'), 
               multiple_df['Q21'].values[0], 
               "Individuals involved in Data Science", "Number of respondents", "Darkgreen")


# <a id='4'>Combine the features</a>


Let's visualize some of the dimmensions presented previously in combination. For example, let's see the combined distribution of sex and age to see how these two are distributed.

## Number of respondents by Sex and Age

In [None]:
def get_categories_group(data, val_group, val):
    tmp = data[1::].groupby(val_group)[val].value_counts()
    return pd.DataFrame(data={'Number': tmp.values}, index=tmp.index).reset_index()

In [None]:
def draw_trace_group_bar(data_df, val_group, val, title, xlab, ylab,color='Blue'):
    data = list()
    groups = (data_df.groupby([val_group])[val_group].nunique()).index
    for group in groups:
        data_group_df = data_df[data_df[val_group]==group]
        trace = go.Bar(
                x = data_group_df[val],
                y = data_group_df['Number'],
                name = group,
                #marker=dict(color=color),
                text=data_group_df[val]
            )
        data.append(trace)

    layout = dict(title = title,
              xaxis = dict(title = xlab, showticklabels=True, tickangle=15,
                          tickfont=dict(
                            size=9,
                            color='black'),), 
              yaxis = dict(title = ylab),
              hovermode = 'closest'
             )
    fig = dict(data = data, layout = layout)
    iplot(fig, filename='draw_trace')

In [None]:
df = get_categories_group(multiple_df, 'Q1', 'Q2')
draw_trace_group_bar(df, 'Q1', 'Q2', 'Number of respondents by Sex and age', 'Sex', 'Number of respondents')

## Number of respondents by Age and Highest level of formal education

In [None]:
df = get_categories_group(multiple_df, 'Q1', 'Q4')
draw_trace_group_bar(df, 'Q1', 'Q4', 'Number of respondents by Age and Highest level of formal education', 'Highest level of formal education', 'Number of respondents')

## Age and number of years of experience

In [None]:
df = get_categories_group(multiple_df, 'Q1', 'Q6')
draw_trace_group_bar(df, 'Q1', 'Q6', 'Number of respondents by Age and number of years of experience', 
                     'Number of years of experience', 'Number of respondents')

## Highest level of formal education and current yearly compensation

In [None]:
df = get_categories_group(multiple_df, 'Q4', 'Q6')
draw_trace_group_bar(df, 'Q4', 'Q6', 'Number of respondents by Highest level of formal education and Current yearly compensation', 'Current yearly compensation', 'Number of respondents')

## Current title and years of experience

In [None]:
df = get_categories_group(multiple_df, 'Q6', 'Q4')
draw_trace_group_bar(df, 'Q6', 'Q4', 'Number of respondents by Current title and years of experience', 'Current title', 'Number of respondents')