# Kaggle 2020 State of Data Science Survey
Kaggle is the world leader for competitive Data Science. Since 2017 it has conducted an annual survey of the state of Data Science. This notebook will conduct an EDA of the data in the 2020 survey. The focus of this EDA will be on how Data Science varies by country.

Please leave any feedback in the comments below and upvote if you liked this notebook.

First we will see how many responses there were in each country.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import altair as alt
#enable altair to plot all the data
alt.data_transformers.disable_max_rows()
alt.themes.enable('fivethirtyeight')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Load the data
df_survey_responses = pd.read_csv("/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")

In [None]:
col_list = df_survey_responses.columns.to_list()
col_list
#create a mapping dict from question id to question content
question_mapping = df_survey_responses.loc[0].to_dict()
print(question_mapping)

df_survey_responses = df_survey_responses.drop(0, axis=0)
#shorten longest country names
df_survey_responses.loc[df_survey_responses['Q3']=="United Kingdom of Great Britain and Northern Ireland", "Q3"] = "United Kingdom"
df_survey_responses.loc[df_survey_responses['Q3']=="Iran, Islamic Republic of...", "Q3"] = "Iran"
df_survey_responses

In [None]:
country = df_survey_responses.groupby(['Q5'])
df_country = country.agg('count')
df_country.reset_index(inplace=True)
df_country=df_country[['Q5', 'Q2']]
df_country.columns=('Country', 'Count')
df_country.tail()

In [None]:
def chart_country(df, sort=None):
    chart_country_groups = alt.Chart(df_country).mark_bar().encode(
    alt.X('Country:N', sort=sort, axis=alt.Axis(grid=False)),
    alt.Y('Count:Q', title="Count of Responses", axis=alt.Axis(grid=False))
    ).properties(width=680,
    background='#f5f5f5',title={
"text": ["Number of Respondents per Countrys"],
"fontSize":20, "fontWeight":600, "subtitleFontSize":14
}
    )
    
    return chart_country_groups

chart_country_groups = chart_country(df_country)

In [None]:
#countries sorted by count
df_country = df_country.sort_values(by=['Count'], ascending=True)
country_sort = df_country['Country'].to_list()
chart_country_groups = chart_country(df_country)

chart_country_groups

Here we can see responses by country. India leads the way by a large margin from the USA. In third place is "Other" which is all other countries not on the list aggregated into one category. Let's dive into the data a bit more deeply.

In [None]:
age_country = df_survey_responses.groupby(['Q1', 'Q3'])
df_age_country = age_country.agg('count')
df_age_country.reset_index(inplace=True)
df_age_country=df_age_country[['Q1', 'Q3', 'Q2']]
df_age_country.columns=('Age', 'Country', 'Count')
df_age_country['Count'] = np.log(df_age_country['Count'])

#sort by count
df_age_country = df_age_country.sort_values(by=['Count'], ascending=False)

df_age_country.head()

### Breakdown by Country and Age

We break down the responses by age and country and count the entries in each. We log transform the count values to get a more informative visualization.

In [None]:
alt.Chart(df_age_country).mark_rect().encode(
    alt.X('Country:N', sort=None, title="Country sorted by descending count of aggregate responses"),
    alt.Y('Age:N'),
    color=alt.Color('Count:Q', legend=alt.Legend(title="Log Count of Respondents", orient="top"))
).properties(width=680,
    background='#f5f5f5',title={
"text": ["Age of Respondents"],
"subtitle": ["Broken Down by proportionally by Country"],
"fontSize":20, "fontWeight":600, "subtitleFontSize":14
}
)

We can see that in India many respondents are in the youngest age categories. Comparing this to other nations we can see that respondents elsewhere tend to be older. This is particularly noticeable in the USA. We can see this more clearly if we focus on the top 8 countries by number of respondents.

In [None]:
#only top countries

alt.Chart(df_age_country[df_age_country['Country'].isin(country_sort[:9])]).mark_bar(
    cornerRadiusTopLeft=3,
    cornerRadiusTopRight=3
).encode(
    x=alt.X('Count:Q', stack="normalize", title="Normalized Count"),
    y=alt.Y('Age:O'),
    color=alt.Color('Country:N', legend=alt.Legend(orient="right"))
).properties(width=500,
             background='#f5f5f5',title={
"text": ["Age of Respondents"],
"subtitle": ["Broken Down by proportionally by Country"],
"fontSize":20, "fontWeight":600, "subtitleFontSize":14
})

### Breakdown by Country and Highest Educational Achievement

Let's take a look at the breakdown of highest educational achievement by country in this Kaggle survey. Remember there can be a selection bias in these results, especially if we consider the age of respondents. Respondents aged 18-21 are unlikely to have a PhD for example! 

If there are respondents in that age group with a PhD please feel free to contact me about teaming up in future competitions ;-)

In [None]:
education_country = df_survey_responses.groupby(['Q4', 'Q3'])
df_education_country = education_country.agg('count')
df_education_country.reset_index(inplace=True)
df_education_country=df_education_country[['Q4', 'Q3', 'Q2']]
df_education_country.columns=('education', 'country', 'Count')
df_education_country['rank'] = df_education_country['education']
education_sort_map = {"Doctoral degree": 6, 'Master’s degree': 5, "Professional degree": 4, 'Bachelor’s degree': 3, 
                  "Some college/university study without earning a bachelor’s degree": 2,
                  "No formal education past high school": 1, "I prefer not to answer": 0
                 }
df_education_country['rank'] = df_education_country['rank'].replace(education_sort_map)
df_education_country['education_rank'] = df_education_country['rank'].astype(str) + ' - ' + df_education_country['education']

In [None]:
alt.Chart(df_education_country).mark_bar(
    cornerRadiusTopLeft=3,
    cornerRadiusTopRight=3
).encode(
    x=alt.X('country:N'),
    y=alt.Y('Count:Q', stack="normalize", title="Normalized Count"),
    color=alt.Color('education_rank:N', legend=alt.Legend(orient="left"))
).properties(width=520,
             background='#f5f5f5',title={
"text": ["Highest Educational Achievement Held by Respondents"],
"subtitle": ["Broken Down by proportionally by Country"],
"fontSize":18, "fontWeight":600, "subtitleFontSize":14
})

### Breakdown by Country and Job Title

Let's take a look at the breakdown of job title by country.

In [None]:
job_country = df_survey_responses.groupby(['Q5', 'Q3'])
df_job_country = job_country.agg('count')
df_job_country.reset_index(inplace=True)
df_job_country=df_job_country[['Q5', 'Q3', 'Q2']]
df_job_country.columns=('job', 'country', 'Count')

In [None]:
alt.Chart(df_job_country).mark_bar(
    cornerRadiusTopLeft=3,
    cornerRadiusTopRight=3
).encode(
    x=alt.X('country:N'),
    y=alt.Y('Count:Q', stack="normalize", title="Normalized Count"),
    color=alt.Color('job:N', legend=alt.Legend(orient="left"))
).properties(width=550,
             background='#f5f5f5',title={
"text": ["Job Titles Held by Respondents"],
"subtitle": ["Broken Down by proportionally by Country"],
"fontSize":18, "fontWeight":600, "subtitleFontSize":14
})

We see that in some countries approximately half of respondents are students. Let's just focus on the job of "Data Scientist".

In [None]:
df_job_proportions = df_job_country.merge(df_country, left_on='country', right_on='Country',
          suffixes=(None, '_total'))
df_job_proportions.drop(columns=['country'], inplace=True)
df_job_proportions['job_proportion'] = df_job_proportions['Count'] / df_job_proportions['Count_total']
df_job_proportions

In [None]:
alt.Chart(df_job_proportions).mark_bar(color="grey").encode(
    alt.Y('Country:N', axis=alt.Axis(grid=False)),
    alt.X('job_proportion:Q', title="Proportion of Respondents", axis=alt.Axis(grid=True))
    ).properties(width=680, height=440,
    background='#f5f5f5',title={
"text": ["Proportion of Respondents who are Data Scientists"],
"subtitle": ["Broken Down by Country"],
"fontSize":20, "fontWeight":600, "subtitleFontSize":14
})


### Breakdown by Years of Coding Experience

In [None]:
experience_country = df_survey_responses.groupby(['Q6', 'Q3'])
df_experience_country = experience_country.agg('count')
df_experience_country.reset_index(inplace=True)
df_experience_country=df_experience_country[['Q6', 'Q3', 'Q2']]
df_experience_country.columns=('experience', 'country', 'Count')

df_experience_country['rank'] = df_experience_country['experience']
experience_sort_map = {"20+ years": 6, '10-20 years': 5, "5-10 years": 4, '3-5 years': 3, 
                  "1-2 years": 2,
                  "< 1 years": 1, "I have never written code": 0
                 }
df_experience_country['rank'] = df_experience_country['rank'].replace(experience_sort_map)
df_experience_country['experience_rank'] = df_experience_country['rank'].astype(str) + ' - ' + df_experience_country['experience']
df_experience_country

In [None]:
alt.Chart(df_experience_country).mark_bar(
    cornerRadiusTopLeft=3,
    cornerRadiusTopRight=3
).encode(
    y=alt.Y('country:N', axis=alt.Axis(grid=False, titleAngle=0)),
    x=alt.X('Count:Q', stack="normalize", title="Normalized Count"),
    color=alt.Color('experience_rank:N', legend=alt.Legend(orient="bottom"), title="Years of Experience")
).properties(width=550, height=500,
             background='#f5f5f5',title={
"text": ["Years of Experience"],
"subtitle": ["Broken Down by proportionally by Country"],
"fontSize":18, "fontWeight":600, "subtitleFontSize":14
})

### What Language Would you Recommend an aspiring Data Scientist Learn First?
This gives a view of where Data Scientists think Data Science is heading and is not clouded by the legacy of languages that Data Scientist may or may not already know. Again, following our theme we will take a look at this by country.

In [None]:
prog_lang_country = df_survey_responses.groupby(['Q8', 'Q3'])
df_prog_lang_country = prog_lang_country.agg('count')
df_prog_lang_country.reset_index(inplace=True)
df_prog_lang_country=df_prog_lang_country[['Q8', 'Q3', 'Q2']]
df_prog_lang_country.columns=('prog_lang', 'country', 'Count')
#sort by count
df_age_country = df_age_country.sort_values(by=['Count'], ascending=False)
df_prog_lang_country.head()

In [None]:
def prog_lang_country(df):
    chart_prog_lang_country = alt.Chart(df).mark_bar(
        cornerRadiusTopLeft=3,
        cornerRadiusTopRight=3
    ).encode(
        y=alt.Y('country:N', axis=alt.Axis(grid=False, titleAngle=0)),
        x=alt.X('Count:Q', stack="normalize", title="Normalized Count"),
        color=alt.Color('prog_lang:N', legend=alt.Legend(orient="bottom"), title="Programming Language")
    ).properties(width=550, height=500,
                 background='#f5f5f5',title={
    "text": ["Recommended Starting Programming Language for Data Science"],
    "subtitle": ["Broken Down by proportionally by Country"],
    "fontSize":18, "fontWeight":600, "subtitleFontSize":14
    })
    
    return chart_prog_lang_country

In [None]:
prog_lang_country(df_prog_lang_country)

We can see that Python is by far the most popular language recommended in all countries. Let's try excluding python from the list to get a clearer view of which other languages an aspiring data scientist could consider learning.

In [None]:
prog_lang_country(df_prog_lang_country.loc[df_prog_lang_country['prog_lang']!="Python"])

Perhaps not surprisingly we can see that SQL and R are both widely recommended in all countries. 

Aspiring data scientists may find more detail on skills in demand in their own country, but remember that the sample size of responses for many of the countries listed may not be very large and the results may not be very reliable as a result.

### Conclusion
Data Science is flourishing in 2020. We can see respondents to the Kaggle survey from over 50 different countries. By working through the data in the survey breaking down the information by country we have found many interesting insights. For example respondents from India have tended to be younger than those from the US. Respondents from Iran and Switzerland have been more likely to have a PhD. Respondents from Sweden were more likely to have 20+ years of coding experience. I hope this notebook was interesting and insightful and that you learnt something new about the state of Data Science in your own country.

Let me know your thoughts and any feedback in the comments below. If you liked the notebook or found it helpful please upvote.