# Kaggle Machine Learning & Data Science Survey EDA
## Does degree matter to decide earning in data science field?

In this notebook I tried to explore and analize various property in the survey. The data consist of many relevant property of kaggle user, including programming language, education, and earning that I am most interested in. To be precise, there are two main questions that I would like to answer.

The first is what programming language matter based on the users age. With many programming language available todays, where would a new data science enthusiast should focus on. I would like to know what language does the kaggle users used in their daily life based on their age.

I would also like to know wether education degree matter in determining earning. With so many data science and machine learning resources being available on the internet, does degree still matter in determining ones earning. Should one need to further pursue their degree in order to earn a a better salary.

The tools that being used in this notebook are Pandas, Plotly, and Scipy.

In [None]:
import numpy as np
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)

In [None]:
df = pd.read_csv("/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
columns = df.head().columns
questions = df.iloc[0, :].values
df = df.iloc[1:, :]

## User age distribution

In [None]:
fig = go.Figure(data=go.Bar(x=df[columns[1]].value_counts().sort_index().index, y=df[columns[1]].value_counts().sort_index().values))

fig.show()

## User's age distribution based on programming language they use

In [None]:
pro_langs = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", "Julia", "Swift", "Bash", "Matlab"]
rows = 3
cols = 4

fig = make_subplots(
    rows=rows, cols=cols, 
    subplot_titles=pro_langs)

df_subs = []
for idx, pro_lang in enumerate(pro_langs):
    df_sub = df.dropna(subset=[columns[7+idx]])
    df_subs.append(df_sub)

idx = 0
for row in range(rows):
    row=row+1
    for col in range(cols):
        col=col+1
        
        if idx==len(pro_langs):
            break
    
        fig.add_trace(
            go.Bar(x=df_subs[idx][columns[1]].value_counts().sort_index().index, y=df_subs[idx][columns[1]].value_counts().sort_index().values),
            row=row, col=col)
        
        idx=idx+1
        
fig.update_layout(height=1000, width=1400, title_text="Programming language based on age")
fig.show()

As we can see from above, there are several points we can take:

- Of all progamming language available on the survey, Python is the most well rounded programming language. This fact can be seen by how Python distribution is the most flat compared to other languages. Other well rounded programming language that being used by all age but not as big as Python are Matlab and Javascript.

- Julia, Bash and Swift are being use more by older people rather than younger kaggle users.

- C, Java and C++ are being used more by younger people. While it might be true that the newer generation required these 3 languages, it can also means that the education system does not represent the ability that needed in work life. Why? Those who still in the age of 18-24 generally are still in school while those whose age more than 24 are usually finished school. Noted that while I can use education degree for this, that data only indicated the last education level they acquired. So it might not be 100% accurated.

## Education level

In [None]:
df[columns[4]].value_counts().sort_index()

In [None]:
fig = go.Figure(data=[go.Pie(labels=df[columns[4]].value_counts().sort_index().index, values=df[columns[4]].value_counts().sort_index().values)])

fig.update_traces(marker=dict(line=dict(color='#000000', width=2)))
fig.show()

- A high percentage of kaggle users comes from Master and Bachelor degree

In [None]:
df_edu = df.loc[df[columns[4]] != 'I prefer not to answer']

In [None]:
df_edu[columns[4]].value_counts().sort_index().index[0]

In [None]:
pro_langs = ["Python", "R", "SQL", "C", "C++", "Java", "Javascript", "Julia", "Swift", "Bash", "Matlab"]
rows = 3
cols = 4

fig = make_subplots(
    rows=rows, cols=cols, 
    subplot_titles=pro_langs)

df_subs = []
for idx, pro_lang in enumerate(pro_langs):
    df_sub = df.dropna(subset=[columns[7+idx]])
    df_subs.append(df_sub)

idx = 0
for row in range(rows):
    row=row+1
    for col in range(cols):
        col=col+1
        
        if idx==len(pro_langs):
            break
    
        fig.add_trace(
            go.Bar(x=df_subs[idx][columns[4]].value_counts().sort_index().index, y=df_subs[idx][columns[4]].value_counts().sort_index().values),
            row=row, col=col)
        
        idx=idx+1
        
fig.update_layout(height=2500, width=1500, title_text="Programming language used by education level")
fig.show()

- 

## Does education level matter in determining earning?

The earning column need a little bit of cleaning before we can get further.

In [None]:
df[columns[127]].value_counts().sort_index()

In [None]:
df.dropna(subset=[columns[127]], inplace=True)

In [None]:
df[columns[127]] = df[columns[127]].str.replace(r'$', '')
df[columns[127]] = df[columns[127]].str.replace(r',', '')
df[columns[127]] = df[columns[127]].str.replace(r'>', '')
df[columns[127]] = df[columns[127]].str.replace(r' ', '')

In [None]:
df[['Earning1', 'Earning2']] = df[columns[127]].str.split('-', 1, expand=True)

In [None]:
df['Earning1'].value_counts().sort_index()

In [None]:
df['Earning1'] = pd.to_numeric(df['Earning1'])

In [None]:
fig = go.Figure([
    go.Bar(name='Bachelor’s degree', x=df[df[columns[4]] == 'Bachelor’s degree']['Earning1'].value_counts().sort_index().index, y=df[df[columns[4]] == 'Bachelor’s degree']['Earning1'].value_counts().sort_index().values),
    go.Bar(name='Master’s degree', x=df[df[columns[4]] == 'Master’s degree']['Earning1'].value_counts().sort_index().index, y=df[df[columns[4]] == 'Master’s degree']['Earning1'].value_counts().sort_index().values)
])
fig.update_layout(barmode='group')
fig.show()

First I try to look at the distribution of earning on bachelor's degree and master's degree. But because of the nature of the data, it is quite hard to see the difference visually. So we need a little bit of statistical tools.

In [None]:
df[df[columns[4]] == 'Bachelor’s degree']['Earning1'].mean()

In [None]:
df[df[columns[4]] == 'Master’s degree']['Earning1'].mean()

In [None]:
df[df[columns[4]] == 'Bachelor’s degree']['Earning1'].std()

In [None]:
df[df[columns[4]] == 'Master’s degree']['Earning1'].std()

In [None]:
df[df[columns[4]] == 'Bachelor’s degree']['Earning1'].median()

In [None]:
df[df[columns[4]] == 'Master’s degree']['Earning1'].median()

From several descriptive statistic above, we can see that there are differences in central tendency between the two categories with both of the categories have almost the same spread. To determine this differences, I used Mann Whitney U test on these two categories. The reason why I am using Mann Whitney is because the distribution is not normally distributed so we have to use non-parametric testing. In this statistical testing I used 0.05 as significance level.

### Mann Whitney U

In [None]:
from scipy.stats import mannwhitneyu
U1, p = mannwhitneyu(x=df[df[columns[4]] == 'Bachelor’s degree']['Earning1'], y=df[df[columns[4]] == 'Master’s degree']['Earning1'])
print(U1)

In [None]:
p

The result is P value < 0.05. This means we have to reject null hypothesis. Thus the two distributions are different.

We can also use Kruskal Wallis to determine whether there is difference between education category on earning.

### Kruskal Wallis

In [None]:
from scipy import stats

stats.kruskal(
    df[df[columns[4]] == 'Bachelor’s degree']['Earning1'],
    df[df[columns[4]] == 'Master’s degree']['Earning1'],
    df[df[columns[4]] == 'Doctoral degree']['Earning1'],
    df[df[columns[4]] == 'No formal education past high school']['Earning1'],
    df[df[columns[4]] == 'Professional doctorate']['Earning1'],
    df[df[columns[4]] == 'Some college/university study without earning a bachelor’s degree']['Earning1'],
    )

By the result above we conclude that there are differences between groups

Thankyou for reading my work! :) Please leave a comment or if you like my work please leave an upvote