![Banner](https://i.imgur.com/KZtUCZM.jpg)

Image generated [Pixiz](https://en.pixiz.com/result?workId=3f4085e63440f7f889154356a8d298a8&resultId=6b6601156abf7d02d1462dfbec75f019&resultFormat=jpg&resultStore=m&templateId=2224&time=1.03&text[]=Correlation+of+Happiness+and+Salary+Level&color0=rgba%28255%2C+255%2C+255%2C+0.8%29)

<center><u style="font-size:30px;">Prepared by Jan Alvin Dimla</u></center>

# Sections

1. [Executive Summary & Introduction](#executive-summary-introduction)
2. [Data Overview](#data-overview)
3. [Demographics](#demographics)
4. [The Average Salary](#average-salary)
5. [Happiness Score Correlation to Average Salary per Country](#happiness-salary-correlation)
6. [GDP Score Correlation to Average Salary per Country](#gdp-salary-correlation)
7. [Equal Opportunity to all Genders](#equal-opportunity-genders)
8. [Does the number of respondents in a country equate to higher average salary?](#respondent-salary-correlation)
9. [Conclusion](#conclusion)
10. [Resources](#resources)

<a id="executive-summary-introduction"></a>
# 1. Executive Summary & Introduction

**What is Correlation?** Based on Wikipedia, any statistical relationship, whether causal or not, between two random variables or bivariate data. In this analysis, we will be using it to measure how closely related are two different data points. E.g. Happiness Score of the all the countries and the average salary level of each. 

And according to a research paper published by Princeton University **"high income buys life satisfaction but not happiness"**. On the contrary, a study done at the Wharton Business School of the University of Pennsylvania said that **money does influence happiness and, contrary to previous influential research on the subject suggesting that this plateaus above $75,000**. 

We will also take a look at whether certain aspects of a country affects or influences the average salary level which in turn affects the happiness score of the country. How diverse were the respondents who took part in the survey based on the demographics. As well as other metrics that were included in the survey. e.g. education, coding experience, country of origin, gender, etc.

**Key Findings**

1. We found out that the correlation between the happiness score of a country and the average salary of the respondents (with their respective country) were only **0.58**, which is not statistically significant enough to make a solid conclusion.
2. Although **79.3%** of the respondents were **Male**, the average salary of those included in the **Nonbinary** gender were the highest at **62002.59**.
3. The GDP score and salary level of each country, only has a correlation of **0.39** which is fairly low.
4. The Product Managers have the highest average salary among all respondents at **72091.25**.
5. And Switzerland has the highest average salary of **96974.99**, second highest happiness score and 3rd highest GDP among others.

**Methodologies**

1. We will be using **Python** for every metrics and computations used in this kernel.
2. The currency which was used in the survey is the **US Dollar**.
3. **Pearson's r** or simply Pearson correlation coefficient which was developed by **Karl Pearson** is used to measure the correlation of certain sets of numbers in this study.
4. We've made changes on column **Q25** in order to get the numerical values instead of a string value, refer to **get_average_from_str()**. Then we calculated the ***median*** of values like **25,000-29,999** in order to get a single value.  After that, we calculated the average or the ***mean*** of the salary column based on what we are searching for. e.g. country, gender, roles
5. We've also read 2 contrasting research papers from 2 different sources **(links in the resources)** in order not to generate any bias for this study.

<a id="data-overview"></a>
# 2. Data Overview

- Import Modules
- Reuseable Functions and Variables
- Datasets - Kaggle Survey 2021 (Cleaning & Preparation)
- Datasets - World Happiness Report 2021 (Cleaning & Preparation)

In [None]:
#
# Import Modules
#

import os
import numpy as np
import pandas as pd
import matplotlib
from matplotlib.pyplot import figure
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [None]:
#
# Re-useable Functions and Variables
#

pie_colors = ['#003ACC', '#0045F5', '#1F5EFF', '#477BFF', '#B36919', '#D67D1F', '#E3923B', '#E8A55E', '#06B145', '#08D954', '#13F666', '#3AF880', '#FF1F1F', '#FF3737', '#FF7070', '#FF9999']
pie_colors_gender = ['#002992', '#FF5F02', '#FF7E33', '#FF985C']

#
# count the occurence of a string
#
def count_str(series,df,column):
    
    return_this = []
    for s in series:
        #return_this.append(len(df.loc[df[column] == s]))
        return_this.append( df[column].str.count(s).sum() )
    return return_this
    
#
#
#
def get_average_from_str(x):
    empty_check = [' years','+','<','>','$',',']
    
    for ec in empty_check:
        if(ec in str(x)):
            x = str(x).replace(ec, '')
            
    if('I have never written code' in str(x)):
        x = x.replace('I have never written code', '0')
        
    if('-' in str(x)):
        x = x.split('-')
        x = ( int(x[0]) + int(x[1]) ) / 2
        
    return float(x)

#
# get the single average value from the salary range column
#
def average_salary(df_series,column):
    salary_ave = []
    
    for gs in df_series:
        a = round( df.loc[df[column] == gs]['average_salary'].mean(), 2 )
        salary_ave.append(a)
    return salary_ave

#
#
#
def plot_pie_chart(count,series,_title,colors):
    fig = plt.figure(figsize=(10,7), dpi=150)
    fig.suptitle(_title,fontsize=18)
    ax = fig.add_axes([0,0,1,1])
    border_wedge = {"edgecolor":"white",'linewidth': 2,'antialiased': True}
    ax.pie(count, labels=series, autopct='%1.1f%%', colors=colors, wedgeprops=border_wedge)
    fig.show()

#
#
#
def plot_barh_chart(count,series,_title,_width,_height,titlesize,_dpi,_annotation=True):

    fig = plt.figure(figsize=(_width,_height), dpi=_dpi)
    fig.suptitle(_title,fontsize=titlesize)
    ax = fig.add_axes([0,0,1,1])
    ax.barh(series,count,color=['#003f5c'])
    
    # Remove axes splines
    for s in ['top','bottom','left','right']:
        ax.spines[s].set_visible(False)
        
    # Remove x,y Ticks
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')
    
    # Add padding between axes and labels
    ax.xaxis.set_tick_params(pad=5)
    ax.yaxis.set_tick_params(pad=10)
    
    # Add x,y gridlines
    ax.grid(b=True, color='grey', linestyle='-.', linewidth=0.5, alpha=0.2)
    
    # Show top values 
    ax.invert_yaxis()
    
    # Add annotation to bars
    if(_annotation == True):
        for i in ax.patches:
            ax.text(i.get_width()+500, i.get_y()+0.5, str(round((i.get_width()), 2)), fontsize=10, fontweight='bold', color='grey')

    fig.show()

#
#
#
def change_order(_series,_values):
    
    new_df = pd.DataFrame()
    new_df['series'] = _series
    new_df['values'] = _values
    new_df = new_df.sort_values(by = 'values', ascending = False)
    return {'series': new_df['series'].to_list(), 'values': new_df['values'].to_list()}

#
#
#
def get_happinessdf_data(country,happiness_df,column_name):
    
    return happiness_df.loc[happiness_df['Country name'] == country][column_name].to_list()

#
#
#
def salary_correlation(column_name,c_series,s_series,x_series):
    
    s_df = pd.DataFrame()
    s_df['country'] = c_series
    s_df['ave. salary'] = s_series
    s_df[column_name] = x_series
    s_df = s_df.set_index('country')
    return s_df.corr().round(2)

In [None]:
#
# Datasets - Kaggle Survey 2021 (Cleaning & Preparation)
#

df = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
columns = df.columns
row_count = len(df)
column_description = []
for c in df.iloc[0]:
    column_description.append(c)
df = df.drop(index=0)
df.loc[df.Q2 == "Man", "Q2"] = 'Male'
df.loc[df.Q2 == "Woman", "Q2"] = 'Female'
df.loc[df.Q2 == "Prefer not to say", "Q2"] = 'Others'
df.loc[df.Q2 == "Prefer to self-describe", "Q2"] = 'Others'

df.loc[df.Q3 == "United Kingdom of Great Britain and Northern Ireland", "Q3"] = 'United Kingdom'
df.loc[df.Q3 == 'Iran, Islamic Republic of...', "Q3"] = 'Iran'
df.loc[df.Q3 == 'United States of America', "Q3"] = 'United States'
df.loc[df.Q3 == 'I do not wish to disclose my location', "Q3"] = 'Undisclosed Location'
df.loc[df.Q3 == 'Viet Nam', "Q3"] = 'Vietnam'

df.loc[df.Q4 == "I prefer not to answer", "Q4"] = 'No answer'
df.loc[df.Q4 == "Some college/university study without earning a bachelor’s degree", "Q4"] = 'Has education, no degree'
df.loc[df.Q4 == "No formal education past high school", "Q4"] = 'No formal education'

df.loc[df.Q5 == 'Program/Project Manager', "Q5"] = 'Program/Proj Mngr'
df.loc[df.Q5 == 'Software Engineer', "Q5"] = 'Software Engr'
df.loc[df.Q5 == 'Currently not employed', "Q5"] = 'Unemployed'
df.loc[df.Q5 == 'Machine Learning Engineer', "Q5"] = 'ML Engr'
df.loc[df.Q5 == 'Product Manager', "Q5"] = 'Product Mngr'
df.loc[df.Q5 == 'DBA/Database Engr', "Q5"] = 'Product Engr'

df.loc[df.Q6 == 'I have never written code', "Q6"] = 'No Experience'

df.Q25 = df.Q25.fillna(0)
df.isnull()

In [None]:
#
# Datasets - World Happiness Report 2021 (Cleaning & Preparation)
#

happiness_df = pd.read_csv("../input/world-happiness-report-2021/world-happiness-report-2021.csv", low_memory=False)
happiness_row_count = len(happiness_df)

happiness_df.loc[happiness_df['Country name'] == "Taiwan Province of China", "Country name"] = 'Taiwan'
happiness_df.loc[happiness_df['Country name'] == "Hong Kong S.A.R. of China", "Country name"] = 'Hong Kong (S.A.R.)'

happiness_df.isnull()

<a id="demographics"></a>
# 3. Demographics

What we can see based on the pie charts below is that we have a wide range of people who participated in the survey. And it goes to show that Data Science is starting to gain popularity. And as for the Company Roles, the majority of them are still students.

In [None]:
#
# Age
#
age_series = pd.Series(df.Q1).drop_duplicates().to_list()
age_count = count_str(age_series, df, 'Q1')

#
# Gender
#
gender_series = pd.Series(df.Q2).drop_duplicates().to_list()
gender_count = count_str(gender_series, df, 'Q2')

#
# Country
#
country_series = pd.Series(df.Q3).drop_duplicates().to_list()
country_count = count_str(country_series, df, 'Q3')

#
# Education Level
#
education_series = pd.Series(df.Q4).drop_duplicates().to_list()
education_count = count_str(education_series, df, 'Q4')

#
# charts
#
plot_pie_chart(age_count,age_series,'fig.1.a Age Bracket',pie_colors)
plot_pie_chart(gender_count,gender_series,'fig.1.b Gender',pie_colors_gender)
new_country_order = change_order(country_series,country_count)
plot_barh_chart(new_country_order['values'],new_country_order['series'],'fig.1.c Countries',10,10,18,100)
plot_pie_chart(education_count,education_series,'fig.1.d Education Level',pie_colors)

In [None]:
#
# Roles
#
role_series = pd.Series(df.Q5).drop_duplicates().to_list()
role_series.pop(0)
role_series.append('Other')
role_count = count_str(role_series, df, 'Q5')

#
# chart
#
plot_pie_chart(role_count,role_series,'fig.2 Company Roles',pie_colors)

<a id="average-salary"></a>
# 4. The Average Salary

<div style="width:100%;text-align: center;"><img align=middle src="https://i.imgur.com/78Lj6hU.jpeg" alt="Salarymen Eating at an Izakaya" style="margin: 0 auto;"></div>

We will be creating a new column on the Kaggle 2021 survey dataframe and name it **"average_salary"** based on the cleaned up data from column **Q25** or **"What is your current yearly compensation (approximate $USD)?"**.

Clean up process:

1. Remove string elements e.g. 'years','+','<','>','$',','
2. Calculate median of values like this **25,000-29,999** to get a single numerical value.
3. For later use, calculate the average salary on the column we are searching for. e.g. happiness score per country, gender and roles by calculating the ***mean***.

In [None]:
df['average_salary'] = df['Q25'].apply(get_average_from_str)
df['average_salary']

Using that column, we can visualize these:

**Average Salary per Role**

In [None]:
#
# Get average salary per role
#
role_salary_ave = average_salary(role_series,'Q5')

#
# Change order of items
#
new_role_order = change_order(role_series,role_salary_ave)

#
# chart
#
plot_barh_chart(new_role_order['values'],new_role_order['series'],'fig.3 Ave. Salary per Role',10,10,18,100)

**Average Salary per Country**

In [None]:
#
# Get average salary per country
#
salary_series = average_salary(country_series,'Q3')

#
# Change order of items
#
new_country_order = change_order(country_series,salary_series)

#
# chart
#
plot_barh_chart(new_country_order['values'],new_country_order['series'],'fig.4 Ave. Salary per Country',10,10,14,70)

**Average Salary based on Education Level**

In [None]:
#
# Get average salary based on education level
#
education_salary_series = average_salary(education_series,'Q4')

#
# Change order of items
#
new_education_order = change_order(education_series,education_salary_series)

#
# chart
#
plot_barh_chart(new_education_order['values'],new_education_order['series'],'fig.6 Ave. Salary based on Education Level',10,10,16,70)

**Average Salary based on Coding Experience**

In [None]:
#
# Code Experience
#
coder_exp_series = pd.Series(df.Q6).drop_duplicates().to_list()

#
# Get average salary based on experience
#
coder_salary_series = average_salary(coder_exp_series,'Q6')

#
# Change order of items
#
new_coder_order = change_order(coder_exp_series,coder_salary_series)

#
# chart
#
plot_barh_chart(new_coder_order['values'],new_coder_order['series'],'fig.6 Ave. Salary based on Coding Experience',8,8,10,60)

<a id="happiness-salary-correlation"></a>
# 5. Happiness Score Correlation to Average Salary per Country

<div style="width:100%;text-align: center;"><img align=middle src="https://i.imgur.com/dY543yo.jpeg" alt="Happy man" style="margin: 0 auto;"></div>

In order to get the correlation of the average salary and happiness per country, we will be using a formula which was developed by Karl Pearson, known as the Pearson's R or Pearson Correlation Coefficient. The value it produces ranges from -1.0 (-100%) up to 1.0 (100%) and 0.0 (0%) at the middle. This is one of our reusable functions **salary_correlation()**.

In [None]:
#
# country happiness score
#
happiness_series = []

for cs in country_series:
    x = get_happinessdf_data(cs,happiness_df,'Ladder score')
    x = str(x).replace('[','')
    x = str(x).replace(']','')
    if(x != ''):
        x = float(x)
    else:
        x = 0.0
    happiness_series.append(x)

#
# correlation
#
salary_correlation('happiness',country_series,salary_series,happiness_series)

**Result**

It turns out, theres only 0.58 or **58%** correlation between the average salary and happiness score. Although the number may seem good at first glance, but its not enough for us to say that the resulting number is statistically significant for us to derive a conclusion.

Here's the visualization of the Happiness Score per Country

In [None]:
#
# Change order of items
#
happiness_order = change_order(country_series,happiness_series)

#
# chart
#
plot_barh_chart(happiness_order['values'],happiness_order['series'],'fig.7 Happiness Score per Country',10,10,14,70,False)

<a id="gdp-salary-correlation"></a>
# 6. GDP Score Correlation to Average Salary per Country

<div style="width:100%;text-align: center;"><img align=middle src="https://i.imgur.com/tEsRbNi.jpeg" alt="Money on a table" style="margin: 0 auto;"></div>

Again, like we did before we will be computing the correlation, but this time between GDP score and average salary per country.

In [None]:
#
# GDP per capita for each country
#
gdp_series = []

for cs in country_series:
    x = get_happinessdf_data(cs,happiness_df,'Logged GDP per capita')
    x = str(x).replace('[','')
    x = str(x).replace(']','')
    if(x != ''):
        x = float(x)
    else:
        x = 0.0
    gdp_series.append(x)

#
# correlation
#
salary_correlation('gdp per capital',country_series,salary_series,gdp_series)

**Result**

The resulting correlation number is even worse than the previous that we had. But with a score of 0.39 or **39%**, we can confidently say that the GDP score has low or insignificant correlation to the average salary of the respondents.

Here's the visualization of the GDP Score per Country

In [None]:
#
# Change order of items
#
gdp_order = change_order(country_series,gdp_series)

#
# chart
#
plot_barh_chart(gdp_order['values'],gdp_order['series'],'fig.8 GDP per Capital for each Country',10,10,14,70,False)

<a id="equal-opportunity-genders"></a>
# 7. Equal Opportunity to all Genders

<div style="width:100%;text-align: center;"><img align=middle src="https://i.imgur.com/x8YUimE.jpeg" alt="Equality on typewriter" style="margin: 0 auto;"></div>

Earlier in the **Demographics** section, we found out the percentage of the respondents' preferred gender.

- Male - 20598 **(79.3%)**
- Female - 4890 **(18.8%)**
- Non-Binanry - 88 **(0.3%)**
- Others - 397 **(1.5%)**

Now, we will try to analyze if the number of respondents per gender has any meaning to the average salary of each gender. To know if having a low population number indicates the level of salary given to them. We will use the same formula **"Pearson's R"** to calculate the correlation.

In [None]:
#
# Get average salary per gender
#
gender_salary_series = average_salary(gender_series,'Q2')

#
# correlation
#
c_s_df = pd.DataFrame()
c_s_df['genders'] = gender_series
c_s_df['gender_ave_salary'] = gender_salary_series
c_s_df['gender_count'] = gender_count
c_s_df = c_s_df.set_index('genders')
c_s_df.corr().round(2)

**Result**

The correlation that we get from here is a bit different from the first two we've had. This time, we get a -0.45 or **-45%**. With this, we can say that theres an inverse correlation between the number of respondents per gender and the average salary for each gender. And again, we used the same formula that we used for the first two correlations.

Here's the visualization of the average salary per gender

In [None]:
#
# Change order of items
#
new_gender_order = change_order(gender_series,gender_salary_series)

#
# chart
#
plot_barh_chart(new_gender_order['values'],new_gender_order['series'],'fig.5 Ave. Salary per Gender',8,8,10,60)

<a id="respondent-salary-correlation"></a>
# 8. Does the number of respondents in a country equate to a higher average salary?

<div style="width:100%;text-align: center;"><img align=middle src="https://i.imgur.com/bWf2vmn.jpeg" alt="A group of people talking" style="margin: 0 auto;"></div>

As a bonus, let's check whether having a higher number of respondents in the survey is equivalent to having a higher average salary.

In [None]:
salary_correlation('No. of Respondents per Country',country_series,salary_series,country_count)

**Result**

The result shows a clear sign of uncorrelation between the number of respondents and the average salary that they get. The value is -0.1 or **-10%** 

<a id="conclusion"></a>
# 9. Conclusion

- At **0.58** correlation, we can say that its not yet at a level where we can conclusively say that salary affects the happiness of a person. **Happiness differs from person-to-person**, there may be an outside influence that is affecting a person causing him/her not to be happy despite having a high amount of salary. 
- The country of residence of the participants are very diverse, but majority of the respondents came from **India**.
- Although **Males** still has the majority count, its the **Nonbinary** group that has the highest average salary among all.
- **Switzerland** has the highest average salary, second highest happiness score and 3rd highest GDP among others.
- The GDP per Capita has a **low correlation** with the average salary. so moving to countries that has high GDP per Capital to increase your salary is not a guarantee.
- Having a high degree of education helps in the Data Science Field.

And before you finish reading this kernel, please answer this question:

<center style="font-size:30px;">Are you happy with your experience in Data Science?</center>

<a id="resources"></a>
# 10. Resources
- [2021 Kaggle Machine Learning & Data Science Survey](https://www.kaggle.com/c/kaggle-survey-2021/data)
- [World Happiness Report 2021](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021)
- (Princeton University) [High income improves evaluation of life but not emotional well-being](https://www.princeton.edu/~deaton/downloads/deaton_kahneman_high_income_improves_evaluation_August2010.pdf) (first bolded paragraph)
- (Wharton Business School of the University of Pennsylvania) [Money matters to happiness—perhaps more than previously thought](https://penntoday.upenn.edu/news/money-matters-to-happiness-perhaps-more-than-previously-thought) (first few paragraphs)
- [Wikipedia - Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)
- [Wikipedia - What is Correlation?](https://en.wikipedia.org/wiki/Correlation)
- [Unsplash](https://unsplash.com/) (Free Stock Images)