# Religion and happiness

### Are religious countries happier?

We attempt to answer this question by looking for a correlation between how religious a country is and it's happiness (World happiness report). 

We find that in general there is not enough evidence to assume a positive or negative correlation. But when looking at religions individually it seems that countries with high Buddhist populations score poorly in happiness rating even when adjusting for quality of life variables like GDP.

## Contents

1. [The Data](#1.-The-Data)

2. [Data preparation](#2.-Data-preparation)

3. [Analysis](#3.-Analysis)

4. [Summary](#4.-Summary)

## 1. The Data
<a id='The Data'></a>

We use two seperate datasets to make our inferences:

* Countries by religion and population

https://www.kaggle.com/vibhorsen/countries-by-population-happiness-index-religion

* World happiness report 2020

https://www.kaggle.com/mathurinache/world-happiness-report


### Countries by religion and population

This dataset contains 151 countries detailing their total population size and the size of individual religious populations within the country. Here's how it looks once ordered by happiness world ranking.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('/kaggle/input/countries-by-population-happiness-index-religion/GfG.csv').drop(columns=['Unnamed: 0', 'score'], axis=1).sort_values(by='happinessRank')

del df['happinessRank']

df.rename(columns={'chistians':'christians'}, inplace=True)

df.head()

### World happiness report 2020

This is a dataset of 153 countries with lot's of columns detailing all contributors to the primary feature 'Ladder score'. Ladder score is the happiness rating given to a particular country. Full details of the reports methods and findings can be found here:

https://happiness-report.s3.amazonaws.com/2020/WHR20.pdf

I will give a brief summary of the less obvious columns to be used as a key if the reader needs to quickly check what is meant by a column at any time.

* **Ladder score** - Happiness rating.
* **Standard error** - The predicted standard deviation for the Ladder score.
* **upper/lowerwhisker** - Two standard errors above/below ladder score.
* **Logged GDP per capita** - Log of the GDP per capita in terms of Purchasing Power Parity (PPP) adjusted to constant 2011 international dollar.
* **Social support** - National average response to “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”.
* **Healthy life expectancy** - Life expectancy based on the Global Health Observatory data repository.
* **Freedom to make life choices** - National average response to “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”.
* **Generosity** - The difference between the actual average responses and predicted average responses to "Have you donated money to a charity in the past month?”.
* **Perceptions of corruption** - National average response to “Is corruption widespread throughout the government or not?”.
* **Ladder score in Dystopia** - Dystopia is a dummy country given the worst possible score.
* **Explained by: (column)** - The amount that (column) has contributed to the Ladder score.

Here's how the dataset looks.

In [None]:
happiness_report = pd.read_csv('/kaggle/input/world-happiness-report/2020.csv')

happiness_report.head()

## 2. Data preparation
<a id='Data preparation'></a>

We prepare and tidy up both datasets before merging them together beginning with the 'Countries by religion and population' dataset.

### Countries by religion and population


When deciding how a religious population might impact a country, it's more helpful to know the percentage of the total population rather than the exact numbers. A new column is created called 'religious pop (%)' which is simply 100 - 'unaffiliated (%) to be used in graphs later on.

In [None]:
def percentage_of_pop(entry):
    return (entry/df['pop2021'])*100

columns = df.iloc[:, 1:9].columns

for religion in columns:
    df[religion] = (df[religion]/df['pop2021'])*100
    df.rename(columns={religion:religion + ' (%)'}, inplace=True)

df['religious pop (%)'] = 100 - df['unaffiliated (%)']
    
df.head()

That's all the data prep required for the first dataset so let's move on the the happiness report.

## World happiness report 2020

We're not interested in the 'explained by' and 'dystopia' columns or the 'error' and 'whisker' columns since they give information on the contributions of the other columns (see the Data section) which is not used.

We also rename a few countries so that our two datasets will merge smoothly without incurring any null values. Here's how it looks.

In [None]:
happiness_report = happiness_report.iloc[:, 0:12]

happiness_report.drop(columns=['upperwhisker','lowerwhisker', 'Standard error of ladder score'], inplace=True)

happiness_report.rename(columns={'Country name': 'country'}, inplace=True)

happiness_report['country'].replace({'Taiwan Province of China': 'Taiwan', 'Hong Kong S.A.R. of China':'Hong Kong', 'Congo (Brazzaville)':'Republic of the Congo', 'Palestinian Territories':'Palestine', 'Congo (Kinshasa)':'DR Congo'}, inplace=True)

happiness_report.head()

It will be useful to categorise our countries by GDP per capita. The 3 categories are **wealthy**, **moderate** and **poor**. A country is assigned the **wealthy** class if their GDP per capita is above the 66th percentile. Similarly, **poor** is classfied as being below the 33rd percentile.

In [None]:
GDP_max = happiness_report['Logged GDP per capita'].max()
GDP_min = happiness_report['Logged GDP per capita'].min()

GDP_33 = ((GDP_max - GDP_min) * 0.33) + GDP_min
GDP_66 = ((GDP_max - GDP_min) * 0.66) + GDP_min

def GDP_Category(GDP):
    if GDP > GDP_66:
        return 'Wealthy'
    elif GDP > GDP_33:
        return 'Moderate'
    else:
        return 'Poor'

happiness_report['GDP category'] = happiness_report['Logged GDP per capita'].apply(GDP_Category)

happiness_report.head()

## Merging of the two dataframes

As stated in the Data section, the happiness report has 2 more countries than the population and religion dataset. We can deal with this by joining the two dataframes with an inner join so that those two countries are ignored rather than having to deal with missing data.

In [None]:
df = pd.merge(df, happiness_report, how='left', on='country')

df.head()

# 3. Analysis

## Does religion correlate with happiness?

Here's a simple scatter plot comparing happiness with the percentage of the population being religious.

Note that the x-axis is a measure of a countries **religious** population i.e to the right is more religious.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(rc={'figure.figsize':(8,8)})

h = sns.scatterplot(data=df, x="religious pop (%)", y="Ladder score")
plt.title('Comparison of happiness with religion')
# Set x-axis label
plt.ylabel('Happiness score')
# Set y-axis label
plt.xlabel('% of population who are religious')
plt.xlim(0, 100)
plt.ylim(0, 10)
plt.text(x=24, y=7.2, s='Czech Republic')
plt.text(x=75, y=3, s='Botswana');

Given that all of the unhappy countries (happiness score < 5) have a majority religious population one might think that being non-religious implies happiness. But ee need to take into account that the poorer a country, the more likely it is to have a religious population. So let's have another look at the same data, but taking into account a countries wealth.

In [None]:
h = sns.jointplot(data=df, x="religious pop (%)", y="Ladder score", height=8, hue='GDP category')
h.set_axis_labels('% of country that is religious', 'Happiness score', fontsize=16)
h.ax_marg_x.set_xlim(0, 100)
h.ax_marg_y.set_ylim(0, 10);

Now we see that although non-religious countries are indeed generally happy, they are all quite well off. When only looking at the wealthy countries, religion doesn't seem to make much difference. Let's take a look at the population of individual religions to see if happiness is affected there. 


Note: Judaism, Folk and Other religions are not graphed given the tiny populations in most countries.

In [None]:
g = sns.PairGrid(df, y_vars=["Ladder score"], x_vars=["christians (%)", "muslims (%)", 'hindus (%)', 'buddhists (%)'], height=5, hue='GDP category')
g.map(sns.regplot)
g.set(ylim=(-1, 11), yticks=[0, 2, 4, 6, 8, 10])
g.axes[0,0].set_xlim(0,100)
g.axes[0,1].set_xlim(0,100)
g.axes[0,2].set_xlim(0,100)
g.axes[0,3].set_xlim(0,100)
titles = ['Christian', 'Muslim', 'Hindu', 'Buddhist']
for ax, title in zip(g.axes.flat, titles):
    ax.set_title('Happiness by ' + title + ' population')
    if title == 'Hindu':
        ax.text(70, 3.8,'India', fontsize=14)
    if title == 'Buddhist':
        ax.text(80, 6.3,'Thailand', fontsize=14)

The coloured lines are the corresponding regression lines for each wealth bracket. The shaded areas are the errors for the regression lines. Notice how the error is greatest where there fewer countries, as one might expect. Judging from the regression lines, there's no clear sign to say either way that any religion is a cause for happiness or sadness.

### How can we improve these graphs?

It would be nice if the happiness rating was not dependant on GDP so that we could look at all the points together. We can do this by predicting the happiness score of each country based on their GDP (and other factors), then comparing a countries actual happiness with their 'GDP predicted' happiness. This would even out the playing field and allow us to combine all countries when drawing the regression lines.

## Adjusting for GDP

To predict the happiness of a country with a linear regression model we must check that there is infact a linear correlation. To check this we'll plot happiness alongside GDP.

In [None]:
sns.scatterplot(data=df, x='Logged GDP per capita', y='Ladder score')

plt.title('Happiness by GDP per capita', fontsize=16);

Although not perfectly correlated, linear regression will do a good enough job of predicting ladder score.

The predictions are noted as 'predicted happiness' in the new dataframe. We then take the difference between the actual happines and the predicted happiness and place the results into the 'adjusted happiness' column to get a good idea of how happy a country is whith respect to their wealth.

In [None]:
from sklearn.linear_model import LinearRegression

X = df[['Logged GDP per capita', 'Social support', 'Healthy life expectancy']]

y = df['Ladder score']

reg = LinearRegression().fit(X, y)

df['predicted happiness'] = reg.predict(X)

df['adjusted happiness'] = df['Ladder score'] - df['predicted happiness']

df['adjusted happiness'] = df['adjusted happiness']/max(abs(df['adjusted happiness'].min()), df['adjusted happiness'].max())

df[['country', 'Ladder score', 'adjusted happiness']].head()

Let's compare again religion in general with adjusted happiness.

In [None]:
h = sns.jointplot(data=df, x="religious pop (%)", y="adjusted happiness",hue='GDP category', height=8)
h.set_axis_labels('% of country that is religious', 'Happiness score', fontsize=16)
h.ax_marg_x.set_xlim(0, 100)
h.ax_marg_y.set_ylim(-1, 1);

Each country is now represented fairly regardless of wealth. The less religious countries that seemed quite happy before are now noticably below average. Even so, there are no grounds for assuming a correlation between happiness and religion.

Let's once again delve into each religion by itself, now with more clarity.

In [None]:
g = sns.PairGrid(df, y_vars=['adjusted happiness'], x_vars=["christians (%)", "muslims (%)", 'hindus (%)', 'buddhists (%)'], height=6)
g.map(sns.regplot)
g.set(ylim=(-1, 1), yticks=[-1, -0.5, 0, 0.5, 1])
g.axes[0,0].set_xlim(0,100)
g.axes[0,1].set_xlim(0,100)
g.axes[0,2].set_xlim(0,100)
g.axes[0,3].set_xlim(0,100)
titles = ['Christian', 'Muslim', 'Hindu', 'Buddhist']
for ax, title in zip(g.axes.flat, titles):
    ax.set_title('Adjusted happiness by ' + title + ' population')
    if title == 'Christian':
        ax.text(34, 0.71,'Ivory Coast', fontsize=14)
    if title == 'Muslim':
        ax.text(80, -0.71,'Egypt', fontsize=14)
    if title == 'Hindu':
        ax.text(80, 0.1,'Nepal', fontsize=14)
    if title == 'Buddhist':
        ax.text(65, -0.71, 'Sri Lanka', fontsize=14)

Astonishingly, all countries that have a significant Buddhist population seem to be below the average happiness score. Of course it may just be chance that the countries with a significant Buddhist population all fall short in happiness. To verify this with confidence let's conduct a z-test.

In [None]:
import math

df_buddhists = df[df['buddhists (%)'] > 5]

mean_buddhist_happiness = df_buddhists['adjusted happiness'].mean()

mean_total_happiness = df['adjusted happiness'].mean()

SD_total_happiness = df['adjusted happiness'].std()

SD_buddhist_happiness = df_buddhists['adjusted happiness'].std()

standard_error = SD_total_happiness/df_buddhists.shape[0]

# the population mean that will give a 99% confidence level that our sample mean is below
confidence_level = (2.3 * standard_error)/mean_buddhist_happiness

percentile = (df[df['adjusted happiness'] < confidence_level].shape[0]/df.shape[0])*100

print('The mean adjusted happiness amongst all countries is {:.2f} with standard deviation {:.4f}.\n'.format(mean_total_happiness, SD_total_happiness))
print('Whereas the mean adjusted happiness amongst countries with significant buddhist population (>5%) is {:.4f} with standard error {:.4f}.'.format(mean_buddhist_happiness, standard_error))
print('\nThis equates to a confidence of 99% that countries with significant Buddhist population are below the {:.4f} percentile of countries in terms of adjusted happiness.'.format(percentile))

We just assumed that the mean adjusted happiness of the countries with significant Buddhist population is normally distributed. By the central limit theorem, we know this to be approximately true. Here's how the counts look.

In [None]:
df['happiness bins'] = pd.cut(df['adjusted happiness'], 10)

sns.countplot(x="happiness bins", data=df)
plt.xticks(rotation=45)
plt.title('Counts of adjusted happiness scores', fontsize=16)
# Set x-axis label
plt.ylabel('Count')
# Set y-axis label
plt.xlabel('Adjusted happiness bins');

## Religion and generosity

The generosity column was calculated just as we have done with happiness (residual of regression). Therefore generosity is already in a convenient format for us to compare with religion.


In [None]:
h = sns.jointplot(data=df, x="religious pop (%)", y="Generosity", height=8, hue='GDP category')
h.set_axis_labels('% of country that is religious', 'Generosity score', fontsize=16)
h.ax_marg_x.set_xlim(0, 100)
h.ax_marg_y.set_ylim(-1, 1);

As expected, there is no bias for the GDP of a country. There is perhaps a slight degredation of generosity as a country loses religous population. Let's take a look at each religion individually.

In [None]:
g = sns.PairGrid(df, y_vars=['Generosity'], x_vars=["christians (%)", "muslims (%)", 'hindus (%)', 'buddhists (%)'], height=5)
g.map(sns.regplot)
g.set(ylim=(-1, 1), yticks=[-1, -0.5, 0, 0.5, 1])
g.axes[0,0].set_xlim(0,100)
g.axes[0,1].set_xlim(0,100)
g.axes[0,2].set_xlim(0,100)
g.axes[0,3].set_xlim(0,100);

Christianity seems to be the only example of a slightly downward trend. Also there seems to be a positive correlation between generosity and Buddhism and Hinduism.

## 4. Summary

Religion is not a good indicator of happiness or otherwise. This explains the abscence of religion in the World happiness report. Although when looking at religions individually, there is a worrying trend for Buddhist countries to be less happy. Of course this might be due to external factors, but it might be worth looking into the reasons for this.

In terms of generosty, the generosity of a country was almost inversally proportional to it's happiness levels for a paricular religion. For example, a higher Christian population trended towards a happier population but a less generous population.