In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![](http://worldhappiness.report/assets/images/icons/whr-cover-ico.png)

# Context

"The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012, the second in 2013, the third in 2015, and the fourth in the 2016 Update. The World Happiness 2017, which ranks 155 countries by their happiness levels, was released at an event celebrating International Day of Happiness on March 20th. The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness."

# Data Content

"The happiness scores and rankings use data from the Gallup World Poll.
The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others."

* Ladder score: Happiness score or subjective well-being. This is the national average response to the question of life evaluations.
* Logged GDP per capita: The GDP-per-capita time series from one year to another using countryspecific forecasts of real GDP growth in the following year.
* Social support: Social support refers to assistance or support provided by members of social networks (like government) to an individual.
* Healthy life expectancy: Healthy life expectancy is the average life in good health - that is to say without irreversible limitation of activity in daily life or incapacities - of a fictitious generation subject to the conditions of mortality and morbidity prevailing that year.
* Freedom to make life choices: Freedom to make life choices is the national average of binary responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” ... It is defined as the average of laughter and enjoyment for other waves where the happiness question was not asked
* Generosity: Generosity is the residual of regressing national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.
* Perceptions of corruption: The measure is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?”
* Ladder score in Dystopia: It has values equal to the world’s lowest national averages. Dystopia as a benchmark against which to compare contributions from each of the six factors. Dystopia is an imaginary country that has the world's least-happy people. ... Since life would be very unpleasant in a country with the world's lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom, and least social support, it is referred to as “Dystopia,” in contrast to Utopia.

World Happiness Report Official Website: https://worldhappiness.report/

# Does Money Buy Happiness?

"For many Americans, the pursuit of happiness and the pursuit of money come to much the same thing. More money means more goods (inflation aside) and thus more of the material benefits of life. As it is for the individual, so it is for society as a whole. National economic growth - a steady upward march in average income, year after year, decade after decate - means it is supposed, greater well-being and a happier society."

# Hypothesis:

$H_0: \beta_1 = 0$ vs. $H_a: \beta_1 \neq 0$

$H_0$: There is not a significant statistical association between GDP per capita, and happiness score.

$H_a$: There is a significant statistical association between GDP per capita, and happiness score.

# Importing Libraries / Reading in the Data

In [None]:
# Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
from statsmodels.formula.api import ols
import scipy.stats as st

**Reading in data**

In [None]:
# Read in the data - 

# Import Local CSV files sourced from - "https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021?select=world-happiness-report.csv"

world1 = pd.read_csv("../input/world-happiness-report-2021/world-happiness-report.csv")
world2 = pd.read_csv("../input/world-happiness-report-2021/world-happiness-report-2021.csv")

# Preview of Dataset 1 / Cleaning

In [None]:
# Preview Data World 1 -

# Preview the first dataset
world1

In [None]:
# Preview of info on the dataframe
print(world1.info())
print('-----------------------------------------------------------------------')

# Check for total count of null vlaues
print('Is null count:',world1.isnull().sum().sum())

In [None]:
# Unique countries used within data set
print('Unique countries used:',world1['Country name'].unique())
print('-----------------------------------------------------------------------')

# Unique years used within data set
print('Unique years used:',world1['year'].unique())

In [None]:
# Clean Data from world1 so there's no NaN values
world1_clean_ = world1.dropna(axis=0)

# Clean Data from world1 so column names fit better
world1_clean = world1_clean_.rename(columns=
                                    {"Country name": "Country_name",
                                     "year": "Year",
                                     "Life Ladder": "Ladder_score",
                                     "Log GDP per capita": 
                                     "GDP_per_capita", "Social support":
                                     "Social_support",
                                     "Healthy life expectancy at birth":
                                     "Life_expectancy", 
                                     "Freedom to make life choices":
                                     "Freedom_of_choice", 
                                     "Perceptions of corruption":
                                     "Corruption","Positive affect":
                                     "Positive_affect", "Negative affect": 
                                     "Negative_affect"})

# Remove 2005 from dataset since there's limited to no data

# Dropping the row from the set
world1_clean_ = world1_clean.drop([0,293])

# Organize set to present columns in a desired fassion
world1_final_ = world1_clean_[['Year','Country_name',
                               'Ladder_score','GDP_per_capita',
                               'Generosity','Social_support',
                               'Life_expectancy','Freedom_of_choice',
                               'Corruption', 'Positive_affect',
                               'Negative_affect']]

# drop columns to match world2
world1_final = world1_final_.drop(['Positive_affect',
                                   'Negative_affect'],axis=1)

**Cleaning Data From Set 1 - Cleaning Columns, NAN values, and Unnecessary Rows**

* First I started by droping all NaN values since there were quite a few.
* I renamed the columns so they fit better for presentation and they would be easier to type.
* I wanted to organize the columns in the order that was a bit more structured.
* I decided to drop the positve and negative affect columns since they weren't included in the second set.
* I decided to drop the year 2005, since there was only one line item of data. 
  * I didn't want the visualizations to only include one line item when doing multi year timespans.

# Preview of Dataset 2 / Cleaning

In [None]:
# Preview Data World 2 -

# Preview the second data set
world2

In [None]:
# Preview of info on the dataframe
print(world2.info())
print('-----------------------------------------------------------------------')

# Check for total count of null vlaues
print('Null count:',world2.isnull().sum().sum())

In [None]:
# Cleaning world 2 Data -

# Drop columns that don't pertain to top dataset
world2_clean_ = world2.drop(['Standard error of ladder score',
                             'upperwhisker','lowerwhisker',
                             'Ladder score in Dystopia',
                             'Explained by: Log GDP per capita',
                             'Explained by: Social support',
                             'Explained by: Freedom to make life choices', 
                             'Explained by: Generosity', 
                             'Explained by: Perceptions of corruption', 
                             'Dystopia + residual', 
                             'Explained by: Healthy life expectancy'],axis=1)

# Add year column to include the year this data is from is 2021
world2_clean_['Year'] = 2021

# Taking previous dataset, including 2021, then orginizing columns
world2_final_ = world2_clean_[['Year','Country name',
                               'Regional indicator',
                               'Ladder score',
                               'Logged GDP per capita',
                               'Generosity',
                               'Social support',
                               'Healthy life expectancy',
                               'Freedom to make life choices',
                               'Perceptions of corruption']]

# Clean up the names
world2_final = world2_final_.rename(columns={"Country name": 
                                             "Country_name",
                                             "Regional indicator":
                                             "Region_indicator", 
                                             "Ladder score": 
                                             "Ladder_score", 
                                             "Logged GDP per capita":
                                             "GDP_per_capita", 
                                             "Social support": 
                                             "Social_support", 
                                             "Healthy life expectancy": 
                                             "Life_expectancy", 
                                             "Freedom to make life choices": 
                                            "Freedom_of_choice", 
                                             "Perceptions of corruption":
                                             "Corruption"})

**Cleaning Data From Set 2 - Cleaning Columns**

* There were quite a few columns that weren't included in the first set, so I decided to drop them.
  * I anticipated merging both sets later, so I tried to get the dataframes to match exactly.
* I noticed that there wasn't a year column for the 2021 dataset, so I decided to create one to match the year column in the first set.
* I decided to organize the columns to match the first dataset.

# Merging Both Sets Together / Cleaning Final Set

In [None]:
# Merge World1_final + world 2 Final, try to use country as the identifier so region gets applied to every existing country -
world_test = pd.merge(world1_final, world2_final[['Country_name','Region_indicator']], how='left', on = 'Country_name')

# Is null test to see which values are missing, then assign them to a column for revisions.
world_test['Missing'] = world_test['Region_indicator'].isnull()

# Assign these missing values to a new dataframe
world_missing = pd.DataFrame(world_test.loc[world_test['Missing'] == True])

# calculate the names of the countries missing values
world_missing['Country_name'].value_counts(ascending=False)

**Creating Final Set - Merge / NaN check**

* After merging both dataframes into one there were a few missing values for the region indicator column.
* After previewing the data, I noticed there were a few countries missing the region indicator value.
* I then assigned all the missing values to a temp dataframe where I planned on filling all of the NaN values apposed to dropping them.

In [None]:
# Create values for all nulls using a for loop function, and apply it to the df as a new column
region = []
for i in range(len(world_test)):
    if world_test['Country_name'][i] == 'Angola':
        region.append('Sub-Saharan Africa')
    elif world_test['Country_name'][i] == 'Belize':
        region.append('Latin America and Caribbean')
    elif world_test['Country_name'][i] == 'Congo (Kinshasa)':
        region.append('Sub-Saharan Africa')
    elif world_test['Country_name'][i] == 'Syria':
        region.append('Middle East and North Africa')
    elif world_test['Country_name'][i] == 'Trinidad and Tobago':
        region.append('Latin America and Caribbean')
    elif world_test['Country_name'][i] == 'Qatar':
        region.append('Middle East and North Africa')
    elif world_test['Country_name'][i] == 'Sudan':
        region.append('Middle East and North Africa')
    elif world_test['Country_name'][i] == 'Central African Republic':
        region.append('Sub-Saharan Africa')
    elif world_test['Country_name'][i] == 'Djibouti':
        region.append('Sub-Saharan Africa')
    elif world_test['Country_name'][i] == 'Guyana':
        region.append('Latin America and Caribbean')
    elif world_test['Country_name'][i] == 'Bhutan':
        region.append('South Asia')
    elif world_test['Country_name'][i] == 'Suriname':
        region.append('Latin America and Caribbea')
        
world_missing['Region'] = region

**Creating Final Set - Filling NaN Values**

* I decided to create a for loop that scanned the missing values by country name, then added the country's region indicator.
  * I did this so I didn't have to drop a small chunck of values to futher the integrity of my data.

In [None]:
# Drop the old Region column, and the missing vlaues column
world_missing_cleaner_ = world_missing.drop('Region_indicator',axis=1)
world_missing_cleaner = world_missing_cleaner_.drop('Missing',axis=1)

# clean the missing column off, drop the rows from the origional set, and merge this set and the 2021 set.
world_missing_final_ = world_missing_cleaner.rename(columns={'Region': 'Region_indicator'})

# Re-order columns to match other dataframes
world_missing_final = world_missing_final_[['Year','Country_name',
                                            'Region_indicator',
                                            'Ladder_score',
                                            'GDP_per_capita',
                                            'Generosity',
                                            'Social_support',
                                            'Life_expectancy',
                                            'Freedom_of_choice',
                                            'Corruption']]

# Drop the missing column from the previous calculations
world_test = world_test.drop(['Missing'],axis=1)

#Organize columns to match other dataframes 
world_test = world_test[['Year','Country_name',
                         'Region_indicator',
                         'Ladder_score',
                         'GDP_per_capita',
                         'Generosity',
                         'Social_support',
                         'Life_expectancy',
                         'Freedom_of_choice',
                         'Corruption']]

**Creating Final Set - Column / Final Dataframe Cleanup**

* First I dropped all of the old columns that were duplicated, or used in the cleaning process.
* I then renamed and reordered all columns so all dataframes were consistent.

In [None]:
# Concat both dataframes
world_final_close = pd.DataFrame(pd.concat([world_test, world_missing_final]))

# Dropping duplicate, na values from the original set, and replacing them with the new values
world_final_ = world_final_close.dropna(axis=0)

# Merging the cleaned dataframes to a final dataframe
world_final = pd.DataFrame(pd.concat([world_final_, world2_final]))

# Sort year values decending
world_final.sort_values(by = 'Year', inplace = True)

**Creating Final Set - Final Merge**

* There were a couple of NA values that were included from the world test in the inital merge, and when I concatinated the cleaned values it created duplicates.
  * These were removed from the set using the dropna function.
* I then merged the final two versions of world 1 and world 2 to create one final dataframe.
* There were a few complications when creating visualizations, so I had to do a sort values in the year column in order to work out the kink.

# Preview and Analytics of Final Set

In [None]:
# Preview of dataframe final
world_final

In [None]:
# Final dataframe describe
round(world_final.describe(),4)

**Analytics - Data Describe Visualization**

* The max ladder score appears to be 7.97 and the highest life expectancy age is to 77.1.
* The lowest score for corruption is .035.
* The lowest average life expectancy age is 32.3.

In [None]:
# Pairplot to show an overview of all of the data, and their distrobutions. 
sns.pairplot(world_final[['Ladder_score', 
                          'GDP_per_capita',
                          'Generosity',
                          'Social_support',
                          'Freedom_of_choice']])

# Show plot
plt.show();

**Analytics - Pairplot Visualization**

* Ladder score appears to have a unimodal distribution.
* GDP per capita appears to have a non-symetric bimodeal distribution.
* Generosity appears to have a distribution skewed to the left.
* Social support appears to have a distribution skewed to the right.
* Freedom of choice appears to have a distribution skewed to the right.

In [None]:
# Heat map customization
plt.figure(figsize = (15,12.5))
sns.heatmap(world_final[['Country_name',
                         'Region_indicator',
                         'Ladder_score',
                         'GDP_per_capita',
                         'Generosity',
                         'Social_support',
                         'Life_expectancy',
                         'Freedom_of_choice',
                         'Corruption']].corr(),
                         annot=True,
                         cmap='Blues',
                         linewidth = .9)

# Axis ticks rotated so full column is displayed
plt.yticks(rotation=45)
plt.xticks(rotation=45)

# Create title for plot, and show plot
plt.title('Relationship Between Columns')
plt.show();

**Analytics - Correlation Between Columns, and Visualization**

* GDP per capita, social support, and life expectancy seem to have the highest correlation to ladder score.
  * After checking the statistical correlation between GDP per capita and happiness, I plan to check the relationship of social support comapared to happiness after accounting for GDP per capita.

In [None]:
# Linear regression of GDP per capita vs Happiness, and Social Support vs Happiness
sns.pairplot(world_final,
             x_vars=['GDP_per_capita','Social_support'],
             y_vars=['Ladder_score'],
             height=9, 
             aspect=.75, 
             kind='reg');

# Create title for plot, and show plot
plt.title('Linear Regression Models of GDP per Capita, and Social Support Compared to Happiness')
plt.show();

**Analytics - GDP per Capita, and Social Support Linear Regression Visualization**

* The slope of the linear regression of GDP compared to happiness appears to be > 0. 
* The slope of the linear regression of social support to happiness appears to be > 0. 
* This is a sign of a strong correlation between both variables when compared to happiness.

In [None]:
# OLS model statistics for GDP compaired to happiness
model = ols('Ladder_score ~ GDP_per_capita',data=world_final).fit()

# Label slope, and intercept
slope = model.params[1]
intercept= model.params[0]

# Print Slope / Intercept
print('Slope is:',slope)
print('------------------------------------------------------------------------------')
print('intercept is:',intercept)
print('==============================================================================')

# Print the model summary
print(model.summary())

**Analytics - GDP per Capita Compared to Happiness**

* We reject our null hypothesis from the .01 significance level and conclude there is a significant statistical correlation between GDP and happiness.
* Now, let's see if there's a relationship between social support and happiness after accounting for GDP.


 **$H_0: \beta_2 = 0$ vs. $H_a: \beta_2 \neq 0$**

* $H_0$: There is not a significant statistical association between social support and happiness score after accounting for GDP per capita.
* $H_a$: There is a significant statistical association between social support and happiness score after accounting for GDP per capita.

In [None]:
# OLS model statistics for GDP + Social support compaired to happiness
model = ols('Ladder_score ~ GDP_per_capita + Social_support',data=world_final).fit()

# Label slope, and intercept
slope = model.params[1]
intercept= model.params[0]

# Print Slope / Intercept
print('Slope is:',slope)
print('------------------------------------------------------------------------------')
print('intercept is:',intercept)
print('==============================================================================')

# Print the model summary
print(model.summary())

**Analytics - GDP per Capita + Social Support Compared to Happiness**

* We reject our null hypothesis from the .01 significance level and conclude there is a significant statistical association between social support and happiness score after accounting for GDP per capita.
* There was an increase to the adj. r-squared value of about 5%.
  * Countries with higher levels of GDP appear to be more social.
  * This might be because of a lack of work obligations that allows them to focus on their social lives.

In [None]:
# Boxplot of GDP per capita per year
plt.figure(figsize=(15,7))
sns.boxplot(x='Year',
            y='GDP_per_capita',
            data=world_final)

# Create title for plot, and show plot
plt.title('Box Plot of GDP per Capita by Year')
plt.show();

**Analytics - GDP per Capita Box Plot Visualization**

* It appears the average GDP median across the years is > 9, with 2020 coming in at the highest.
  * This might be due to the lack of data provided from that year skewing the data.
* A large majority of the interquartile range of GDP througout the years lies in range of 8 to 10.


In [None]:
# Boxplot of Social Support per year
plt.figure(figsize=(15,7))
sns.boxplot(x='Year',
            y='Social_support',
            data=world_final)

# Create title for plot, and show plot
plt.title('Box Plot of Social Support by Year')
plt.show();

**Analytics - Social Support Box Plot Visualization**

* It appears the average social support median across the years is > .8, with 2006 coming in at the highest.
* A large majority of the interquartile range of GDP througout the years lies in range of .7 to .9.
* There appears to be a few outliers across the years, with most outliers in 2010, and 2011.

In [None]:
# Strip plot to show Generosity per regions
plt.figure(figsize = (15,12.5))
sns.stripplot(x='Year',
              y='Generosity',
              data=world_final,
              hue='Region_indicator')

# Create title for plot, and show plot
plt.title('Generosity Per Region by Year')
plt.show();

**Analytics - Generosity Strip Plot Visualization**


* Southeast Asian countries appear to be the most generious countries consistently throughout the years.
* The Commonwealth of Independent States seemed to be the least generious countries, but slightly improved throughout time.

In [None]:
# animated scatter plot to present GDP per capita in compairison to happiness rating per year
# Also plot points based on size of social score
fig = px.scatter(world_final,
                x='GDP_per_capita',
                y='Ladder_score',
                animation_frame = 'Year',
                animation_group = 'Country_name',
                template = 'plotly_white',
                color='Region_indicator',
                size='Social_support',
                size_max= 20,
                title='GDP per Capita + Social Support per Region Compared to Happiness')

# Show plot
fig.show();

# Conclusions:

* We rejected our first null hypothesis and conclude there is a statistically significant relationship between the amount of GDP per capita and happiness score.
* We rejected our second null hypothesis and conclude there is a significant stitistical association between social score and happiness score after accounting for GDP per capita.
* After analyzing the data I've come to the conculsion that countries with higher levels of GDP are happier countries. Additionally, I've found that countries with higher GDP have higher social support, leading me to conclude that wealtheir countries have fewer obligations and they can focus on their social lives. This leads to happier well-being as a general population.

# Resources / Sources - 

https://www.proquest.com/openview/6ce2d5a919778d8fada2059d39b7ff89/1?pq-origsite=gscholar&cbl=1817076

https://worldhappiness.report/

https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021?select=world-happiness-report.csv

https://www.kaggle.com/unsdsn/world-happiness

https://colab.research.google.com/github/binnisb/blog/blob/master/_notebooks/2020-04-02-Plotly-in-lab.ipynb#scrollTo=oJ8EFTtlwGBa

https://plotly.com/python/

https://www.python-graph-gallery.com/

https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide

https://seaborn.pydata.org/index.html