# Assignemnt 6: Visualization of the Happiness Score Datasets (2015-2019)

Prepared by: Sondos Aabed

Student ID: 1190652

Instructor: Dr. Hussien Suboh

Section: 2

## Abstract

## Table of Contents

## Introduction
- Pipeline of the analysis process that I followed.
- About the questions asked.

### About the Datasets (2015-2019)

### Analysis Questions

## Tools and Versions

## Data Analysis Process

### Data Wrangling

#### Loading the Dataset

In [None]:
import pandas as pd
import os
from IPython.display import display
import matplotlib.pyplot as plt

In [None]:
def load_data(path="./happiness-score-datasets"):
    """
    Loads the data into the pandas data frame, add the year column
    Args:
        path (string): path to the data, deafult value is the directory name
    Returns:
        (list): list of data frames (pd.DataFrame)
    """
    dfs = []
    for file in os.listdir(path):
        if file.endswith(".csv"):
            data = pd.read_csv(path+'/'+file)
            data['year'] = file.strip(".csv")
            dfs.append(data)
    return dfs

The reason for loading each dataframe separately is to first check if the columns are identical and to make data optimization using Numpy based on the datatypes.

In [None]:
dfs = load_data()

#### Assessing and Cleaning the datasets

- Assess and handle Columns and Data types
- Assess and handle Duplicates
- Assess and handle Missing Values
-  Assess and handle Outliers

##### Assessing and handling Columns and Data types

Fix structural issue to merge datasets

In [None]:
for df in dfs:
    display(df.head())

In [None]:
for df in dfs:
    print(df.columns)

> Inspecting the head of each of the datasets, shows incosistent and different columns names. This has to be handled for the merge step of the dataset.

Here is the renaming map that follows the convention of naming:

In [None]:
rename_mapping = {
    'Happiness.Rank': 'happiness_rank',
    'Happiness.Score': 'happiness_score',
    'Happiness Rank':'happiness_rank',
    'Happiness Score':'happiness_score',
    'Whisker.high': 'upper_confidence_interval',
    'Upper Confidence Interval': 'upper_confidence_interval',
    'Whisker.low': 'lower_confidence_interval',
    'Lower Confidence Interval': 'lower_confidence_interval',
    'Economy..GDP.per.Capita.': 'economy_gdp_per_capita',
    'Economy (GDP per Capita)':'economy_gdp_per_capita',
    'Health..Life.Expectancy.': 'health_life_expectancy',
    'Trust..Government.Corruption.': 'trust_government_corruption',
    'Dystopia.Residual': 'dystopia_residual',
    'Dystopia Residual': 'dystopia_residual',
    'Overall rank': 'happiness_rank',
    'Country or region': 'country',
    'Country':'country',
    'Region':'region',
    'Standard Error':'standard_error',
    'Score': 'happiness_score',
    'GDP per capita': 'economy_gdp_per_capita',
    'Social support': 'family',
    'Family':'family',
    'Healthy life expectancy': 'health_life_expectancy',
    'Health (Life Expectancy)': 'health_life_expectancy',
    'Freedom to make life choices': 'freedom',
    'Freedom': 'freedom',
    'Perceptions of corruption': 'trust_government_corruption',
    'Trust (Government Corruption)': 'trust_government_corruption'
}

In [None]:
def standardize_columns(dfs):
    """
    Standardize the column names of the dataframes
    Args:
        dfs (list): list of dataframes
    Returns:
        (list): list of standardized (columns names) dataframes
    """
    standardized_dfs = []
    
    for df in dfs:
        df = df.rename(columns=rename_mapping)
        standardized_dfs.append(df)
    
    return standardized_dfs

> Since the datastes are standerized now we're able to concat them

In [None]:
merged_df = pd.concat(standardize_columns(dfs), axis=0, ignore_index=True)
merged_df.columns

In [None]:
merged_df.shape

> Now the merged concatenated datsets have the shape of 782 and have 15 features. Let's work on the datatypes:

In [None]:
merged_df.info()

In [None]:
merged_df.nunique()

In [None]:
merged_df['year'] = pd.to_datetime(merged_df['year'], format='%Y').dt.year
merged_df.head()

> Only the year was converted to an int and the memory usage was optimized.

In [None]:
## info after conversion
merged_df.info()

##### Assess and handle Duplicates
Now let's check for duplicates and handle them

In [None]:
merged_df.duplicated().any()

> There are no duplicates records found.

##### Assess and handle Outliers
Now let's check for outliers with visualization using boxplot.

In [None]:
merged_df.plot(kind='box',figsize=(15, 6));
plt.xlabel('Columns')  
plt.ylabel('Values') 
plt.grid(True, alpha=0.2)
plt.minorticks_on()
plt.title('Boxlotting the Dataset')
plt.tick_params(axis='x', rotation=70) 
plt.show()

> These columns have outliers values, let's visualize the ones that does.

In [None]:
outliers_cols = ['Generosity', 'family','standard_error', 'trust_government_corruption']

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18, 10))
axes_flat = axes.ravel() 

for i, col in enumerate(outliers_cols):
  ax = axes_flat[i] 
  merged_df[col].plot(kind='box', ax=ax)
  ax.set_ylabel('Values')
  ax.set_title(col + ' Boxplot')

plt.show()

> In this visualization, it is noticed that these comlumns show high numerical values, except the family column has outliers that are lowe.

Let's now check for which regions have countries with extream (**outliers**) happiness score.
Let's show the regions and the countries that has them if there is one. (Requiremnt 1)

In [None]:
merged_df['happiness_score'].plot(kind='box');

> The happiness score column has no outliers on it.

> The outliers were handled using

##### Assess and handle Missing Values
This is the final section of cleaning the dataset, it is about detecting and handling the missing values.

In [None]:
merged_df.isna().sum().sort_values()

> It's noticed that the follwing columns have missing values:

In [None]:
missing_vals_columns = merged_df[merged_df.columns[merged_df.isna().any()]].columns
missing_vals_columns

> For the trust_government_corruption column let's drop that missing value since it's only one mssing vales.

> For the region it will be imputed by the other value from the datasets. Since the country would always be in the same region

In [None]:
## Extracting the region, country pairs
country_to_region = {}

for index, row in merged_df.iterrows():
    country = row['country']
    region = row['region']
    if pd.notna(region):
        if country not in country_to_region:
            country_to_region[country] = region

print(country_to_region)

In [None]:
merged_df['region'] = merged_df.apply(lambda row: country_to_region.get(row['country'], row['region']), axis=1)
print("Missing regions after imputation: ", merged_df['region'].isna().sum())

> After the applying of imputing the region, there is still 8 rows of missing values in that column so let's drop them

> The folwing columns are dropped for having high missing values and for not being related to the task and requiremnt.

In [None]:
to_drop_col= ['dystopia_residual', 'lower_confidence_interval', 'upper_confidence_interval','standard_error']
merged_df.drop(columns=to_drop_col, inplace=True)

> Finally the misisng values on the whole dataste are zeros:

In [None]:
merged_df.isna().sum().sort_values()

### Exploratory Data Analysis (EDA)

#### Highest and Lowest happiness Scores across all years

Let's show the happiness scores based on the regions, aggreagted by the median.

In [None]:
merged_df.groupby('region')['happiness_score'].median().sort_values().plot(kind='barh', color='orange',  figsize=(15, 6))
plt.title('Regions Vs. happiness scores median')
plt.xlabel('Happiness Score')
plt.ylabel('Regions');
plt.grid(True, alpha=0.3)

> The region that has highest happiness score across all years is Australia and New Zealand.

> The region that has lowest happiness score across all years is Sub-Saharan Africa.

#### Global Happiness cahge Over years

Let's look into the global happiness score chnage over years.

In [None]:
merged_df.groupby('year')['happiness_score'].median().plot(kind='line', color='red', figsize=(15, 6))
plt.xticks(merged_df['year'].unique());
plt.title('Global happiness scores over the years')
plt.ylabel('Happiness Score')
plt.xlabel('Years')
plt.grid(True, alpha=0.3)

> It is noticed that the global happiness over the year is changing and following a positive slightly higher changes. Even though the changes are slow.

> It is also noticed that in the year 2017 it got lower than the year before.

#### The years 2020-2022 Global Happiness scores

A question is raised about the next years of global happiness scores, what will the change be?

> Based on my experience and the context of 2020-2022 it was the yeasr were Covid-19 epedimic was rising, there had been so much suffer around the world and distrust in the goverments. I would say that the global happiness was getting lower in those next years. The econmic was going to collapse and a lot of people have lost their jobs.

- Based on the Data:

> Also, based on the figure above that shows the changes over the years between (2015-2019), it is following a slight trend of going up, so based on the data the more we go by years the higher the global happiness score is going to be, it seems that 2017 was an exception.

> Let's take a look into the correlation with the happiness scores:

In [None]:
merged_df.corr(numeric_only=True)[['happiness_score']].sort_values(by='happiness_score')

> These factors have high positive correlation with the target happiness score: 
- freedom: was contrained in quarantins while 2020-2022.
- family
- health_life_expectancy
- economy_gdp_per_capita.

#### Happiness score change by Region

Let's see how does the Happiness score change by each region.

In [None]:
grouped_year_region = merged_df.groupby(by=['year', 'region'])['happiness_score'].median()
unstacked_df = grouped_year_region.unstack()

In [None]:
unstacked_df.plot(kind='line', marker='o',  figsize=(15, 6))
plt.xlabel('Year')  
plt.ylabel('Median Happiness Score')
plt.title('Comparison of Median Happiness Scores by Region Over years')
plt.xticks(rotation=45)  
plt.grid(True, alpha=0.3)
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.xticks(merged_df['year'].unique());
plt.legend(labels=merged_df['region'].unique());
plt.show()

> It is noticed that the regions had kept the same levels of happiness scores, which is expected if the world has been stable during the 2015-2019. Even the changes over the years were slow and slightly changing.

> It is noticed that western europe had kept the same value of happiness scores. Sub-Saharan Africa had been the lowest region with happiness scores.

> Even though most of the region had witnessed a lower happiness scores in 2017, the region of Central and Eastern Europe had had higher happiness scores from the year before.

## Insights and Conclusion