# **WORLD HAPPINESS REPORT**


## Data Cleaning & Preprocessing

This notebook outlines the structured approach taken to clean and preprocess the dataset, ensuring it is properly formatted for visualization and further analysis.

🔹 Why is Data Cleaning and Preprocessing Important?
Data cleaning and preprocessing are crucial steps in any data analysis pipeline. Raw data often contains errors, inconsistencies, and missing values, which can lead to misleading insights if left unaddressed. By refining and standardizing the dataset, we enhance the accuracy, reliability, and efficiency of subsequent analyses.

## Objectives

✅ Handling Missing Values – Identifying and addressing gaps in the data to maintain its completeness and integrity.

✅ Removing Duplicates – Eliminating redundant records to ensure data uniqueness and prevent distortions in analysis.

✅ Standardizing Column Names – Renaming columns for consistency and readability across different datasets.

✅ Formatting Date and Time Variables – Converting date-time fields into structured formats suitable for time-based analysis.

✅ Encoding Categorical Variables – Transforming categorical data into numerical formats where required for analysis and modeling.

✅ Data Type Standardization – Ensuring each variable is stored in its appropriate format to optimize computational efficiency.

## Inputs

* initial data sources are 2020.csv,2021.csv, 2022.csv, 2023.csv, 2024.csv.


## Outputs

* Cleaned and processed data : Combined_Cleaned.csv

## Additional Comments

* To ensure the data is accurate and suitable for analysis, preprocessing is crucial. This process involves removing missing data, standardizing data types, and eliminating unnecessary variables, thus improving the reliability and analytical value of the dataset.

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [82]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [83]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [84]:
current_dir = os.getcwd()
current_dir

'c:\\'

* 1.1 Import Data

Loading the data from a CSV file into a Pandas DataFrame as the Pandas library contains helpful methods for loading, cleaning, and transforming data.

In [85]:
''' Importing necessary libraries
    1) Pandas: for data manipulation and analysis
    2) NumPy: for numerical operations
'''
import pandas as pd  
import numpy as np

After importing these libraries, load in the raw dataset using the read_csv() function.

* 1.2 Undertanding Data

Checking  each csv file.

info() method to print data types and non-null counts for each column.

head() method to display the first five rows for inspection.

* 1.3 Handling Missing & Incorrect Values

isnull() function is used to identify missing (null or NaN) values in a dataset.


In [86]:
df_2024 = pd.read_csv(r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2024.csv")
print(df_2024.info())
print(df_2024.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  143 non-null    object 
 1   Happiness Rank                143 non-null    int64  
 2   Happiness score               143 non-null    float64
 3   Upperwhisker                  143 non-null    float64
 4   Lowerwhisker                  143 non-null    float64
 5   Economy (GDP per Capita)	     140 non-null    float64
 6   Social support                140 non-null    float64
 7   Healthy life expectancy       140 non-null    float64
 8   Freedom to make life choices  140 non-null    float64
 9   Generosity                    140 non-null    float64
 10  Perceptions of corruption     140 non-null    float64
dtypes: float64(9), int64(1), object(1)
memory usage: 12.4+ KB
None
Country name                    0
Happiness Rank            

In [87]:
df_2023 = pd.read_csv(r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2023.csv")

print(df_2023.info())
print(df_2023.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137 entries, 0 to 136
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  137 non-null    object 
 1   Happiness Rank                137 non-null    int64  
 2   Happiness score               137 non-null    float64
 3   Upperwhisker                  137 non-null    float64
 4   Lowerwhisker                  137 non-null    float64
 5   Economy (GDP per Capita)	     137 non-null    float64
 6   Social support                137 non-null    float64
 7   Healthy life expectancy       136 non-null    float64
 8   Freedom to make life choices  137 non-null    float64
 9   Generosity                    137 non-null    float64
 10  Perceptions of corruption     137 non-null    float64
dtypes: float64(9), int64(1), object(1)
memory usage: 11.9+ KB
None
Country name                    0
Happiness Rank            

In [88]:
df_2022 = pd.read_csv(r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2022.csv")

print(df_2022.info())
print(df_2022.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  146 non-null    object 
 1   Happiness Rank                146 non-null    int64  
 2   Happiness score               146 non-null    float64
 3   Upperwhisker                  146 non-null    float64
 4   Lowerwhisker                  146 non-null    float64
 5   Economy (GDP per Capita)	     146 non-null    float64
 6   Social support                146 non-null    float64
 7   Healthy life expectancy       146 non-null    float64
 8   Freedom to make life choices  146 non-null    float64
 9   Generosity                    146 non-null    float64
 10  Perceptions of corruption     146 non-null    float64
dtypes: float64(9), int64(1), object(1)
memory usage: 12.7+ KB
None
Country name                    0
Happiness Rank            

In [89]:
df_2021 = pd.read_csv(r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2021.csv")
print(df_2021.info())
print(df_2021.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  149 non-null    object 
 1   Happiness Rank                149 non-null    int64  
 2   Happiness score               149 non-null    float64
 3   Upperwhisker                  149 non-null    float64
 4   Lowerwhisker                  149 non-null    float64
 5   Economy (GDP per Capita)	     149 non-null    float64
 6   Social support                149 non-null    float64
 7   Healthy life expectancy       149 non-null    float64
 8   Freedom to make life choices  149 non-null    float64
 9   Generosity                    149 non-null    float64
 10  Perceptions of corruption     149 non-null    float64
dtypes: float64(9), int64(1), object(1)
memory usage: 12.9+ KB
None
Country name                    0
Happiness Rank            

In [90]:
df_2020 = pd.read_csv(r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2020.csv")
print(df_2020.info())
print(df_2020.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  153 non-null    object 
 1   Happiness Rank                153 non-null    int64  
 2   Happiness score               153 non-null    float64
 3   Upperwhisker                  153 non-null    float64
 4   Lowerwhisker                  153 non-null    float64
 5   Economy (GDP per Capita)	     153 non-null    float64
 6   Social support                153 non-null    float64
 7   Healthy life expectancy       153 non-null    float64
 8   Freedom to make life choices  153 non-null    float64
 9   Generosity                    153 non-null    float64
 10  Perceptions of corruption     153 non-null    float64
dtypes: float64(9), int64(1), object(1)
memory usage: 13.3+ KB
None
Country name                    0
Happiness Rank            

In [91]:
import pandas as pd
import glob
import os

In [92]:
file_paths = [
    r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2020.csv",
    r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2021.csv",
    r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2022.csv",
    r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2023.csv",
    r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2024.csv"
]

In [93]:
df_list = []  # create empty list to store dataframes
for file in file_paths:
    try:
        df = pd.read_csv(file)
        year = os.path.basename(file).split('.')[0]  # extract year from filaname (example "2020.csv" → "2020")
        df["Year"] = int(year)  # Add new 'Year' column
        df_list.append(df)  # Add DataFrame to the list
        print(f"✅ {file} successfully loaded! Rows-Columns: {df.shape}")
    except Exception as e:
        print(f"❌ Error occurred:{file} could not be loaded! Error: {e}")

# If no files were loaded, stop the program
if len(df_list) == 0:
    print("❌ No files were loaded! Please check the file paths and formats.")
else:
    df_combined = pd.concat(df_list, ignore_index=True)
    print("✅ All files successfully combined!")

# File saving path
output_path = r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\combined_20202024.csv"
df_combined.to_csv(output_path, index=False)
print(f"✅ Combined file successfully saved:{output_path}")

✅ C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2020.csv successfully loaded! Rows-Columns: (153, 12)
✅ C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2021.csv successfully loaded! Rows-Columns: (149, 12)
✅ C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2022.csv successfully loaded! Rows-Columns: (146, 12)
✅ C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2023.csv successfully loaded! Rows-Columns: (137, 12)
✅ C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\2024.csv successfully loaded! Rows-Columns: (143, 12)
✅ All files successfully combined!
✅ Combined file successfully saved:C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\combined_20202024.csv


In [94]:
df_combined_20202024 = pd.read_csv(r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\combined_20202024.csv")
print(df_combined_20202024.info())
print(df_combined_20202024.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 728 entries, 0 to 727
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country name                  728 non-null    object 
 1   Happiness Rank                728 non-null    int64  
 2   Happiness score               728 non-null    float64
 3   Upperwhisker                  728 non-null    float64
 4   Lowerwhisker                  728 non-null    float64
 5   Economy (GDP per Capita)	     725 non-null    float64
 6   Social support                725 non-null    float64
 7   Healthy life expectancy       724 non-null    float64
 8   Freedom to make life choices  725 non-null    float64
 9   Generosity                    725 non-null    float64
 10  Perceptions of corruption     725 non-null    float64
 11  Year                          728 non-null    int64  
dtypes: float64(9), int64(2), object(1)
memory usage: 68.4+ KB
None
C

In [95]:
import pandas as pd
data = pd.read_csv(r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\combined_20202024.csv")
numeric_columns = data.select_dtypes(include=['float64', 'int64']).columns
data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())
print(data.isnull().sum())

Country name                    0
Happiness Rank                  0
Happiness score                 0
Upperwhisker                    0
Lowerwhisker                    0
Economy (GDP per Capita)\t      0
Social support                  0
Healthy life expectancy         0
Freedom to make life choices    0
Generosity                      0
Perceptions of corruption       0
Year                            0
dtype: int64


In [96]:
print(df_combined_20202024.columns)

Index(['Country name', 'Happiness Rank', 'Happiness score', 'Upperwhisker',
       'Lowerwhisker', 'Economy (GDP per Capita)\t', 'Social support',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Year'],
      dtype='object')


In [110]:
df_combined_20202024.columns = df_combined_20202024.columns.str.replace(r"[\t\n]", "", regex=True)  # Tab ve yeni satırları kaldır
df_combined_20202024.columns = df_combined_20202024.columns.str.strip()  # remove any leading/trailing spaces/tabs
df_combined_20202024.columns = df_combined_20202024.columns.str.replace(" ", "_")  # Boşlukları "_" ile değiştir
df_combined_20202024.columns = df_combined_20202024.columns.str.replace(r"[\(\)]", "", regex=True)  # Parantezleri kaldır
df_combined_20202024.rename(columns={'Economy (GDP per Capita)\t': 'Economy_GDP_per_capita'}, inplace=True)
df_combined_20202024.columns = df.columns.str.replace(' ', '_')

print(df_combined_20202024.columns)

Index(['Country_name', 'Happiness_Rank', 'Happiness_score', 'Upperwhisker',
       'Lowerwhisker', 'Economy_(GDP_per_Capita)\t', 'Social_support',
       'Healthy_life_expectancy', 'Freedom_to_make_life_choices', 'Generosity',
       'Perceptions_of_corruption', 'Year'],
      dtype='object')


In [98]:
# Find duplicate rows
duplicates = df[df.duplicated()]
print(duplicates)  # Displays duplicate rows


Empty DataFrame
Columns: [Country name, Happiness Rank, Happiness score, Upperwhisker, Lowerwhisker, Economy (GDP per Capita)	, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption, Year]
Index: []


In [103]:
# Fill missing values in numerical columns with the mean
data['Economy_GDP_per_capita'] = data['Economy_GDP_per_capita'].fillna(data['Economy_GDP_per_capita'].mean())
data['Social support'] = data['Social support'].fillna(data['Social support'].mean())
data['Healthy life expectancy'] = data['Healthy life expectancy'].fillna(data['Healthy life expectancy'].mean())
data['Freedom to make life choices'] = data['Freedom to make life choices'].fillna(data['Freedom to make life choices'].mean())
data['Generosity'] = data['Generosity'].fillna(data['Generosity'].mean())
data['Perceptions of corruption'] = data['Perceptions of corruption'].fillna(data['Perceptions of corruption'].mean())

# Fill missing values in categorical columns with the mode
data['Country_name'] = data['Country_name'].fillna(data['Country_name'].mode()[0])

# Drop rows with missing values in 'Year' column
data = data.dropna(subset=['Year'])

print(data.isnull().sum())

KeyError: 'Economy_GDP_per_capita'

In [None]:
# change the format of 'year' column to datetime
df_combined_20202024['Year'] = pd.to_datetime(df_combined_20202024['Year'], format='%Y')

# check the format of 'year' column
print(df_combined_20202024['Year'].dtype)


datetime64[ns]


In [None]:
# Save new csv file
data.to_csv(r"C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\Combined_Cleaned.csv", index=False)

print("✅ file successfully saved!")


✅ file successfully saved!


In [None]:
import pandas as pd


df = pd.read_csv(r'C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\Combined_Cleaned.csv')


df.columns = df.columns.str.replace(' ', '_')


df.to_csv(r'C:\Users\balla\OneDrive\Documents\CapStoneProject_2025\Data\Combined_Cleaned.csv', index=False)

print(df.columns)


Index(['Country_name', 'Happiness_Rank', 'Happiness_score', 'Upperwhisker',
       'Lowerwhisker', 'Economy_(GDP_per_Capita)\t', 'Social_support',
       'Healthy_life_expectancy', 'Freedom_to_make_life_choices', 'Generosity',
       'Perceptions_of_corruption', 'Year'],
      dtype='object')


---

---

---