This dataset contains country-level data from the **World Happiness Report** (https://worldhappiness.report/data-sharing/), aimed at explaining the key factors contributing to a nation's happiness. Each variable represents a component that influences a country's average **Happiness Score**, which is the target variable in this analysis.

### Dataset Features:

- **Log GDP per capita**  
  Represents the gross domestic product per person, adjusted for Purchasing Power Parity (PPP) and expressed in constant 2017 international dollars. This reflects the average standard of living in each country.  
  *Range: ~0 to $80,000 USD*

- **Social Support**  
  Measures the percentage of people who report having someone they can count on in times of trouble (e.g., family, friends).  
  *Range: 0% to 100%*

- **Healthy Life Expectancy**  
  Average number of years a person can expect to live in good health, based on WHO estimates.  
  *Unit: years*

- **Freedom to Make Life Choices**  
  Indicates the percentage of people who feel they have freedom in making life decisions.  
  *Range: 0% to 100%*

- **Generosity**  
  Captures the share of people who reported donating to charity in the past month — a proxy for altruism.  
  *Range: 0% to 70%*

- **Perceptions of Corruption**  
  An index based on survey responses to questions about the perceived level of corruption in government and business. If government data is unavailable, only business perception is used.  
  *Range: 0% to 100%*

- **Dystopia + Residual**  


In [1]:
import pandas as pd

### Load the World Happiness Report dataset

In [None]:
happiness_df = pd.read_csv("/Users/admin/Desktop/Projects/HappyLens_NN/data/World Happiness Report.csv")

In [3]:
happiness_df

Unnamed: 0,Year,Rank,Country name,life evaluation,upperwhisker,lowerwhisker,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
0,2024,1,Finland,7736,7810,7662,1749,1783,0824,0986,0110,0502,1782
1,2024,2,Denmark,7521,7611,7431,1825,1748,0820,0955,0150,0488,1535
2,2024,3,Iceland,7515,7606,7425,1799,1840,0873,0971,0201,0173,1659
3,2024,4,Sweden,7345,7427,7262,1783,1698,0889,0952,0170,0467,1385
4,2024,5,Netherlands,7306,7372,7240,1822,1667,0844,0860,0186,0344,1583
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1964,2011,152,Burundi,3678,,,,,,,,,
1965,2011,153,Sierra Leone,3586,,,,,,,,,
1966,2011,154,Central African Republic,3568,,,,,,,,,
1967,2011,155,Benin,3493,,,,,,,,,


### Display dataset information

In [4]:
happiness_df.info()

print(f"Dataset shape: {happiness_df.shape}")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1969 entries, 0 to 1968
Data columns (total 13 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   Year                                        1969 non-null   int64 
 1   Rank                                        1969 non-null   int64 
 2   Country name                                1969 non-null   object
 3   life evaluation                             1969 non-null   object
 4   upperwhisker                                875 non-null    object
 5   lowerwhisker                                875 non-null    object
 6   Explained by: Log GDP per capita            872 non-null    object
 7   Explained by: Social support                872 non-null    object
 8   Explained by: Healthy life expectancy       870 non-null    object
 9   Explained by: Freedom to make life choices  871 non-null    object
 10  Explained by: Generosity

### Convert numeric columns from object to float

In [5]:
for col in happiness_df.columns:
    if col not in ['Country name', 'Year']:
        happiness_df[col] = happiness_df[col].astype(str).str.replace(',', '.').str.strip()
        happiness_df[col] = pd.to_numeric(happiness_df[col], errors='coerce')  # Convert errors to NaN

# Check updated data types
print(happiness_df.dtypes)


Year                                            int64
Rank                                            int64
Country name                                   object
life evaluation                               float64
upperwhisker                                  float64
lowerwhisker                                  float64
Explained by: Log GDP per capita              float64
Explained by: Social support                  float64
Explained by: Healthy life expectancy         float64
Explained by: Freedom to make life choices    float64
Explained by: Generosity                      float64
Explained by: Perceptions of corruption       float64
Dystopia + residual                           float64
dtype: object


### Rename columns for clarity

In [6]:
happiness_df = happiness_df.rename(columns={
    'Country name': 'Country',
    'Explained by: Log GDP per capita': 'GDP',
    'Explained by: Social support': 'SocialSupport',
    'Explained by: Healthy life expectancy': 'LifeExpectancy',
    'Explained by: Freedom to make life choices': 'Freedom',
    'Explained by: Generosity': 'Generosity',
    'Explained by: Perceptions of corruption': 'Corruption',
    'life evaluation': 'HappinessScore'
})
happiness_df

Unnamed: 0,Year,Rank,Country,HappinessScore,upperwhisker,lowerwhisker,GDP,SocialSupport,LifeExpectancy,Freedom,Generosity,Corruption,Dystopia + residual
0,2024,1,Finland,7.736,7.810,7.662,1.749,1.783,0.824,0.986,0.110,0.502,1.782
1,2024,2,Denmark,7.521,7.611,7.431,1.825,1.748,0.820,0.955,0.150,0.488,1.535
2,2024,3,Iceland,7.515,7.606,7.425,1.799,1.840,0.873,0.971,0.201,0.173,1.659
3,2024,4,Sweden,7.345,7.427,7.262,1.783,1.698,0.889,0.952,0.170,0.467,1.385
4,2024,5,Netherlands,7.306,7.372,7.240,1.822,1.667,0.844,0.860,0.186,0.344,1.583
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1964,2011,152,Burundi,3.678,,,,,,,,,
1965,2011,153,Sierra Leone,3.586,,,,,,,,,
1966,2011,154,Central African Republic,3.568,,,,,,,,,
1967,2011,155,Benin,3.493,,,,,,,,,


### Check for duplicate rows

In [7]:
print(f"Number of duplicated rows: {happiness_df.duplicated().sum()}")

Number of duplicated rows: 0


### Check for outliers in the HappinessScore

In [8]:
happiness_df[happiness_df["HappinessScore"] < 0]  # Negative values
happiness_df[happiness_df["HappinessScore"] > 10]  # Values above 10

Unnamed: 0,Year,Rank,Country,HappinessScore,upperwhisker,lowerwhisker,GDP,SocialSupport,LifeExpectancy,Freedom,Generosity,Corruption,Dystopia + residual


### Check for missing values

In [9]:
print(happiness_df.isnull().sum())

Year                      0
Rank                      0
Country                   0
HappinessScore            0
upperwhisker           1094
lowerwhisker           1094
GDP                    1097
SocialSupport          1097
LifeExpectancy         1099
Freedom                1098
Generosity             1097
Corruption             1098
Dystopia + residual    1101
dtype: int64


In [10]:
def fillna_with_local_mean(df, columns, window=10):

    """
    Fill missing values (NaN) in specified columns using the mean value 
    within a window of cells above and below each missing value.

    Parameters:
    df (pd.DataFrame): Input DataFrame with missing values.
    columns (list): List of column names to fill missing values in.
    window (int): Number of cells above and below to consider (default is 10).

    Returns:
    pd.DataFrame: DataFrame with filled missing values.
    """

    df = df.copy()
    for col in columns:
        for i in range(len(df)):
            if pd.isnull(df.loc[i, col]) or pd.isna(df.loc[i, col]):
                # Define window boundaries
                start = max(0, i - window)
                end = min(len(df), i + window + 1)
                
                # Extract window values and drop NaN
                window_vals = df.loc[start:end, col].dropna()
                
                if len(window_vals) > 0:
                    df.loc[i, col] = window_vals.mean()
    return df

# Identify columns with missing values
columns_with_na = happiness_df.columns[happiness_df.isnull().any()].tolist()

# Fill missing values
happiness_df = fillna_with_local_mean(happiness_df, columns_with_na)
happiness_df

Unnamed: 0,Year,Rank,Country,HappinessScore,upperwhisker,lowerwhisker,GDP,SocialSupport,LifeExpectancy,Freedom,Generosity,Corruption,Dystopia + residual
0,2024,1,Finland,7.736,7.810000,7.6620,1.7490,1.783000,0.8240,0.986000,0.110000,0.502000,1.782000
1,2024,2,Denmark,7.521,7.611000,7.4310,1.8250,1.748000,0.8200,0.955000,0.150000,0.488000,1.535000
2,2024,3,Iceland,7.515,7.606000,7.4250,1.7990,1.840000,0.8730,0.971000,0.201000,0.173000,1.659000
3,2024,4,Sweden,7.345,7.427000,7.2620,1.7830,1.698000,0.8890,0.952000,0.170000,0.467000,1.385000
4,2024,5,Netherlands,7.306,7.372000,7.2400,1.8220,1.667000,0.8440,0.860000,0.186000,0.344000,1.583000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1964,2011,152,Burundi,3.678,3.295182,3.0316,0.3674,0.627745,0.3348,0.299345,0.182982,0.132473,1.218473
1965,2011,153,Sierra Leone,3.586,3.295182,3.0316,0.3674,0.627745,0.3348,0.299345,0.182982,0.132473,1.218473
1966,2011,154,Central African Republic,3.568,3.295182,3.0316,0.3674,0.627745,0.3348,0.299345,0.182982,0.132473,1.218473
1967,2011,155,Benin,3.493,3.295182,3.0316,0.3674,0.627745,0.3348,0.299345,0.182982,0.132473,1.218473


### Verify no missing values remain

In [11]:
print(happiness_df.isnull().sum())

Year                   0
Rank                   0
Country                0
HappinessScore         0
upperwhisker           0
lowerwhisker           0
GDP                    0
SocialSupport          0
LifeExpectancy         0
Freedom                0
Generosity             0
Corruption             0
Dystopia + residual    0
dtype: int64


In [12]:
happiness_df = happiness_df[~happiness_df["Country"].isin(["Russia", "Russian Federation"])] # Remove rows related to Russia

In [14]:
happiness_df = happiness_df.drop(columns=['upperwhisker', 'lowerwhisker', 'Dystopia + residual']) # Drop unnecessary columns
happiness_df

Unnamed: 0,Year,Rank,Country,HappinessScore,GDP,SocialSupport,LifeExpectancy,Freedom,Generosity,Corruption
0,2024,1,Finland,7.736,1.7490,1.783000,0.8240,0.986000,0.110000,0.502000
1,2024,2,Denmark,7.521,1.8250,1.748000,0.8200,0.955000,0.150000,0.488000
2,2024,3,Iceland,7.515,1.7990,1.840000,0.8730,0.971000,0.201000,0.173000
3,2024,4,Sweden,7.345,1.7830,1.698000,0.8890,0.952000,0.170000,0.467000
4,2024,5,Netherlands,7.306,1.8220,1.667000,0.8440,0.860000,0.186000,0.344000
...,...,...,...,...,...,...,...,...,...,...
1964,2011,152,Burundi,3.678,0.3674,0.627745,0.3348,0.299345,0.182982,0.132473
1965,2011,153,Sierra Leone,3.586,0.3674,0.627745,0.3348,0.299345,0.182982,0.132473
1966,2011,154,Central African Republic,3.568,0.3674,0.627745,0.3348,0.299345,0.182982,0.132473
1967,2011,155,Benin,3.493,0.3674,0.627745,0.3348,0.299345,0.182982,0.132473


### Save the cleaned dataset

In [None]:
happiness_df.to_csv("/Users/admin/Desktop/Projects/HappyLens_NN/data/happiness_data.csv", index=False)

print("File successfully saved!")

File successfully saved!
