1. **Load the Dataset:**
   - The code begins by specifying the file path to the dataset (2020.csv) and attempts to load the data using the `pd.read_csv()` function.

2. **Check for Missing or Invalid Values:**
   - It checks for any missing or invalid values in the dataset using the `.isnull().values.any()` method to ensure data integrity.

3. **Define Columns to Keep:**
   - Based on the analysis plan, it defines the specific columns to keep, including:
     - "Logged GDP per capita" for economic factor,
     - "Perceptions of corruption" for government corruption perception,
     - "Healthy life expectancy" for life expectancy,
     - "Social support" for the social factor,
     - "Country name" for country identification,
     - "Ladder score" which corresponds to Happiness Score.

4. **Filter the DataFrame:**
   - Filters the DataFrame to retain only the desired columns using DataFrame indexing (`df[columns_to_keep]`).

5. **Save Filtered Data:**
   - Saves the filtered DataFrame to a new CSV file named "2020_data.csv" using `.to_csv()` method.

6. **Print Confirmation:**
   - Prints a message confirming the successful saving of the filtered data.

7. **Output:**
   - The final dataset "2020_data.csv" now contains the selected columns for further analysis.


In [1]:
import pandas as pd

# Load the dataset
file_path = "..\\..\\resources\\2020.csv"  # Update with your file path

# Load the data and check for missing or invalid values
try:
    df = pd.read_csv(file_path)
except pd.errors.ParserError as e:
    print("Error while parsing the CSV file:", str(e))
    raise

# Check for missing or invalid values
if df.isnull().values.any():
    print("The dataset contains missing values.")
    # Optionally, you can drop rows with missing values or impute them.

# Define the columns you want to keep based on the provided options
columns_to_keep = [
    "Logged GDP per capita",
    "Perceptions of corruption",
    "Healthy life expectancy",
    "Social support",
    "Country name",
    "Ladder score"  # Happiness Score
]

# Filter the DataFrame to keep only the desired columns
filtered_df = df[columns_to_keep]

# Save the filtered DataFrame to a new CSV file (as 2020_data.csv)
output_file_path = "..\\data\\2020_data.csv"
filtered_df.to_csv(output_file_path, index=False)

print("Filtered data saved to", output_file_path)


Filtered data saved to ..\data\2020_data.csv
