# 01 Data Ingestion and Quality Checks

## Objectives

- Load the raw dataset from the versioned folder
- Perform basic data quality checks and standardization
- Export a cleaned dataset to the processed folder

## Inputs

- data/raw/v1/environmental_trends.csv

## Outputs

- data/processed/v1/environmental_trends_clean.csv

## Additional Comments

- Keep all changes reproducible and logged in the notebook

## Notebook layout

- Section 1: Setup and load raw data
- Section 2: Data quality checks
- Section 3: Save cleaned output

## Business context

This notebook supports all downstream analysis and the Streamlit dashboard by ensuring the data foundation is reliable and reproducible.

## Purpose and Context

This notebook is the first step in our Global Environmental Trends analysis pipeline. It ensures our data foundation is solid before we draw any conclusions or make predictions.

The connection to project guidelines centers on three key areas. For ethics, transparent data quality checks prevent misleading conclusions that could affect public understanding. For communication, clear documentation helps both technical reviewers and non-technical stakeholders understand our process. For project planning, establishing a clean, versioned dataset enables reproducible analysis and future updates.

What makes this approach responsible is that we document all quality issues rather than just the ones we fix. We preserve the raw data unchanged by separating raw from processed files. We version outputs so anyone can reproduce our results. We make the cleaning process visible with no hidden transformations.

---

---

## Section 1 - Setup and load raw data

This section sets the project root as the working directory and loads the raw CSV for inspection.

# Change working directory

Notebooks are stored in the `jupyter_notebooks/` subfolder. This cell checks the current working directory and navigates to the project root (`global_env_trend/`) if needed, ensuring relative paths to `data/` work correctly.

In [17]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sergi\\OneDrive\\Documents\\Code Institute Data analytics\\Capstone project 3\\Global_environmental_trends_2000_2024\\global_env_trend'

In [18]:
# Change to project root if we're in jupyter_notebooks folder
if current_dir.endswith('jupyter_notebooks'):
    os.chdir(os.path.dirname(current_dir))
    print("Changed working directory from jupyter_notebooks to project root")
elif os.path.exists('jupyter_notebooks'):
    print("Already at project root")
else:
    # Navigate up until we find the project root (contains data folder)
    while not os.path.exists('data') and os.path.dirname(os.getcwd()) != os.getcwd():
        os.chdir('..')
    if os.path.exists('data'):
        print("Found project root")
    else:
        print("Warning: Could not find project root with data folder")

Already at project root


In [19]:
os.getcwd()

'c:\\Users\\sergi\\OneDrive\\Documents\\Code Institute Data analytics\\Capstone project 3\\Global_environmental_trends_2000_2024\\global_env_trend'

# Load raw data

In [20]:
import pandas as pd
raw_path = "data/raw/v1/environmental_trends.csv"
df = pd.read_csv(raw_path)
df.head()

Unnamed: 0,Year,Country,Avg_Temperature_degC,CO2_Emissions_tons_per_capita,Sea_Level_Rise_mm,Rainfall_mm,Population,Renewable_Energy_pct,Extreme_Weather_Events,Forest_Area_pct
0,2000,United States,13.5,20.2,0,715,282500000,6.2,38,33.1
1,2000,China,12.8,2.7,0,645,1267000000,16.5,24,18.8
2,2000,Germany,9.3,10.1,0,700,82200000,6.6,12,31.8
3,2000,Brazil,24.9,1.9,0,1760,175000000,83.7,18,65.4
4,2000,Australia,21.7,17.2,0,534,19200000,8.8,11,16.2


**What this code does:**

Loads the raw CSV from the versioned folder and previews the first rows to verify columns and types before any cleaning.

## Section 2 - Data quality checks

We assess missing values, duplicates, and data types to validate reliability before analysis.

# Quality checks

**What we're checking and why:**

Data quality is the foundation of reliable analysis. Before we can trust any insights or predictions, we need to verify that our data is complete (checking for missing values that could bias results), accurate (reviewing data types to ensure calculations will work correctly), and consistent (identifying duplicates that could skew statistics).

From an ethical perspective, poor data quality can lead to misleading conclusions that affect public understanding of climate issues. By documenting these checks, we ensure transparency and accountability.

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           156 non-null    int64  
 1   Country                        156 non-null    object 
 2   Avg_Temperature_degC           156 non-null    float64
 3   CO2_Emissions_tons_per_capita  156 non-null    float64
 4   Sea_Level_Rise_mm              156 non-null    int64  
 5   Rainfall_mm                    156 non-null    int64  
 6   Population                     156 non-null    int64  
 7   Renewable_Energy_pct           156 non-null    float64
 8   Extreme_Weather_Events         156 non-null    int64  
 9   Forest_Area_pct                156 non-null    float64
dtypes: float64(4), int64(5), object(1)
memory usage: 12.3+ KB


**What this shows:**

The info method reveals column names and data types (are temperatures stored as numbers?), non-null counts (how many valid values per column?), and memory usage (dataset size). Look for unexpected types such as numbers stored as text or excessive missing data.

In [22]:
df.isna().sum().sort_values(ascending=False).head(10)

Year                             0
Country                          0
Avg_Temperature_degC             0
CO2_Emissions_tons_per_capita    0
Sea_Level_Rise_mm                0
Rainfall_mm                      0
Population                       0
Renewable_Energy_pct             0
Extreme_Weather_Events           0
Forest_Area_pct                  0
dtype: int64

**Interpreting missing values:**

Low counts below 5% are usually acceptable and may indicate incomplete reporting for specific countries or years. High counts above 20% require caution, so consider whether the variable can still be used reliably. Also watch for patterns where missing data is concentrated in certain countries or years, as this may introduce bias.

The action we take is to document any columns with significant missing data and note this as a limitation in the final dashboard.

In [23]:
df.duplicated().sum()

0

**Understanding duplicates:**

Duplicate rows can inflate statistics (making trends appear stronger than they are), create misleading visualizations, and violate assumptions in statistical tests.

If duplicates exist, we need to investigate whether they're true duplicates (identical entries to remove), valid repeated measurements (keep but document), or data entry errors (needs correction).

## Section 3 - Save cleaned output

We store a versioned cleaned dataset for downstream EDA, hypothesis tests, and modeling.

# Save cleaned data

**Why versioning matters:**

We save the cleaned dataset to data/processed/v1/ for several important reasons. First, reproducibility ensures anyone can trace results back to the exact data we used. Second, version control means future updates go to v2, v3, and so on, preserving history. Third, transparency is maintained by separating raw from processed data, making our workflow clear. Fourth, we follow data governance best practices for accountability and compliance.

The cleaned data becomes the input for all subsequent analysis and modeling.

In [24]:
clean_path = "data/processed/v1/environmental_trends_clean.csv"
df.to_csv(clean_path, index=False)
clean_path

'data/processed/v1/environmental_trends_clean.csv'

**Outcome:**

The cleaned dataset is saved to the processed folder and becomes the single source of truth for the rest of the project.