# 01 Data Ingestion and Quality Checks

## Objectives

- Load the raw dataset from the versioned folder
- Perform basic data quality checks and standardization
- Export a cleaned dataset to the processed folder

## Inputs

- data/raw/v1/environmental_trends.csv

## Outputs

- data/processed/v1/environmental_trends_clean.csv

## Additional Comments

- Keep all changes reproducible and logged in the notebook

## Purpose and Context

This notebook is the **first step** in our Global Environmental Trends analysis pipeline. It ensures our data foundation is solid before we draw any conclusions or make predictions.

**Connection to project guidelines:**

- **Ethics (LO1.1)**: Transparent data quality checks prevent misleading conclusions that could affect public understanding
- **Communication (LO2.3)**: Clear documentation helps both technical reviewers and non-technical stakeholders understand our process
- **Project planning (LO3.1)**: Establishing a clean, versioned dataset enables reproducible analysis and future updates

**What makes this approach responsible:**

1. We document *all* quality issues (not just the ones we fix)
2. We preserve the raw data unchanged (separating raw from processed)
3. We version outputs so anyone can reproduce our results
4. We make the cleaning process visible (no hidden transformations)

---

---

# Change working directory

We store notebooks in a subfolder, so we set the project root as the working directory.

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sergi\\OneDrive\\Documents\\Code Institute Data analytics\\Capstone project 3\\Global_environmental_trends_2000_2024\\global_env_trend\\jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("Current working directory set to project root")

Current working directory set to project root


In [3]:
os.getcwd()

'c:\\Users\\sergi\\OneDrive\\Documents\\Code Institute Data analytics\\Capstone project 3\\Global_environmental_trends_2000_2024\\global_env_trend'

# Load raw data

In [4]:
import pandas as pd
raw_path = "data/raw/v1/environmental_trends.csv"
df = pd.read_csv(raw_path)
df.head()

Unnamed: 0,Year,Country,Avg_Temperature_degC,CO2_Emissions_tons_per_capita,Sea_Level_Rise_mm,Rainfall_mm,Population,Renewable_Energy_pct,Extreme_Weather_Events,Forest_Area_pct
0,2000,United States,13.5,20.2,0,715,282500000,6.2,38,33.1
1,2000,China,12.8,2.7,0,645,1267000000,16.5,24,18.8
2,2000,Germany,9.3,10.1,0,700,82200000,6.6,12,31.8
3,2000,Brazil,24.9,1.9,0,1760,175000000,83.7,18,65.4
4,2000,Australia,21.7,17.2,0,534,19200000,8.8,11,16.2


# Quality checks

**What we're checking and why:**

Data quality is the foundation of reliable analysis. Before we can trust any insights or predictions, we need to verify that our data is:

1. **Complete**: Check for missing values that could bias results
2. **Accurate**: Review data types to ensure calculations will work correctly
3. **Consistent**: Identify duplicates that could skew statistics

**Ethical consideration**: Poor data quality can lead to misleading conclusions that affect public understanding of climate issues. By documenting these checks, we ensure transparency and accountability.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           156 non-null    int64  
 1   Country                        156 non-null    object 
 2   Avg_Temperature_degC           156 non-null    float64
 3   CO2_Emissions_tons_per_capita  156 non-null    float64
 4   Sea_Level_Rise_mm              156 non-null    int64  
 5   Rainfall_mm                    156 non-null    int64  
 6   Population                     156 non-null    int64  
 7   Renewable_Energy_pct           156 non-null    float64
 8   Extreme_Weather_Events         156 non-null    int64  
 9   Forest_Area_pct                156 non-null    float64
dtypes: float64(4), int64(5), object(1)
memory usage: 12.3+ KB


**What this shows:**

The `.info()` method reveals:
- Column names and data types (are temperatures stored as numbers?)
- Non-null counts (how many valid values per column?)
- Memory usage (dataset size)

Look for unexpected types (e.g., numbers stored as text) or excessive missing data.

In [6]:
df.isna().sum().sort_values(ascending=False).head(10)

Year                             0
Country                          0
Avg_Temperature_degC             0
CO2_Emissions_tons_per_capita    0
Sea_Level_Rise_mm                0
Rainfall_mm                      0
Population                       0
Renewable_Energy_pct             0
Extreme_Weather_Events           0
Forest_Area_pct                  0
dtype: int64

**Interpreting missing values:**

- **Low counts (< 5%)**: Usually acceptable, may indicate incomplete reporting for specific countries/years
- **High counts (> 20%)**: Require caution. Consider whether the variable can still be used reliably
- **Patterns**: Missing data concentrated in certain countries or years may introduce bias

**Action**: Document any columns with significant missing data and note this as a limitation in the final dashboard.

In [7]:
df.duplicated().sum()

0

**Understanding duplicates:**

Duplicate rows can:
- Inflate statistics (make trends appear stronger than they are)
- Create misleading visualizations
- Violate assumptions in statistical tests

If duplicates exist, investigate whether they're:
- True duplicates (identical entries to remove)
- Valid repeated measurements (keep but document)
- Data entry errors (needs correction)

# Save cleaned data

**Why versioning matters:**

We save the cleaned dataset to `data/processed/v1/` for several reasons:

1. **Reproducibility**: Anyone can trace results back to the exact data we used
2. **Version control**: Future updates go to v2, v3, etc., preserving history
3. **Transparency**: Separating raw from processed data makes our workflow clear
4. **Compliance**: Follows data governance best practices for accountability

The cleaned data becomes the input for all subsequent analysis and modeling.

In [8]:
clean_path = "data/processed/v1/environmental_trends_clean.csv"
df.to_csv(clean_path, index=False)
clean_path

'data/processed/v1/environmental_trends_clean.csv'