In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load cleaned dataset
df = pd.read_csv("/Users/rasheedmehrinfar/Desktop/springboard/capstone-two/Data/cleaned_co2_data.csv")
df.head()

Unnamed: 0,country,iso_code,year,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,...,share_global_luc_co2,share_global_oil_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,co2_change
0,Afghanistan,AFG,1990,12045664.0,13065980000.0,0.046,0.004,2.024,-0.741,-26.784,...,0.003,0.014,0.094,0.0,0.0,0.001,0.0,13.892,4.218,
1,Afghanistan,AFG,1991,12238879.0,12047360000.0,0.046,0.004,1.914,-0.11,-5.435,...,0.0,0.012,0.092,0.0,0.0,0.001,0.0,14.178,4.207,-0.054348
2,Afghanistan,AFG,1992,13278982.0,12677540000.0,0.046,0.003,1.482,-0.432,-22.58,...,-0.032,0.011,0.09,0.0,0.0,0.001,0.0,12.514,3.853,-0.225705
3,Afghanistan,AFG,1993,14943174.0,9834582000.0,0.047,0.003,1.487,0.005,0.33,...,-0.082,0.011,0.089,0.0,0.0,0.001,0.0,11.804,4.021,0.003374
4,Afghanistan,AFG,1994,16250799.0,7919857000.0,0.047,0.003,1.454,-0.033,-2.227,...,-0.059,0.011,0.087,0.0,0.0,0.001,0.0,12.282,4.159,-0.022192


## Step 1: Drop Unused Columns

In [38]:
df_model = df.copy()
target = 'co2'
# Drop identifier/redundant columns
cols_to_drop = ['iso_code', 'co2_change'] 
df_model = df_model.drop(columns=cols_to_drop)

iso_code is just an identifier and co2_change is a derivative of your target (co2) and would leak information into the model effecting training.

This step ensures that only independent variables remain.

## Step 2: Encode Categorical Variables

In [41]:
# Convert country into dummy variables
df_model = pd.get_dummies(df_model, columns=['country'], drop_first=True)

Most ML models can't handle text or categorical data like 'country'.

Dummy encoding turns 'country' into binary flags.

drop_first=True prevents multicollinearity (also called the dummy variable trap).

## Step 3: Split Features and Target

In [44]:
X = df_model.drop(columns=[target])
y = df_model[target]

Clearly separates your input variables (X) from your target (y) which is what the model will learn to predict.

## Step 4: Feature Scaling

In [47]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

CO₂, GDP, and population have very different scales.
Many models are sensitive to scale.

Without scaling, features with large values will dominate model learning.

## Step 5: Train-Test Split

In [50]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Confirm shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5848, 278), (1462, 278), (5848,), (1462,))

Separates data into:

- Training set: what the model learns from
- Test set: what we use to evaluate how well the model performs on unseen data

This prevents overfitting and ensures our evaluation is fair and unbiased, and model performance reflects real-world utility.