# Power Plants Regression

In [1]:
import pandas as pd

# Load dataset
# (file copied into this repo as `usina_with_outliers.csv`)
df = pd.read_csv('usina_with_outliers.csv')

df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


### Model Choice: Linear Regression (OLS)
- I chose OLS for this dataset since Lasso Regression try to suppress outliers. However, for this Python Notebook's analysis to be relevant to Cook's distance (which calculates the difference removing a data point makes on a model), the outliers need to be considered in our analysis, rather than smoothed. OLS is intentionally sensitive to influential observations, which allows Cook's distance to effectively identify points that disproportionally affect the fitted model.

### Library Choice: Statsmodels OLS
- I chose Statsmodels since it is made for diagnostics, however, scikit-learn is built for predicitons. Statsmodels gives us access to all of the metrics Cook's distance depends on.



In [2]:
from sklearn.model_selection import train_test_split

# 70/30 random split (reproducible)
train_df, test_df = train_test_split(df, test_size=0.30, random_state=42)

# Quick sanity-check on sizes
print(f"Total rows: {len(df)}")
print(f"Train rows: {len(train_df)} ({len(train_df)/len(df)*100:.1f}%)")
print(f"Test rows:  {len(test_df)} ({len(test_df)/len(df)*100:.1f}%)")

# Reset indices for clean train/test frames
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

Total rows: 9568
Train rows: 6697 (70.0%)
Test rows:  2871 (30.0%)


In [3]:
import statsmodels.api as sm

# Fit OLS on training data: predict PE from AT, V, AP, RH
X = train_df[['AT', 'V', 'AP', 'RH']]
y = train_df['PE']

X_const = sm.add_constant(X)
ols_model = sm.OLS(y, X_const).fit()

# Cook's distance for each training point
influence = ols_model.get_influence()
cooks_d = influence.cooks_distance[0]

# Threshold rule: 4/n
n = len(train_df)
threshold = 4 / n

train_df_with_cooks = train_df.copy()
train_df_with_cooks['cooks_distance'] = cooks_d
train_df_with_cooks['is_outlier'] = train_df_with_cooks['cooks_distance'] > threshold

print(f"n (train) = {n}")
print(f"Cook's distance threshold (4/n) = {threshold:.6f}")
print(f"Outliers detected = {train_df_with_cooks['is_outlier'].sum()}")

# Remove outliers
train_df_no_outliers = train_df_with_cooks.loc[~train_df_with_cooks['is_outlier']].drop(columns=['is_outlier'])

# Save cleaned training data
train_df_no_outliers.to_csv('usina.csv', index=False)
print("Saved cleaned dataset to usina.csv")

n (train) = 6697
Cook's distance threshold (4/n) = 0.000597
Outliers detected = 85
Saved cleaned dataset to usina.csv


## Model Training & Evaluation — OLS, Ridge, Lasso

Below we train OLS (LinearRegression), Ridge, and Lasso on both `usina_with_outliers.csv` and the cleaned `usina.csv`, and report train/test MSE, MAE, and R².

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd

ALPHAS = [0.01, 0.1, 1, 10, 100]


def evaluate_and_display(path):
    # Load and drop any NA rows
    df = pd.read_csv(path).dropna()

    # Features / target
    X = df[['AT', 'V', 'AP', 'RH']]
    y = df['PE']

    # 70/30 split (same random_state used earlier)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

    rows = []

    def record(model_name, model):
        y_pred_train = model.predict(X_train)
        y_pred_test = model.predict(X_test)
        rows.append({
            'Model': model_name,
            'MSE (Train)': mean_squared_error(y_train, y_pred_train),
            'MAE (Train)': mean_absolute_error(y_train, y_pred_train),
            'R2 (Train)': r2_score(y_train, y_pred_train),
            'MSE (Test)': mean_squared_error(y_test, y_pred_test),
            'MAE (Test)': mean_absolute_error(y_test, y_pred_test),
            'R2 (Test)': r2_score(y_test, y_pred_test),
        })

    # OLS (LinearRegression)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    record('LinearRegression', lr)

    # Ridge for requested alphas
    for a in ALPHAS:
        r = Ridge(alpha=a)
        r.fit(X_train, y_train)
        record(f'Ridge (alpha={a})', r)

    # Lasso for requested alphas
    for a in ALPHAS:
        l = Lasso(alpha=a, max_iter=10000)
        l.fit(X_train, y_train)
        record(f'Lasso (alpha={a})', l)

    results_df = pd.DataFrame(rows)
    # Round for display
    results_df = results_df.round({
        'MSE (Train)': 4, 'MAE (Train)': 4, 'R2 (Train)': 4,
        'MSE (Test)': 4, 'MAE (Test)': 4, 'R2 (Test)': 4
    })

    print('\n' + '='*80)
    print(f"Dataset: {path} — rows={len(df)} — train/test: {len(X_train)}/{len(X_test)}")
    display(results_df)
    return results_df

# Run for both files
results_with_outliers = evaluate_and_display('usina_with_outliers.csv')
results_cleaned = evaluate_and_display('usina.csv')

# Optionally, save results to CSV inside notebook
results_with_outliers.to_csv('usina_with_outliers_metrics.csv', index=False)
results_cleaned.to_csv('usina_metrics.csv', index=False)
print('\nSaved metrics to usina_with_outliers_metrics.csv and usina_metrics.csv')


Dataset: usina_with_outliers.csv — rows=9568 — train/test: 6697/2871


Unnamed: 0,Model,MSE (Train),MAE (Train),R2 (Train),MSE (Test),MAE (Test),R2 (Test)
0,LinearRegression,123.3842,5.1987,0.6502,125.1134,5.0525,0.6426
1,Ridge (alpha=0.01),123.3842,5.1987,0.6502,125.1134,5.0525,0.6426
2,Ridge (alpha=0.1),123.3842,5.1987,0.6502,125.1134,5.0525,0.6426
3,Ridge (alpha=1),123.3842,5.1987,0.6502,125.1134,5.0525,0.6426
4,Ridge (alpha=10),123.3842,5.1987,0.6502,125.1137,5.0525,0.6426
5,Ridge (alpha=100),123.3842,5.1993,0.6502,125.1165,5.053,0.6426
6,Lasso (alpha=0.01),123.3842,5.1989,0.6502,125.1153,5.0526,0.6426
7,Lasso (alpha=0.1),123.3846,5.2017,0.6502,125.1345,5.0553,0.6425
8,Lasso (alpha=1),123.4212,5.2295,0.6501,125.3384,5.0831,0.6419
9,Lasso (alpha=10),126.1415,5.5404,0.6424,128.9021,5.4088,0.6318



Dataset: usina.csv — rows=6612 — train/test: 4628/1984


Unnamed: 0,Model,MSE (Train),MAE (Train),R2 (Train),MSE (Test),MAE (Test),R2 (Test)
0,LinearRegression,19.4535,3.5712,0.9334,20.1667,3.6183,0.9283
1,Ridge (alpha=0.01),19.4535,3.5712,0.9334,20.1667,3.6183,0.9283
2,Ridge (alpha=0.1),19.4535,3.5712,0.9334,20.1667,3.6183,0.9283
3,Ridge (alpha=1),19.4535,3.5712,0.9334,20.1668,3.6183,0.9283
4,Ridge (alpha=10),19.4535,3.5712,0.9334,20.1673,3.6184,0.9283
5,Ridge (alpha=100),19.4537,3.5716,0.9334,20.1727,3.6193,0.9283
6,Lasso (alpha=0.01),19.4535,3.5711,0.9334,20.1669,3.6183,0.9283
7,Lasso (alpha=0.1),19.4541,3.5715,0.9334,20.1768,3.6198,0.9283
8,Lasso (alpha=1),19.5181,3.58,0.9331,20.3322,3.6402,0.9277
9,Lasso (alpha=10),25.0543,4.0404,0.9142,26.5588,4.1813,0.9056



Saved metrics to usina_with_outliers_metrics.csv and usina_metrics.csv


### Do outliers change train error? Test error?
- Train: As can be seen from the table where we evaluate metrics, the train MAE and MSE values range from 5-11 (MAE) and 123-234 (MSE), which is a higher range as compared to the same values for the dataset without outliers, where it ranges from 3-9 (MAE) and 19-131 (MSE). Similarly, the train R2 values for usina_with_outliers.csv (around 0.6) are significantly smaller than those for usina.csv (around 0.9) which are much closer to 1.
- Test: The same pattern can be noticed for test MAE and MSE, wherein the test MAE (5-11) and MSE (125-233) values are much higher for the dataset with outliers as compared to the test MAE (3-9) and MSE (20-127) values for the dataset without outliers.
These observations prove that the outliers change both train and test errors, increasing errors when they are present in the dataset.

### Which dataset (with outliers vs without outliers) shows better generalization?
- The dataset without outliers shows better generalization. It has lower MAE and MSE values, along with higher R2 values, making its absolute error lesser as compared to the dataset with outliers. Further, the train and test metric values are pretty similar (MSE ~ 20, MAE ~3.6, R2 ~0.9), which shows that this model is neither underfitted nor overfitted. While the values are similar for the dataset with outliers as well, this dataset has a larger absolute error, which is why we conclude that the dataset without outliers generalizes better.

### Do Ridge/Lasso appear to help relative to standard linear regression?
- No. For both of our datasets, Ridge/Lasso do absolutely nothing to improve error metrics. In fact, their performance only gets worse as the Lasso alpha increases. These only help when models are overfitting, however, as proven in the last response, our OLS model is not overfitting. Thus, these do not appear to help relative to standard linear regression.