### Power Plans Regression

In [None]:
import pandas as pd

# Load dataset
# (file copied into this repo as `usina_with_outliers.csv`)
df = pd.read_csv('usina_with_outliers.csv')

df.head()

### 1.1 Model Choice: Linear Regression (OLS)
- I chose OLS for this dataset since Lasso Regression try to suppress outliers. However, for this Python Notebook's analysis to be relevant to Cook's distance (which calculates the difference removing a data point makes on a model), the outliers need to be considered in our analysis, rather than smoothed. OLS is intentionally sensitive to influential observations, which allows Cook's distance to effectively identify points that disproportionally affect the fitted model.

### Library Choice: Statsmodels OLS
- I chose Statsmodels since it is made for diagnostics, however, scikit-learn is built for predicitons. Statsmodels gives us access to all of the metrics Cook's distance depends on.



In [None]:
from sklearn.model_selection import train_test_split

# 70/30 random split (reproducible)
train_df, test_df = train_test_split(df, test_size=0.30, random_state=42)

# Quick sanity-check on sizes
print(f"Total rows: {len(df)}")
print(f"Train rows: {len(train_df)} ({len(train_df)/len(df)*100:.1f}%)")
print(f"Test rows:  {len(test_df)} ({len(test_df)/len(df)*100:.1f}%)")

# Reset indices for clean train/test frames
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

In [None]:
import statsmodels.api as sm

# Fit OLS on training data: predict PE from AT, V, AP, RH
X = train_df[['AT', 'V', 'AP', 'RH']]
y = train_df['PE']

X_const = sm.add_constant(X)
ols_model = sm.OLS(y, X_const).fit()

# Cook's distance for each training point
influence = ols_model.get_influence()
cooks_d = influence.cooks_distance[0]

# Threshold rule: 4/n
n = len(train_df)
threshold = 4 / n

train_df_with_cooks = train_df.copy()
train_df_with_cooks['cooks_distance'] = cooks_d
train_df_with_cooks['is_outlier'] = train_df_with_cooks['cooks_distance'] > threshold

print(f"n (train) = {n}")
print(f"Cook's distance threshold (4/n) = {threshold:.6f}")
print(f"Outliers detected = {train_df_with_cooks['is_outlier'].sum()}")

# Remove outliers
train_df_no_outliers = train_df_with_cooks.loc[~train_df_with_cooks['is_outlier']].drop(columns=['is_outlier'])

# Save cleaned training data
train_df_no_outliers.to_csv('usina.csv', index=False)
print("Saved cleaned dataset to usina.csv")