# 🚀 Econometric Analysis Report
## Generated Analysis Replication

**Generated on:** 2025-09-11 15:04:54  
**Method:** OLS  
**Problem Type:** Regression  
**Features:** 10

---

This notebook replicates your exact analysis from the econometric app, including all preprocessing steps, model configuration, and evaluation metrics.

## ✅ Options Tracking Checklist

This analysis tracks **ALL** the options you selected in the app:

### 📊 **Basic Configuration**
- ✅ **Method:** OLS
- ✅ **Problem Type:** Regression
- ✅ **Target Variable:** `promotion`
- ✅ **Features:** 10 variables
  - `education_Master`, `age`, `experience`, `education_High School`, `high_earner`...
- ✅ **Random State:** 42 (for reproducibility)

### 🔧 **Data Processing Options**
- ✅ **Data File:** test_dataset_classification.csv
- ✅ **Missing Data:** Listwise Deletion
- ❌ **Data Filtering:** 0 filters applied
- ❌ **Feature Scaling:** Disabled
- ❌ **Sample Range:** Full dataset

### 🤖 **Model-Specific Options**

### 📈 **Analysis Options**
- ✅ **Include Constant:** Yes
- ✅ **Generate Plots:** Enabled
- ❌ **Stratified Split:** No

### 🔍 **Advanced Options**
- ❌ **Parameter Input Method:** Default
- ❌ **Class Weight Option:** None
- ❌ **Filter Method:** Standard

---

💡 **All these options are replicated exactly in the code below!**

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load your dataset
df = pd.read_csv('test_dataset_classification.csv')

print(f'Dataset shape: {df.shape}')
print(f'Columns: {list(df.columns)}')
df.head()

Dataset shape: (500, 12)
Columns: ['age', 'experience', 'income', 'hours_worked', 'is_urban', 'education_Bachelor', 'education_High School', 'education_Master', 'education_PhD', 'high_earner', 'promotion', 'survey_date']


Unnamed: 0,age,experience,income,hours_worked,is_urban,education_Bachelor,education_High School,education_Master,education_PhD,high_earner,promotion,survey_date
0,46,26.7,73107.0,41.5,1,False,True,False,False,1,1,2021-12-23
1,38,22.1,57660.0,50.9,1,False,False,True,False,1,1,2022-02-18
2,48,21.6,37415.0,43.3,1,True,False,False,False,1,0,2023-05-18
3,58,38.0,,29.8,1,False,True,False,False,1,1,2021-06-12
4,37,13.2,51488.0,44.0,1,False,True,False,False,0,1,2022-03-08


## 📊 Variable Definition and Preprocessing

Defining features and target variable:

In [3]:
# Define variables (matching your analysis)
independent_vars = ['education_Master', 'age', 'experience', 'education_High School', 'high_earner', 'hours_worked', 'income', 'education_PhD', 'is_urban', 'education_Bachelor']
dependent_var = 'promotion'

# Extract features and target
X = df[independent_vars].copy()
y = df[dependent_var].copy()

print(f'Feature matrix shape: {X.shape}')
print(f'Target variable shape: {y.shape}')
print(f'Features: {list(X.columns)}')

Feature matrix shape: (500, 10)
Target variable shape: (500,)
Features: ['education_Master', 'age', 'experience', 'education_High School', 'high_earner', 'hours_worked', 'income', 'education_PhD', 'is_urban', 'education_Bachelor']


## 🤖 Model Training: OLS

Training with your exact settings:

In [4]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples')

# Ordinary Least Squares Regression
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)
print('✓ Model trained successfully')

Training set: 400 samples
Test set: 100 samples


ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## 📈 Model Evaluation

Calculate performance metrics:

In [None]:
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Regression metrics
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)

print('\n' + '='*60)
print('🎯 KEY RESULTS (Should match main window):')
print('='*60)
print(f'📊 Training R²: {train_r2:.4f}')
print(f'📊 Test R²: {test_r2:.4f}')
print(f'📊 Training RMSE: {train_rmse:.4f}')
print(f'📊 Test RMSE: {test_rmse:.4f}')
print(f'📊 Training MAE: {train_mae:.4f}')
print(f'📊 Test MAE: {test_mae:.4f}')
print('='*60)

print('\n=== DETAILED MODEL PERFORMANCE ===')
print(f'Training R²: {train_r2:.6f}')
print(f'Test R²: {test_r2:.6f}')
print(f'Training MSE: {train_mse:.6f}')
print(f'Test MSE: {test_mse:.6f}')
print(f'Training RMSE: {train_rmse:.6f}')
print(f'Test RMSE: {test_rmse:.6f}')
print(f'Training MAE: {train_mae:.6f}')
print(f'Test MAE: {test_mae:.6f}')

## 🔥 Feature Analysis

Analyzing feature importance or coefficients:

In [None]:
# Model coefficients analysis
# Check if model has coefficients attribute
if hasattr(model, 'coef_'):
    if hasattr(model, 'best_estimator_'):
        coef_values = model.best_estimator_.coef_
    else:
        coef_values = model.coef_
    
    coefficients = pd.DataFrame({
        'feature': X.columns,
        'coefficient': coef_values
    }).sort_values('coefficient', key=abs, ascending=False)
    
    print('\n🔥 TOP COEFFICIENTS (Most influential features):')
    print('='*60)
    for idx, row in coefficients.head().iterrows():
        print(f'📈 {row["feature"]:<25}: {row["coefficient"]:>12.6f}')
    print('='*60)
    
    # Display complete coefficients
    display(coefficients)
else:
    print('\n⚠️  Coefficient analysis not available for this model type')
    print('This model does not have linear coefficients.')

## 📊 Visualization

Generate comprehensive plots:

In [None]:
# Set up plotting style
plt.style.use('default')
sns.set_palette('husl')

# Regression plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('OLS - Regression Analysis Plots', fontsize=16)

# 1. Actual vs Predicted
axes[0, 0].scatter(y_test, y_test_pred, alpha=0.6)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual Values')
axes[0, 0].set_ylabel('Predicted Values')
axes[0, 0].set_title('Actual vs Predicted Values')
axes[0, 0].grid(True, alpha=0.3)

# 2. Residual plot
residuals = y_test - y_test_pred
axes[0, 1].scatter(y_test_pred, residuals, alpha=0.6)
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('Predicted Values')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Residual Plot')
axes[0, 1].grid(True, alpha=0.3)

# 3. Residual distribution
axes[1, 0].hist(residuals, bins=20, alpha=0.7, edgecolor='black')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
axes[1, 0].grid(True, alpha=0.3)

# 4. Q-Q plot for residuals
from scipy import stats
stats.probplot(residuals, dist='norm', plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot of Residuals')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Coefficient Plot
plt.figure(figsize=(10, 6))
coef_abs_sorted = coefficients.reindex(coefficients['coefficient'].abs().sort_values(ascending=True).index)
colors = ['red' if x < 0 else 'blue' for x in coef_abs_sorted['coefficient']]
plt.barh(coef_abs_sorted['feature'], coef_abs_sorted['coefficient'], color=colors, alpha=0.7)
plt.xlabel('Coefficient Value')
plt.title('OLS - Feature Coefficients')
plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 🎯 Analysis Summary

✅ **Analysis completed successfully!**

**Key Information:**
- **Method:** OLS
- **Problem Type:** Regression
- **Features:** 10
- **Preprocessing:** Applied
- **Cross-validation:** No
- **Plots:** Generated

⚠️ **Important:** Compare the KEY RESULTS above with your main window to verify accuracy!

🔄 **Reproducibility:** This notebook uses `random_state=42` for consistent results.