# 🚀 Econometric Analysis Report
## Generated by Quick Learning Analytics

**Generated on:** 2025-09-11 20:45:43  
**Method:** Random Forest  
**Problem Type:** Regression  
**Features:** 10  
**Source:** Supervised Learning Tool by Ren Zhang, McCoy College of Business, Texas State University

---

This notebook replicates your exact analysis from the Quick Learning Analytics econometric app, including all preprocessing steps, model configuration, and evaluation metrics.

**Visit:** [Quick Learning Analytics](https://quicklearninganalytics.streamlit.app/) for more information and tools.

## ✅ Options Tracking Checklist

This analysis tracks **ALL** the options you selected in the app:

### 📊 **Basic Configuration**
- ✅ **Method:** Random Forest
- ✅ **Problem Type:** Regression
- ✅ **Target Variable:** `age`
- ✅ **Features:** 10 variables
  - `high_earner`, `is_urban`, `education_PhD`, `experience`, `promotion`...
- ✅ **Random State:** 42 (for reproducibility)

### 🔧 **Data Processing Options**
- ✅ **Data File:** test_dataset_classification.csv
- ✅ **Missing Data:** KNN Imputation
- ✅ **Data Filtering:** 1 filters applied
- ❌ **Feature Scaling:** Disabled
- ✅ **Sample Range:** Full dataset

### 🤖 **Model-Specific Options**
- ✅ **Max Depth:** 5
- ✅ **Min Samples Split:** 2
- ✅ **Min Samples Leaf:** 1
- ✅ **Number of Trees:** 120
- ✅ **Pruning:** Automatic (CV)
- ❌ **Manual Alpha:** Auto

### 📈 **Analysis Options**
- ✅ **Include Constant:** Yes
- ✅ **Test Size:** 0.2 (20% for testing)
- ✅ **Generate Plots:** Enabled
- ❌ **Stratified Split:** No

### 🔍 **Advanced Options**
- ❌ **Parameter Input Method:** Default
- ✅ **Class Weight Option:** None
- ✅ **Filter Method:** Column Values

---

💡 **All these options are replicated exactly in the code below!**

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load your dataset
df = pd.read_csv('test_dataset_classification.csv')

print(f'Dataset shape: {df.shape}')
print(f'Columns: {list(df.columns)}')
df.head()

## 🔍 Data Filtering

Applying the same filters you used in your analysis:

In [None]:
# Apply data filters (replicating your selections)
# Filter 1: hours_worked between 27.4 and 51.43
df = df[(df['hours_worked'] >= 27.4) & (df['hours_worked'] <= 51.43)]

print(f'After filtering: {df.shape}')
df.head()

## 🔧 Missing Data Handling

Method: **KNN Imputation**

In [None]:
# Handle missing values using KNN imputation
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])
print('✓ Applied KNN imputation to numeric columns')

# Check for remaining missing values
print(f'Missing values after imputation: {df.isnull().sum().sum()}')

## 📊 Variable Definition and Preprocessing

Defining features and target variable:

In [None]:
# Define variables (matching your analysis)
independent_vars = ['high_earner', 'is_urban', 'education_PhD', 'experience', 'promotion', 'education_Bachelor', 'education_High School', 'education_Master', 'income', 'hours_worked']
dependent_var = 'age'

# Extract features and target
X = df[independent_vars].copy()
y = df[dependent_var].copy()

# Define model type for visualization and logic
model_type = 'regression'

print(f'Feature matrix shape: {X.shape}')
print(f'Target variable shape: {y.shape}')
print(f'Features: {list(X.columns)}')

## 🤖 Model Training: Random Forest

Training with your exact settings:

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples')

# Random Forest Regression
model = RandomForestRegressor(
    n_estimators=120,
    max_depth=5,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)

# Train the model
model.fit(X_train, y_train)
print('✓ Model trained successfully')

## 📈 Model Evaluation

Calculate performance metrics:

In [None]:
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Regression metrics
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)

print('\n' + '='*60)
print('🎯 KEY RESULTS (Should match main window):')
print('='*60)
print(f'📊 Training R²: {train_r2:.4f}')
print(f'📊 Test R²: {test_r2:.4f}')
print(f'📊 Training RMSE: {train_rmse:.4f}')
print(f'📊 Test RMSE: {test_rmse:.4f}')
print(f'📊 Training MAE: {train_mae:.4f}')
print(f'📊 Test MAE: {test_mae:.4f}')
print('='*60)

print('\n=== DETAILED MODEL PERFORMANCE ===')
print(f'Training R²: {train_r2:.6f}')
print(f'Test R²: {test_r2:.6f}')
print(f'Training MSE: {train_mse:.6f}')
print(f'Test MSE: {test_mse:.6f}')
print(f'Training RMSE: {train_rmse:.6f}')
print(f'Test RMSE: {test_rmse:.6f}')
print(f'Training MAE: {train_mae:.6f}')
print(f'Test MAE: {test_mae:.6f}')

## 🔥 Feature Analysis

Analyzing feature importance or coefficients:

In [None]:
# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print('\n🔥 FEATURE IMPORTANCE (Top features):')
print('='*50)
for idx, row in feature_importance.head().iterrows():
    print(f'📈 {row["feature"]:<25}: {row["importance"]:.6f}')
print('='*50)

# Display complete feature importance
feature_importance

## 📊 Visualization

Generate comprehensive plots:

In [None]:
# Set up plotting style
plt.style.use('default')
sns.set_palette('husl')

# Regression plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Random Forest - Regression Analysis Plots', fontsize=16)

# 1. Actual vs Predicted
axes[0, 0].scatter(y_test, y_test_pred, alpha=0.6)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual Values')
axes[0, 0].set_ylabel('Predicted Values')
axes[0, 0].set_title('Actual vs Predicted Values')
axes[0, 0].grid(True, alpha=0.3)

# 2. Residual plot
residuals = y_test - y_test_pred
axes[0, 1].scatter(y_test_pred, residuals, alpha=0.6)
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('Predicted Values')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Residual Plot')
axes[0, 1].grid(True, alpha=0.3)

# 3. Residual distribution
axes[1, 0].hist(residuals, bins=20, alpha=0.7, edgecolor='black')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
axes[1, 0].grid(True, alpha=0.3)

# 4. Q-Q plot for residuals
from scipy import stats
stats.probplot(residuals, dist='norm', plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot of Residuals')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Feature Importance Plot
plt.figure(figsize=(10, 6))
feature_importance_sorted = feature_importance.sort_values('importance', ascending=True)
plt.barh(feature_importance_sorted['feature'], feature_importance_sorted['importance'])
plt.xlabel('Feature Importance')
plt.title('Random Forest - Feature Importance')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Random Forest - Individual Tree Visualization
from sklearn.tree import plot_tree

# Plot the first tree from the forest
plt.figure(figsize=(20, 12))
plot_tree(model.estimators_[0],
          feature_names=X.columns,
          class_names=None if model_type == 'regression' else True,
          filled=True,
          rounded=True,
          fontsize=8)
plt.title('Random Forest - Sample Tree (Tree #1 of {model.n_estimators})\nRandom Forest', fontsize=16)
plt.tight_layout()
plt.show()

# Random Forest Tree Statistics
print('\n🌳 RANDOM FOREST PROPERTIES:')
print('='*40)
print(f'🔢 Number of Trees: {model.n_estimators}')
print(f'🍃 Max Features per Tree: {model.max_features}')
print(f'📏 Max Depth Setting: {model.max_depth}')
print(f'🔀 Min Samples Split: {model.min_samples_split}')
print(f'🍀 Min Samples Leaf: {model.min_samples_leaf}')
print(f'🎲 Bootstrap Samples: {model.bootstrap}')
print('='*40)

## 🎯 Analysis Summary

✅ **Analysis completed successfully!**

**Key Information:**
- **Method:** Random Forest
- **Problem Type:** Regression
- **Features:** 10
- **Preprocessing:** Applied
- **Cross-validation:** No
- **Plots:** Generated

⚠️ **Important:** Compare the KEY RESULTS above with your main window to verify accuracy!

🔄 **Reproducibility:** This notebook uses `random_state=42` for consistent results.