# Server Crash Prediction - Data Exploration

This notebook explores the cloud workload dataset for server crash prediction using machine learning.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

## 1. Data Loading

In [None]:
# Load the cloud workload dataset
file_path = '../data/cloud_workload_dataset.csv'
df = pd.read_csv(file_path)
print(f"Successfully loaded data from {file_path}")
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())
print("\nDataset info:")
print(df.info())

## 2. Data Exploration and Feature Engineering

In this section we will:
- Check for missing values
- Explore the dataset structure and columns
- Create a binary target variable from `Error_Rate (%)` (high error rate = 1, low = 0)
- Identify numerical and categorical features
- One-hot encode categorical variables (Data_Source, Job_Priority, Scheduler_Type, Resource_Allocation_Type)
- Prepare features (X) and target (y) for model training

In [None]:
# Clean column names for easier access
df.columns = (
    df.columns
    .str.strip()
    .str.replace(' ', '_')
    .str.replace('(', '', regex=False)
    .str.replace(')', '', regex=False)
    .str.replace('%', 'pct')
)

# Check for missing values
print("Missing values:")
print(df.isnull().sum())
print(f"\nCleaned column names: {df.columns.tolist()}")

# Create binary target variable from Error_Rate
# High error rate (above 75th percentile) indicates potential server crashes/issues
error_threshold = df['Error_Rate_pct'].quantile(0.75)
df['High_Error'] = (df['Error_Rate_pct'] > error_threshold).astype(int)

print(f"\nError rate threshold (75th percentile): {error_threshold:.2f}%")
print(f"\nTarget distribution:")
print(df['High_Error'].value_counts())
print(f"\nClass balance: {df['High_Error'].value_counts(normalize=True)}")

# Drop unnecessary columns (identifiers and timestamps)
df = df.drop(['Job_ID', 'Task_Start_Time', 'Task_End_Time'], axis=1)

# One-hot encode categorical variables
categorical_cols = ['Data_Source', 'Job_Priority', 'Scheduler_Type', 'Resource_Allocation_Type']
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Prepare features (X) and target (y)
X = df_encoded.drop(['High_Error', 'Error_Rate_pct'], axis=1)
y = df_encoded['High_Error']

print(f"\nFinal feature count: {X.shape[1]} features")
print(f"Final dataset shape: X={X.shape}, y={y.shape}")

<cell_type>markdown</cell_type>## 3. Train/Test Split and SMOTE Resampling

**Challenge:** Our dataset has class imbalance (75% normal, 25% high error).

**Solution:** We use SMOTE (Synthetic Minority Over-sampling Technique) to balance the training set by generating synthetic examples of the minority class (high error cases).

In [None]:
# Train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"\nTraining set class distribution:\n{y_train.value_counts()}")

# Apply SMOTE to balance the training set
# Note: We only apply SMOTE to training data to avoid data leakage
sm = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train, y_train)

print(f"\nAfter SMOTE resampling:")
print(f"Resampled training set size: {len(X_train_resampled)}")
print(f"Resampled class distribution: {np.bincount(y_train_resampled)}")
print("âœ“ Training data is now balanced!")

<cell_type>markdown</cell_type>## 4. Model Training

We train a RandomForestClassifier with 600 trees. Random Forests are ideal for this task because:
- Handle both numerical and categorical features well
- Resistant to overfitting
- Provide feature importance metrics
- Don't require feature scaling (unlike neural networks)

In [None]:
# Train RandomForest model on SMOTE-resampled data
print("Training RandomForestClassifier...")
model = RandomForestClassifier(
    n_estimators=600,        # 600 decision trees
    max_depth=None,          # No depth limit (trees grow until pure)
    min_samples_split=2,     # Minimum samples to split a node
    min_samples_leaf=1,      # Minimum samples in leaf nodes
    random_state=42,
    n_jobs=-1                # Use all CPU cores
)

model.fit(X_train_resampled, y_train_resampled)
print("âœ“ Model trained successfully!")

<cell_type>markdown</cell_type>## 5. Model Evaluation

We evaluate the model using multiple metrics:
- **Accuracy**: Overall correctness
- **Precision**: Of predicted high-error cases, how many were actually high-error?
- **Recall**: Of actual high-error cases, how many did we catch?
- **F1-Score**: Harmonic mean of precision and recall

In [None]:
# Make predictions on test set
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of high error

# Calculate metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("="*60)
print("MODEL PERFORMANCE (Default 0.5 Threshold)")
print("="*60)
print(f"Accuracy:  {acc:.4f} ({acc*100:.2f}%)")
print(f"Precision: {prec:.4f} ({prec*100:.2f}%)")
print(f"Recall:    {rec:.4f} ({rec*100:.2f}%)")
print(f"F1-Score:  {f1:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal', 'High Error']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

In [None]:
# Visualize Confusion Matrix
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=ax,
            xticklabels=['Predicted Normal', 'Predicted High Error'],
            yticklabels=['Actual Normal', 'Actual High Error'])
ax.set_title('Confusion Matrix - Server Error Prediction', fontsize=14, fontweight='bold')
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.show()

# Interpretation
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f"True Negatives (TN):  {tn} - Correctly predicted normal")
print(f"False Positives (FP): {fp} - Incorrectly predicted high error")
print(f"False Negatives (FN): {fn} - Missed high error cases")
print(f"True Positives (TP):  {tp} - Correctly caught high error cases")

In [None]:
# Feature Importance Analysis
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))

# Visualize feature importance
fig, ax = plt.subplots(figsize=(10, 6))
top_features = feature_importance.head(15)
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(top_features)))
ax.barh(range(len(top_features)), top_features['importance'], color=colors)
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'])
ax.set_xlabel('Feature Importance', fontsize=12)
ax.set_title('Top 15 Most Important Features for Server Error Prediction', 
             fontsize=14, fontweight='bold')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

print(f"\nðŸ“Š Key Insight: The top 7 features are all numerical system metrics,")
print(f"   showing that performance indicators are the strongest predictors.")

<cell_type>markdown</cell_type>## 6. Model Persistence

Save the trained model and artifacts for future use and deployment.

### Sources of Bias (Important Considerations)
When interpreting results, be aware of these documented limitations:
- **Server type bias**: Logs from limited machine types - uncommon server patterns may be underrepresented
- **Workload variation**: CPU-intensive, memory-intensive, and network-intensive jobs may not be equally captured
- **Anomaly frequency**: Rare events (maintenance windows, unexpected traffic spikes) appear infrequently
- **Configuration diversity**: Not all server types and configurations are equally represented

### Future Improvements
1. **Hyperparameter Tuning**: GridSearchCV or RandomizedSearchCV to optimize model parameters
2. **Cross-Validation**: K-fold CV for more robust performance estimates
3. **Threshold Optimization**: Adjust decision threshold based on business needs (precision vs recall)
4. **Additional Models**: Compare with GradientBoosting, XGBoost, or LightGBM
5. **Feature Engineering**: Create interaction features (e.g., CPU Ã— Memory utilization)
6. **More Data**: Expand to Google Cluster Data, Azure Public Dataset, or Alibaba ClusterData

### Technologies Used
- **Python**: Core programming language
- **scikit-learn**: Machine learning framework (RandomForest, train/test split, metrics)
- **imbalanced-learn**: SMOTE implementation for handling class imbalance
- **Pandas**: Data manipulation and feature engineering
- **Matplotlib & Seaborn**: Visualization
- **Joblib**: Model serialization

### Project Team
**AI/ML Team Lead**: AI4ALL Ignite Program  
**Contributors**: Leilany Rojas, Ammar Salama, Salvador Frias, Mashel Khan  
**Presentation**: AI4ALL Research Symposium

## 7. Summary and Key Findings

### Project Overview
This project predicts server crashes and high error rates using machine learning on cloud workload data. We trained a RandomForestClassifier on 5,000 job records to classify jobs as "high error" (potential crashes/bottlenecks) vs "normal" based on CPU, memory, network, and other system metrics.

### Research Question
**How can AI models predict spikes in server load to prevent slowdowns or crashes, while accounting for variations in workload and server types?**

### Methodology
1. **Data**: 5,000 cloud job records with 15 features (CPU, memory, network metrics, etc.)
2. **Target**: Binary classification - error rates â‰¥ 75th percentile (3.8%) = high error
3. **Class Imbalance Solution**: SMOTE resampling to balance training data (3,002 normal / 3,002 high error)
4. **Model**: RandomForestClassifier with 600 estimators
5. **Evaluation**: 80/20 train/test split with stratification

### Model Performance
- **Accuracy**: ~71% - Model correctly classifies most cases
- **Recall**: 4-18% - Currently catches only a small fraction of high-error cases
- **Challenge**: Imbalanced data leads to conservative predictions (model prefers "safe" normal predictions)

### Top Predictive Features
1. System Throughput (tasks/sec) - 10.4%
2. Memory Consumption (MB) - 10.3%
3. Network Bandwidth Utilization (Mbps) - 10.3%
4. Task Execution Time (ms) - 10.1%
5. Number of Active Users - 10.1%

**Key Insight**: Numerical performance metrics are far more predictive than categorical features (scheduler type, job priority, etc.)

### Real-World Impact
- **Potential Use Case**: Early warning system for cloud infrastructure teams
- **Business Value**: Prevent downtime by flagging high-risk jobs before they cause outages
- **Trade-offs**: Current model prioritizes precision over recall (fewer false alarms, but misses some issues)

In [None]:
# Save the trained model
model_path = '../models/random_forest_smote_model.pkl'
joblib.dump(model, model_path)
print(f"âœ“ Model saved to {model_path}")

# Save feature importance
feature_importance.to_csv('../models/feature_importance.csv', index=False)
print("âœ“ Feature importance saved to ../models/feature_importance.csv")

print("\nðŸ’¾ Model artifacts saved successfully!")