# Predicting Defect-Prone Modules with ML

In this notebook, we simulate test history data and use it to predict the likelihood of software defects.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Create synthetic dataset
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'lines_of_code': np.random.randint(50, 2000, n),
    'complexity': np.random.randint(1, 20, n),
    'num_commits': np.random.randint(1, 100, n),
    'past_defects': np.random.randint(0, 5, n),
    'module_age_months': np.random.randint(1, 48, n),
})

# Target: defect-prone (1) or not (0)
df['is_defective'] = (df['complexity'] > 10) & (df['past_defects'] > 0)
df['is_defective'] = df['is_defective'].astype(int)

df.head()

In [None]:
# Split data
X = df.drop('is_defective', axis=1)
y = df['is_defective']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
# View probabilities of being defective
df_probs = X_test.copy()
df_probs['predicted_prob'] = model.predict_proba(X_test)[:, 1]
df_probs['true_label'] = y_test.values
df_probs.sort_values(by='predicted_prob', ascending=False).head(10)

### ✅ Summary

- We simulated a dataset of software module metrics (LOC, complexity, etc.)
- Trained a model to predict defect-proneness
- Used it to prioritize testing (e.g., modules with >0.8 defect probability)

This approach is used in AI-driven testing to focus effort and reduce QA cost.
