# üõ°Ô∏è GuardNet - Random Forest Training untuk Hybrid Detection

Notebook ini digunakan untuk melatih model Random Forest yang ringan untuk sistem deteksi phishing hybrid di ekstensi Chrome GuardNet.

## Alur Kerja:
1. Upload dataset `PhiUSIIL_Phishing_URL_Dataset.csv`
2. Preprocessing dan normalisasi fitur
3. Training Random Forest dengan hyperparameter yang optimal untuk browser
4. Export model ke format JSON untuk JavaScript
5. Download file `rf_model.json`

---

## üì¶ Step 1: Install Dependencies & Import Libraries

In [None]:
import pandas as pd
import numpy as np
import json
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
from google.colab import files
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")

## üì§ Step 2: Upload Dataset

Upload file `PhiUSIIL_Phishing_URL_Dataset.csv` dari folder `code/` di proyek GuardNet.

In [None]:
# Upload dataset
print("üìÇ Please upload the dataset file (PhiUSIIL_Phishing_URL_Dataset.csv)...")
uploaded = files.upload()

# Get the filename
filename = list(uploaded.keys())[0]
print(f"\n‚úÖ File uploaded: {filename}")

In [None]:
# Load dataset
df = pd.read_csv(filename)
print(f"üìä Dataset shape: {df.shape}")
print(f"\nüìã Columns ({len(df.columns)} total):")
print(df.columns.tolist())
print(f"\nüîç First 3 rows:")
df.head(3)

## üîß Step 3: Data Preprocessing

In [None]:
# Define the 50 features that match sandbox.js extraction
FEATURE_NAMES = [
    'URLLength', 'DomainLength', 'IsDomainIP', 'URLSimilarityIndex',
    'CharContinuationRate', 'TLDLegitimateProb', 'URLCharProb', 'TLDLength',
    'NoOfSubDomain', 'HasObfuscation', 'NoOfObfuscatedChar', 'ObfuscationRatio',
    'NoOfLettersInURL', 'LetterRatioInURL', 'NoOfDigitsInURL', 'DigitRatioInURL',
    'NoOfEqualsInURL', 'NoOfQMarkInURL', 'NoOfAmpersandInURL', 'NoOfOtherSpecialCharsInURL',
    'SpacialCharRatioInURL', 'IsHTTPS', 'LineOfCode', 'LargestLineLength',
    'HasTitle', 'DomainTitleMatchScore', 'URLTitleMatchScore', 'HasFavicon',
    'Robots', 'IsResponsive', 'NoOfURLRedirect', 'NoOfSelfRedirect',
    'HasDescription', 'NoOfPopup', 'NoOfiFrame', 'HasExternalFormSubmit',
    'HasSocialNet', 'HasSubmitButton', 'HasHiddenFields', 'HasPasswordField',
    'Bank', 'Pay', 'Crypto', 'HasCopyrightInfo',
    'NoOfImage', 'NoOfCSS', 'NoOfJS', 'NoOfSelfRef', 'NoOfEmptyRef', 'NoOfExternalRef'
]

# Check which features exist in the dataset
available_features = [f for f in FEATURE_NAMES if f in df.columns]
missing_features = [f for f in FEATURE_NAMES if f not in df.columns]

print(f"‚úÖ Available features: {len(available_features)}/{len(FEATURE_NAMES)}")
if missing_features:
    print(f"‚ö†Ô∏è Missing features: {missing_features}")

In [None]:
# Prepare features and target
# Use available features, fill missing with 0
X = pd.DataFrame()
for feat in FEATURE_NAMES:
    if feat in df.columns:
        X[feat] = df[feat]
    else:
        X[feat] = 0  # Default value for missing features

# Target variable - adjust column name as needed
target_col = None
for col in ['label', 'Label', 'phishing', 'Phishing', 'is_phishing', 'class', 'Class']:
    if col in df.columns:
        target_col = col
        break

if target_col is None:
    print("‚ùå Target column not found! Available columns:")
    print(df.columns.tolist())
else:
    y = df[target_col]
    print(f"‚úÖ Target column: '{target_col}'")
    print(f"\nüìä Class distribution:")
    print(y.value_counts())

In [None]:
# Handle missing values
X = X.fillna(0)

# Convert any non-numeric to numeric
for col in X.columns:
    X[col] = pd.to_numeric(X[col], errors='coerce').fillna(0)

print(f"‚úÖ Features prepared: {X.shape}")
print(f"\nüìä Feature statistics:")
X.describe().T.head(10)

## üéØ Step 4: Train Random Forest Model

Hyperparameters dioptimalkan untuk ukuran kecil dan performa browser:
- `n_estimators=10` - Jumlah trees yang sedikit untuk ukuran file kecil
- `max_depth=5` - Kedalaman dangkal untuk mencegah overfitting
- `min_samples_leaf=20` - Hindari leaf node yang terlalu kecil

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"üìä Training set: {X_train.shape[0]} samples")
print(f"üìä Test set: {X_test.shape[0]} samples")

In [None]:
# Train Random Forest with browser-optimized hyperparameters
rf_model = RandomForestClassifier(
    n_estimators=10,        # Sedikit trees untuk ukuran kecil
    max_depth=5,            # Kedalaman dangkal
    min_samples_leaf=20,    # Hindari leaf node kecil
    min_samples_split=40,   # Minimum samples untuk split
    random_state=42,
    n_jobs=-1               # Gunakan semua CPU cores
)

print("üöÄ Training Random Forest...")
rf_model.fit(X_train, y_train)
print("‚úÖ Training complete!")

In [None]:
# Evaluate model
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)

print("üìä Model Evaluation:")
print("=" * 50)
print(f"\n‚úÖ Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nüìã Classification Report:")
print(classification_report(y_test, y_pred))
print(f"\nüî¢ Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': FEATURE_NAMES,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("üèÜ Top 15 Most Important Features:")
print("=" * 50)
for i, row in feature_importance.head(15).iterrows():
    bar = "‚ñà" * int(row['importance'] * 50)
    print(f"{row['feature']:25s} {row['importance']:.4f} {bar}")

## üì¶ Step 5: Export Model to JSON

Konversi Random Forest sklearn ke format JSON yang bisa dibaca oleh JavaScript di browser.

In [None]:
def tree_to_json(tree, feature_names):
    """
    Convert sklearn DecisionTree to JSON format for JavaScript.
    """
    tree_ = tree.tree_
    feature_name = feature_names

    def recurse(node):
        if tree_.feature[node] != -2:  # Not a leaf node
            feature_index = int(tree_.feature[node])
            threshold = float(tree_.threshold[node])
            left_child = int(tree_.children_left[node])
            right_child = int(tree_.children_right[node])

            return {
                "featureIndex": feature_index,
                "threshold": round(threshold, 4),
                "left": recurse(left_child),
                "right": recurse(right_child)
            }
        else:  # Leaf node
            # Get class probabilities
            value = tree_.value[node][0]
            total = sum(value)
            probs = [round(v / total, 4) for v in value]
            return {"value": probs}

    return recurse(0)


def export_rf_to_json(rf_model, feature_names, output_path='rf_model.json'):
    """
    Export Random Forest model to JSON format for JavaScript.
    """
    trees_json = []
    for i, tree in enumerate(rf_model.estimators_):
        tree_json = tree_to_json(tree, feature_names)
        trees_json.append(tree_json)
        print(f"  Tree {i+1}/{len(rf_model.estimators_)} exported")

    model_json = {
        "model_info": {
            "name": "GuardNet Random Forest",
            "version": "1.0.0",
            "description": "Random Forest for phishing detection - trained on PhiUSIIL dataset",
            "n_estimators": len(rf_model.estimators_),
            "max_depth": rf_model.max_depth,
            "accuracy": round(accuracy_score(y_test, y_pred), 4),
            "trained_on": "PhiUSIIL_Phishing_URL_Dataset"
        },
        "feature_names": feature_names,
        "n_estimators": len(rf_model.estimators_),
        "max_depth": rf_model.max_depth,
        "trees": trees_json
    }

    with open(output_path, 'w') as f:
        json.dump(model_json, f, indent=2)

    # Calculate file size
    import os
    file_size = os.path.getsize(output_path) / 1024

    return output_path, file_size


print("üì¶ Exporting Random Forest to JSON...")
output_path, file_size = export_rf_to_json(rf_model, FEATURE_NAMES)
print(f"\n‚úÖ Model exported to: {output_path}")
print(f"üìÅ File size: {file_size:.2f} KB")

In [None]:
# Verify the exported JSON
with open('rf_model.json', 'r') as f:
    exported_model = json.load(f)

print("üìã Exported Model Structure:")
print("=" * 50)
print(f"  n_estimators: {exported_model['n_estimators']}")
print(f"  max_depth: {exported_model['max_depth']}")
print(f"  features: {len(exported_model['feature_names'])}")
print(f"  trees: {len(exported_model['trees'])}")
print(f"\nüìä Model Info:")
for key, value in exported_model['model_info'].items():
    print(f"  {key}: {value}")

## üì• Step 6: Download Model File

Download file `rf_model.json` dan letakkan di folder `models/` di proyek GuardNet.

In [None]:
# Download the model file
print("üì• Downloading rf_model.json...")
files.download('rf_model.json')
print("\n‚úÖ Download complete!")
print("\nüìù Next steps:")
print("  1. Letakkan file rf_model.json di folder: GuardNet Test 1.2/models/")
print("  2. Replace file rf_model.json yang sudah ada (placeholder)")
print("  3. Reload extension di Chrome")
print("  4. Test dengan URL phishing dan legitimate")

---

## üéâ Selesai!

Model Random Forest sudah diekspor ke `rf_model.json`. File ini akan digunakan oleh JavaScript classifier di ekstensi GuardNet untuk hybrid detection bersama dengan model Logistic Regression.

### Struktur File Model:
```
models/
‚îú‚îÄ‚îÄ model.json           # TensorFlow.js (Logistic Regression)
‚îú‚îÄ‚îÄ scaler_params.json   # StandardScaler parameters
‚îú‚îÄ‚îÄ rf_model.json        # Random Forest (dari notebook ini)
‚îî‚îÄ‚îÄ group1-shard1of1.bin # TF.js weights
```