# Homework 5 RF Accuracy Improvement

This assignment is inspired by examples of Shan-Hung Wu from National Tsing Hua University.

Requirement: improve the accuracy per feature of the following code from 0.03 up to at least 0.45 and accuracy should be more than 0.92

Here are three hints:

    You can improve the ratio by picking out or "creating" several features.
    Tune hyperparameters
    The ratio can be improved from 0.03 up to 0.47.

In [7]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Load the breast_cancer dataset
init_data = load_breast_cancer()
(X, y) = load_breast_cancer(return_X_y=True)

# Split the data into training and testing sets using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train RandomForestClassifier to get feature importances
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Select top N features based on importance
sfm = SelectFromModel(rf_classifier, threshold=0.12)
sfm.fit(X_train, y_train)
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)

# 5-fold cross-validation
cv_scores = cross_val_score(rf_classifier, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-Validation Scores:", cv_scores)

# Tune parameters for RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [10, 25, 50],  # Fix the typo in the order of values
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
}

# Use GridSearchCV to find the best parameters
grid_search = GridSearchCV(rf_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_selected, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Train RandomForestClassifier with the best parameters
best_rf_classifier = RandomForestClassifier(random_state=42, **best_params)
best_rf_classifier.fit(X_train_selected, y_train)

# Make predictions and calculate accuracy on the test set
y_pred_best = best_rf_classifier.predict(X_test_selected)

# Calculate accuracy score
accuracy_best = accuracy_score(y_test, y_pred_best)
print("Best Accuracy:", accuracy_best)

# Calculate Average (accuracy score/number of features)
average_accuracy_per_feature = accuracy_best / X_test_selected.shape[1]
print("Average (accuracy per feature):", average_accuracy_per_feature)



Cross-Validation Scores: [0.97802198 0.94505495 0.97802198 0.95604396 0.93406593]
Best Parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 50}
Best Accuracy: 0.956140350877193
Average (accuracy per feature): 0.4780701754385965
