# Anemia Classification using Random Forest

## 1. Install Required Libraries

Before running the notebook, make sure to install all required libraries.



In [2]:
%pip install pandas scikit-learn numpy

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV , StratifiedKFold,cross_val_score
from sklearn.model_selection import RandomizedSearchCV 


In [4]:
# Load the unprocessed data
data = pd.read_csv('../Datas/anemia.csv')

# Separate features (X) and target variable (y)
X = data.drop('Result', axis=1)
y = data['Result']

# Apply Min-Max Scaling to all columns except 'Gender'
min_max_scaler = MinMaxScaler()
X_scaled = X.copy()  # Create a copy of X to apply scaling

# Apply Min-Max Scaling to each column except 'Gender'
for col in X.columns:
    if col != "Gender":
        X_scaled[col] = min_max_scaler.fit_transform(X[[col]])

# Add 'Gender' column back to the scaled data
processed_data = X_scaled
processed_data['Gender'] = X['Gender'].values

# Add the target variable back to the processed dataset
processed_data['Result'] = y

# Save the processed data to a new CSV file
processed_data.to_csv('..//Datas/processed_data.csv', index=False)


In [5]:

# Load the processed dataset
data = pd.read_csv('..//Datas/processed_data.csv')

# Separate features (X) and target variable (y)
X = data.drop('Result', axis=1)
y = data['Result']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [6]:


# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Support Vector Machine': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Gradient Boosting': GradientBoostingClassifier()
}

# Train, predict and evaluate each model
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)
    
    print(f"Model: {name}")
    print(f"Accuracy: {accuracy:.2f}")
    print("Confusion Matrix:")
    print(conf_matrix)
    print("Classification Report:")
    print(class_report)
    print("\n" + "-"*50 + "\n")


Model: Logistic Regression
Accuracy: 0.98
Confusion Matrix:
[[237   8]
 [  0 182]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98       245
           1       0.96      1.00      0.98       182

    accuracy                           0.98       427
   macro avg       0.98      0.98      0.98       427
weighted avg       0.98      0.98      0.98       427


--------------------------------------------------

Model: Decision Tree
Accuracy: 1.00
Confusion Matrix:
[[245   0]
 [  0 182]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       245
           1       1.00      1.00      1.00       182

    accuracy                           1.00       427
   macro avg       1.00      1.00      1.00       427
weighted avg       1.00      1.00      1.00       427


--------------------------------------------------

Model: Random Forest
Accuracy: 1.0

## Optimal Machine Learning Algorithm

In [8]:
# Load the processed dataset
data = pd.read_csv('..//Datas/processed_combined_data.csv')
X = data.drop(columns=['Result'])
y = data['Result']


# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model
model = RandomForestClassifier(n_estimators=100 , random_state=42)

# Fit the model
model.fit(X_train, y_train)

# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Evaluate the model
print("Training Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

print("\nTraining Classification Report:")
print(classification_report(y_train, y_train_pred))

print("\nTest Classification Report:")
print(classification_report(y_test, y_test_pred))

print("\nConfusion Matrix (Test Data):")
print(confusion_matrix(y_test, y_test_pred))

Training Accuracy: 1.0
Test Accuracy: 1.0

Training Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       556
           1       1.00      1.00      1.00       438

    accuracy                           1.00       994
   macro avg       1.00      1.00      1.00       994
weighted avg       1.00      1.00      1.00       994


Test Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       245
           1       1.00      1.00      1.00       182

    accuracy                           1.00       427
   macro avg       1.00      1.00      1.00       427
weighted avg       1.00      1.00      1.00       427


Confusion Matrix (Test Data):
[[245   0]
 [  0 182]]


# Why Random Forest


## Selection Criteria
# Generalization Ability:
 Random Forest and Gradient Boosting generally have better generalization ability because they are more complex and use a combination of more models.
# Computational Resources: 
If your computational resources are limited, Decision Tree may be more suitable because it is simpler and faster.
# Model Complexity: 
More complex models (Random Forest and Gradient Boosting) generally perform better but may require more computation and modeling time.
## Recommendation.
If your data is small or medium-sized and your computational resources are not limited, I recommend you to opt for Random Forest or Gradient Boosting models. These models usually provide better overall performance.

If model interpretability is important and your computational resources are limited, Decision Tree may be a good option. However, usually decision trees are suitable for simpler scenarios and for more complex data sets, ensemble methods such as Random Forest or Gradient Boosting perform better.

