# Data Processing and Machine Learning Pipeline

This notebook performs data processing, applies SMOTE for balancing, and evaluates various classification models.

The following steps are included:
1. Data loading and preprocessing
2. Feature engineering and scaling
3. Model training and evaluation
4. Cross-validation for Random Forest

In [9]:
%pip install pandas numpy scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score


## Data Loading and Preprocessing

In this section, we load the data and perform preprocessing including encoding categorical features, scaling numeric features, and creating additional features.

In [11]:
# Load the dataset
data = pd.read_csv('..//Datas/output.csv')

# Convert 'Gender' column
data['Gender'] = data['Gender'].map({'F': 0, 'M': 1, 'F ': 0, 'M ': 1})

# Encode the 'Anaemic' column
label_encoder = LabelEncoder()
data['Anaemic'] = label_encoder.fit_transform(data['Anaemic'])

# Apply one-hot encoding to 'Anaemic'
one_hot_encoder = OneHotEncoder(drop='first', sparse_output=False)
anaemic_encoded = one_hot_encoder.fit_transform(data[['Anaemic']])
anaemic_encoded_df = pd.DataFrame(anaemic_encoded, columns=one_hot_encoder.get_feature_names_out(['Anaemic']))

# Drop the old 'Anaemic' column and add new one-hot encoded columns
data = pd.concat([data.drop(columns=['Anaemic']), anaemic_encoded_df], axis=1)

# Apply MinMaxScaler to specific columns
min_max_scaler = MinMaxScaler()
columns_to_scale = ['%Red Pixel', '%Green pixel', '%Blue pixel', 'Hb']
data[columns_to_scale] = min_max_scaler.fit_transform(data[columns_to_scale])


### Feature Engineering

In this section, additional features are created such as average pixel values, differences, and logarithmic transformations.

In [12]:
# Calculate average pixel, differences, and logarithms
data['Average Pixel'] = data[['%Red Pixel', '%Green pixel', '%Blue pixel']].mean(axis=1)
data['Red-Blue Difference'] = data['%Red Pixel'] - data['%Blue pixel']
data['Green-Red Difference'] = data['%Green pixel'] - data['%Red Pixel']
data['Red-Blue Difference'] = data['Red-Blue Difference'].abs()
data['Green-Red Difference'] = data['Green-Red Difference'].abs()

# Apply MinMaxScaler to the difference columns
data[['Red-Blue Difference', 'Green-Red Difference']] = min_max_scaler.fit_transform(data[['Red-Blue Difference', 'Green-Red Difference']])

# Calculate log of Hb
data['Log Hb'] = np.log(data['Hb'] + 1)

# Replace zero values in 'Log Hb' with the minimum positive value
min_positive_value = data[data['Log Hb'] > 0]['Log Hb'].min()
data['Log Hb'] = data['Log Hb'].replace(0, min_positive_value)


### Additional Columns and Risk Calculation

This section includes additional calculations for new columns and risk assessment based on predefined criteria.

In [13]:
# Additional columns and risk calculations
data['Red-Blue Product'] = data['%Red Pixel'] * data['%Blue pixel']
data['Hb to Red Ratio'] = data['Hb'] / (data['%Red Pixel'] + 1)

# Load original dataset and calculate risk
original_data = pd.read_csv('..//Datas/output.csv')
original_data['Risk'] = np.where((original_data['Hb'] < 12) & (original_data['%Red Pixel'] > 0.7), 1, 0)

# Map the risk values to the processed data
original_data.set_index('Number', inplace=True)
risk_mapping = original_data['Risk'].to_dict()
data['Risk'] = data['Number'].map(risk_mapping)

# Adjust risk values, setting normal risk to 0.5
data['Risk'] = np.where(data['Risk'] == 0, 0.5, data['Risk'])
balanced_data = data
balanced_data.to_csv('..//Datas/Balanced_Processed_Data.csv', index=False)


### Model Training and Evaluation

In this section, various classification models are trained and evaluated on the processed data. The performance of each model is assessed using accuracy, confusion matrix, and classification report.

In [14]:
# Load the updated dataset
data = pd.read_csv('..//Datas/Balanced_Processed_Data.csv')

# Separate features and target variable
X = data.drop(columns=['Anaemic_1'])
y = data['Anaemic_1']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define and train models
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'Support Vector Machine': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier()
}

# Evaluate each model
for model_name, model in models.items():
    print(f'Training {model_name}...')
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print(f'Results for {model_name}:')
    print('Accuracy:', accuracy_score(y_test, y_pred))
    print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
    print('Classification Report:\n', classification_report(y_test, y_pred))
    print('='*50)


Training Logistic Regression...
Results for Logistic Regression:
Accuracy: 0.9375
Confusion Matrix:
 [[24  1]
 [ 1  6]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.96      0.96        25
         1.0       0.86      0.86      0.86         7

    accuracy                           0.94        32
   macro avg       0.91      0.91      0.91        32
weighted avg       0.94      0.94      0.94        32

Training Random Forest...
Results for Random Forest:
Accuracy: 1.0
Confusion Matrix:
 [[25  0]
 [ 0  7]]
Classification Report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        25
         1.0       1.00      1.00      1.00         7

    accuracy                           1.00        32
   macro avg       1.00      1.00      1.00        32
weighted avg       1.00      1.00      1.00        32

Training Support Vector Machine...
Results for Support Vector Machine:
Ac

### Cross-Validation for Random Forest

This section performs cross-validation on the Random Forest model to evaluate its performance more robustly.

In [15]:
# Evaluate Random Forest with cross-validation
rf_model = RandomForestClassifier(random_state=42)
cv_scores = cross_val_score(rf_model, X, y, cv=5)

# Print cross-validation results
print('Cross-validation Accuracy Scores: ', cv_scores)
print('Average Accuracy: ', cv_scores.mean())


Cross-validation Accuracy Scores:  [0.95238095 1.         1.         1.         0.9       ]
Average Accuracy:  0.9704761904761906
