In [None]:
!pip install kaggle

import os
os.environ['KAGGLE_USERNAME'] = 'alisarmadi98'
os.environ['KAGGLE_KEY'] = 'b84ec0b5251ec99625b628f9b52255bf'

!kaggle datasets download -d andrewmvd/heart-failure-clinical-data

# Unzip the downloaded file
!unzip heart-failure-clinical-data.zip


In [2]:
import pandas as pd

df = pd.read_csv('heart_failure_clinical_records_dataset.csv')

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a column transformer for preprocessing
# Standardize numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']),
        ('cat', OneHotEncoder(), ['sex', 'smoking', 'anaemia', 'diabetes', 'high_blood_pressure']),
    ]
)

# Create a logistic regression model
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42)),
])

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.76      0.97      0.85        35
           1       0.93      0.56      0.70        25

    accuracy                           0.80        60
   macro avg       0.84      0.77      0.78        60
weighted avg       0.83      0.80      0.79        60




Data Cleaning:

Check for any missing values in the dataset.
If there are missing values, decide on a strategy for handling them (e.g., imputation or removal).
Feature Scaling:

Standardize or normalize numerical features to ensure that they are on a similar scale. This is important for some machine learning algorithms, including logistic regression.
Encoding Categorical Variables:

If there are categorical variables (like 'sex' and 'smoking'), encode them into numerical format. You can use one-hot encoding for this purpose.
Train-Test Split:

Split the dataset into training and testing sets. This is crucial for evaluating the model's performance on unseen data.
Feature Selection (Optional):

If your dataset is large or has redundant features, consider performing feature selection to improve the model's efficiency.
Model Training:

Train the logistic regression model using the training data.
Model Evaluation:

Evaluate the model's performance on the testing set using metrics like accuracy, precision, recall, and F1-score.

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Assuming df is your DataFrame with the dataset
# X contains the features, y contains the target variable
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a column transformer for preprocessing
# Standardize numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']),
        ('cat', OneHotEncoder(), ['sex', 'smoking', 'anaemia', 'diabetes', 'high_blood_pressure']),
    ]
)

# Create an SVM model
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SVC(random_state=42)),
])

# Define hyperparameters for tuning
param_grid = {
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__kernel': ['linear', 'rbf', 'poly'],
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, scoring='f1', cv=5)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model
print("Best Parameters:", grid_search.best_params_)
print(classification_report(y_test, y_pred))


Best Parameters: {'classifier__C': 1, 'classifier__kernel': 'linear'}
              precision    recall  f1-score   support

           0       0.77      0.94      0.85        35
           1       0.88      0.60      0.71        25

    accuracy                           0.80        60
   macro avg       0.82      0.77      0.78        60
weighted avg       0.82      0.80      0.79        60




This code prepares a dataset for SVM classification, focusing on predicting the occurrence of death events. It starts by splitting the dataset into training and testing sets and then employs a preprocessing pipeline that standardizes numerical features and one-hot encodes categorical variables. The SVM model is built with the scikit-learn library, and a grid search is employed to tune hyperparameters (C and kernel) for optimal performance. The best model is selected based on the F1-score during the hyperparameter search. The final model is then evaluated on the test set, and the classification report, including precision, recall, and F1-score, is printed. The goal is to achieve an F1-score above 0.8, providing a balanced measure of precision and recall for the classification task.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Assuming df is your DataFrame with the dataset
# X contains the features, y contains the target variable
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a column transformer for preprocessing
# Standardize numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']),
        ('cat', OneHotEncoder(), ['sex', 'smoking', 'anaemia', 'diabetes', 'high_blood_pressure']),
    ]
)

# Create a kernel SVM model with RBF kernel
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SVC(random_state=42, probability=True)),  # probability=True for later ROC analysis
])

# Define hyperparameters for tuning
param_grid = {
    'classifier__C': [0.1, 1, 10, 100],
    'classifier__gamma': [0.01, 0.1, 1, 10],
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, scoring='f1', cv=5)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model
print("Best Parameters:", grid_search.best_params_)
print(classification_report(y_test, y_pred))


Best Parameters: {'classifier__C': 10, 'classifier__gamma': 0.01}
              precision    recall  f1-score   support

           0       0.77      0.94      0.85        35
           1       0.88      0.60      0.71        25

    accuracy                           0.80        60
   macro avg       0.82      0.77      0.78        60
weighted avg       0.82      0.80      0.79        60




This code applies kernel SVM (Support Vector Machine) with a radial basis function (RBF) kernel to predict death events based on a prepared dataset. It begins by splitting the data into training and testing sets and uses a preprocessing pipeline to standardize numerical features and one-hot encode categorical variables. The SVM model is constructed with scikit-learn, incorporating a probability estimate for later ROC analysis. Hyperparameter tuning is performed using a grid search over the regularization parameter (C) and the kernel coefficient (gamma) to optimize the model's performance. The best model is selected based on the F1-score during the hyperparameter search. Finally, the code evaluates the model on the test set and prints a classification report, aiming to achieve an F1-score above 0.8 for a balanced measure of precision and recall in the classification task.

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Assuming df is your DataFrame with the dataset
# X contains the features, y contains the target variable
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Calculate class weights
class_weights = len(y_train) / (2 * pd.value_counts(y_train))

# Create a column transformer for preprocessing
# Standardize numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']),
        ('cat', OneHotEncoder(), ['sex', 'smoking', 'anaemia', 'diabetes', 'high_blood_pressure']),
    ]
)

# Create a KNN model with class weights
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier(weights='distance', p=1, n_neighbors=5)),
])

# Define hyperparameters for tuning
param_grid = {
    'classifier__n_neighbors': [3, 5, 7, 9],
    'classifier__weights': ['uniform', 'distance'],
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, scoring='f1', cv=5)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model
print("Best Number of Neighbors (k):", best_model.named_steps['classifier'].n_neighbors)
print("Best Parameters:", grid_search.best_params_)
print(classification_report(y_test, y_pred))


Best Number of Neighbors (k): 3
Best Parameters: {'classifier__n_neighbors': 3, 'classifier__weights': 'uniform'}
              precision    recall  f1-score   support

           0       0.68      0.91      0.78        35
           1       0.77      0.40      0.53        25

    accuracy                           0.70        60
   macro avg       0.73      0.66      0.65        60
weighted avg       0.72      0.70      0.67        60




This code performs K-Nearest Neighbors (KNN) classification on a dataset, specifically addressing the prediction of death events. The dataset is split into training and testing sets, and class weights are calculated to handle class imbalance. A preprocessing pipeline is constructed to standardize numerical features and one-hot encode categorical variables. A KNN model is then created using the scikit-learn pipeline, incorporating class weights and hyperparameter settings such as the distance-weighted approach ('weights='distance''), the Manhattan distance metric ('p=1'), and an initial number of neighbors set to 5. Hyperparameters, including the number of neighbors and weight options, are fine-tuned using GridSearchCV. The best model is selected based on the F1-score, and predictions are made on the test set. The code prints the best number of neighbors selected by the grid search, the best hyperparameters, and a detailed classification report, offering insights into the model's performance on the test data.

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Assuming df is your DataFrame with the dataset
# X contains the features, y contains the target variable
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a column transformer for preprocessing
# Standardize numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']),
        ('cat', OneHotEncoder(), ['sex', 'smoking', 'anaemia', 'diabetes', 'high_blood_pressure']),
    ]
)

# Create a Decision Tree model
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42)),
])

# Define hyperparameters for tuning
param_grid = {
    'classifier__max_depth': [None, 5, 10, 15, 20],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, scoring='f1', cv=5)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model
print("Best Parameters:", grid_search.best_params_)
print(classification_report(y_test, y_pred))


Best Parameters: {'classifier__max_depth': 5, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 5}
              precision    recall  f1-score   support

           0       0.76      0.80      0.78        35
           1       0.70      0.64      0.67        25

    accuracy                           0.73        60
   macro avg       0.73      0.72      0.72        60
weighted avg       0.73      0.73      0.73        60




This code implements a Decision Tree classification model for predicting death events, addressing class imbalance, and tuning hyperparameters to avoid overfitting. The dataset is split into training and testing sets, and a preprocessing pipeline is constructed to standardize numerical features and one-hot encode categorical variables. The Decision Tree model is created using scikit-learn, with hyperparameter settings for the maximum depth of the tree, minimum samples required to split a node, and minimum samples required to be a leaf. The hyperparameters are fine-tuned using a grid search to optimize the model's performance, assessed based on the F1-score. The code prints the best hyperparameters selected by the grid search and a detailed classification report, evaluating the model's predictive accuracy on the test data. The goal is to achieve an F1-score above 0.8, indicating a balance between precision and recall in the classification task.

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Assuming df is your DataFrame with the dataset
# X contains the features, y contains the target variable
X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a column transformer for preprocessing
# Standardize numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']),
        ('cat', OneHotEncoder(), ['sex', 'smoking', 'anaemia', 'diabetes', 'high_blood_pressure']),
    ]
)

# Create a Random Forest model
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42)),
])

# Define hyperparameters for tuning
param_grid = {
    'classifier__n_estimators': [50, 100, 150],
    'classifier__max_depth': [None, 5, 10, 15],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, scoring='f1', cv=5)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Evaluate the model
print("Best Parameters:", grid_search.best_params_)
print(classification_report(y_test, y_pred))


Best Parameters: {'classifier__max_depth': 5, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 50}
              precision    recall  f1-score   support

           0       0.73      0.94      0.83        35
           1       0.87      0.52      0.65        25

    accuracy                           0.77        60
   macro avg       0.80      0.73      0.74        60
weighted avg       0.79      0.77      0.75        60




This code implements a Random Forest classification model for predicting death events in a dataset, with the goal of achieving an F1-score above 0.85. The dataset is divided into training and testing sets, and a preprocessing pipeline is established to standardize numerical features and one-hot encode categorical variables. The Random Forest model, which is an ensemble of decision trees, is constructed using scikit-learn. Hyperparameters governing the number of trees, maximum depth of individual trees, and minimum samples required for node splitting and leaf formation are fine-tuned through a grid search. The grid search optimizes the model based on the F1-score during cross-validation. The code then prints the best hyperparameters selected by the grid search and a detailed classification report, providing an assessment of the model's predictive accuracy on the test data, aiming for an F1-score exceeding 0.85 as a balanced measure of precision and recall in the classification task.

three techniques to regularize the training process for decision trees:

Set a Maximum Depth: This technique prevents decision trees from growing past a maximum depth. For example, you might limit the depth of the tree to 10 levels. This helps to control the complexity of the model and prevent overfitting.

Set a Minimum Number of Examples in Leaf: This technique ensures that a leaf node must have a minimum number of examples before it can be considered for splitting. For example, a leaf with less than a certain number of examples, such as 5, will not be considered for splitting. This also helps to prevent overfitting.

Pruning: This is a technique where you selectively remove certain branches of the tree after training. This is done by converting certain non-leaf nodes to leaves. A common solution to select the branches to remove is to use a validation dataset. That is, if removing a branch improves the quality of the model on the validation dataset, then the branch is removed.

These techniques help to control the complexity of the decision tree, prevent overfitting, and improve the generalization ability of the model.