# Description

Python code example for predicting heart disease using the Scikit-learn library. We'll use the Cleveland Heart Disease dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/45/heart+disease), which is a popular dataset for this kind of task.

### Install all necessary labraries

In [None]:
pip install pandas scikit-learn

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Step 1. Load the dataset

After all libraries has been imported, we are going to use the read_csv method from Pandas. **This method will read a comma-separated values (csv) file into DataFram** from the given string path as parameter.

In [None]:
url = "/Users/roberto.lezama/Downloads/JetBrains/PycharmProjects/Glbnt/dojo/AI-academy-Glbnt/Module_0_Fundamentals/02_Python_Libraries_for_Data_Science/05_Demos/03_DemoScikit-learn/heart.csv"
df = pd.read_csv(url)

# Display the first few rows of the dataset
print(df.head())

### Step 2. Define features (X) and target (y). 

This will help to separate the data to being trained. The drop function as the name suggest, drop specified labels from rows or columns. Remove rows or columns by specifying label names and corresponding axis, or by directly specifying index or column names.

More info: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

![image.png](attachment:image.png)

In [None]:
X = df.drop('target', axis=1)
y = df['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 3. Standardize the features

The StandardScaler function is part of the scikit-learn library. Standardize features applying a mathematical function by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:
![image.png](attachment:image.png)

where **_u_** is the mean of the training samples or zero if with_mean=False, and **_s_** is the standard deviation of the training samples or one if with_std=False.

More info: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Step 4. Initialize the model

The **RandomForestClassifier**, is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

More info: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
model = RandomForestClassifier(n_estimators=100, random_state=42)

### Step 5. Train the model

In [None]:
model.fit(X_train, y_train)

### Step 6. Make predictions

In [None]:
y_pred = model.predict(X_test)

### Step 7. Calculate accuracy

The **accuracy_score** method is used to calculate the accuracy of either the fraction **_(normalization)_** or count of correct prediction in Python Scikit learn. Mathematically it represents the ratio of the sum of true positives and true negatives out of all the predictions.

More info: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

#### Step 8 (Optional). Generate reports

In [None]:
# Generate a classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Generate a confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Conclusion

As we go trhoug this excersice, we saw how we can use a subset of already store medical information about heart disease and use it to trained a model and make predictions for future patients that might have same or similar symptoms.

In the below image we can see that after training the model, the level/ percentage of accuracy is 0.99.

![image.png](attachment:image.png)

You can find the entire Python code on the file: **02_DemoScikit-learn_Final.py**