# Classification with Python


To be completed soon ...


Let's take a deeper look at how we can use Python to classify data. 
Python provides a lot of tools for implementing Classification. 
In this tutorial We'll use the `scikit-learn` library which is the most popular open-source Python data science library, to build a simple classifier.

Let’s learn how to use `scikit-learn` to perform Classification in simple terms.



As mentioned there are many classification algorithms available. We will use the following algorithms for this tutorial:

- Decision Trees (C4.5/ID3, CART)
- Naive Bayes
- AdaBoost

> [!IMPORTANT]

> **An end-to-end Scikit-Learn workflow:**

> Here’s the mile-high overview of each step:
    
>    1. Get the data ready (split into features and labels, prepare train and test steps)
>    2. Choose a model for our problem
>    3. Fit the model to the data and use it to make a prediction
>    4. Evaluate the model
>    5. Experiment to improve
>    6. Save a model for someone else to use


## Decision Tree Classifier Building in python


### Step 1. Get the data ready

As an example dataset, we'll import `heart-disease.csv`. (You can find the dataset here: https://github.com/kb22/Heart-Disease-Prediction/tree/master)

This file contains anonymized patient medical records that focus on whether or not a section of patients has heart disease or not, and then we can use it to look for patterns.

(Side note: this is a classification problem since we're trying to predict whether something is one thing or another. Do they have it or not?).


In [1]:
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

In [4]:
import pandas as pd
heart_disease = pd.read_csv('heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Here, each row is a different patient and all columns except `target` are different patient characteristics.

The `target` column indicates whether the patient has heart disease (`target=1`) or not (`target=0`).

This is our "label" column and is the variable we're going to try and predict. The rest of the columns (often called features) are what we'll be using to predict the `target` value.

> Note: Note: It's a common custom to save features to a variable `X` and labels to a variable `y`. In practice, we'd like to use the `X` (features) to build a predictive algorithm to predict the `y` (labels).

In [5]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

# Check the head of the features DataFrame
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [9]:
# Check the head and the value counts of the labels 
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

One of the most important practices in Machine Learning is to split datasets into training and test sets.

This way, a model will **train on the training** set to learn patterns, and then those patterns can be **evaluated on the test set**.

It’s important that a model never sees testing data during training. This is equivalent to a student studying course materials during the semester (training set) and then testing their abilities on the following exam (testing set).

Scikit-learn provides the `sklearn.model_selection.train_test_split` method to split datasets in training and test sets.

> Note: A common practice to use an 80/20 or 70/30 or 75/25 split for training/testing data. There is also a third set, known as a validation set (e.g. 70/15/15 for training/validation/test) for hyperparameter tuning, but for now we'll focus on training and test sets.

In [10]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.25) # by default train_test_split uses 25% of the data for the test set

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))

### Step 2. Choose the model and hyperparameters

Scikit-Learn refers to models as "estimators", however, they are also referred to as either a `model` or `clf` (short for classifier).

A model's hyperparameters are settings you can change to adjust it for your problem, much like knobs on an oven you can tune to cook your favorite dish.

Since we're working on a classification problem, we'll start with a `DecisionTreeClassifier`.

In [11]:
# import the DecisionTreeClassifier model class from the tree module
from sklearn.tree import DecisionTreeClassifier

# Instantiate the model (using the default parameters)
model = DecisionTreeClassifier()


We can see the current hyperparameters of a model with the `get_params()` method.

In [12]:
# get parameters of this model
model.get_params()


{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

---
**NOTE**

**test**
It works with almost all markdown flavours (the below blank line matters).

---

> [!TIP]
> Helpful advice for doing things better or more easily.

```{note}
A brief explanation of the parameters:
- criterion: the function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. (If we set the criterion to "entropy", the model will use the information gain to measure the quality of a split. and that actually means that the decision tree will be id3/ c4.5. If we set the criterion to "gini", the model will use the gini impurity to measure the quality of a split. and that actually means that the decision tree will be CART.)
- splitter: the strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
- max_depth: the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_split: the minimum number of samples required to split an internal node.
- min_samples_leaf: the minimum number of samples required to be at a leaf node.
- min_weight_fraction_leaf: the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.
- max_features: the number of features to consider when looking for the best split.
- random_state: the seed used by the random number generator.
- max_leaf_nodes: grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
- min_impurity_decrease: a node will be split if this split induces a decrease of the impurity greater than or equal to this value.
- min_impurity_split: threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
```

### Step 3. Fit the model to the data and use it to make a prediction

Fitting a model to a dataset involves passing the data and asking it to figure out the patterns.

Such as:

- If there are labels (supervised learning), then the model tries to work out the relationship between the data and the labels
- Or, if there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together
- Most Scikit-Learn models have the `fit(X,y)` method built-in, where the `X` parameter is the features and the `y` parameter is the labels.

In our case, we start by fitting a model on the training split (`X_train, y_train`).

In [13]:
# Fit the model to the training data
model.fit(X_train, y_train)


## Adaptive Boodting Classifier Building in python


## Naive Bayes Classifier Building in python
