# Classification Algorithms
We test multiple machine learning models to compare their raw performance **before hyperparameter tuning**. This is a crucial first step in model selection.

We will use the Pima Indians Diabetes dataset, which is commonly used for binary classification.

In [15]:
from pandas import read_csv

url = 'https://raw.githubusercontent.com/erojaso/MLMasteryEndToEnd/master/data/pima-indians-diabetes.data.csv'
column_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=column_names)

array = data.values
X1 = array[:, 0:8]
Y1 = array[:, 8]

print("Shape of X1:", X1.shape)
print("Shape of Y1:", Y1.shape)

Shape of X1: (768, 8)
Shape of Y1: (768,)


### Evaluation Strategy
We use **10-Fold Cross-Validation** to evaluate model performance. This ensures reliable results and avoids overfitting to a particular data split.

### Logistic Regression
A linear model often used for binary classification. It outputs probabilities and is interpretable.

In [16]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression

kfold = KFold(n_splits=10)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X1, Y1, cv=kfold)
print("Logistic Regression Accuracy: %.3f" % results.mean())

Logistic Regression Accuracy: 0.770


### Linear Discriminant Analysis (LDA)
LDA is another linear classifier that works well when data is normally distributed and features are independent.


In [17]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

model = LinearDiscriminantAnalysis()
results = cross_val_score(model, X1, Y1, cv=kfold)
print("LDA Accuracy: %.3f" % results.mean())

LDA Accuracy: 0.773


### K-Nearest Neighbors (KNN)
A non-parametric method. KNN makes predictions based on closest neighbors in training data.

In [18]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
results = cross_val_score(model, X1, Y1, cv=kfold)
print("KNN Accuracy: %.3f" % results.mean())

KNN Accuracy: 0.727


### Naive Bayes
Naive Bayes uses Bayes Theorem and assumes all features are independent. It’s extremely fast

In [19]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
results = cross_val_score(model, X1, Y1, cv=kfold)
print("Naive Bayes Accuracy: %.3f" % results.mean())

Naive Bayes Accuracy: 0.755


### Classification and Regression Trees (CART)
A decision-tree-based model that splits features using information gain.

In [20]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
results = cross_val_score(model, X1, Y1, cv=kfold)
print("Decision Tree Accuracy: %.3f" % results.mean())

Decision Tree Accuracy: 0.706


### Support Vector Machines (SVM)
SVM finds a hyperplane that best separates the classes. Can work for both linear and non-linear data depending on the kernel.

In [21]:
from sklearn.svm import SVC

model = SVC()
results = cross_val_score(model, X1, Y1, cv=kfold)
print("SVM Accuracy: %.3f" % results.mean())

SVM Accuracy: 0.760


| Model                  | Accuracy (mean of 10 folds) |
|------------------------|-----------------------------|
| Logistic Regression    | 0.770                        |
| LDA                    | 0.773                        |
| K-Nearest Neighbors    | 0.727                        |
| Naive Bayes            | 0.755                        |
| Decision Tree (CART)   | 0.703                        |
| SVM                    | 0.760                        |
