# Classification

## Table of content 

### 1 Introduction to Classification

* What is classification?
* Types of classification problems
* Real-world examples of classification tasks

### 2 Getting Started with scikit-learn
* Loading datasets and preprocessing

### 3 Supervised Learning: Classification Techniques

* a. Logistic Regression
* b. K-Nearest Neighbors (KNN)
* c. Support Vector Machines (SVM)
* d. Decision Trees
* e. Random Forest

### 4 Model Evaluation and Selection

* Train-test split
* Cross-validation
* Performance metrics (accuracy, precision, recall, F1-score, ROC-AUC)
* Hyperparameter tuning (Grid search, Random search)
* Model selection and comparison

### 5 Advanced Classification Techniques
* a. Imbalanced Classification

### 6  Practical Project

* Choosing a classification dataset
* Data preprocessing and exploration
* Model selection, training, and evaluation
* Hyperparameter tuning and model optimization
* Presenting the final results  

### 7 Assignment

## 1 Introduction to Classification

### What is classification?

Classification is a type of supervised machine learning task in which the goal is to assign objects or instances to predefined categories or classes. In supervised learning, the model learns from a dataset that contains input-output pairs, where the output (or target) is a discrete value representing the class label. Classification models can be used to predict the class of an object based on its input features.

### Types of classification problems:

There are two main types of classification problems:
    
    

* a. Binary Classification: In binary classification, there are only two possible classes. The model is trained to distinguish between these two classes. For example, classifying emails as spam or not spam.

* b. Multiclass Classification: In multiclass classification, there are more than two possible classes. The model is trained to classify instances into one of the multiple classes. For example, classifying handwritten digits into one of the ten classes (0 to 9).

In some cases, you might also encounter multilabel classification problems, where each instance can be assigned to multiple classes simultaneously. For example, classifying a text document into multiple topics.

### Real-world examples of classification tasks:


Here are some real-world examples of classification tasks:

* a. Email spam detection: Identifying whether an email is spam or not based on its content and other features.

* b. Medical diagnosis: Predicting the presence or absence of a disease based on patient data (e.g., symptoms, lab results).

* c. Sentiment analysis: Determining the sentiment (positive, negative, or neutral) of a given text or document.

* d. Handwritten digit recognition: Identifying the digit (0 to 9) represented by a handwritten image.

* e. Fraud detection: Detecting fraudulent transactions in a financial dataset based on transaction data and user behavior.

* f. Image classification: Categorizing images into predefined classes, such as animals, objects, or scenes.

* g. Customer segmentation: Classifying customers into groups based on their behavior or preferences for targeted marketing.

These are just a few examples; classification problems are widespread across various domains and industries.

## 2 Getting Started with scikit-learn

Scikit-learn is a popular Python library for machine learning that provides simple and efficient tools for data mining and data analysis. To install scikit-learn, you can use the following command with pip:

### Loading datasets and preprocessing:

Scikit-learn provides various utilities for loading datasets and preprocessing the data. Some common tasks include:

* a. **Loading datasets**: Scikit-learn comes with several built-in datasets (e.g., iris, digits, breast cancer) that can be loaded using the datasets module.

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
#iris.data

* b. **Data splitting**: Split the data into training and testing sets using the train_test_split function from the model_selection module.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42, stratify=iris.target)


* c. **Feature scaling:** Standardize or normalize the data using transformers like StandardScaler or MinMaxScaler from the preprocessing module.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


* **Handling missing values** : Impute missing values using transformers like SimpleImputer from the impute module.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)


By understanding the scikit-learn API and its utilities, you can load, preprocess, and prepare your data for various classification tasks.

# 3 Supervised Learning: Classification Techniques

### a. Logistic Regression

#### Understanding logistic regression: 

Logistic regression is a linear model used for binary classification tasks. It estimates the probability of an instance belonging to a class using the logistic function (sigmoid function). The model is trained to find the best-fitting decision boundary that separates the two classes.

#### Implementing logistic regression with scikit-learn:

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)
predictions = logreg.predict(X_test_scaled)

### b. K-Nearest Neighbors (KNN)

#### Understanding KNN: 

K-Nearest Neighbors is a non-parametric, instance-based learning algorithm used for classification tasks. Given a new instance, KNN finds the k nearest training instances in the feature space and assigns the majority class label among these neighbors.

#### Implementing KNN with scikit-learn:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
predictions = knn.predict(X_test_scaled)


### c. Support Vector Machines (SVM)

#### Understanding SVM:
Support Vector Machines is a powerful classification algorithm that can be used for linear or non-linear classification tasks. The main idea of SVM is to find the hyperplane that best separates the classes with the maximum margin, which is the distance between the hyperplane and the nearest instances from each class (support vectors).

#### Implementing SVM with scikit-learn:

In [None]:
from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1)
svm.fit(X_train_scaled, y_train)
predictions = svm.predict(X_test_scaled)


### d. Decision Trees

#### Understanding decision trees: 
Decision trees are a type of flowchart-like structure used for classification tasks. The tree consists of nodes, which represent features or decisions, and branches, which represent the outcome of a decision. The model is trained to recursively split the data based on the feature that provides the best separation of the classes.

#### Implementing decision trees with scikit-learn:

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=3, random_state=42)
dtree.fit(X_train_scaled, y_train)
predictions = dtree.predict(X_test_scaled)


### e. Random Forest

#### Understanding random forest: 

Random Forest is an ensemble learning method that constructs multiple decision trees and combines their predictions through majority voting. It improves the performance and generalization of a single decision tree by reducing overfitting and adding randomness to the model.

#### Implementing random forest with scikit-learn:

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
rf.fit(X_train_scaled, y_train)
predictions = rf.predict(X_test_scaled)


## 4 Model Evaluation and Selection

### Train-test split:
The train-test split is a technique used to divide the dataset into two parts, one for training the model and the other for testing the model's performance. This helps to evaluate the model's ability to generalize to unseen data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


### Cross-validation:
Cross-validation is a more robust technique for evaluating the model's performance by dividing the dataset into multiple folds. The model is trained and tested multiple times, using different combinations of training and testing folds. The most common method is k-fold cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)  # 5-fold cross-validation


### Performance metrics:

Various performance metrics can be used to evaluate the quality of a classification model. Some common metrics include:

* **Accuracy**: The proportion of correct predictions among the total number of instances.
* **Precision**: The proportion of true positives among the instances predicted as positive.
* **Recall (Sensitivity)**: The proportion of true positives among the instances that are actually positive.
* **F1-score:** The harmonic mean of precision and recall.
* **ROC-AUC:** The area under the receiver operating characteristic (ROC) curve, which plots the true positive rate against the false positive rate.

In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)


In [15]:
report = classification_report(y_test, predictions)
report

'              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00        15\n           1       0.82      0.93      0.87        15\n           2       0.92      0.80      0.86        15\n\n    accuracy                           0.91        45\n   macro avg       0.92      0.91      0.91        45\nweighted avg       0.92      0.91      0.91        45\n'

### Hyperparameter tuning:

Hyperparameter tuning is the process of finding the optimal values for the hyperparameters of a model. Two popular methods for hyperparameter tuning are grid search and random search.

#### Grid search: 
Exhaustively tries all possible combinations of hyperparameter values.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)

#### Random search: 

Samples a random combination of hyperparameter values within specified ranges.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
param_dist = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
random_search = RandomizedSearchCV(SVC(), param_dist, n_iter=10, cv=5)
random_search.fit(X_train_scaled, y_train)


### Model selection and comparison:

After evaluating the performance of different models and tuning their hyperparameters, you can compare the models and select the one that performs best on the chosen metrics. This will help you choose the most suitable model for your classification task.

In [None]:
# Example: Comparing the accuracy of two models
accuracy_logreg = accuracy_score(y_test, predictions_logreg)
accuracy_knn = accuracy_score(y_test, predictions_knn)

print("Logistic Regression Accuracy:", accuracy_logreg)
print("K-Nearest Neighbors Accuracy:", accuracy_knn)


### 5 Advanced Classification Techniques

### Imbalanced Classification
Imbalanced classification deals with datasets where one class is significantly under-represented compared to the other classes. This can lead to biased models that perform poorly on the minority class.

#### Understanding imbalanced datasets: 

Imbalanced datasets can occur in real-world problems like fraud detection, medical diagnosis, and rare event prediction. The imbalance can lead to a higher misclassification rate for the minority class, as the model is biased towards the majority class.

#### Resampling techniques: 

Resampling techniques can be used to balance the class distribution by either oversampling the minority class or undersampling the majority class.

* **Oversampling**: Randomly replicating instances from the minority class to increase its representation.

In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority')
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)


* **Undersampling**: Randomly removing instances from the majority class to decrease its representation.


In [None]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy='majority')
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

#### Evaluation metrics for imbalanced datasets: 

In imbalanced datasets, accuracy is not a suitable metric, as it can be misleading due to the bias towards the majority class. Instead, other metrics like precision, recall, F1-score, and the area under the precision-recall curve (PR-AUC) should be used.

Precision-Recall Curve and PR-AUC:

In [None]:
from sklearn.metrics import precision_recall_curve, auc
precision, recall, _ = precision_recall_curve(y_test, predictions)
pr_auc = auc(recall, precision)


By using advanced classification techniques like ensemble methods and addressing imbalanced datasets with resampling techniques, you can improve the performance and generalization of your classification models. Additionally, using appropriate evaluation metrics will help you better assess and compare models on imbalanced datasets.

## 6 Practical Project

In this practical project, we will go through the process of choosing a real-world classification dataset, preprocessing and exploring the data, selecting, training, and evaluating models, tuning hyperparameters, and presenting the final results.



### 1 Choosing a classification dataset:
Find a suitable classification dataset for your project. Examples include the Iris dataset, the Breast Cancer Wisconsin dataset, or the Wine Quality dataset. You can also explore public datasets available on platforms like Kaggle or UCI Machine Learning Repository.

### 2 Data preprocessing and exploration:
Load the dataset, clean the data if necessary, and perform exploratory data analysis to understand the data's characteristics.

In this example, I'll use the Wine Quality dataset from the UCI Machine Learning Repository. Here's the code to load the data and perform exploratory data analysis.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")

# Basic statistics
print(data.describe())

# Check for missing values
print("\nMissing values:")
print(data.isnull().sum())

# Visualize feature distributions and relationships
sns.pairplot(data, hue='quality', corner=True)
plt.show()



### 3 Model selection, training, and evaluation:

Split the data into training and testing sets, train different classification models, and evaluate their performance using appropriate metrics.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")

# Preprocessing
X = data.drop('quality', axis=1)
y = data['quality']

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training and evaluating models
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
logreg_preds = logreg.predict(X_test)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

svc = SVC()
svc.fit(X_train, y_train)
svc_preds = svc.predict(X_test)

# Metrics
def evaluate(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return accuracy, precision, recall, f1

logreg_metrics = evaluate(y_test, logreg_preds)
rf_metrics = evaluate(y_test, rf_preds)
svc_metrics = evaluate(y_test, svc_preds)

print("Logistic Regression Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*logreg_metrics))
print("Random Forest Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*rf_metrics))
print("SVM Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*svc_metrics))



This code loads the Wine Quality dataset, preprocesses it, splits it into training and testing sets, trains three classification models, and evaluates their performance using accuracy, precision, recall, and F1-score.

### 4 Hyperparameter tuning and model optimization:
    
Optimize the models' performance by tuning their hyperparameters using techniques like grid search or random search.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")

# Preprocessing
X = data.drop('quality', axis=1)
y = data['quality']

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training and evaluating the baseline model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

# Metrics
def evaluate(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return accuracy, precision, recall, f1

rf_metrics = evaluate(y_test, rf_preds)
print("Baseline Random Forest Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*rf_metrics))

# Hyperparameter tuning using Grid Search
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

# Training and evaluating the optimized model
rf_optimized = RandomForestClassifier(**best_params)
rf_optimized.fit(X_train, y_train)
rf_optimized_preds = rf_optimized.predict(X_test)

rf_optimized_metrics = evaluate(y_test, rf_optimized_preds)
print("Optimized Random Forest Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*rf_optimized_metrics))


This code loads the Wine Quality dataset, preprocesses it, splits it into training and testing sets, trains a baseline Random Forest model, and evaluates its performance. Then, it performs hyperparameter tuning using Grid Search and retrains the optimized model, evaluating its performance to compare with the baseline.

### 5 Presenting the final results:
After optimizing the models, select the best model based on the chosen evaluation metrics, and present the final results, including the model's performance on the test dataset, feature importances or coefficients, and any insights derived from the analysis.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")

# Preprocessing
X = data.drop('quality', axis=1)
y = data['quality']

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Hyperparameter tuning using Grid Search
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

# Training and evaluating the optimized model
rf_optimized = RandomForestClassifier(**best_params)
rf_optimized.fit(X_train, y_train)
rf_optimized_preds = rf_optimized.predict(X_test)

# Metrics
def evaluate(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return accuracy, precision, recall, f1

rf_optimized_metrics = evaluate(y_test, rf_optimized_preds)
print("Optimized Random Forest Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*rf_optimized_metrics))

# Identifying important features
important_features = pd.Series(rf_optimized.feature_importances_, index=X.columns)
important_features = important_features.sort_values(ascending=False)

print("\nImportant Features:")
print(important_features)

# Conclusion
print("\nBased on the evaluation metrics, the Optimized Random Forest model is the best-performing model.")
print("The top features contributing to wine quality prediction are:")
print(important_features.head(5))


This code loads the Wine Quality dataset, preprocesses it, splits it into training and testing sets, performs hyperparameter tuning using Grid Search, and trains the optimized Random Forest model. It then evaluates the model's performance on the test dataset, identifies the important features, and presents the final results.

In the conclusion, we report the best model based on the chosen evaluation metrics and list the top features contributing to wine quality prediction.

## 8 Assignment

## Assignment: Predicting Customer Churn

### Objective: 

The goal of this assignment is to build a classification model to predict whether a customer will churn (stop using a service) based on their features and interactions with the service.

### Dataset: 
The Telco Customer Churn dataset, available on Kaggle, contains information about a fictional telecommunication company's customers and whether they have churned. You can download the dataset here.
 https://www.kaggle.com/datasets/blastchar/telco-customer-churn

### Tasks:

* Load and explore the dataset: Analyze the distribution of features, check for missing values, and visualize relationships between features and the target variable (churn).

    
    
* Preprocess the data: Handle missing values, convert categorical variables to numeric, and normalize/standardize the features if necessary.

    
    
* Split the dataset: Divide the dataset into training and testing sets.

    
    
* Train classification models: Train various classification models (e.g., logistic regression, KNN, SVM, decision tree, random forest, etc.) on the training dataset.

    
    
* Evaluate the models: Assess the performance of the models using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

    
    
* Optimize the models: Perform hyperparameter tuning using techniques like grid search or random search to improve the performance of the models.

    
    
* Feature selection and dimensionality reduction: Apply feature selection techniques such as RFE, variance threshold, or dimensionality reduction methods like PCA and LDA to reduce the number of features and potentially improve model performance.

    
    
* Select the best model: Choose the best-performing model based on the evaluation metrics.

    
    
* Interpret the results: Discuss the performance of the chosen model, the importance of different features, and any insights gained from the analysis.

    
    
* Conclusion: Summarize the findings, mention any limitations of the project, and suggest possible improvements or future work.

# solution for the Customer Churn assignment

This code provides a compact solution to the Customer Churn assignment. It loads the data, preprocesses it, trains different models, optimizes the hyperparameters, selects the most important features, and presents the results.

Remember that in a real-world scenario, it's crucial to explore the data and models in more detail and interpret the results accordingly. Additionally, it is recommended to try other advanced classification techniques or address class imbalance issues if applicable to your dataset.
