## Introduction to `Scikit-Learn`

#### Scikit-learn is a powerful Python library for machine learning. It provides simple and efficient tools for data mining and data analysis. It supports various machine learning algorithms, which can be applied for classification, regression, clustering, and dimensionality reduction

### 1. Loading Data

#### Before applying any machine learning algorithms, you need data. Let's use dummy data to make this easier to follow.

##### `Scenario`: You are working on predicting whether a customer will buy a product based on their features like age, income, etc

In [1]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [17]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Creating a dummy dataset
data = {
    'Age': [22, 25, 47, 52, 46, 56, 56, 60, 62, 61],
    'Income': [15000, 18000, 32000, 40000, 52000, 55000, 60000, 62000, 64000, 65000],
    'Buy': [0, 0, 1, 1, 1, 1, 1, 1, 1, 1]  # 0 means No, 1 means Yes
}

df = pd.DataFrame(data)

# Features and target
X = df[['Age', 'Income']]  # Features
y = df['Buy']  # Target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_test)
print(y_test)


# Explanation:

# train_test_split is used to divide data into training and testing sets.
# The features (Age and Income) are independent variables, 
# while Buy is the dependent variable (whether a customer buys the product)


   Age  Income
8   62   64000
1   25   18000
5   56   55000
8    1
1    0
5    1
Name: Buy, dtype: int64


### 2. Preprocessing Data
#### Scikit-learn provides tools for preprocessing data. Sometimes, we need to scale or normalize data to ensure that machine learning algorithms perform well.

##### `Scenario`: Age and income are on different scales (e.g., 20-70 for age, 10k-100k for income). Machine learning algorithms might perform poorly with unscaled data

In [5]:
df.head(10)

Unnamed: 0,Age,Income,Buy
0,22,15000,0
1,25,18000,0
2,47,32000,1
3,52,40000,1
4,46,52000,1
5,56,55000,1
6,56,60000,1
7,60,62000,1
8,62,64000,1
9,61,65000,1


In [8]:
## StandardScaler

from sklearn.preprocessing import StandardScaler,MinMaxScaler

# Standardizing data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(X_train_scaled)


## Explanation:

# StandardScaler scales the data such that it has a mean of 0 and a standard deviation of 1. 
# This is useful when working with algorithms that require normally distributed data (like SVM or Logistic Regression).

## MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#print(X_train_scaled)
print(X_test_scaled)


# Description: Scales features to a given range, usually [0, 1].
# Use Case: Useful when the bounds of the data are known.

[[-2.20069019 -1.8495181 ]
 [ 0.88027607  0.9038369 ]
 [-0.1737387  -0.85362374]
 [ 0.96135413  1.07958296]
 [-0.25481676  0.31801669]
 [ 0.2316516  -0.38496757]
 [ 0.55596384  0.78667286]]
[[0.         0.        ]
 [0.97435897 0.94      ]
 [0.64102564 0.34      ]
 [1.         1.        ]
 [0.61538462 0.74      ]
 [0.76923077 0.5       ]
 [0.87179487 0.9       ]]


### 3. Classification using Logistic Regression

#### Logistic Regression is used when the target variable is binary (like "Buy" or "Not Buy").

##### `Scenario`: Predicting whether a customer will buy a product based on age and income

In [10]:
from sklearn.linear_model import LogisticRegression

# Create and train the model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred = log_reg.predict(X_test_scaled)

print("Predicted values:", y_pred)


# Explanation:

# Logistic Regression is a classification algorithm 
# used when the dependent variable is binary.
# After training, you can use the predict method to classify unseen data.

## Email Data
## Its a spam or not!!

Predicted values: [1 1 1]


### 4. Evaluating the Model
#### After training the model, it's important to evaluate its performance using metrics like accuracy, precision, recall, and F1 score.

##### `Scenario`: You want to know how well the model is predicting customer behavior.

#### **1. Confusion Matrix**
A confusion matrix is a summary of prediction results for a classification problem. It shows the number of correct and incorrect predictions categorized by actual classes and predicted classes.

The confusion matrix looks like this:
- **True Positives (TP)**: Correctly predicted positive class (actual = 1, predicted = 1).
- **True Negatives (TN)**: Correctly predicted negative class (actual = 0, predicted = 0).
- **False Positives (FP)**: Incorrectly predicted positive class (actual = 0, predicted = 1).
- **False Negatives (FN)**: Incorrectly predicted negative class (actual = 1, predicted = 0).

|               | Predicted: No (0) | Predicted: Yes (1) |
|---------------|------------------|-------------------|
| **Actual: No (0)** | True Negative (TN) | False Positive (FP) |
| **Actual: Yes (1)** | False Negative (FN) | True Positive (TP) |

In [13]:
from sklearn.metrics import confusion_matrix

# Assuming `y_test` is the actual labels and `y_pred` is the predicted labels
conf_matrix = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:\n", conf_matrix)


Confusion Matrix:
 [[0 1]
 [0 2]]


**Explanation**:
The confusion matrix gives us an understanding of the model’s classification performance on each category.

For example:
- If the model is classifying too many "No" customers as "Yes," that would reflect as a higher False Positive (FP) count

#### **2. Accuracy**
Accuracy is simply the ratio of correctly predicted instances to the total number of instances. However, accuracy can be misleading in imbalanced datasets where one class dominates.

**Formula**:
\[
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
\]

In [18]:
from sklearn.metrics import accuracy_score

## y_test --> [1 0 1]
## y_pred-->[1 1 1]

## (TruePositive+TruNegative)/ ()

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.6666666666666666



**Explanation**:
- Accuracy is a good metric when the dataset is balanced (similar number of positive and negative samples).
- For imbalanced datasets, other metrics like precision and recall should be prioritized

#### **3. Precision**
Precision tells us the proportion of positive predictions that were actually correct. It is useful when the cost of false positives is high (e.g., diagnosing a disease).

**Formula**:
\[
Precision = \frac{TP}{TP + FP}
\]


**Scenario**: You want to minimize false positives because they could result in predicting that a customer will buy when they actually won’t, leading to unnecessary marketing expenses.

In [19]:
from sklearn.metrics import precision_score

precision = precision_score(y_test, y_pred)
print("Precision:", precision)

Precision: 0.6666666666666666



**Explanation**:
- Precision focuses on how many of the predicted "Yes" values were actually correct.
- A high precision score means the model makes few false positives.


#### **4. Recall (Sensitivity or True Positive Rate)**
Recall tells us the proportion of actual positives that were correctly identified. It is useful when missing a positive case is costly (e.g., missing out on identifying a fraud).

**Formula**:
\[
Recall = \frac{TP}{TP + FN}
\]

**Scenario**: You want to catch as many potential buyers as possible (minimizing false negatives), so recall becomes important.

In [20]:
from sklearn.metrics import recall_score

recall = recall_score(y_test, y_pred)
print("Recall:", recall)

Recall: 1.0


**Explanation**:
- Recall focuses on how many of the actual "Yes" customers the model correctly predicted.
- A high recall score means fewer false negatives, so most actual "Yes" cases were captured by the model.

#### **5. F1 Score**
The F1 score is the harmonic mean of precision and recall, combining both into a single metric. It is especially useful when you need a balance between precision and recall.

**Formula**:
\[
F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}
\]

**Scenario**: In situations where both false positives and false negatives are important, you can use the F1 score to balance precision and recall.


In [21]:
from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)

F1 Score: 0.8


**Explanation**:
- The F1 score provides a single metric that balances both precision and recall.
- A higher F1 score means the model does well on both reducing false positives and false negatives.

### 5. Cross Validation
#### Cross-validation is a statistical method used to estimate the performance of a machine learning model. Instead of training the model on a single training set and evaluating it on a single test set, cross-validation splits the dataset into multiple "folds" and trains and tests the model multiple times.

#### The idea behind cross-validation is to provide a better understanding of how the model performs on different subsets of data, giving a more reliable performance estimate.

##### `Scenario`: You want to ensure that your model performs consistently across different subsets of data

#### Why is Cross-Validation Important?

##### `Reduces Overfitting`: It ensures the model doesn’t memorize the data but instead generalizes well to unseen data.
##### `Better Model Evaluation`: Instead of evaluating the model on just one test set, cross-validation evaluates the model on several subsets of data, providing a more accurate estimate of its performance.
##### `Handles Data Variability`: Cross-validation captures the variability in data, especially when data is scarce or imbalanced

Types of Cross-Validation

1. K-Fold Cross-Validation
In K-Fold Cross-Validation, the dataset is split into K equal parts (called folds). The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times, each time using a different fold as the test set. The performance of the model is then averaged over the K iterations.

Scenario: You have a small dataset, and you want to ensure that the model gets trained and tested on every portion of the data

In [24]:
from sklearn.model_selection import cross_val_score

# Cross-validation
cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=5)
print("Cross-validation scores:", cv_scores)
print("Average CV score:", np.mean(cv_scores))


# Explanation:

# cross_val_score helps evaluate the model's performance using multiple subsets of data. 
# You can average the scores to get a more reliable estimate of model performance


Cross-validation scores: [nan  1.  1.  1.  1.]
Average CV score: nan


1 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "e:\Codes\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "e:\Codes\.venv\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "e:\Codes\.venv\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1301, in fit
    raise ValueError(
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: np.int64(1)



2. Stratified K-Fold Cross-Validation
In Stratified K-Fold, the folds are made in such a way that each fold contains approximately the same proportion of samples from each class as the original dataset. This is especially useful when dealing with imbalanced datasets, where some classes have far fewer samples than others.

Scenario: You are working with an imbalanced dataset (e.g., predicting customer churn, where most customers don’t churn), and you want to ensure that each fold contains a similar proportion of churn and non-churn customers

In [25]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Use StratifiedKFold for cross-validation
skf = StratifiedKFold(n_splits=5)

cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=skf)
print("Stratified Cross-validation scores:", cv_scores)
print("Average CV score:", np.mean(cv_scores))


Stratified Cross-validation scores: [nan  1.  1.  1.  1.]
Average CV score: nan


1 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "e:\Codes\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "e:\Codes\.venv\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "e:\Codes\.venv\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1301, in fit
    raise ValueError(
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: np.int64(1)



### 7. Clustering with K-Means
#### K-Means is an unsupervised learning algorithm used to find groups in data.

##### `Scenario`: You want to group customers into clusters based on their age and income.

In [26]:
from sklearn.cluster import KMeans

# Clustering customers
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_train_scaled)

print("Cluster Centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)


# Explanation:

# K-Means groups data points into clusters based on similarity.
# n_clusters=2 means we are looking for 2 groups in the data.

Cluster Centers: [[0.81196581 0.73666667]
 [0.         0.        ]]
Labels: [1 0 0 0 0 0 0]
