# Introduction to scikit-learn:

`Scikit-Learn` (also known as sklearn) is an open-source Python library used for machine learning. It provides simple and efficient tools for data mining and analysis. It is built on top of NumPy, SciPy, and Matplotlib, making it a powerful library for implementing machine learning models.

<img src='images/sklearn1.png' width='500'>

### Installation Command:

In [None]:
! pip install scikit-learn

### Import and Verify Installation:

In [None]:
import sklearn
print(sklearn.__version__)

1.6.1


## Core Modules in Scikit-Learn

<img src='images/modules.png' width='700' height='600'>

### 1️⃣ Get a List of All Modules in `sklearn`:

In [3]:
import sklearn

# List all available modules inside sklearn
modules = [mod for mod in dir(sklearn) if not mod.startswith('_')]
print(modules)

['calibration', 'clone', 'cluster', 'compose', 'config_context', 'covariance', 'cross_decomposition', 'datasets', 'decomposition', 'discriminant_analysis', 'dummy', 'ensemble', 'exceptions', 'experimental', 'externals', 'feature_extraction', 'feature_selection', 'frozen', 'gaussian_process', 'get_config', 'impute', 'inspection', 'isotonic', 'kernel_approximation', 'kernel_ridge', 'linear_model', 'manifold', 'metrics', 'mixture', 'model_selection', 'multiclass', 'multioutput', 'naive_bayes', 'neighbors', 'neural_network', 'pipeline', 'preprocessing', 'random_projection', 'semi_supervised', 'set_config', 'show_versions', 'svm', 'tree']


Explanation:

- dir(sklearn): Lists all attributes inside the sklearn library.
- not mod.startswith('_'): Filters out private/internal attributes

### 2️⃣ Find All Classes Inside a Specific Module (Example: sklearn.linear_model):

In [4]:
import inspect
import sklearn.linear_model

# Get all classes inside sklearn.linear_model
classes = [name for name, obj in inspect.getmembers(sklearn.linear_model, inspect.isclass)]
print(classes)

['ARDRegression', 'BayesianRidge', 'ElasticNet', 'ElasticNetCV', 'GammaRegressor', 'HuberRegressor', 'Lars', 'LarsCV', 'Lasso', 'LassoCV', 'LassoLars', 'LassoLarsCV', 'LassoLarsIC', 'LinearRegression', 'LogisticRegression', 'LogisticRegressionCV', 'MultiTaskElasticNet', 'MultiTaskElasticNetCV', 'MultiTaskLasso', 'MultiTaskLassoCV', 'OrthogonalMatchingPursuit', 'OrthogonalMatchingPursuitCV', 'PassiveAggressiveClassifier', 'PassiveAggressiveRegressor', 'Perceptron', 'PoissonRegressor', 'QuantileRegressor', 'RANSACRegressor', 'Ridge', 'RidgeCV', 'RidgeClassifier', 'RidgeClassifierCV', 'SGDClassifier', 'SGDOneClassSVM', 'SGDRegressor', 'TheilSenRegressor', 'TweedieRegressor']


Explanation:

- inspect.getmembers(): Retrieves members of a module.
- inspect.isclass: Filters only the class names.

### 3️⃣ Find All Functions Inside a Specific Class (Example: LinearRegression in sklearn.linear_model)

In [5]:
import inspect
from sklearn.linear_model import LinearRegression

# Get all functions inside LinearRegression class
functions = [name for name, obj in inspect.getmembers(LinearRegression, inspect.isfunction)]
print(functions)


['__getstate__', '__init__', '__repr__', '__setstate__', '__sklearn_clone__', '__sklearn_tags__', '_check_feature_names', '_check_n_features', '_decision_function', '_get_doc_link', '_get_metadata_request', '_get_tags', '_more_tags', '_repr_html_inner', '_repr_mimebundle_', '_set_intercept', '_validate_data', '_validate_params', 'fit', 'get_metadata_routing', 'get_params', 'predict', 'score', 'set_fit_request', 'set_params', 'set_score_request']


 Explanation:

- inspect.getmembers(): Retrieves members of the class.
- inspect.isfunction: Filters only function names inside the class.

## Step-by-Step Guide to Using Scikit-Learn

- Step 1: Importing required libraries (import sklearn)
- Step 2: Loading datasets (sklearn.datasets.load_*)
- Step 3: Splitting data (train_test_split())
- Step 4: Preprocessing data (sklearn.preprocessing)
- Step 5: Choosing a model (sklearn.linear_model, sklearn.tree, etc.)
- Step 6: Training the model (model.fit())
- Step 7: Making predictions (model.predict())
- Step 8: Evaluating the model (sklearn.metrics)

### Advanced Topics in Scikit-Learn
- Hyperparameter tuning: GridSearchCV, RandomizedSearchCV
- Model selection: cross-validation, model comparison
- Ensemble methods: Bagging, Boosting, Stacking, Casading
- Feature selection methods: SelectKBest, RFE
- Handling imbalanced datasets: SMOTE, class_weight
- Custom transformers and pipelines

## Step 1. Importing required libraries:

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn

## Step 2. Loading Dataset:

- Load the dataset using the `pd.read_csv()` function
- from `sklearn` library
- from `seaborn` library.
- Extracting data directly from an API. (`request` library)

In [6]:
# Load the dataset from sklearn library

# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [7]:
type(housing)

sklearn.utils._bunch.Bunch

In [8]:
# get column names

housing['feature_names']

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [9]:
# get target column

housing['target']

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

In [10]:
# Convert array to dataframe with feature names

housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df.sample(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
17674,4.3621,13.0,4.75283,1.096226,2094.0,3.950943,37.32,-121.86
14465,8.1844,22.0,7.817073,1.046341,1083.0,2.641463,32.81,-117.23
1190,2.425,17.0,5.485757,1.101949,1970.0,2.953523,39.4,-121.46


In [11]:
# Add target column to dataframe

housing_df["MedHouseValue"] = housing["target"]
housing_df.sample(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseValue
750,4.2733,22.0,5.224764,1.09825,1830.0,2.462988,37.67,-122.06,1.807
10918,3.3125,38.0,4.531746,1.013889,2451.0,4.863095,33.73,-117.86,1.591
6968,2.0147,41.0,3.605128,1.097436,1174.0,3.010256,33.98,-118.05,1.375


## Step 3: Train-test-split: 

The train_test_split() function in Scikit-Learn is used to split a dataset into training and testing sets.

Module: `model_selection`    
function: `train_test_split`

`Why Use train_test_split()?`

- Helps in model validation by preventing overfitting.
- Ensures the model generalizes well on unseen data.
- Provides a way to evaluate model performance before deployment.

In [None]:
# Dividing the dataset into independent and dependent features

X = housing_df.drop("MedHouseValue", axis=1)
y = housing_df["MedHouseValue"]

In [15]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

((16512, 8), (4128, 8))

`1. test_size` → How Much Data Goes to Testing?
- Defines the percentage of test data.  

Common values:
- test_size=0.2 → 80% training, 20% testing
- test_size=0.3 → 70% training, 30% testing

`2. random_state`:

- If random_state is fixed, the same indices are selected for train/test every time.
- If random_state=None, indices will change on every execution.

`3. shuffle` → Should Data Be Shuffled?

- By default, data is shuffled before splitting.
- Set shuffle=False if the dataset has a specific order.

`4. stratify` → Maintains Class Proportion

- Ensures the same distribution of classes in training and testing sets.
- Useful for imbalanced classification problems.

In [16]:
X_train.sample(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
14084,5.7768,49.0,6.275035,1.002821,1606.0,2.265162,32.77,-117.1
9976,5.3698,20.0,5.340206,1.041237,589.0,3.036082,38.6,-122.47
13907,2.4828,17.0,5.446618,1.056235,2677.0,2.181744,34.1,-116.43


In [25]:
X_train.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,3.880754,28.608285,5.435235,1.096685,1426.453004,3.096961,35.643149,-119.58229
std,1.904294,12.602499,2.387375,0.433215,1137.05638,11.578744,2.136665,2.005654
min,0.4999,1.0,0.888889,0.333333,3.0,0.692308,32.55,-124.35
25%,2.5667,18.0,4.452055,1.006508,789.0,2.428799,33.93,-121.81
50%,3.5458,29.0,5.235874,1.049286,1167.0,2.81724,34.26,-118.51
75%,4.773175,37.0,6.061037,1.100348,1726.0,3.28,37.72,-118.01
max,15.0001,52.0,141.909091,25.636364,35682.0,1243.333333,41.95,-114.31


## Step 4. Data preprocessing:

1. Complete Feature Engineering:
- Handling Missing values
- Encoding categorical variables
- Scaling/Normalizing data

2. Understanding data
- Exploratary Data Analysis (EDA)

3. Feature Selection

sklearn-doc: https://scikit-learn.org/stable/data_transforms.html

### List of Data Preprocessing Modules in sklearn.preprocessing     
     
1️⃣ Scaling & Normalization     
- StandardScaler     
- MinMaxScaler     
- RobustScaler     
- Normalizer     
📌 Purpose: Scale or normalize numerical data to improve model performance.     
     
2️⃣ Encoding Categorical Data     
- LabelEncoder     
- OneHotEncoder     
- OrdinalEncoder     
📌 Purpose: Convert categorical variables into numerical format.     
     
3️⃣ Handling Missing Values     
- SimpleImputer     
- KNNImputer     
📌 Purpose: Fill missing values using different strategies (mean, median, etc.).     
     
4️⃣ Feature Binarization     
- Binarizer     
📌 Purpose: Convert numerical data into binary format (0 or 1) based on a threshold.     
     
5️⃣ Polynomial Features     
- PolynomialFeatures     
📌 Purpose: Generate polynomial and interaction features for linear regression.     
     
6️⃣ Discretization     
- KBinsDiscretizer     
📌 Purpose: Convert continuous data into discrete bins (useful for categorization).    
    
7️⃣ Power Transformations    
- PowerTransformer    
- QuantileTransformer    
📌 Purpose: Apply transformations to make data more Gaussian-like.

In [None]:
# from sklearn.preprocessing import (
#     StandardScaler, MinMaxScaler, RobustScaler, Normalizer,
#     LabelEncoder, OneHotEncoder, OrdinalEncoder,
#     Binarizer, PolynomialFeatures,
#     KBinsDiscretizer, PowerTransformer, QuantileTransformer
# )

# from sklearn.impute import SimpleImputer, KNNImputer

# from sklearn.compose import ColumnTransformer

In [30]:
from sklearn.preprocessing import MinMaxScaler

In [31]:
X_train.columns.values

array(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
       'AveOccup', 'Latitude', 'Longitude'], dtype=object)

Scikit-Learn provides various transformers that follow a common workflow:

1. Create an instance of the transformer
2. Fit the transformer to the data (fit())
3. Transform the data (transform())

In [None]:
# Create an instance of MinMaxScaler

n = MinMaxScaler()

`Created instance of the class`

- The MinMaxScaler() object is created, but no computations are performed yet.
- This object can now be used to fit and transform the data.

In [None]:
# Fit the scaler to the training data

n.fit(X_train)

`What fit() Does?`
- Computes necessary statistics (e.g., mean & standard deviation for scaling).
- Stores these statistics internally for later use.
- It does NOT modify the data yet.

In [None]:
# Transform the training data

new_X_train = n.transform(X_train)
new_X_test = n.transform(X_test)

`What transform() Does?`
- Uses the stored statistics (computed in fit()) to transform the data.
- Modifies data according to the transformation rule.

In [35]:
new_X_train = pd.DataFrame(new_X_train, columns=X_train.columns)
new_X_test = pd.DataFrame(new_X_test, columns=X_test.columns)

In [36]:
new_X_train.sample(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
15787,0.187887,0.372549,0.028704,0.029471,0.033129,0.001327,0.648936,0.298805
5208,0.284486,0.27451,0.04525,0.034383,0.029541,0.002278,0.230851,0.618526
13806,0.444642,0.352941,0.046877,0.025622,0.084503,0.001919,0.497872,0.247012


In [24]:
new_X_train.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0,16512.0
mean,0.233159,0.541339,0.032239,0.030168,0.039896,0.001935,0.329058,0.474871
std,0.131329,0.247108,0.016929,0.017121,0.031869,0.009318,0.227305,0.199766
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.142536,0.333333,0.025267,0.026604,0.02203,0.001397,0.146809,0.252988
50%,0.210059,0.54902,0.030825,0.028295,0.032624,0.00171,0.181915,0.581673
75%,0.294705,0.705882,0.036677,0.030313,0.048292,0.002082,0.55,0.631474
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Step 5. Choosing a model:

<img src='images/model.png' width='700'>

In [26]:
# importing the model class as required

from sklearn.linear_model import LinearRegression

In [27]:
# Create an instance of LinearRegression
lr = LinearRegression()

## Step 6. Train the model:

The `fit()` function is used to train a machine learning model on given data. It learns patterns from the dataset and adjusts internal parameters to optimize performance.

Parameters:

- X_train: Features (independent variables) in the training dataset.
- y_train: Target values (dependent variables) in the training dataset.

In [28]:
# Fit the model
lr.fit(new_X_train, y_train)

## Step 7. Make prediction:

Once the model is trained using fit(), the predict() function is used to make predictions on new, unseen data.

`Parameters`:
- X_test: The new input data for which predictions are required.

`Returns`:
- Predicted values (y_pred), based on the learned patterns from training.

`What Happens Internally?`
- Receives Input Data (X_test)
- Uses Trained Model Parameters (from fit())
- Computes Predictions (y_pred) using the trained model equation
- Returns the Predicted Values

In [29]:
# Make predictions
y_pred = lr.predict(new_X_test)

## Step 8. Evaluating the Model:

- To measure the performance of the model.
- To compare different models and choose the best one.
- To prevent overfitting or underfitting (ensuring the model generalizes well).

Note: It takes y_pred and y_test as a input and returns the accuracy of the model.

### A. Classification Model Evaluation

Scikit-learn provides multiple functions and classes for classification evaluation:

#### sklearn.metrics module

Contains functions for various classification evaluation metrics like accuracy, precision, recall, F1-score, etc.

Key Classes/Functions:

- `accuracy_score`(): Computes accuracy of classification.
- `precision_score`(): Computes precision of classification.
- `recall_score`(): Computes recall of classification.
- `f1_score`(): Computes F1-score of classification.
- `roc_auc_score`(): Computes AUC for ROC curve.

#### Confusion Matrix

The confusion matrix is a table that describes the performance of a classification model on a set of test data for which the true values are known.

Function:

- `confusion_matrix`(y_true, y_pred)

### B. Regression Model Evaluation

Scikit-learn also provides functions for regression evaluation:

#### sklearn.metrics module

Contains functions for various regression evaluation metrics like MAE, MSE, RMSE, and R².

Key Classes/Functions:

- `mean_absolute_error`(): Computes MAE for regression.
- `mean_squared_error`(): Computes MSE for regression.
- `r2_score`(): Computes R² for regression.

#### Cross-Validation for Regression

Function:

- `cross_val_score`(model, X, y, cv=5, scoring='neg_mean_squared_error')


In [37]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred)

0.5757877060324512