

# 🔎 Scikit-learn

---

## 1️⃣ What is Scikit-learn?

* **Definition**: Scikit-learn is an **open-source Python library for Machine Learning (ML)**.
* It provides **tools to implement ML algorithms** and also handle the entire pipeline:

  * Loading and preparing data
  * Training models
  * Testing predictions
  * Evaluating performance
  * Improving models
* Built on **NumPy** (numerical computing), **SciPy** (scientific computing), and **Matplotlib** (visualization).

👉 **Analogy**: Think of scikit-learn as a **kitchen for data science** – it gives you the ingredients (datasets), utensils (preprocessing), recipes (models), and tasting methods (evaluation).

---

## 2️⃣ Key Features

1. **Wide Algorithm Coverage**

   * **Supervised learning** → regression & classification (Linear Regression, Logistic Regression, Decision Trees, Random Forests, SVM, KNN, Naïve Bayes).
   * **Unsupervised learning** → clustering (KMeans, DBSCAN) & dimensionality reduction (PCA).

2. **Preprocessing Tools**

   * Scaling → `StandardScaler`, `MinMaxScaler`
   * Encoding categorical data → `LabelEncoder`, `OneHotEncoder`
   * Handling missing values → `SimpleImputer`

3. **Model Selection**

   * Train/test split → `train_test_split`
   * Cross-validation → `cross_val_score`
   * Hyperparameter tuning → `GridSearchCV`, `RandomizedSearchCV`

4. **Evaluation Metrics**

   * Classification → Accuracy, Precision, Recall, F1, ROC-AUC
   * Regression → MSE, MAE, RMSE, R²

5. **Pipelines**

   * Combine multiple steps (scaling + model) into one workflow.
   * Avoids data leakage and keeps workflow clean.

👉 **Why this matters**: Instead of learning a new syntax for each algorithm, scikit-learn makes everything consistent with `.fit()`, `.predict()`, `.score()`.

---

## 3️⃣ When to Use

* ✅ **Small to Medium datasets** → Works best with data that fits in memory (up to a few million rows).
* ✅ **Structured/tabular data** → CSVs, SQL tables (e.g., student scores, bank transactions).
* ✅ **Learning/teaching ML basics** → Easy for beginners.
* ✅ **Quick prototyping** → You can test many algorithms fast before moving to advanced frameworks.
* ❌ Not for **Big Data** → If dataset is too large, use Spark MLlib or Dask.
* ❌ Not for **Deep Learning** → Images, speech, or NLP → use TensorFlow or PyTorch.

👉 **Rule of thumb**: Use sklearn when you have **structured data** and want to apply **classical ML**.

---

## 4️⃣ Structure of Scikit-learn

Scikit-learn is divided into **modules**:

1. **Datasets**

   * Built-in datasets: `load_iris()`, `load_digits()`, `fetch_california_housing()`.
   * Also supports loading your own datasets (CSV, Pandas).

2. **Preprocessing**

   * Scaling, normalization, encoding categorical features, handling missing values.

3. **Models (Estimators)**

   * Classification: Logistic Regression, SVM, Decision Trees, Random Forests, KNN.
   * Regression: Linear Regression, Ridge, Lasso.
   * Clustering: KMeans, DBSCAN.
   * Dimensionality Reduction: PCA.

4. **Model Selection**

   * Splitting data into train/test.
   * Cross-validation for reliable evaluation.
   * Hyperparameter tuning (GridSearchCV, RandomizedSearchCV).

5. **Evaluation Metrics**

   * Classification metrics: accuracy, precision, recall, F1, ROC-AUC.
   * Regression metrics: MSE, RMSE, R².

👉 **Consistency principle**: all models follow the same steps – `.fit()`, `.predict()`, `.score()`.

---

## 5️⃣ Example Categories

* **Classification (Discrete outputs)**
  Example: Predict whether an email is spam (Yes/No).
  Algorithms: Logistic Regression, SVM, Decision Tree, Random Forest.

* **Regression (Continuous outputs)**
  Example: Predict house prices (₹).
  Algorithms: Linear Regression, Ridge, Lasso, Random Forest Regressor.

* **Clustering (Grouping data without labels)**
  Example: Grouping customers by shopping behavior.
  Algorithms: KMeans, DBSCAN, Agglomerative clustering.

* **Dimensionality Reduction**
  Example: Reduce 100 features to 2 for visualization.
  Algorithm: PCA.

---

## 6️⃣ Why it’s Useful

* **Consistent API** → Learn once, apply everywhere.
* **Integration** → Works with NumPy, Pandas, Matplotlib.
* **Efficient** → Optimized algorithms run quickly even on laptops.
* **Educational** → Excellent for teaching ML concepts.
* **Industry Prototyping** → Quick testing of models before using bigger frameworks.

👉 **Analogy**: Like a **toolbox** – hammer, screwdriver, wrench – everything you need for building ML projects.

---

## 7️⃣ Limitations

* 🚫 **Not for Big Data** → cannot scale to terabytes (use Spark/Dask).
* 🚫 **Not for Deep Learning** → lacks neural networks (use TensorFlow/PyTorch).
* 🚫 **Structured Data Focused** → struggles with raw images, audio, or text (needs preprocessing first).
* 🚫 **Limited GPU support** → sklearn runs mainly on CPU.

---

## 8️⃣ Steps in an ML Project (using sklearn)

## **Step 1: Problem Definition**

* First, clearly **state the problem** you want to solve.
* Example questions:

  * *“Can I predict if a student passes/fails based on study hours?”* (classification)
  * *“Can I predict the price of a house?”* (regression)
  * *“Can I group customers into categories without labels?”* (clustering)
* 👉 **Tip for students**: Always write your problem in plain English before touching code.

---

## **Step 2: Collect Data**

* Data is the “fuel” of ML. Without data, there’s no learning.
* Options:

  * Use **built-in datasets** in sklearn (like `iris`, `digits`, `breast_cancer`).
  * Use **CSV/Excel files** from Kaggle, UCI ML Repository, or your own collection.
  * Use APIs/databases (SQL, MongoDB).
* 👉 Good data matters more than complex models.

---

## **Step 3: Explore and Understand the Data**

* Look at the dataset to **understand what’s inside**.
* Steps:

  * View rows, columns, data types.
  * Check missing values.
  * See distributions of features (e.g., histogram of study hours).
  * Visualize relationships (scatter plots, bar charts).
* 👉 Goal: *Get a sense of the story the data is telling you.*

---

## **Step 4: Preprocess the Data**

* Raw data is messy. We prepare it so models can understand it.
* Common preprocessing tasks:

  * **Handle missing values** → fill with mean/median, or drop rows.
  * **Encode categorical variables** → e.g., convert “Male/Female” into 0/1.
  * **Scale features** → some algorithms (SVM, KNN) work best when values are on the same scale.
  * **Split data** → separate into `training set` (to learn) and `test set` (to evaluate).
* 👉 Think of preprocessing as *washing vegetables before cooking*.

---

## **Step 5: Choose a Model (Algorithm)**

* Now pick a machine learning method that matches your problem:

  * **Classification** → Logistic Regression, Decision Tree, Random Forest, SVM.
  * **Regression** → Linear Regression, Ridge, Lasso, Random Forest Regressor.
  * **Clustering** → KMeans, DBSCAN.
  * **Dimensionality Reduction** → PCA.
* 👉 sklearn makes this easy because every model follows the same pattern:
  `.fit()` → `.predict()` → `.score()`

---

## **Step 6: Train the Model**

* **Training** = feeding the algorithm with data so it “learns” patterns.
* Example: The model sees study hours and learns the trend with pass/fail.
* In sklearn:

  * `.fit(X_train, y_train)` → model learns from training data.
* 👉 Analogy: Like teaching a student with class notes.

---

## **Step 7: Make Predictions**

* After training, ask the model to predict for **new/unseen data**.
* In sklearn:

  * `.predict(X_test)` → model makes guesses.
* Example: “If a new student studies 4 hours, will they pass or fail?”

---

## **Step 8: Evaluate the Model**

* We must **check if the model is good**. Otherwise, it’s just guessing.
* Evaluation depends on the type of problem:

  * **Classification** → accuracy, precision, recall, F1-score, ROC-AUC.
  * **Regression** → Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R² score.
* 👉 This step tells us how reliable the model is.

---

## **Step 9: Improve the Model**

* If accuracy is low, try:

  * Different preprocessing (better scaling, encoding, handling outliers).
  * Trying a different algorithm.
  * Hyperparameter tuning (using `GridSearchCV` or `RandomizedSearchCV`).
  * More data or better quality data.
* 👉 Analogy: Like preparing for an exam – if first attempt score is low, improve your study method.

---

## **Step 10: Deploy or Use the Model**

* Once satisfied, we **use the model in real life**:

  * Save it (`joblib`, `pickle`) and load it later.
  * Deploy on websites/apps (e.g., predicting customer churn).
* 👉 This is when ML moves from a “notebook experiment” to solving **real-world problems**.

---

# 🔑 Summary of the Workflow

1. Define the problem
2. Collect data
3. Explore data
4. Preprocess data
5. Choose a model
6. Train the model
7. Make predictions
8. Evaluate model
9. Improve model
10. Deploy/use model

👉 Every sklearn project (small or big) **always follows these steps**.

---


1. **Import libraries**

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


In [1]:
import sklearn
print(dir(sklearn))

['calibration', 'clone', 'cluster', 'compose', 'config_context', 'covariance', 'cross_decomposition', 'datasets', 'decomposition', 'discriminant_analysis', 'dummy', 'ensemble', 'exceptions', 'experimental', 'externals', 'feature_extraction', 'feature_selection', 'frozen', 'gaussian_process', 'get_config', 'impute', 'inspection', 'isotonic', 'kernel_approximation', 'kernel_ridge', 'linear_model', 'manifold', 'metrics', 'mixture', 'model_selection', 'multiclass', 'multioutput', 'naive_bayes', 'neighbors', 'neural_network', 'pipeline', 'preprocessing', 'random_projection', 'semi_supervised', 'set_config', 'show_versions', 'svm', 'tree']


2. **Load dataset**

In [4]:
iris = load_iris()
X, y = iris.data, iris.target


3. **Split into train & test**

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


4. **Choose model**

In [6]:
model = KNeighborsClassifier(n_neighbors=3)


5. **Train model**

In [9]:
model.fit(X_train, y_train)


0,1,2
,n_neighbors,3
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


6. **Make predictions**

In [10]:
y_pred = model.predict(X_test)


7. **Evaluate performance**

In [11]:
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 1.0



---

✅ **Summary:**
Scikit-learn is the **go-to library for classical machine learning** in Python.
It covers everything from **data preprocessing → model building → evaluation → improvement**.
It’s best for **structured/tabular data** and **small-to-medium datasets**.

---