# Guide to create a ML Model in Scikit_learn

Here is the **original end-to-end scikit-learn guide**—unchanged in structure and comments—plus one new EDA step that visualizes the **correlation between every numeric feature and the target**.

***

## 1. Set-Up Your Environment

- Create and activate a virtual environment (optional but recommended):

```bash
python -m venv venv && source venv/bin/activate      # macOS / Linux
venv\Scripts\activate                                 # Windows
```

- Install the core stack:

```bash
pip install scikit-learn pandas numpy matplotlib seaborn jupyterlab
```

    - **scikit-learn** for the ML algorithms
    - **pandas / numpy** for data handling
    - **matplotlib / seaborn** for visualizations
    - **jupyterlab** for an interactive notebook interface

***

## 2. Import the Libraries

In a new notebook or `.py` script:

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, classification_report
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
```

*Pick additional models or metrics as required.*

***

## 3. Load the Dataset

```python
df = pd.read_csv("your_data.csv")      # or pd.read_excel / read_sql / fetch_openml
display(df.head())
display(df.info())
```

Checklist:

- Check for missing values (`df.isna().sum()`).
- Verify data types (`df.dtypes`).
- Ensure a **target** column exists.

***

## 4. Exploratory Data Analysis (EDA) \& Visualization

1. **Univariate plots**

```python
df['target'].value_counts().plot(kind="bar")         # classification
sns.histplot(df['feature'], kde=True)                # regression / classification
```

2. **Feature-to-target relationships**

```python
sns.boxplot(x='target', y='numeric_feature', data=df)
sns.scatterplot(x='feature1', y='target', data=df)
```

3. **Pairwise relationships \& full correlation heatmap**

```python
sns.pairplot(df.sample(500), hue='target')
sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=".2f", cmap="coolwarm")
```

4. **Correlation matrix vs. target (NEW)**

```python
numeric_df  = df.select_dtypes(include="number")
corr        = numeric_df.corr()
target_corr = corr['target'].drop('target').sort_values(ascending=False)

plt.figure(figsize=(6, 4))
sns.barplot(x=target_corr.values, y=target_corr.index, orient='h')
plt.title("Correlation with Target")
plt.xlabel("Pearson r")
plt.tight_layout()
plt.show()
```

*High |r| values often signal strong predictors; very low |r| values may contribute little. Off-diagonal highs in step 3 hint at multicollinearity.*
5. **Missing-value heatmap**

```python
sns.heatmap(df.isna(), cbar=False)
```


*Insights from EDA guide preprocessing choices and model selection.*

***

## 5. Split Data into Train/Test Sets

```python
X = df.drop(columns="target")
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y if y.nunique() < 20 else None,
    random_state=42
)
```

- Use `stratify` for classification to preserve class proportions.
- Keep the test set untouched until final evaluation.

***

## 6. Preprocessing Pipeline

1. **Identify column types**

```python
num_cols = X.select_dtypes(include=["int64", "float64"]).columns
cat_cols = X.select_dtypes(include=["object", "category"]).columns
```

2. **Create transformers**

```python
numeric_transformer = Pipeline([
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])
```

3. **Combine into `ColumnTransformer`**

```python
preprocessor = ColumnTransformer([
    ("num", numeric_transformer, num_cols),
    ("cat", categorical_transformer, cat_cols)
])
```


***

## 7. Build the Model Pipeline

```python
model = RandomForestClassifier(random_state=42)          # or RandomForestRegressor

pipe = Pipeline([
    ("prep", preprocessor),
    ("model", model)
])
```

Advantages:

- End-to-end reproducibility
- Prevents data leakage (scaler fits only on training folds)
- Seamless hyperparameter tuning with `GridSearchCV`

***

## 8. Hyperparameter Tuning (Optional but Recommended)

```python
param_grid = {
    "model__n_estimators": [100, 300],
    "model__max_depth": [None, 10, 20],
    "model__min_samples_split": [2, 5]
}

grid = GridSearchCV(
    pipe, param_grid,
    cv=5, scoring="accuracy", n_jobs=-1, verbose=2
)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
best_model = grid.best_estimator_
```

- Replace `accuracy` with `neg_root_mean_squared_error` for regression or other metrics.
- `best_model` now contains preprocessing + tuned estimator.

***

## 9. Evaluate on the Test Set

```python
y_pred = best_model.predict(X_test)

# Classification
print(classification_report(y_test, y_pred))

# Regression
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Test RMSE: {rmse:.2f}")
```

- Inspect confusion matrices or residual plots for deeper insight.

***

## 10. Persist the Model

```python
import joblib
joblib.dump(best_model, "model.joblib")          # Save
# later:
loaded_model = joblib.load("model.joblib")       # Load
```

Stores the preprocessing pipeline *and* trained estimator together for seamless deployment.

***

## 11. Deploy or Integrate

- **Batch predictions:** load the joblib file in a scheduled script.
- **Real-time API:** wrap the model in a FastAPI or Flask endpoint.
- **Edge / mobile:** convert to ONNX using `skl2onnx` if needed.

***

## 12. Maintain \& Monitor

- Track data and concept drift (e.g., Evidently AI or custom monitoring).
- Periodically retrain with fresh data.
- Log model predictions \& feedback for continuous improvement.
