

# **📘 Pipeline Architecture & Documentation**

## **1. Objective**

The pipeline is designed to **automate the entire flow**: from raw data ingestion → preprocessing → feature engineering → model training → evaluation → deployment for real-time predictions.

---

## **2. Pipeline Overview (End-to-End Flow)**

```
Raw Data → Preprocessing & Cleaning → EDA & Visualization →
Feature Engineering → Feature Selection → Model Training & Evaluation →
Model Saving → Deployment (Streamlit App)
```

---

## **3. Detailed Pipeline Stages**

### **3.1 Data Ingestion**

* **Input**: Cryptocurrency dataset (`price, volume, market cap, % changes, liquidity ratio`)
* **Actions**:
  ✅ Load using `pandas.read_csv()`
  ✅ Initial shape & datatype checks (`df.info()`, `df.describe()`)

---

### **3.2 Data Preprocessing**

| Step                     | Description                                                 | Tools/Techniques |
| ------------------------ | ----------------------------------------------------------- | ---------------- |
| **Null Handling**        | Dropped rows with missing values (`df.dropna()`)            | pandas           |
| **Date Features**        | Extracted year, month, day but dropped later (low variance) | pandas           |
| **Outlier Handling**     | Applied **3×IQR** to minimize extreme values                | IQR Method       |
| **Data Type Conversion** | Converted `date → datetime64[ns]`                           | pandas           |

---

### **3.3 EDA (Exploratory Data Analysis)**

* **Distribution Plots**: Checked skewness of all numerical features.
* **Correlation Matrix**: Selected features strongly correlated with target.
* **Outlier Visuals**: Boxplots before & after cleaning.

---

### **3.4 Feature Engineering**

* **Log Transform**: Reduced skewness using `np.log1p()`.
* **Derived Features**:
  ✅ `volume_to_market_cap = 24h_volume_log / mkt_cap_log`
  ✅ `price_to_liquidity = price_log / liquidity_ratio_log`
* **Justification**: Captures **liquidity efficiency** and **market stability**.

---

### **3.5 Feature Selection**

1. **VIF Check**: Removed multicollinear features (`24h_volume_log`, `mkt_cap_log`).
2. **Correlation Check**: Chose only predictive features.
3. **Final Features**:
   ✅ `volume_to_market_cap`, `1h`, `24h`, `7d`

---

### **3.6 Train-Test Split & Scaling**

* **80:20 split** (`train_test_split`)
* **StandardScaler** applied on selected features to normalize inputs.

---

### **3.7 Model Training & Evaluation**

| Model                   | R² Score  | MAE       | RMSE      |
| ----------------------- | --------- | --------- | --------- |
| **Linear Regression**   | 0.509     | 0.029     | 0.040     |
| **KNN**                 | 0.824     | 0.014     | 0.024     |
| ✅ **Gradient Boosting** | **0.959** | **0.007** | **0.011** |

* **Chosen Model**: **Gradient Boosting Regressor** (best accuracy, minimal overfitting).

---

### **3.8 Model Serialization**

* Saved using **joblib**:

```
gradient_boost_model.pkl  
scaler.pkl  
features.pkl
```

---

### **3.9 Deployment Pipeline (Streamlit App)**

1. **User Inputs**: `1h, 24h, 7d, Volume, Market Cap`
2. **Feature Engineering**:
   `volume_to_market_cap = volume / market_cap`
3. **Scaling & Prediction**:

   * Apply saved `scaler.pkl`
   * Predict liquidity ratio (`gradient_boost_model.pkl`)
4. **Classification**:

   ```
   Low (<0.05)
   Medium (0.05–0.15)
   High (>0.15)
   ```
5. **Output on UI**: Real-time predicted ratio + classification.

---

## **4. Pipeline Flowchart (Architecture Diagram)**

```
        ┌────────────────┐
        │  Raw Dataset   │
        └──────┬─────────┘
               │
    ┌──────────▼──────────┐
    │ Data Preprocessing  │
    │ (Nulls, Outliers)   │
    └─────────┬───────────┘
              │
    ┌─────────▼───────────┐
    │  EDA & Feature Engg │
    │ (Log, Derived Feats)│
    └─────────┬───────────┘
              │
    ┌─────────▼───────────┐
    │ Feature Selection   │
    │ (VIF, Correlation)  │
    └─────────┬───────────┘
              │
    ┌─────────▼───────────┐
    │ Train/Test Split    │
    │ + Scaling           │
    └─────────┬───────────┘
              │
    ┌─────────▼───────────┐
    │  Model Training     │
    │ (Gradient Boosting) │
    └─────────┬───────────┘
              │
    ┌─────────▼───────────┐
    │ Model Saving (pkl)  │
    └─────────┬───────────┘
              │
    ┌─────────▼───────────┐
    │ Streamlit App (UI)  │
    │ Predict & Classify  │
    └─────────────────────┘
```

---

## **5. Key Highlights**

✅ **End-to-end automated pipeline**
✅ **Outlier & skewness handled** → better model stability
✅ **99% reproducible** (joblib saves all required components)
✅ **Interactive real-time prediction app**

---

