## üß≠ EDA Setup ‚Äî Variable Groups and Exploration Plan

Before starting exploratory data analysis (EDA), we organize variables into logical groups.
This helps us focus visualizations and understand feature behavior before encoding.

### Variable Groups
| Group | Description | Examples |
|--------|--------------|-----------|
| **Numeric** | Continuous or discrete numerical values used for correlation and distribution analysis | `demo_age`, `svc_tenure_in_months`, `svc_total_charges`, `st_cltv` |
| **Categorical (Yes/No)** | Binary categorical features to be encoded later | `demo_married`, `svc_paperless_billing`, `svc_online_backup` |
| **Categorical (Nominal / Ordinal)** | Multi-category variables to inspect for balance and relationships | `svc_contract`, `svc_payment_method`, `svc_offer` |
| **Geographical / Identifiers** | Location or customer identifiers ‚Äî used for group-level analysis only | `loc_city`, `loc_zip_code`, `zip_population` |
| **Churn / Leakage Related** | Outcome and diagnostic columns, not for modeling | `st_churn_label`, `st_customer_status`, `st_churn_reason`, `st_churn_category` |

### EDA Objectives
1. Inspect distributions and balance of categorical variables (Yes/No, offers, contract types).  
2. Explore numeric variable distributions and detect outliers.  
3. Examine churn patterns across demographics, services, and contract types.  
4. Identify correlations and redundancy among numeric variables.  
5. Retain interpretability ‚Äî all encoding occurs **after** EDA.


## End of EDA ‚Äî Summary of Next Steps Before Modeling

Following exploratory data analysis, the next phase focuses on preparing the dataset for modeling.  
The goal is to create a clean, consistent, and leak-free feature set with appropriate encodings and scaling.  

---

### **Binary (Yes/No) Variables**
**Columns:**  
`demo_under_30`, `demo_senior_citizen`, `demo_married`, `demo_dependents`,  
`svc_referred_a_friend`, `svc_multiple_lines`,  
`svc_online_security`, `svc_online_backup`, `svc_device_protection_plan`,  
`svc_premium_tech_support`, `svc_streaming_tv`, `svc_streaming_movies`,  
`svc_streaming_music`, `svc_unlimited_data`, `svc_paperless_billing`

**Actions:**  
- Convert ‚ÄúYes‚Äù ‚Üí 1, ‚ÄúNo‚Äù ‚Üí 0.  
- Keep readable labels until final preprocessing (to preserve clarity during EDA).  
- Verify there are no inconsistent spellings (‚Äúyes‚Äù, ‚Äúno‚Äù, ‚ÄúNo internet‚Äù).  

---

### **Nominal and Ordinal Categorical Variables**
**Columns:**  
- **Nominal:** `svc_offer`, `svc_payment_method`, `svc_internet_type`  
- **Ordinal:** `svc_contract`

**Actions:**  
- `svc_contract`: encode ordinally (Month-to-month = 0, One year = 1, Two year = 2).  
- `svc_offer` and `svc_payment_method`: one-hot encode or frequency encode depending on model choice.  
- `svc_internet_type`: normalize ‚ÄúNo internet service‚Äù ‚Üí ‚ÄúNone‚Äù. Drop `svc_internet_service` (redundant).  

---

### **Geographical / Identifier Variables**
**Columns:**  
`loc_city`, `loc_zip_code`, `zip_population`  

**Actions:**  
- Keep `zip_population` as a numeric contextual variable.  
- Drop `loc_city` and `loc_zip_code` before modeling (too high cardinality).  

---

### **Numerical Variables**
**Columns:**  
Includes `demo_age`, `svc_tenure_in_months`, `svc_total_charges`, `svc_total_revenue`, `st_cltv`, etc.

**Actions:**  
- Check for outliers and skewness.  
- Apply log transformation to skewed financial variables (e.g., `st_cltv`, `svc_total_revenue`, `svc_total_charges`).  
- Scale all numeric variables using `StandardScaler` or `MinMaxScaler`.  
- Investigate correlations; drop highly correlated features (e.g., total vs. component charges).  

---

### **Leakage and Target Variables**
**Columns:**  
- **Leakage:** `st_customer_status`, `st_churn_category`, `st_churn_reason`, `st_churn_score`  
- **Target:** `st_churn_label`, `st_churn_value`

**Actions:**  
- Retain for EDA and churn analysis, but **drop all leakage variables before modeling**.  
- Keep only one target variable:  
  ```python
  df_master["st_churn_label"] = df_master["st_churn_label"].map({"Yes": 1, "No": 0})
  df_master.drop(columns=["st_churn_value"], inplace=True)
