# MAR_Missing At Random

### Missingness depends on other observed variables, NOT on the missing value itself.
### It’s less restrictive than MCAR (Missing Completely At Random) but more manageable than MNAR (Missing Not At Random).

### Understand with examples:
##### Suppose income data is missing more often for younger people,students, certain job roles . i.e: The missingness depends on age/job (observed), not on the true income (unobserved). , (Younger respondents less likely to report income)
##### MCAR: Missingness happens randomly during collection, transfer, or analysis — independent of the data itself.
##### MAR: Missingness happens systematically, because the source’s response depends on other observed variables (not on the missing value itself).
#### - In this type of missingness, the likelihood of a value being missing is influenced by other observed variables. Since the missing values are related to existing data, simple imputation methods (like mean, median,random sampling and mode) are not ideal, as they ignore these **dependencies** and can distort the existing **relationships** in the dataset.
#### - So,always consider the other related variable to predict missing value
##### For above Income we can model missingness using Age as a predictor in imputation.





## Imputations for MAR

### 1. Group-based imputation (Simple MAR baseline)
##### Ex: df['Income'] = df.groupby('Age')['Income'] \.transform(lambda x: x.fillna(x.median()))
##### Instead of imputing missing income with the overall median, you impute median within each age group, 
##### - Cons : Loss of variability,Bias if group is small, Ignores other predictors, Not suitable for predictive modeling
#####  Manual feature selection
##### Applicable : Small data, clear groups, Clear categorical driver of missingness

### 2. Regression Imputation
from sklearn.linear_model import LinearRegression
train = df[df['Income'].notna()]
test  = df[df['Income'].isna()]

X_train = train[['Age', 'Experience']]

y_train = train['Income']

model = LinearRegression()

model.fit(X_train, y_train)

df.loc[df['Income'].isna(), 'Income'] = model.predict(  test[['Age', 'Experience']])

#### - Pros : 
Unlike mean/median imputation, this method preserves relationships between variables.
keeps variability
More realistic if predictors explain Income well.
Used for Complex Datasets
#### - Cons : 
Assumes a linear relationship between predictors and Income.
If the model is poorly fitted, imputations may be inaccurate.
Sensitive to outliers in training data.
##### Group‑median imputation is simple but flattens values within groups. Regression imputation is richer, uses multiple predictors, and preserves variability — but depends on model quality.

### 3. KNN Imputation
#### Fill missing value using similar rows.
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)

df_knn = pd.DataFrame(
    imputer.fit_transform(df[numerical_features]),
    columns=numerical_features,
    index=df.index)
##### Creates a K‑Nearest Neighbors imputer.
##### For each missing value, it looks at the 5 nearest rows (neighbors) based on other numerical features.
##### The missing value is replaced with the average of those neighbors’ values.
##### Returns a NumPy array with missing values filled in. Thats why we again convert them in to Data Frame, ensuring column names and index aligns with original dataset
#### - Pros :
Preserves relationships
Different missing values can be imputed differently depending on their neighbors.
Unlike mean/median imputation, it maintains variability and local patterns.
#### - Cons:
For large datasets, finding neighbors can be slow.                              
If *n_neighbors* Too small → noisy imputations; too large → values converge toward global mean.

Only works with numerical features

### 4. Iterative Imputer - Multiple Imputation by Chained Equations (MICE)
#### Predict each missing feature using others
#### Iterate until convergence

from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer

imputer = IterativeImputer( estimator=LinearRegression(),
    max_iter=10,
    random_state=42)

df_mice = pd.DataFrame(
    imputer.fit_transform(df[numerical_features]),
    columns=numerical_features,
    index=df.index)
##### cycles through all features with missing values up to 10 times, refining imputations each round.
##### Setting random_state=42 fixes the random number generator’s seed, ensuring that the imputation process is reproducible.Without it, each run could produce slightly different imputed values, even on the same dataset.

#### - Pros:
- Multivariate approach: Unlike mean/median or group imputation, it uses all other features to predict missing values.
- Iterative refinement: Each round improves imputations by re‑using updated values.
- *Flexibility*: You can swap out LinearRegression() for other estimators (e.g., DecisionTreeRegressor, BayesianRidge).
- A well‑established statistical method.
#### - Cons:
- Iterative cycles can be slow for large datasets.
- Linear regression may not capture nonlinear relationships.
- Sensitive to collinearity/outliers: Regression can be distorted if predictors are highly correlated or extreme.
##### A more sophisticated alternative to mean/median or KNN imputation.

### 5. Multiple Imputation (Advanced, Statistical)

##### Idea:

##### Create multiple imputed datasets

##### Combine estimates

#### More common in biostatistics, less in ML pipelines because of complexity and computational cost.


| Scenario                    | Best choice       |
| --------------------------- | ----------------- |
| Small data, clear groups    | Group median      |
| Linear relationships        | Regression        |
| Non-linear, local structure | KNN               |
| Complex dependencies        | Iterative Imputer |
