In [1]:
import numpy as np
import pandas as pd

# Data Mining Handbook

## Missing Data

### Removing Misisng Data

 - Missing Completely at Random (MCAR): The probability of a missing value in a variable is independent of both the obserbed and unobserved data. Let $Y$ be the complete data matrix, and let $M$ be the missing data indicator matrix, where $M_{ij}={Y_{ij}=\emptyset\ ?\ 1\ :\ 0}$. Then, $P(M|Y_{\text{obs}},Y_{mis})=P(M)$.
 - Sample Size Reduction $\implies$ Lower statistical power and higher variance.
 - Reduced sample size can increase standard errors, making confidence intervals wider.
 - If missing values occur more in certain subgroups, dropping them alters group proportions.
 - If missingness is systematic, removing data can make estimators inconsistent or biased.
 - If different variables have missing values in different patterns, removing rows can distort correlations and dependencies.
 - In cases where missingness contains information (e.g., non-response in surveys indicating sensitivity), removing data discards valuable insights.



In [4]:
np.random.random(4)

array([0.06034738, 0.29081623, 0.2614664 , 0.22037083])

In [70]:
n = 100
p = 50
nan_prop = np.random.random()

data = np.random.random(size=(n, p))
data[np.random.binomial(n=1, p=nan_prop, size=(n, p)) == 1] = np.nan
data = pd.DataFrame(data)

n_nan = data.isna().sum().sum()
data_not_all_nan = data.dropna(how='all')
data_no_nan = data.dropna(how='any')

print(f"Number of nan: {n_nan}")
print(f"Number of samples: {n * p}")
print(f"Probabiliy of Nan: {nan_prop:.2f}")
print(f"Proportion of Nan: {(n_nan/(n * p)) * 100}%")
print(f"N after removing rows with all nans: {n_nan}")
print(f"N after removing rows with any nans: {n_nan}")


Number of nan: 2939
Number of samples: 5000
Probabiliy of Nan: 0.60
Proportion of Nan: 58.78%
N after removing rows with all nans: 2939
N after removing rows with any nans: 2939




### Imputation

#### Constant
#### Mean
#### Median
#### Mode

### Forward/Backward Fill

### Interpolation

### Multiple Imputation

### KNN Imputation

### Hot Deck imputation

### Cold Deck Imputaiton

### Expectation-Maximzation

### Matrix Factorization

### Data Augmentation

### Generative Models

### Bayesain Imputation

### Matrix Completion

### Random Forest Imputation

### Deep Learning Methods

### Latent Variable Methods

### Markov Chain Monte Carlo for Imputation

### Nonparametric Imputation


