# fancyimpute [(link)](https://github.com/iskandr/fancyimpute)
  * Using machine learning algorithm to impute missing values.
  * There are many ways missing data can be imputed using fancyimpute:
    * `SimpleFill`: Replaces missing entries with the mean or median of each column.
    * `KNN`: Nearest neighbor imputations which weights samples using the mean squared difference on features for which two rows both have observed data.
    * `IterativeImputer`: A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
    * `SoftImpute`: Matrix completion by iterative soft thresholding of SVD decompositions.
    * `IterativeSVD`: Matrix completion by iterative low-rank SVD decomposition.
    * `MatrixFactorization`: Direct factorization of the incomplete matrix into low-rank U and V, with an L1 sparsity penalty on the elements of U and an L2 penalty on the elements of V.
    * `NuclearNormMinimization`: Simple implementation of Exact Matrix Completion via Convex Optimization by Emmanuel Candes and Benjamin Recht using cvxpy.
    * `BiScaler`: Iterative estimation of row/column means and standard deviations to get doubly normalized matrix.

In [1]:
! pip install fancyimpute

Collecting fancyimpute
  Downloading fancyimpute-0.7.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting knnimpute>=0.1.0 (from fancyimpute)
  Downloading knnimpute-0.1.0.tar.gz (8.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nose (from fancyimpute)
  Downloading nose-1.3.7-py3-none-any.whl.metadata (1.7 kB)
Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: fancyimpute, knnimpute
  Building wheel for fancyimpute (setup.py) ... [?25l[?25hdone
  Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29880 sha256=300ea4c0304ae4fffe47e19b1829b9891a2d48cdbaea9d9e7555b78c13eb2604
  Stored in directory: /root/.cache/pip/wheels/7b/0c/d3/ee82d1fbdcc0858d96434af108608d01703505d453720c84ed
  Building wheel for knnimpute (setup.py) ... [?25l[?25hdone
  C

# 產生有缺失值的Iris數據集
為了瞭解填補方法的性能，將原本的數據集部分變為缺失，再評估原始數據集與缺失填補後的數據集的誤差(用mean square error評估)

In [11]:
# 檢查數據集中是否有缺失值
def checking_missing_values(X):
  has_missing_values = np.isnan(X).any()

  # 打印結果
  if has_missing_values:
      print("數據集中存在缺失值")
  else:
      print("數據集中沒有缺失值")

In [12]:
import numpy as np
from sklearn.datasets import load_iris
from fancyimpute import KNN, NuclearNormMinimization, SoftImpute, BiScaler
from sklearn.model_selection import train_test_split

# 加載Iris數據集
data = load_iris()
X = data.data

checking_missing_values(X)

數據集中沒有缺失值


In [15]:
np.random.seed(0)
# 創建一個與X相同形狀的mask矩陣，隨機設置一定比例的元素為True
missing_mask = np.random.rand(*X.shape) < 0.1  # 10%的數據會被設為缺失值
print("missing mask: \n", missing_mask[:5])

# 將mask為True的地方設為NaN，生成X_incomplete
X_incomplete = X.copy()
X_incomplete[missing_mask] = np.nan

print("原始前5筆data: \n", X[:5])
print("遺失10%資料後，前5筆data: \n", X_incomplete[:5])

checking_missing_values(X_incomplete)

missing mask: 
 [[False False False False]
 [False False False False]
 [False False False False]
 [False False  True  True]
 [ True False False False]]
原始前5筆data: 
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
遺失10%資料後，前5筆data: 
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 nan nan]
 [nan 3.6 1.4 0.2]]
數據集中存在缺失值


# 使用fancyimpute填補X_incomplete

###  1. SimpleFill
  * Possible values for `fill_method=`:
    - "zero": fill missing entries with zeros
    - "mean": fill with column means
    - "median": fill with column medians
    - "min": fill with min value per column
    - "random": fill with gaussian noise according to mean/std of column

In [16]:
from fancyimpute import SimpleFill

imputer = SimpleFill(fill_method='mean')
X_filled_mean = imputer.fit_transform(X_incomplete)
print("填補缺失值後的前5筆data: \n", X_filled_mean[:5])

填補缺失值後的前5筆data: 
 [[5.1        3.5        1.4        0.2       ]
 [4.9        3.         1.4        0.2       ]
 [4.7        3.2        1.3        0.2       ]
 [4.6        3.1        3.77969925 1.20708661]
 [5.79130435 3.6        1.4        0.2       ]]


In [17]:
mean_mse = ((X_filled_mean[missing_mask] - X[missing_mask]) ** 2).mean()
print("meanImpute MSE: %f" % mean_mse)

meanImpute MSE: 1.226109


### 2. KNN (K-Nearest Neighbor)
  * K: number of neighboring rows to use for imputation, default=5
  * orientation: Which axis of the input matrix should be treated as a sample, default='rows'


In [18]:
from fancyimpute import KNN

imputer = KNN(k=3)
X_filled_knn = imputer.fit_transform(X_incomplete)
print("填補缺失值後的前5筆data: \n", X_filled_knn[:5])

Imputing row 1/150 with 0 missing, elapsed time: 0.009
Imputing row 101/150 with 1 missing, elapsed time: 0.010
填補缺失值後的前5筆data: 
 [[5.1   3.5   1.4   0.2  ]
 [4.9   3.    1.4   0.2  ]
 [4.7   3.2   1.3   0.2  ]
 [4.6   3.1   2.325 0.65 ]
 [5.025 3.6   1.4   0.2  ]]


In [19]:
knn_mse = ((X_filled_knn[missing_mask] - X[missing_mask]) ** 2).mean()
print("knnImpute MSE: %f" % knn_mse)

knnImpute MSE: 0.223060


### 3. MICE (Multiple Imputation by Chained Equation)
  * MICE perform multiple regression over the sample data and take averages of them.
  * `IterativeImputer`: A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a iterative fashion.

In [20]:
# importing the MICE from fancyimpute library
from fancyimpute import IterativeImputer
imputer = IterativeImputer()
# imputing the missing value with mice imputer
X_filled_mice = imputer.fit_transform(X_incomplete)
print("填補缺失值後的前5筆data: \n", X_filled_mice[:5])

填補缺失值後的前5筆data: 
 [[5.1        3.5        1.4        0.2       ]
 [4.9        3.         1.4        0.2       ]
 [4.7        3.2        1.3        0.2       ]
 [4.6        3.1        1.60055164 0.34232662]
 [5.08324578 3.6        1.4        0.2       ]]




In [21]:
knn_mice = ((X_filled_mice[missing_mask] - X[missing_mask]) ** 2).mean()
print("miceImpute MSE: %f" % knn_mse)

miceImpute MSE: 0.223060
