# Missingno Library

In the case of a real-world dataset, it is very common that some values in the dataset are missing. We represent these missing values as NaN (Not a Number) values. But to build a good machine learning model our dataset should be complete. That’s why we use some imputation techniques to replace the NaN values with some probable values. But before doing that we need to have a good understanding of how the NaN values are distributed in our dataset.

Missingno library offers a very nice way to visualize the distribution of NaN values. Missingno is a Python library and compatible with Pandas.

## Matrix :
Using this matrix you can very quickly find the pattern of missingness in the dataset. In our example, the columns AAWhiteSt-4 and SulphidityL-4 have a similar pattern of missing values while UCZAA shows a different pattern.

In [None]:
# Program to visualize missing values in dataset
  
# Importing the libraries
import pandas as pd
import missingno as msno
  
# Loading the dataset
df = pd.read_csv("kamyr-digester.csv")
  
# Visualize missing values as a matrix
msno.matrix(df)

## Bar Chart :
This bar chart gives you an idea about how many missing values are there in each column. In our example, AAWhiteSt-4 and SulphidityL-4 contain the most number of missing values followed by UCZAA

In [None]:
# Program to visualize missing values in dataset
  
# Importing the libraries
import pandas as pd
import missingno as msno
  
# Loading the dataset
df = pd.read_csv("kamyr-digester.csv")
  
# Visualize the number of missing
# values as a bar chart
msno.bar(df)

## Heatmap :
Heatmap shows the correlation of missingness between every 2 columns. In our example, the correlation between AAWhiteSt-4 and SulphidityL-4 is 1 which means if one of them is present then the other one must be present.

A value near -1 means if one variable appears then the other variable is very likely to be missing.
A value near 0 means there is no dependence between the occurrence of missing values of two variables.
A value near 1 means if one variable appears then the other variable is very likely to be present.

In [None]:
  
# Importing the libraries
import pandas as pd
import missingno as msno
  
# Loading the dataset
df = pd.read_csv("kamyr-digester.csv")
  
  
# Visualize the correlation between the number of
# missing values in different columns as a heatmap
msno.heatmap(df)

# What is KNNImputer in scikit-learn?

The KNNImputer belongs to the scikit-learn module in Python.

The KNNImputer is used to fill in missing values in a dataset using the k-Nearest Neighbors method.

k-Nearest Neighbors algorithm is used for classification and prediction problems.

The KNNImputer predicts the value of a missing value by observing trends in related columns. It then chooses the best fit value based on the k-Nearest Neighbors algorithm.

The illustration below show how KNNImputer works in scikit-learn:

KNNImputer class is defined as follows:

class sklearn.impute.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False)

fit(X)	Fit the imputer on X.

fit_transform(X)Fit to data, then transform it

In [None]:
import numpy as np # Importing numpy to create an array
from sklearn.impute import KNNImputer 
# Creating array with missing values 
X = [[1, 2, np.nan], [3, 6, 12], [np.nan, 12, 24], [2, 4, 16]] 
print("Original array: ", X)
imputer = KNNImputer(n_neighbors=2) # Creating a KNNImputer
array = imputer.fit_transform(X) # Imputing data 
print("Updated array: ", array)

# Tuning the hyper-parameters of an estimator

- Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.

- It is possible and recommended to search the hyper-parameter space for the best cross validation score.

 for more info :https://scikit-learn.org/stable/modules/grid_search.html

