![image.png](attachment:9d612a10-0092-42aa-a1db-bc9955aebe25.png)

![image.png](attachment:ab2e3ee7-6383-4659-b88c-cbac9487b67e.png)

![image.png](attachment:0266680e-d48d-45d2-b255-8575b2df522f.png)

### Handling missing value

Missing value in a dataset is a very common phenomenon in reality. For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders.Values could be missing for many reasons, often specific to the problem domain, and might include reasons such as corrupt measurements or data unavailability.

Missing value correction is required to reduce bias and to produce powerful suitable models. Most of the algorithms can’t handle missing data, thus you need to act in some way to simply not let your code crash

There are four qualitatively distinct types of missing data. Missing data is either: structurally missing, missing completely at random (MCAR), missing at random, or nonignorable (also known as missing not at random). Different types of missing data need to be treated differently in order for any analysis to be meaningful.

### Structurally missing data

Structurally missing data is data that is missing for a logical reason. In other words, it is data that is missing because it should not exist. In the table below, the first and third observations have missing values for Age of youngest child. This is because these people have no children. In the How many colas did you drink in the past 24 hours column, there are also structurally missing values. In this case, we can logically deduce that the correct value is 0, so this value should be used in place of the missing values in our analysis.
 
![image.png](attachment:68100109-e134-4bc8-b482-ccd7ca25f4e0.png)


### Missing completely at random (MCAR)

Looking at the table below, we need to ask ourselves: what is the likely income of the fourth observation? The simplest approach is to note that 50% of the other people have high incomes and 50% have low incomes. We could assume, therefore, that there is a 50% chance she has a high income and a 50% chance she has a low income. This is known as assuming that the missing value is missing completely at random (MCAR). When we make this assumption, we are assuming that whether or not the person has missing data is completely unrelated to the other information in the data.

### Missing at random (MAR)

In the case of missing completely at random, the assumption was that there was no pattern. An alternative assumption, known somewhat confusingly as missing at random (MAR) instead assumes that we can predict the value that is missing based on the other data.We use this assumption to return to the problem of trying to work out the value of the fourth observation on income. A simple predictive model is that income can be predicted based on gender and age. Looking at the table below, which is the same as the one above, we note that our missing value is for a Female aged 30 or more, and the other females aged 30 or more have a High income. As a result, we can predict that the missing value should be High. Note that the idea of prediction does not mean we can perfectly predict a relationship. All that is required is a probabilistic relationship (i.e., that we have a better than random probability of predicting the true value of the missing data).
When data is missing at random, it means that we need to either use an advanced imputation method, such as multiple imputation, or an analysis method specifically designed for missing at random data.

![image.png](attachment:3ed54536-6fc6-4c65-9ab1-249dab5c8640.png)

### Missing not at random (nonignorable)

It may be the case that we cannot confidently make any conclusions about the likely value of missing data. For example, it is possible that people with very low incomes and very high incomes tend to refuse to answer. Or there could be some other reason we just do not know. This is known as missing not at random data and also as nonignorable missing data.
It is common to include structural missing data as a special case of data that is missing not at random. However, this misses an important distinction. Structurally missing data is easy to analyze, whereas other forms of missing not at random data are highly problematic.
When data is missing, not at random, it means that we cannot use any of the standard methods for dealing with missing data (e.g., imputation, or algorithms specifically designed for missing values). If the missing data is missing, not at random, any standard calculations give the wrong answer

![image.png](attachment:b5e43072-031c-45cf-b4d4-65786ead1e36.png)


# Missing Values

### Visualization
* matrix
* barchar
* heatmaps

![image.png](attachment:8e6ecff7-0714-4093-8ee4-21bd6b1dc864.png)

In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import pymongo
import json
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
%matplotlib inline

In [4]:
df=pd.read_csv("D:\\lh_data\\ineuron\\HomeCredit_columns_description.csv",encoding= 'unicode_escape')
df

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,
...,...,...,...,...,...
214,217,installments_payments.csv,NUM_INSTALMENT_NUMBER,On which installment we observe payment,
215,218,installments_payments.csv,DAYS_INSTALMENT,When the installment of previous credit was su...,time only relative to the application
216,219,installments_payments.csv,DAYS_ENTRY_PAYMENT,When was the installments of previous credit p...,time only relative to the application
217,220,installments_payments.csv,AMT_INSTALMENT,What was the prescribed installment amount of ...,
