# Different types of missing data:

### 1.Missing Completely At Random (MCAR):

#### When data is missing completely at random, there is no systematic reason for the missingness. This means that the missingness is unrelated to both observed and unobserved variables, and the missingness does not affect the estimates of the remaining data.

### 2.Missing At Random (MAR):

#### When data is missing at random, the missingness is related to observed variables, but not to unobserved variables. This means that the probability of missingness depends on the observed data, but not on the missing data itself.



### 3.Missing Not At Random (MNAR):

#### When data is missing not at random, the missingness is related to both observed and unobserved variables. This means that the probability of missingness depends on the missing data itself, which can result in biased estimates of the remaining data.

### 4.Intentional Missingness:

#### In some cases, data may be intentionally left out or not collected due to research design, ethical or practical reasons.

# DATA

In [89]:
import pandas as pd
import numpy as np

In [90]:
data=pd.read_csv("student.csv")
data.head(2)

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0.0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0
1,1.0,female,group C,some college,standard,,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0


In [91]:
data.isnull().sum()

Unnamed: 0               37
Gender                   23
EthnicGroup            1868
ParentEduc             1870
LunchType                21
TestPrep               1833
ParentMaritalStatus    1214
PracticeSport           655
IsFirstChild            907
NrSiblings             1575
TransportMeans         3142
WklyStudyHours          966
MathScore                26
ReadingScore             22
WritingScore             11
dtype: int64

## ways of handling missing value:

### 1.Deleting rows

####  Deleting rows with missing values is not recommended, especially when the dataset is small. It is always a good practice to try to impute the missing values instead.

### 2.Filling with MEAN   [ mean = (sum of values) / (number of values)]

In [92]:
mean=data["MathScore"].mean()
data["MathScore"].fillna(mean, inplace=True)
print(mean)
data.head(2)

66.55835374816266


Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0.0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0
1,1.0,female,group C,some college,standard,,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0


### 3.Filling with MEDIAN   [middle value in a dataset]

In [93]:
median=data["MathScore"].median()
data["ReadingScore"].fillna(median, inplace=True)
print(median)
data.head(2)

67.0


Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0.0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0
1,1.0,female,group C,some college,standard,,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0


### 4.Filling with MODE [Most frequent value]

In [94]:
mode=data["Gender"].mode()
data["Gender"].fillna(mode, inplace=True)
print(mode)
data.head(2)

0    female
Name: Gender, dtype: object


Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0.0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0
1,1.0,female,group C,some college,standard,,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0


### 5.Filling with forward fill(ffill)  [filling with most recently observed value in the same column.]

In [95]:
ffill=data["TestPrep"].ffill()
data["TestPrep"].fillna(ffill, inplace=True)
data.head(2)

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0.0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0
1,1.0,female,group C,some college,standard,none,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0


### 6.Filling with backward fill(bfill)  [filling with next observed observed value in the same column.]

In [96]:
bfill=data["LunchType"].bfill()
data["LunchType"].fillna(bfill, inplace=True)
data.head(2)

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0.0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0
1,1.0,female,group C,some college,standard,none,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0


### 7.Random imputation [filling by randomly selecting values from the non-missing values in the same column]

In [97]:
data["NrSiblings"].fillna(value=np.random.choice(data["NrSiblings"].dropna()), inplace=True)
data["NrSiblings"].head(2)

0    3.0
1    0.0
Name: NrSiblings, dtype: float64

### 8.Creating a new feature to capture missing values (NaNs)    [creating a new binary feature]

In [98]:
data['TransportMeans_null'] = data['TransportMeans'].isnull().astype(int)
data.head(2)

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,TransportMeans_null
0,0.0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0,0
1,1.0,female,group C,some college,standard,none,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0,1


### 9.End of distribution imputation

#### The advantage of end of distribution imputation is that it can be an effective way to fill in missing values that are believed to be outliers or extreme values. By replacing missing values with extreme values, the resulting dataset may be more representative of the true distribution of the data.

In [99]:
upper_limit = data['ReadingScore'].mean() + 3 * data['ReadingScore'].std()
data['ReadingScore'] = data['ReadingScore'].fillna(upper_limit)
data.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,TransportMeans_null
0,0.0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0,0
1,1.0,female,group C,some college,standard,none,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0,1
2,2.0,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87.0,93.0,91.0,0
3,3.0,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,05-Oct,45.0,56.0,42.0,1
4,4.0,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,05-Oct,76.0,78.0,75.0,0


### 10.Arbitrary value imputation

#### In arbitrary value imputation, we replace missing values with a manually chosen value that is not part of the original data. For example, we might replace null values with a string like 'unknown' or 'other'. This approach can be useful when the missing values are considered to be missing completely at random, or when we have reason to believe that the missing values are not informative.

In [100]:
data['EthnicGroup'].fillna("unknown", inplace=True)
data.head(2)

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,TransportMeans_null
0,0.0,female,unknown,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0,0
1,1.0,female,group C,some college,standard,none,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0,1


### 11. Imputing null values with a new category

#### In contrast, imputing null values with a new category value involves creating a new category label to explicitly identify missing values. This approach can be useful when the missing values are considered to be missing not at random or when we want to distinguish between missing and non-missing values.

In [101]:
data["ParentEduc"]= data["ParentEduc"].fillna('missing')
data.head(2)

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,TransportMeans_null
0,0.0,female,unknown,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71.0,71.0,74.0,0
1,1.0,female,group C,some college,standard,none,married,sometimes,yes,0.0,,05-Oct,69.0,90.0,88.0,1


### 12. Flling missing value through unsupervised model:

In [None]:
##RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load data with missing values
data = pd.read_csv('data.csv')

# Split into X and y
X = data.drop(columns=['target'])
y = data['target']

# Identify missing values and create mask
mask = X.isna().any(axis=1)

# Split into training and test sets
X_train = X[~mask]
y_train = y[~mask]
X_test = X[mask]

# Fit RFC on non-missing values
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

# Predict missing values and fill them in
X_test_filled = X_test.fillna(rfc.predict(X_test))

# Concatenate non-missing and filled values back into X
X_filled = pd.concat([X_train, X_test_filled])


In [None]:
##KMeans

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# create sample data with missing values
data = pd.DataFrame({'A': [1, 2, np.nan, 4, 5], 'B': [6, np.nan, 8, 9, 10]})

# create a new DataFrame with the missing values replaced by zeros
data_filled = data.fillna(0)

# fit KMeans clustering on the filled data
kmeans = KMeans(n_clusters=2, random_state=0).fit(data_filled)

# predict the cluster labels for all data points
labels = kmeans.predict(data_filled)

# create a DataFrame with the cluster labels and the original data
data_clustered = pd.concat([data, pd.Series(labels, name='cluster')], axis=1)

# fill missing values based on the mean value of each cluster
data_imputed = data_clustered.groupby('cluster').transform(lambda x: x.fillna(x.mean()))

print('Original Data:\n', data)
print('Imputed Data:\n', data_imputed)
