# Missing data

![image.png](attachment:image.png)
1.	Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.
2.	Handling missing data is important as many machine learning algorithms do not support data with missing values.
3.	We should learn
    - How to mark invalid or corrupt values as missing in your dataset.
    - How to remove rows with missing data from your dataset.
    - How to impute missing values with mean/median/more occuring value/zero values in your dataset.


# <span style = "color:red"> Exercise 1: Dropping missing data</span>

In [1]:
import pandas as pd
import numpy as np
import os
os.chdir("C:\\Users\\ramreddymyla\\Google Drive\\01 DS ML DL NLP and AI With Python Lab Copy\\02 Lab Data\\Python")
df = pd.read_csv("house-votes-84.csv")
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
Class Name                                435 non-null object
handicapped-infants                       435 non-null object
water-project-cost-sharing                435 non-null object
adoption-of-the-budget-resolution         435 non-null object
physician-fee-freeze                      435 non-null object
el-salvador-aid                           435 non-null object
religious-groups-in-schools               435 non-null object
anti-satellite-test-ban                   435 non-null object
aid-to-nicaraguan-contras                 435 non-null object
mx-missile                                435 non-null object
 immigration                              435 non-null object
synfuels-corporation-cutback              435 non-null object
education-spending                        435 non-null object
superfund-right-to-sue                    435 non-null object
crime                      

- Missing data may not be always NaN,different source system will maintain differently
- in our case, it is **question mark** ,see below 

In [2]:
df["handicapped-infants"].unique()

array(['n', '?', 'y'], dtype=object)

In [3]:
(df["handicapped-infants"] == "?").sum() # how many quetion marks available in column

12

In [4]:
df[0:3]

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n


In [5]:
# Convert '?' to NaN
df[df == '?'] = np.nan


In [6]:
df.head(3)

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n


In [7]:
# Print the number of NaNs
print(df.isnull().sum())
# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

Class Name                                  0
handicapped-infants                        12
water-project-cost-sharing                 48
adoption-of-the-budget-resolution          11
physician-fee-freeze                       11
el-salvador-aid                            15
religious-groups-in-schools                11
anti-satellite-test-ban                    14
aid-to-nicaraguan-contras                  15
mx-missile                                 22
 immigration                                7
synfuels-corporation-cutback               21
education-spending                         31
superfund-right-to-sue                     25
crime                                      17
duty-free-exports                          28
export-administration-act-south-africa    104
dtype: int64
Shape of Original DataFrame: (435, 17)


In [10]:
# Drop missing values and print shape of new DataFrame
df = df.dropna() 
# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))

Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)


[Refer Missing Data](https://machinelearningmastery.com/handle-missing-data-python/)

In [2]:
(435-232)/435*100

46.666666666666664

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 232 entries, 5 to 431
Data columns (total 17 columns):
Class Name                                232 non-null object
handicapped-infants                       232 non-null object
water-project-cost-sharing                232 non-null object
adoption-of-the-budget-resolution         232 non-null object
physician-fee-freeze                      232 non-null object
el-salvador-aid                           232 non-null object
religious-groups-in-schools               232 non-null object
anti-satellite-test-ban                   232 non-null object
aid-to-nicaraguan-contras                 232 non-null object
mx-missile                                232 non-null object
 immigration                              232 non-null object
synfuels-corporation-cutback              232 non-null object
education-spending                        232 non-null object
superfund-right-to-sue                    232 non-null object
crime                      

In [12]:
import pandas as pd

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 232 entries, 5 to 431
Data columns (total 17 columns):
Class Name                                232 non-null object
handicapped-infants                       232 non-null object
water-project-cost-sharing                232 non-null object
adoption-of-the-budget-resolution         232 non-null object
physician-fee-freeze                      232 non-null object
el-salvador-aid                           232 non-null object
religious-groups-in-schools               232 non-null object
anti-satellite-test-ban                   232 non-null object
aid-to-nicaraguan-contras                 232 non-null object
mx-missile                                232 non-null object
 immigration                              232 non-null object
synfuels-corporation-cutback              232 non-null object
education-spending                        232 non-null object
superfund-right-to-sue                    232 non-null object
crime                      

In [None]:
# 