## **Mitigating Missing Data**

### 1- Find missing values in the dataset
The isnull( ) detects the missing values and returns a
Boolean object indicating if the values are NA. The values which are none or empty get
mapped to True values and not null values get mapped to false values


In [4]:
import pandas as pd
df = pd.read_csv('/content/Prepared_Mall_Customers.csv')
df.isnull()

Unnamed: 0,ID,Sex,Age,Salary,Spending_Score,NewColumn,Customer_Satisfaction
0,False,False,False,False,True,False,False
1,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,True,False,False,False
4,False,False,True,True,False,False,False
...,...,...,...,...,...,...,...
195,False,False,False,False,False,False,False
196,False,False,False,False,False,False,False
197,False,False,False,False,False,False,False
198,False,False,False,False,False,False,False


### 2- To find out the number of missing values
 in the dataset or the portion of them, use data_frame.isnull( ).sum( ). In
the below example, the dataset doesn’t contain any null values. Hence, each column’s output is 0.


In [6]:
df.isna().sum()/len(df)*100

Unnamed: 0,0
ID,0.0
Sex,0.0
Age,1.5
Salary,1.0
Spending_Score,0.5
NewColumn,0.0
Customer_Satisfaction,0.0


### 3- Deal with missing values
 One way to deal with missing values is to delete records containing missing values, also
known as complete case analysis, but one must pay attention to the portion of missing not
to be excessive, excessive removal of missing data can bias your analysis. Typically, the
removal of 5% of data points with missing data values or less is acceptable for complete case
analysis. To remove instances with missing data, we can use the dropna() function

In [7]:
df_Complete_Case = df.dropna()
df_Complete_Case

Unnamed: 0,ID,Sex,Age,Salary,Spending_Score,NewColumn,Customer_Satisfaction
2,3,1,20.0,16.0,6.0,1,Unsatisfied
5,6,1,22.0,17.0,76.0,1,Satisfied
7,8,1,23.0,18.0,94.0,1,Satisfied
8,9,0,64.0,19.0,3.0,1,Unsatisfied
9,10,1,30.0,19.0,72.0,1,Satisfied
...,...,...,...,...,...,...,...
195,196,1,35.0,120.0,79.0,1,Satisfied
196,197,1,45.0,126.0,28.0,1,Unsatisfied
197,198,0,32.0,126.0,74.0,1,Satisfied
198,199,0,32.0,137.0,18.0,1,Unsatisfied


### 4- Simple imputation with the mean, median or mode
for missing values can also be done
following the same 5% condition for complete case analysis. To perform a simple imputation
with the mean for a variable, you have to calculate the mean for that variable. You can impute
a single variable at a time or multiple at once.
Mean_VariableName = data['Variable Name'].mean()
data['Variable Name'].fillna(Mean_VariableName, inplace=True)

In [10]:
# Mean_Salary = df['Salary'].mean()
# Mean_Spending_Score = df['Spending_Score'].mean()
# Mean_Age = df['Age'].mean()
# df['Salary'].fillna(Mean_Salary, inplace=True)
# df['Spending_Score'].fillna(Mean_Spending_Score, inplace=True)
# df['Age'].fillna(Mean_Age, inplace=True)
df.isna().sum()/len(df)*100


Unnamed: 0,0
ID,0.0
Sex,0.0
Age,0.0
Salary,0.0
Spending_Score,0.0
NewColumn,0.0
Customer_Satisfaction,0.0


Now save your clean dataset ready for processing and remember to download it to your local
machine.

In [11]:
df.to_csv(r'/content/Clean_Mall_Customers.csv', index=False)