***Data cleaning is a crucial step in the data preprocessing*** pipeline, involving the identification and handling of missing values, outliers, duplicates, and categorical data. NumPy, a powerful library for numerical computing in Python, provides efficient tools for data cleaning tasks. In this article, we'll explore various data cleaning techniques using NumPy, including masking with boolean arrays, replacing outliers, removing duplicates, converting categorical data to one-hot encoding, and data normalization.



In [2]:
import numpy as np
arr = np.array([1,2,np.nan,4,5,np.nan])

print(np.isnan(arr))
print(arr[np.isnan(arr)])
print(arr[~np.isnan(arr)])

[False False  True False False  True]
[nan nan]
[1. 2. 4. 5.]


In [4]:
arr = np.array([1,2,3,4,5,6,7,8,9])


#replacing with certain value

arr[arr > 7] = 7
arr[arr < 3] = 3
print(arr)

[3 3 3 4 5 6 7 7 7]


In [6]:
arr = np.array([1,2,2,2,3,3,4,5,6,7,8,9])
print(arr)

unique_arr = np.unique(arr)
print(unique_arr)

print(set(arr))

[1 2 2 2 3 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
{1, 2, 3, 4, 5, 6, 7, 8, 9}


In [8]:
categorical_arr = np.array(['A','B','C','D','E'])

numrical_arr = np.array([0,1,2,3,4])

print(categorical_arr)
#one hot encoding
one_hot_arr = np.eye(5)[numrical_arr]
print(one_hot_arr)

['A' 'B' 'C' 'D' 'E']
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


In [10]:
arr = np.array([10000,200000,300000,4000000,5000000])


#normalisation

normalized_arr = (arr - arr.min()) / (arr.max() - arr.min())
print(normalized_arr)

[0.         0.03807615 0.05811623 0.7995992  1.        ]
