## Data Preprocessing
Preprocessing is a critical step in machine learning. Raw data is rarely clean or scaled, and preprocessing ensures that models train efficiently and perform better.



In [1]:
from pandas import read_csv

# Load CSV using Pandas
url = 'https://raw.githubusercontent.com/erojaso/MLMasteryEndToEnd/master/data/pima-indians-diabetes.data.csv'
column_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = read_csv(url, names=column_names)

# Convert to NumPy array and split input/output
array = data.values
Input = array[:, 0:8]   # Input features
Output = array[:, 8]    # Target column

## Min-Max Scaling (Normalization to [0,1])
Min-Max scaling rescales each feature to a given range — here, between 0 and 1. This is helpful for algorithms sensitive to the scale (e.g., k-NN, neural networks).

In [2]:
from sklearn.preprocessing import MinMaxScaler
from numpy import set_printoptions

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(Input)

# Set printing precision
set_printoptions(precision=3)
print(rescaledX[0:5, :])  # Show first 5 rows

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


# Standardization (Z-score Normalization)
This transforms data to have mean = 0 and standard deviation = 1. Useful for algorithms assuming Gaussian distribution (e.g., logistic regression, SVM, etc.).

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(Input)
standardizedX = scaler.transform(Input)

set_printoptions(precision=3)
print(standardizedX[0:5, :])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


## Normalization (Vector Length = 1)
This scales individual samples (rows) to have unit norm. It’s helpful in text classification and clustering.

In [4]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(Input)
normalizedInput = scaler.transform(Input)

set_printoptions(precision=3)
print(normalizedInput[0:5, :])

[[0.034 0.828 0.403 0.196 0.    0.188 0.004 0.28 ]
 [0.008 0.716 0.556 0.244 0.    0.224 0.003 0.261]
 [0.04  0.924 0.323 0.    0.    0.118 0.003 0.162]
 [0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
 [0.    0.596 0.174 0.152 0.731 0.188 0.01  0.144]]


## Binarization
Binarization applies a threshold: values above it are converted to 1; otherwise 0. This is useful for converting numeric features into binary ones.

In [5]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(Input)
binaryX = binarizer.transform(Input)

set_printoptions(precision=3)
print(binaryX[0:5, :])

[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]]
