# 04 - Real World NumPy Usage

Now we apply NumPy to real data-like scenarios:

- Data cleaning & preprocessing
- Feature scaling
- Missing value handling
- Combining arrays
- Random sampling (critical for ML training/validation splits)


1) Generating synthetic dataset (like ML problems)

In [3]:
import numpy as np

# Random dataset: 100 rows, 3 features
np.random.seed(42)
data = np.random.randn(100, 3) * 10 + 50

data[:10]


array([[54.96714153, 48.61735699, 56.47688538],
       [65.23029856, 47.65846625, 47.65863043],
       [65.79212816, 57.67434729, 45.30525614],
       [55.42560044, 45.36582307, 45.34270246],
       [52.41962272, 30.86719755, 32.75082167],
       [44.37712471, 39.8716888 , 53.14247333],
       [40.91975924, 35.87696299, 64.65648769],
       [47.742237  , 50.67528205, 35.75251814],
       [44.55617275, 51.1092259 , 38.49006423],
       [53.75698018, 43.9936131 , 47.0830625 ]])

You now have:

100 samples (rows)

3 features (columns)

This mimics real numeric data.

2) Feature Scaling (Standardization)

This is used in EVERY ML model:

In [4]:
mean = data.mean(axis=0)
std = data.std(axis=0)

data_scaled = (data - mean) / std

print("Means after scaling:", data_scaled.mean(axis=0))
print("Std dev after scaling:", data_scaled.std(axis=0))


Means after scaling: [5.22359933e-16 9.99200722e-16 1.95177208e-15]
Std dev after scaling: [1. 1. 1.]


If this doesn’t click → you’ll fail in ML interviews.
This is core.

3) Min-Max Normalization (used in neural networks)

In [5]:
min_val = data.min(axis=0)
max_val = data.max(axis=0)

data_norm = (data - min_val) / (max_val - min_val)

data_norm[:5]


array([[0.59784738, 0.55850332, 0.50481969],
       [0.84841079, 0.54124443, 0.36857733],
       [0.86212723, 0.72151827, 0.33221761],
       [0.60904014, 0.49997961, 0.33279616],
       [0.53565259, 0.23902174, 0.13825114]])

Difference:

Standardization → mean 0, std 1 (used for ML models)

Min-Max → range [0,1] (used for neural networks)

4) Handling Missing Data

Simulate missing values:

In [6]:
data_missing = data.copy()
data_missing[0:5, 1] = np.nan  # corrupt one feature

print(data_missing[:10])


[[54.96714153         nan 56.47688538]
 [65.23029856         nan 47.65863043]
 [65.79212816         nan 45.30525614]
 [55.42560044         nan 45.34270246]
 [52.41962272         nan 32.75082167]
 [44.37712471 39.8716888  53.14247333]
 [40.91975924 35.87696299 64.65648769]
 [47.742237   50.67528205 35.75251814]
 [44.55617275 51.1092259  38.49006423]
 [53.75698018 43.9936131  47.0830625 ]]


Replace missing values with column mean:

In [7]:
col_mean = np.nanmean(data_missing, axis=0)
idxs = np.where(np.isnan(data_missing))
data_missing[idxs] = np.take(col_mean, idxs[1])


This is exactly what pandas .fillna() does under the hood.

5) Train-Test Split (But without sklearn)

In [8]:
np.random.seed(42)
indices = np.random.permutation(len(data))

train_size = int(0.8 * len(data))
train_idx = indices[:train_size]
test_idx = indices[train_size:]

train_data = data[train_idx]
test_data = data[test_idx]

train_data.shape, test_data.shape


((80, 3), (20, 3))

This is the actual logic behind train_test_split.