# Pre-Processing Data Example

This notebook walks through common **data preprocessing** steps before building a model: loading data, **cleaning** (handling missing values), **encoding** categorical variables (label and one-hot encoding), train/test splitting, and **feature scaling** (standardization). The dataset has country, age, salary, and a binary target (e.g. purchased).

## Imports

We use **NumPy** for arrays, **Pandas** for loading and slicing the CSV, **Matplotlib** for plotting (available for later use), and **scikit-learn** for imputation, encoding, train/test split, and scaling.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Load features (X)

Read the CSV with `pd.read_csv`, then take all columns **except the last** as the feature matrix `X` using `iloc[:, :-1]`. `.values` converts the DataFrame to a NumPy array. You can see the result has country names, age, and salary (and some `nan` values).

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Load target (Y)

The **target** (dependent variable) is the last column, e.g. "Purchased" (Yes/No). We extract it with `iloc[:, 3]` and store it as `Y` for later encoding and train/test split.

In [3]:
Y = dataset.iloc[:, 3].values
Y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## Cleaning: Handling missing data

**Technique: Imputation (mean strategy).** Missing values (`nan`) in numeric columns can break many models. Here we use scikit-learn's **SimpleImputer** with `strategy='mean'`: it fits the mean of each column (over non-missing values) and replaces missing values with that mean. We apply it only to the numeric columns (columns 1 and 2: age and salary), so `X[:, 1:3]` is fitted and transformed. This is a simple, common **cleaning** step for numeric data.

In [4]:
# handling missing data
from sklearn.impute import SimpleImputer
# mean means the value to replace
# and axis 0 means column
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, 1:3])
# replace the values in X
X[:, 1:3] = imputer.transform(X[:, 1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Encoding: Categorical features (LabelEncoder)

**Technique: Label encoding.** The first column (country) is categorical (France, Spain, Germany). **LabelEncoder** converts each category to an integer (e.g. France=0, Germany=1, Spain=2). This is useful for the **target** variable, but for **features** it implies an order (e.g. Germany > France) that may be wrong. The next step uses **one-hot encoding** for the country feature to avoid that.

In [5]:
# categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# better to use LabelEncoder for the dependent variable
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
print(X)
# here we need to find a different approach
# France is not more than Spain or Germany

[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]


## Encoding: One-hot encoding (dummy variables)

**Technique: One-hot encoding.** To avoid implying an order between countries, we replace the single label-encoded column with **dummy variables**: one binary column per category (e.g. France, Germany, Spain). **OneHotEncoder** with `sparse_output=False` produces a matrix of 0/1 columns. We then replace the first column of `X` with these new columns using `np.column_stack` and keep the numeric columns (age, salary). This is standard for nominal categorical features in regression/classification.

In [6]:
# One hot encoding means creating dummy variables
onehotencoder = OneHotEncoder(categories='auto', sparse_output=False)
country_encoded = onehotencoder.fit_transform(X[:, 0:1])
X = np.column_stack((country_encoded, X[:, 1:]))
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

## Encoding: Target variable (Y)

For a **binary target** (e.g. Yes/No), **LabelEncoder** is appropriate: it converts labels to 0 and 1. We fit and transform `Y` so that the model receives numeric targets (e.g. 0 = No, 1 = Yes).

In [7]:
label_encoded = LabelEncoder()
Y = label_encoded.fit_transform(Y)
Y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Train/test split

**Technique: Holdout split.** We split the data into training and test sets with **train_test_split** so we can evaluate the model on unseen data. Here `test_size=0.2` keeps 20% for testing and 80% for training; `random_state=0` makes the split reproducible.

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

## Inspect train and test sets

Display `X_train` and `X_test` to confirm the split and that features (one-hot country, age, salary) look correct before scaling.

In [10]:
X_train

array([[0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 37.0, 67000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [11]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0]], dtype=object)

**Standardization** (z-score) formula:

$x_{std} = \frac{x - \text{mean}(x)}{\text{std}(x)}$

Each feature is scaled to mean 0 and standard deviation 1.

**Normalization** (min-max scaling) formula:

$x_{norm} = \frac{x - \min(x)}{\max(x) - \min(x)}$

Values are scaled to the range [0, 1]. This notebook uses **StandardScaler** (standardization) below; use **MinMaxScaler** if you prefer this range.

## Feature scaling: Standardization

**Technique: Standardization (z-score).** Many algorithms (e.g. SVM, gradient-based methods) work better when features are on a similar scale. **StandardScaler** transforms each feature to have mean 0 and standard deviation 1 using the formula above. We **fit** the scaler on the **training set only** (`fit_transform(X_train)`), then **transform** the test set with the same learned mean and std (`transform(X_test)`). Never fit on the test set to avoid data leakage.

In [12]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
X_train
X_test

array([[-1.        ,  2.64575131, -0.77459667, -1.45882927, -0.90166297],
       [-1.        ,  2.64575131, -0.77459667,  1.98496442,  2.13981082]])