# Data Preprocessing - Boilerplate

### by ReDay Zarra

This is **simply a template to get you started** with data preprocessing. The libraries are all ready-to-go, you just have to ensure you input the right parameters. I will leave hints for how to configure the code to your needs so you can use this as reference or to start a new project.

If you would like to **learn about data preprocessing**, please visit: [Data Preprocessing Theory](https://www.redaysblog.com/machine-learning/data-preprocessing)

If you want an **explanation for the code** in detail, please visit: [Data Preprocessing - Code](https://github.com/redayzarra/ml-data-preprocessing/blob/master/Data_Preprocessing.ipynb)

## Importing libraries

Importing the necessary libraries and modules we need to start preprocessing 
our data. **Pandas** is a library used for data frame manipulations. **NumPy** is a package used for numerical analysis.

In [1]:
import numpy as np
import pandas as pd

## Importing data

To make sure you do this correctly, make sure the **name of our data file matches** the one inside the .read_csv() function. I am also assuming that your **dependent variable**, the thing you want to predict, **is in the last column**. If that is not the case then change the "-1" to the index of our dependent variable.

In [None]:
# Load the dataset
dataset = pd.read_csv('YOUR_DATA.csv')

# X are features, y is the dependent variable.
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

## Addressing missing data

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

Make sure to **change the "0" in the brackets to the indices of the columns** you want to address the missing values in. Currently the index 0 is applying the mean imputer for the first column only.

In [None]:
imputer.fit(X[: , 0])
X[: , 0]  = imputer.transform(X[: , 0])
print(X)

## Encoding categorical data

Utilize this step **if you have categorical data** or if your dependent variable are classes (in words). 

### Encoding the features

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

Make sure to **change the "0" in the brackets to the column** containing the categorical data you wish to encode.

In [None]:
transformer = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')

In [None]:
X = np.array(transformer.fit_transform(X))
print(X)

### Encoding the dependent variable

You should **only use this if your dependent variable is categorical data!**

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()
y = encoder.fit_transform(y)
print(y)

## Splitting the dataset

In [None]:
from sklearn.model_selection import train_test_split

**Adjust the testing set size to your liking** by changing the "0.2" to whatever ratio you would like!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

## Feature scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Make sure to a

In [None]:
X_train[: , 0] = scaler.fit_transform(X_train[: , 0])
X_test[: , 0] = scaler.fit_transform(X_test[: , 0])