# Data Preprocessing

### by ReDay Zarra

Data preprocessing is about **preparing data for analysis** by cleaning, transforming, and organizing it. This step is crucial for any machine learning or data analysis as it helps to ensure that the data is accurate, consistent, and comprehensive. Data preprocessing includes tasks such as handling missing values, removing outliers, normalizing data, and encoding categorical variables. The goal of data preprocessing is to **make the data as clean and usable** as possible, so that it can be used for further analysis and modeling.

The three major steps of data preprocessing includes:

1. Importing the necessary libraries

2. Importing the dataset

3. Addressing missing data

4. Encoding categorical data

5. Splitting the dataset

6. Feature scaling

## Importing libraries

Importing the necessary libraries and modules we need to start preprocessing 
our data. **Pandas** is a library used for data frame manipulations. **NumPy** is a package used for numerical analysis.

In [1]:
import numpy as np
import pandas as pd

## Importing data

Importing the dataset requires us to use the **Pandas library which will import the dataset** in a new variable. Then we have to create the matrix of features and then the dependent variable vector. The features are the columns with which you will predict the final decision (dependent variable). So, the dependent variable is really what you want to predict.

In [3]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[: , :-1].values
y = dataset.iloc[: , -1].values

print(X)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


> Pandas reads the dataset and **creates a data frame** which is **stored in the variable dataset**. The **.iloc() function** means locating indices and **identifies the data we want to target**.  

## Addressing missing data

Our dataset has missing data and values for some of the features,
which can cause errors in our machine learning models. To address that, there
are some options:

  1. Ignoring the missing data by simply **removing** missing data
  
  2. **Replacing** the missing data with the average of the column

**Scikit-Learn** is a data science library with a lot of data
preprocessing tools that aid us in replacing missing data.

In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

> Uses the SimpleImputer class to target the missing values with the **missing_values** parameter. The **strategy** parameter specifies that we want to replace it with the mean.

In [6]:
imputer.fit(X[: , 1:3])
X[: , 1:3]  = imputer.transform(X[: , 1:3])

> The .fit() method from the imputer applies our SimpleImputer to the data specified. We then use the .transform() method to change the original data to the 

## Encoding categorical data

## Splitting the dataset

## Feature Scaling