<a href="https://colab.research.google.com/github/vjhawar12/IrisClassifier/blob/main/data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the dataset

In [None]:
dataset = pd.read_csv("Data.csv")

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [None]:
from sklearn.impute import SimpleImputer

# creating imputer object to replace missing values (np.nan)
# with the mean of all the values (strategy="mean")
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")

# fitting the data across all rows in the 1-2 columns (3 is an upper bound)
imputer.fit(X[:, 1:3])

# replacing the X object with the fitted data
X[:, 1:3] = imputer.transform(X[:, 1:3])

print(X)


[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [None]:
# In order to encode the strings we need to import these two classes to perform OneHotEncoding (converting to binary vector) on the first column
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# name='encoder' (type of transformation) transformers=OneHotEncoder (what we're using for transformation), columns=[0] (just transform the 0th column)
# remainder=passthrough means do not affect the other columns that are not 0
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

# We're trying to transform X (which is all the rows and columns 0-2)
X = np.array(ct.fit_transform(X)) # this results the encoded table

print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


### Encoding the Dependent Variable

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In this step the data will be split into seperate sets. In the training set, you train the model based on existing observations. In the test set you evaluate the performance of your model based on new observations. Feature scaling needs to be done after splitting the dataset so that you can make sure the test set (which is to remain unaffected) is not impacted by the feature scaling.

In this step, we are basically splitting the data into 4 sets--not 2. We will get Xtest (the features of the test set), XTrain (the features of the training set), yTrain (the dependent varaible of the training set), and yTest (the features of the test set).

The machine learning model that we are going to build will require all of this data.

In [None]:
from sklearn.model_selection import train_test_split

# here the data is being split into the 4 sets. Test_size means 80% training, 20% test.
# Random_state=1 means the data will be split in the same manner each time
XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
print(XTrain)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(XTest)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(yTrain)

[0 1 0 0 1 1 0 1]


In [None]:
print(yTest)

[0 1]


## Feature Scaling

Feature scaling allows us to put all of our features on the same scale. This is so that some of our data doesn't get ignored by the machine learning models.

Feature scaling isn't always neccessary.

There are two methods for feature scaling: Standarization and normalization. In standarization the features range between -3 and 3. In normalization, the features range between 0 and 1.

Use normalization when there is a fairly normal distribution in most of your features. Standarization works well all of the time. When in doubt, go for standarization.

Standarization is calculated like this: (x - mean(x)) / standardDeviation(x)

We will have to limit the mean value to just the training set since the test set is supposed to be unaltered.

In [None]:
from sklearn.preprocessing import StandardScalar # this library is for standarization

sc = StandardScalar()
