# Preprocessing

## importing libraries

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## calling the dataset

In [23]:
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values      ##from the dataset, X is ref to feature/independent variables
Y = dataset.iloc[:, -1].values       ##from the dataset, Y is ref to dependent/target variables

print (X)
print (Y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Filling the missing data

### using mean technique

In [24]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print (X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### using median technique

In [10]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print (X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### Most Frequent Imputation

For categorical variables, you can fill missing values with the most frequent value in the column.

In [16]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent') 
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print (X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 48000.0]
 ['France' 35.0 58000.0]
 ['Spain' 27.0 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### Constant imputation

You can fill missing values with a constant value of your choice. This is useful when missing values have a specific meaning in the data.

In [40]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value=0)
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print (X)

[[0.0 1.0 0.0 1.0 1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 1.0 0.0 1.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 1.0 0.0 0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 1.0 0.0 1.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 1.0 0.0 0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 1.0 0.0 1.0 1.0 0.0 0.0 35.0 58000.0]
 [1.0 0.0 1.0 0.0 1.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 1.0 1.0 0.0 0.0 48.0 79000.0]
 [1.0 0.0 1.0 0.0 0.0 1.0 0.0 50.0 83000.0]
 [0.0 1.0 0.0 1.0 1.0 0.0 0.0 37.0 67000.0]]


If missing values are too prevalent or cannot be imputed accurately, you may choose to simply drop rows or columns with missing values.

## Encoding values

### encoding the independent values

Encoding in the context of machine learning is the process of converting categorical variables into a numerical representation that can be used by machine learning algorithms for analysis and modeling.

#### One Hot encoding

Converts categorical variables into binary vectors, where each category becomes a binary feature.

Suitable for nominal variables without any inherent order.

Helps prevent ordinality assumption by the model.

Implemented using libraries like scikit-learn's OneHotEncoder.

In [25]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
X = np.array(ct.fit_transform(X))
print (X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### encoding the dependent values

#### Label Encoding

Label encoding is a method of converting categorical variables into numerical format by assigning a unique integer label to each category. This encoding preserves the ordinal relationship among categories, making it suitable for categorical variables with a natural order

In [26]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
Y = le.fit_transform(Y)
print (Y)


[0 1 0 0 1 1 0 1 0 1]


In [27]:
xy_table = np.column_stack((X, Y))   ##prints as full table
print (xy_table)

[[1.0 0.0 0.0 44.0 72000.0 0]
 [0.0 0.0 1.0 27.0 48000.0 1]
 [0.0 1.0 0.0 30.0 54000.0 0]
 [0.0 0.0 1.0 38.0 61000.0 0]
 [0.0 1.0 0.0 40.0 63777.77777777778 1]
 [1.0 0.0 0.0 35.0 58000.0 1]
 [0.0 0.0 1.0 38.77777777777778 52000.0 0]
 [1.0 0.0 0.0 48.0 79000.0 1]
 [0.0 1.0 0.0 50.0 83000.0 0]
 [1.0 0.0 0.0 37.0 67000.0 1]]


## Splitting the data for training and testing

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=1)

In [29]:
print (X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [30]:
print (X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [31]:
print (Y_train)

[0 1 0 0 1 1 0 1]


In [32]:
print (Y_test)

[0 1]


## Feature Scaling

Feature scaling is a preprocessing technique in machine learning that transforms the values of features to a specific range or distribution, such as standardization (Z-score normalization) or normalization (Min-Max scaling), to ensure all features have the same scale and prevent dominance of features with larger magnitudes.

it tries to put the all values in same scale

standardisation = x - mean(x) / sd(x)

normalization   = x - mean(x) / max(x) - min(x)

normalization will be preferred when we have normal distribution in features

standardization will be preferred for all the time

In [14]:
## preferring Standardization

In [33]:
from sklearn.preprocessing import StandardScaler

scc = StandardScaler()
X_train[:, 3:] = scc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = scc.transform(X_test[:, 3:])

In [34]:
print (X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [35]:
print (X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


In the given code snippet, `fit_transform` is applied to the training data (`X_train[:, 3:]`), but only `transform` is applied to the test data (`X_test[:, 3:]`). This is because:

1. **Fit**: `fit_transform` is used on the training data to compute the mean and standard deviation of each feature in order to standardize the data. The `fit` part calculates the mean and standard deviation based on the training data.

2. **Transform**: Once the mean and standard deviation are computed from the training data, `transform` is applied to both the training and test data to standardize them using the same mean and standard deviation learned from the training data. This ensures that the test data is scaled in the same way as the training data.

By applying `fit_transform` only to the training data and `transform` to the test data, we prevent data leakage and ensure that the test data is scaled based on the same parameters learned from the training data. This maintains the integrity of the test set and ensures that the model is evaluated on unseen data.

In [39]:
## if we use normalization

"""from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()


X_train[:, 3:] = scaler.fit_transform(X_train[:, 3:])


* X_test[:, 3:] = scaler.transform(X_test[:, 3:]) """

'from sklearn.preprocessing import MinMaxScaler\n\nscaler = MinMaxScaler()\n\n\nX_train[:, 3:] = scaler.fit_transform(X_train[:, 3:])\n\n\n* X_test[:, 3:] = scaler.transform(X_test[:, 3:]) '