# Demonstrate some common data preprocessing steps : Version 1

## Importing the libraries

In [490]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset (typically CSV format)

To make things simple, we usually structure the dataset so that the **target variable (dependent variable)** column is the last column in the table, and all the preceding columns are **feature variable (independent variable)** columns

*  **X typically denotes the feature variables**, which in this case will be all the columns in the table except the last one
* **y typically denotes the single target variable**, which in this case is the last column in the table

Note here that X is a Dataframe and y is a Series, so we can use all the standard methods for these two core Pandas objects.

In [491]:
df  = pd.read_csv('sample-data-proprocessing-v1.csv')
df.index = df.index + 1 # Start index from 1 instead of 0, just to make it easier to interpret data
X = df.iloc[  :  , :-1] # Accesses all columns except the last
y = df.iloc[  :  , -1]  # Accesses the last column

In [492]:
print("The feature variable columns in X are")
print (X)
print (type(X))

The feature variable columns in X are
    Country   Age   Salary    Cost  Days
1     Spain  21.0  11000.0     120     3
2    France  45.0  32000.0     330     7
3     Spain  43.0  60000.0     510    15
4    France  40.0  80000.0     910     8
5   Germany  74.0  59000.0     520     5
6   Germany   NaN  92000.0     800   500
7    France  51.0  43000.0     420     6
8    France  74.0      NaN     720     8
9    France  73.0  25000.0     930    15
10    Spain  65.0  85000.0     410    13
11    Spain  44.0  94000.0     620    12
12  Germany  25.0  22000.0    -200     9
13  Germany  75.0  52000.0     740     4
14    Spain  34.0  15000.0     870    19
15  Germany   NaN  54000.0     370     6
16   France  48.0  31000.0     610     7
17   France  58.0  80000.0     280    11
18  Germany  32.0  56000.0  200000     8
19    Spain  34.0  51000.0     330   900
20   France  55.0  59000.0     630     5
21    Spain  50.0  54000.0     340    10
22  Germany  62.0      NaN     680     7
23   France  44.0  

In [493]:
print ("The target variable column values are : ")
print(y)
print (type(y))

The target variable column values are : 
1     Yes
2      No
3     Yes
4      No
5     Yes
6     Yes
7      No
8      No
9     Yes
10     No
11    Yes
12     No
13     No
14    Yes
15     No
16     No
17    Yes
18    Yes
19     No
20     No
21     No
22    Yes
23    Yes
24     No
25    Yes
26     No
27     No
28    Yes
29    Yes
30     No
Name: Purchased, dtype: object
<class 'pandas.core.series.Series'>


## Checking for missing data and perform imputation if necessary

Handling missing values is one of the most important aspects of data cleansing in the ML life cycle

We first check for missing data in the numeric columns of X (Age, Salary, Cost and Days)

If there are any, we can use any one of the many methods available to handle the missing values

A simple approach is to simply impute the missing cells with either the **mean, median or mode of all the other values in that column**

In [494]:
# Check for missing data in 'Age' and 'Salary' columns
missing_data_check = X[['Age', 'Salary', 'Cost', 'Days']].isnull().sum()
print("Number of missing values in the numeric columns : ")
print (missing_data_check)
print (type(missing_data_check))

Number of missing values in the numeric columns : 
Age       2
Salary    2
Cost      0
Days      0
dtype: int64
<class 'pandas.core.series.Series'>


In [495]:
# If there are missing values present in any of the numeric columns
if missing_data_check.any():

    for column in ['Age', 'Salary', 'Cost', 'Days']:
       if missing_data_check[column] > 0:

           #Impute missing data with the mean of the existing values in the respective column
           
           X[column].fillna(X[column].mean(), inplace=True)

           # You can also impute with the median of the existing values in the respective column
           # 
           #X[column].fillna(X[column].median(), inplace=True)

           # You can also impute with the mode of the existing values in the respective column

           #mode_value = X[column].mode()[0]
           #X[column].fillna(mode_value, inplace=True)


           
           # Round the imputed values to the nearest integer
           # Not necessary, but helps in displaying the column values neatly
           X[column] = X[column].round()

# Display modified X dataframe after imputation
print("Feature variables X after imputation:\n", X)

Feature variables X after imputation:
     Country   Age   Salary    Cost  Days
1     Spain  21.0  11000.0     120     3
2    France  45.0  32000.0     330     7
3     Spain  43.0  60000.0     510    15
4    France  40.0  80000.0     910     8
5   Germany  74.0  59000.0     520     5
6   Germany  52.0  92000.0     800   500
7    France  51.0  43000.0     420     6
8    France  74.0  55107.0     720     8
9    France  73.0  25000.0     930    15
10    Spain  65.0  85000.0     410    13
11    Spain  44.0  94000.0     620    12
12  Germany  25.0  22000.0    -200     9
13  Germany  75.0  52000.0     740     4
14    Spain  34.0  15000.0     870    19
15  Germany  52.0  54000.0     370     6
16   France  48.0  31000.0     610     7
17   France  58.0  80000.0     280    11
18  Germany  32.0  56000.0  200000     8
19    Spain  34.0  51000.0     330   900
20   France  55.0  59000.0     630     5
21    Spain  50.0  54000.0     340    10
22  Germany  62.0  55107.0     680     7
23   France  44.0 

## Function to detect outliers and impute them with mean of non-outliers

There are many **custom algorithms** available for detecting outliers based on statistical techniques, each of them has their specific weakness and advantages

The range of values that are considered normal so that anything outside this range is **highly subjective**

Some algorithms will miss picking up extremely low value or high value outliers, while others may mistake slightly high or low values within normal rage as outliers

This particular algorithm will pick up extremely high value outliers but may miss extremely low value outliers 

Replacing the outlier values with the mean of the non-outlier values however is pretty straightforward




In [496]:

def impute_outliers(df, column):
    # Calculate the first and third quartile
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    # Define bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Check if there are any outliers
    outliers = ((df[column] < lower_bound) | (df[column] > upper_bound))
    outlier_count = outliers.sum()
    
    if outlier_count > 0:

        print (f"Number of outliers detecting for column '{column}' is {outlier_count}")
        
        # Calculate mean of non-outlier values
        non_outlier_mean = int(df[~outliers][column].mean())
        
        # Replace outliers with the mean of non-outlier values
        df.loc[outliers, column] = non_outlier_mean
        print(f"Outliers detected and imputed in column '{column}' with mean value {non_outlier_mean}")
    else:
        print(f"No outliers detected in column '{column}'")



In [497]:
# Check and impute outliers for columns 'Cost' and 'Days'
impute_outliers(X, 'Cost')
impute_outliers(X, 'Days')

Number of outliers detecting for column 'Cost' is 1
Outliers detected and imputed in column 'Cost' with mean value 523
Number of outliers detecting for column 'Days' is 2
Outliers detected and imputed in column 'Days' with mean value 9


In [498]:
print ("Feature variables after detection and imputation of outliers")
print (X)

Feature variables after detection and imputation of outliers
    Country   Age   Salary  Cost  Days
1     Spain  21.0  11000.0   120     3
2    France  45.0  32000.0   330     7
3     Spain  43.0  60000.0   510    15
4    France  40.0  80000.0   910     8
5   Germany  74.0  59000.0   520     5
6   Germany  52.0  92000.0   800     9
7    France  51.0  43000.0   420     6
8    France  74.0  55107.0   720     8
9    France  73.0  25000.0   930    15
10    Spain  65.0  85000.0   410    13
11    Spain  44.0  94000.0   620    12
12  Germany  25.0  22000.0  -200     9
13  Germany  75.0  52000.0   740     4
14    Spain  34.0  15000.0   870    19
15  Germany  52.0  54000.0   370     6
16   France  48.0  31000.0   610     7
17   France  58.0  80000.0   280    11
18  Germany  32.0  56000.0   523     8
19    Spain  34.0  51000.0   330     9
20   France  55.0  59000.0   630     5
21    Spain  50.0  54000.0   340    10
22  Germany  62.0  55107.0   680     7
23   France  44.0  45000.0   900     5
24 

## Perform one hot encoding on categorical variables in dataset

This converts categorical variables into numeric format, since certain ML algorithms such as linear regression require all values in the dataset used to train the ML model to be numeirc



In [499]:
# Perform one hot encoding on the 'Country' column in X
X = pd.get_dummies(X, columns=['Country']).astype(int)

print ("Feature variables after one hot encoding on the Country column")
print(X)

Feature variables after one hot encoding on the Country column
    Age  Salary  Cost  Days  Country_France  Country_Germany  Country_Spain
1    21   11000   120     3               0                0              1
2    45   32000   330     7               1                0              0
3    43   60000   510    15               0                0              1
4    40   80000   910     8               1                0              0
5    74   59000   520     5               0                1              0
6    52   92000   800     9               0                1              0
7    51   43000   420     6               1                0              0
8    74   55107   720     8               1                0              0
9    73   25000   930    15               1                0              0
10   65   85000   410    13               0                0              1
11   44   94000   620    12               0                0              1
12   25   22000  -200    

## Perform label encoding on the target variable

The target variable must also be encoded in numeric format for many ML models


In [500]:
print ("Target variable column original values")
print (y)

Target variable column original values
1     Yes
2      No
3     Yes
4      No
5     Yes
6     Yes
7      No
8      No
9     Yes
10     No
11    Yes
12     No
13     No
14    Yes
15     No
16     No
17    Yes
18    Yes
19     No
20     No
21     No
22    Yes
23    Yes
24     No
25    Yes
26     No
27     No
28    Yes
29    Yes
30     No
Name: Purchased, dtype: object


In [501]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
print ("After encoding the Yes and No as 1 and 0")
print (y)

After encoding the Yes and No as 1 and 0
[1 0 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 1 0 0 1 1 0]


## Splitting original dataset into the Training set and Test set

This is a fundamental practice in ML and is important for:
* Evaluating model performance correctly
* Avoiding overfitting
* Preventing data leakage



In [502]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3333, random_state = 1)

# We use a fixed value for the random state parameter so that the split will always use the same rows for the training and test data set 
# This is ensure reproducibility of this ML project, which is important to verify performance evaluations


In [503]:
print (f"There are {len(X_train)} rows in the training dataset\n")
print ("The feature variable values are")
print (X_train)

print ("\nThe target variable values are")
print (y_train)

There are 20 rows in the training dataset

The feature variable values are
    Age  Salary  Cost  Days  Country_France  Country_Germany  Country_Spain
24   39   18000   480    14               1                0              0
5    74   59000   520     5               0                1              0
3    43   60000   510    15               0                0              1
26   51   95000   250    14               1                0              0
7    51   43000   420     6               1                0              0
19   34   51000   330     9               0                0              1
14   34   15000   870    19               0                0              1
8    74   55107   720     8               1                0              0
28   74   82000   320    14               1                0              0
2    45   32000   330     7               1                0              0
17   58   80000   280    11               1                0              0
1    21   110

In [504]:
print (f"There are {len(X_test)} rows in the test dataset\n")
print ("The feature variable values are")
print (X_test)

print ("\nThe target variable values are")
print (y_test)

There are 10 rows in the test dataset

The feature variable values are
    Age  Salary  Cost  Days  Country_France  Country_Germany  Country_Spain
18   32   56000   523     8               0                1              0
22   62   55107   680     7               0                1              0
11   44   94000   620    12               0                0              1
20   55   59000   630     5               1                0              0
15   52   54000   370     6               0                1              0
21   50   54000   340    10               0                0              1
27   46   80000   250    13               0                0              1
4    40   80000   910     8               1                0              0
25   38   33000   600    20               0                0              1
23   44   45000   900     5               1                0              0

The target variable values are
[1 1 1 0 0 0 0 0 1 1]


## Feature Scaling

This should be done (rather than before) after train test split to avoid data leakage that may contribute to poor model 
performance in production

There are two commonly used types of feature scaling: **standardization** and **normalization**

Here we are using **standardization**, which scales the data such that the distribution of values in the feature column has a mean of 0 and a standard deviation of 1.

After standardization, most values will lie within the **range of approximately -3 and 3** (since ±3 standard deviations cover about 99.7% of data in a normal distribution).

**fit_transform()** first performs a **fit()**, which computes the mean and standard deviation of the training data set, which is subsequently stored in **sc**
Subsequently, it performs a **transform()**, which uses the mean and standard deviation in **sc** to scale the training data set

Finally we use the same **sc** object (rather than creating a new one) to also feature scale the test dataset. This ensures that the same scaling parameters (mean and standard deviation) applied to the test data are identical to those used for the training data. This is important to prevent data leakage.

Also note that we perform feature scaling only on the columns that were originally numeric, and do not include columns with numbers that result from categorical encoding such as one hot encoding


In [505]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# Fit and transform the training data
X_train.iloc[:, 0:4] = sc.fit_transform(X_train.iloc[:, 0:4])

# Transform the test data
X_test.iloc[:, 0:4] = sc.transform(X_test.iloc[:, 0:4])



In [506]:
print ("Feature variables in training dataset after standardization")
print (X_train)

Feature variables in training dataset after standardization
         Age    Salary      Cost      Days  Country_France  Country_Germany  \
24 -0.864214 -1.229152 -0.050749  0.958664               1                0   
5   1.099909  0.246319  0.089249 -1.171700               0                1   
3  -0.639743  0.282306  0.054249  1.195371               0                0   
26 -0.190801  1.541854 -0.855739  0.958664               1                0   
7  -0.190801 -0.329474 -0.260747 -0.934993               1                0   
19 -1.144803 -0.041578 -0.575743 -0.224872               0                0   
14 -1.144803 -1.337113  1.314233  2.142199               0                0   
8   1.099909  0.106221  0.789240 -0.461579               1                0   
28  1.099909  1.074022 -0.610742  0.958664               1                0   
2  -0.527507 -0.725332 -0.575743 -0.698286               1                0   
17  0.202024  1.002048 -0.750740  0.248542               1             

In [507]:
print ("Feature variables in test dataset after standardization")
print (X_test)

Feature variables in test dataset after standardization
         Age    Salary      Cost      Days  Country_France  Country_Germany  \
18 -1.257039  0.138358  0.099749 -0.461579               0                1   
22  0.426495  0.106221  0.649242 -0.698286               0                1   
11 -0.583625  1.505867  0.439244  0.485250               0                0   
20  0.033671  0.246319  0.474244 -1.171700               1                0   
15 -0.134683  0.066384 -0.435744 -0.934993               0                1   
21 -0.246918  0.066384 -0.540743  0.011835               0                0   
27 -0.471390  1.002048 -0.855739  0.721957               0                0   
4  -0.808097  1.002048  1.454231 -0.461579               1                0   
25 -0.920332 -0.689345  0.369245  2.378906               0                0   
23 -0.583625 -0.257500  1.419232 -1.171700               1                0   

    Country_Spain  
18              0  
22              0  
11            