# DATA PREPROCESSING

## 1.1 Exploratory Data Analysis (EDA)
**Exploratory Data Analysis (EDA)** is a crucial step in the data analysis process. It involves examining and summarizing data sets to better understand their main characteristics, patterns, and relationships. Through EDA, analysts can identify outliers, detect missing values, explore distributions, and gain initial insights into the data. The primary goal of EDA is to unearth meaningful information and gain a deeper understanding of the data before applying more advanced analytical techniques. By exploring the data comprehensively, analysts can make informed decisions about the appropriate modeling and analysis strategies to use. This can involve various techniques such as:

- Handling Duplicated Values
- Handling Noising Values
- Handling Missing Values
- Balancing Dataset IF Task Is Classification
- Summary Statistics
- Correlation Analysis
- Inconsistant Data Entry
- Data Visualization

## 1.2 Feature Engineering
**Feature engineering** is an essential process in Machine Learning, involving the creation and selection of relevant features or variables from raw data. It aims to maximize the predictive power of the models by extracting meaningful information and patterns that can improve their performance. In feature engineering, domain knowledge and creativity are applied to transform the original data into a more suitable representation for the Machine Learning Algorithms. This can involve various techniques such as:

- **Feature extraction:** Deriving new features from existing ones by applying mathematical transformations, aggregations, or statistical measures. For example, extracting the month and year from a date variable or calculating ratios between numeric variables.

- **Feature encoding:** Converting categorical or textual variables into numerical representations that machine learning algorithms can process. This can be done through techniques like one-hot encoding, label encoding, or embedding.

- **Feature scaling:** Standardizing or normalizing numerical features to ensure they have similar scales and ranges. This helps algorithms that are sensitive to the magnitude of the variables, such as linear regression or K-means clustering.

- **Feature selection:** Identifying the most informative features that contribute significantly to the predictive power of the model while discarding irrelevant or redundant ones. This reduces computational complexity and improves generalization.

- **Feature combination:** Creating new features by combining existing ones, often through arithmetic operations or interactions between variables. For example, multiplying two variables to capture their interaction effect.

Effective feature engineering can greatly impact the performance of machine learning models, leading to improved accuracy, reduced overfitting, and better interpretability. It requires a deep understanding of the data, problem domain, and the algorithms being used. Through iterative experimentation and evaluation, feature engineering enables the extraction of the most relevant and informative features that empower models to make accurate predictions.

### **Dataset**
The Product Purchase Dataset is a small sample dataset used to understand consumer behavior and build models to predict whether a user is likely to purchase a product based on demographic and financial information.
##### **Objective**
To predict the Purchased status (Yes / No) based on the customer’s country, age, and salary.

This dataset consists of 4 following attributes:

- **Country:** Country of the customer (e.g., France, Spain, Germany)
- **Age:** Age of the customer (in years)
- **Salary:** Estimated annual salary of the customer
- **Purchased:**  Whether the customer purchased the product (Yes / No).

### Import Essential Libreries



In [None]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.14.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   - -------------------------------------- 0.5/11.0 MB 5.6 MB/s eta 0:00:02
   ---- ----------------------------------- 1.3/11.0 MB 4.5 MB/s eta 0:00:03
   -------- ------------------------------- 2.4/11.0 MB 4.2 MB/s eta 0:00:03
   ----------- ---------------------------- 3.1/11.0 MB 4.3 MB/s eta 0:00:02
   -------------- ------------------------- 3.9/11.0 MB 4.2 MB/s eta 0:00:02
   ------------------ ---------------


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,MinMaxScaler,OrdinalEncoder,OneHotEncoder
from sklearn.impute import SimpleImputer

Load dataset

In [None]:
df = pd.read_csv('./Sample_Data.csv')


In [None]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Correlation analysis between the numerical columns      

In [None]:
df.corr(numeric_only=True)


Unnamed: 0,Age,Salary
Age,1.0,0.982495
Salary,0.982495,1.0


Check for missing values

In [None]:
df.isnull().sum()


Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

Impute Missing Values with Constant Value. We will impute Missing Values with Constant Value

In [None]:
Imputer = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0)


Fill missing values in specific columns with the constant value defined above.

In [None]:
df.iloc[:,1:3]=Imputer.fit_transform(df.iloc[:,1:3])


In [None]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,0.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,0.0,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Impute Missing Values with Mean

In [None]:
Imputer=SimpleImputer(missing_values=np.nan,strategy='mean')


Apply Mean Imputation

In [None]:
df.iloc[:,1:3]=Imputer.fit_transform(df.iloc[:,1:3])


In [None]:
df


Unnamed: 0,Country,Age,Salary,Purchased
0,France,0.88,0.86747,No
1,Spain,0.54,0.578313,Yes
2,Germany,0.6,0.650602,No
3,Spain,0.76,0.73494,No
4,Germany,0.8,0.0,Yes
5,France,0.7,0.698795,Yes
6,Spain,0.0,0.626506,No
7,France,0.96,0.951807,Yes
8,Germany,1.0,1.0,No
9,France,0.74,0.807229,Yes


In [None]:
MinMax=MinMaxScaler()
df.iloc[:,1:3]=MinMax.fit_transform(df.iloc[:,1:3])

In [None]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,0.88,0.86747,No
1,Spain,0.54,0.578313,Yes
2,Germany,0.6,0.650602,No
3,Spain,0.76,0.73494,No
4,Germany,0.8,0.0,Yes
5,France,0.7,0.698795,Yes
6,Spain,0.0,0.626506,No
7,France,0.96,0.951807,Yes
8,Germany,1.0,1.0,No
9,France,0.74,0.807229,Yes


Scales selected columns to a 0-1 range using Min-Max scaling.

In [None]:
St = StandardScaler()
df.iloc[:,1:3]=St.fit_transform(df.iloc[:,1:3])

In [None]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,0.673262,0.66197,No
1,Spain,-0.58448,-0.4262,Yes
2,Germany,-0.362526,-0.154157,No
3,Spain,0.229353,0.163225,No
4,Germany,0.377323,-2.602539,Yes
5,France,0.007398,0.027204,Yes
6,Spain,-2.58207,-0.244838,No
7,France,0.969201,0.979353,Yes
8,Germany,1.117171,1.160714,No
9,France,0.155368,0.435268,Yes


Calculates the skewness of numerical columns to check for asymmetry.

In [None]:
df.skew(numeric_only=True)

Age      -1.753019
Salary   -1.760196
dtype: float64

Calculates the kurtosis to measure the tailedness of numerical columns.
Kurtosis is a statistical measure that describes the shape of a data distribution, particularly the extremity of its tails. Unlike skewness, which indicates the symmetry or asymmetry of a distribution, kurtosis focuses on the presence and size of outliers.

In [None]:
df.kurtosis(numeric_only=True)


Age       4.014398
Salary    4.301502
dtype: float64

Encodes categorical values in specific columns with ordinal values. Ordinal Encoding is a technique used to convert categorical (non-numeric) data into numeric values by assigning a unique integer to each category. This encoding is particularly suitable when the categorical data has an inherent order or ranking. For example, categories like “low,” “medium,” and “high” can be represented numerically as 0, 1, and 2, respectively, because there is an order from low to high.

In [None]:
Ordinal = OrdinalEncoder()
df.iloc[:,[0,3]]=Ordinal.fit_transform(df.iloc[:,[0,3]])
df.index

RangeIndex(start=0, stop=10, step=1)

One-Hot Encoding to convert categorical columns into binary columns.

In [None]:
ct = OneHotEncoder(sparse_output=False)
x=pd.DataFrame(ct.fit_transform(df.iloc[:,[0,3]]))
x.index=df.index
x

Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0,1.0
2,0.0,1.0,0.0,1.0,0.0
3,0.0,0.0,1.0,1.0,0.0
4,0.0,1.0,0.0,0.0,1.0
5,1.0,0.0,0.0,0.0,1.0
6,0.0,0.0,1.0,1.0,0.0
7,1.0,0.0,0.0,0.0,1.0
8,0.0,1.0,0.0,1.0,0.0
9,1.0,0.0,0.0,0.0,1.0


Applies column transformations, using One-Hot Encoding for specific columns and leaving others unchanged.



In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
df = np.array(ct.fit_transform(df))
print(df)

[[1.0 0.0 0.0 0.6732618473780672 0.6619698596295541 0.0]
 [0.0 0.0 1.0 -0.5844800653062344 -0.4261997726382066 1.0]
 [0.0 1.0 0.0 -0.36252561012665196 -0.1541573645712663 0.0]
 [0.0 0.0 1.0 0.22935293701890186 0.16322544484016374 0.0]
 [0.0 1.0 0.0 0.3773225738052904 -2.6025390371737283 1.0]
 [1.0 0.0 0.0 0.0073984818393194275 0.02720424080669361 1.0]
 [0.0 0.0 1.0 -2.5820701619224784 -0.24483816726024626 0.0]
 [1.0 0.0 0.0 0.9692011209508438 0.9793526690409846 1.0]
 [0.0 1.0 0.0 1.1171707577372323 1.160714274418945 0.0]
 [1.0 0.0 0.0 0.15536811862570757 0.435267852907104 1.0]]
