## Standardization + LDA Pipeline

In this approach, we use a `Pipeline` to first **standardize the dataset** using `StandardScaler` and then apply **Linear Discriminant Analysis (LDA)**.

### Why this approach?
- **StandardScaler** brings all input features to the same scale, improving model performance.
- **LDA** is both a classifier and a dimensionality reduction technique that projects data in directions that maximize class separability.

This pipeline helps us build a clean and reproducible machine learning process.


In [1]:
# Load and split the dataset
from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

url = 'https://raw.githubusercontent.com/erojaso/MLMasteryEndToEnd/master/data/pima-indians-diabetes.data.csv'
column_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=column_names)
array = data.values
X = array[:, 0:8]
Y = array[:, 8]

# Define the pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lda', LinearDiscriminantAnalysis()))
model = Pipeline(estimators)

# Evaluate the pipeline using 10-fold cross-validation
kfold = KFold(n_splits=10)
results = cross_val_score(model, X, Y, cv=kfold)
print("LDA Pipeline Mean Accuracy:", results.mean())


LDA Pipeline Mean Accuracy: 0.773462064251538


## Feature Union: PCA + SelectKBest + Logistic Regression

In this pipeline, we combine **two feature selection techniques** using `FeatureUnion`:
- **PCA (Principal Component Analysis):** A linear technique that projects data into a lower-dimensional space capturing maximum variance.
- **SelectKBest:** A statistical feature selector that picks top `k` features based on univariate tests.

Then we pass the combined features into a **Logistic Regression** classifier.

### Why this is useful:
- Different feature selection techniques capture different aspects of the data.
- **FeatureUnion** lets us merge their outputs into one final dataset.
- The combined model may offer better generalization than using either PCA or SelectKBest alone.


In [2]:
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

# Define FeatureUnion with PCA and SelectKBest
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

# Create pipeline with feature union and logistic regression
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', LogisticRegression(solver='liblinear')))
model = Pipeline(estimators)

# Evaluate the pipeline
kfold = KFold(n_splits=10)
results = cross_val_score(model, X, Y, cv=kfold)
print("FeatureUnion Pipeline Mean Accuracy:", results.mean())

FeatureUnion Pipeline Mean Accuracy: 0.7760423786739576
