
# Tutorial 7a: Features

COMP309-2024-T2

Marcus Frean

*with thanks to Baligh Al-Helali (PhD from VUW, 2021)*

----

## transformers
These tasks are done using **transformers**
(not to be confused with the neural network architecture of the same name).

SciKit-Learn's "transformer" is used for this, the main methods being:
- transformer.fit()
- transformer.transform()
- transformer.fit_transform()

Note that the analysis and fitting(training) is based only on the train dataset.
After that, the learned transformations are applied to the test data.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Load the dataset

In [None]:
df=sns.load_dataset('iris')
df.head()

In [None]:
df.describe()

In [None]:
# Seperate the target variable
X=df[df.columns[1:-1]]   # read "-1" as "the last one"
y=df[df.columns[-1]]

In [None]:
# split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [None]:
#Check before standardization
X_train.head()

## **PCA**

### Standardize the Data

PCA is affected by scale: you should give each of the features in your data a similar scale (mean = 0 and variance = 1) before applying PCA.
We will use `StandardScaler` to standardize our dataset’s features.

In [None]:
#Now lets apply 1-1 "StandardScaler" transformer
#1) import the module
from sklearn.preprocessing import StandardScaler

#2) define the model
scaler=StandardScaler()

#3) fit the model
scaler.fit(X_train)

#4) transform the data
X_train_ss = scaler.transform(X_train)

# note 3 and 4 could be combined like this:
X_train_ss = scaler.fit_transform(X_train)

### Run PCA

In [None]:
#Now lets perfrom pca
#Steps are similar to the scale transformer
#1) import the module
from sklearn.decomposition import PCA

#2) define the model
pca = PCA(n_components=2)   # n_components means the pca transformation constructs this many features

#3) fit the model
pca.fit(X_train_ss)

#4) transform the data
pca_train = pca.transform(X_train_ss)

# Again, 3 and 4 could be combined
pca_train = pca.fit_transform(X_train_ss)

# print the output, which is a matrix of only two features
pca_train[:10,:]
# ALT: plt.scatter(pca_train[:,0],pca_train[:,1])

### Visualising the results

possible if it's 2d

In [None]:
#format and visualise the transformed training data
df_pca_train = pd.DataFrame(data = pca_train, columns = ['pc1', 'pc2'])
df_pca_train['species']=y_train
sns.scatterplot(x='pc1', y='pc2', hue=df_pca_train['species'], data=df_pca_train);

### Transform the test data

note: Here we only apply the learned transformers to transform the test data, so there's no "fitting" here.


In [None]:
#1- First apply the scaler that has been built based on the training data to scale the test data
X_test_ss = scaler.transform(X_test)

#2- Second apply the pca transformation that has been built based on the training data to transfer the scaled test data
pca_test = pca.transform(X_test_ss)

###Classification

Let's try using the original features only to do classification, and then see if things get better with the new features.

In [None]:
# Performing classification based on the original data
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(random_state=0)
#classifier=SVC()

classifier.fit(X_train, y_train)
score = accuracy_score(classifier.predict(X_test), y_test)
print('Accuracy before transformation  = {:.2f}'.format(score))

In [None]:
# Performing classification using the pca-based transformed data
classifier.fit(pca_train, y_train)
score = accuracy_score(classifier.predict(pca_test), y_test)
print('Accuracy after PCA transformation  = {:.2f}'.format(score))

##ICA

Steps are very similar to the scaler and the PCA transformeres.

There are several ICA approaches in fact. We will use sklearn's `Fast ICA` algorithm.

In [None]:
from sklearn.decomposition import FastICA
ica = FastICA(n_components=2)

ica_train = ica.fit_transform(X_train_ss)  # nb. we already did the scaling, above

In [None]:
# Visualisation
df_ica_train = pd.DataFrame(data = ica_train, columns = ['ic1', 'ic2'])
df_ica_train['species']=y_train
sns.scatterplot(x='ic1', y='ic2', hue=df_ica_train['species'], data=df_ica_train);

In [None]:
# Performing classification using the ica-based transformed data
# Transform test data using ica
classifier.fit(ica_train, y_train)
ica_test = ica.transform(X_test_ss)
score = accuracy_score(classifier.predict(ica_test), y_test)
print('Accuracy after ICA transformation  = {:.2f}'.format(score))

## GP transformers

In [None]:
# Might need to install the package for genetic programming (gp)
!pip install gplearn

In [None]:
#Since this package does not work when the target variable is string, an encoder is used to convert it
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
label_encoded = le.fit_transform(y_train)
label_encoded

In [None]:
#Now lets apply genetic programming.
#Steps are similar to the scale, pca, and ica transformers
from gplearn.genetic import SymbolicTransformer
gp = SymbolicTransformer(n_components=2)
gp.fit(X_train_ss, label_encoded)
gp_train = gp.transform(X_train_ss)

In [None]:
# Visualisation using the gp-based transformed data
df_gp_train = pd.DataFrame(data = gp_train, columns = ['gp1', 'gp2'])
df_gp_train['species']=y_train
sns.scatterplot(x='gp1', y='gp2', hue=df_gp_train['species'], data=df_gp_train)

In [None]:
# Transform test data using gp
# Then, performing classification using the ica-based transformed data
gp_test = gp.transform(X_test_ss)
df_gp_test = pd.DataFrame(data = gp_test, columns = ['gp1', 'gp2'])
df_gp_test['species']=y_test
classifier.fit(gp_train, y_train)
accuracy_score(classifier.predict(gp_test), y_test)

---
