# Everything but the Model
> An outline of data loading, feature extraction, feature selection, model evaluation, and model evaluation.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]


# Background

Most of the attention with regard to machine learning goes to the models, with good reason. We are at the point where we can reasonably model any $X\to y$ relationship we wish. Difficult classification and clustering problems are increasingly tractable with clever approaches in statistical and rules-based learning. Even if we have no clue how to begin to go about developing such an approach (or such an approach would be intractable / biased), we can train deep neural networks – Universal Approximators – to theoretically represent almost any $f(x)= y = Wx + b$ relationship given properly trained weights $W$.

Still, humans are likely to be of use for the foreseeable future, and one of the more useful domains is the preparation of dat for such modeling. This includes topics such as feature selection and feature extraction, and how we might select the best model based on a rigorous, generalizable evaluation metric.

In this post, I will outline some of these useful topics with the hope that you will be able to tackle most everything about machine learning but the machine itself.

## Example Dataset

I will use the COVID-19 Cell Atlas' nasal immunodefiency swab dataset (https://www.sanger.ac.uk/group/vento-tormo-group/) for examples to follow. This is a recent dataset of real patients. I will use the ```scanpy``` package to load in the data and take a ```pandas``` dataframe for examples for feed into ```scikit-learn```.

For the response variable, I will use ```Vasoactive agents required during hospitalization```, a proxy for severity of symptoms and infection.

In [46]:
import scanpy

dataset = scanpy.read_h5ad('../data/2021-01-09/Immunodeficiency_Nasal_swabs.h5ad')
y = dataset.obs['Vasoactive agents required during hospitalization']
X = dataset.obs.drop(columns=['Vasoactive agents required during hospitalization'])

dataset.obs.head()

Unnamed: 0,CellType,log2p1_RNAcount,nFeature_RNA,MT_fraction,Viral Molecules,Lab number,Donor Id,Age,Sex,Race,...,Pre-existing Hypertension,Pre-existing immunocompromised condition,Smoking,SARS-CoV-2 PCR,SARS-CoV-2 Ab,Symptomatic,Admitted to hospital,Highest level of respiratory support,Vasoactive agents required during hospitalization,28-day death
GW1_AAACGGGAGCTAGTCT-1,Secretary epithelium,14.909096,5687,0.042059,0,CV19-1-S3.2A,GWAS_1,18,F,White,...,No,Yes,Never or unknown,Positive,,Yes,Yes,Mechanical ventilation with intubation,Yes,No
GW1_AAAGTAGTCCTAGGGC-1,Secretary epithelium KRT5,13.611947,3967,0.097771,0,CV19-1-S3.2A,GWAS_1,18,F,White,...,No,Yes,Never or unknown,Positive,,Yes,Yes,Mechanical ventilation with intubation,Yes,No
GW1_AACACGTCAGCGTCCA-1,Ciliated epithelium,9.366322,513,0.036419,0,CV19-1-S3.2A,GWAS_1,18,F,White,...,No,Yes,Never or unknown,Positive,,Yes,Yes,Mechanical ventilation with intubation,Yes,No
GW1_AACCATGAGAATCTCC-1,Secretary epithelium,15.217731,6260,0.05372,0,CV19-1-S3.2A,GWAS_1,18,F,White,...,No,Yes,Never or unknown,Positive,,Yes,Yes,Mechanical ventilation with intubation,Yes,No
GW1_AACCATGCATCCTTGC-1,Ciliated epithelium,9.134426,439,0.016043,0,CV19-1-S3.2A,GWAS_1,18,F,White,...,No,Yes,Never or unknown,Positive,,Yes,Yes,Mechanical ventilation with intubation,Yes,No


In [50]:
X.columns

Index(['CellType', 'log2p1_RNAcount', 'nFeature_RNA', 'MT_fraction',
       'Viral Molecules', 'Lab number', 'Donor Id', 'Age', 'Sex', 'Race',
       'Ethnicity', 'BMI', 'Pre-existing heart disease',
       'Pre-existing lung disease', 'Pre-existing kidney disease',
       'Pre-existing diabetes', 'Pre-existing Hypertension',
       'Pre-existing immunocompromised condition', 'Smoking', 'SARS-CoV-2 PCR',
       'Symptomatic', 'Admitted to hospital',
       'Highest level of respiratory support', '28-day death'],
      dtype='object')

This is a good toy dataset. We have a mixture of categorical, continuous-valued, integer-valued, string-valued, and others, as well as a clean binary ```Yes``` or ```No``` response variable.

#  Wrangling

A few short operations will make life easier later on.

## Dealing with NaN (Missing Values)

You can see in the ```SARS-CoV-2 Ab``` column that we have NaN values. Although classifier implementations may have built-in accommodations, it may best best to deal with these values in a way we can fully control.

In [36]:
print(X['SARS-CoV-2 Ab'])

GW1_AAACGGGAGCTAGTCT-1     NaN
GW1_AAAGTAGTCCTAGGGC-1     NaN
GW1_AACACGTCAGCGTCCA-1     NaN
GW1_AACCATGAGAATCTCC-1     NaN
GW1_AACCATGCATCCTTGC-1     NaN
                          ... 
GW13_TTTCCTCCAAGCCTAT-1    NaN
GW13_TTTCCTCCAAGTCTGT-1    NaN
GW13_TTTGTCAAGCCCAATT-1    NaN
GW13_TTTGTCAGTAGGACAC-1    NaN
GW13_TTTGTCATCGTGTAGT-1    NaN
Name: SARS-CoV-2 Ab, Length: 4936, dtype: category
Categories (3, object): ['Not done' < 'Negative' < 'Positive']


First, though I suspect this is just an indicator variable to show whether or not the patient recieved an antibody test, we can look at unique values to be sure.

In [39]:
X['SARS-CoV-2 Ab'].unique()

[NaN]
Categories (0, object): []

In this case, let's just drop the column. It is likely uninformative with regard to the reponse variable ```y = Vasoactive agents required during hospitalization```.

In [49]:
X = X.drop(columns=['SARS-CoV-2 Ab'])
print(X.shape,dataset.obs.shape, sep='\n')

(4936, 24)
(4936, 26)


As expected.

# Dealing with Categorial Features

We will need to encode nominal and/or ordinal features to a one-hot representation. We can easily exclude numerical-valued columns from this process. We should also binarize the $y$ labels.

In [93]:
numer_cols = list(X._get_numeric_data().columns)
cat_cols = list(set(X.columns) - set(numerical_cols))
print(f'numerical columns:\n{numer_cols}\n\ncategorical columns:\n {cat_cols}')

numerical columns:
['log2p1_RNAcount', 'nFeature_RNA', 'MT_fraction', 'Viral Molecules']

categorical columns:
 ['Race', 'BMI', 'Pre-existing immunocompromised condition', 'Pre-existing Hypertension', 'Donor Id', 'Pre-existing lung disease', 'Pre-existing diabetes', 'Age', '28-day death', 'Admitted to hospital', 'Lab number', 'Smoking', 'Ethnicity', 'Symptomatic', 'Pre-existing kidney disease', 'Sex', 'Highest level of respiratory support', 'SARS-CoV-2 PCR', 'CellType', 'Pre-existing heart disease']


In [101]:
X[cat_cols].head(3)

Unnamed: 0,Race,BMI,Pre-existing immunocompromised condition,Pre-existing Hypertension,Donor Id,Pre-existing lung disease,Pre-existing diabetes,Age,28-day death,Admitted to hospital,Lab number,Smoking,Ethnicity,Symptomatic,Pre-existing kidney disease,Sex,Highest level of respiratory support,SARS-CoV-2 PCR,CellType,Pre-existing heart disease
GW1_AAACGGGAGCTAGTCT-1,White,30.0-39.9 (obese),Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium,No
GW1_AAAGTAGTCCTAGGGC-1,White,30.0-39.9 (obese),Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium KRT5,No
GW1_AACACGTCAGCGTCCA-1,White,30.0-39.9 (obese),Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Ciliated epithelium,No


With the exception of BMI, this looks fine. Since we have a BMI range, we should use an ordinal encoder in this case. The rest are simply categorical.

In [104]:
ord_cols = ['BMI']
cat_cols = [i for i in cat_cols if i not in ord_cols]
X[cat_cols].head(3)

Unnamed: 0,Race,Pre-existing immunocompromised condition,Pre-existing Hypertension,Donor Id,Pre-existing lung disease,Pre-existing diabetes,Age,28-day death,Admitted to hospital,Lab number,Smoking,Ethnicity,Symptomatic,Pre-existing kidney disease,Sex,Highest level of respiratory support,SARS-CoV-2 PCR,CellType,Pre-existing heart disease
GW1_AAACGGGAGCTAGTCT-1,White,Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium,No
GW1_AAAGTAGTCCTAGGGC-1,White,Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium KRT5,No
GW1_AACACGTCAGCGTCCA-1,White,Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Ciliated epithelium,No


Now we can encode these properly using ```scikit-learn``` or a builtin ```pandas``` method (we will use the latter and comment out the former). While we could use an ordinal encoder, whereby one clas is mapped to an integer, we should actually use one-hot encoding as this is a continuous input, valid for ```scikit-learn``` estimators. Note that a NaN is treated as a distinct category. It's a good thing we dropped that NaN column!

In [126]:
import pandas as pd
pd.get_dummies(X[cat_cols]).head()

Unnamed: 0,Race_White,Race_Black,Race_Asian,Race_Other,Pre-existing immunocompromised condition_No,Pre-existing immunocompromised condition_Yes,Pre-existing Hypertension_No,Pre-existing Hypertension_Yes,Donor Id_GWAS_1,Donor Id_GWAS_10,...,CellType_Secretary epithelium KRT5,CellType_Squamous epithelium 1,CellType_Squamous epithelium 2,CellType_Ciliated epithelium,CellType_Neutrophil,CellType_Erythrocytes,CellType_Low quality,CellType_filtered cells and doublets,Pre-existing heart disease_No,Pre-existing heart disease_Yes
GW1_AAACGGGAGCTAGTCT-1,1,0,0,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
GW1_AAAGTAGTCCTAGGGC-1,1,0,0,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,1,0
GW1_AACACGTCAGCGTCCA-1,1,0,0,0,0,1,1,0,1,0,...,0,0,0,1,0,0,0,0,1,0
GW1_AACCATGAGAATCTCC-1,1,0,0,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
GW1_AACCATGCATCCTTGC-1,1,0,0,0,0,1,1,0,1,0,...,0,0,0,1,0,0,0,0,1,0


In [129]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

transformed_data = encoder.fit_transform(X[cat_cols])

encoded_data = pd.DataFrame(transformed_data, index=X[cat_cols].index)

# now concatenate the original data and the encoded data
concatenated_data = pd.concat([X[cat_cols], encoded_data], axis=1)
concatenated_data

Unnamed: 0,Race,Pre-existing immunocompromised condition,Pre-existing Hypertension,Donor Id,Pre-existing lung disease,Pre-existing diabetes,Age,28-day death,Admitted to hospital,Lab number,Smoking,Ethnicity,Symptomatic,Pre-existing kidney disease,Sex,Highest level of respiratory support,SARS-CoV-2 PCR,CellType,Pre-existing heart disease,0
GW1_AAACGGGAGCTAGTCT-1,White,Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."
GW1_AAAGTAGTCCTAGGGC-1,White,Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium KRT5,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."
GW1_AACACGTCAGCGTCCA-1,White,Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Ciliated epithelium,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."
GW1_AACCATGAGAATCTCC-1,White,Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."
GW1_AACCATGCATCCTTGC-1,White,Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Ciliated epithelium,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GW13_TTTCCTCCAAGCCTAT-1,White,Yes,No,GWAS_11,No,No,50,No,Yes,CV19-11-S3.2A,Never or unknown,Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Tcells CD8,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."
GW13_TTTCCTCCAAGTCTGT-1,White,Yes,No,GWAS_11,No,No,50,No,Yes,CV19-11-S3.2A,Never or unknown,Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Tcells CD8,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."
GW13_TTTGTCAAGCCCAATT-1,White,Yes,No,GWAS_11,No,No,50,No,Yes,CV19-11-S3.2A,Never or unknown,Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."
GW13_TTTGTCAGTAGGACAC-1,White,Yes,No,GWAS_11,No,No,50,No,Yes,CV19-11-S3.2A,Never or unknown,Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Squamous epithelium 1,No,"(0, 1)\t1.0\n (0, 2)\t1.0\n (0, 3)\t1.0\n ..."


In [97]:
from sklearn.preprocessing import 
X[cat_cols].

Unnamed: 0,log2p1_RNAcount,nFeature_RNA,MT_fraction,Viral Molecules
GW1_AAACGGGAGCTAGTCT-1,14.909096,5687,0.042059,0
GW1_AAAGTAGTCCTAGGGC-1,13.611947,3967,0.097771,0
GW1_AACACGTCAGCGTCCA-1,9.366322,513,0.036419,0
GW1_AACCATGAGAATCTCC-1,15.217731,6260,0.053720,0
GW1_AACCATGCATCCTTGC-1,9.134426,439,0.016043,0
...,...,...,...,...
GW13_TTTCCTCCAAGCCTAT-1,10.832890,870,0.017005,0
GW13_TTTCCTCCAAGTCTGT-1,11.065416,1001,0.015406,0
GW13_TTTGTCAAGCCCAATT-1,13.670767,3477,0.032975,0
GW13_TTTGTCAGTAGGACAC-1,9.147205,313,0.044170,0


In [95]:
y.head()

GW1_AAACGGGAGCTAGTCT-1    Yes
GW1_AAAGTAGTCCTAGGGC-1    Yes
GW1_AACACGTCAGCGTCCA-1    Yes
GW1_AACCATGAGAATCTCC-1    Yes
GW1_AACCATGCATCCTTGC-1    Yes
Name: Vasoactive agents required during hospitalization, dtype: category
Categories (2, object): ['No' < 'Yes']

In [80]:
X[set(X.columns) - set(numerical_cols)]

Unnamed: 0,Race,BMI,Pre-existing immunocompromised condition,Pre-existing Hypertension,Donor Id,Pre-existing lung disease,Pre-existing diabetes,Age,28-day death,Admitted to hospital,Lab number,Smoking,Ethnicity,Symptomatic,Pre-existing kidney disease,Sex,Highest level of respiratory support,SARS-CoV-2 PCR,CellType,Pre-existing heart disease
GW1_AAACGGGAGCTAGTCT-1,White,30.0-39.9 (obese),Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium,No
GW1_AAAGTAGTCCTAGGGC-1,White,30.0-39.9 (obese),Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium KRT5,No
GW1_AACACGTCAGCGTCCA-1,White,30.0-39.9 (obese),Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Ciliated epithelium,No
GW1_AACCATGAGAATCTCC-1,White,30.0-39.9 (obese),Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium,No
GW1_AACCATGCATCCTTGC-1,White,30.0-39.9 (obese),Yes,No,GWAS_1,No,No,18,No,Yes,CV19-1-S3.2A,Never or unknown,Not Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Ciliated epithelium,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GW13_TTTCCTCCAAGCCTAT-1,White,Unknown,Yes,No,GWAS_11,No,No,50,No,Yes,CV19-11-S3.2A,Never or unknown,Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Tcells CD8,No
GW13_TTTCCTCCAAGTCTGT-1,White,Unknown,Yes,No,GWAS_11,No,No,50,No,Yes,CV19-11-S3.2A,Never or unknown,Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Tcells CD8,No
GW13_TTTGTCAAGCCCAATT-1,White,Unknown,Yes,No,GWAS_11,No,No,50,No,Yes,CV19-11-S3.2A,Never or unknown,Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Secretary epithelium,No
GW13_TTTGTCAGTAGGACAC-1,White,Unknown,Yes,No,GWAS_11,No,No,50,No,Yes,CV19-11-S3.2A,Never or unknown,Hispanic or Latino,Yes,No,F,Mechanical ventilation with intubation,Positive,Squamous epithelium 1,No


# Feature Rescaling

It may be prudent to rescale features, especially if each feature is not on the same given scale. This may be done to remove the bias of certain features given downstream tasks. For instance, take the following features from our $X$: ```log2p1_RNAcount``` and ```nFeature_RNA```.

In [57]:
X[['log2p1_RNAcount', 'nFeature_RNA']].head()

Unnamed: 0,log2p1_RNAcount,nFeature_RNA
GW1_AAACGGGAGCTAGTCT-1,14.909096,5687
GW1_AAAGTAGTCCTAGGGC-1,13.611947,3967
GW1_AACACGTCAGCGTCCA-1,9.366322,513
GW1_AACCATGAGAATCTCC-1,15.217731,6260
GW1_AACCATGCATCCTTGC-1,9.134426,439


In an extraction process (such as PCA, making use of covariance), or when using a classifier making use of Euclidean distance, the feature with the largest numerical range will be naturally more weighted.

In [61]:
x_1 = X['log2p1_RNAcount']
x_2 = X['nFeature_RNA']

range_1 = x_1.max()-x_1.min()
range_2 = x_2.max()-x_2.min()

print(range_1, range_2, sep='\n')

12.476429788752128
9651


```nFeature_RNA``` therefore has a much more significant bearing on an outcome contingent upon this range.

We therefore rescale in a few ways:

- 1) with min/max rescaling
    - $x_i' = \frac{x_i - min(x)}{max(x) - min(x)}$
    - simple, preserves mean of dataset. Useful for image pixel intensity, for instance.
- 2) with $z$-score normalization
    - $x_i' = \frac{x_i - \bar{x}}{\sigma}$ = $\text{number of standard deviations from the mean}$ = $z\text{-score}$
    - standardizes features, typically with $\mu = 0,\sigma^2 = 1$. This is a better choice than min/max for things like PCA since in that case we want to select components maximizing variance of the feature matrix, without getting caught up by the scale of that variance.

- 3) with median and interquartile range rescaling (robust rescaling).
    - Remove the median value and scale according to interquartile range.
    - better choice if there are significant outliers.

Let's go with 3) somewhat arbitrarily but also in the case of outliers.

In [62]:
from sklearn import preprocessing

robust_scaler = preprocessing.RobustScaler()
robust_scaler.fit_transform(X)

ValueError: could not convert string to float: 'Secretary epithelium'

# Feature Extraction

Feature extraction is the process of reducing dimensionality to find latent features in a given feature set. This can be done in a variety of ways for a variety of use cases. As for why, let's say we have a dataset with a massive number of features, such that training a network to make use of all the features somewhat equally. That is, let

# 