# Using regression to take advantage of ordinal data

The data relates a series of measurements or indicators to a prognostic of the health of the foetus, with two degrees of certitude. 

* 'N' is normal, indicating that the foetus is healthy
* 'S' is interpreted as a 'maybe', so the foetus should be monitored
* 'P' is problematic (so I interpret this as a fully positive result) 

The outcome variable is thus an ordinal variable that expresses the same prediction with different degrees of certitude. 

I start by exploring the data and then try a two different methods of taking into account the ordinal nature of the outcome. 

Please comment on this approach, as I am not an expert in data science nor on health.

In [None]:
%pylab inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

## Data exploration

The column **fetal_health** is the outcome variable

In [None]:
df = pd.read_csv('../input/fetal-health-classification/fetal_health.csv')
df.drop_duplicates(inplace=True)

indep = 'fetal_health'
Xdf = df.drop([indep],axis=1)
Xr = Xdf
yr = df[indep]


### Correlation

Several of the variables are highly correlated, and might interfere with the regression, or at least with its interpretation:
****

In [None]:
import re    

def initials(x,n=1,join_chars=''):
    splits = re.split('[_ ]',x)
    return join_chars.join([s[0:n] for s in splits])


corr = df.corr()

# rename the columns to have dmaller labels on the horizontal axis
cr =corr.rename(columns=initials)

fig,ax=plt.subplots(1,figsize=(12,10))
cm=ax.matshow(cr,cmap='bwr',vmin=-1,vmax=1)
plt.xticks(range(cr.shape[1]), cr.columns, fontsize=12, rotation=45,ha='left')
plt.yticks(range(cr.shape[0]), cr.index, fontsize=12)
cb = plt.colorbar(cm)
cb.ax.tick_params(labelsize=14)


**Mutual information** informs on which are the variables that are most correlated with the outcome variable ('fetal health')

**Variance inflation factor** is more adequate than correlation (above) because it takes into account multi-correlation

In [None]:
from sklearn.feature_selection import mutual_info_regression
mimx = pd.DataFrame({col: pd.Series(mutual_info_regression(df,df[col]),index=df.columns) for col in df.columns} )


Figure below is similar to the correlation matrix but should take into account non-linear relations between variables

In [None]:
fig,ax=plt.subplots(1,figsize=(12,10))
mm=ax.matshow(mimx,cmap='viridis',vmin=0,vmax=1)
plt.xticks(range(cr.shape[1]), cr.columns, fontsize=12, rotation=45,ha='left')
plt.yticks(range(cr.shape[0]), cr.index, fontsize=12)
cb = plt.colorbar(mm)
cb.ax.tick_params(labelsize=14)


In [None]:
from sklearn.feature_selection import mutual_info_classif
from statsmodels.stats.outliers_influence import variance_inflation_factor

mi = pd.Series(mutual_info_classif(Xr,yr),index=Xr.columns,name='mutual info')
vs = pd.Series({col:variance_inflation_factor(Xr.values,Xr.columns.to_list().index(col)) for col in Xr.columns},name='vif')

pd.DataFrame([mi,vs]).T

Based on the previous analysis, I'll drop the following columns, since they seem to be highly correlated with existing columns, and then re-check the VIF

In [None]:
to_drop = ['histogram_width','histogram_mode','histogram_median']
Xr.drop(to_drop,inplace=True,axis=1)

mi = pd.Series(mutual_info_classif(Xr,yr),index=Xr.columns,name='mutual info')
vs = pd.Series({col:variance_inflation_factor(Xr.values,Xr.columns.to_list().index(col)) for col in Xr.columns},name='vif')

pd.DataFrame([mi,vs]).T

# Logisitc regression classifier -- baseline

First let's try naively a logistic regression classifier using the one vs many strategy. 

This is not the approach to be followed since it doesn't allow an adjustment of the decision boundaries for the 'maybe' and the 'positive' cases, but just to have a idea of the precision that can be acheived...

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler, PowerTransformer, QuantileTransformer
from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

Xtrain,Xtest,ytrain,ytest = train_test_split(Xr,yr,test_size = 0.30,stratify = yr,shuffle = True,random_state = 42)


model = Pipeline([('scaler',StandardScaler()),
                  ('classifier', OneVsRestClassifier(LogisticRegression()))])

mfit = model.fit(Xtrain,ytrain)
ypred = mfit.predict(Xtest)

print(confusion_matrix(ytest,ypred))
print()
print(classification_report(ytest,ypred))

# Ordinal regression

`mord` provides a few models of Ordinal Regression, see https://pythonhosted.org/mord/. 

There isn't a lot of documentation on this package, but it seems that the ordinal regression consists of a linear regression with an estimation of the limits of each class along the target of the linear regression

In [None]:
!pip install mord

In [None]:

from imblearn.over_sampling import SMOTE
import mord

Xrc = Xr.copy()


# scale data and encode categories
X = pd.DataFrame(StandardScaler().fit_transform(Xrc),columns=Xrc.columns)
le = LabelEncoder().fit(yr)
y = yr.astype('i') 

# split into test and train
Xtraini,Xtest,ytraini,ytest = train_test_split(X,y,test_size = 0.30,stratify = y,shuffle = True,random_state = 42)

#oversample training data due to the imbalance between categories
Xtrain, ytrain = SMOTE().fit_sample(Xtraini,ytraini)
#Xtrain, ytrain = Xtraini,ytraini

model = mord.LogisticAT(alpha=.1)
results = model.fit(Xtrain,ytrain)
ypred = (results.predict(Xtest))

print('Raw data:')
print(y.value_counts())
print('Oversampled data:')
print(ytrain.value_counts())


Like a regular Logistic Regression, the model contains coefficients that inform on the contribution of each feature to the prediction

Instead of one intercept, the model contains several intercepts that establish the limits between each level of the ordinal variable. 

In [None]:
from sklearn.metrics import roc_curve, precision_recall_curve, auc

print(confusion_matrix(ytest,ypred))
print()
print(classification_report(ytest,ypred))
print()
print(pd.Series(model.coef_,index=Xrc.columns))



The regression projects all the features onto a single dimension, and attributes class boundaries (the theta_ in the Ordinal Regression) along this dimension. 

These boundaries can be adjusted to select a compromise between false positive rate and true positive rate.

The boundaries best matching the categorisation proposed are plotted in red horizontal lines. The lowest curve makes sure that no 'problematic' cases are missed (green blob), while accepting some false positives. 

In [None]:
decf = np.dot(X,model.coef_)
figure()
sns.violinplot(y=decf,x=y)
for th in model.theta_:
    axhline(th,color='r',alpha=.5)
    
ylabel('decision function')

A similar information should be found in the ROC curve and Precision-recall curve, however I'm not sure how to interpret these 

I'd expect similar curves for both class limits (above problematic and above suspect thresholds) but they do not overlap.

I'd expect that the two thetas_ would correspond to different points on this curve. 

In [None]:
scores = model.predict_proba(Xtest)

figure()

all_fpr = []
all_tpr = []

for ii in [1,2]:
    cno = model.classes_[ii]
    fpr, tpr, thresholds = roc_curve(ytest>=cno, np.sum(scores[:,model.classes_>=cno],axis=1))

    plot(fpr,tpr)
    xlabel('False positive rate')
    ylabel('True positive rate')
    db = exp(model.theta_)/(1+exp(model.theta_))
    idx = (thresholds>db[0])&(thresholds<db[1])
    plot(fpr[idx],tpr[idx])

    print (auc(fpr,tpr))

# Ordinal SVC hack

Based on [this article](https://towardsdatascience.com/simple-trick-to-train-an-ordinal-regression-with-any-classifier-6911183d2a3c), any classifier can be hacked to an ordinal one, by fitting to the N-1 problems "above class N / below class N"

In [None]:
from copy import deepcopy as copy
from sklearn.svm import SVC
from sklearn.base import BaseEstimator

class OrdinalClassifier(BaseEstimator):

    def __init__(self, *args, **kwargs):
        self.clfs = {}
        self.args = args
        self.set_params(**kwargs)
        self.sample_clf = SVC(*self.args, **self.kwargs)
        
    def fit(self, X, y):
        self.unique_class = np.sort(np.unique(y))
        if self.unique_class.shape[0] > 2:
            for i in range(self.unique_class.shape[0]-1):
                # for each k - 1 ordinal value we fit a binary classification problem
                binary_y = (y > self.unique_class[i]).astype(np.uint8)
                clf = SVC(*self.args, **self.kwargs)
                clf.fit(X, binary_y)
                self.clfs[i] = clf
        return self

    def predict_proba(self, X):
        clfs_predict = {k:self.clfs[k].predict_proba(X) for k in self.clfs}
        predict_list = [self.clfs[k].predict_proba(X) for k in self.clfs]
        predicted = []
        for i,y in enumerate(self.unique_class):
            if i == 0:
                # V1 = 1 - Pr(y > V1)
                predicted.append(1 - predict_list[i][:,1])
            elif y in clfs_predict:
                # Vi = Pr(y > Vi-1) - Pr(y > Vi)
                 predicted.append(predict_list[i-1][:,1] - predict_list[i][:,1])
            else:
                # Vk = Pr(y > Vk-1)
                predicted.append(predict_list[i-1][:,1])
        return np.vstack(predicted).T

    def predict(self, X):
        return self.unique_class[np.argmax(self.predict_proba(X), axis=1)]
    
    def get_params(self, *args, **kwargs):
        return self.sample_clf.get_params(*args, **kwargs)

    def set_params(self, **kwargs):
        self.kwargs = kwargs
        self.kwargs['probability']=True
        return self


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import  make_scorer, mean_squared_error, mean_absolute_error, precision_score

def mean_err_round(y,ypred,**kwargs):
    return mean_absolute_error(y,np.round(ypred),**kwargs)


Xt = pd.DataFrame(StandardScaler().fit_transform(Xr),columns=Xr.columns,index=Xr.index)
yt = yr

X=Xt#.sample(500)
le = LabelEncoder().fit(yr)
yt = yr.astype('i') 
y=yt.loc[X.index]

Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size = 0.30,stratify = y,shuffle = True,random_state = 42)

#scorer = make_scorer(mean_absolute_error, greater_is_better=False)
scorer = make_scorer(precision_score,average='macro')

# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [ 0.5,0.1,0.01,0.001],
                     'C':[1,10,100,1000]},
                    {'kernel': ['linear'], 'C': [ 0.1,1,10,100]}
                   ]
clf = GridSearchCV(
    OrdinalClassifier(), tuned_parameters, cv=5, verbose=2,scoring=scorer,
)
clf.fit(Xtrain, ytrain)

ypred = (clf.predict(Xtest))


In [None]:
print(clf.best_params_)
print()

print(confusion_matrix(ytest,ypred))
print()
print(classification_report(ytest,ypred))



Accuracy has improved but not above the multi-class Logistic Regression, and we lost in interpretability. 

Accuracy is probably not the most important measure here, as class 'S' should be allowed a large error to account for possible  problematic cases. 

The most important measure is probably precision and recall between classes 'P' and 'N'. Not sure how to account for this measure. 