# GEOG 5160 6160 Lab 09

In this lab, we'll go an example of building a pipeline to feed data into a machine learning model. You will need the *credit_data.csv* dataset, which should be available from Canvas with this document. Download this to your `datafiles` folder (extract any zip files). Make a new folder for today's class called `lab09`.

## Pipelines

Machine learning pipelines are commonly used to produce reproducible and consistent results from a machine learning model. The general goal is to link together a series of functions that undertake all (or most) of the data pre-processing steps, and link these directly into one or more algorithms. Pipelines have several advantages:

- They will process all data in the same way, making it easy to work with multiple datasets (including datasets used for predictions)
- They provide a more reproducible approach to data processing than a series of ad hoc steps and code
- They can easily be used in training, evaluating and tuning models
- They provide a description of the steps involved for later reference

We'll work again with the `credit_data.csv` file: a dataset of credit rankings for over 4000 people (see appendix for a description of the fields). The goal will be to predict `Status`, a binary outcome with two levels: `good` and `bad`. 

Let's begin by importing the packages we'll need for the lab:

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
## Set random seed for reproducibility
np.random.seed(1234)

Next, read in the data:

In [2]:
credit = pd.read_csv("../datafiles/credit_data.csv")
print(credit.shape)

(4454, 14)


If we take a look at the first few rows of the data, you should see that there are a mixture of numerical and categorical variables, as well as a range of different scales for the numerical variables. 

In [3]:
credit.head()

Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,good,9,rent,60,30,married,no,freelance,73,129.0,0.0,0.0,800,846
1,good,17,rent,60,58,widow,no,fixed,48,131.0,0.0,0.0,1000,1658
2,bad,10,owner,36,46,married,yes,freelance,90,200.0,3000.0,0.0,2000,2985
3,good,0,rent,60,24,single,no,fixed,63,182.0,2500.0,0.0,900,1325
4,good,0,rent,36,26,single,no,fixed,46,107.0,0.0,0.0,310,910


In addition, if you run the `isna()` method, you'll see that there are missing values in several of the features. 

In [4]:
credit.isna().any()

Status       False
Seniority    False
Home          True
Time         False
Age          False
Marital       True
Records      False
Job           True
Expenses     False
Income        True
Assets        True
Debt          True
Amount       False
Price        False
dtype: bool

There are several steps that we might want to do to pre-process these data before training any model:

- Impute any missing values
- Convert categorical/factor variables to numeric by one-hot encoding
- Scale the numerical variables to prevent biases while training our neural networks

While it is possible to do this in an ad-hoc way (as we did in previous examples), we will set up a processing *pipeline* that contains all of these steps. This takes more time to set up, but has a number of advantages: we can use the pipeline directly in cross-validation or tuning, and we can use it to process any new data that we might to make predictions for, without having to remember the individual steps. scikit-learn has a whole series of submodules and functions to help with this process, and we'll look at some of these here. Full details, including a number of worked examples, can be found on the scikit-learn website here: https://scikit-learn.org/stable/data_transforms.html

Let's start by splitting the data into a training and testing set using a simple holdout (we'll use this to demonstrate some of the functions:

In [5]:
from sklearn.model_selection import train_test_split

X = credit.loc[:, credit.columns != 'Status']
y = credit['Status']
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size = 0.8)

It is possible to set up the full pipeline in one step, but we'll work through this gradually so you can get a sense of what each operator is doing. As we have mixed data (numeric and categorical), we'll need separate operators for each type. You can get a list of the data types in a Pandas DataFrame by looking at the `dtypes` property:

In [6]:
X.dtypes

Seniority      int64
Home          object
Time           int64
Age            int64
Marital       object
Records       object
Job           object
Expenses       int64
Income       float64
Assets       float64
Debt         float64
Amount         int64
Price          int64
dtype: object

It is possible to set up the full pipeline in one step, but we'll work through this gradually so you can get a sense of what each operator is doing. As we have mixed data (numeric and categorical), we'll need separate operators for each type. We'll start by setting up lists containing the names of the categorical and numerical types so that we can refer to these for each step. 

In [7]:
categorical_features = ['Home', 'Marital', 'Records', 'Job']
numerical_features = ['Seniority', 'Time', 'Age', 'Expenses', 'Income', 
                      'Assets', 'Debt', 'Amount', 'Price']

Next, we'll design two operators to process the categorical data. The first will carry out a mode-based imputation of any missing values (i.e. fill in with the most common value). This is a simple (single) imputation method, and we need the `SimpleImputer()` function from `sklearn.impute`. Note that we include an argument `strategy` to define what value we will use to fill in the missing values. 

In [8]:
from sklearn.impute import SimpleImputer
imp_cat = SimpleImputer(strategy="most_frequent")

Having set this up, we can run it on our data. This can be done using the `fit()` method, which 'fits' an imputation method to a given dataset. While this is not a machine learning method, we are still building a simple model - here just finding the most common value in each column. Note that we use the vector of column names to confine this operation to the categorical features.

In [9]:
imp_cat.fit(X[categorical_features])

To actually see the results of this, we can use the `fit_transform()` method to output the results as an array

In [10]:
X_impute = imp_cat.fit_transform(X[categorical_features])
print(X_impute)

[['rent' 'married' 'no' 'freelance']
 ['rent' 'widow' 'no' 'fixed']
 ['owner' 'married' 'yes' 'freelance']
 ...
 ['owner' 'married' 'no' 'partime']
 ['rent' 'single' 'no' 'freelance']
 ['owner' 'married' 'no' 'freelance']]


To check that this has, in fact, filled the missing values, we can convert this array to a DataFrame and reuse the code to check for the presence of missing values:

In [11]:
X_impute_df = pd.DataFrame(X_impute, columns = categorical_features)
X_impute_df.isna().any()

Home       False
Marital    False
Records    False
Job        False
dtype: bool

Next, we'll one-hot encode the categorical variables. This converts each feature to a new set of numerical, integer features, one per level in the original feature. So a column titled 'sex', containing the levels 'female' and 'male' would be converted to two new columns: 'sex_female' and 'sex_male'. Each new column has a simple binary representation of that level, so an observation coded as 'female' would have 'sex_female' = 1 and 'sex_male' = 0. For this we need the `OneHotEncoder` function from `preprocessing`. Note that we include the argument `drop = 'first'`. This will remove the first new encoded feature to reduce redundancy and multicollinearity. 

In [12]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop = 'first')
ohe.fit(X[categorical_features])

Output is as a sparse matrix by default (i.e. it only contains the '1's), so to visualize what it is doing, we can covert to a NumPy array

In [13]:
X_cat = ohe.transform(X[categorical_features]).toarray()
X_cat

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

Now we have the two operations for the categorical features, we can combine them into a pipeline. The function you need is, amazingly, called `Pipeline()`. This requires the argument `steps`, which is a list of all the individual operators. We'll include the two that we have already defined, but note that you could define these directly in the pipeline if needed. 

In [14]:
from sklearn.pipeline import Pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', imp_cat),
    ('encoder', ohe)])

We can visualize the pipeline by importing `set_config`, and changing the display type. You can now just type the name of the pipeline to see the following diagram:

In [15]:
from sklearn import set_config
set_config(display='diagram')
categorical_transformer

Now let's do the same thing for the numerical variables. We'll start by imputing the median value. This again uses the `SimpleImputer` but with a different strategy:

In [16]:
from sklearn.impute import SimpleImputer
imp_num = SimpleImputer(missing_values=np.nan, strategy='median')
print(imp_num.fit_transform(X[numerical_features]))

[[   9.   60.   30. ...    0.  800.  846.]
 [  17.   60.   58. ...    0. 1000. 1658.]
 [  10.   36.   46. ...    0. 2000. 2985.]
 ...
 [   0.   24.   37. ...    0.  500.  963.]
 [   0.   48.   23. ...    0.  550.  550.]
 [   5.   60.   32. ... 1000. 1350. 1650.]]


Quick check to see that all the NaNs have been removed:

In [17]:
X_impute_df = pd.DataFrame(imp_num.fit_transform(X[numerical_features]), 
                           columns = numerical_features)
X_impute_df.isna().any()

Seniority    False
Time         False
Age          False
Expenses     False
Income       False
Assets       False
Debt         False
Amount       False
Price        False
dtype: bool

Next, we'll rescale the numerical values to a 0-1 range using `MinMaxScaler`. The default range is 0-1, but this can be adjusted if you want a different range. 

In [18]:
from sklearn.preprocessing import MinMaxScaler
scale = MinMaxScaler()
X_scale_df = pd.DataFrame(scale.fit_transform(X[numerical_features]), 
                           columns = numerical_features)
X_scale_df.describe()

Unnamed: 0,Seniority,Time,Age,Expenses,Income,Assets,Debt,Amount,Price
count,4454.0,4454.0,4454.0,4454.0,4073.0,4407.0,4436.0,4454.0,4454.0
mean,0.166391,0.612708,0.381608,0.141886,0.14238,0.018013,0.011434,0.191616,0.123043
std,0.170298,0.222052,0.219692,0.134591,0.084731,0.038581,0.041533,0.096846,0.056921
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.041667,0.454545,0.2,0.0,0.088143,0.0,0.0,0.122449,0.091731
50%,0.104167,0.636364,0.36,0.110345,0.124869,0.01,0.0,0.183673,0.117354
75%,0.25,0.818182,0.54,0.255172,0.172088,0.02,0.0,0.244898,0.14377
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


And we can now arrange these two operations in a pipeline for the numerical features. 

In [19]:
numerical_transformer = Pipeline(steps=[
    ('imputer', imp_num),
    ('scaler', scale)])
numerical_transformer

Now we want to stick all of this together to process the entire dataset using a `ColumnTransformer` function, which includes the two pipelines. Note that each one had an ID in the first argument (e.g. `'num'`) - this will be important in identifying hyperparameters for tuning later on. 

In [20]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])
preprocessor

So far, we have a pipeline that transforms the dataset into a format ready for modeling. The next step is to link it to a algorithm, so that we can pass the transformed data directly to it. We'll set up a classification based random forest, with a default hidden layer and a maximum iteration limit of 1000:

In [21]:
from sklearn.ensemble import RandomForestClassifier

rf_credit = Pipeline(steps=[('preprocessor', preprocessor),
                            ('classifier', RandomForestClassifier(n_estimators = 500))])
rf_credit

If we now fit this new object (`rf_credit`) to our holdout training set, this processes the data, fills in missing values, encodes and scales *and* trains the random forest:

In [22]:
rf_credit.fit(X_train, y_train)

Now we can use the same pipeline to predict for our test set. Note that we don't need to do anything to this, the pipeline will take care of all the processing steps, and then pass these new data to the trained neural network to make a prediction. We'll get the probabilistic predictions, and calculate the AUC

In [23]:
from sklearn import metrics

y_test_pred = rf_credit.predict_proba(X_test)
metrics.roc_auc_score(y_test, y_test_pred[:,1])

0.8472329432270918

We can use the same pipeline in a cross-validation to get an overall assessment of the model. As this can take some time, I've added the argument `n_jobs=4` to force it to run in parallel on 4 cores. Feel free to adjust this to match what is available on your machine. Setting `n_jobs = -1` will use all available cores. 

In [24]:
from sklearn.model_selection import cross_val_score, KFold

cv = KFold(n_splits=5)
scores = cross_val_score(rf_credit, X, y, cv=cv, 
                         scoring='roc_auc', n_jobs=4)
print("%.3f" % np.mean(scores))

0.829


Alternatively, we can use our combined pipeline and learner to tune hyperparameters of the learner. We'll try tuning the number of nodes in the hidden layer, the `size` argument to between 2 and 20. To see the full set of available hyperparameters, just use the `get_params()` method. Note that as we are using a pipeline, this returns *all* the parameters, including any for the individual operators in the pipeline:


In [25]:
rf_credit.get_params()

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(transformers=[('num',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(strategy='median')),
                                                    ('scaler', MinMaxScaler())]),
                                    ['Seniority', 'Time', 'Age', 'Expenses',
                                     'Income', 'Assets', 'Debt', 'Amount',
                                     'Price']),
                                   ('cat',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(strategy='most_frequent')),
                                                    ('encoder',
                                                     OneHotEncoder(drop='first'))]),
                                    ['Home', 'Marital', 'Records', 'Job'])])),
  ('classifier', RandomForestClassifier(n_estima

So there's a lot in there. The hierarchy in the pipeline is shown by a double underscore connecting the different operators. The one we want to tune is the number of nodes in the hidden layer (`classifier__hidden_layer_sizes`). We need to set up a search space for this, then set up `GridSearchCV` to run the tuning. Note that we use the pipeline (`nn_credit`) directly in the grid search. This will then carry out all the necessary operations of imputation and scaling for each iteration of each parameter value

In [26]:
from sklearn.model_selection import GridSearchCV
param_grid = {'classifier__n_estimators': 
              [100, 200, 300, 400, 500, 750, 1000, 1500, 2000]}


credit_rf_tuned = GridSearchCV(rf_credit, param_grid, 
                             scoring='roc_auc', cv=5, n_jobs=4)
credit_rf_tuned

Now let's run this (it will take a short while to go through all the iterations and parameter values):

In [27]:
credit_rf_tuned.fit(X_train, y_train)

And we can check the results of the tuning:

In [28]:
print(f"Best params:")
print(credit_rf_tuned.best_params_)
print(credit_rf_tuned.best_score_)

Best params:
{'classifier__n_estimators': 2000}
0.8286531599813433


We can use the tuned learner to make predictions for a new data set. As the learner is built on the data transformation pipeline, we do need to carry out any transformations prior to prediction. Instead we can simply pass the new data to the learner, and leave it to do all that for us. We'll create a single example and use the `predict_new()` method to get a prediction of credit risk (feel free to use different values to see the impact here):

In [29]:
new_credit = pd.DataFrame({'Seniority': [8], 'Home': ["rent"], 'Time': [36], 'Age': [26],
                           'Marital': ["single"], 'Records': ["no"], 'Job': ["fixed"],
                           'Expenses': [50], 'Income': [100], 'Assets': [0], 'Debt': [10],
                           'Amount': [100], 'Price': [125]})

And predict for this new case:

In [30]:
credit_rf_tuned.predict(new_credit)

array(['bad'], dtype=object)

This predicts a bad credit ranking for this example. As a reminder, if you use `predict_proba()` instead of `predict()` this will return the predicted probabilities of each class ('bad' vs 'good'). Note that the probabilities are quite close, suggesting our model is struggling to cleanly predict for this case.

In [31]:
credit_rf_tuned.predict_proba(new_credit)

array([[0.5515, 0.4485]])

### Feature selection

When working with datasets with large numbers of features we often want to reduce the number of features used to build the model. We can further modify the pipeline to allow for feature selection using scikit-learn. In this first example, we'll filter the features by mutual information; a form of entropy based correlation between each feature and the target variable. 

We'll make a new pipeline that consists of the original data processor (the categorical and numerical transformations) and append the feature selection operator. Note that there are two seperate functions here:

- `mutual_info_classif`: this calculates the mutual information value for each feature
- `SelectKBest`: this selects the top $k$ features based on the values returned by `mutual_info_classif`. Here, we'll choose the top 5. 

In [32]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
rf_credit = Pipeline(steps=[('preprocessor', preprocessor),
                            ('mim', SelectKBest(mutual_info_classif, k = 5)
                            )])
rf_credit

We can again use the `fit_transform` function to show the results of this process. Note that we need to include both X and y here to calculate the mutual information values. This will return an array with just 5 features. Note that one impact of running this following the one-hot encoding is that the filter might select the encoding of *individual* levels of the original factor. 

In [33]:
print(rf_credit.fit_transform(X, y))

[[0.1875     0.12906611 0.14285714 0.         0.        ]
 [0.35416667 0.13116474 0.18367347 0.         0.        ]
 [0.20833333 0.20356768 0.3877551  1.         0.        ]
 ...
 [0.         0.08814271 0.08163265 0.         1.        ]
 [0.         0.1406086  0.09183673 0.         0.        ]
 [0.10416667 0.1406086  0.25510204 0.         0.        ]]


We don't, however, know if 3 is the best subset of variables to include in the model. This is where the link between the pipeline, learner and tuning becomes very useful as we can tune this parameter (`k`) just as we would tune any hyperparameter. To do this, first rebuild the pipeline to include the random forest classifier:

In [34]:
rf_credit = Pipeline(steps=[('preprocessor', preprocessor),
                            ('mim', SelectKBest(mutual_info_classif)),
                            ('classifier', RandomForestClassifier(n_estimators = 500))
                            ])
rf_credit

Now set up a new tuning grid that includes both the number of features (`mim__k`) and the number of trees in the forest (`classifier__n_estimators`). To keep this manageable, we'll just test 3 values for each to give 3 x 3 = 9 total combinations, which will result in 9 x 5 = 45 total models being built during the 5-fold cross-validation. As a reminder, type `nn_credit.get_params()` to see the full set of parameters and their names.


In [35]:
param_grid = {'classifier__n_estimators': [500, 1000, 1500],
             'mim__k': [5, 10, 15]}

credit_rf_tuned = GridSearchCV(rf_credit, param_grid, 
                               scoring='roc_auc', cv=5, n_jobs=4)
credit_rf_tuned

In [36]:
param_grid

{'classifier__n_estimators': [500, 1000, 1500], 'mim__k': [5, 10, 15]}

And run the tuning:

In [37]:
%%capture --no-stdout
credit_rf_tuned.fit(X, y)

In [38]:
print(f"Best params:")
print(credit_rf_tuned.best_params_)
print(credit_rf_tuned.best_score_)

Best params:
{'classifier__n_estimators': 1000, 'mim__k': 15}
0.8268731499003984


### PCA transformation

For a final example, we'll look at a different feature selection strategy. Rather than selecting out original features, we'll use a PCA transformation to create new features. These are based on the original features, but a) are uncorrelated and b) try to maximize the amount of information contained in each one. Ideally, we should be able to select a small number of these that still represent most of the signal in the features.

To do this, we'll recreate our pipeline with a new operator that will carry out the PCA transformation (`PCA`). We can also specify the number of new features that will be used in the model. Each new principal component feature will explain a certain proportion of the variance in the original dataset (with successive PC features explaining less and less variance). Here, the `n_components = 0.5` argument will select the set of new features that, in total, explain over 50% of the original variance. Note that if you sent this to an integer value (1 or higher), it will use that many features. 


In [39]:
from sklearn.decomposition import PCA
rf_credit = Pipeline(steps=[('preprocessor', preprocessor),
                            ('pca', PCA(n_components = 0.5)
                            )])
rf_credit

If we `fit` this to our data, you'll that this returns just three new features. This is the 1st, 2nd and 3rd PC, and these could then be used as input features to the neural network (or any other algorithm).

In [40]:
rf_credit.fit_transform(X)

array([[-0.08389405,  0.93255372,  0.68336541],
       [ 0.47187127,  0.65092987, -0.1051408 ],
       [-0.79090489, -0.20694752,  0.8723634 ],
       ...,
       [-0.4702979 , -0.30145424, -0.50129179],
       [ 0.99785998,  0.33168092,  0.86230522],
       [-0.6897555 , -0.32020941,  0.57742967]])

Again, we don't know if this is a best number of the new features to retain. We can test this by tuning this parameter (`n_components`). As before, we'll remake the full pipeline with the neural network, create a parameter grid and fit this

In [41]:
rf_credit = Pipeline(steps=[('preprocessor', preprocessor),
                            ('pca', PCA()),
                            ('classifier', RandomForestClassifier(n_estimators = 500))
                            ])

In [42]:
param_grid = {'classifier__n_estimators': [500, 1000, 1500],
             'pca__n_components': [0.5, 0.75, 0.9]}
credit_rf_tuned = GridSearchCV(rf_credit, param_grid, 
                               scoring='roc_auc', cv=5, n_jobs=4)
credit_rf_tuned

In [43]:
%%capture --no-stdout
credit_rf_tuned.fit(X, y)

In [44]:
print(f"Best params:")
print(credit_rf_tuned.best_params_)
print(credit_rf_tuned.best_score_)

Best params:
{'classifier__n_estimators': 1000, 'pca__n_components': 0.9}
0.7424121862549802


### Evaluating tuned pipelines

While the previous code allows us to select the value of the hyperparameters, the performance score shown above is only based on the training set. It is calculated using only part of that training set, but is still considered to not be an independent test of predictive skill. To evaluate the model, we need to run a nested cross-validation where 

- The inner cross-validation is used to choose values of the hyperparameters by training pn part of the data (training) and evaluation against another part (validation)
- The outer cross-validation is used to evaluate the tuned model against a third part of the data (testing)

As the testing data is not used to tune the model, it is still considered to be independent and a fair test of the model's predictive skill

First we set up the innner and outer cross-validation. Here, we'll use the same 4-fold cross-validation for both. A 4-fold cross-validation results in a 75-25 split of the data for each fold. In practice, this means we take 75% of the data for tuning, and put the other 25% in the *test* set. Of the 75%, we split this again into 75% for training and 25% for validation. 

In [45]:
inner_cv = KFold(n_splits=4)
outer_cv = KFold(n_splits=4)

Now we can set up the search strategy. Note this is almost exactly the same as the one we ran previously - the exception is that we set the `cv` argument to the inner strategy

In [46]:
# Nested CV with parameter optimization
rf_tuned_evaluated = GridSearchCV(estimator = rf_credit, 
                                  param_grid = param_grid, 
                                  scoring = 'roc_auc',
                                  cv = inner_cv,
                                  n_jobs = 4)

Now, we run the full nested loop using `cross_val_score`, with the tuning method, the data and the outer cross-validation strategy

In [47]:
nested_score = cross_val_score(rf_tuned_evaluated, X=X, y=y, cv=outer_cv)

And print the final set of results:

In [48]:
nested_score

array([0.72151992, 0.73905872, 0.71688192, 0.76616177])

In [49]:
nested_score.mean()

0.7359055821412115

[Optional] While the above code provides a straightforward way to evaluate the tuning process, it doesn't show what the tuned model looks like. We can more information by building a loop that controls the outer cross-validation.

In [54]:
# Set blank list to save output
outer_results = list()
for train_ix, test_ix in outer_cv.split(X):
    # split data
    X_train, X_test = X.loc[train_ix, :], X.loc[test_ix, :]
    y_train, y_test = y[train_ix], y[test_ix]

    # Define grid search
    search = GridSearchCV(rf_credit, param_grid = param_grid,
                          scoring = 'roc_auc', cv = inner_cv, 
                          refit = True, n_jobs = 4)
    # execute search
    result = search.fit(X_train, y_train)
    
    # get the best performing model fit on the whole training set
    best_model = result.best_estimator_
    # evaluate model on the hold out dataset
    y_hat = best_model.predict_proba(X_test)
    # evaluate the model
    auc = metrics.roc_auc_score(y_test, y_hat[:,1])
    # store the result
    outer_results.append(auc)
    # report progress
    print('>auc=%.3f, est=%.3f, cfg=%s' % (auc, result.best_score_, result.best_params_))

>auc=0.723, est=0.745, cfg={'classifier__n_estimators': 500, 'pca__n_components': 0.75}
>auc=0.740, est=0.738, cfg={'classifier__n_estimators': 1500, 'pca__n_components': 0.9}
>auc=0.718, est=0.745, cfg={'classifier__n_estimators': 1500, 'pca__n_components': 0.75}
>auc=0.766, est=0.736, cfg={'classifier__n_estimators': 1500, 'pca__n_components': 0.9}


In [53]:
# summarize the estimated performance of the model
import numpy as np
print('AUROC: %.3f (%.3f)' % (np.mean(outer_results), np.std(outer_results)))

AUROC: 0.738 (0.019)


## Exercise

For the exercise we will once again use the data from the *Sonar.csv* file to model types of object (rocks 'R' or mines 'M') using the values of a set of frequency bands. The goal of the exercise is to build the best predictive model for predicting these data, and you are free to choose any of the algorithms/learners we have previously looked at. You should use the **scikit-learn** framework to setup, train and test your model. You will need to choose a cross-validation strategy and calculate the AUC to assess the model. 

As the data has a large number of features, you should build a pipeline to reduce the number of features using one of the two filter examples (mutual information or PCA) from the lab. Note that there are no categorical features so you can skip those steps. You should then tune both the filter and at least one hyperparameter of your model (if you are not sure about this, please ask!)

Your answer should consist of the following

- A description of your pipeline (this can include a figure showing the steps)
- The values you obtained for the number of features and the selected hyperparameter
- The cross-validated AUC

You should also provide your full Python code, either as a notebook or a screenshot 

## Appendix

### Sonar data set

The file *Sonar.csv* contains values of 208 sonar signals. The data have 60 features, each representing "the energy within a particular frequency band, integrated over a certain period of time", and an outcome variable `Class`, which is coded `M` or `R`. The goal of the experiment is to discriminate between the sonar signals bounced of a rock `R` or a metal object (a mine `M`). 

### Credit data set 

From https://github.com/gastonstat/CreditScoring

|    | Column name | Feature                | 
|----|-------------|------------------------|
| 1  | `Status`    | credit status          |
| 2  | `Seniority` | job seniority (years)  |
| 3  | `Home`      | type of home ownership |
| 4  | `Time`      | time of requested loan |
| 5  | `Age`       | client's age           |
| 6  | `Marital`   | marital status         |
| 7  | `Records`   | existence of records   |
| 8  | `Job`       | type of job            |
| 9  | `Expenses`  | amount of expenses     |
| 10 | `Income`    | amount of income       |
| 11 | `Assets`    | amount of assets       |
| 12 | `Debt`      | amount of debt         |
| 13 | `Amount`    | loan amount requested  |
| 14 | `Price`     | price of good          |

