In [1]:
import numpy as np
print("numpy version: {}".format(np.__version__))
import pandas as pd 
print("pandas version: {}".format(pd.__version__))
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib version: {}".format(matplotlib.__version__))
import scipy as sp
print("scipy version: {}".format(sp.__version__))
import sklearn as sl
print("scikit-learn: {}".format(sl.__version__))
import seaborn as sns
print("seaborn: {}".format(sns.__version__))
import statsmodels as sm
print("statsmodels: {}".format(sm.__version__))

numpy version: 1.17.4
pandas version: 0.25.3
matplotlib version: 3.1.2
scipy version: 1.3.3
scikit-learn: 0.21.3
seaborn: 0.9.0
statsmodels: 0.10.2


## Interfacing Between pandas and Model Code

A common workflow for model development is to use pandas for **data loading** and **cleaning** before switching over to a modeling library to build the model itself.
An important part of the model development process is called **feature engineering** in machinge learning.
This can describe any data transformation or analytics that extract information from a raw dataset that mey be useful in a modeling context.

The point of contact between pandas and other analysis libraries is usually NumPy arrays.
To turn a DataFrame into a NumPy array, use the ```.values``` property:

In [2]:
data = pd.DataFrame({
    'x0': [1, 2, 3, 4, 5],
    'x1': [0.01, -0.01, 0.25, -4.1, 0.],
    'y': [-1.5, 0., 3.6, 1.3, -2.]})

In [3]:
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [4]:
data.columns

Index(['x0', 'x1', 'y'], dtype='object')

In [5]:
data.values

array([[ 1.  ,  0.01, -1.5 ],
       [ 2.  , -0.01,  0.  ],
       [ 3.  ,  0.25,  3.6 ],
       [ 4.  , -4.1 ,  1.3 ],
       [ 5.  ,  0.  , -2.  ]])

In [6]:
df2 = pd.DataFrame(data.values, columns=['one', 'two', 'three'])

In [7]:
df2

Unnamed: 0,one,two,three
0,1.0,0.01,-1.5
1,2.0,-0.01,0.0
2,3.0,0.25,3.6
3,4.0,-4.1,1.3
4,5.0,0.0,-2.0


The ```.values``` attribute is intended to be used when your data is
homogeneous — for example, all numeric types. 
If you have heterogeneous data, the result will be an ndarray of Python objects:

In [8]:
df3 = data.copy()

In [9]:
df3['strings'] = ['a', 'b', 'c', 'd', 'e']

In [10]:
df3

Unnamed: 0,x0,x1,y,strings
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,c
3,4,-4.1,1.3,d
4,5,0.0,-2.0,e


In [11]:
df3.values

array([[1, 0.01, -1.5, 'a'],
       [2, -0.01, 0.0, 'b'],
       [3, 0.25, 3.6, 'c'],
       [4, -4.1, 1.3, 'd'],
       [5, 0.0, -2.0, 'e']], dtype=object)

For some models, you may only wish to use a subset of the columns. I recommend using loc indexing with values:

In [12]:
model_cols = ['x0', 'x1']

In [13]:
data.loc[:, model_cols].values

array([[ 1.  ,  0.01],
       [ 2.  , -0.01],
       [ 3.  ,  0.25],
       [ 4.  , -4.1 ],
       [ 5.  ,  0.  ]])

In [14]:
data['category'] = pd.Categorical(['a', 'b', 'a', 'a', 'b'], categories=['a', 'b'])

In [15]:
data

Unnamed: 0,x0,x1,y,category
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,a
3,4,-4.1,1.3,a
4,5,0.0,-2.0,b


If we wanted to replace the 'category' column with dummy variables, we create
dummy variables, drop the 'category' column, and then join the result:

In [16]:
dummies = pd.get_dummies(data.category, prefix='category')

In [17]:
data_with_dummies = data.drop('category', axis=1).join(dummies)

In [18]:
data_with_dummies

Unnamed: 0,x0,x1,y,category_a,category_b
0,1,0.01,-1.5,1,0
1,2,-0.01,0.0,0,1
2,3,0.25,3.6,1,0
3,4,-4.1,1.3,1,0
4,5,0.0,-2.0,0,1


There are some nuances to fitting certain statistical models with dummy variables. It
may be simpler and less error-prone to use ```Patsy``` (the subject of the next section)
when you have more than simple numeric columns.

## Creating Model Descriptions with Patsy

**Patsy** is a Python library for describing statistical models (especially linear models) with a small string-based "formula syntax", which is inspired by (but not exactly the same as) the formula syntax used by the R and S statistical programming languages.

Patsy is well supported for specifying linear models in statsmodels, so I will focus on
some of the main features to help you get up and running. Patsy’s formulas are a spe‐
cial string syntax that looks like:

    y ~ x0 + x1

The syntax a + b does not mean to add a to b , but rather that these are terms in the
design matrix created for the model. The ```patsy.dmatrices``` function takes a formula
string along with a dataset (which can be a DataFrame or a dict of arrays) and produces design matrices for a linear model:

In [20]:
data = pd.DataFrame({
    'x0': [1, 2, 3, 4, 5],
    'x1': [0.01, -0.01, 0.25, -4.1, 0.],
    'y': [-1.5, 0., 3.6, 1.3, -2.]})

In [21]:
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [22]:
import patsy

In [23]:
y, X = patsy.dmatrices('y ~ x0 + x1', data)

In [24]:
y

DesignMatrix with shape (5, 1)
     y
  -1.5
   0.0
   3.6
   1.3
  -2.0
  Terms:
    'y' (column 0)

In [25]:
X

DesignMatrix with shape (5, 3)
  Intercept  x0     x1
          1   1   0.01
          1   2  -0.01
          1   3   0.25
          1   4  -4.10
          1   5   0.00
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'x1' (column 2)

In [26]:
np.asarray(y)

array([[-1.5],
       [ 0. ],
       [ 3.6],
       [ 1.3],
       [-2. ]])

In [27]:
np.asarray(X)

array([[ 1.  ,  1.  ,  0.01],
       [ 1.  ,  2.  , -0.01],
       [ 1.  ,  3.  ,  0.25],
       [ 1.  ,  4.  , -4.1 ],
       [ 1.  ,  5.  ,  0.  ]])

You might wonder where the Intercept term came from. This is a convention for
linear models like ordinary least squares (OLS) regression. You can suppress the
intercept by adding the term + 0 to the model:

In [28]:
patsy.dmatrices('y ~ x0 + x1 + 0', data)[1]

DesignMatrix with shape (5, 2)
  x0     x1
   1   0.01
   2  -0.01
   3   0.25
   4  -4.10
   5   0.00
  Terms:
    'x0' (column 0)
    'x1' (column 1)

The Patsy objects can be passed directly into algorithms like ```numpy.linalg.lstsq```,
which performs an ordinary least squares regression:

In [29]:
coef, resid, _, _ = np.linalg.lstsq(X, y)

  """Entry point for launching an IPython kernel.


In [33]:
np.linalg.lstsq(X, y)

  """Entry point for launching an IPython kernel.


(array([[ 0.31290976],
        [-0.07910564],
        [-0.26546384]]),
 array([19.63791494]),
 3,
 array([8.03737688, 3.38335321, 0.90895207]))

In [30]:
coef

array([[ 0.31290976],
       [-0.07910564],
       [-0.26546384]])

In [31]:
coef = pd.Series(coef.squeeze(), index=X.design_info.column_names)

In [32]:
coef

Intercept    0.312910
x0          -0.079106
x1          -0.265464
dtype: float64

### Data Transformations in Patsy Formulas

You can mix Python code into your Patsy formulas; when evaluating the formula the
library will try to find the functions you use in the enclosing scope:

In [34]:
y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)

In [35]:
X

DesignMatrix with shape (5, 3)
  Intercept  x0  np.log(np.abs(x1) + 1)
          1   1                 0.00995
          1   2                 0.00995
          1   3                 0.22314
          1   4                 1.62924
          1   5                 0.00000
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'np.log(np.abs(x1) + 1)' (column 2)

Some commonly used variable transformations include standardizing (to mean 0 and
variance 1) and centering (subtracting the mean). Patsy has built-in functions for this
purpose:

In [36]:
y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)', data)

In [37]:
X

DesignMatrix with shape (5, 3)
  Intercept  standardize(x0)  center(x1)
          1         -1.41421        0.78
          1         -0.70711        0.76
          1          0.00000        1.02
          1          0.70711       -3.33
          1          1.41421        0.77
  Terms:
    'Intercept' (column 0)
    'standardize(x0)' (column 1)
    'center(x1)' (column 2)

The ```patsy.build_design_matrices``` function can apply transformations to new *out-of-sample* data using the saved information from the original *in-sample* dataset:

In [38]:
new_data = pd.DataFrame({
    'x0': [6, 7, 8, 9],
    'x1': [3.1, -0.5, 0, 2.3],
    'y': [1, 2, 3, 4]
})

In [39]:
new_X = patsy.build_design_matrices([X.design_info], new_data)

In [40]:
new_X

[DesignMatrix with shape (4, 3)
   Intercept  standardize(x0)  center(x1)
           1          2.12132        3.87
           1          2.82843        0.27
           1          3.53553        0.77
           1          4.24264        3.07
   Terms:
     'Intercept' (column 0)
     'standardize(x0)' (column 1)
     'center(x1)' (column 2)]

Because the plus symbol ( + ) in the context of Patsy formulas does not mean addition,
when you want to add columns from a dataset by name, you must wrap them in the
special I function:

In [58]:
data

Unnamed: 0,key1,key2,v1,v2
0,a,zero,1,-1.0
1,a,one,2,0.0
2,b,zero,3,2.5
3,b,one,4,-0.5
4,a,zero,5,4.0
5,b,one,6,-1.2
6,a,zero,7,0.2
7,b,zero,8,-1.7


In [41]:
y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)

In [42]:
X

DesignMatrix with shape (5, 2)
  Intercept  I(x0 + x1)
          1        1.01
          1        1.99
          1        3.25
          1       -0.10
          1        5.00
  Terms:
    'Intercept' (column 0)
    'I(x0 + x1)' (column 1)

### Categorical Data and Patsy

Non-numeric data can be transformed for a model design matrix in many different
ways. A complete treatment of this topic is outside the scope of this book and would
be best studied along with a course in statistics.

When you use non-numeric terms in a Patsy formula, they are converted to dummy
variables by default. If there is an intercept, one of the levels will be left out to avoid
collinearity:

In [43]:
data = pd.DataFrame({
    'key1': ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'b'],
    'key2': [0, 1, 0, 1, 0, 1, 0, 0],
    'v1': [1, 2, 3, 4, 5, 6, 7, 8],
    'v2': [-1, 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7]
})

In [44]:
y, X = patsy.dmatrices('v2 ~ key1', data)

In [45]:
X

DesignMatrix with shape (8, 2)
  Intercept  key1[T.b]
          1          0
          1          0
          1          1
          1          1
          1          0
          1          1
          1          0
          1          1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)

If you omit the intercept from the model, then columns for each category value will
be included in the model design matrix:

In [46]:
y, X = patsy.dmatrices('v2 ~ key1 + 0', data)

In [47]:
X

DesignMatrix with shape (8, 2)
  key1[a]  key1[b]
        1        0
        1        0
        0        1
        0        1
        1        0
        0        1
        1        0
        0        1
  Terms:
    'key1' (columns 0:2)

Numeric columns can be interpreted as categorical with the ```C``` function:

In [48]:
y, X = patsy.dmatrices('v2 ~ C(key2)', data)

In [49]:
X

DesignMatrix with shape (8, 2)
  Intercept  C(key2)[T.1]
          1             0
          1             1
          1             0
          1             1
          1             0
          1             1
          1             0
          1             0
  Terms:
    'Intercept' (column 0)
    'C(key2)' (column 1)

In [50]:
data['key2'] = data['key2'].map({0: 'zero', 1: 'one'})

In [51]:
data

Unnamed: 0,key1,key2,v1,v2
0,a,zero,1,-1.0
1,a,one,2,0.0
2,b,zero,3,2.5
3,b,one,4,-0.5
4,a,zero,5,4.0
5,b,one,6,-1.2
6,a,zero,7,0.2
7,b,zero,8,-1.7


In [52]:
y, X = patsy.dmatrices('v2 ~ key1 + key2', data)

In [53]:
X

DesignMatrix with shape (8, 3)
  Intercept  key1[T.b]  key2[T.zero]
          1          0             1
          1          0             0
          1          1             1
          1          1             0
          1          0             1
          1          1             0
          1          0             1
          1          1             1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)
    'key2' (column 2)

In [54]:
y, X = patsy.dmatrices('v2 ~ key1 + key2 + key1:key2', data)

In [55]:
X

DesignMatrix with shape (8, 4)
  Intercept  key1[T.b]  key2[T.zero]  key1[T.b]:key2[T.zero]
          1          0             1                       0
          1          0             0                       0
          1          1             1                       1
          1          1             0                       0
          1          0             1                       0
          1          1             0                       0
          1          0             1                       0
          1          1             1                       1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)
    'key2' (column 2)
    'key1:key2' (column 3)

## Introduction to statsmodels

**statsmodels** is a Python library for fitting many kinds of statistical models, performing statistical tests, and data exploration and visualization. Statsmodels contains more “classical” frequentist statistical methods, while Bayesian methods and machine learning models are found in other libraries.

Some kinds of models found in statsmodels include:

- Linear models, generalized linear models, and robust linear models
- Linear mixed effects models
- Analysis of variance (ANOVA) methods
- Time series processes and state space models
- Generalized method of moments

### Estimating Linear Models

Linear models in statsmodels have two different main interfaces: **array-based** and
**formula-based**. These are accessed through these API module imports:

In [59]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [60]:
def dnorm(mean, variance, size=1):
    if isinstance(size, int):
        size = size,
    return mean + np.sqrt(variance) * np.random.randn(*size)

For reproducibility

In [61]:
np.random.seed(12345)

In [62]:
N = 100

In [63]:
X = np.c_[dnorm(0, 0.4, size=N),
          dnorm(0, 0.6, size=N),
          dnorm(0, 0.2, size=N)]
eps = dnorm(0, 0.1, size=N)
beta = [0.1, 0.3, 0.5]

In [64]:
y = np.dot(X, beta) + eps

In [65]:
X[:5]

array([[-0.12946849, -1.21275292,  0.50422488],
       [ 0.30291036, -0.43574176, -0.25417986],
       [-0.32852189, -0.02530153,  0.13835097],
       [-0.35147471, -0.71960511, -0.25821463],
       [ 1.2432688 , -0.37379916, -0.52262905]])

In [66]:
y[:5]

array([ 0.42786349, -0.67348041, -0.09087764, -0.48949442, -0.12894109])

A linear model is generally fitted with an intercept term as we saw before with Patsy.
The ```sm.add_constant``` function can add an intercept column to an existing matrix:

In [67]:
X_model = sm.add_constant(X)

In [68]:
X_model[:5]

array([[ 1.        , -0.12946849, -1.21275292,  0.50422488],
       [ 1.        ,  0.30291036, -0.43574176, -0.25417986],
       [ 1.        , -0.32852189, -0.02530153,  0.13835097],
       [ 1.        , -0.35147471, -0.71960511, -0.25821463],
       [ 1.        ,  1.2432688 , -0.37379916, -0.52262905]])

The ```sm.OLS``` class can fit an ordinary least squares linear regression:

In [69]:
model = sm.OLS(y, X)

In [70]:
results = model.fit()

In [71]:
results.params

array([0.17826108, 0.22303962, 0.50095093])

The ```summary``` method on results can print a model detailing diagnostic output of the
model:

In [72]:
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.430
Model:                            OLS   Adj. R-squared (uncentered):              0.413
Method:                 Least Squares   F-statistic:                              24.42
Date:                Sun, 26 Jan 2020   Prob (F-statistic):                    7.44e-12
Time:                        12:51:12   Log-Likelihood:                         -34.305
No. Observations:                 100   AIC:                                      74.61
Df Residuals:                      97   BIC:                                      82.42
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

The parameter names here have been given the generic names x1, x2 , and so on.
Suppose instead that all of the model parameters are in a DataFrame:

In [73]:
data = pd.DataFrame(X, columns=['col0', 'col1', 'col2'])

In [74]:
data['y'] = y

In [75]:
data[:5]

Unnamed: 0,col0,col1,col2,y
0,-0.129468,-1.212753,0.504225,0.427863
1,0.30291,-0.435742,-0.25418,-0.67348
2,-0.328522,-0.025302,0.138351,-0.090878
3,-0.351475,-0.719605,-0.258215,-0.489494
4,1.243269,-0.373799,-0.522629,-0.128941


In [76]:
results = smf.ols('y ~ col0 + col1 + col2', data=data).fit()

In [77]:
results.params

Intercept    0.033559
col0         0.176149
col1         0.224826
col2         0.514808
dtype: float64

In [78]:
results.tvalues

Intercept    0.952188
col0         3.319754
col1         4.850730
col2         6.303971
dtype: float64

Observe how statsmodels has returned results as Series with the DataFrame column
names attached. We also **do not need** to use ```add_constant``` when using formulas and
pandas objects.

Given new out-of-sample data, you can compute predicted values given the estimated
model parameters:

In [80]:
results.predict(data[:5])

0   -0.002327
1   -0.141904
2    0.041226
3   -0.323070
4   -0.100535
dtype: float64

### Estrimating Time SEries Processes

Another class of models in statsmodels are for time series analysis. Among these are
autoregressive processes, Kalman filtering and other state space models, and multi‐
variate autoregressive models.

In [84]:
init_x = 4

In [85]:
import random 

In [86]:
values = [init_x, init_x]

In [87]:
N = 1000

In [88]:
b0 = 0.8
b1 = -0.4
noise = dnorm(0, 0.1, N)
for i in range(N):
    new_x = values[-1] * b0 + values[-2] * b1 + noise[i]
    values.append(new_x)

In [89]:
MAXLAGS = 5

In [90]:
model = sm.tsa.AR(values)

In [91]:
results = model.fit(MAXLAGS)

The estimated parameters in the results have the intercept first and the estimates for
the first two lags next:

In [92]:
results.params

array([-0.00616093,  0.78446347, -0.40847891, -0.01364148,  0.01496872,
        0.01429462])

## Introduction to scikit-learn

**scikit-learn** is one of the most widely used and trusted general-purpose Python
machine learning toolkits. It contains a broad selection of standard supervised and
unsupervised machine learning methods with tools for model selection and evalua‐
tion, data transformation, data loading, and model persistence. These models can be
used for classification, clustering, prediction, and other common tasks.

In [96]:
train = pd.read_csv('datasets/titanic/train.csv')

In [97]:
test = pd.read_csv('datasets/titanic/test.csv')

In [98]:
train[:4]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


Libraries like statsmodels and scikit-learn generally cannot be fed missing data, so we
look at the columns to see if there are any that contain missing data:

In [99]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [100]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

A model is **fitted** on a training dataset and then **evaluated** on an out-of-sample testing dataset

I would like to use Age as a predictor, but it has missing data. There are a number of
ways to do missing data imputation , but I will do a simple one and use the median of
the training dataset to fill the nulls in both tables:

In [102]:
impute_value = train['Age'].median()

In [103]:
train['Age'] = train['Age'].fillna(impute_value)

In [104]:
test['Age'] = test['Age'].fillna(impute_value)

Now we need to specify our models. I add a column ```IsFemale``` as an encoded version
of the 'Sex' column:

In [105]:
train['IsFemale'] = (train['Sex'] == 'female').astype(int)

In [106]:
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

In [107]:
predictors = ['Pclass', 'IsFemale', 'Age']

In [108]:
X_train = train[predictors].values

In [109]:
X_test = test[predictors].values

In [110]:
y_train = train['Survived'].values

In [111]:
X_train[:5]

array([[ 3.,  0., 22.],
       [ 1.,  1., 38.],
       [ 3.,  1., 26.],
       [ 1.,  1., 35.],
       [ 3.,  0., 35.]])

In [112]:
y_train[:5]

array([0, 1, 1, 1, 0])

In [113]:
from sklearn.linear_model import LogisticRegression

In [114]:
model = LogisticRegression()

In [115]:
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [116]:
y_predict = model.predict(X_test)

In [117]:
y_predict[:10]

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

If you had the true values for the test dataset, you could compute an accuracy percentage or some other error metric:
```python
(y_true == y_predict).mean()
```

In practice, there are often many additional layers of complexity in model training.
Many models have parameters that can be tuned, and there are techniques such as
cross-validation that can be used for parameter tuning to avoid overfitting to the
training data. This can often yield better predictive performance or robustness on
new data.

Cross-validation works by splitting the training data to simulate out-of-sample pre‐
diction. Based on a model accuracy score like mean squared error, one can perform a
grid search on model parameters. Some models, like logistic regression, have estima‐
tor classes with built-in cross-validation. For example, the LogisticRegressionCV
class can be used with a parameter indicating how fine-grained of a grid search to do
on the model regularization parameter C :

In [119]:
from sklearn.linear_model import LogisticRegressionCV

In [120]:
model_cv = LogisticRegressionCV(10)

In [121]:
model_cv.fit(X_train, y_train)



LogisticRegressionCV(Cs=10, class_weight=None, cv='warn', dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='warn', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

To do cross-validation by hand, you can use the ```cross_val_score``` helper function,
which handles the data splitting process. For example, to cross-validate our model
with four non-overlapping splits of the training data, we can do:

In [122]:
from sklearn.model_selection import cross_val_score

In [123]:
model = LogisticRegression(C=10)

In [124]:
scores = cross_val_score(model, X_train, y_train, cv=4)



In [125]:
scores

array([0.77232143, 0.80269058, 0.77027027, 0.78828829])

The default scoring metric is model-dependent, but it is possible to choose an explicit
scoring function. Cross-validated models take longer to train, but can often yield better model performance.