# 12.1 Interfacing between pandas and Model Code

A common workflow for model development is to use pandas for data loading and cleaning before switching over to a modeling library to buildthe model itself. 

Feature engineering
Any data transformation or analytics that extract information from a raw data set that may be useful in a modeling context. 
Data aggregation and GroupBy tools are often used in feature engineering context. 


In [1]:
import pandas as pd
import numpy as np
import patsy

import statsmodels.api as sm
import statsmodels.formula.api as smf


In [None]:
# Turn a DataFrame to a NumPy array
data = pd.DataFrame({
	'x0': [1, 2, 3, 4, 5],
	'x1': [0.01, -0.01, 0.25, -4.1, 0.],
	'y': [-1.5, 0., 3.6, 1.3, -2.]
})

data.to_numpy()

# To convert dataframe back to NumPy Array
df2 = pd.DataFrame(data.to_numpy(), columns=['one', 'two', 'three'])

df2

# heterpgeneous data will be converted an ndarray of Python Objects
df3 = data.copy()
df3['strings'] = ['a', 'b', 'c', 'd', 'e']

df3.to_numpy()

# to select subset of columns, use loc
model_cols = ['x0', 'x1']

data.loc[:, model_cols].to_numpy()

# Create a categorical column
data['category'] = pd.Categorical(['a', 'b', 'a', 'a', 'b'], categories=['a', 'b'])

# Replace the 'category' column with dummy variables
dummies = pd.get_dummies(data.category, prefix='category')

data_with_dummies = data.drop('category', axis=1).join(dummies)

data_with_dummies

# 12.2 Creating Model Descriptions with Patsy

A python library for describing statistical models (especially linear models) with a string-based 'formula syntax'.

`y ~ x0 + x1`

a + b means terms in the design matrix created for the model

patsy.dematrices function takes a formula string along with a dataset, produce design matrices for a linear model

In [None]:
data = pd.DataFrame({
	'x0': [1, 2, 3, 4, 5],
	'x1': [0.01, -0.01, 0.25, -4.1, 0],
	'y': [-1.5, 0., 3.6, 1.3, -2.]
})

data
y, X = patsy.dmatrices('y ~ x0 + x1', data)

y



In [None]:
X

Patsy DesignMatrix instance are NumPy ndarrays with additional metadata

The Patsy objects can be passed directly in to algorithms like numpy.linalg.lstsq

In [None]:
coef, resid, _, _ = np.linalg.lstsq(X, y)

In [None]:
coef

In [None]:
# The model metadata is retained in the design_info attribute. Can reattach the model column names to the fitted coefficients to obtain a Series
coef = pd.Series(coef.squeeze(), index=X.design_info.column_names)

## Data Transformations in Patsy Formulas

Mix Python code into Patsy formulas, the library will try to find the functions in the enclusing scope

In [None]:
y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)

In [None]:
X

Commonly used variable transformations include standardizing (to mean 0 and variance 1) and Centering (subtracting the mean)

In [None]:
y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)', data)

X

As part of modeling process, you may fit a model on one dataset, then evaluate the model based on another. 
When applying transformations like center and standardize, be careful when using the model to form predictions based on new data.

Stateful transformations, must use statistics like the mean of standard deviation or the original dataset when transforming a new dataset.

In [None]:
new_data = pd.DataFrame({
	'x0': [6, 7, 8, 9],
	'x1': [3.1, -0.5, 0, 2.3],
	'y' : [1, 2, 3, 4]
})

new_X = patsy.build_design_matrices([X.design_info], new_data)

In [None]:
new_X

In [None]:
# use special I function to wrap addition operation
# since + symbol doesn't mean adding
y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)

X

In [None]:
data

## Categorical Data and Patsy
When you use nonnumeric terms in a Patsy formula, they are converted to dummy variables by default. 

In [None]:
data = pd.DataFrame({
	'key1': ['a', 'b', 'b', 'b', 'a', 'b', 'a', 'b'],
	'key2': [0, 1, 0, 1, 0, 1, 0, 0],
	'v1' : [1, 2, 3, 4, 5, 6, 7, 8],
	'v2': [-1., 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7]
})

y,X = patsy.dmatrices('v2 ~ key1', data)

In [None]:
X

In [None]:
# If omit the intercept from the model, columns for each category value will be included in the model design matrix
# the +0 part will omit the intercept
y, X = patsy.dmatrices('v2 ~ key1 + 0', data)
X

In [None]:
# Numeric columns can be interpreted as categorical with the C function
y, X = patsy.dmatrices('v2 ~ C(key2)', data)

X

In [None]:
# When using multiple categorical terms in a model 

data['key2'] = data['key2'].map({0: 'zero', 1:'one'})

data

In [None]:
y, X = patsy.dmatrices('v2 ~ key1 + key2', data)

In [None]:
X

In [None]:
y, X = patsy.dmatrices('v2 ~ key1 + key2 + key1:key2', data)

In [None]:
X

# 12.3 Introduction to statsmodels
statsmodels is a Python library for fitting many kinds of statistical models, performaing statistical tests and data exploration and visualization.

Some kinds of models found in statsmodels:
- Linear models, generalized linear models and robust linear models
- Linear mixed effects models
- Analysis of variance (ANOVA) methods
- Time series processes and state space models
- Generalized methods of moments

Use the modeling interfaces with Patsy formulas and pandas DataFrame objects

## Estimating Linear Models


In [2]:
# To make the example reproducible
rng = np.random.default_rng(seed=12345)

# helper function for generating normally distributed data with a particular mean and variance
def dnorm(mean, variance, size= 1):
	if isinstance(size, int):
		size = size,
	return mean + np.sqrt(variance) * rng.standard_normal(*size)

N = 100

X = np.c_[
	dnorm(0, 0.4, size=N),
	dnorm(0, 0.6, size=N),
	dnorm(0, 0.2, size=N)
]

eps = dnorm(0, 0.1, size=N)

# 'true' model
beta = [0.1, 0.3, 0.5]

y = np.dot(X, beta) + eps

In [None]:
X[:5]

In [None]:
y[:5]

In [None]:
# A linear model is generally fitted with an intercept term
# use sm.add_constant function to add an intercept column

In [None]:
X_model = sm.add_constant(X)
X_model[:5]

In [None]:
# fit an ordinary least square linear regression
model = sm.OLS(y, X)
results = model.fit()

In [None]:
results.params

In [None]:
# Print a model detailing dignostic output of the model
print(results.summary())

In [None]:
data = pd.DataFrame(X, columns=['col0', 'col1', 'col2'])

data['y'] = y

results = smf.ols('y ~ col0 + col1 + col2', data=data).fit()

In [None]:
results.params

In [None]:
results.tvalues

In [None]:
# Compute the predicted values given the estimated model parameters
results.predict(data[:5])

## Estimating Time Series Processes

Autoregressive processes, Kalman filtering, other state space models and multivariable autoregressive model. 



In [3]:
# Simulate some time series data with an autoregressive structure and noise
# This data has an AR(2) structure (two lags) with parameters 0.8 and -0.4. 

init_x = 4

values = [init_x, init_x]

N = 1000

b0 = 0.8
b1 = -0.4
noise = dnorm(0, 0.1, N)

for i in range(N):
	new_x = values[-1] * b0 + values[-2] * b1 + noise[i]
	values.append(new_x)

# Fit the model with some larger number of lags
from statsmodels.tsa.ar_model import AutoReg

MAXLAGS = 5
model = AutoReg(values, MAXLAGS)

results = model.fit()

In [5]:
results.params

array([ 0.02346612,  0.8096828 , -0.42865278, -0.03336517,  0.04267874,
       -0.05671529])

# 12.4 Introduction to scikit-learn

General-purpose Python machine learning toolkits. 



In [6]:
train = pd.read_csv('./datasets/titanic/train.csv')
test = pd.read_csv('./datasets/titanic/test.csv')

train.head(4)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


In [10]:
# Libraries like statsmodels and scikit-learn generally cannot be fed missing data. 
# Look at the columns to see if there are ny that contain missing data.
train.isna().sum()

test.isna().sum()


PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [11]:
# A model is fitted on a training dataset and then evaluated on an out-of-sample testing dataset
# Use the median of the training data set to fill the nulls in both tables

impute_value = train['Age'].median()

train['Age'] = train['Age'].fillna(impute_value)

test['Age'] = test['Age'].fillna(impute_value)

In [29]:
# Add a column IsFemale to encode the 'Sex' column
train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

predictors = ['Pclass', 'IsFemale', 'Age']

X_train = train[predictors].to_numpy()
X_test = test[predictors].to_numpy()

y_train = train['Survived'].to_numpy()

X_train[:5]


array([[ 3.,  0., 22.],
       [ 1.,  1., 38.],
       [ 3.,  1., 26.],
       [ 1.,  1., 35.],
       [ 3.,  0., 35.]])

In [23]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)


In [17]:
y_predict = model.predict(X_test)

In [18]:
y_predict[:10]

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

In [30]:
# Compute the accuracy percentage 
# (y_true == y_predict).mean()

# Cross validation
# Some models have build-in cross-validation

from sklearn.linear_model import LogisticRegressionCV

model_cv = LogisticRegressionCV(Cs=10)
model_cv.fit(X_train, y_train)


In [31]:
y_predict = model_cv.predict(X_test)

In [34]:
y_predict[:10]

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])

In [35]:
# To do cross validation by hand, use cross_val_score helper function

from sklearn.model_selection import cross_val_score

model = LogisticRegression(C=10)

scores = cross_val_score(model, X_train, y_train, cv=4)

scores

array([0.77578475, 0.79820628, 0.77578475, 0.78828829])