# 12.1 Interfacing between pandas and Model Code

A common workflow for model development is to use pandas for data loading and cleaning before switching over to a modeling library to buildthe model itself. 

Feature engineering
Any data transformation or analytics that extract information from a raw data set that may be useful in a modeling context. 
Data aggregation and GroupBy tools are often used in feature engineering context. 


In [None]:
import pandas as pd
import numpy as np
import patsy

In [None]:
# Turn a DataFrame to a NumPy array
data = pd.DataFrame({
	'x0': [1, 2, 3, 4, 5],
	'x1': [0.01, -0.01, 0.25, -4.1, 0.],
	'y': [-1.5, 0., 3.6, 1.3, -2.]
})

data.to_numpy()

# To convert dataframe back to NumPy Array
df2 = pd.DataFrame(data.to_numpy(), columns=['one', 'two', 'three'])

df2

# heterpgeneous data will be converted an ndarray of Python Objects
df3 = data.copy()
df3['strings'] = ['a', 'b', 'c', 'd', 'e']

df3.to_numpy()

# to select subset of columns, use loc
model_cols = ['x0', 'x1']

data.loc[:, model_cols].to_numpy()

# Create a categorical column
data['category'] = pd.Categorical(['a', 'b', 'a', 'a', 'b'], categories=['a', 'b'])

# Replace the 'category' column with dummy variables
dummies = pd.get_dummies(data.category, prefix='category')

data_with_dummies = data.drop('category', axis=1).join(dummies)

data_with_dummies

# 12.2 Creating Model Descriptions with Patsy

A python library for describing statistical models (especially linear models) with a string-based 'formula syntax'.

`y ~ x0 + x1`

a + b means terms in the design matrix created for the model

patsy.dematrices function takes a formula string along with a dataset, produce design matrices for a linear model

In [None]:
data = pd.DataFrame({
	'x0': [1, 2, 3, 4, 5],
	'x1': [0.01, -0.01, 0.25, -4.1, 0],
	'y': [-1.5, 0., 3.6, 1.3, -2.]
})

data
y, X = patsy.dmatrices('y ~ x0 + x1', data)

y



In [None]:
X

Patsy DesignMatrix instance are NumPy ndarrays with additional metadata

The Patsy objects can be passed directly in to algorithms like numpy.linalg.lstsq

In [None]:
coef, resid, _, _ = np.linalg.lstsq(X, y)

In [None]:
coef

In [None]:
# The model metadata is retained in the design_info attribute. Can reattach the model column names to the fitted coefficients to obtain a Series
coef = pd.Series(coef.squeeze(), index=X.design_info.column_names)

## Data Transformations in Patsy Formulas

Mix Python code into Patsy formulas, the library will try to find the functions in the enclusing scope

In [None]:
y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)

In [None]:
X

Commonly used variable transformations include standardizing (to mean 0 and variance 1) and Centering (subtracting the mean)

In [None]:
y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)', data)

X

As part of modeling process, you may fit a model on one dataset, then evaluate the model based on another. 
When applying transformations like center and standardize, be careful when using the model to form predictions based on new data.

Stateful transformations, must use statistics like the mean of standard deviation or the original dataset when transforming a new dataset.

In [7]:
new_data = pd.DataFrame({
	'x0': [6, 7, 8, 9],
	'x1': [3.1, -0.5, 0, 2.3],
	'y' : [1, 2, 3, 4]
})

new_X = patsy.build_design_matrices([X.design_info], new_data)

In [10]:
new_X

[DesignMatrix with shape (4, 3)
   Intercept  standardize(x0)  center(x1)
           1          2.12132        3.87
           1          2.82843        0.27
           1          3.53553        0.77
           1          4.24264        3.07
   Terms:
     'Intercept' (column 0)
     'standardize(x0)' (column 1)
     'center(x1)' (column 2)]

In [12]:
# use special I function to wrap addition operation
# since + symbol doesn't mean adding
y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)

X

DesignMatrix with shape (5, 2)
  Intercept  I(x0 + x1)
          1        1.01
          1        1.99
          1        3.25
          1       -0.10
          1        5.00
  Terms:
    'Intercept' (column 0)
    'I(x0 + x1)' (column 1)

In [14]:
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0
