## 12.1 Interfacing Between pandas and Model Code
A common workflow for model development: 
- use pandas for data loading and cleaning 
- Use  modeling library (`scikit-learn`, `statsmodels`) to build the model . 
    - feature engineering: data transformation.  The data aggregation and GroupBy tools  are used often in a feature engineering context.

The point of contact between `pandas` and `other analysis libraries` is usually `NumPy` arrays. To turn a DataFrame into a NumPy array, use the `to_numpy` method:

To convert back to a DataFrame, as you may recall from earlier chapters, you can pass a two-dimensional ndarray with optional column names:

The `to_numpy` method is intended to be used when your data is homogeneous—for example, all numeric types. If you have heterogeneous data, the result will be an ndarray of Python objects:

df2 = pd.DataFrame(data.to_numpy(), columns=['one', 'two', 'three'])


df3 = data.copy() #deep copy. changes to df3 will not affect the original data

For some models, you may wish to use only a subset of the columns. I recommend using `loc` indexing with `to_numpy`:

Some libraries have native support for pandas and do some of this work for you automatically: converting to NumPy from DataFrame and attaching model parameter names to the columns of output tables or Series. In other cases, you will have to perform this "metadata management" manually.

data['category'] = pd.Categorical(['a', 'b', 'a', 'a', 'b'],
                                  categories=['a', 'b']) #categories argument can be omitted here

If we wanted to replace the `'category'` column with dummy variables, we create dummy variables, drop the 'category' column, and then join the result:

dummies = pd.get_dummies(data.category, prefix='category',
                         dtype=float)

data_with_dummies = data.drop('category', axis=1).join(dummies) #left-join by default on index. 

There are some nuances to fitting certain statistical models with dummy variables. It may be simpler and less error-prone to use `Patsy` when you have more than simple numeric columns.

## 12.2 Creating Model Descriptions with Patsy
`Patsy` is a Python library for describing statistical models (especially linear models) with a string-based "formula syntax," which is inspired by (but not exactly the same as) the formula syntax used by the `R` and `S` statistical programming languages. It is installed automatically when you install `statsmodels`:

`conda install statsmodels`
Patsy is well supported for specifying linear models in `statsmodels`.  `Patsy`'s formulas are a special string syntax that looks like:

`y ~ x0 + x1`
The syntax `a + b` does not mean to add `a` to `b`, but rather that these are terms in the design matrix created for the model. The `patsy.dmatrices` function takes a formula string along with a dataset (which can be a DataFrame or a dictionary of arrays) and produces design matrices for a linear model:

import patsy
y, X = patsy.dmatrices('y ~ x0 + x1', data)

These `Patsy` DesignMatrix instances are NumPy ndarrays with additional metadata:

np.asarray(y)

You might wonder where the `Intercept` term came from. This is a convention for linear models like ordinary least squares (OLS) regression. You can suppress the intercept by adding the term `+ 0` to the model:

patsy.dmatrices('y ~ x0 + x1 + 0', data)[1]

The Patsy objects can be passed directly into algorithms like `numpy.linalg.lstsq`, which performs an ordinary least squares regression:

coef, resid, _, _ = np.linalg.lstsq(X, y, rcond=None)
#rcond: specify the cutoff for small singular values
#resid: RSS

The model metadata is retained in the `design_info` attribute, so you can reattach the model column names to the fitted coefficients to obtain a Series, for example:

X.design_info

X.design_info.column_names

### Data Transformations in Patsy Formulas
You can mix Python code into your Patsy formulas; when evaluating the formula, the library will try to find the functions you use in the enclosing scope:

y, X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data)

Some commonly used variable transformations include standardizing (to mean 0 and variance 1) and centering (subtracting the mean). Patsy has built-in functions for this purpose:

y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)', data)

As part of a modeling process, you may fit a model on one dataset, then evaluate the model based on another. This might be a hold-out portion or new data that is observed later. When applying transformations like center and standardize, you should be careful when using the model to form predications based on new data. These are called **stateful transformations**, because you must use statistics like the mean or standard deviation of the original dataset when transforming a new dataset.

The `patsy.build_design_matrices` function can apply transformations to new out-of-sample data using the saved information from the original in-sample dataset:

new_X = patsy.build_design_matrices([X.design_info], new_data)

Because the plus symbol (+) in the context of Patsy formulas does not mean addition, when you want to add columns from a dataset by name, you must wrap them in the special `I` function:

y, X = patsy.dmatrices('y ~ I(x0 + x1)', data)

Patsy has several other built-in transforms in the `patsy.builtins` module. See the online documentation for more.

### Categorical Data and Patsy
Nonnumeric data can be transformed for a model design matrix in many different ways. When you use nonnumeric terms in a Patsy formula, they are converted to **dummy variables** by default. If there is an intercept, one of the levels will be left out to avoid collinearity:

y, X = patsy.dmatrices('v2 ~ key1', data)

If you omit the intercept from the model, then columns for each category value will be included in the model design matrix:

y, X = patsy.dmatrices('v2 ~ key1 + 0', data)

Numeric columns can be interpreted as categorical with the `C` function:

y, X = patsy.dmatrices('v2 ~ C(key2)', data)

When you're using multiple categorical terms in a model, things can be more complicated, as you can include interaction terms of the form `key1:key2`, which can be used, for example, in analysis of variance (ANOVA) models:

data['key2'] = data['key2'].map({0: 'zero', 1: 'one'})

y, X = patsy.dmatrices('v2 ~ key1 + key2 + key1:key2', data)

## 12.3 Introduction to statsmodels
`statsmodels` is a Python library for fitting many kinds of statistical models, performing statistical tests, and data exploration and visualization. statsmodels contains more "classical" frequentist statistical methods, while Bayesian methods and machine learning models are found in other libraries.

Some kinds of models found in `statsmodels` include:

Linear models, generalized linear models, and robust linear models

Linear mixed effects models

Analysis of variance (ANOVA) methods

Time series processes and state space models

Generalized method of moments

### Estimating Linear Models
There are several kinds of linear regression models in statsmodels, from the more basic (e.g., ordinary least squares) to more complex (e.g., iteratively reweighted least squares).

Linear models in statsmodels have two different main interfaces: array based and formula based. These are accessed through these API module imports:
```
import statsmodels.api as sm
import statsmodels.formula.api as smf
```

To show how to use these, we generate a linear model from some random data. Run the following code in a Jupyter cell:

A linear model is generally fitted with an intercept term, as we saw before with Patsy. The `sm.add_constant` function can add an intercept column to an existing matrix:

X_model = sm.add_constant(X)

model = sm.OLS(y, X)

results = model.fit()
results.params

print(results.summary())

The parameter names here have been given the generic names x1, x2, and so on. Suppose instead that all of the model parameters are in a DataFrame:

sm.OLS(y, X_model).fit().params

sm.OLS(y, X_model).fit().summary()

results.tvalues # t-values

Observe how statsmodels has returned results as Series with the DataFrame column names attached. We also do not need to use add_constant when using formulas and pandas objects.

Given new out-of-sample data, you can compute predicted values given the estimated model parameters:

results.predict(data[:5])

### Estimating Time Series Processes
Another class of models in statsmodels is for time series analysis. Among these are autoregressive processes, Kalman filtering and other state space models, and multivariate autoregressive models.

Let's simulate some time series data with an autoregressive structure and noise. 


This data has an `AR(2)` structure (two lags) with parameters 0.8 and –0.4. When you fit an AR model, you may not know the number of lagged terms to include, so you can fit the model with some larger number of lags:

from statsmodels.tsa.ar_model import AutoReg
MAXLAGS = 5
model = AutoReg(values, MAXLAGS)
results = model.fit()

The estimated parameters in the results have the intercept first, and the estimates for the first two lags next:

results.params

## 12.4 Introduction to scikit-learn
`scikit-learn` is one of the most widely used and trusted general-purpose Python machine learning toolkits. It contains a broad selection of standard **supervised** and **unsupervised** machine learning methods, with tools for model selection and evaluation, data transformation, data loading, and model persistence. These models can be used for classification, clustering, prediction, and other common tasks. You can install scikit-learn from conda like so:

`conda install scikit-learn`

 use a now-classic dataset from a Kaggle competition about passenger survival rates on the Titanic in 1912. We load the training and test datasets using pandas:

train.isna().sum()


In statistics and machine learning examples like this one, a typical task is to predict whether a passenger would survive based on features in the data. A model is fitted on a training dataset and then evaluated on an out-of-sample testing dataset.

use the median of the training dataset to fill the nulls in both tables:

impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)

Now we need to specify our models. I add a column IsFemale as an encoded version of the 'Sex' column:

train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

Then we decide on some model variables and create NumPy arrays:

predictors = ['Pclass', 'IsFemale', 'Age']

X_train = train[predictors].to_numpy()
X_test = test[predictors].to_numpy()
y_train = train['Survived'].to_numpy()

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

y_predict = model.predict(X_test)

If you had the true values for the test dataset, you could compute an accuracy percentage or some other error metric:

`(y_true == y_predict).mean()`

In practice, there are often many additional layers of complexity in model training. Many models have parameters that can be tuned, and there are techniques such as cross-validation that can be used for parameter tuning to avoid overfitting to the training data. This can often yield better predictive performance or robustness on new data.

Cross-validation works by splitting the training data to simulate out-of-sample prediction. Based on a model accuracy score like mean squared error, you can perform a grid search on model parameters. Some models, like logistic regression, have estimator classes with built-in cross-validation. For example, the LogisticRegressionCV class can be used with a parameter indicating how fine-grained of a grid search to do on the model regularization parameter `C`:

from sklearn.linear_model import LogisticRegressionCV
model_cv = LogisticRegressionCV(Cs=10) # 10 values in the log scale between 1e-4, 1e4. C=inverse of the regularization.
model_cv.fit(X_train, y_train)

To do cross-validation by hand, you can use the `cross_val_score` helper function, which handles the data splitting process. For example, to cross-validate our model with four nonoverlapping splits of the training data, we can do:

from sklearn.model_selection import cross_val_score
model = LogisticRegression(C=10)
scores = cross_val_score(model, X_train, y_train, cv=4)