# 30-features-models-reports: Create new features to be used in models and report results
The purpose of this notebook is 3-fold:
1. [Feature engineering](#Feature-investigation-and-creation) - generate new features
2. [Modeling](#Modeling) - create and evaluate models of the data  
    a. [Generic machine learning models](#Basic-machine-learning-models)  
    b. [H2O modeling + hyperparameter optimization](#Machine-learning-via-h2o)  
    c. [Deep learning](#Modeling-via-deep-learning)
3. [Report](#Reporting) - convey the results of the data

**A note on variable encodings**
- **scikit-learn**: If using scikit-learn, it's inadvisable to explicitly create a new dataset with your one hot or binary-encoded features.  This is because it is better to include them as a section of the larger pipeline which will include the prediction strategy and perhaps modeling.
- **h2o**: If using h2o, it is not *strictly* necessary to explicitly define the encodings.  H2o is able to do this [under the hood](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html) for most of its algorithms in an optimized way.

#### Common helpful packages

In [None]:
#Data analysis and processing
import pandas as pd
import numpy as np

#plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Constants/globals
cleaned_data_filename = ''

# Feature investigation and creation

#### Helpful packages and preliminaries

In [None]:
#Assertions and testing
import great_expectations as ge

In [None]:
#Read in data and view
data = pd.read_csv(cleaned_data_filename)
data.head()

In [None]:
#Split data into train and test sets
train_data = 
test_data = 

# Modeling

## Basic machine learning models

#### Helpful packages and preliminaries

In [None]:
#modeling preprocessing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score, classification_report

from sklearn.LinearModels import LogisticRegressionCV

## Machine learning via h2o

#### Helpful packages and preliminaries

In [None]:
#modeling preprocessing
from h2o import 

In [None]:
# Constants/globals
full_features_filename_train = ''
full_features_filename_test = ''

training_ratio = 0.7
max_models = None
repro_seed = 1234
max_runtime_mins = 5

### model using AutoML

In [None]:
#init h2o
h2o.init()

In [None]:
#import data from file, maybe
data = h2o.import_file(full_features_filename_train)
test_data = h2o.import_file(full_features_filename_test)

In [None]:
#import data based on pandas data frames
data = h2o.H2OFrame(python_obj=train_data)
test_data = h2o.H2OFrame(python_obj=test_data)

In [None]:
#set columns as factors if necessary
data['col'] = data['col'].as_factor()

In [None]:
#split data into desired portions and identify variables of interest
data_train, data_val = data.split_frame(ratios=(training_ratio), seed=repro_seed)

predictors = ['col1', 'col2']
response = 'coly'

In [None]:
#identify model and train
mls = H2OAutoML(max_models=max_models, max_runtime_secs = 60* max_runtime_mins, seed=repro_seed)
mls.train(x=predictors, y=response)

In [None]:
#View leaderboard
mls.leaderboard

### h2o performance evaluation

In [None]:
#See full performance of model on original training data
mls.leader.model_performance(train_data)

In [None]:
#Predict outputs using leader (usage of the leader done automatically)
res = mls.predict(test_data)

In [None]:
#See full performance of model
mls.leader.model_performance(test_data)

In [None]:
#Save model

## Modeling via deep learning

#### Helpful packages and preliminaries

In [None]:
#fast dev dl packages
import fastai as fa
from fastai.text import * as fat
from fastai.callbacks import * as fac

In [None]:
# Constants/globals
bs = 128
path = %pwd
model_save_dir = Path(path)/'models'
db_save_dir = Path(path)/'databunches'

In [None]:
#Hot fix for fast.ai v1.0.57 to fix issue about fast.ai not creating databunch file
import os
os.mkdir(databunch_dir)

In [None]:
#Load constants

In [None]:
#Read in as databunch

In [None]:
#Save databunch

In [None]:
# Model type and parameters

In [None]:
# Train model

In [None]:
# Test model

In [None]:
# Save model

### dl performance evaluation

# Reporting

#### Helpful packages and preliminaries

In [None]:
#globals/constants
skl_model_filename = ''
h2o_model_filename = ''
dl_model_filename = ''