# Capstone: Text Factorizing with NLP
## Thomas Ludlow

# 08 - Model Optimization

This notebook contains the optimization loops for the Corpus EDA, Gensim LDA topic model, and  Keras FFRNN model.  Parameters and Hyperparameters to optimize include:

- **Corpus EDA**
  - Number of works
  - Class balance

- **Gensim LDA**
  - Number of topics
  - Update frequency
  - Chunksize
  - Number of passes
  - Alpha
  - _Metrics_
    - Log Perplexity (Minimize)
    - Coherence (Maximize)

- **Keras FFRNN**
  - Network topology
    - Number of hidden layers
    - Number of nodes per layer
  - Dropout percentage
  - Early stopping patience
  - Number of epochs
  - _Metrics_
    - Model Loss (Minimize)
    - Model Accuracy (Maximize)
    - Bumper Sticker Accuracy (Maximize)

**Libraries**

In [3]:
# Python Data Science
import re
import time
import numpy as np
import pandas as pd
from tqdm import tqdm

# Natural Language Processing
import spacy
import gensim
import pyLDAvis.gensim
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, ldamodel, ldamulticore, CoherenceModel

# Modeling Prep
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

# Neural Net
import keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.callbacks import EarlyStopping

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Override deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## Gensim LDA Optimizer

- Number of topics `num_topics=`
  - `[8,12,16,20,40,100,400]`
- Update frequency `update_every=`
  - `[1,2]`
- Chunksize `chunksize=`
  - half, quarter, eighth, sixteenth
- Number of passes `passes=`
  - `[1,2,4]`
- Alpha `alpha=`
  - `'symmetric'`
  - `'asymmetric'`
- _Metrics_
  - Log Perplexity (Minimize)
  - Coherence (Maximize)

In [4]:
num_sentences = 16000

half = num_sentences/2
quarter = num_sentences/4
eighth = num_sentences/8
sixteenth = num_sentences/16

lda_params = {
    'num_topics':[8,16,40,100,400],
    'update_every':[1,2,4],
    'chunksize':[half,quarter,eighth,sixteenth],
    'passes':[1,2,4],
    'alpha':['symmetric','asymmetric']
}