# Tensorflow 2.0 Algos

The main goal here is to cover off on some core machine learning algos.  Apply each algo to a problem and dataset.

## What Algos?

The main ones right now for this notebook are 

* Linear Regression
* Classification
* Clustering
* Hidden Markov Models


### Google Collab Tip

If using Google Collab then run 

"%tensorflow_version 2.x  # this line is not required unless you are in a notebook" 

Restart runtime if a different version is selected.


## Linear Regression

Very basic form of machine learning used to predict numeric values.  With the magic of linear algebra this is super easy for computers to compute.

Using the Titanic dataset and the documentation from <https://www.tensorflow.org/tutorials/estimator/linear>

In [None]:
import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
from IPython.display import clear_output
from six.moves import urllib

## Loading the Dataset

This will be used to predict the survival rate of the passengers for gender, age, class, etc.

In [None]:
import tensorflow.compat.v2.feature_column as fc

import tensorflow as tf  
print(tf.version)  # make sure the version is 2.x

In [None]:
# load up the dataset
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')

In [None]:
y_train = dftrain.pop('survived')

In [None]:
y_eval = dfeval.pop('survived')

In [None]:
# explore with pandas
dftrain.describe()

In [None]:
# Shape of the datasets
dftrain.shape[0], dfeval.shape[0]

In [None]:
# histogram
dftrain.age.hist(bins=20)

In [None]:
dftrain.sex.value_counts().plot(kind="barh")

In [None]:
dftrain['class'].value_counts().plot(kind="barh")

In [None]:
pd.concat([dftrain, y_train], axis=1).groupby('sex').survived.mean().plot(kind='barh')

## Linear Regression and Feature Engineering

You need setup numeric and categorical columns differently for machine learning.  Categorical values need to converted into some type of integer encoding using the `tf.feature_column.categorical_column_with_vocabulary_list()`

For numeric columns, using the same idea but with `tf.feature_column.numeric_column()`

Read more here <https://www.tensorflow.org/api_docs/python/tf/feature_column>

In [None]:
# Gather an array of the categorial columns
CATEGORIAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone']

# the same for numeric
NUMERIC_COLUMNS = ['age', 'fare']

feature_columns = []
for feature_name in CATEGORIAL_COLUMNS:
    vocabulary = dftrain[feature_name].unique() # get all the unique values in the column
    feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary_list=vocabulary))

for feature_name in NUMERIC_COLUMNS:
    feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

print(feature_columns)

### Creating the TF dataset

When using the TF model, the data we pass comes in as a `tf.data.Dataset` object.  Therefore, we have to convert the pandas df into that object.


In [None]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
    def input_function():
        ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) # create the tf object with the data and the labels
        if shuffle:
            ds = ds.shuffle(1000) # random order
        ds = ds.batch(batch_size).repeat(num_epochs) 
        return ds
    return input_function

train_input_fn = make_input_fn(dftrain, y_train)
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)

### Creating the linear regression model

In [None]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)

In [None]:
linear_est.train(train_input_fn)
result = linear_est.evaluate(eval_input_fn)

clear_output()
print(result)

### Creating some predictions

Use `.predict()`.  This method will return a list of dicts that store a prediction for eacch of the entries in the dataset. 

In [None]:
pred_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])

probs.plot(kind='hist', bins=20, title='predicted probabilities')

## Classification

Predict different labels of a dataset.  This uses the Iris dataset.

Using <https://www.tensorflow.org/tutorials/estimator/premade>


In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

### The Dataset

There are 3 different classes:
* Setosa
* Versicolor
* Virginica

With 4 different columns for sepal/pedal with length/width

In [None]:
# Define the columns 
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']
# Lets define some constants to help us later on

In [None]:
train_path = tf.keras.utils.get_file(
    "iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
    "iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")

train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0) # use the column names from before
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)

In [None]:
train.head()

In [None]:
# pop off the column and use as the label
train_y = train.pop("Species")
test_y = test.pop("Species")

### Input function

Just like with the other regression model from above, you have to create an input function.

In [None]:
def input_fn(features, labels, training=True, batch_size=256):
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    if training:
        dataset = dataset.shuffle(1000).repeat()
    return dataset.batch(batch_size)

In [None]:
# Doing some feature column magic

feature_columns = []
for key in train.keys():
    feature_columns.append(tf.feature_column.numeric_column(key=key))
print(feature_columns)

### Building the classifier model

There are A LOT of differnt classifier models.  Here are the two easiest ones:
* `DNNClassifier`
* `LinearClassifier`

In [None]:
# using the dnn with 2 hidden layers with 30 and 10 hidden nodes each
# the hidden number is picked arbitrattly

classifer = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[30,10],
    n_classes=3
)

In [None]:
### Using Lambda as the input function

classifer.train(
    input_fn=lambda: input_fn(train, train_y, training=True),
    steps=5000
)

In [None]:
### the model is bad... But lets evaluate the model!

eval_result = classifer.evaluate(
    input_fn=lambda: input_fn(test,test_y, training=False)
)
print('\nTest set accuracy : {accuracy:0.3f}\n'.format(**eval_result))

## Hidden Markov Model

"The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution []. Transitions among the states are governed by a set of probabilities called transition probabilities." (http://jedlik.phy.bme.hu/~gerjanos/HMM/node4.html)

Using TF to work with probabilities to predict future events or states.  

We're gonna predict the weather!
<https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/HiddenMarkovModel>

In [None]:
import tensorflow_probability as tfp 

In [None]:
tfd = tfp.distributions

# Creating a simple weather model

# Represents a cold day with 0 and a hot day with 1
# The first day of a sequence has a 0.8 chance of being cold.
# The model using categorical distribution:

initial_distribution = tfd.Categorical(probs=[0.8,0.2])

# A cold day has a 30% chance of being followed by a hot day
# and a hot day has a 20% chance of being followed by a cold day
# This is the simple model of that statement

transition_distribution = tfd.Categorical(probs=[[0.7, 0.3],
                                                [0.2, 0.8]])

# Additionally that on each day the temperature is normally distributed with 
# a mean and std dev 0 / 5 on a cold day and mean and std dev 15 / 10 on a hot day
# Modeled like

observation_distribution = tfd.Normal(loc=[0.,15.], scale=[5.,10.])

# These distributions into a single week long model 

model = tfd.HiddenMarkovModel(
    initial_distribution=initial_distribution,
    transition_distribution=transition_distribution,
    observation_distribution=observation_distribution,
    num_steps=7
)

model.mean()

# model.log_prob(tf.zeros(shape=[7]))