<a href="https://colab.research.google.com/github/rjsaito/Data-Science-Essentials/blob/master/Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Algebra

## Eigenvalues and Eigenvectors



$$
A v = \lambda v
$$


- eigen value 
- eigen vector
- invertibe matrix
- singularity
- singular values (square root of eigen values)
- rank
- linear independence
- singular value decomposition
- eigenvalue decomposition (matrix represented in eigenvalues and eigenvectors)
- singular value decomposition (complexity of m^2 n + n^3, or O(mn^2))
- matrix factorization (nm)^(O(2^r r^2)) time exact solution



## Principal Components

## Factor Analysis

# Machine Learning

## Loss Functions

### Cross Entropy (Log Loss)

In [0]:
# Full function: -(y*log(p) + (1-y)*log(1-p))
def CrossEntropy(p, y):
  if y == 1:
    return -log(p)
  else:
    return -log(1 - p)

### Hinge

In [0]:
def Hinge(p, y):
  return np.max(0, 1 - yHat * y)
  

### Mean Absolute Error (L1)

In [0]:
def MAE(yH, y):
  return np.sum(np.absolute(yH - y)) / y.size

### Mean Squared Error (L2)

In [0]:
def MSE(yH, y):
  return np.sum((yH - y)**2) / y.size

## scikit-learn

## Regression

In [0]:
# load libraries
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.datasets import load_boston

# load data
boston = load_boston()
features = boston.data
target = boston.target

# standardize
scalar = StandardScaler()
features_standardized = scalar.fit_transform(features)
features.shape

(506, 13)

### Feature Engineering

In [0]:
# polynomial terms
polynomial = PolynomialFeatures(degree = 3, include_bias = False)
features_polynomial = polynomial.fit_transform(features)

### Linear Regression

In [0]:
# create regression
regression = LinearRegression()

# fit model
model = regression.fit(features_polynomial, target)

# get model output
print(model.intercept_)
print(model.coef_)

### Ridge Regression

In [0]:
# create ridge regression, with cross validation
ridge = RidgeCV(cv = 5)

# fit model
ridgemodel = ridge.fit(features, target)

# get model output
print(ridgemodel.intercept_)
print(ridgemodel.coef_)

27.467884964141177
[-0.10143535  0.0495791  -0.0429624   1.95202082 -2.37161896  3.70227207
 -0.01070735 -1.24880821  0.2795956  -0.01399313 -0.79794498  0.01003684
 -0.55936642]




### Lasso Regression

In [0]:
# create lasso regression, with cross validation
lasso = LassoCV(cv = 5)

# fit model
lassomodel = lasso.fit(features, target)

# get model output
print(lassomodel.intercept_)
print(lassomodel.coef_)

36.33499969015174
[-0.07426626  0.04945448 -0.          0.         -0.          1.804385
  0.01133345 -0.81324404  0.27228399 -0.01542465 -0.74287183  0.00892587
 -0.70365352]


## Matrix Factorization

### LibMF

https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_open_source.pdf

Non-convex optimization problem:

$$ \underset{P,Q}{min} \underset{(u,v)\epsilon R}{\Sigma} [f(p_u, q_v; r_{u,v}) + \mu_p || p_u||_1 + \mu_q ||q_v||_1 +
\frac{λ_p}{2} || p_u ||^2_2 + \frac{λ_q}{2} ||q_v||^2_2 ]  \ \ \ (1)
$$

where 

$f(p_u, q_v; r_{u,v})$ is the loss function, $p_u, q_v$ are latent factors, $r_{u,v}$ is the interaction, and $\mu_p, \mu_q, λ_p, λ_q$ are regularization parameters.

### Training with Stochastic Gradient Descent

The algorithm for the Fast Parallelized Stochastic Gradient Descent is:

1. randomly shuffle R
2. grid R into a set B with at least (s + 1) × (s + 1) blocks
3. sort each block by user (or item) identities
4. construct a scheduler
5. launch s working threads
6. wait until the total number of updates reaches a user-defined value


The basic idea of SG is that, instead of expensively calculating the gradient of (1), it randomly selects a $(u,v)$ entry from the summation and calculates the corresponding gradient [Robbins and Monro 1951; Kiefer and Wolfowitz 1952]. Once $r_{u,v}$ is chosen, the objective function in (1), is:

$$ f(p_u, q_v; r_{u,v}) + \mu_p p_u + \mu_q q_v +
\frac{λ_p}{2} p_u^T p_u + \frac{λ_q}{2} q_v^T q_v \ \ \ (2)
$$

\\

We calculate the sub-gradient over $p_u$ and $q_v$. Variables are updated by the following rules:

\\

$$
p_u ← p_u + γ (\frac{d}{dp_u}(2)),  \\
q_v ← q_v + γ (\frac{d}{dq_v}(2))
$$

where $\gamma$ is the learning rate.

### Loss Functions

 $f(·)$ is a non-convex loss function of $p_u$ and $q_v$, and $\mu_p$, $\mu_q$, $\lambda_p$, and $\lambda_q$ are regularization coefficients. For Real Valued MF, the loss function can be a squared loss, an absolute loss, or generalized KL-divergence. If R is a binary matrix, users may select among logistic loss, hinge loss, and squared hinge loss to perform BMF. Note that, the non-negative constraints,

# Deep Learning

### Activation Functions

### Exploding Gradient

## TensorFlow and Keras

## Recurrent Neural Networks

### Architecture

#### Example

RNN for Short-Term Prediction

In [0]:
import numpy as np
from matplotlib import pyplot as plt
import tensorflow as tf
tf.enable_eager_execution()
print("Tensorflow version: " + tf.__version__)

In [0]:
#@title Display utilities [RUN ME]

from enum import IntEnum
import numpy as np

class Waveforms(IntEnum):
    SINE1 = 0
    SINE2 = 1
    SINE3 = 2
    SINE4 = 3

def create_time_series(waveform, datalen):
    # Generates a sequence of length datalen
    # There are three available waveforms in the Waveforms enum
    # good waveforms
    frequencies = [(0.2, 0.15), (0.35, 0.3), (0.6, 0.55), (0.4, 0.25)]
    freq1, freq2 = frequencies[waveform]
    noise = [np.random.random()*0.2 for i in range(datalen)]
    x1 = np.sin(np.arange(0,datalen) * freq1)  + noise
    x2 = np.sin(np.arange(0,datalen) * freq2)  + noise
    x = x1 + x2
    return x.astype(np.float32)

from matplotlib import transforms as plttrans

plt.rcParams['figure.figsize']=(16.8,6.0)
plt.rcParams['axes.grid']=True
plt.rcParams['axes.linewidth']=0
plt.rcParams['grid.color']='#DDDDDD'
plt.rcParams['axes.facecolor']='white'
plt.rcParams['xtick.major.size']=0
plt.rcParams['ytick.major.size']=0

def picture_this_1(data, datalen):
    plt.subplot(211)
    plt.plot(data[datalen-512:datalen+512])
    plt.axvspan(0, 512, color='black', alpha=0.06)
    plt.axvspan(512, 1024, color='grey', alpha=0.04)
    plt.subplot(212)
    plt.plot(data[3*datalen-512:3*datalen+512])
    plt.axvspan(0, 512, color='grey', alpha=0.04)
    plt.axvspan(512, 1024, color='black', alpha=0.06)
    plt.show()
    
def picture_this_2(data, batchsize, seqlen):
    samples = np.reshape(data, [-1, batchsize, seqlen])
    rndsample = samples[np.random.choice(samples.shape[0], 8, replace=False)]
    print("Tensor shape of a batch of training sequences: " + str(rndsample[0].shape))
    print("Random excerpt:")
    subplot = 241
    for i in range(8):
        plt.subplot(subplot)
        plt.plot(rndsample[i, 0]) # first sequence in random batch
        subplot += 1
    plt.show()
    
def picture_this_3(predictions, evaldata, evallabels, seqlen):
    subplot = 241
    colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
    for i in range(8):
        plt.subplot(subplot)
        #k = int(np.random.rand() * evaldata.shape[0])
        l0, = plt.plot(evaldata[i, 1:], label="data")
        plt.plot([seqlen-2, seqlen-1], evallabels[i, -2:])
        l1, = plt.plot([seqlen-1], [predictions[i]], "o", color="red", label='Predicted')
        l2, = plt.plot([seqlen-1], [evallabels[i][-1]], "o", color=colors[1], label='Ground Truth')
        if i==0:
            plt.legend(handles=[l0, l1, l2])
        subplot += 1
    plt.show()
    
def picture_this_hist(rmse1, rmse2, rmse3, rmse):
  colors = ['#4285f4', '#34a853', '#fbbc05', '#ea4334']
  plt.figure(figsize=(5,4))
  plt.xticks(rotation='40')
  plt.title('RMSE: your model vs. simplistic approaches')
  plt.bar(['RND', 'LAST', 'LAST2', 'Yours'], [rmse1, rmse2, rmse3, rmse], color=colors)
  plt.show()

def picture_this_hist_all(rmse1, rmse2, rmse3, rmse4, rmse5, rmse6, rmse7, rmse8):
  colors = ['#4285f4', '#34a853', '#fbbc05', '#ea4334', '#4285f4', '#34a853', '#fbbc05', '#ea4334']
  plt.figure(figsize=(7,4))
  plt.xticks(rotation='40')
  plt.ylim(0, 0.35)
  plt.title('RMSE: all models')
  plt.bar(['RND', 'LAST', 'LAST2', 'LINEAR', 'DNN', 'CNN', 'RNN', 'RNN_N'],
          [rmse1, rmse2, rmse3, rmse4, rmse5, rmse6, rmse7, rmse8], color=colors)
  plt.show()

Generate Fake Data

In [0]:
DATA_SEQ_LEN = 1024*128
data = np.concatenate([create_time_series(waveform, DATA_SEQ_LEN) for waveform in Waveforms]) # 4 different wave forms
picture_this_1(data, DATA_SEQ_LEN)
DATA_LEN = DATA_SEQ_LEN * 4 # since we concatenated 4 sequences

Hyperparameters

In [0]:
RNN_CELLSIZE = 32   # size of the RNN cells
SEQLEN = 32         # unrolled sequence length
BATCHSIZE = 32      # mini-batch size
LAST_N = SEQLEN//2  # loss computed on last N element of sequence in advanced RNN model

Visualize Training Sequences

In [0]:
picture_this_2(data, BATCHSIZE, SEQLEN) # execute multiple times to see different sample sequences

Create Model

In [0]:
# this is how to create a Keras model from neural network layers
def compile_keras_sequential_model(list_of_layers, msg):
  
    # a tf.keras.Sequential model is a sequence of layers
    model = tf.keras.Sequential(list_of_layers)
    
    # keras does not have a pre-defined metric for Root Mean Square Error. Let's define one.
    def rmse(y_true, y_pred): # Root Mean Squared Error
      return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))
    
    print('\nModel ', msg)
    
    # to finalize the model, specify the loss, the optimizer and metrics
    model.compile(
       loss = 'mean_squared_error',
       optimizer = 'rmsprop',
       metrics = [rmse])
    
    # this prints a description of the model
    model.summary()
    
    return model
  
#
# three very simplistic "models" that require no training. Can you beat them ?
#

# SIMPLISTIC BENCHMARK MODEL 1
predict_same_as_last_value = lambda x: x[:,-1] # shape of x is [BATCHSIZE,SEQLEN]
# SIMPLISTIC BENCHMARK MODEL 2
predict_trend_from_last_two_values = lambda x: x[:,-1] + (x[:,-1] - x[:,-2])
# SIMPLISTIC BENCHMARK MODEL 3
predict_random_value = lambda x: tf.random.uniform(tf.shape(x)[0:1], -2.0, 2.0)

def model_layers_from_lambda(lambda_fn, input_shape, output_shape):
  return [tf.keras.layers.Lambda(lambda_fn, input_shape=input_shape),
          tf.keras.layers.Reshape(output_shape)]

model_layers_RAND  = model_layers_from_lambda(predict_random_value,               input_shape=[SEQLEN,], output_shape=[1,])
model_layers_LAST  = model_layers_from_lambda(predict_same_as_last_value,         input_shape=[SEQLEN,], output_shape=[1,])
model_layers_LAST2 = model_layers_from_lambda(predict_trend_from_last_two_values, input_shape=[SEQLEN,], output_shape=[1,])

# three neural network models for comparison, in increasing order of complexity

l = tf.keras.layers  # syntax shortcut

# BENCHMARK MODEL 4: linear model (RMSE: 0.215 after 10 epochs)
model_layers_LINEAR = [l.Dense(1, input_shape=[SEQLEN,])] # output shape [BATCHSIZE, 1]

# BENCHMARK MODEL 5: 2-layer dense model (RMSE: 0.197 after 10 epochs)
model_layers_DNN = [l.Dense(SEQLEN//2, activation='relu', input_shape=[SEQLEN,]), # input  shape [BATCHSIZE, SEQLEN]
                    l.Dense(1)] # output shape [BATCHSIZE, 1]

# BENCHMARK MODEL 6: convolutional (RMSE: 0.186 after 10 epochs)
model_layers_CNN = [
    l.Reshape([SEQLEN, 1], input_shape=[SEQLEN,]), # [BATCHSIZE, SEQLEN, 1] is necessary for conv model
    l.Conv1D(filters=8, kernel_size=4, activation='relu', padding="same"), # [BATCHSIZE, SEQLEN, 8]
    l.Conv1D(filters=16, kernel_size=3, activation='relu', padding="same"), # [BATCHSIZE, SEQLEN, 8]
    l.Conv1D(filters=8, kernel_size=1, activation='relu', padding="same"), # [BATCHSIZE, SEQLEN, 8]
    l.MaxPooling1D(pool_size=2, strides=2),  # [BATCHSIZE, SEQLEN//2, 8]
    l.Conv1D(filters=8, kernel_size=3, activation='relu', padding="same"),  # [BATCHSIZE, SEQLEN//2, 8]
    l.MaxPooling1D(pool_size=2, strides=2),  # [BATCHSIZE, SEQLEN//4, 8]
    # mis-using a conv layer as linear regression :-)
    l.Conv1D(filters=1, kernel_size=SEQLEN//4, activation=None, padding="valid"), # output shape [BATCHSIZE, 1, 1]
    l.Reshape([1,]) ] # output shape [BATCHSIZE, 1]

# RNN
model_layers_RNN = [
    # input shape needed on first layer only
    l.Reshape([SEQLEN, 1], input_shape=[SEQLEN,]),
    l.GRU(RNN_CELLSIZE), # shape [BATCHSIZE, RNN_CELLSIZE]
    l.Dense(1, ) # shape [BATCHSIZE, 1]
    
]

## Convolutional Neural Networks

### Architecture