In [2]:
print("")

### Deep Learning Classifiers
- deep refers to the number of layers in the neural network, not any deeper understanding
#### Deep Belief Networks
- a multilayer neural network that is trained in a greedy layer-wise fashion
- a stack of restricted Boltzmann machines (RBMs) in which each RBM layer communicates with the previous and next layer
- RBMs are another name for the simple neural network nodes we've been using
- uses a softmax output layer to classify the data
    - softmax is a generalization of the logistic function that squashes a K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range (0, 1) that add up to 1
    - i.e. it converts a vector of real values to a probability distribution that sums to 1
    - <img src="images/softmax.png">
    - e.g. `Softmax([1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0])`
        - = `[0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175])`
    - the output puts most of the weight where the 4 was in the input

#### Convolutional Neural Networks (CNNs)
- a type of deep neural network that is often used for image recognition
- regularized version of neural networks that use convolutional layers
- convolution reduces the number of free parameters, reducing the chance of overfitting
    - i.e. pictures have a lot of pixels, so the number of free parameters in a neural network would be huge
- inputs are called tensors
    - tensors are multidimensional arrays
    - a 2D tensor is a matrix
    - a 3D tensor is a cube
    - a 4D tensor is a cube with multiple channels
- hidden layers consist of
    - convolutional layers
    - pooling layers
    - ReLU layers
        - ReLU stands for rectified linear unit
- convolutional layer
    - a window slides over the input tensor and performs a function on each part of the input
    - the window is called a kernel
    - the kernel is a matrix of weights
    - the kernel is applied to each part of the input
    - <img src="images/sliding_pane_CNN.png">
    - reduces the feature map size via some function
- pooling layer
    - a window slides over the feature map and takes the average or max value in each window
    - e.g. max pooling
        - takes the max value in each window
- ReLU layer
    - applies the ReLU function to each element of the feature map
        - ReLU(x) = max(0, x)
            - i.e. if x is negative, it is replaced with 0, otherwise it is left alone
- Keras (https://keras.io/) has probably the best python CNN setup out of the box

### IC 14Sep23
1. $1 + 2 + 3 = 6$
-  $6 \cdot 6 + 3 = 39$
2. index 0
3. 1
4. a. 3
    b. 2
5. ReLU(-3) = 0, ReLU(0) = 0, ReLU(3) = 3

### Recurrent Neural Networks (RNNs)
- HMM assumes that the probability of a state depends only on the previous state
- RNN uses information from more than one previous input and output
- RNN classifies a sequence of inputs in time or space
- process one element at a time while retaining a memory of what has come before
    - memory is called a cell state
    - allows the network to learn long-term dependencies
- how do you make an RNN remember things
    - in a normal neural network, the output is a function of the input
        - $f(x) = wx +b$
    - in an RNN, the output is a function of the input and the previous state
        - $h(t) = f_1(x(t))$
                - $h(t)$ is the output at time t
                - $x(t)$ is the input at time t
                - $f_1(x_t)$ is the function that determines the output at time t
        - expand that
                - $h_t = f_1(x_t, h_{t-1})$
    - recurrence is increased with cell output $c_t$
        - $c_t = f_3(x_t, h_{t-1}, c_{t-1})$
        - $h_t = f_4(x_t, h_{t-1}, c_t)$
        - there are several methods for calculating $f_3$ and $f_4$
            - LSTM (long short-term memory)

### LSTM
- <img src="images/LSTM.png" width="800">
- forget gate, $f_t$
    - $W_f, U_f, & b_f$ are weight matrices and bias vector
    - $x_t$ is the input at time t
    - $h_{t-1}$ is the output at time t-1
    - $\sigma_g$ is the sigmoid function
        - $\sigma_g(x) = \frac{1}{1+e^{-x}}$
        - used extensively in neural networks
        - squashes the input to a value between 0 and 1
- $i_t$ is the input gate
    - $i_t = \sigma_g(W_i \cdot x_t + U_i \cdot h_{t-1} + b_i)$
        - $W_i, U_i, & b_i$ are weight matrices and bias vector
        - $x_t$ is the input at time t
        - $h_{t-1}$ is the output at time t-1
        - $\sigma_g$ is the sigmoid function
- $c_t$ is the cell state at time t
    - $c_t = f_3(x_t, h_{t-1}, c_{t-1})$ _
        - $c_t = f_t \cdot c_{t-1} + i_t \cdot \sigma_c(W_c \cdot x_t + U_c \cdot h_{t-1} + b_c)$
        - $f_t$ is the forget gate
        - $i_t$ is the input gate
        - $W_c, U_c, & b_c$ are weight matrices and bias vector
        - $x_t$ is the input at time t
        - $h_{t-1}$ is the output at time t-1
        - $\sigma_c$ is the tanh function
        - $\sigma_c(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
            - also used to squash the input to a value between -1 and 1
        - 

### IC 21Sep30
1. $c_{t-1}$ is the previous cell output
    - $h_{t-1}$ is the previous ouput
    - $x_t$ is the current input at time t
2. $c_t$ is the current cell output
    - $h_t$ is the current output
3. $c_{t-1}$ is used to "remember" the previous cell output
    - $h_{t-1}$ is used to "remember" the previous output
4. $c_t$ is the current cell output 
    - $h_t$ is the current output
5. a)  the forget gate function uses $x_t$ and $h_{t-1}$
    - b) the input gate function uses $x_t$ and $h_{t-1}$
    - c) the output gate function uses $x_t$ and $h_{t-1}$
6. the cell state function uses $c_{t-1}$, $x_t$, and $h_{t-1}$
7. the output function uses $c_t$, $x_t$, and $h_{t-1}$
8. $[1, 4, 5] \cdot [2, 6, 9] = [2, 24, 45]$
9. $f_t = [\frac{1}{e^{-(W_fx_t + U_fh_{t-1} + b_f)}]$
10.  $c_t = tanh(W_f x_t + U_f h_{t-1} + b_f)$



### IC 21Sep30
1. c_{t-1} is the previous cell output
    - h_{t-1} is the previous ouput
    - x_t is the current input at time t
2. c_t is the current cell output
    - h_t is the current output
3. c_{t-1} is used to "remember" the previous cell output
    - h_{t-1} is used to "remember" the previous output
4. c_t is the current cell output 
    - h_t is the current output
5. a)  the forget gate function uses x_t and $h_{t-1}
    - b) the input gate function uses x_t and $h_{t-1}
    - c) the output gate function uses x_t and $h_{t-1}
6. the cell state function uses c_{t-1}, x_t, and h_{t-1}
7. the output function uses c_t, x_t, and h_{t-1}
8. [1, 4, 5] dot [2, 6, 9] = [2, 24, 45]
9. f_t = (1)/(e^-(W_fx_t + U_fh_{t-1} + b_f)
10.  c_t = tanh(W_f x_t + U_f h_{t-1} + b_f)

10. $c_t - c_t$ = $tanh(W_fx_t + U_fh_{t-1} + b_f)$

### IC 26Sep23
1. a GaussianNoise layer could be added to the model to add noise to the input and prevent overfitting
2. simple dropout is used because it relies on removing a random subset of the input data to prevent overfitting rather than adding noise which could change the meaning of the input

### Dimensionality Reduction
- applies to all supervised ML models
- why?
    - reduces time complexity
    - reduces space complexity
    - saves cost of observing the features
    - reduces the number of features
- how?
    - eliminate features that are highly correlated (i.e. redundant)
    - eliminate features that are not correlated with the target (i.e. irrelevant)
### Feature Transformation - PCA
- PCA (principal component analysis) is a dimensionality reduction technique
- project $x$ onto a lower dimensional subspace $z$
    - $x$ is a vector of d input features
    - $z$ is also a vector of d features
    - minimize information loss by maximizing the variance of the projected data
    - projection of $x$ is $z = w^Tx$
        - $w$ maximizes the covariance of $z$
    - $z$ is not a subset of $x$
        - it is the same number of features as $x$ though
    - a subset of $k$ features from $z$ is used as the input to the model
- <img src="images/PCA.png" width="800">

- e.g. mean centered matrix
    - mean of the column is subtracted from each element in the column
    - <img src="images/mean_centered_matrix.png" width="800">
    - next, calculate the covariance matrix
        - <img src="images/covariance_matrix.png" width="800">
    - next, calculate the eigenvectors and eigenvalues of the covariance matrix
        - <img src="images/eigenvs.png" width="800">
    - next, determine the transformed parameters
        - $x$ is the x_centered matrix of original features
        - $w$ is the matrix of eigenvectors
        - $z$ is the transformed matrix of features
            - $z = xw^T$
                    - note that $x$ and $w^T$ are reversed from the previous equation
    - finally, select the top $k$ features from $z$ to use as the input to the model
        - selected from the eigenvector matrix
        - the column in the eigenvector matrix with the highest eigenvalue is the most important feature
            - this is the first principal component
        - the column in the eigenvector matrix with the second highest eigenvalue is the second most important feature
            - this is the second principal component
        - and so on
        - the top $k$ features are the first $k$ principal components
        - if you want to reduce the number of features from $d$ to $k$, you select the first $k$ principal components

### Choosing k
-  Proportion of Variance (PoV)
    - <img src="images/PoV.png"> 
    - $\lambda_i$ is the $i^{th}$ eigenvalue
    - typically stop at PoV > 0.9
- <img src="images/choosingk.png">
### What PCA Does
- centers the data around the origin
- <img src="images/PCAplot.png">
- the eigenvalues matrix is a covariance matrix with
    - no covariance
    - variance concentrated in the principal components
        - more variance means more variance in features between classes
            - should translate to better classification

### Python Example

In [1]:
# Import libraries
import numpy as np
from pandas import read_csv
from sklearn.decomposition import PCA
# Get data
dataset = read_csv(url, names=names)
array = dataset.values
x = array[:, 0:4]
y = array[:, 4]
# Create PCA instance
pca = PCA(n_components=4)
# Perform PCA
pca.fit(x)
# Get eigenvectors and eigenvalues
eigenvectors = pca.components_
eigenvalues = pca.explained_variance_

NameError: name 'url' is not defined

In [None]:
# Transform data
principleComponents = pca.transform(x)
# Calculate PoVs
sumvariance = np.cumsum(eigenvalues)
sumvariance /= sumvariance[-1]
# Make a list of (eigenvalue, eigenvector) tuples
eigen_pairs = list(zip(eigenvalues, eigenvectors))
# Sort the (eigenvalue, eigenvector) tuples from high to low
# eigen_pairs.sort(key=lambda x: x[0], reverse=True)
# Transform data (x) to Z
W = eigen_pairs[0][1].reshape(4, 1)
Z = principleComponents.dot(W)

### IC 28Sep23
- 1) [[-2.33, -2.33, -2.33], [-0.33, -0.33, -0.33], [2.66, 2.66, 2.66]]
- 2) a. 6.33
- 2) b. 4.33
- 3) 2, using the first two principal components yields PoV of 97.8% which is greater than 97%

### PCA Continued
- a key weakness is that it assumes a linear relationship between the features
    - i.e. straight line relationship
- correlation can range from -1 to + 1 (perfect negative correlation to perfect positive correlation)
    - 0 represents no correlation at all
- typical ML applications do not have linear relationships
    - e.g. a car's price is not linearly related to its mileage
- one solution or mitigation strategy is to apply a feature selection technique to reduce the number of features d to k

### Feature Selection
- brute force
    - try all possible combinations of features
    - select the combination that yields the best results
    - this is not feasible for large d
    - $2^d - 1$ iterations are required to test all combinations of features
- sequential search strategies
    - greedy algorithms - i.e. they find sub-optimal solutions
    - forward selection
        - search forward and add the best feature at each step
            1. start with no features
            2. find the feature that yields the best results
            3. add that feature to the model
            4. stop when adding more features does not improve the results
        - max iterations = d
        - max k-fold cross validation = $\frac{d(d+1)}{2}$
    - backward selection
        - search backward and remove the worst feature at each step
            1. start with all features
            2. find the feature that yields the worst results
            3. remove that feature from the model
            4. stop when removing more features does not improve the results
        - max iterations = d - 1
        - max k-fold cross validation = $\frac{d(d+1)}{2} - 1$
    - plus-l minus-r selection (LRS)
        - generalized form of forward and backward selection
        - if L > R, it is forward selection
        - if L < R, it is backward selection
        - LRS attempts to compensate for weaknesses in forward and backward selection
            - involves some backtracking
        - primary limitation is the lack of theory to predict the optimal values of L and R 
        - ```
            If L>R then
                F = ∅
                Repeat L times
                    Find the best feature and add it to F
                Repeat R times
                    Find the worst feature and delete it from F
            Else
                F = X
                Repeat R times
                    Find the worst feature and delete it from F
                Repeat L times
                    Find the best feature and add it to F
            Endif
            ```
        - <img src="images/lrs_example.png" width="800">
    - bidirectional search
        - parallel forward and backward search
        - to guarantee convergence, the forward and backward searches must be synchronized
            - features added in forward cannot be removed in backward
            - features removed in backward cannot be added in forward
        - <img src="images/bds_example.png" width="800">
    - sequential floating selection (SFS)
        - forward selection with backtracking
        - rather than fixing values of L and R, floating methods allow L and R to be determined from the data
        - two methods
            - sequential forward floating selection (SFFS)
                - starts with no features
                - after each forward step, it checks to see if removing any features would improve the results
            - sequential backward floating selection (SBFS)
                - starts with all features
                - after each backward step, it checks to see if adding any features would improve the results

### IC 03Oct23
- 1) iteration time = 0.1 seconds
        - $d$ = 8
        - $n = 2^d -1$
        - $2^8 - 1 = 255$  iterations
        - $255 \cdot 0.1 = 25.5$ seconds
- 2) X = {a, b, c, d}, F = {a, b} 
        - a. the first feature to be evaluated is c
        - b. the first feature to be evaluated is d
- 3) X = {a, b, c, d}, after step 2: F = {a, b}, L = 1, R = 3
        - a. the next features to be evaluated are c and d
        - b. R > L the next step is to remove the worst feature from F
- 4) Y = {a, b, f, g}, x = h, Acc({a, b, f, g}) = 0.89, and Acc({a, b, f, g, h}) = 0.91 in step 4
        - a. the next Y to be evaluated is {a, b, f, g, h}
        - b. the next step is to remove the worst feature from Y