<a href="https://colab.research.google.com/github/rockahominy/Small-Projects/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Perceptron will not converge if the classes are not perfectly linearly separable.

**Logistic Regression:**
*  Is a model for classification.
*  Performs very well on linearly separable classes.
*  Is also a linear model for classification, like the Perceptron and the Adaline models.

**ODDS:**
The odds in favor for a particular event can be written as p/(1 - p); where p is the probability of the positive event. 

The **logit function** is the logarithm of the odds:
logit(p) = log(p/(1 - p))
~The log refers to the natural logarithm
*   The logit function takes input values from range 0 to 1 and transforms them to values over the entire real-number range.
*   This can be used to express a linear relationship between feature values and the log-odds.
*   logit(p(y=1|x)) = W0X0 + W1X1 + WmXm = Sum all these values up.
*   The above equation represents the conditional probability that a particular example belongs to class 1 given its features, x.

**Predicting the probability an example belongs in a particular class:**
*   We will utilize the inverse form of the logit function to complete this task.
*   The inverse form of the logit function is called the **logistic sigmoid function (sigmoid function).**
*   Sigmoid function: O/(z) = (1)/1 + e^(-z)
*   Here, z is the net input of the linear combination of weights and the inputs. z = wTx = w0x0 + w1x1 + wmxm
*   w0 is the bias unit which is set to equal to 1.
*   The outputs of the sigmoid function are constrained to 0 and 1.
*   The sigmoid function is the activation function for logistic regression.
*   The output of the sigmoid function is then interpreted as the probability of a particular example belonging to class 1, given its features x , parameterized by the weights w.
*   If the predicted probability is >= 0.5, the example belongs to class 1.

**Logistic cost function--used to learn weights**
Figuring out how to fit the parameters of the model:
*   We need to maximize the likelihood L in L(w) = P(y|x;w)
*   Apply the log function to maximize the likelihood function: l(w) = log L(w)
*   Applying the log function reduces the potential for numerical underflow which can occur if the likelihoods are very small.
*   Now use an optimization algortithm such as **gradient ascent to maximize the log-likelihood function.**
*   A cost function J(w) can be used to minimize using gradient descent.

**Loss functions for classification:**
Binary cross-entropy is the loss function to train a binary classification (with a single output unit) model.
**PAGE 540 Rachka**

Loss functions availabel in Keras:
*   Binary Cross entropy--Binary classification
*   Categorical Cross entropy--multiclass classification
*   Sparse categorical cross entropy--multiclass classification

Computing the cross-entropy loss by providing the logits, and the not the class-membership probabilities is the preferred way. from_logits=True

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf

In [None]:
####Binary Crossentropy###

bce_probas = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce_logits = tf.keras.losses.BinaryCrossentropy(from_logits=True)

logits = tf.constant([0.8])
probas = tf.keras.activations.sigmoid(logits)

tf.print('BCE (with probabilities: {:.4f}'.format(bce_probas(y_true=[1], y_pred=probas)),
         'BCE (with Logits): {:.4f}'.format(bce_logits(y_true=[1], y_pred=logits))) 

BCE (with probabilities: 0.3711 BCE (with Logits): 0.3711


In [None]:
from distutils.version import LooseVersion as Version
####### Binary Crossentropy
bce_probas = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce_logits = tf.keras.losses.BinaryCrossentropy(from_logits=True)

logits = tf.constant([0.8])
probas = tf.keras.activations.sigmoid(logits)

if Version(tf.__version__) >= '2.3.0':
    tf.print(
        'CCE (w Probas): {:.4f}'.format(bce_probas(y_true=[1], y_pred=probas)),
        '(w Logits): {:.4f}'.format(bce_logits(y_true=[1], y_pred=logits)))
    
else:
    tf.print(
        'CCE (w Probas): {:.4f}'.format(
        cce_probas(y_true=[0, 0, 1], y_pred=probas)),
        '(w Logits): {:.4f}'.format(
        cce_logits(y_true=[0, 0, 1], y_pred=logits)))

CCE (w Probas): 0.3711 (w Logits): 0.3711


In [None]:
tf.__version__

'2.3.0'

In [None]:
###Categorical Crossentropy###

####### Categorical Crossentropy
cce_probas = tf.keras.losses.CategoricalCrossentropy(
    from_logits=False)
cce_logits = tf.keras.losses.CategoricalCrossentropy(
    from_logits=True)

logits = tf.constant([[1.5, 0.8, 2.1]])
probas = tf.keras.activations.softmax(logits)

tf.print(
    'CCE (w Probas): {:.4f}'.format(
    cce_probas(y_true=[[0, 0, 1]], y_pred=probas)), #fixed it with double brackets
    '(w Logits): {:.4f}'.format(
    cce_logits(y_true=[[0, 0, 1]], y_pred=logits))) #fixed it with double brackets

CCE (w Probas): 0.5996 (w Logits): 0.5996


In [None]:
####### Sparse Categorical Crossentropy
####### Sparse Categorical Crossentropy
sp_cce_probas = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=False)
sp_cce_logits = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True)

tf.print(
    'Sparse CCE (w Probas): {:.4f}'.format(
    sp_cce_probas(y_true=[2], y_pred=probas)),
    '(w Logits): {:.4f}'.format(
    sp_cce_logits(y_true=[2], y_pred=logits)))

Sparse CCE (w Probas): 0.5996 (w Logits): 0.5996


**Gradient Tape:**

Optimizing neural networks requires computing the gradients of the cost with respest to the NN weights. This is required for optimization algorithms such as stochastic gradient descent (SGD).

GradientTape is used to compute gradients.

In order to compute the gradients of the tensors, we have to record the computations via:

**tf.GradientTape**
GradientTape allows for automatic differentiation.
**Rachka Page 483**


In [None]:
#Compute z = wx + b
#Define the loss the squared loss between the target and prediction:
#Loss = (y - z)^2

#Defining the model parameters w and b
w = tf.Variable(1.0)
b = tf.Variable(0.5)
print(w.trainable, b.trainable)

True True


In [None]:
#Defining the input x and y as tensors
x = tf.convert_to_tensor([1.4])
y = tf.convert_to_tensor([2.1])
with tf.GradientTape() as tape: #tape object where the operations will be recorded
  z = tf.add(tf.multiply(w, x), b)
  loss = tf.reduce_sum(tf.square(y - z))

dloss_dw = tape.gradient(loss, w)
tf.print('dL/dw:', dloss_dw)

dL/dw: -0.559999764


In [None]:
#verifying the computed gradient
tf.print(2*x*(w*x+b-y))

[-0.559999764]


The universal approximation theorem states that a feedforward NN with a single hidden layer and a relatively large number of hidden units can approximate arbitrary continuous functions relatively well. 
**Rachska Page 493**

MIT Deep Reinforcement Learning

In **supervised learning** there's a dataset and our goal is to learn and learn a model to predict the label y. 

In **unsupervised learning** there's only data with no labels and our goal is to find underlying structure in the data. 

In **reinforcement learning** the data is given as state-action pairs. Our goal is to maximize the future rewards over many time steps.

**Agent:**
*   Central part to the reainforcement learning algorithm.
*   Is the nueral network.
*   Example: a drone is the agent.

**Environment:**
*   The environment which the agent acts.

The agent sends commands to the environment in the form of actions. 
Action space A: the set of possible actions an agent can make in the environment.
In return, the environment will send back observations back to the agent. The environment sends back observations as a **state**. A state is a situation which the agent perceives.

The goal of reinforcement learning is for the agent to maximize its rewards in this environment. For every step, the agent is also getting back a reward from the environment. The **reward** is just a feedback that measures the success or failure of the agent's action.

Total Reward is the summation of the rewards at some time. 

The **discounted reward **  is obtained by multiplying the discounting factor lambda by each of the rewards. You do this is discount future rewards so they don't count as much as the current reward. The discounting factor is typically between 0 and 1. 

**Q function:**
Q(st, at) = E[Rt|st, at]
The Q function captures the expected total future reward an agent in state, s, can receive by executing a certain action, a.
High Q value = we're taking an action that's desirable at that state.
Low Q value = we're taking an action that's undesirable at that state.

Ultimately the agent needs a **policy pi(s)** to infer the best action to take at its state, s. 

Strategy: the policy should choose an action that maximizes future reward.

Evaluate the Q function over all possible actions and pick the action the Q function is our policy. 

-------Value Learning-------
Deep Q Networks (DQN):
Action + State --> Expected Return
*This is time extensive because each action has to the calculated separately. 

Alternative:
State --> Expected Return for each action
*Only needs to be executed once and we get the Q values for every single action.
*We look at the Q values and pick the maximum and take that action.

Q-Loss function:
Target Q value - Predicted Q value
*Loss function is the mean squared error between the taget Q value and predicted Q value from the network.
*Take the argmax of the Q values in a state to figure out the action.

Caveats of Q-Learning:
*Doesn't do well with complex model scenarios. Q-learning is well suited where the action space is discrete and small. It cannot handel continuous action spaces. 
*The policy is deterministic because we are picking the action that maximizes the Q value. By maximizing the reward, we cannot learn from stochastic policies.

-------Policy Learning-------
Instead of calculating the Q function and then determine the policy, we are going to directly take the state and action and compute the policy of the network.
So, output a probability distribution over the space of all actions given that state. The probability distribution is the policy and we can take an action based on the policy. Just sample from the probability distribution and act accordingly to the action.

**Limitations of Deep Learning:**

Universal Approximation Theorem:
*   A feedforward network with a single layer is sufficient to approximate, to an arbitrary precision, any continuous function.
*   Caveats:
  *This theorem makes no guarantees about the number of hidden units or the size of the hidden layer that would be required to make the approximation. And, it makes no suggestions on how to find the weights and optimize the network for the task. 

Neural Network Limitations:
-Very data hungry
-Computationally intensive (GPUs needed)
-Easily fooled by adversarial examples
-Can be subject to algorithmic bias
-Difficult
 to encode structure and prior knowledge during learning
-Poor at representing uncertainty (how do you know what the model knows?)
-Uninterpretable black boxes, difficult to trust
-Finicky to optimize
-Often requires expert knowledge to design, fine tune architectures.


