## Q1 lstm vs gru
https://medium.com/mindboard/lstm-vs-gru-experimental-comparison-955820c21e8b

## Q2 Why don't we initialize the weights of a neural network to zero?

https://www.quora.com/Why-dont-we-initialize-the-weights-of-a-neural-network-to-zero

## Q3 why the initialization of the weight or bias should be around 0?

If all of the weights are the same, they will all have the same error and the model will not learn anything - there is no source of asymmetry between the neurons. <br>

What we could do, instead, is to keep the weights very close to zero but make them different by initializing them to small, non-zero random numbers. <br>

A potential issue is that the distribution of the outputs of each neuron, when using random initialization values, has a variance that gets larger with more inputs. A common additional step is to normalize the neuron's output variance to 1 by dividing its weights by sqrt(d) where d is the number of inputs to the neuron. The resulting weights are normally distributed between $$[−1/√d,1/√d]$$

        self.weights_input_to_hidden = np.random.normal(0.0, self.input_nodes**-0.5, 
                                       (self.input_nodes, self.hidden_nodes))

        self.weights_hidden_to_output = np.random.normal(0.0, self.hidden_nodes**-0.5, 
                                       (self.hidden_nodes, self.output_nodes))
                                       
   Please see Joe/deep-learning/weight-initialization/weight_initialization.ipynb for more detailed info on different types of weight initialization.
   
   #### tf weight initialization
   
     softmax_w = tf.Variable(tf.truncated_normal((in_size, out_size), stddev=0.1))
     softmax_b = tf.Variable(tf.zeros(out_size))

## Q4 When should I use the Normal distribution or the Uniform distribution when using Xavier initialization?

[intro](https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/)


[original paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

## Q5 Activation Functions: 

Sigmoid, tanh, Softmax, ReLU, Leaky ReLU EXPLAINED !!!


#### Why we use Activation functions with Neural Networks?

It is used to determine the output of neural network like yes or no. It maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).
The Activation Functions can be basically divided into 2 types-

1. Linear Activation Function
2. Non-linear Activation Functions

[cs231n](http://cs231n.github.io/neural-networks-1/)

[Detail Of Activation Function](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)

## Q6 Why do we use convolutions for images rather than just FC layers?

This answer has 2 parts to it. Firstly, convolutions preserve, encode, and actually use the **spatial information** from the image. If we used only FC layers we would have no relative spatial information. Secondly, Convolutional Neural Networks (CNNs) have a partially built-in translation in-variance, since each convolution kernel acts as it’s own filter/feature detector.

let us first understand what convolution means.

### Convolution


Convolution is the first layer to extract features from an input image. Convolution preserves the relationship between pixels by learning image features using small squares of input data. It is a mathematical operation that takes two inputs such as image matrix and a filter or kernal

<img src="MyPythonCode/resources/Convolution.png">

Convolution of an image with different filters can perform operations such as edge detection, blur and sharpen by applying filters. 

**Non Linearity (ReLU)**

ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x).

One way ReLUs improve neural networks is by speeding up training. The gradient computation is very simple (either 0 or 1 depending on the sign of x). Also, the computational step of a ReLU is easy: any negative elements are set to 0.0 -- no exponentials, no multiplication or division operations.

Gradients of logistic(sigmoid) and hyperbolic tangent(tanh) networks are smaller than the positive portion of the ReLU. This means that the positive portion is updated more rapidly as training progresses. However, this comes at a cost. The 0 gradient on the left-hand side is has its own problem, called "dead neurons," in which a gradient update sets the incoming values to a ReLU such that the output is always zero; modified ReLU units such as ELU (or Leaky ReLU, or PReLU, etc.) can minimize this.

$$ \frac{d}{dx}\text{ReLU}(x)=1 \forall x > 0 $$ By contrast, the gradient of a sigmoid unit is at most 0.25; on the other hand, tanh fares better for inputs in a region near 0 since 
$$ 0.25 < \frac{d}{dx}\tanh(x) \le 1 \forall x \in [-1.31, 1.31]$$

Why ReLU is important : ReLU’s purpose is to introduce non-linearity in our ConvNet. Since, the real world data would want our ConvNet to learn would be non-negative linear values.

**What makes CNNs translation invariant?** 
As explained above, each convolution kernel acts as it’s own filter/feature detector. So let’s say you’re doing object detection, it doesn’t matter where in the image the object is since we’re going to apply the convolution in a sliding window fashion across the entire image anyways.


## Q7 What is batch normalization

Batch normalization is a technique for improving the performance and stability of neural networks. The idea is to normalize the layer inputs such that they have a **mean of zero and variance of one**, much like how we standardize the inputs to networks. Batch normalization is necessary to make DCGANs work. Detail code is in *Joe/deep-learning/batch-norm/Batch_Normalization_Lesson.ipynb*

### Mainak's Question
#### HTR
##### 1. In Handwriting recognition system what is the usage of LSTM and CNN?
##### 2. What is the kernel Function, activation function and loss function used?
##### 3. What is the ground truth used?


##### What is soft clustering in LDA

#### What is Dirchlet distribution

#### How can NMF be used in LDA

### Q8. Top 3 use case of NLP
https://pub.towardsai.net/top-3-nlp-use-cases-a-data-scientist-should-know-637eacc3d1d4

### Q9. Explain the General Ideas of Word Embeddings
https://medium.com/analytics-vidhya/word-embeddings-in-nlp-word2vec-glove-fasttext-24d4d4286a73
https://towardsdatascience.com/short-technical-information-about-word2vec-glove-and-fasttext-d38e4f529ca8

Word embeddings are word vector representations where words with similar meaning have similar representation. 

Word vectors are much better ways to represent words than one hot encoded vectors. Word vectors consume much less space than one hot encoded vectors and they also maintain semantic representation of word.

There three “classical” flavors of word embeddings: **Word2Vec, GloVe, and FastText**. 

**Word2Vec**
In word2vec there are 2 architectures CBOW(Continuous Bag of Words) and Skip Gram.
**Continuous Bag-of-Words (CBOW)**
The model predicts which is the most likely word in the given context. So, the words which have equal likelihood of appearing are considered as the similar and hence occur closer in the dimension space.

**Skip-gram**
This architecture is similar to that of CBOW, but instead the model works the other way around. The model predicts the context using the given word.


**Glove**
Both the architecture of the Word2Vec are the predictive ones and also ignores the fact that some context words occurs more often than others and also they only take into consideration the local context and hence failing to capture the global context.

The GloVe model is trained on aggregated global word to word co-occurrence matrix from a given text collection of text documents. This co-occurrence matrix is decomposed to form denser and expressive vector representation.

<img src='Glove.png' height=300 width=500>

Here, we can see the pairs formed of man and woman, queen and king, uncle and aunt and other pairs. So, it’s able to differentiate the concept of sex or gender.

**FastText**
One of the main disadvantages of Word2Vec and GloVe embedding is that they are unable to encode unknown or out-of-vocabulary words.

It is an extension to Word2Vec and follows the same Skip-gram and CBOW model. but unlike Word2Vec which feeds whole words into the neural network, FastText first breaks the words into several sub-words (or n-grams) and then feed them into the neural network.

For example, if the value for n is 3 and the word is ‘apple’ then tri-gram will be [‘<ap’, ‘app’, ‘ppl’, ‘ple’, ‘le>’] and its word embedding will be sum of vector representation of these tri-grams.

So, using this methodology unknown words can be represented in vector form as it has high probability that its n-grams are also present in other words.

### Q10. Drawbacks of word embedding

https://medium.com/@kashyapkathrani/all-about-embeddings-829c8ff0bf5b

https://towardsdatascience.com/from-pre-trained-word-embeddings-to-pre-trained-language-models-focus-on-bert-343815627598

https://towardsdatascience.com/beyond-word-embeddings-part-2-word-vectors-nlp-modeling-from-bow-to-bert-4ebd4711d0ec

http://ai.stanford.edu/blog/contextual/

https://jalammar.github.io/illustrated-bert/

https://towardsdatascience.com/explainable-data-efficient-text-classification-888cc7a1af05

Both the techniques of word embedding have given a decent result, but the problem is the approach is not accurate enough. As, they don’t take into consideration the order of words in which they appear which leads to loss of syntactic and semantic understanding of the sentence.

For example, “You are going there to teach not play.” And “You are going there to play not teach.” Both of these sentences will have same representation in the vector space but they don’t mean the same.

Also, the word embedding model cannot give satisfactory results on large amount of text data, as same word may different meaning in different sentence depending on the context of the sentence.
For example, “I have scuba diving in my bucket list.” And “There is a bucket filled with drinking water.” In both the sentences, the word “bucket” has different meanings.

So, we require a kind of representation which can retain the contextual meaning of the word present in a sentence.

**Sentence Embedding**
Sentence embedding are similar to the word embedding but instead of words, they encode whole sentence into vector representation. The obtained vector representation retains good properties by inheriting these features from underlying word embedding.

Some of the state-of-the-art models for sentence embedding are **ELMo, InferSent and SBERT**


Differences between GPT vs. ELMo vs. BERT -> all pre-training model architectures. BERT uses a bidirectional Transformer vs. GPT uses a left-to-right Transformer vs. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTM to generate features for downstream task.

<img src='word_embedding.png'>

### Q11. How can Convolutional Neural Networks be used for NLP

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

https://www.kaggle.com/rizdelhi/end-to-end-natural-language-processing-3



### Q12.  Explain BERT

https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270


### Q12. What is an auto-encoder? Why do we "auto-encode"?

#### What is it?

<img src="MyPythonCode/resources/autoencoder-1.png">

With autoencoders, we pass input data through an encoder that makes a compressed representation of the input. Then, this representation is passed through a decoder to reconstruct the input data. Generally the encoder and decoder will be built with neural networks, then trained on example data.


Let's see a quick example:

The data is the famous MNIST dataset, each MNIST image is a scan of a handwritten digit in a 28x28 image, so our "inputs" are in 28x28 = 784 dimensions. We train an autoencoder using just 25 hidden neurons:

On the left some of our original input points, on the right what the autoencoder can reconstruct from the 25 dimensions in the middle layer. 



#### Use of Auto-encoder
1. So this shows that we can represent each MNIST digit as a vector in 25 dimensions, and this is where we can see the utility of an autoencoder, **it is a feature extraction algorithm it helps us find a representation for our data**. **In some cases the features generated by the autoencoder represent the data points better than the points themselves, that's the key!**

2. Autoencoders can be stacked and trained in a progressive way, we train an autoencoder and then we take the middle layer generated by the AE and use it as input for another AE or as a i/p to a classifier. 

3. In practice, autoencoders aren't actually better at compression compared to typical methods like JPEGs and MP3s. But, they are being used for noise reduction.

4. Some input features may be redundant/ correlated → waste of processing time & “overfitting” in our model (too many parameters).

5. An autoencoder has a lot of freedom and that usually means our AE can overfit the data because it has just too many ways to represent it. To constrain this we should use sparse autoencoders where a non-sparsity penalty is added to the cost function. In general when we talk about autoencoders we are really talking about sparse autoencoders. Autoencoders can be magical but they need to be fed some hyper-parameters and finding the optimal values for those hyper-parameters can be a time consuming operation.

#### Different types of auto-encoder
##### 1. Convolutional AutoEncoders (CAE)
This replaces “fully-connected layer” by “convolutional layer”.

<img src='MyPythonCode/resources/convolutional_autoencoder.png' width=500px>

Use of CAE :-
1. Ultra-basic image reconstruction

    - learns to remove noise from picture/ reconstruct missing parts.

    - the input (noisy version) ; so the output (clean version).

    - the network fills the gaps in the image.

2. Ultra-basic image colorization

    - CAE maps circles and squares from an image to the same image but with red and blue respectively (coloring).

    - Purple is formed sometimes because of blend of colors, where network hesitates between circle or square.

3. Advanced applications

    - fully image colorization

    - latent space clustering

    - generating higher resolution images

##### 2. Variational AutoEncoder(VAE)
- This incorporates Bayesian Inference.
- The compressed representation is a probability distribution.

##### 3. Sparse AutoEncoder
- This is used for feature extraction.
- This has more hidden Units than inputs.
- This allows sparse represntation of input data.

##### 4. Stacked AutoEnoder
If more than one HIDDEN layer is used, then we seek for this Autoencoder.

##### 5. Deep AutoEncoders
This has 2 symmetrical “Deep-belief networks” that has usually 4 or 5 shallow layers.
Its layers are Restricted Boltzmann Machines (RBM).

Use of DAE :-
1. Image search

    - An image can be compressed into around 30-number vectors (as in Google image search).

2. Data Compression

    - Deep Autoencoders are useful for “semantic hashing”.

3. Topic Modelling & Information Retrieval (IR)



#### What's going on with the decoder part of auto encoder

Okay, so the decoder has these "Upsample" layers that you might not have seen before. First off, I'll discuss a bit what these layers *aren't*. Usually, you'll see **transposed convolution** layers used to increase the width and height of the layers. They work almost exactly the same as convolutional layers, but in reverse. A stride in the input layer results in a larger stride in the transposed convolution layer. For example, if you have a 3x3 kernel, a 3x3 patch in the input layer will be reduced to one unit in a convolutional layer. Comparatively, one unit in the input layer will be expanded to a 3x3 path in a transposed convolution layer. The TensorFlow API provides us with an easy way to create the layers, [`tf.nn.conv2d_transpose`](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d_transpose).

However, transposed convolution layers can lead to artifacts in the final images, such as checkerboard patterns. This is due to overlap in the kernels which can be avoided by setting the stride and kernel size equal. This can be done by **tf.image.resize_nearest_neighbor**