# Interview Questions For Preparation

## 1. RAG

### Q. What is RAG ? 
* Retrieval Augmented Generation
* Technique to generate responses on data the model may not have been trained on
* Leverages the power of in-context learning for LLMs
* Relevant context is typically retrieved from vector database and model responses are generated based on that retrieved context

### Q. What is the time complexity of RAG retrieval ?
* O(n) for flat index search;
* log(n) for IVF and HNSW


### Q. How does IVF work in RAG ?
* Cluster the documents based on embeddings eg. k-means clustering
* Find simliarity of query to cluster centroid
* Search within the most similar centroid for most close document

### Q. What is contextual precision which is often utilized in evaluating RAG systems/pipelines ?
* The correctness of the retrievals with respect to the query
* For k = 1 to n where n is int
* The precision is calculated for all the values of k and the average is calculated

### Q. Which are the techniques that can reduce model hallucinations ? 
* Fine-tuning
* Retrieval Augmented Generation
* Few-shot and chain-of-thought prompting
* Lowering the temperature hyperparameter
* Training Data filtering and preprocessing 
* Careful system prompt design

### Q. Which are techniques to find document similarity ?
* Cosine Similarity 
* Jaccard distance
* Euclidean Distance
* Clustering 

###  Q. What are model hallucinations ?
* Factually incorrect generations by the model
* LLMs are autoregressive models and focus on next token predictions




### Q. What are position embeddings as used in modern LLMs ? 
* Token wise embedding corresponding to position in the context
* The transformer can understand the position of the token in the context
* Otherwise attention mechanism will calculate same attention score for all tokens irrespective of position

### Q. What is RAG re-ranking ?
* A technique used to evaluate the evaluate the correctness of retrieval step
* The similarity of the retrieved chunk and the query is calculated better than embedding similarity
* Adds significant overhead and latency
* Can be performed using Cross Encoder and ColBERT

###  Q. Which is the more appropriate technique for million-scale documents vector Database in Retrieval Augmented Generation ?
* IVF, HNSW as they have log(n) time complexity

## 2. Neural Neural Networks - Transformers, Self Attention, LLMS

###  Q. Why are skip connections used in deep neural networks ?
* To avoid the vanishing gradient problem
* Enables training of deeper networks with more parameters

### Q. Why has attention mechanism been used in NLP tasks?
* RNNs utilized before struggled when the context was very long
* Attention could capture dependencies between tokens
* Attention enabled parallel training of the models

### Q. What are the drawbacks of attention mechanism used in Transformer architecture ?
* n^2 computation time required for computing self-attention
* Vast amount of training data required for training
* Struggles with varied sequence length

### Q. In computing self attention in Transformers what are the 3 matrices that are used ? What is the scaling constant used in the Transformer model ?
* Query, Key, Value - Q,K, V
* the scaling value used is sqrt(dimension of embeddings)

### Q. Which normalization technique is used in Transformers ?
* Layer normalization
* Performs normalization based on the feature values
* Normalizes features on forward pass and gradients on the backward pass

### Q. Why do modern LLMs not use Dropout in their architecture ?
* Lots of training data for pre-training and instruction tuning

### Q. What is KV caching ?
* A method to improve the inference time
* Key, Value tokens are cached
* Avoids repetitive computation of Query, Key computations
* Increases the memory requirements marginally
* Reduces inference time in some cases by 5 X

### Q. What is sliding attention window mechanism seen in modern LLM architectures like Mistral ?
* Reduces the computation of attention to smaller context
* Uses the concept of receptive field from CNN

### Q. Why is RMS Normalization utilized in LLMs?
*

### Q. How does the temperature hyperparameter affect a models outputs?

### Q. Why are LLMS generally based on decoder-only architectures ? What is meant by the term auto-regressive models ?

### Q. What is the loss function used in LLMs ? 

### Q. What is meant by quantization in context of modern Large Language Models ? How does it help in latency and also memory requirements ?

### Q. What is the pre-training step in LLMs ? Is it supervised, unsupervised or self-supervised training ?

### Q. What is Mixture of Experts(MOEs) in LLMs ? How does the gating function work in LLMs?

### Q. What is the full form of GPT in modern LLMs ?

## 3. Neural Networks

### Q. What is the purpose of Normalization in Neural Networks ?
* To stabilize training
* Preventing covariate shift


### Q. What is the initialization techniques used when training neural networks ?
* ReLU specific initialization
* tanh, sigmoid specific initialization
* Avoid 

### Q. What is the bias-variance trade off ?
* To improve the generalization of models on real-world, unseen data
* Bias is the performance on training data
* Variance is the performance on test data
* Overfit models will have low bias and high variance
* Underfit models will have high bias

### Q. What are some strategies to reduce high bias / underfit in Neural networks ?
* Longer training
* More number of parameters
* Change the optimization strategy - SGD, Adam, AdamW

### Q. What are some strategies to reduce high variance / overfit in Neural Networks ?
* L1, L2 regularization
* Overfitting
* Dropout (reduce the number of active parameters) in each layer
* Do training on larger volume of data so that data is more diverse and reflects variability seen in real-world
* Cross-validation on hold-out validation set
* Early stopping
* Weight decay - Reduces the updates to weights 

### Q. Why is neural network weights not set to zero in training step

### Q. Explain the concept of transfer learning.

### Q. What is the purpose of Bias parameter in Neural networks ?

## 4. Machine Learning

### Q. How to handle imbalanced classification problems ?
* Get more training data
* Correct any bias, errors in sampling or measurement
* Generate synthetic data
* Oversample minority class
* Undersample majority class

### Q. What are the assumptions for linear regression ?

### Q. What are the assumptions for logisitic regression ?

### Q. What is the loss function used in logsitic regression ?
* BCE (Binary Cross Entropy)
* Logistic regression performs binary classification
* It finds best linear separator between target classes

### Q. What is the difference between bagging and boosting ?
* Bagging takes predictions from different trees; uses mode for classification and mean for regression
* In boosting - errors from previous trees are improved upon

### Q. What are the metrics used to decide decision tree splitting ?
* Entropy
* Gini impurity
* Information Gain

In [None]:
What is the time complexity of RAG retrieval ? For O


## Notes

<!-- Interview questions 
 
Which of the following is not a technique that can reduce model hallucinations ?

Which of the following are techniques to find document similarity ?
Principal Component Analysis
Cosine Similarity 
Jaccard distance
Euclidean Distance
Clustering 
What is contextual recall which is often utilized in evaluating RAG systems/pipelines ?
The correctness of the retrievals with respect to the query
The correctness of the retrievals with respect to the generation
The correctness of the retrievals with respect to the other retrievals
The overlap of the generation with the ground truth response
What are position embeddings as used in modern LLMs ? 
Word embeddings with random noise
Token wise embedding corresponding to position in the context
Indexing of the context
What is the vanishing gradient problem in Deep Learning ? 
The gradient descent optimization is stuck in local minima
Neural network gets a NAN value during backpropagation
In deep neural networks the gradient is unable to propagate backwards to the initial layers
What is the exploding gradient problem in Neural Networks ? 
Time taken explodes as Gradient descent takes too long due to indecipherable black box nature of neural networks   
Gradient becomes too large or NAN values causing large corrections to the parameters and hence unstable training 
In deep neural networks the gradient is unable to propagate backwards to the initial layers
Validation error shows a large divergence from Training Error
Activation functions show unexpected behaviour
What is the purpose of the activation function in neural networks ?
Reduce the number of necessary computations from one layer to another
Initializes the parameters of the model in the correct range so that model loss converges to global minimum
Makes the model able to learn the linear mapping of the values to the target values
Capture the non linear patterns of the data to train the model parameters



What are skip connections useful for in Neural Networks ? 
To make deeper neural networks
For reducing the computation complexity of neural network arithmetic
Keeping parameters within a certain range
Capturing non-linear patterns in the data 
Why are current generation SOTA LLM Models decoder only models ? 
Due to the task being of generating new tokens
Decoder models have fewer parameters
Decoder models have in built efficiencies due to cross-attention mechanism
Where is Sigmoid Activation Function used commonly ? 
Regression tasks
Reinforcement learning 
Classification tasks
Feature normalization of the input layer
Parameter initialization of the weights and biases

What is the difference between ReLU and Leaky ReLU ? 
Leaky RELU adds a small value for negative values of f(x) hence it is differentiable everywhere
Leaky RELU is faster than ReLU
Leaky RELU has small gradients at extreme values
Leaky RELU is asymptotic at large values of x 
What is the function of the optimiser in the Pytorch snippet below ? 
Prevents the vanishing gradient and exploding gradient problems
Zeros out the gradient at each backpropagation step
Finds the optimum parameters to reach global minimum of cost function in back propagation


Why is Early stopping used ? 
To save computational resources when GPU resource is constrained or to let GPU servers cool down
To prevent overfit 
For analyzing loss on various epochs
The self attention mechanism used in Transformers has a time complexity of 
Log n
n 
n2 
n3
What is jailbreaking in the context of LLMs ? 
Releasing models weights illegally  
Ability of the model to express its repressed internal thoughts, conciousness
Making the model behave in a non aligned manner and ignore safety instructions
Unlocking generalisation capabilities of LLM beyond the initial training instructions or expected capacities of the model
From the given options what is the best step for reducing underfit of models ? 
Longer training time by adding critical idle time between epochs for GPU cooling
Reduce depth of neural network 
Reduce the number of parameters of the model
Increasing the number of epochs
What is the utility of BLEU, ROUGE, METEOR ?
Classification tasks
NLP tasks for neural network models
Object Detection metrics


What is the difference between Stemming and Lemmatization ?
Stemming makes the  subword tokens while lemmatization makes the character level tokens
Lemmatization preserves the meaning of the word after altering the token
Stemming is more computational and time intensive than Lemmatization
Lemmatization will produce more out of vocabulary tokens than stemming
Subword tokenization is preferred in modern LLMs for which of the following reasons 
Balances vocabulary size 
Makes the vocabulary size the smallest possible 
Understands all the mappings between tokens 
In a Precision-Recall Curve what is the observed impact on Precision, Recall values as confidence threshold is increased
No impact 
Precision will increase and Recall will decrease 
Precision will decrease and Recall will increase
Few shot prompting is accurately described by which of the following statements 
Multiple examples given to the guide the output
LLM correcting itself by fine-tuning itself on new data
LLM is allowed to make errors to come up with final correct solution after many errors
Context is reworded in different ways to guide the output 
What is the context window for an LLM ?
Knowledge on particular topics stored in the weights of the model
Number of tokens used in pre training 
Number of tokens on which Self Attention is applied in Sliding Window Attention
Number of tokens that an LLM can take as an input
What does the temperature hyperparameter that is seen in modern LLM hyperparameters effect ? 
The slowness and thoughtfulness of the model
The probability distribution for sampling of the models output tokens
The maximum number of tokens generated by the LLM
The inference speed of tokens
Why is Sigmoid Activation Function not used so much in modern LLMs and Deep Learning models ? 
Mathematical intuition is difficult to grasp
For extreme values gradients are small so back propagation does not occur ideally to optimize the parameters of the Neural network 
It can only support classification tasks 
Not differentiable for all values of the function 
What is Mean Average Precision (mAP) for object detection tasks ? 
Average of precision for different values 
How is mAP calculated for Object Detection tasks ? 
Arrange the following in order of complexity and level of detail in Computer Vision ?
Classification > Object Detection > Segmentation
Segmentation > Classification > Object Detection
What is Non Max Suppression used for in Object Detection ? 
Convolutions that reduce high pixel values from the feature map 
Removing highly overlapping bounding boxes and retaining only those predictions with the highest confidence 
Remove outliers of extremely high confidence from predictions 
<Some other>
How does Non Max Suppression work in Object Detection ? 
Highest value Bounding Boxes based on confidence -> Remove high overlap -> 
Keeps only bounding boxes of a threshold 
Which of the following is not a regularisation method used in Neural networks training ? 
Dropout 
Batch Normalisation 
Higher number of epochs
Early stopping 
Mini batch gradient descent 
 Which of the following best describes an overfit model ? 
Poor performance on validation set
Poor performance on training set
Poor performance on test set 
Which of the following methods cannot be used to reduce the memory requirement for LLMs ? 
Knowledge Distillation 
Pruning 
Quantization 
Low Rank Adaptor 
Full precision training 
Which of the following describes precision accurately ? 
When given a regression task how many of the values of a particular predicted class are indeed correctly predicted
When given a classification task how many of the values of a particular predicted class are indeed correctly predicted 
When given a classification task how high the confidence score is for the predicted values 
Which of the following statements describes frequency penalty most aptly ? 
It is a hyperparameter seen in modern LLMs that relates to tokens being generated out of turn
It is a hyperparameter seen in modern LLMs that relates to tokens being generated beyond the maximum number of tokens
It is a hyperparameter seen in modern LLMs that relates to tokens being repeated numerous times
How does GRPO differ from other RL algorithms like PPO and DPO ? 
Samples are generated by the LLM and require less human feedback 
What is an aligned model ? 
One that gives honest, helpful and harmless responses 
The model that is finetuned for a particular domain 
Model that is quantized and has reduced memory requirements  
What is perplexity when it comes to language models ?  
The more surprised a model is when generating a new sequence of tokens
The time complexity of Retrieval Augmented Generation
What is zero-shot when it comes to LLMs ?
Model hallucinations when it has not got diverse training data
Prompting technique
First stage of LLM training
Autoregressive model generation task
What are the data augmentation techniques suitable for NLP and language tasks ? 
Synthetic data generation by LLMs 
Smoothing and adding noise 
Shearing / Rotation
Gray scaling 
Which of the following given below is incorrect about the kernel trick in Support Vector Machines ?
To alter the distribution of the high dimensional data to make it linearly separable for the linear boundary to
The model
With regards to k-means clustering which of the following is False ?
It is not suitable for high dimensionality dataset
It always finds the most ideal clusters
It forms circular clusters
It is a supervised learning approach
Elbow method can be used to find the ideal number of clusters
What is correct regarding XG Boost ?
It is an ensemble learning method only for classification tasks
It is an ensemble learning method only for regression tasks
It is an unsupervised method which requires no target variables
XG Boost can be utilized with tree or linear models
Why are metrics apart from accuracy required in imbalanced Classification problem statements ?
Accuracy can be flawed when there is high class imbalance as naive predictions can predict majority class
Accuracy is sufficient as it holistic and covers all situations comprehensively
Accuracy becomes inconclusive in problems other than binary classification prediction tasks 
Accuracy can be flawed when there is high class imbalance as naive predictions can predict minority class
Which is not an assumption for Linear Regression ?
Residuals are normally distributed
Independent variables are normally distributed
Mean value of residuals is 0
There is no significant correlation between independent variables
There exists a linear relationship between the target variable and the independent variable
Which of the following is not a technique for outlier removal ?
Capping
Deletion of sample
Clustering
Feature Encoding
Which of the following is not a technique for improving performance on imbalanced classification tasks ?
Upsampling
Downsampling
SMOTE and synthetic data generation
Collecting more data 
Checking for measurement or sampling errors
Converting to parquet format 
What is concept drift in ML models ?
Change in the ML paradigms as technology evolves
A phenomenon seen in ML model deployed in product where the relation between input feature and the target value changes
Change in the  -->


