# General Questions


## How To Develop a Machine Learning Model From Scratch

1. Define adequately our problem (objective, desired outputs…).
2. Gather data.
3. Choose a measure of success.
4. Set an evaluation protocol and the different protocols available.
5. Prepare the data (dealing with missing values, with categorial values…).
6. Spilit correctly the data.
7. Differentiate between over and underfitting, defining what they are and explaining the best ways to avoid them.
8. An overview of how a model learns.
9. What is regularization and when is appropiate to use it.
10. Develop a benchmark model.
11. Choose an adequate model and tune it to get the best performance possible.

https://towardsdatascience.com/machine-learning-general-process-8f1b510bd8af

## What’s the difference between a generative and discriminative model? Can you give me an example of each?

### Generative: model distribution. Discriminative: learn boundary

Generative models model the distribution of individual classes. Generative algorithms make some kind of structure assumptions on your model

Discriminative models learn the (hard or soft) boundary between classes. SVMs and decision trees are discriminative because they learn explicit boundaries between classes.

https://stats.stackexchange.com/questions/12421/generative-vs-discriminative

## Difference between linear regression and logistic regression?

### Linear: continuous. Logistic: categorical

Linear regression is used to predict the continuous dependent variable using a given set of independent variables. Cost function: least squares

Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables. Mathematically, a logistic regression model predicts P(Y=1) as a function of X. Can be binary, or multicategorical.

https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_classification_algorithms_logistic_regression.htm

## How does LR learn?

### Linear: least square. Logistic: sigmoid

Linear: Fitting a line through data, minimizing lost, lost is least square, or sum of squared error

Logistic: sigmoid function

https://towardsdatascience.com/introduction-to-linear-regression-and-polynomial-regression-f8adc96f31cb

https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148

## Difference between kNN and k-means？


### kNN: supervised classification and regression. K-means: unsupervised clustering

They are completely different methods. The fact that they both have the letter K in their name is a coincidence.

KNN represents a supervised classification algorithm that will give new data points accordingly to the k number or the closest data points. In order to determine the classification of a point, KNN combines the classification of the K nearest points. It is supervised because you are trying to classify a point based on the known classification of other points.

Pro: KNN have no training period, new data can be added seamlessly, easy to implement. Con: Poor for large dataset or high dimension, need feature scaling, sensitive to noise, missing value and outlier.

k-means clustering is an unsupervised clustering algorithm that gathers and groups data into k number of clusters. K-means tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to be near each other. It is unsupervised because the points have no external classification.

Both KNN and k-means clustering represent distance-based algorithms that rely on a metric

https://pythonprogramminglanguage.com/how-is-the-k-nearest-neighbor-algorithm-different-from-k-means-clustering/
https://stats.stackexchange.com/questions/56500/what-are-the-main-differences-between-k-means-and-k-nearest-neighbours
http://theprofessionalspoint.blogspot.com/2019/02/advantages-and-disadvantages-of-knn.html

## How does KNN and K-means learn?

### KNN: Lazy learner, no training. K-means


𝑘 -NN does not have a loss function that can be minimized during training. In fact, this algorithm is not trained at all. The only "training" that happens for 𝑘-NN, is memorising the data (creating a local copy), so that during prediction you can do a search and majority vote. Technically, no function is fitted to the data, and so, no optimization is done (it cannot be trained using gradient descent).

https://stats.stackexchange.com/questions/420416/does-knn-have-a-loss-function

### K-means From Scratch:

https://towardsdatascience.com/a-complete-k-mean-clustering-algorithm-from-scratch-in-python-step-by-step-guide-1eb05cdcd461

# Bayesian modelling

### NBC: Eager learner, fast, but assume features are independent

***What is Bayes’ Theorem? How is it useful in a machine learning context?*** 

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability.

It provides a way of thinking about the relationship between data and a model. A machine learning algorithm or model is a specific way of thinking about the structured relationships in the data. Bayesian statistics helps some models by classifying and specifying the prior distributions of any unknown parameters. Bayesian Statistics are a technique that assigns “degrees of belief,”

P(A|B) = P(B|A) * P(A)/P(B)

Bayes Naive Classifier, Bayes Optimal Classifier, Bayesian Optimization, Bayesian Belief Networks

Statistical Inference
- Bayesian inference uses Bayesian probability to summarize evidence for the likelihood of a prediction.

Statistical Modeling
- Bayesian statistics helps some models by classifying and specifying the prior distributions of any unknown parameters. 

***What's the prior / likelihood / posterior?*** 

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

***Why is “Naive” Bayes naive?***

Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data

***What are the pros and cons of bayesian modelling?*** 

Pro: NB classifier is a fast and easy ML algorithm, works for binary and better for multi-class. Eager learner. Most popular choice for text classification. Con: features assumed to be independent, so can't learn relationship between features

***What are the application of bayesian modelling?*** 

Credit scoring, medical data classification, real-time prediction, spam filtering, sentiment analysis. 1


https://www.javatpoint.com/machine-learning-naive-bayes-classifier

https://machinelearningmastery.com/bayes-theorem-for-machine-learning/

https://deepai.org/machine-learning-glossary-and-terms/bayesian-statistics

In [1]:
# calculate P(A|B) given P(A), P(B|A), P(B|not A)
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
	# calculate P(not A)
	not_a = 1 - p_a
	# calculate P(B)
	p_b = p_b_given_a * p_a + p_b_given_not_a * not_a
	# calculate P(A|B)
	p_a_given_b = (p_b_given_a * p_a) / p_b
	return p_a_given_b
 
# P(A)
p_a = 0.0002
# P(B|A)
p_b_given_a = 0.85
# P(B|not A)
p_b_given_not_a = 0.05
# calculate P(A|B)
result = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a)
# summarize
print('P(A|B) = %.3f%%' % (result * 100))

P(A|B) = 0.339%


## Error Analysis

***How do you ensure you’re not overfitting with a model?*** (D0)

Variance is low

***How do you ensure you’re not underfitting with a model?*** (D0)

bias is low

***What’s the trade-off between bias and variance?*** (D0)

Trade-off is tension between the error introduced by the bias and the variance.

choose the complexity level that minimizes the overall error. 

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. 


***What’s the statistical meaning of bias and variance in the ML context. It's the variance of what?*** (D1)


***For regression, can you decompose the mean square error into bias and variance?*** (D2)

***What is Type 1 and Type 2 error?***
- Type 1: False positive, rejection of a true null hypothesis
- Type 2: False negative, non-rejection of a false null hypothesis

***what is accuracy?***
(TP + TN)/(TP + TN + FP + FN) = (TP + TN)/Total

In a set of measurements, accuracy is closeness of the measurements to a specific value 

***What is precision and recall?*** (D0)

 - Precision TP/(TP+FP)
 - True Positive Rate (Recall, Sensitivity) TP/(TP+FN)  how well the positive class was predicted.
 - True Negative Rate (Specificity) TN/(TN+FP) how well the negative class was predicted
 
Precision is the closeness of the measurements to each other.
 
Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made. Recall provides an indication of missed positive predictions.


***How do you combine them?*** (D0)

Positive likelihood ratio = TPF/FPF = Precision/Recall

***What is a ROC curve?*** (D0)

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

 - True Positive Rate (Recall) TP/(TP+FN)
 - False Positive Rate

***What is a AUC ?*** (D0)

Area under the ROC curve


***If model A achieves 97% accuracy and model B achieves 98%? Do you switch model A for B?*** (D1)


https://www.quora.com/What-is-high-bias-and-high-variance-in-machine-learning-terminology-in-simplest-terms

https://www.wikiwand.com/en/Precision_and_recall#:~:text=Recall%20is%20the%20number%20of,documents%20retrieved%20by%20that%20search.

***When is accuracy not a good metric?***

imbalanced set of samples

For imbalanced classification, the sensitivity might be more interesting than the specificity.

F-Measure = (2 * Precision * Recall) / (Precision + Recall)
The F-Measure is a popular metric for imbalanced classification.

https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/

***enough training data?***

how complex your problem is

choice of algorithm

you need roughly 10 times as many examples as there are degrees of freedom in your model. It is often a good one but if your features do not provide a good separation of targets this rule of thumb would be completely useless to your problem.

But simply having more data is not useful

learning curve of model performance: plot a learning curve with the number of examples you have. Maybe your choice of model is already saturated with the set size you have, or maybe you will learn that your curve is further away from stabilising than you initially thought.

https://medium.com/analytics-vidhya/so-you-think-you-dont-have-enough-data-to-do-machine-learning-3b5c6c512e27

## Regularization
***What is the purpose of regularization?*** (D0)
***What regularization technics do you know?*** (D0)
***What are the pros and cons of each?*** (D0)
***Regularization in NN?*** (D1)
***Regularization in decision tree?*** (D1)
***What is the effect of L1 vs L2 regularization?*** (D2)
***What is the bayesian interpretation of L1 and L2 regularization?*** (D3)

## Clustering
***What clustering algorithms do you know?*** (D0)
***What is the k-means algorithm?*** (D1)
***What is the GMM algorithm?*** (D2)
***How k-means and GMM are related?*** (D2)
***How do you evaluate the performance of a clustering algorithm?*** (D1)
***How do you pick the right number of clusters?*** (D2)

## Kernels
***What’s a kernel?*** (D1)
***Give me an example with formulas?*** (D2)
***Formula RBF/Gaussian kernel?*** (D2)
***Dimension associated to RBF/Gaussian kernel?*** (D3)
***What’s the “kernel trick” and how is it useful?*** (D1)
***What is an SVM?*** (D0)
***What is support vector?*** (D1)
***What is the margin?*** (D1)
***In SVM, what are the slack variables? How are they useful?*** (D2)
***When using slack variables is the margin likely to increase of decrease?*** (D2)
***What optimization problem SVM is trying to solve (please right down the math)?*** (D3)
***What is the pros of the dual form vs primal form?*** (D3)
***What optimization algorithm is used to solve SVM?*** (D4)
***Could the kernel trick be used in other algorithms than SVM?*** (D2)

## Neural networks
***What are neural networks?*** (D0)
***What is batch normalization?*** (D2)
***What is dropout?*** (D2)
***What is weight decay? How different is it from L2 regularization?*** (D2)
***Can you give me example of different activation functions for Neural Network, also explain their advantages and disadvantages?*** (D1)
***Relu neurons can die, do you know any method to avoid that shortcoming?*** (D2)
***What are CNNs?*** (D0)
***What are RNNs?*** (D0)
***Can you give me and describe me different kind of RNNs?*** (D1)
***What are auto-encoders? What kind of auto-encoders do you know?*** (D2)
***What is adversarial training? What is it used for?*** (D2)
***What is a Siamese Network?*** (D2)
***What are word embeddings?*** (D1)
***How do you train them?*** (D2)
***How do you handled unknown words? Rare words?*** (D2)

## DNN Optimization

***How do you train DNNs? (ask about cost functions and back-propagation)*** (D1)
***How stochastic gradient descent is different from gradient descent?*** (D1)
***What is back-propagation?*** (D1)
***Do you know other optimizer? What are the pros and cons of each?*** (D1)
***What is momentum?*** (D2)
***What are second order optimization methods?*** (D3)
***How do you set the learning rate?*** (D1)

### Semi-supervised learning
***What is un-supervised learning? Can you give me an example?*** (D0)
***What is semi-supervised learning?*** (D1)
***What semi-supervised learning methods do you know?*** (D2)
***Pros and cons of each?*** (D3)

### Graphical latent models
***What is a latent variable model? How do you optimize it?*** (D1)


***What is a GMM (please write down the model)?*** (D1)


***What is an HMM (please write down the model)?*** (D2)


***Could you describe me the EM algorithm?*** (D2)


***Can you write down the equations of the E-step and M-step?*** (D3)


***Can you prove the convergence?*** (D4)


***Could you describe me the EM algorithm for GMM?*** (D2)


***Could you establish E-step and M-step equations for GMM?*** (D3)


***Could you describe me the EM algorithm for HMM?*** (D3)


***How do you train an HMM?*** (D3)


***How would you handle missing data?*** (D3)


***What are Variational Bayesian methods?*** (D4)


***What are sampling methods (e.g Gibbs sampling)?*** (D4)

### Sequence modeling
***What is the Markov assumption?*** (D1)
***What is an n-gram model (please write down the model) ?*** (D2)
***How do you train them?*** (D2)
***What if certain combinations have no examples in the data*** (D2)
***What is an HMM (please write down the model) ?*** (D2)
***How do you compute P(o1,o2|q1, q2) where o1 and o2 are observation and q1 and q2 are states?*** (D2)
***How do you compute P(o1,o2) where o1 and o2 are observation?*** (D3)
***Assuming you have o1, o2, (observation) what do you need to solve to find most likely q1, q2 (hidden states)*** (D3)

### Ensemble methods
***What goal is achieved with ensemble method?*** (D0)

achieve higher accuracy than individual models through bagging, boosting or stacking

***What is bagging? What is the goal achieved? Can you give me an example of ML algorithm that uses bagging*** (D1)

low vias, high variance, bagging reduce variance, parallel

***How random forest are different from bagging?*** (D2)

RF doesn't use all features and bagging use all features

***What is boosting? What is the goal achieved? Can you give me an example of ML algorithm that uses boosting*** (D1)

high bias, low variance, boosting reduces biasa, sequential

https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205


### Decision trees
***Can you explain to me what is a decision tree?*** (D0)

A decision tree is simply a set of cascading questions. When you get a data point (i.e. set of features and values), you use each attribute (i.e. a value of a given feature of the data point) to answer a question. The answer to each question decides the next question.



***What is a decision node in decision tree?*** (D1)

root node, decision node, terminal/leaf node. decision node is when a choice need to be made. There is one input and two or more output.

***How do you find those nodes?*** (D2)

***Tell me what is overfitting, how to prevent overfitting in decision tree?*** (D2)

Pre-pruning that stop growing the tree earlier, before it perfectly classifies the training set.
Post-pruning that allows the tree to perfectly classify the training set, and then post prune the tree. 

***What is a random forest?*** (D2)

Random forest is a supervised learning algorithm. The "forest" it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

***What is a XGBoost?***

### Ensemble of Weak Learners

By combining the advantages from both random forest and gradient boosting, XGBoost gave the a prediction error lower than boosting or random forest

***Random Forest VS XGBoost***

***Which is more likely to overfit, RF or XGB?***


### Dimensionality reduction
***What dimensionality reduction is trying to solve?*** (D0)
***What dimensionality reduction method do you know?*** (D0)
***Which one is a linear model?*** (D1)
***What is PCA, how different is it from SVD?*** (D1)
***Can you prove that PCA find the axis with the maximum variance?*** (D4)
***Have you heard of random projection? How and why does it work?*** (D3)
***Have you heard of non-negative matrix factorization? How do you solve this problem? What are the pros and cons vs SVD/PCA?*** (D3)
***What is LDA? How different it is from PCA?*** (D3)

### NLP Problem solving
***What are the steps for NLP?***
- Tokenization -> slice sentence to bag of words
- Stemming: asked -> ask, return word to true form
- Lemmatization -> canonical form


***How would you implement prediction of the next word when you write text on your cell phone? Describe all the steps involved. Hints: data mining, data cleaning, model design, learning algorithm, evaluation.***
***How would you personalize the model based on the text type by the user?***
***Now that we have a working system. How would you implement an autocorrection/replace-as-you-type system?***
***How would you use the user feedback to improve the autocorrection system***
***If you are given 100k tweets and the customer information. How do you cluster them to understand the contents of the tweets?***
***How would you design the wake word system? (Opening the mic when people say "Alexa")***
***Let's say you work for a company like Amazon, and you have trained a Chinese-English machine translation system that achieves 30 BLEU.
Now your customer says "Thanks; now make the model ten times smaller." What approaches might you take? What concerns do you have? ***
***Amazon has millions of product reviews. How would approach the problem of identifying positive and negative reviews? ***

# Stats and Math

## What is sampling? 
“Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.” Read the full answer here.


## What sampling methods are there?

### Probability Sampling

- Simple random sampling: Software is used to randomly select subjects from the whole population.
- Stratified sampling: Subsets of the data sets or population are created based on a common factor, and samples are randomly collected from each subgroup.
- Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed.
- Multistage sampling: A more complicated form of cluster sampling, this method also involves dividing the larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are identified, clustered and analyzed.
- Systematic sampling: A sample is created by setting an interval at which to extract data from the larger population -- for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of 20 rows to analyze.


### Nonprobability data sampling methods :

- Convenience sampling: Data is collected from an easily accessible and available group.
- Consecutive sampling: Data is collected from every subject that meets the criteria until the predetermined sample size is met.
- Purposive or judgmental sampling: The researcher selects the data to sample based on predefined criteria.
- Quota sampling: The researcher ensures equal representation within the sample for all subgroups in the data set or population.

https://searchbusinessanalytics.techtarget.com/definition/data-sampling

# Challenging Questions

## What are the pros and cons of bayesian modelling?

## What is the bayesian interpretation of L1 and L2 regularization?

## What optimization problem SVM is trying to solve (please right down the math)?




# Misc

## Loss Function
At its core, a loss function is incredibly simple: it’s a method of evaluating how well your algorithm models your dataset
https://stats.stackexchange.com/questions/420416/does-knn-have-a-loss-function

## Eager learning vs Lazy Learning

eager learning is a learning method in which the system tries to construct a general, input-independent target function during training of the system, as opposed to lazy learning, where generalization beyond the training data is delayed until a query is made to the system. The main advantage gained in employing an eager learning method, such as an artificial neural network, is that the target function will be approximated globally during training, thus requiring much less space than using a lazy learning system. Eager learning systems also deal much better with noise in the training data. Eager learning is an example of offline learning, in which post-training queries to the system have no effect on the system itself, and thus the same query to the system will always produce the same result.

The main disadvantage with eager learning is that it is generally unable to provide good local approximations in the target function.

https://www.wikiwand.com/en/Eager_learning

## Maximum Entropy Model

https://medium.com/@phylypo/nlp-text-segmentation-using-maximum-entropy-markov-model-c6160b13b248

## Active Learning

## Reinforcement Learning



# Multimodal ML

## Definition

Human experience multi-modal sense such as sight, hearing, smell, etc. 

The machine is similar but slightly different. For machines, "modality" refers to a particular way or mechanism of encoding information. For example, PNG and JPEG are two different way to encode image, however, the exact definition of "Multi-modality" is actively debated in research. The definitiion is also divided between "Human-centric" and "Machine-centric"

graph and tables can be considered a good example for multi-modal

A good use case of multi-modal is multimodal arithmetics: takes one image, then does text math, output another image

Does well for very high dimension embeddings

Multimodality (in machine learning) occurs when two or more heterogeneous inputs (recorded on different types of media) are processed by the same machine learning model, ***only if*** these inputs cannot be mapped unambiguously into one another by an algorithm

https://www.youtube.com/watch?v=jReaoJWdO78



# Pipeline

## What is a pipeline?

chain steps together sequentially, including data preprocessing and model building

## Why pipeline?

1. cross validate a process not just a model. 
2. grid search or random search of not only the hyperparameters for the model but also the preprocessing steps

different strategy of impuding null values



# One hot encoder, column transformer, pipeline

## Why is this better than pandas get dummies?
1. no need for big dataframe, easier to manage 
2. no need to apply get dummies on new training data, problematic when out of sample data have less categories than in sample data
3. grid search with both model and preprocessing parameters
4. preprocess outside of sklearn can make cross validation less reliable

## Ref
https://www.youtube.com/watch?v=irHhDMbw3xo

# High cardinality issue with categorical variables

## Frequency Encoding

heavily used in Kaggle competitions

### Advantages
1. It is very simple to implement
2. Does not increase the feature dimensional space
### Disadvantages
1. If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.
2. Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power

## Ref:
1. https://github.com/krishnaik06/Complete-Feature-Engineering/blob/master/2.Count_frequency_encoding.ipynb
2. Follow this thread in Kaggle for more information: https://www.kaggle.com/general/16927