# General Machine Learning Questions

## Collected Interview Questions

### What is Gradient Descent?

- Gradient descent is an optimization algorithm that helps us to optimize our loss function; as the naming suggests, we are leveraging the gradient of our loss function to help us optimize our model parameters
    - Mathematically speaking, gradient is just a vector of (partial) **derivatives of the loss function** w.r.t the model parameters
- Gradients/derivatives tell us which direction (by sign) and by how much should be adjust the model parameters (by magnitude)
    - Gradients point in the direction of steepest ascent, and we adjust the weights with the hyperparameter learning rate as $w_{new} = w_{old} - \alpha \nabla f$, where $\alpha$ is the learning rate and $\nabla f$ is the derivative/gradient

#### Why do we need to scale before gradient descent?
- When we apply gradient descent on multiple variables/weights, if their ranges and distributions are very different, then we would have very different step sizes for each feature.
    - It makes the training faster
    - It prevents the optimization from getting stuck in local optima
    - It gives a better error surface shape
    - Weight decay and Bayes optimization can be done more conveniently

### What is Stochastic Gradient Descent (SGD)?


### Is it possible to apply GD or SGD with non-convex loss function?

## Top ML Interview Questions

Adapted from a post from https://career.guru99.com/.

With the help of many other online resources, especially 

### What is Machine learning?

The way I like to see it, machine learning(ML) is the baby of computer science and statistics, as it joins forces from both fields of study. Statistical learning acts as the twin (some also state statistical learning = machine learning + statistics, although that seems debatable to me) of machine learning since both are built on top of the statistical theories and apply system programming to learn the pattern implicitly included in the data.

### Mention the difference between Data Mining and Machine learning?

Machine learning relates with the process of studying, designing and optimizing algorithms to give computers the capability to learn without being explicitly programmed. Whereas While, data mining is designed to extract the rules from large quantities of data and can be defined as the process in which the unstructured data tries to extract knowledge or unknown interesting patterns. Or to put it another way, data mining is simply a method of researching to determine a particular outcome based on the total of the gathered data.

### What is Over-fitting in Machine learning?

- Famous bias-variance trade-off: with the increase of the complexity of the model, we would have decreasing trend for biases but the variances would increase instead;
- Over-fitting happens when our variances are much higher than the biases (i.e. we focus too much on the variances), which means the model is tailored to the data being trained, which means the generality and stability of the model would be sacrificed
- In terms of the training process, when the training error is going to be very low and there is a big gap between the training error and the validation error. This is overfitting.
- Over-fitting exists because there exists noises in the data besides the actual pattern we are trying to capture with the machine learning algorithms and models.

### How to avoid overfitting?

- Using cross-validation which is a technique of splitting the training data into folds and each fold acts as the validation set/"fake" testing set during each training and validating process.
- Using a large amount of data, and more importantly, representative data with as few noises as possible.

### Different kinds of machine learning?

- There are perhaps 14 types of learning that you must be familiar with as a machine learning practitioner; they are:
- Learning Problems
    1. Supervised Learning
    2. Unsupervised Learning
    3. Reinforcement Learning
        - a class of problems where an agent operates in an environment and must learn to operate using feedback
- Hybrid Learning Problems
    4. Semi-Supervised Learning
    5. Self-Supervised Learning
    6. Multi-Instance Learning
- Statistical Inference
    7. Inductive Learning
    8. Deductive Inference
    9. Transductive Learning
- Learning Techniques
    10. Multi-Task Learning
    11. Active Learning
    12. Online Learning
    13. Transfer Learning
    14. Ensemble Learning

# Questions Prepared for Interviews

This is a set of questions that I prepared for interviews and my own learning/review purposes

## What is...?

Some "what is" questions -- mainly on the definition and quick explanations of different machine learning models/algorithms/techniques that people usually apply.

### What is linear regression?

- $y = ax + b$
- Regression I 561, Regression II 562

### What is logistic regression?

- $h(\pi_i) = \mbox{logit}(\pi_i) = \log \bigg( \frac{\pi_i}{1 - \pi_i}\bigg) = \beta_0 + \beta_1 X_{i, 1} + \beta_1 X_{i, 2} + \ldots + \beta_p X_{i, p}$
- Equivalently: $\pi_i = \frac{\exp\big[\mbox{logit}(\pi_i)\big]}{1 + \exp\big[\mbox{logit}(\pi_i)\big]}$ where $\mbox{logit}(\pi_i)= \log\left(\frac{\pi_i}{1 - \pi_i}\right)$

### What is LASSO regression?

- The Least Absolute Shrinkage and Selection Operator(LASSO) regression is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.


### What is Support Vector Machine (SVM)?

- Supervised learning model
- Keys: support vectors, decision boundaries, kernels and basis functions
- In SVM algorithm (applicable for both regression and classification), we plot raw data as points in an n-dimensional space where n is the number of features that we have, and the value of each feature is then tied to a particular coordinate on the higher dimensionality, making it easier to classify the data.
- Kernel trick: a non Linear data is projected onto a higher dimension space so as to make it easier to classify the data where it could be linearly divided by a plane. This is mathematically achieved by Lagrangian formula using Lagrangian multipliers
- Pros:
    - SVMs (especially with radial basis functions as kernels) are powerful models, allowing for complex decision boundaries.
    - Work well on low dimensional as well as high dimensional data.
    - Work well on sparse data.
- Cons:
    - Computationally expensive and do not scale well on large dataset.
    - Require careful preprocessing of data and tuning of hyperparameters.
    - Hard to interpret; unlike regression models, we cannot easily interpret the result of SVM

### What is neural network?
