# General ML Model

## What is linear in a generalized linear model?

The GLM typically contrains 3 components
1. The probability distribution from exponential family
2. The linear predictor
3. The link function which connects the mean of 1. to the linear predictor

So, the linear part is the linear preditor in GLM.

## What does the curse of dimensionality mean?

The curse of dimensionality means that when the dimensionality increases, the volume of the space increase so fast that the available data become sparse. This sparsity is problematic for ML methods that require statistical significance. In order to obtain reliable result, the amount of data grows exponentially. Following is a simple example.

Let's say you have a straight line 100 yards long and you dropped a penny somewhere on it. It wouldn't be too hard to find. You walk along the line and it takes two minutes.

Now let's say you have a square 100 yards on each side and you dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. It could take days.

Now a cube 100 yards across. That's like searching a 30-story building the size of a football stadium. Ugh.

The difficulty of searching through the space gets a lot harder as you have more dimensions. You might not realize this intuitively when it's just stated in mathematical formulas, since they all have the same "width". That's the curse of dimensionality. It gets to have a name because it is unintuitive, useful, and yet simple.

## What is the difference between joint and conditional probability

+ Joint probability measures how likely two (or more) things will both occur
+ Conditional probability measure how likely one thing happen if you know the other things has happened. 

## Implement logistic regression training for binary classification

+ We have m training data points $((x^1, y^1), (x^2, y^2), ... ,(x^m, y^m))$. 
+ Each x is resent by a n-dimension vector x = $(x_1, x_2, .., x_n)$
+ Linear predictor: $p(y=1|x) = y' = f(x)$ where f is the sigmoid function $f = 1/(1 + exp(-w*x))$ and w is the weight vector; $p(y=0|x) = 1-y'$

Cross-entropy cost function between predicted value and grouth truth is: $H(y, y') = -y*log(y') - (1-y)log(1-y')$

Gradient of the cost function with regard to $i^{th}$ dimension of input vector $x = (y' - y) * x_i$

So the loss function over the training data: $L(w) = \frac{1}{m} \sum_{j=1,m} {[-y_j*log(y'_j) - (1-y_j)log(1-y'_j)]}$

minimize L(w) using gradient descent

repeat{    
    $w_i = w_i - \alpha * \sum_{j=1,m} {(y'_j -y_j)*x_{j,i}}$    
}


## What is the difference between Naive Bayes and logistic regression

## What is the difference between linear regression and logistic regressions

+ Use linear regression when you want to predict continuous outcome. 
+ Use logistic regression when the outcome is categorical.

For example, given X is the number of square feet in a house. If you want to predict what is the selling price of the house you will use linear regression.
On the other hand, if you want to predict whether or not the house would sell the house more than $500K you will use logistic regression.

## What is the difference between generative and discriminative models

## What is maximum likelihood, cost function, gradient descent

## What are some alternatives to gradient descent

## What is the EM algorithm? Give a couple of applications

##  Explain what regularization is and why it is useful.

## What is a probabilistic graphical model?

## What is the difference between Markov networks and Bayesian networks?

## Explain decision tree & decision forest

## Explain kernel tricks

# Matrix factorization/Model selection/ 

## On what type of ensemble technique is a random forest based? What particular limitation does it try to address?

## Solve Ax=b

## Give an example of an application of non-negative matrix factorization

## What methods for dimensionality reduction do you know and how do they compare with each other?

## What are some good ways for performing feature selection that do not involve exhaustive search?

# Model Evaluation

## How would you evaluate the quality of the clusters that are generated by a run of K-means?

## Explain A/B testing and give a concrete example how to use it

## How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?

## Explain what precision and recall are. How do they relate to the ROC curve?

## How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?

## Is it better to have too many false positives, or too many false negatives? Explain.

## What is selection bias, why is it important and how can you avoid it?

# Data Science

## What is root cause analysis?

## Explain price optimization, price elasticity, inventory management, competitive intelligence?

## Explain what resampling methods are and why they are useful. Also explain their limitations.

# Deep Learning

## What are some of the main characteristics that distinguish deep learning from traditional machine learning?

## Differences between mean square error and cross entropy?

## Compare & contrast CNN & RNN

# NLP & Speech

## What's HMM? 

## What's CRF?

## What's sequence-to-sequence model?

## Compare & contrast sequence-to-sequence model with other NLP models?

# References

http://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html/3
    
https://www.quora.com/What-are-the-best-interview-questions-to-evaluate-a-machine-learning-researcher

http://stats.stackexchange.com/questions/65379/machine-learning-curse-of-dimensionality-explained