# Running list of data science interview questions/answers

## Statistics

#### How should outliers be detected and handled?

One method is to use a box and whisker plot, and define outliers to be data points outside of Q1 - 1.5IQR and Q3 + 1.5IQR

#### Explain Frequentist vs Bayesian approaches to statistical analysis

The frequentist approach hinges on the belief of long-run probability, and attempting to make statements about some true (but unknown) parameters in the distribution that generated the sample data. 
The Bayesian approach uses some a-priori belief, and updates that belief with data, to make statements about the degree to which they believe something to be true. 

#### Explain the concept of a p-value in hypothesis testing

P-values or probability values measure the probability of observing sample data as or more extreme than what was observed, under a true null hypothesis. A decision rule is formed using the significance level, alpha (type 1 error rate), and the p-value is compared to alpha to make a decision about rejecting or failing to reject a null hypothesis. 

#### What are type I and type II error rates? What is power? Which is more important?

#### Explain bar chart vs histogram

#### Explain the key assumptions for linear regression

#### What is R squared?

#### Confidence vs Prediction interval*

Once a regression equation has been obtained, we may wish to predict a response $\hat{y}$ for new data $x_{new}$. The regression equation is

$$\large{\mathbf{y=X^{T}\beta+\epsilon}}$$

Here, $\mathbf{X}$ is the design matrix, $\mathbf{\beta}$ is the vector of parameters, and $\mathbf{\epsilon}$ is the error vector (assumed to follow $\epsilon_{i}~$~$^{iid}~N(0,\sigma^{2})$). For $\mathbf{x_{new}}$ the predicted response will be

$$\hat{y}=x_{new}^{T}\hat{\beta}$$

where $\hat{\beta}$ is the vector of predicted coefficients for each explanatory variable. Of course we don't expect our prediction to be exactly equal to the true value; there is uncertainty in our coefficient estimates and also the model itself; even if we had $\hat{\beta}=\beta$ we would have uncertainty from $\epsilon$. If we want to form a confidence interval for the true mean value of predictions made on particular new data $x_{new}$, we would do so by using $\hat{y}$ as our point estimate, and building the interval around it. We seek an interval for the expected value of $y$ given new data; $\mathbb{E}[{y}~|~x_{new}]$. Thus we must add and subtract from our point estimate a margin of error, which depends on the degree of confidence desired for the interval. Note that

$$\mathbb{E}[\hat{y}~|~x_{new}]~=~\mathbb{E}[~x_{new}^{T}\hat{\beta}+\epsilon~]~=~x_{new}^{T}\beta$$

#### Explain precision vs accuracy in binary classification

Accuracy is the proportion of correct predictions $\frac{TP + TN}{Total}$, and precision is $\frac{TP}{TP + FP}$, where TP is True Positive etc. In statistical jargon, variability is the amount of imprecision and bias is the amount of inaccuracy. Precision is also called Positive Predictive Value (PPV).

<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/e/e7/Sensitivity_and_specificity.svg" width="500"/>
<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" width="500"/>
</div>

## Programming

#### Python

#### R

#### SQL

#### Spark/Hadoop/NoSQL

## Machine Learning

Supervised vs Unsupervised Learning

##### Explain the bias-variance tradeoff

##### Tell me about an original algorithm you’ve created.

##### Explain Random Forest in layman's terms:

Random Forest is a supervised machine learning algorithm that extends the CART paradigm. Decision trees are built in the conventional way, through recursive binary splitting; we take all of our data, and then split that data into two groups based on some characteristic or feature (M vs F). We then split those two child "nodes" on a different feature, and so on until some pre-determined depth. This would constitute one tree. When we decide how to split each node at each level of the tree, we have some criteria which determines the "best" split. However, this results in many similar trees if the process is repeated. To "decorrelate" these trees, we only consider a random sub-set of the available splitting features at each node of the tree. This is the "random" part of random forest. The forest part comes from repeating this random tree growing process many times, and aggregating the results of these trees to form the forest. To determine the outcome for unseen test data, we simply "drop" the new data down each tree in the forest. For classification, this results in the new data being classified into some category for each tree; we take the majority vote of all trees in the forest and use that to classify the new data. For regression, we average the output of each tree. It is an $\mathit{ensemble}$ method, further improved by forcing trees to be grown differently so that a small subset of predictors don't dominate every tree. 

#### Explain Random Forest in technical terms/explain common split rules


#### Explain K-fold Cross Validation and why it is used

The K in K-fold CV refers to the number of groups the training data is separated or "folded" into; for 5-fold CV we take the training data the divide it into 5 groups, at random. We then train our algorithm on 4 of the 5 groups, and use the left-out 5th group for testing or "validation" of the model. This means, in the case of 5-fold CV, that we will run our algorithm 5 times, with one fold held out each time. We then have 5 validation set error rates, which we average to obtain a predicted test set error rate. We use this validation error rate for model selection, by repeating the process with different tuning parameters. We choose the parameters based on lowest average validation error (or perhaps a variation), and then those parameters are used to train the model on the full training data. The benefit to doing this rather than simply using all training data for each combination of hyperparameters is that we reduce overfitting, because we are using the held out data for validation rather than simply using training set error rate for model selection. 

## Data Visualization

## General (Git/Linux/Data Science)

#### What are some ways to handle missing data?

Missing data may be handled in the following ways:
1. Impute missing values using some rule or function (median of values for that feature, for instance)
2. Ignore missing values (usually not possible)
3. Remove all observations which have any missing values/too many missing values
4. Attempt to acquire missing data

Validate by attempting multiple approaches. 

##### Do you contribute to any open-source projects?

##### Have you written a whitepaper?