## Machine Learning

## Rejection reason

### Question

Suppose we have a binary classification model that classifies whether or not an applicant should be qualified to get a loan. Because we are a financial company, we have to provide each rejected applicant with a reason why. Given we don’t have access to the feature weights, how would we give each rejected applicant a reason why they got rejected?

### Answer

Create **partial dependence plot** for each feature and find the value in a feature which increases the probability of default, and explain to applicants that their input to a certain feature possibly was the reason for rejection.

## Keyword bidding

### Question

Let’s say you’re working on keyword bidding optimization. You’re given a dataset with two columns. One column contains the keywords that are being bid against, and the other column contains the price that’s being paid for those keywords. Given this dataset, how would you build a model to bid on a new unseen keyword?

### Answer

We need to build a supervised learning algorithm which takes keyword column as input and outputs the bidding price.

We make word embeddings where similar words have a similar representation in the form of vectors.

Take cosine similarity of words to a target word and recommend prices of the similar words.

Todo: GloVe embedding, Word2Vec.

## 85% vs 82%

### Question

We have two models: one with 85% accuracy, one 82%. Which one do you pick?

### Answer

We need to know whether a higher accuracy or a higher interpretable model is important to the business, because 85% accuracy model could be from a blackbox model and 82% accuracy model could be linear regression.

If this model is a binary classification model, and recall or precision are important to the business, we might pick a model regardless of the accuracy. 82% accuracy model could have a higher recall or precision than 85% accuracy model.

We also need to know the scalability. 85% accuracy model could have too long training time and model could use too much memory, so that 85% accuracy model cannot be used in production environment.

## Multicollinearity in regression

### Question

How would you tackle multicollinearity in multiple linear regression?

### Answer

Ignore multicollinearity, if the predictions work in training and test dateset, and the correlation of variables are not that high.

If the multicollinearity is caused by higher-order terms, standardize the variables.

Reduce the number of vairables. Remove one of the highly correlated variables. Apply PCA to reduce the dimension.

### Resource

- [When Do You Need to Standardize the Variables in a Regression Model?](https://statisticsbyjim.com/regression/standardize-variables-regression/)

## Variate anomalies

### Question

If given a univariate dataset, how would you design a function to detect anomalies? What if the data is bivariate?

### Answer

In a function, it can find the values at the 1th and 99th percentiles for example, and eliminate all values below or above those thresholds. In a bivariate data, anomaly detection can be one variable individually, or a combination of 2 variables. The example of anomaly detection machine learning algorithm is **Isolation Forecast**, **DBSCAN**, **Bayesian Gaussian Mixture**.

## Explaining linear regression to different audiences

### Question

How would you explain the concept of linear regression to a child, a first-year college student and a mathematician?

### Answer

To child, draw a lot of points, and draw one straight line in the center of the group of points.

To college student, linear regression is a statistical model to predict numbers. It's a way to draw a straight line in a scatter plot of data points, such that the straight line minimizes the distance between the actual data and the predicted data.

To mathematician, linear regression is a method to model relationship between a dependent variable $y$ and one or more independent variables $X$. The method assumes the linear relationship between $y$ and $X$, so that $y$ will be calculated by a linear combination of $X$. This linear combination minimizes the distance between the actual values and predicted values.

## Credit card fraud model

### Question

Say you work at a major credit card company and are given a dataset of 600,000 credit card transactions. Use this dataset to build a fraud detection model.

### Answer

Clarify the definition of a transaction which was labeled as a fraud; User decided or bank decided? 

Check how frequent the fraud data is in the dataset. Probably, fraud is small, so implement rebalancing methods, such as up-sampling, down-sampling or SMOTE.

Perform feature engineering, such as time of data for transaction, 

https://www.interviewquery.com/questions/credit-card-fraud-model

## Pizza no show

### Question

You run a pizza shop, and you run into a lot of no-shows after customers place their order. What features would you include in a model to try to predict a no-show?

### Answer

Some of the following variables may be difficult to obtain. It depends on discussion with business team and technical team about the feasibility of getting those data.

For business level, order type (call, online, in-person), employee administering order, order cost, projected time to complete order, time of day of orders, recept type (deliver, pick up)

For customer level, area of customer, time length of a order (in call, in website), new or existing customer

For environmental level, day of the week, temperature, weather.

## Search Algorithm Recall

### Question

Let's say you work as a data scientist at Amazon. You want to improve the search results for product search but cannot change the underlying logic in the search algorithm. What methods could you use to increase recall?

### Answer

Recall is

$$
\frac{TP}{TP + FN}
$$

Reduce threshold to predict more positive

## Missing housing date

### Question

We want to build a model to predict housing prices in a city. We've scraped 100,000 listings over the past 3 years but found that around 20% of the listings are missing square footage data. How do we deal with the missing data to construct our model?

### Answer

Idea of model **without square footage**. I assume the square footage is an important feature to predict housing prices. But we can explore whether we can build a model without a square footage. For example, we remove the rows of data with the missing square footage, and build 2 models; one with square footage and with other features, and the other model without square footage but with the other features. If model accuracy doesn't decrease so much without square footage, we could build a model without square footage. But it's unlikely though.

Idea of **deleting** the rows of data with missing square footage. We would remove the rows of data with missing square footage. I call this data 80% data. Build 2 models; one with this 80% data with complete square footage, and the other with reducing another 20% data (I cann this data 60% data) and build the same stats model with complete square footage. If model accuracy from 60% data is slightly less accurate than 80% data model, we may be able to just delete the rows of data with missing square footage.

Idea of not delete data and **impute** missing values. We could fill in the missing values with estimations. Simple estimation is the mean or median of the distribution of square footage. Another estimation is **k nearest neighbors algorithm** to approximate a square footage based on grouping different categorical features. We can also get means from different subsets of data by group by data by other features such as locations, number of bedrooms