## Machine Learning

## Rejection reason

### Question

Suppose we have a binary classification model that classifies whether or not an applicant should be qualified to get a loan. Because we are a financial company, we have to provide each rejected applicant with a reason why. Given we don’t have access to the feature weights, how would we give each rejected applicant a reason why they got rejected?

### Answer

Create **partial dependence plot** for each feature and find the value in a feature which increases the probability of default, and explain to applicants that their input to a certain feature possibly was the reason for rejection.

## Keyword bidding

### Question

Let’s say you’re working on keyword bidding optimization. You’re given a dataset with two columns. One column contains the keywords that are being bid against, and the other column contains the price that’s being paid for those keywords. Given this dataset, how would you build a model to bid on a new unseen keyword?

### Answer

We need to build a supervised learning algorithm which takes keyword column as input and outputs the bidding price.

We make word embeddings where similar words have a similar representation in the form of vectors.

Take cosine similarity of words to a target word and recommend prices of the similar words.

Todo: GloVe embedding, Word2Vec.

## 85% vs 82%

### Question

We have two models: one with 85% accuracy, one 82%. Which one do you pick?

### Answer

We need to know whether a higher accuracy or a higher interpretable model is important to the business, because 85% accuracy model could be from a blackbox model and 82% accuracy model could be linear regression.

If this model is a binary classification model, and recall or precision are important to the business, we might pick a model regardless of the accuracy. 82% accuracy model could have a higher recall or precision than 85% accuracy model.

We also need to know the scalability. 85% accuracy model could have too long training time and model could use too much memory, so that 85% accuracy model cannot be used in production environment.

## Multicollinearity in regression

### Question

How would you tackle multicollinearity in multiple linear regression?

### Answer

Ignore multicollinearity, if the predictions work in training and test dateset, and the correlation of variables are not that high.

If the multicollinearity is caused by higher-order terms, standardize the variables.

Reduce the number of vairables. Remove one of the highly correlated variables. Apply PCA to reduce the dimension.

### Resource

- [When Do You Need to Standardize the Variables in a Regression Model?](https://statisticsbyjim.com/regression/standardize-variables-regression/)

## Variate anomalies

### Question

If given a univariate dataset, how would you design a function to detect anomalies? What if the data is bivariate?

### Answer

In a function, it can find the values at the 1th and 99th percentiles for example, and eliminate all values below or above those thresholds. In a bivariate data, anomaly detection can be one variable individually, or a combination of 2 variables. The example of anomaly detection machine learning algorithm is **Isolation Forecast**, **DBSCAN**, **Bayesian Gaussian Mixture**.

## Explaining linear regression to different audiences

### Question

How would you explain the concept of linear regression to a child, a first-year college student and a mathematician?

### Answer

To child, draw a lot of points, and draw one straight line in the center of the group of points.

To college student, linear regression is a statistical model to predict numbers. It's a way to draw a straight line in a scatter plot of data points, such that the straight line minimizes the distance between the actual data and the predicted data.

To mathematician, linear regression is a method to model relationship between a dependent variable $y$ and one or more independent variables $X$. The method assumes the linear relationship between $y$ and $X$, so that $y$ will be calculated by a linear combination of $X$. This linear combination minimizes the distance between the actual values and predicted values.

## Pizza no show

### Question

You run a pizza shop, and you run into a lot of no-shows after customers place their order. What features would you include in a model to try to predict a no-show?

### Answer

Some of the following variables may be difficult to obtain. It depends on discussion with business team and technical team about the feasibility of getting those data.

For business level, order type (call, online, in-person), employee administering order, order cost, projected time to complete order, time of day of orders, recept type (deliver, pick up)

For customer level, area of customer, time length of a order (in call, in website), new or existing customer

For environmental level, day of the week, temperature, weather.

## Search Algorithm Recall

### Question

Let's say you work as a data scientist at Amazon. You want to improve the search results for product search but cannot change the underlying logic in the search algorithm. What methods could you use to increase recall?

### Answer

Recall is

$$
\frac{TP}{TP + FN}
$$

Reduce threshold to predict more positive

## Missing housing date

### Question

We want to build a model to predict housing prices in a city. We've scraped 100,000 listings over the past 3 years but found that around 20% of the listings are missing square footage data. How do we deal with the missing data to construct our model?

### Answer

Idea of model **without square footage**. I assume the square footage is an important feature to predict housing prices. But we can explore whether we can build a model without a square footage. For example, we remove the rows of data with the missing square footage, and build 2 models; one with square footage and with other features, and the other model without square footage but with the other features. If model accuracy doesn't decrease so much without square footage, we could build a model without square footage. But it's unlikely though.

Idea of **deleting** the rows of data with missing square footage. We would remove the rows of data with missing square footage. I call this data 80% data. Build 2 models; one with this 80% data with complete square footage, and the other with reducing another 20% data (I cann this data 60% data) and build the same stats model with complete square footage. If model accuracy from 60% data is slightly less accurate than 80% data model, we may be able to just delete the rows of data with missing square footage.

Idea of not delete data and **impute** missing values. We could fill in the missing values with estimations. Simple estimation is the mean or median of the distribution of square footage. Another estimation is **k nearest neighbors algorithm** to approximate a square footage based on grouping different categorical features. We can also get means from different subsets of data by group by data by other features such as locations, number of bedrooms

## Variable error

### Question

You have a logistic model that is heavily weighted on one variable. The sample data from the variable is like 50.00, 100.00, 40.00, etc... There was a data quality issue with this variable, and an unknown number of values removed the decimal point. For example, 100.00 turned into 100000. Would the model still be valid? Why or why not? How would you fix the model?

### Answer

It's not valid, because only unknown subset of values removed decimal points. These affect the estimated parameter of the logistic regression. But if all the data removed the decimal points, the model is valid. The parameter will change but can predict.

To fix the model, make a distribution histrogram of the variable and we should be able to see 2 portions of data. The portion of data with large number should be the error data, so divide them by 100 to fix the values. Train model again. We should be able to visually identify where the error has occurred and correct it. 

However, when the data has a large range like 0 to 1,000,000, the histogram won't show the difference so that it's difficult to catch the error. In such case, we need to use the other variables together and apply clustering such as **expectation maximization** to identify the error data.

## Skewed pricing

### Question

Buildinga model to predict real estate home prices in a particular city. Analyze the distribution of the home prices and see that the homes values are skewed to the right. Do we need to do anything or take it into consideration? If so, what should we do? If the target distribution is heavily left instead. What do we do now?

### Answer

Take log to turn the right skewed distribution into the more normal distribution, because typically model assumes the target variable has a normal distribution

Multiply by -1 and take log, and train model. In prediction, take exponential of the prediction and multiple by -1.

https://www.youtube.com/watch?v=jW0Z4qkR63E

## Bank fraud model

### Question

You work at a bank that wants to build a model to detect fraud on the platform. The bank wants to implement a text messaging service in addition that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response. How would be build this model?

### Answer

In this problem, we are going to build a binary classification model on an imbalanced dataset.

Things that we need to clarify before building a model is,
- How accurate the fraud class data? Is it definitive fraud or suspicious transactions?
- Do we care about interpretability of a model to learn how a data will be predicted to be fraud?
- What's the cost of misclassification?
- Recall or precision, which do we care?
- What model works on an imbalanced dataset?

In this problem, low **recall** means large false negatives. It means we blindly accept many fraud transactions. It's economically bad. Low **precision** means large false positives. It means we send many texts to customer about fraud, but most of the notifications are wrong. Maybe customers are annoyed by our service, but we are not making direct economical loss. So I assume recall should be higher than precision when training a binary classification model.

Computing recall in different subset of transaction amount data would be useful. For example, let's say recall is low with the average 10 dollar fraud data. But if recall is high with the average 1,000 dollar fraud data, it's good result.

For model choice, we can try boosting and SVM algorithm because it can learn more from the error of minority class. we can also customize loss function to assign more cost for the larger amount of fraud data to prioritize large fraud prediction.

For smaller class of fraud data, we can try **SMOTE** and **ADASYN (Adaptive Synthetic Sampling)** to generate synthetic samples for fraud class data.

## Ride requests model

You're tasked with building a model to predict if a driver on Uber will accept a ride request or not.
1. What algorithm would you use to build this model?
2. What are the tradeoffs between different classifiers?
3. What features would you use?

1. Binary classification. I think class imbalance depends on locations, time, etc
2. Linear model has less accurate but more interpretability, machine learning model has more accurate bu less interpretability

3. Features would be
  - Expected profit
  - Expected total time of driving (Affect driver's availability)
  - Expected total distance of driving (Affect driver's gas)
  - Rider's review
  - Destination

https://www.youtube.com/watch?v=Dwbgy7cUxk4

## Multicollinearity in regression

### Question

How would you tackle multicollinearity in multiple linear regression


### Answer

Ridge to reduce coefficient

Lasso to reduce the number of features

PCA to reduce correlation

Manually reduce the number of features

- Decide whether we care about multicollinearity. If we only care the accuracy in training data and test data, we are okay to ignore multicolliearity in model
- Standardize each independent variable and check again whether correlation still exists. Example of standardization is subtracting mean from values in a column
- Reduce the number of independent variables such as by **Ridge**, **Lasso** or manually removing. Or apply dimension reduction by **PCA**.

## Keyword bidding

### Question

Keyword bidding optimization. A dataset with two columns. One column contains keywords to bid. The other column contains the price to pay for the keyword. How would you build a model to bid on a new unseen keyword? 

### Answer

We will create word embeddings. Word embeddings are vectors where words having the same meaning have a similar vector. 

Prediction is to compute **cosine similarity** between the embedding and the keyword, and we **recommend** the actual prices of similar words with the high cosine similarity to a new keyword to bid.

Use the mean of the embedding as additional feature to a model as well as word to predict price.

Training **neural networks** gives us the word embeddings.

**GloVe embedding** can be used. It's a co-occurrence matrix of all the keywords in a context.

**Word2Vec** can be used. It's predicting the context of the word.

## Booking regression

### Question

Build a model to predict booking prices on Airbnb. Between linear regression and random forest regression, which model would perform better and why?

### Answer

The linear regression adn random forest differences
- Random forest regression can approximate the complex nonlinear shapes. But linear regression performs better whent the underlying data is linear and has many continous predictors.
- Random forest can use many predictors, but with linear regression we need to reduce the number of predictors to avoid overfit.
- Random forest can capture complex interactions between predictors, but we need to create interaction variables with linear regression.
- For predictor interpretability, we can compute feature importance with random forest. But the predictor coefficients of linear regression is more interpretable.

Possible features to predict booking prices on Airbnb
- Location features
- Calendar features (Seasonality)
- Number of bedrooms and bathrooms
- Room type like private room, shared, entire home
- External demand like conference, events

Linear regression still work. For example, creating one linear regression model per location. And manually create interaction effect features like special event feature with the size of room

But maybe random forecast performs better if there are lots of features available and we need to predict around the world. The relationship would be non-linear and too complex to manually create complex features for linear regression.

If we only have one zipcode column of one city and other simple rental information, linear regression is a good choice and it gives us the interpretability by the features.

```
Memo

Linear regression
- Pros
  - We can know how feature values affec model output
  - Prediction is faster than random forest
  - Perform well if the subset of data is small
  - Faster training
- Cons
  - Less accurate than random forest

Random forest
- Pros
  - More accurate
- Cons
  - Less interpretable than linear regression, but we can show feature importance
  - Slower training but it can be faster in a parellel training
```

## Credit card fraud model

### Question

Say you work at a major credit card company and are given a dataset of 600,000 credit card transactions. Use this dataset to build a fraud detection model.

### Answer

Clarify the definition of a transaction which was labeled as a fraud; User decided or bank decided? 

Check how frequent the fraud data is in the dataset. Probably, fraud is small, so implement rebalancing methods, such as up-sampling, down-sampling or SMOTE.

Perform feature engineering, such as time of data for transaction, 

https://www.interviewquery.com/learning-paths/modeling-and-machine-learning/model-selection/credit-card-fraud-model