# Challenge: what model can answer this question?

You now have a fairly substantial starting toolbox of supervised learning methods that you can use to tackle a host of exciting problems. To make sure all of these ideas are organized in your mind, please go through the list of problems below. For each, identify which supervised learning method(s) would be best for addressing that particular problem. Explain your reasoning and discuss your answers with your mentor.

1. Predict the running times of prospective Olympic sprinters using data from the last 20 Olympics.
    - Regression, 
    - Possibilities: linear regression
    - relationships between features are probably not complex, so linear regression fits well.
2. You have more features (columns) than rows in your dataset.
    - Possibilities: naive bayes, maybe KNN 
3. Identify the most important characteristic predicting likelihood of being jailed before age 20.
    - Classifier
    - Possibilities: logistic lasso regression, random forest (feature importance)
    - Probably not: logistic regression, naive bayes
4. Implement a filter to “highlight” emails that might be important to the recipient
    - Classifier
    - Possibilities: Naive Bayes
    - Probably not: linear regression
5. You have 1000+ features.
    - Possibilities: random forest, lasso/ridge regression (w/ feature reduction), naive bayes, SVM
    - Probably not: KNN, OLS regression
6. Predict whether someone who adds items to their cart on a website will purchase the items.
    - Classifier
    - Possibilities: KNN, logistic regression, random forest, SVM
    - Probably not: linear regression,
7. Your dataset dimensions are 982400 x 500
    - Possibilities: naive bayes, lasso/ridge regression (w/ feature reduction), random forest
    - Probably not: SVM, KNN, OLS regression
8. Identify faces in an image.
    - Classifier. Can supervised learning do this?
    - Possiblities: KNN, PCA and then KNN
9. Predict which of three flavors of ice cream will be most popular with boys vs girls.
    - Classifier
    - Possibilities: naive bayes, KNN,
    - Probably not: linear regression

Model types to choose from:
- linear regression
    - pros: explainability, fast, relatively little data needed
    - cons: poor with complex relationships, 
- logistic regression
    - pros: can be used for classification/probabilities
- lasso regression (linear and logistic)
    - pros: good for eliminating weak predictors/finding strongest predictors (feature selection), combats overfitting (multicollinearity)
    - cons: can arbitrarily drop predictors when collinearity exists
- ridge regression (linear and logistic)
    - pros: good for large amounts of data, combats overfitting ((multicollinearity)/smaller variance 
    - cons: not parsimonious (doesn't reduce variables), 
- naive bayes
    - classification
    - pros: -simple, fast, sentiment classification, little data needed, can be trained on large amounts of data
    - cons: assumption of independence
- svm
    - regression and classification
    - pro: little data needed, accuracy on small clean datasets
    - cons: high training time, Less effective on noisier datasets with overlapping classes
- random forest/decision tree
    - regression and classification
    - pros: good with complex/non-linear relationships, strong performer
    - cons: can get large + slow, 
- nearest neighbors
    - regression and classification
    - pros: no assumptions needed, easy to understand, Flexible to feature / distance choices
    - cons: regression can't predict values outside of training data, computationally expensive (slow prediction, high memory usage)

Model attributes to consider: computation cost, prediction accuracy, explainability, identifying important features/number of , regression/classification, assumptions/complex data, data required, sensitivity to outliers, tendency to overfit (bias vs variance),