# Challenge

In this challenge, we will identify which supervised learning method(s) would be best for addressing that particular problem. First, let's compile a list of all the methods we learned so far. 

#### Naive Bayes 
Assumes independence between every pair of features and work well in many real-world situations such as document classification and spam filtering
 - __Advantages__ 
     -  requires a small amount of training data to estimate the necessary parameters
     - extremely fast compared to more sophisticated methods
 - __Disadvantages__
     - bad estimator 

#### Linear Regression 
Used to predictions problems, it find the target variable by finding a best suitable fit line between the independent and dependent variables
 - __Advantages__ 
     -  the best fit line is the line with minimum error from all the points
 - __Disadvantages__
     - linear regression is limited to linear relationship
     - only looks at the mean of the dependent variable
     - sensitive to outliers
     - data must be independent 
     
#### KNN
A type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.
 - __Advantages__ 
     -  simple to implement, robust to noisy training data, and effective if training data is large
 - __Disadvantages__
     - need to determine the value of K and the computation cost is high as it needs to computer the distance of each instance to all the training samples
     
#### Decision Tree
Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data
 - __Advantages__ 
     -  simple to understand and visualise
     - requires little data preparation
     - can handle both numerical and categorical data
 - __Disadvantages__
     - can create complex trees that do not generalise well
     - can be unstable because small variations in the data might result in a completely different tree being generated
     
#### Random Forest
A meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.
 - __Advantages__ 
     - reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.

 - __Disadvantages__
     - slow real time prediction
     - difficult to implement
     - complex algorithm
     
#### Support Vector Machine
Representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
 - __Advantages__ 
     - effective in high dimensional spaces 
     - uses a subset of training points in the decision function so it is also memory efficient

 - __Disadvantages__
     - does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation
     
#### Gradient Boosting
Build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms
 - __Advantages__ 
     - high predictive accuracy
     - flexibility, can optimize on different loss function and provides several hyperparameter tuning options
     - no data pre-processing required, great with categorical and numerical values

 - __Disadvantages__
     - can overemphasize outliers and cause overfitting, must use cross validation 
     - time and memory tensive 
     - high flexibility requires tunning 
     - less interpertable 

#### 1. Predict the running times of prospective Olympic sprinters using data from the last 20 Olympics.
We would use a regression model to predict running times. Each sprinter would be considered as an observation, qualities of the sprinter would be features, and the Olympic running time would be the outcome variables. 

Assuming that there are more than 50 observations and a few important features have been selected, we can use either Lasso or ElasticNet regression. However, if we wanted to use multiple features for each sprinter, we would need something more memory intensive such as a random forest.

#### 2. You have more features (columns) than rows in your dataset.
We could use linear regression for this problem. However, we would have to regularize the estimated coefficients and shrink them to almost zero to avoid over fitting the data. In ridge regression, a tuning parameter is added to the cost function to penalize large coefficients. 

If this was a classification problem, Naive Bayes has an advantage with small training sets since its low bias/high variance. The latter tends to not over fit.

Using PCA to reduce dimensions and random noise could also be helpful.

#### 3. Identify the most important characteristic predicting likelihood of being jailed before age 20.

Since decision trees are easy to interpret, we could create one to see feature interactions. Another alterative would be to ensemble random forests and return a list of feature importance. We could also identify which coefficients are statistically significant to the outcome in an OLS summary chart.

#### 4. Implement a filter to “highlight” emails that might be important to the recipient
Naive Bayes is great in classifying with very large datasets. SVMs also have high accuracy and work well with text classification problems. It can deal with non-linear data and high-dimensional space.

#### 5. You have 1000+ features.

Before selecting a model, running PCA during feature engineering would be a good idea to combine collinear features and reduce random noise in the data. 

Lasso regression could be used to shrink the coefficients of unwanted features to zero. We could also use R-squared from Linear regression to keep only statistically significant variables. 

Random forests and SVMs would be also be a great choice to handle over fitting for this problem since there are so many features.

#### 6. Predict whether someone who adds items to their cart on a website will purchase the items.
We can view the outcome as binary classification in determining whether a customer will purchase added items in their cart or not. If variables are independent of each other and dataset is free of missing values, linear regression model would be a good start. With a large training data, we could use KNN.

#### 7. Your dataset dimensions are 982400 x 500
It would be best to run PCA to reduce dimensionality and employ other feature selection techniques to select most important features. Using random forest or SVM would be best due to the high dimension space of the dataset and possibility of over fitting.

#### 8. Identify faces in an image.
Due to the immense dimensions of this task, I would recommend using SVMs or PCA to process and identify geometric shapes that resembles a face. 


#### 9. Predict which of three flavors of ice cream will be most popular with boys vs girls.
Since this a classification task with a few features, Navies Bayes, linear classification or KNN would do the trick.
