## Prioritizing What to Work On

System Design Example:

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the most frequently used words in our data set. If a word is to be found in the email, we would assign its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x vectors ready, we train our algorithm and finally, we could use it to classify if an email is spam or not.


<br>
<img src= 'files/spam.png'>


So how could you spend your time to improve the accuracy of this classifier?

- Collect lots of data ("honeypot" project but doesn't always work)
- Develop sophisticated features (using email header data in spam emails)
- Develop algorithms to process your input in different ways (recognizing misspellings in spam)

It is difficult to tell which of the options will be most helpful.



## Error Analysis

The recommended approach to solving machine learning problems is to:

- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

For example, assume that we have 500 emails and our algorithm misclassifies a 100 of them. 
We could manually analyze the 100 emails and categorize them based on what type of emails they are. 
We could then try to come up with new cues and features that would help us classify these 100 emails correctly. 

Hence, if most of our misclassified emails are those which try to steal passwords, then we could find some features that are particular to those emails and add them to our model. We could also see how classifying each word according to its root changes our error rate:


<br>
<img src= 'files/error_analysis.png'>

It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance. 

For example if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. 

However, if we try to distinguish between uppercase and lowercase letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. 

Hence, we should try new things, get a numerical value for our error rate, and based on our results decide whether we want to keep the new feature or not.


## Precision and recall trade-off

<b>Accuracy = (true positives + true negatives) / (total examples)

Precision = true positives / # predicted positives

Recall = true postives / # actual positives</b>

Those two metrics are a great way to handle skewed classes and deal with particular problem like fraud, where only 1% of cases are actually positives. Here a model predicting y = 0 all the time will be 99% accurate while never revealing any fraud case. Thats the main interest of using Precision and Recall metrics. 

Depending on our classfication problematic, we want to set a different threshold in order to determine when our classifier should predict a positive or a negative case. The default is 0.5 : if we have a probability of 0.5 for the observation to be positive, we classified it as positive. But depending on the problematic or the domain knowledge, we may need to set a different threshold.

For instance, what is the less problematic situation :

- wrongly predicting a patient  as positive and treat him even if he is healthy?
- wrongly predicting a patient as negative and not treating him, letting his condition get worse and leading him to a potential death?

Here you may want to lower the threshold and classified more patient as being positive even if they are not, in order to not miss any actual positive case : you reduce the precision and increase the recall by doing so.

F score is usefull in order to somehow aggregate the value of both Precision and Recall in a single number. This metric aim to favor balanced cases where both previous metrics are high enough. 

$F_{1}score = 2* PR / (P+R)$


### Data for ML

#### "It's not who has the best algorithm that wins, it's who has the most data." 

Assume features X have sufficient information to predict y accurately (a human expert could predict y confidently).

Then using a complex learning algorithm with many parameters (e.g. many features, many hidden layers...) will allow to have a low bias and fit well the training data. 

If we use a very large training, this will make our hypothesis unlikely to overfit and therefore result in a low variance.