**Confidence Intervals and Confidence Levels**

A confidence interval is a range of values that is likely to contain an unknown population parameter. If we draw a random sample many times, a certain percentage of the confidence intervals will contain the population mean. This percentage is the confidence interval.

If we want to have a 95% chance of capturing the true population parameter with a point estimate and a corresponding confidence interval, we would set the confidence level to 95%. *Higher confidence levels result in wider confidence intervals*.


We calculate a confidence interval by taking a point estimate and either adding or subtracting a *margin of error* to create a range. This margin of error is based on three things -

1. the desired confidence level
2. the spread of data
3. size of the sample 

The margin of error for a known population standard deviation is:

Margin of Error = z ∗ σ / √n
Where σ (sigma) is the population standard deviation, n is sample size, and z is a number known as the z-critical value. 

The z-critical value is the number of standard deviations you'd have to go from the mean of the normal distribution to capture the proportion of the data associated with the desired confidence level

When we create a confidence interval, it's important to be able to **interpret** the meaning of the confidence level we used and the interval that was obtained.

A specific confidence interval gives a range of plausible values for the parameter of interest.

**Interpreting a confidence level**

Example 1:
    
A political pollster plans to ask a random sample of 500 voters whether 
or not they support the incumbent candidate. The pollster will take the 
results of the sample and construct a 90% confidence interval for the true 
proportion of all voters who support the candidate.

Which of the following is a correct interpretation of the 90% confidence level?

1. If the pollster repeats this process and constructs 20 intervals from separate independent samples, we can expect about 18 of those intervals to contain the true proportion of voters who support the candidate


2. About 90% of people who support the candidate will respond to the poll


3. If the pollster repeats this process many times, then about 90% of the intervals produced will capture the true proportion of voters who support the candidate


Example 2:

A baseball coach was curious about the true mean speed of fastball pitches in his league. The coach recorded the speed in kilometers per hour of each fastball in a random sample of 100 pitches and constructed a 95% confidence interval for the mean speed. The resulting interval was (110, 120)

Which of the following is a correct interpretation of the interval (110, 120)?

1. If the coach took another sample of 100 pitches, there's a 95% chance the sample mean would be between 110 and 120 km/hr


2. About 95% of pitches in the sample were between 110 and 120 km/hr


3. We're 95% confident that the interval (110, 120) captured the true mean pitch speed


Suppose the coach decides they want to be more confident. The coach uses the same sample data as before but recalculates the confidence interval using a 99% confidence level. 

How will increasing the confidence level from 95% to 99% affect the confidence interval?

1. It is impossible to say without seeing the sample data


2. Increasing the confidence will increase the margin of error resulting in a wider interval


3. Increasing the confidence will decrease the margin of error resulting in a narrower interval

**Types of Error, Sensitivity and Specificity**

Type 1 errors are errors when the result is significant despite the fact that the null hypothesis is true. The likelihood of a Type 1 error is indicated by alpha. 

In quality control, a Type 1 error is called *producer risk*
because an item is rejected despite the fact that it meets regulatory requirements

In healthcare, a Type 1 error would be a diagnosis of cancer even though the subject is healthy. For this reason this type of error is referred to as a False Positive.


Type 2 errors are errors when the result is not significant despite the fact that the null hypothesis is false. The probability of this type of error is indicated by beta. The power of a statistical test is the chance of correctly
accepting the alternate hypothesis. 

In quality control, a Type 2 error is called a consumer risk because the consumer obtains an item that does not meet the regulatory requirements. 

In healthcare, a Type 2 error would be a healthy diagnosis even though the subject has cancer.


**Sensitivity** - Proportion of positives that are correctly identified by a test or the 
probability of a positive test given the patient is ill

**Specificity** - Proportion of negatives that are correctly identified by a test or the
probability of a negative test given the patient is well

**Positive Predictive Value (PPV)** - Proportion of patients with positive test results who
are correctly diagnosed

**Negative Predictive Value (NPV)** - Proportion of patients with negative test results who 
are correctly diagnosed

![Screenshot%202021-01-21%20at%2017.18.23.png](attachment:Screenshot%202021-01-21%20at%2017.18.23.png)

**Likelihood Ratios**

Likelihood ratios are used for assessing the value of performing a diagnostic test. They use the sensitivity and specificity of the test to determine whether a test result 
changes the probability that a condition exists.

Two versions of the likelihood ratio exist - positive likelihood ration and negative likelihood ratio


**False Positive Rate** (alpha) = type 1 error = 1 - specificity = FP / (FP + TN)

**False Negative Rate** (beta) = type 2 error = 1 - sensitivity = FN / (TP + FN)

**Positive Likelihood Ratio** = Sensitivity / (1 - Specificity)

**Negative Likelihood Ratio** = 1 - Sensitivity / Specificity 

**Confusion Matrices**

Using our above understanding of types of errors, we can use the concept of confusion matrices to evaluate a machine learning model, particularly a classification model or classifier.

A confusion matrix tells us A confusion matrix tells us four important things. 
Let's assume a model was trained for a Binary Classification task, meaning that every item in the dataset has a ground-truth value of 1 or 0. To make it easier to understand, let's pretend this model is trying to predict whether or not someone 
has a disease. A confusion matrix gives you the following information:
    
True Positives (TP) : You predicted positive and it's true i.e. you predicted a woman is pregnant when she actually is

True Negatives (TN) : You predicted negative and it is true i.e. you predicted the man is not pregnant and he actually is not


False Positives (FP) or Type 1 Error : You predicted positive and it is false i.e. you predicted a man is pregnant but he actually is not


False Negatives (FN) or Type 2 Error: You predicted that a woman is not pregnant but she actually is


![new_confusion_matrix_2.png](attachment:new_confusion_matrix_2.png)

We can use sklearn to create confusion matrices to evaluate classifier performance

In [10]:
from sklearn.metrics import confusion_matrix
example_labels = [0,1,1,1,0,1,0,1,0,0,1]
example_preds =  [0,1,0,1,0,0,1,1,1,1,1]

cf = confusion_matrix(example_labels, example_preds)
cf

array([[2, 3],
       [2, 4]])

**Evaluation Metrics**

In classification models, we use certain metrics to evaluate the performance of an algorithm. These metrics are based on our understanding of false positives, false negatives and true positives, true negatives. In binary classification models, you are either correct or incorrect. As a result, we tend to deconstruct this as how many false positives versus false negatives there are in a model

**Precision and Recall**

Precision and Recall are two of the most basic evaluation metrics available to us. Precision measures how precise the predictions are, while Recall indicates what percentage of the classes we're interested in were actually captured by the model.

**Precision= Number of True Positives / Number of Predicted Positives**

To reuse a previous analogy of a model that predicts whether or not a person has a certain disease, precision allows us to answer the following question:

"Out of all the times the model said someone had a disease, how many times did the patient in question actually have the disease?"

Note that a high precision score can be a bit misleading. For instance, let's say we take a model and train it to make predictions on a sample of 10,000 patients. This model predicts that 6000 patients have the disease when in reality, only 5500 have the disease. This model would have a precision of 91.6%. Now, let's assume we create a second model that only predicts that a person is sick when it's incredibly obvious. Out of 10,000 patients, this model only predicts that 5 people in the entire population are sick. However, each of those 5 times, it is correct. model 2 would have a precision score of 100%, even though it missed 5,495 cases where the patient actually had the disease! In this way, more conservative models can have a high precision score, but this doesn't necessarily mean that they are the best performing model!

**Recall= Number of True Positives / Number of Actual Total Positives**

Following the same disease analogy, recall allows us to ask:

"Out of all the patients we saw that actually had the disease, what percentage of them did our model correctly identify as having the disease?"

Note that recall can be a bit of a tricky statistic because improving our recall score doesn't necessarily always mean a better model overall. For example, our model could easily score 100% for recall by just classifying every single patient that walks through the door as having the disease in question. Sure, it would have many False Positives, but it would also correctly identify every single sick person as having the disease!

Precision and Recall have an inverse relationship. As our recall goes up, our precision will go down and vice versa. 

A doctor that is overly obsessed with recall will have a very low threshold for declaring someone as sick because they are most worried about sick patients. Their precision will be quite low, because they classify almost everyone as sick, and don't care when they're wrong -- they only care about making sure that sick people are identified as sick.

A doctor that is overly obsessed with precision will have a very high threshold for declaring someone as sick, because they only declare someone as sick when they are completely sure that they will be correct if they declare a person as sick. Although their precision will be very high, their recall will be incredibly low, because a lot of people that are sick but don't meet the doctor's threshold will be incorrectly classified as healthy.


In addition to Precision and Recall, there are two other metrics used to describe performance of a model. 

**Accuracy = (Number of True Positives + True Negatives) / Total Observations

Accuracy is useful because it allows us to measure the total number of predictions a model gets right, including both True Positives and True Negatives.

Sticking with our analogy, accuracy allows us to answer:

"Out of all the predictions our model made, what percentage were correct?"

Accuracy is the most common metric for classification. It provides a solid holistic view of the overall performance of our model.

**F1 Score**

F1 score represents the Harmonic Mean of Precision and Recall. In short, this means that the F1 score cannot be high without both precision and recall also being high. When a model's F1 score is high, you know that your model is doing well all around.

The formula for F1 score is:

**F1 score=2 **(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 **𝑅𝑒𝑐𝑎𝑙𝑙)/ (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙)**
 
To demonstrate the effectiveness of F1 score, let's plug in some numbers and compare F1 score with a regular arithmetic average of precision and recall.

Let's assume that the model has 98% recall and 6% precision.

Taking the arithmetic mean of the two, we get:  (0.98+0.06)/2=1.04/2=0.52 

However, using these numbers in the F1 score formula results in:

F1 score=2*(0.98∗0.06)/(0.98+0.06)= 2*(0.0588)/1.04=
2(0.061152)=0.122304
 
or 12.2%!

As you can see, F1 score penalizes models heavily if it skews too hard towards either precision or recall. For this reason, *F1 score is generally the most used metric for describing the performance of a model*.


**Receiver Operating Characteristic (ROC) Curve**

Closely related to sensitivity and specificity is the Receiver-Operating- Characteristic (ROC) curve. This is a graph displaying the relationship between the **true positive rate** (on the vertical axis) and the **false positive rate** (on the horizontal xis). 

The technique comes from the field of engineering, where it was developed to find the predictor which best discriminates between two given distributions: ROC curves were first used during WWII to analyze radar effectiveness. In the early days of radar, it was sometimes hard to tell a bird from a plane. The British pioneered using ROC curves to optimize the way that they relied on radar for discriminating between incoming German planes and birds.

Take the case that we have two different distributions, for example one from the radar signal of birds and one from the radar signal of German planes, and we have to determine a cut-off value for an indicator in order to assign a test result to distribution one (“bird”) or to distribution two (“German plane”). The only parameter that we can change is the cut-off value, and the question arises: is there an optimal choice for this cut-off value?

The answer is yes: it is the point on the ROC-curve with the largest distance to the diagonal

![Screenshot%202021-01-21%20at%2017.17.15.png](attachment:Screenshot%202021-01-21%20at%2017.17.15.png)

The Receiver Operator Characteristic curve (ROC curve) illustrates the true positive rate against the false positive rate of our classifier.

The True Positive Rate before, is another name for recall. As a reminder, it's the ratio of the true positive predictions compared to all values that are actually positive. Mathematically, it is represented by:

TPR=TP / (TP+FN)
 
False positive rate is the ratio of the false positive predictions compared to all values that are actually negative. Mathematically, it's represented as:

FPR= FP / (FP+TN)
 
When training a classifier, the best performing models will have an ROC curve that hugs the upper left corner of the graph. A classifier with 50-50 accuracy is deemed 'worthless'; this is no better than random guessing, as in the case of a coin flip.

![Screenshot%202021-01-21%20at%2017.22.15.png](attachment:Screenshot%202021-01-21%20at%2017.22.15.png)

The ROC curve gives us a graph of the tradeoff between this false positive and true positive rate. The AUC, or area under the curve, gives us a singular metric to compare these. An AUC of 1 being a perfect classifier, and an AUC of 0.5 being that which has a precision of 50%

Smaller values on the x-axis of the plot indicate lower false positives and higher true negatives.
Larger values on the y-axis of the plot indicate higher true positives and lower false negatives.

A skilful model will assign a higher probability to a randomly chosen real positive occurrence than a negative occurrence on average. This is what we mean when we say that the model has skill. Generally, skilful models are represented by curves that bow up to the top left of the plot.

An operator may plot the ROC curve for the final model and choose a threshold that gives a desirable balance between the false positives and false negatives