# This notebook

I will show the projects presented in the course. I will present the projects in the notebook using the some of the code provided in the lessons and some code by me. I have restructured the code to adapt it to the use in a notebook. You can find me code here.

Each section will correspond to a lesson in the course. 

Let's start!

# Lesson 1: Naive Bayes

TO DO : Description of the algorithm/objective of the lesson

- very common algorithm to find a decision boundary
- identify from a text source where one label or another is more likely to be the origin of the text. The algorithm uses the frequency of the words for each label to update the probabilities.
- it is called naive because it ignores the words order
- Advantages:
    - easy to implement.
    - it works great with big features spaces (20000 or 30000 english words for example).
- Disadvantages:
    - it can break: e.g. chicago bulls google.

## Example: Cancer test
- Prob(having cancer) = 0.01
- Test: 
    - 90% positive if you have cancer (sensitivity)
    - 90% negative if you do not have cancer (specificity)
    
- Question: Test is positive, what is the probability that the patient actually has cancer?
- Answer: We know the test is positive, that means the patient is either one of the 90% detected with a positive test and having cancer or one of the 10% detected with a positive test and not having cancer. For the first case, the patient would be one of the 90% of 1% of the total population. For the second case the patient would be one of the 10% of the 99% of the total population. That means the total population of people with a positive test is $0.9*0.01 + 0.1*0.99 = 0.108$. And out of this portion of the popululation we want to take into account only the people actually having cancer, which is $0.9*0.01$. That means that the probability of having cancer when a test is positive $0.9*0.01 / 0.108 = 0.08333$.

## Bayes rule:
The basic idea behind the bayes rule is that we start with a prior probability and update it using evidence.
$$\text{Prior probability} \: \& \: \text{Evidence} \rightarrow \text{Posterior  probability}$$

In the previous example:
- Prior probability : $\mathbb{P}(\text{Cancer}) = 0.01$
- Sensitivity: $\mathbb{P}(\text{Positive} \: | \: \text{Cancer}) = 0.9$
- Joint probability: 
    - $\mathbb{P}(\text{Cancer} \: \& \: \text{Positive}) = \mathbb{P}(\text{Cancer})*\mathbb{P}(\text{Positive} \: | \: \text{Cancer}) = 0.009$
    - $\mathbb{P}(\sim \text{Cancer} \: \& \: \text{Positive}) = \mathbb{P}(\sim \text{Cancer})*\mathbb{P}(\text{Positive}\: | \: \sim \text{Cancer}) = 0.099$
- Normalizer: $\mathbb{P}(\sim \text{Cancer} \: \& \: \text{Positive}) + \mathbb{P}(\text{Cancer} \: \& \: \text{Positive}) = 0.108$
- Posterior: 
    - $\mathbb{P}(\text{Cancer} \:|\: \text{Positive}) = \frac{\mathbb{P}(\text{Cancer})*\mathbb{P}(\text{Positive} \: | \: \text{Cancer})}{\mathbb{P}(\text{Positive})}$
    - $\mathbb{P}(\sim \text{Cancer} \:|\: \text{Positive}) = \frac{\mathbb{P}(\sim \text{Cancer})*\mathbb{P}(\text{Positive} \: | \: \sim \text{Cancer})}{\mathbb{P}(\text{Positive})}$

## Text Learning
We will use an example of emails from two people: Chris and Sara. The data comes from the Enron emails database.  

In [20]:
path_tools = 'tools/'

import sys
from time import time
sys.path.append(path_tools)
from chris_sara_email_preprocess import preprocess

We will use the preprocess method to load the features and labels for train and test. This method uses 2 files that have already been curated. We will dive in the details of the preprocess in the next lessons. For now we only use the data.

In [21]:
features_train, features_test, labels_train, labels_test = preprocess()

No. of Chris training emails :  7936
No. of Sara training emails :  7884


In [22]:
#import naive bayes from sklearn
from sklearn.naive_bayes import GaussianNB

In [23]:
#initialize and fit
naive_bayes = GaussianNB()
naive_bayes = naive_bayes.fit(features_train, labels_train)

Is the model any good? We will use the **accuracy** metric to evaluate the models. The accuracy represents the fraction of correct classifications out of all the classifications for the test set.

In [24]:
from sklearn.metrics import accuracy_score

In [25]:
#predict and evaluate accuracy
predictions = naive_bayes.predict(features_test)
accuracy_score(labels_test,predictions)

0.9732650739476678

Let's add also the time of training and testing for the model. I wrote a method to keep track of the models we will try in the next few lessons.

In [26]:
from classifiers_summary import Classifiers

In [27]:
summary = Classifiers('ML intro Udacity')

In [28]:
summary.add_classifier('Naive Bayes', GaussianNB(), {}, features_train, labels_train, features_test, labels_test)

****************************************************************************************************
Added Naive Bayes with params {'priors': None, 'var_smoothing': 1e-09}
Nb of training examples: 15820
Training time: 0.394 s | Predict time: 0.034 s
Accuracy: 0.9732650739476678
****************************************************************************************************


# Lesson 2: SVM (Support Vector Machine)

TO DO description of the algorithm and the lesson

Find hyperplane to separate classes (if possible)
What makes a good separating hyperplane? Maximaze the distance (margin) between the plane and the nearest point of the classes.

SVM correct classification is more important than optimizing the margin. We can parametrize in order to tolerate classification errors and find a good margin to separate classes.

SVM can also use another types of kernel (non-linear) in order to have a decision boundary that can capture non-linear relationships. Sometimes this complexity can be so tricky that it does not generalize well!

Kernel trick: maps low dimension (non separable) to high dimension (linearly separable). These functions are called kernels. 
The solution of the SVM than reverses the kernel trick and will have a non-linear separation in the original feature space.

Important SVM parameters:
- kernel
- C: controls the tradeoff between a smooth decision boundary and one that classifies all the training points correctly.
- Gamma

Let's use an SVM classifier with linear kernel

In [10]:
from sklearn.svm import SVC

In [11]:
summary.add_classifier("SVM linear kernel", SVC(), {'kernel':'linear'}, features_train,labels_train, features_test,labels_test)

****************************************************************************************************
Added SVM linear kernel with params {'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'linear', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
Nb of training examples: 15820
Training time: 80.343 s | Predict time: 7.973 s
Accuracy: 0.9840728100113766
****************************************************************************************************


That took a long time! Let's see what happens if we reduce the training data to 1% of the original size

In [12]:
features_train_1pc = features_train[:round(len(features_train)/100)]
labels_train_1pc = labels_train[:round(len(labels_train)/100)]

In [13]:
summary.add_classifier("SVM linear kernel 1pc", SVC(), {'kernel':'linear'}, features_train_1pc,labels_train_1pc, 
                              features_test,labels_test,)

****************************************************************************************************
Added SVM linear kernel 1pc with params {'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'linear', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
Nb of training examples: 158
Training time: 0.044 s | Predict time: 0.455 s
Accuracy: 0.8845278725824801
****************************************************************************************************


Let's try to change the kernel to an rbf kernel and keep the same small training set. What is the accuracy in this case?

In [14]:
summary.add_classifier("SVM rbf kernel 1pc", SVC(), {'kernel':'rbf'}, features_train_1pc,labels_train_1pc, 
                              features_test,labels_test, inplace=True)

****************************************************************************************************
Added SVM rbf kernel 1pc with params {'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
Nb of training examples: 158
Training time: 0.049 s | Predict time: 0.677 s
Accuracy: 0.8953356086461889
****************************************************************************************************


Now let's try different values for the parameter C

In [15]:
C = [1, 10, 100, 1000, 10000]
for c in C:
    clf = summary.add_classifier(f"SVM rbf kernel 1pc C={c}", SVC(), {'kernel':'rbf', 'C':c}, features_train_1pc,labels_train_1pc, 
                              features_test,labels_test)

****************************************************************************************************
Added SVM rbf kernel 1pc C=1 with params {'C': 1, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
Nb of training examples: 158
Training time: 0.05 s | Predict time: 0.675 s
Accuracy: 0.8953356086461889
****************************************************************************************************
****************************************************************************************************
Added SVM rbf kernel 1pc C=10 with params {'C': 10, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shr

In [16]:
summary.summary_table()

Unnamed: 0,accuracy,nb_training_examples,nb_training_features,training_time,predict_time,params
Naive Bayes,0.973265,15820,3785,0.361,0.037,{}
SVM linear kernel,0.984073,15820,3785,80.343,7.973,{'kernel': 'linear'}
SVM linear kernel 1pc,0.884528,158,3785,0.044,0.455,{'kernel': 'linear'}
SVM rbf kernel 1pc,0.895336,158,3785,0.049,0.677,{'kernel': 'rbf'}
SVM rbf kernel 1pc C=1,0.895336,158,3785,0.05,0.675,"{'kernel': 'rbf', 'C': 1}"
SVM rbf kernel 1pc C=10,0.899886,158,3785,0.046,0.683,"{'kernel': 'rbf', 'C': 10}"
SVM rbf kernel 1pc C=100,0.899886,158,3785,0.046,0.676,"{'kernel': 'rbf', 'C': 100}"
SVM rbf kernel 1pc C=1000,0.899886,158,3785,0.047,0.664,"{'kernel': 'rbf', 'C': 1000}"
SVM rbf kernel 1pc C=10000,0.899886,158,3785,0.047,0.666,"{'kernel': 'rbf', 'C': 10000}"


We choose model with C=10000 and now we will train it with the entire training dataset.best_clf = SVC(kernel='rbf', C=10000) 
best_clf.fit(features_train, labels_train)

In [29]:
best_svm = summary.add_classifier(f"Best SVM", SVC(), {'kernel':'rbf', 'C':10000}, features_train,labels_train, 
                              features_test,labels_test)

****************************************************************************************************
Added Best SVM with params {'C': 10000, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
Nb of training examples: 15820
Training time: 84.826 s | Predict time: 11.085 s
Accuracy: 0.9960182025028441
****************************************************************************************************


# Lesson 3: Decision Trees

TO DO description of the algorithm and lesson 

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
summary.add_classifier("DT min_samples_split=40", DecisionTreeClassifier(), {'min_samples_split':40}, features_train,labels_train, 
                              features_test,labels_test, inplace=True)