# Python Tutorial - Scikit Learn

## 1. Basic Terminology

Each row is an observation (or sample, example, instance, record). <br>
Each column is a feature (or predictor, attribute, independent variable, input, regressor, covariate)<br>
Each predicted value is the response. (or target)<br>
Features and targets must: 
    1. Be separate objects, 
    2. Be numeric, 
    3. Be NumPy arrays (or pandas dataframe) 
    4. Have specific shapes:
         * X = FEATURES: (rows, columns)   2-DIM MATRIX
         * Y = TARGET: (rows)              1-DIM VECTOR

<br>
**USUAL STEPS FOR ML ALGORITHM**

* STEP 1: Import the class we plan to use 
* STEP 2: Instantiate the classifier (Here we can specify all the arguments of the classifier)
* STEP 3: Fit the data to the model (Here the method changes the classifier in-ace, no need to assign it to a variable)
* STEP 4: Predict New values (Here we feed the classifier with new_data, i.e a Numpy array with ALL the features)
* STEP 5: Model Evaluation (We can check accuracy of the model or in between models)
     - First Way: Train the model on the entire dataset, and then use the model on the dataset to check its accuracy (NOT OPTIMAL)
     - Second way: Train-Test Split (We split the data to train dataset and test dataset)   (OPTIMAL)  


## 2. Supervised Learning

### - K-NEIGHBORS CLASSIFIER

Searches for K nearest neighbors (Euclidean Distance) from the point we want to predict, and gives as a result the most frequent neighbor.

In [None]:
# STEP 1: Import the class we plan to use 
from sklearn.neighbors import KNeighborsClassifirer                

# STEP 2: Instantiate the classifier. 
clf = KNeighborsClassifirer(n_neighbors=5,              # n_neighbors: How many other points to take into account             
                            weight_options='uniform')   # weight_options: The weights to put to each point around the one we deal with. 
                                                        # Uniform is the default, which everything takes the same weight. 
                                                        # Another one is distance which nearest neighbors take higher weights

# (PRE)STEP 5: The Second way Train-Test Split (We split the data to train dataset and test dataset)   (OPTIMAL)  
from sklearn.cross_validation import train_test_split                         

# It splits the data to train and test datasets. 
# IMPORTANT:  Then we perform the whole model
X_train, X_test, y_train, y_test = train_test_split(X,y,                                         
                                                    test_size=0.5,        # Indicates the percentage of test to train dataset (here half-half, default 25/75)
                                                    random_state=1)       # Model splits dataset in a random way, unless we specify the state (here we specified to 1st way)
                                                    stratify=y            # Equal proportions of different set of targets in the training and test set

# STEP 3: Fit the train dataset to the model           
clf.fit(X_train,y_train)                    

# STEP 4: Predict New values (We can also put here completely new values)
y_pred = clf.predict(X_test)

# Gives the probability of each class of targets to be the correct one. 
# The classifier predicts class 1 when the probability is higher than the threshold (by default is 50%)
y_prob = clf.predict_proba(X_test)  

# OPTIONAL STEP: Serialize the classifier with pickle, so you don't need to train it over and over again. 
# Now, every time you open the file the classifier will be there, ready to use)
import pickle 

# We create a variable to save the classifier
with open("K-NeighborsClassifier.pickle","wb") as tmp:      
    pickle.dump(clf,tmp)                                     # We save the classifier clf to save_classifier. wb = write bytes
    tmp.close()                                              # Closes the process

# We create a variable to load the saved pickle classifier. rb = read bytes
with open("K-NeighborsClassifier.pickle","rb") as tmp:       
    clf = pickle.load(tmp)                                   # We load the classifier to clf in order to use it


# STEP 5: Model Evaluation
# It's a good tactic to always compare the accuracy of the model with the NULL ACCURACY which is the accuracy that could be achieved by always predicting the most frequent class. This is the "base accuracy" (dumbest model). 
# No actual code for that, use pandas and find it.

from sklearn import metrics               

# 1st Way: accuracy_score(): UPSIDES: Easy and fast. DOWNSIDES: Does not tell the underlying distribution of responses AND the type of errors the classifier does
# Model Evaluation Metric: Accuracy - Compares the prediction with the actual value. 1=perfect, 0=bad. 
metrics.accuracy_score(y_test, y_pred)   

# 2nd Way: confusion_matrix(): UPSIDES: More Complete Picture, More Classification metrics (OPTIMAL)
# Model Evaluation Metric: Confusion Matrix - A NxN matrix where N is the number of different classes of targets, showing the actual value and what it was predicted
metrics.confusion_matrix(y_test, y_pred)                                                          


A 2x2 Confusion Matrix where 0 is the positive and 1 the negative displays the following informations:<br>

<img src="https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/60900/versions/12/screenshot.png",width=400,height=400>


In [None]:
# True Positive
TP = metrics.confusion_matrix(y_test, y_pred_class)[0, 0]

# True Negative
TN = metrics.confusion_matrix(y_test, y_pred_class)[1, 1]

# False Positive
FP = metrics.confusion_matrix(y_test, y_pred_class)[0, 1]

# False Negative
FN = metrics.confusion_matrix(y_test, y_pred_class)[1, 0]

# Classification Accuracy from confusion matrix. SAME AS: metrics.accuracy_score(y_test, y_pred) 
accuracy_score = (TP + TN) / float(TP + TN + FP + FN)

# Classification Error (or Misclassification Rate) from confusion matrix. SAME AS: 1 - metrics.accuracy_score(y_test, y_pred) 
error_score = (FP + FN) / float(TP + TN + FP + FN)

# Sensitivity or True Positive Rate or Recall (i.e when the actual value is positive, how often is the prediction correct?) from confusion matrix. SAME AS: metrics.recall_score(y_test, y_pred_class)
sensitivity = TP / float(TP + FN)

# Specificity (i.e when the actual value is negative, how often is the prediction correct?) NO METRIC FUNCTION FOR THAT
specificity = NP / float(TN + FP)

# False Positive Rate = 1 - Specificity (i.e when the actual value is negative, how often is the prediction incorrect?) NO METRIC FUNCTION FOR THAT
false_positive_rate = FP / float(TN + FP) 

# Precision (i.e when a positive value is predicted, how often is the prediction correct?) SAME AS: metrics.precision_score()
precision = FP / float(TN + FP)

### - K-FOLD VALIDATION

A better way to evaluate the model is instead of doing the splitting of the data to train and test dataset just once, to do it many times. <br>
The way it works, is that we split the dataset to K equal partitions (FOLDS), and each time we use 1 of it for test, and the union of the rest for training. Usually we pick K=10. In the end all of them will have been used as training and test sets, and the average accuracy score will be the one of the model.


In [None]:
# This function, splits the data X and y into folds, and runs the classifier clf for all of them as tests and training sets. It returns a Numpy array with the different scores
from sklearn.cross_validation import cross_val_score  

scores = cross_val_score(clf, X, y,
                         cv=10,                    # Number of folds that the dataset is split (10-Fold Validation)
                         scoring = 'accuracy',     # We use classification accuracy
                         n_jobs = -1)              # Run computations in parallel for faster

# The average of the values
scores.mean()

Using this we can actually calculate which K-Neighbor's parameters are the best option for our model, by checking the scores for all different K's. <br>
BE CAREFUL: After finding the best parameters, we have to train the model, with the these best parameters


In [None]:
# 1st Way (Manually)
k_range = range(1,31)
k_scores = {}
for k in k_range:
clf = KNeighborsClassifirer(n_neighbors=k)
scores = cross_val_score(clf, X, y, cv=10, scoring = 'accuracy', n_jobs = -1)
k_scores[k] = scores.mean()

In [None]:
# 2nd Way (By using GridSearchCV function, which can include multi-checking for tuning) (COMPUTATIONALY EXPENSIVE)
from sklearn.grid_search import GridSearchCV
    
k_range = range(1,31)
weight_options=['uniform', 'distance']

# Create a dictionary for all K's of K-Neighbors, and all weight options we want to examine for the GridSearchCV
param_grid = dict(n_neighbors=k_range, weights=[])

# Feed the function with the clf, and the dictionary instead of the data. cv specifies the number of K-Folds
grid = GridSearchCV(clf, param_grid, cv=10, scoring = 'accuracy', n_jobs = -1)                

# Fit the model with the data (it will run the cross validation for all K's, i.e cv's)
 grid.fit(X,y)

# Shows a list of ALL the results including the parameters for mean and standard deviation for each K of K-Neighbors and each weight_option
grid.grid_scores_

# Shows only the parameter K of K-Neighbors for the first neighbor (i.e K=1) and both weight_options    
grid.grid_scores_.[0]cv_validation_scores

# Shows all the results from K-folds (i.e the for all 10 cv's) for the first neighbor (K=1) and both weight_options
gird.grid_scores_.[0]parameters  

# Shows the mean values from all above scores for the first neighbor (K=1) and both weight_options
grid.grid_scores_.[0]mean_validation_score

# The best mean values among all parameters of K-Neighbors and weight_options
grid.best_score_

# The parameters of K-Neighbors and weight_option that created this score
grid.best_params_

# All the details of the model of the best estimator (here K-Neighbors and weight_option)
grid.best_estimator_

In [None]:
# 3rd Way (By using RandomizedSearchCV function, which can include multi-checking for tuning) (COMPUTATIONALY INEXPENSIVE)
from sklearn.grid_search import RandomizedSearchCV

k_range = range(1,31)
weight_options=['uniform', 'distance']
param_grid = dict(n_neighbors=k_range, weights=[])

# Exactly the same thing. 
# The difference: instead of running ALL possible combinations, it runs only as many as the n_iter defines, in a random way.
rand = RandomizedSearchCV(clf, param_grid, cv=10, scoring = 'accuracy', n_iter=10)             

# The best mean values among all the parameters of K-Neighbors that RandomizedSearchCV chose randomly.
rand.best_score_

# The parameters of K-Neighbors and weight_option that created this score
rand.best_params_

# All the details of the model of the best estimator (here K-Neighbors and weight_option)
rand.best_estimator_

### - LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegression 

...          
clf = LogisticRegression()
...

# The interception term of the logistic regression with the y-axis. (The _ is for estimated attributes)   
# y = INTERCEPT + ...
clf.intercept_                                                                                     

# The coefficients of the logistic regression (slope of line for each feature if more than one) 
# y = intercept + COEF X1 + COEF X2 + ...
clf.coef_ 

# It prints both features and corresponding slopes of the line for us to be easy to know what is what  
zip(feature_cols, clf_coef)                                                                     

### - SUPPORT VECTOR MACHINE

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1), stratify=y)  

from sklearn import svm

clf = svm.SVC(C=1.0,                            # How important is the violation of the margin. Adds value to ξ. Small C means the errors matter less.
              kernel='rbf',                     # The kernel ti be used: rbf = radius bases function, poly = polynomial, linear = linear, sigmoid = sigmoid
              degree=1,                         # The degree of polynomial in case kernel='poly'. Ignored by any other kernel
              decision_function_shape='ovr')    # The way to deal with more than once classes. ovr = One versus rest. ovo = one versus other							

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)

# Indices of support vectors
clf.support_

# The support vectors
clf.support_vectors_ 

# The number of support vectors
clf.n_support_ 

# The b from the support vector equation wx +b = 0
clf.intercept_ 

# The w from the support vector equation wx +b = 0
clf.coef_                                                                                             

### - LINEAR REGRESSION

In [None]:
from sklearn.linear_model import LinearRegression 

X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size=0.4, 
                                                    random_state=1)           

clf = LinearRegression(n_jobs=-1) 
clf.fit(X_train, y_train)
clf.predict(new_data)

# The interception term of the linear regression with the y-axis
# y = INTERCEPT + ...
clf.intercept_

# The coefficients of the linear regression (slope of line for each feature if more than one)          
# y = intercept + COEF X1 + COEF X2 + ...
clf.coef_ 

# It prints both features and corresponding slopes of the line for us to be easy to know what is what  
zip(feature_cols, clf_coef)                                                                        

from sklearn import metrics

# Model Evaluation Metric: Mean Absolute Error 
# In regression analysis, simply comparison does not work. We should calculate the mean difference between prediction and real value. 
# Here is the sum of (y_real - y_predict) divided by the sample size
metrics.mean_absolute_error(y_test, clf_predict(X_test))                                           

# Model Evaluation Metric: Mean Squared Error - This is simply the sum of differences (y_real - y_predict)^2 divided by the sample size
metrics.mean_squared_error(y_test, clf_predict(X_test))

# Model Evaluation Metric: Root Mean Squared Error - We could also take the sqrt of this (MOST POPULAR!) Same units as target
np.sqrt(metrics.mean_squared_error(y_test, clf_predict(X_test)))
                                             
# K-Fold Validation
from sklearn.cross_validation import cross_val_score  

scores = cross_val_score(clf, X, y, cv=10,
                         scoring = 'mean_squared_error')      # We use mean_squared_error for logistic regression 

# Loss function (negative likelihood)
scores = -scores                                              
scores.mean()