# Scikit-Learn Algorithms in Three Phases

Each of Scikit-Learn's supervised learners is represented as a python class.
For example, there is a LogisticRegression class, a DecisionTreeClassifier
class, a RandomForestRegressor class, and so on. For simplicity, each of
these classes uses the same application programming interface ("API").
That is, to use these classes, you will invoke the same methods regardless
of whether you are building a logistic regression or a random forest.  

Using these classes typically consists of **three phases**.  

1. Initialization phase
2. Training phase
3. Prediction phase


## INITIALIZATION PHASE
In the initialization phase, we create a new learner and specify any
options we want. Let's use the example of a logistic regression.
Remember, the LogisticRegression is a general class (like a "blueprint") not a
specific object just like the word "list" is a class.  

To create a specific list you must type:  
   `mylist = list(1,2,3)`  
Similarly, to create a specific logistic regressor type:  
   `lr = LogisticRegression()`  
   
When we're creating our regression object, we should specify any details.
For example, do we want the module to be regularized regression? If so how much? 

Let's create a `LogisticRegression` object and specify some options.

In [9]:
#Import it from the relevant module
from sklearn.linear_model import LogisticRegression

#Standard logistic regression
lr = LogisticRegression()


#Logistic with a penalty parameter of 0.1
lr = LogisticRegression(C = 0.5)


#another option is "class_weight". If you have two classes A & B,
#and 90% of the observations are of class A, sometimes it can be
#helpful for your model to give more weight to class B to give your
#model more balance. To do this, set the parameter class_weight = "balanced"
lr = LogisticRegression(C = 0.5, class_weight = "balanced")

The logistic regression is relatively simple, so it doesn't have many options.
Other algorithms, like random forest, are more complicated and have far
more options to choose from. For instance you have the number of tree


## TRAINING PHASE

Assume you have some predictive data stored in a variable X and you
have the target data stored in a variable called y. Imagine you split
the data into a training and testing set called
X_train, y_train and
X_test, y_test

To train any scikit-learn model, use the fit() method like so:

In [10]:
#Let's make some simple fake X and y data
X_train = [[1,2], [3,3], [4,5], [6,5]]
y_train = [0, 0, 1, 1]

#Train the model on the training data
lr.fit(X_train, y_train)

LogisticRegression(C=0.5, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

What does the "fit" method return? It doesn't really return anything.  
Instead it changes the lr object. In this case, it stores the model's coefficients
and intercept in attributes:  
`lr.coef_`  
`lr.intercept_`  

These attributes will be used later on when we make predictions on our test data.  
Let's take a look at those attributes now.


In [4]:
print (lr.coef_)
print (lr.intercept_)

[[ 0.09796461  0.0486127 ]]
[-0.04638885]


By convention, scikit-learn *names any learned attributes with a trailing
underscore*.  
Any attributes that are set during the initialization phase
do not have a trailing underscore.  
That's why `lr.class_weight` does not have an underscore, but `lr.coef_` does.  
  
Note: technically speaking, "fit()" returns the lr object itself. So if you
wanted, you could type "lr = lr.fit(X, y)" but this is not necessary.  
  
That's basically everything when it comes to the fit method. There aren't a
lot of options to specify. Most of the options have been specified
during the initialization phase.



## PREDICTION PHASE

In the prediction phase, we test our model on the unseen predictive data
and compare its output to the unseen target data.

To make a prediction with any scikit-learn model, use the predict() method.
It will return a numpy array of the model's best estimates.  

Let's make some fake test data and make a prediction to see how we did.

In [11]:
#Fake Data
X_test = [[1,1], [3,2.5], [4,4.5], [5.8,4.5]]
y_test = [0, 0, 1, 1]


#Make a prediction
prediction = lr.predict(X_test)

#print the results so we can see
print(prediction)
          
#print the "true" values to we can compare
print(y_test)

[0 1 1 1]
[0, 0, 1, 1]


If X_test contains 5 observations, then "prediction" will be an array with
5 elements. For example it might be [0,0,1,1,0].  

Typically you would then compare this to y_test to see how well the model did.
For example, y_test might be [0,0,1,1,1]. In this case, the model got 4 out of 5
predictions correct.  

Note: if this were a regression (e.g. LinearRegression) the pred would be an
array of floats, e.g. [0.1, 1.2, 1.5, 2.0]  

### Getting Probability Scores

Under the hood, classifiers almost always produce a number between 0 and 1
(in the case of binary classification anyway) that represents the probability
that an observation belongs to class 1. It then takes any value below 0.5 and
classifies it as "0" then takes any value above 0.5 and classifies it as "1".  

Sometimes it's helpful to have this value rather than a simple 0 or 1 class.  
In this case use the `predict_proba()` method.

In [12]:
#Get probabilities of belonging to class 0 or 1
probs = lr.predict_proba(X_test)

print(probs)

[[ 0.50725906  0.49274094]
 [ 0.39186419  0.60813581]
 [ 0.3327938   0.6672062 ]
 [ 0.25092859  0.74907141]]


"probs" is a table containig the probability of each class for reach observation.
Each class is represented by a column and each observation is represented by a row.
For example, row 0 might have the columns [0.25, 0.75]. This means row 0 has a 25%
chance of belonging to class 0 and a 75% chance of belonging to class 1.  

If there were more than two classes (say, classes 0, 1, 2 and 3) then there would be
additional columns in probs. Each row will sum to 1. For example it might be:  
[[0.25, 0.25, 0.3, 0.2],  
[0.1, 0.85, 0.03, 0.02]  
....  
[0.2, 0.4, 0.3, 0.1]]  


### TL;DR 

Here's a quick recap:

In [15]:
#Create a specific classifier, specify options
lr = LogisticRegression(C = 0.5)

#Train on the train data
lr.fit(X_train, y_train)

#Predict on the test data
prediction = lr.predict(X_test)

#To get predicted probabilities rather than predicted classes do  this:
probs = lr.predict_proba(X_test)

print("Prediction: ", prediction)
print ("\nProbabilities:\n", probs)

Prediction:  [0 1 1 1]

Probabilities:
[[ 0.50725906  0.49274094]
 [ 0.39186419  0.60813581]
 [ 0.3327938   0.6672062 ]
 [ 0.25092859  0.74907141]]
