## 5.2.1 Introduction to logistic regression in scikit-learn
### Training an algorithm and making predictions

In building machine learning models, you should be already familiar with the concept of training an algorithm on labeled data so that you can use the trained model to make predictions on new data. 

Scikit-learn encapsulates these core functionalities in the .fit method for training models, and the .predict method for using the trained models for making predictions. Because of the consistent syntax, you can call .fit and .predict on any scikit-learn model from linear regression to classification trees.

Let us see how we can choose a model, train it, employ it to make a prediction and evaluate it. 

### Step 1: Choosing a model

The first step is to choose a model. We will start with the logistic regression model, and instantiate it from the class provided by the scikit-learn library. 

- In Python, classes are templates for creating objects, which provide functions, such as the .fit function that is used to train a model. When instantiating a class from scikit-learn, you are taking the blueprint of the model that scikitlearn makes available to you and creating a useful object out of it. 

- You can train this object on your data and then save it to disk for later use. 

The following code can be used to perform the task of choosing a model and creating an object: 

In [2]:
# The first step is to import the class:
from sklearn.linear_model import LogisticRegression

# The code to instantiate the class into an object is as follows:
lr_Object = LogisticRegression()

# The object is now a variable in the workspace. We can examine it using the following code:
print(lr_Object)
# This should give the following output: LogisticRegression()

### Step 2: Model default options and hyperparameters

It should be noted that creating the model object lr_Object involves essentially no knowledge or background of what logistic regression is or how it operates. 

In addition, although we didn't select any specific options when we create the logistic regression model object, we are now actually using several default options for how the model is formulated and would be trained.

In fact, there is a related disadvantage somehow when using an easy-to-use package such as scikit-learn. This is because it has the potential to obscure the default options or choices from you and make them unclear. For that, any time you use a machine learning model that has been prepared for you as scikit-learn models have been, your first job is to get an understanding of all the options that are available. 

- A best practice in such cases is to explicitly provide every keyword parameter to the model when you create the object. Even if you are just selecting all the default options, this will help increase your awareness of the choices that are being made.

We will keep the explanation of these options and parameters for later. For now, here is the code for instantiating a logistic regression model with all the default options:

In [3]:
new_lr_Object = LogisticRegression(penalty='l2', dual=False,
                               tol=0.0001, C=1.0,
                               fit_intercept=True,
                               intercept_scaling=1,
                               class_weight=None,
                               random_state=None,
                               solver='lbfgs',
                               max_iter=100,
                               multi_class='auto',
                              verbose=0, warm_start=False,
                              n_jobs=None, l1_ratio=None)

Although the newly created object new_lr_Object is identical to lr_Object, being specific like this is especially helpful when you are starting out and learning about different kinds of models. Once you're more familiar with the model and syntax, you may just instantiate with the default options and make changes later as necessary. 

Here, we show how to make changes to the default options of the model object. The following code sets three options and displays the current state of the model object:

In [5]:
new_lr_Object.C = 0.2
new_lr_Object.solver = 'liblinear'
new_lr_Object.max_iter= 500
print(new_lr_Object)

LogisticRegression(C=0.2, max_iter=500, solver='liblinear')


In the above code, we've taken what is called a hyperparameter of the model, C, and updated it from its default value of 1 to 0.2. We've also specified a solver. In addition, we have set the max_iter to 500. 

For now, it is enough to understand that hyperparameters are options that you supply to the model, before fitting it to the data. These options specify the way in which the model will be trained. 

Running the above code should produce the following:

`LogisticRegression(C=0.2, max_iter=500, solver='liblinear')`

From the output, it should be noted only the options that are updated from the default values are displayed. 

### Step 3: Model fitting and training
Once the data is prepared and the model is specified, fitting the model should be done next. 

To illustrate the core functionality of the model, we will fit this logistic regression model object to some data. Supervised learning algorithms rely on labeled data. This means we need both the features, customarily contained in a variable called `X`, and the corresponding responses of the target variable, in a variable called `y`. 

- `X` and `y` are conventional names normally used in defining the features and response/target variables
  
We will obtain the first 700 samples of the features' data, and the response variable, from our dataset for illustration purposes:

In [23]:
import pandas as pd
df = pd.read_csv('../datasets/clean_creditcard.csv')

X = df[0:700].values

In [24]:
# The corresponding first 700 values of the response variable Class_Category can be obtained as follows:
y = df['Class_Category'][0:700].values

Here, we have selected two Series (that is, columns) from the DataFrame df: the input features, and the response variable Class_Category. Then we selected the first 700 elements of each and finally used the .values method to return NumPy arrays.

- Note that while we've extracted the data into NumPy arrays to show how this can be done, it's also possible to use pandas Series as direct input to scikit-learn.

Let's now use this data to fit the logistic regression model object. This is accomplished with just one line:

In [25]:
new_lr_Object.fit(X, y)

Indeed, we are ignoring all the important options and what they mean right now while we perform the fitting. But, in terms of coding and implementation, fitting a model is very easy. 

The new_lr_Object model object is now a trained model. We say that this training has happened in place since no new object was created; the existing object, new_lr_Object, has been modified. 

### Step 4: Model employment
We can now use our trained model to make predictions using the features of new samples, that the model has never "seen" before. Let's try the next 15 rows after the first 700 rows from the features.

We can select and view these features using a new variable, `new_X`:

In [26]:
new_X = df[700:715].values

# Making predictions is done like this in which y_pred  stores the predicted value using the trained model.
y_pred = new_lr_Object.predict(new_X)
print(y_pred)

[1 1 1 1 1 1 1 1 0 0 1 0 0 0 1]


In [27]:
# We can also view the response values of the target variable corresponding to these predictions since this data is labeled:
y_test= df['Class_Category'][700:715].values
y_test

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

Here, we've illustrated several things. After getting our unseen values, we've called the .predict method on the trained model. Notice that the only argument to this method is unseen values of features, which we've called `new_X`. 

### Step 5: Model evaluation
It is always important in machine learning modeling to evaluate the trained model to see if further refinement and training are required.

From the previous output, we observe:

- Predicted value y_pred: [1 1 1 1 1 1 1 1 0 0 1 0 0 0 1]
- Actual value y_test: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

We may notice that since the model predicted 10 values correctly, and `10/15 = 0.6666666666666666%` of predictions are correct. In other words, the model accuracy is `0.6666666666666666%` which is not that extremely bad. Indeed this accuracy value may change if you choose other rows of unseen data to predict their response value. 

While this is just an example to get you familiar with how scikit-learn works, it's worth considering what a "good" prediction might look like for this problem. We will get into the details of assessing model predictive capabilities shortly. 