<font color=red>This is a draft version and the notebook is due to be changed and finalized soon.</font>

# Logistic Regression

Hello again! You are going to implement a logistic regression classifer in this Jupyter notebook using scikit-learn and predict using it. We will also see a technique which is useful for visualizing the data.

## Before you start

- In order for the notebooks to function as intended, modify only between lines marked "### begin your code here (__ lines)." and "### end your code here.". 

- The line count is a suggestion of how many lines of code you need to accomplish what is asked.

- You should execute the cells (the boxes that a notebook is composed of) in order.

- You can execute a cell by pressing Shift and Enter (or Return) simultaneously.

- You should have completed the previous Jupyter notebooks before attempting this one as the concepts covered there are not repeated, for the sake of brevity.

## Loading the appropriate packages

Nothing new here. We will import logistic regression class along with some helpers from scikit-learn. 

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go

Let's turn off the scientific notation for floating point numbers.

In [None]:
np.set_printoptions(suppress=True)

## Loading and examining the data

We will load our data from a CSV file and put it in a pandas an object of the `DataFrame` class.

This dataset is the breast cancer Wisconsin (diagnostic) dataset which contains 30 different features computed from a images of a fine needle aspirate (FNA) of breast masses for 569 patients with each example labeled as being a _benign_ or _malignant_ mass.

* This was taken and modified from the Machine Learning dataset repository of School of Information and Computer Science of University of California Irvine (UCI):
 
> _Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science._

In [None]:
df_30 = pd.read_csv('data_logistic_regression.csv')

Let's take a look at the data:

In [None]:
df_30

For this example to be educational, we need to be able to visualize our data, so our data has to be 2-dimensional. However, our data here is 30 dimensional. Let us use a trick (that we can use for many things including visualizations) to get 2-dimensional data out of this dataset.

Remember we talked about _unsupervised learning_ in course 1. We said that _representation learning_, the methods use to create representations of the data (which are hopefully helping us to dod machine learning more efficiently) are a subclass of unsupervised learning methods. Specifically, we said that _dimensionality reduction_ are a set of representation learning algorithms aimed at, as the name suggests, reducing the dimensionality of our data. We are going to use a very popular dimensionality reduction technique, called the _Principal Components Analysis_ (_PCA_) to reduce the dimensioanlity of our feature space down to 2, so we can visualize our data in 3D plots.

Note that we can not only expand our feature space by adding features, for exmample, nonlinear feature expansions, but also transform features and get new ones and we are doing exactly that with PCA. We are taking all of the features and constructing the two features that are _a._ a linear combination of our features; and _b._ are most informative in spreading out the data. In other words, with PCA, we construct two features from our original features where in these new features, the data points are most spread out and varied, among all features we can construct out of linearly combining our original features.

To do that we first need to extract our data, from the dataframe, in NumPy arrays: 

In [None]:
X_30 = df_30.drop('type', axis=1).to_numpy()
y_text = df_30['type'].to_numpy()

As a sanity check, let's check `X_30`:

In [None]:
X_30

...and the size:

In [None]:
X_30.shape

Let's do the same thing for `y_text`:

In [None]:
y_text

...and for shape of `y_text`:

In [None]:
y_text.shape

### Reducing dimensionality

In [None]:
pca = PCA(n_components=2)
pca.fit(X_30)
X = pca.transform(X_30)

See how now we can find the proper transformation from the `X_30` and specify that we want our transformation to produce data with 2 features for us in the output by letting `n_components=2`? Also, see how PCA does not get the labels, in `fit`? It's unsupervised learning after all and it does not use the labels!

Let's check this new `X`:

In [None]:
X

...and its shape:

In [None]:
X.shape

Now we can generate a data frame from this two dimesional data `X` that we generated:

In [None]:
df = pd.DataFrame(data=np.c_[X, y_text], columns=['Feature 1', 'Feature 2', 'Label'])

Let's take a look at our new 2-dimensional data as a table. We have to construct a data frame from our new 2-dimensional data as well as our labels:

In [None]:
df

Let's also do a scatter plot of our data:

In [None]:
fig = px.scatter(df, x='Feature 1', y='Feature 2', color='Label')
fig.show()

We can also create $\{-1, +1\}$ labels for our data from `y_text` and assign it to (vector) variable `y`. We use `LabelEncoder` from scikit-learn again to transform labels into -1s or +1s:

In [None]:
y = (2 * LabelEncoder().fit_transform(y_text)) - 1

As usual let's check our `y`:

In [None]:
y

...and its shape:

In [None]:
y.shape

Now, we can plot our training data in 3D with a 3D scatter plot (we are going to use surface plots afterwards and the new interface of plotly cannot do surface plots yet, so we are using the older style rather than plotly express):

In [None]:
points_colorscale = [
                     [0.0, 'rgb(239, 85, 59)'],
                     [1.0, 'rgb(99, 110, 250)'],
                    ]

layout = go.Layout(scene=dict(
                              xaxis=dict(title='Feature 1'),
                              yaxis=dict(title='Featrue 2'),
                              zaxis=dict(title='Label')
                             ),
                  )

points = go.Scatter3d(x=df['Feature 1'], 
                      y=df['Feature 2'], 
                      z=y,
                      mode='markers',
                      text=df['Label'],
                      marker=dict(
                                  size=3,
                                  color=y,
                                  colorscale=points_colorscale
                            ),
                     )

fig2 = go.Figure(data=[points], layout=layout)
fig2.show()

## Splitting data

Now, let's split our data into training, validation and test sets. We don't need validation data in this example and we won't be doing model selection here. So, let's use 70% and 30% for training test data, repectively.

In [None]:
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.3, random_state=0)

## Building and visualizing a logistic regression model

Let's build our logistic regression model then by creating an object of the `LogisticRegression` class and assign the name `logreg` to the resulting object. 

You can see the documentation for `LogisticRegression` here:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Go ahead and do that now:

In [None]:
### begin your code here (1 line).

### end your code here.

Now, fit `logreg` to `X_train` and `y_train`:

In [None]:
### begin your code here (1 line).

### end your code here.

You will get a summary for the model:

> LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
>           intercept_scaling=1, max_iter=100, multi_class='warn',
>           n_jobs=None, penalty='l2', random_state=None, solver='warn',
>           tol=0.0001, verbose=0, warm_start=False)

* You may also get a warning because you have not explicitly set a solver and that is going to change in newer versions of scikit-learn. Nothing you should be worried about here.

Let's visualize the surface generated by our logistic regression model. First, we need to generate a number of points required for creating a visualization of the decision surface:

In [None]:
detail_steps = 100

(x_vis_0_min, x_vis_1_min) = X_train.min(axis=0)
(x_vis_0_max, x_vis_1_max) = X_train.max(axis=0)

x_vis_0_range = np.linspace(x_vis_0_min, x_vis_0_max, detail_steps)
x_vis_1_range = np.linspace(x_vis_1_min, x_vis_1_max, detail_steps)

(XX_vis_0, XX_vis_1) = np.meshgrid(x_vis_0_range, x_vis_0_range)

X_vis = np.c_[XX_vis_0.reshape(-1), XX_vis_1.reshape(-1)]

We need to predict the proability associated with points in this generated data in order to visualize it. You can get the probabilities associated with belonging to classes by `predict_proba` method. Let's use that to calculate probabilities for points in `X_vis`. Use `predict_proba` just like `predict` to predict probabilities instead of actual classes. Go ahead and do that now, and assign the result to variable `probs`:

In [None]:
### begin your code here (1 line).

### end your code here.

Let's check the shape of this variable `probs`:

In [None]:
probs.shape

As you can see, it has two column, because it gives the probability of belonging to each of the two classes. However, we care only about the probability of belonging to the positive class, so we can only choose the cloumn with index `1`. Also, the probabilities will be in $[0,1]$ while our labels are $\{+1, 1\}$, so we will transform the probabilities to be in range $[-1, +1]$:

In [None]:
yhat_vis = (2 * probs[:, 1]) - 1

Now, we can transfrom `yhat_vis` into the shape required for a surface plot and plot away:

In [None]:
YYhat_vis = yhat_vis.reshape(XX_vis_0.shape)

surface_colorscale = [
                      [0.0, 'rgb(235, 185, 177)'],
                      [1.0, 'rgb(199, 204, 249)'],
                     ]

surface = go.Surface(
                     x=XX_vis_0, 
                     y=XX_vis_1,
                     z=YYhat_vis,
                     colorscale=surface_colorscale,
                     showscale=False
                    )

fig3 = go.Figure(data=[points, surface], layout=layout)
fig3.show()

We can see that logistic regression has fit a surface to our data that is has the logistic (or Sigmoid) function as its intersection.

## Assessing the performance

Let's check our accuracies next. First, the training accuracy. For that let's get the predictions of training data. Predict `yhat_train` by `logreg` on `X_train`:

In [None]:
### begin your code here (1 line).

### end your code here.

Let's measure the accuracy:

In [None]:
accuracy_score(yhat_train, y_train)

We got ??.??%. Let's check accuracy on the test data. Predict `yhat_test`:

In [None]:
### begin your code here (1 line).

### end your code here.
accuracy_score(yhat_test, y_test)

??.??%. We have better performance on test data than on training data! But that's just random and it does not mean that we have perfectly generalized and have no overfitting: that is theoretically impossible!

That's it for now.