## Logistic Regression Tutorial: UCI Wisconsin Breast Cancer Data

This tutorial will provide you some techniques on the following as a supplement to Assignment 3:
* Two Ways to Load Data
* Dealing with Missing or Unknown Data
* Indexing techniques to select desired attributes
* Setting up Logistic Model with Sckit-Learn
*  Visualing the weight coefficients

### Importing Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# from sklearn.datasets import load_breast_cancer
sklearn.datasets.fetch_kddcup99(*, subset=None, data_home=None, shuffle=False, random_state=None, percent10=True, download_if_missing=True, return_X_y=False)

SyntaxError: invalid syntax (<ipython-input-2-9a4598cf7f62>, line 6)

### Loading Data

One way you can load data is with the .csv file from the UC Irvine Machine Learning Repository website (https://archive.ics.uci.edu/ml/index.php) with the Pandas framework. This is a more generic way of loading data that is not availiable from the Sci-kit learn libraries.

In [None]:
df = pd.read_csv('data_bc.csv')

Another way you can load data, if it is availiable in the Sci-kit learn library (see what is availiable: https://scikit-learn.org/stable/datasets/index.html), is as follows.

In [None]:
data_bc = load_breast_cancer()
X=data_bc.data
y=data_bc.target

We will proceed with the data loaded from the .csv file availiable on NYUClasses (data_bc.csv). To get an idea of what the data looks like, you may display the first five entries using the head() function from Pandas.

In [None]:
df.head()

### Unknown or missing data

In some cases, you may encounter a dataset with either unknown or missing data -- even both! Here are some ways to deal with it. In many other cases, such as the Iris dataset, unknown or missing data may not occur. 

In [None]:
# Dropping
df = df.drop('Unnamed: 32', axis=1)

In [None]:
df.head()

At times, it would not make sense to drop a whole attribute. For more ways to deal with this problem, you may refer to this guide on how to work on datasets with missing data (https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

### Selecting Attributes

In [None]:
# Selecting from a range

# Here, we are selecting attribute 2, 3, 4 (radius mean, texture mean, and perimeter mean). 
# Recall that in Python, indexing starts at 0.

x_labels1 = df.columns[2:5]
X = np.array(df[x_labels1].values)
X

In [None]:
# Selecting based on desired label names
xlbl= ['perimeter_mean', 'area_mean', 'compactness_mean']
x_labels1 = df[xlbl]
X1=x_labels1.values
X1

For more information on indexing and selecting data, you may refer to this link (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

### Logistic Regression Model

In [None]:
vals, y = np.unique(df['diagnosis'].values, return_inverse=True)

In [None]:
vals

In [None]:
y

In [None]:
from sklearn.preprocessing import scale

In [None]:
# Scale the data
Xs = scale(X)
Xs

In [None]:
# Create a Logistic Regression instance
logreg = LogisticRegression()

# Fit the data
logreg.fit(Xs, y)

In [None]:
# Get predictions
y_hat = logreg.predict(Xs)

# Calculate accuracy on training data
accuracy=np.mean(y_hat==y)
accuracy

### Weight Visualization

It is difficult to tell which attributes contribute heavily in the prediction. This is especially true for this dataset because it has about 30 attributes. It may not be efficient to find those by trial and error mix and match methods. Fortunately, there is a way to visualize the weights.

In [None]:
# We create a matrix with all the labels
x_labels_w = df.columns[2:]
Xw = np.array(df[x_labels_w].values)
print(Xw)
print("The matrix dimensions of Xw is " + str(Xw.shape))

In [None]:
# For plotting in the Jupyter Notebook environment as an inline output
%matplotlib inline

In [None]:
# By default, LogisticRegression() is set on penalty as L2 and C=1.
# To simulate no regularization, we will select a large C to minimize regularization to later 
# show the effect of regularization

logreg_w=LogisticRegression(C=1e10)
logreg_w.fit(Xw,y)
W=logreg_w.coef_
W=W.flatten()
plt.stem(W)

From this stem plot of coefficient weights, we can conclude from visual inspection that there are some labels that will contribute more heavily (large in absolute values) in the prediction than others. To find those labels, we can do the following. As an example, let us find the top three labels that are most heavily weighted.

In [None]:
idx1=np.argsort(np.abs(W))[-1]
idx2=np.argsort(np.abs(W))[-2]
idx3=np.argsort(np.abs(W))[-3]

heavy=[x_labels_w[idx1], x_labels_w[idx2],x_labels_w[idx3]]
heavy