# Introduction to Regression: Logistic Regression
In this introduction, we will develop logistic regression from basic principles.  Other tutorials will forgo the theory and focus on existing python libraries that are commonly used for building regression models.


In [None]:
# All good python projects begin with specifying which modules to load

import pandas as pd  # Pandas is a package which creates data frames
import numpy as np # Numpy is the package which creates/manages/operates on numerical data
import matplotlib.pyplot as plt # Matplotlib is the plotting library

## Background

As we covered in the last session, regression is a method for fitting the parameters of a presumed model to our data. In logistic regression, the dependent output data is the probability that the independent inputs belong to a given classification.

For example, say the input variables are the height and weight of a person. The output could be the probability that the person is a male. This can be applied to binary classes, such as yes/no or success/failure. It can also be used for multiple classes, such as identification of an animal to a specific species.

https://en.wikipedia.org/wiki/Logistic_regression

## The Data

Every project begins with the data.  We will be using data that _Tjen-Sien Lim_ (limt@stat.wisc.edu) supplied. The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data

//=-=-=-=-=-=-=-=-=-=-=-=-=-=

Dataset:  haberman.data
Lim, Tjen-Sien (1999). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Variables/Columns

1. Age of patient at time of operation (numerical) <br>
2. Patient's year of operation (year - 1900, numerical)<br> 
3. Number of positive axillary nodes detected (numerical) <br>
4. Survival status (class attribute) <br>
-- 1 = the patient survived 5 years or longer <br>
-- 2 = the patient died within 5 year <br>

//=-=-=-=-=-=-=-=-=-=-=-=-=-=



In [None]:
# Pull the data directly from github
haber = 'http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data'

# Data does not have a header row so we have to label the data
names = ['age', 'year', 'nodes', 'survival']
data = pd.read_csv(haber, header=None, names=names)

# We will change the survival labels
# 1-> 1 : No change
# 2-> 0 : Death within 5 years is 0
data['survival']=-1*(data['survival']-2)

# head() gives a snapshot of the data.  Jupyterhub is great a rendering tables.
data.head()

In [None]:
# describe() provides more summary information from the data (also in a nice rendered table)
data.describe()

In [None]:
# plt.scatter can show us the data
plt.scatter(data['age'], data['survival'],color='r')
plt.title('Age and Survival')
plt.xlabel('Age')
plt.ylabel('Survival [1=No; 2=Yes]')
plt.show()

In [None]:
# plt.scatter can show us the data
plt.scatter(data['nodes'], data['survival'],color='r')
plt.title('Number of Nodes and Survival')
plt.xlabel('Nodes')
plt.ylabel('Survival [1=No; 2=Yes]')
plt.show()

In [None]:
# plt.scatter can show us the data
plt.scatter(data['year'], data['survival'],color='r')
plt.title('Year of Surgery and Survival')
plt.xlabel('Year of Surgery')
plt.ylabel('Survival [1=No; 2=Yes]')
plt.show()

# Model buiding

A logistic regression model uses the log-odds regression fit. If $p$ is the probability of survival, then the log-odds of survival is:

$$
L = \log \frac{p}{1-p}
$$

And to fit this to 3 input variables:

$$
L = \log \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3
$$

So, now we have the inputs, $x$, and we need to find the scalars, $\beta$, that fit our data.

## One Variable

Let's begin with only solving the relationship of age to survival. First, we will define the logit function:

$$
\sigma(t) = \frac{e^t}{e^t+1} = \frac{1}{1+e^{-t}}
$$

Where $t$ is our linear equation:

$$
t = \beta_0 + \beta_1 x
$$


In [None]:
model = dict()
model['b0'] = ________
model['b1'] = ________

model


In [None]:
modelData = pd.DataFrame({'x': data['age'],'y': data['survival']})
modelData['y_'] = 1/(1+np.exp(-1*(model['b1'] * modelData['x']  +  model['b0'])))

modelData

In [None]:
# Let's see how we did
plt.scatter(modelData['x'],modelData['y'],color='r')
plt.scatter(modelData['x'],modelData['y_'],color='g')
plt.plot(modelData['x'],modelData['y_'],color='b')
plt.show()

# Model Evaulation

We need a metric to determine how good the model is.  Thoughts?

In [None]:
modelData['delta'] = modelData['y'] - modelData['y_']
modelData

In [None]:
modelData['squared'] = modelData['delta']*modelData['delta']
modelData

In [None]:
sse = sum(modelData['squared'])/modelData['squared'].count()
sse

<font color='red'>
# Ok, Now go back and try different values for b0 and b1.
# Can you do better?