# Introduction to Regression: Line Fitting

In this introduction, we will develop linear regression from basic principles.  Other tutorials will forgo the theory and focus on existing python libraries that are commonly used for building regression models.


In [None]:
# All good python projects begin with specifying which modules to load

import pandas as pd  # Pandas is a package which creates data frames
import numpy as np # Numpy is the package which creates/manages/operates on numerical data
import matplotlib.pyplot as plt # Matplotlib is the plotting library

## The Data

Every project begins with the data.  We will be using example data that _Fedor Karmanov_ has created based on some published literature.  Visit Fedor's github repo to learn more (https://github.com/fed-ka/springboard)

//=-=-=-=-=-=-=-=-=-=-=-=-=-=

Dataset:  lsd.dat
Source: Wagner, Agahajanian, and Bing (1968). Correlation of Performance
Test Scores with Tissue Concentration of Lysergic Acid Diethylamide in
Human Subjects. Clinical Pharmacology and Therapeutics, Vol.9 pp635-638.

Description: Group of volunteers was given LSD, their mean scores on
math exam and tissue concentrations of LSD were obtained at n=7 time points.

Variables/Columns

Tissue Concentration    1-4 <br>
Math Score             8-12

//=-=-=-=-=-=-=-=-=-=-=-=-=-=



In [None]:
# Pull the data directly from github
lsd = 'https://raw.githubusercontent.com/fed-ka/springboard/master/5.%20Linear%20Regressions%20in%20Python/lsd.csv'
data = pd.read_csv(lsd)

# head() gives a snapshot of the data.  Jupyterhub is great a rendering tables.
data.head()

In [None]:
# describe() provides more summary information from the data (also in a nice rendered table)
data.describe()

In [None]:
# plt.scatter can show us the data
plt.scatter(data['Tissue Concentration'], data['Test Score'],color='r')
plt.title('Drugs: What\'s the deal?')
plt.xlabel('Tissue Concentration')
plt.ylabel('Test Score')
plt.show()

# Model buiding

When building models, there are any number of ways to store/represent the data.  

You can always use the native data table and refer directly to the columns, but I'm too lazy to keep typing 'Tissue Concentration' and 'Test Score', so I'm going to create new datatables.

model: Dict <br>
model['m']: Slope of the line <br>
model['b']: Y Intercept

modelData: Dataframe <br>
modelData['x']: Independent variable <br>
modelData['y']: Dependent variable <br>
modelData['y\_']: Estimated dependent variable based on model



In [None]:
model = dict()
model['m'] = ________
model['b'] = ________

model


In [None]:
modelData = pd.DataFrame({'x': data['Tissue Concentration'],'y': data['Test Score']})
modelData['y_'] = model['m'] * modelData['x']  +  model['b']

modelData

In [None]:
# Let's see how we did
plt.scatter(modelData['x'],modelData['y'],color='r')
plt.scatter(modelData['x'],modelData['y_'],color='g')
plt.plot(modelData['x'],modelData['y_'],color='b')
plt.show()

# Model Evaulation

We need a metric to determine how good the model is.  Thoughts?

In [None]:
modelData['delta'] = modelData['y'] - modelData['y_']
modelData

In [None]:
modelData['squared'] = modelData['delta']*modelData['delta']
modelData

In [None]:
sse = sum(modelData['squared'])/modelData['squared'].count()
sse

<font color='red'>
# Ok, Now go back and try different values for m and b.
# Can you do better?