# Linear Regression Analysis in Python

This Jupyter notebook uses Python to do linear regression analysis on a set of *x*-*y* points and plot the result.

## I. Instructions

Change the code cells as needed and run them by pressing \<shift\>-\<enter\> while the cursor is in the cell or the cell is highlighted.

## II. Code

Run the cell below to load the Python libraries we will need to do the calculations.

In [None]:
import numpy as np # Numerical calculations
import matplotlib.pyplot as plt  # To graph results
import pandas as pd  # To read data from a file
import statsmodels.api as sm # To do the linear regression on directly entered data
import statsmodels.formula.api as smf # To do the linear regression on data from a file.

Choose one of the two options below to read your data into this code. Only run the code from one of these options because each one you run will overwrite the data from the previous time you ran one.

Option 1 is easier because you can just copy and paste your data into this notebook. Option 2 is better if you have a large data set.

### A. (Optional) Enter data directly

Enter your *x* and *y* values inside the square brackets below, and run the commands in the cell to store your data into a data frame named `mydata`.

In [None]:
x = np.array([1, 2, 3, 4, 5, 6])
y = np.array([1, 4.02, 5.9, 8.11, 10.02, 11.99])
mydata = pd.DataFrame({'xpts': x, 'ypts': y})

Run the cell below to display the dataframe `mydata` so you can be sure it is correct.

In [None]:
mydata

If everything looks right, skip the Option 2 cells and go to the "Do linear regression on data" section.

### B. (Optional) Read data from CSV file

Use this option if you have a large data set that you will import from a file. You must follow these steps to get your file onto the server running this notebook so you can import it.
1. Save your x and y data to a CSV file. Excel running on Windows and Mac OS can export files to CSV. The CSV file should only have two columns: the x values and the y values.
2. If you are running this notebook in Jupyter Lab, there is a file list on the left of the screen. Press the upload button (icon with up arrow beneath the menus) and upload your CSV file.
3. If you are running this notebook in Jupyter Notebook, there is no file list pane on the left. Go to the browser tab or window with your file list and use the `Upload` button to upload your CSV file.

Change the file name in the following command to match the name of your CSV file, and run the cell to import your file into a data frame.

#### 1. Data file without column headers (names)

The following command imports your CSV file correctly only if the file does not have column headers in it. If your file has column headers, skip this command and run the next one.

In [None]:
mydata = pd.read_csv('datafile.csv', names=['xpts', 'ypts'])
mydata

#### 2. Data file that has column headers (names)

Only run the cell below if your CSV file has column headers (names) in it. We are going to replace the names in your file with 'xpts' and 'ypts' because the rest of the commands in this notebook work using those names. 

(Feel free to use your column header names and change 'xpts' and 'ypts' to your names in the rest of this notebook.)

In [None]:
mydata = pd.read_csv('datafile.csv', names=['xpts', 'ypts'], header=0)
mydata

If everything looks right, go to the "Do linear regression on data" section below.

### C. Do linear regression on data

Run the cell below to do a linear regression on the data you entered above. All of the information about the fit parameters is in the summary that these commands generate. We will get specific details below.

In [None]:
linefit = smf.ols('ypts ~ xpts', data=mydata).fit()
linefit.summary()

Run the cell below to see the fit values of the intercept and slope.

In [None]:
linefit.params

Run the cell below to define the slope and intercept as `m` and `b`. (Recall $y=mx+b$.) We will need these for our regression plot.

In [None]:
b = linefit.params[0]
m = linefit.params[1]
print('Slope and intercept defined.')
print('m: ', m)
print('b: ', b)

Run the cell below to see the standard errors of the slope and intercepts from the fit. 

In [None]:
linefit.bse

Run the cell below to show T-test 5% confidence intervals for the fitted values of the intercept and slope. The first column in the table (labeled `0`) are the lower (2.5%) bounds, and the second column (labeled `1`) are the upper (97.5%) bounds.

In [None]:
linefit.conf_int()

Run the cell below to define the fit resifual standard error as `stderror`. We will use this value for the error bars of our plot.

In [None]:
stderror = np.sqrt(linefit.mse_resid)
stderror

Run the cell below to show the R-squared value and the p-value for the fit.

In [None]:
print('R-squared: ', linefit.rsquared)
print('P-value (of F-statistic): ', linefit.f_pvalue)

The cell below makes a very basic plot of the data and the fit line. We will make a nicer plot next.

In [None]:
fig, ax = plt.subplots()
mydata.plot('xpts', 'ypts', kind='scatter', ax=ax)
ax.plot(mydata['xpts'], linefit.predict(exog=mydata['xpts']), '-k');

The cell below produces a plot of the data points with error bars and the fit line. Change the axis limits and the axis labels in the cell to customize it for your data.

In [None]:
fig, ax = plt.subplots()
ax.errorbar(mydata.xpts, mydata.ypts, yerr=stderror, fmt='.k', capsize=5)

# Set axis limits below.
ax.set_xlim([0, 7])
ax.set_ylim([0, 15])

# Set axis labels below.
ax.set_xlabel('xpts')
ax.set_ylabel('ypts')

# Plot the fit line for all x values.
ax.plot(ax.get_xlim(), linefit.predict(exog=dict(xpts=ax.get_xlim())), '-k');

Run the cell below to save the plot on the server running this notebook. Change the file name to whatever you want. Change the extension to change the file type. Some allowed file types include jpg, png, svf, pdf, and eps.

If you are using a mobile device, you can save an image of the plot by pressing and holding on the plot above and choosing to save it (to your photos or wherever you can access it).

In [None]:
fig.savefig('plot.pdf')