# Simple linear regression

## Import the relevant libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# We can override the default matplotlib styles with those of Seaborn
#import seaborn as sns
#sns.set()

## Load the data

In [None]:
# Load the data from a .csv in the same folder
data = pd.read_csv('data/simple_linear_regression.csv')

In [None]:
# Let's check what's inside this data frame
data

In [None]:
# This method gives us very nice descriptive statistics. We don't need this as of now, but will later on!
data.describe()

# Create your first regression

## Define the dependent and the independent variables

In [None]:
# Following the regression equation, our dependent variable (y) is the GPA
y = data ['GPA']
# Similarly, our independent variable (x) is the SAT score
x1 = data ['SAT']

## Explore the data

In [None]:
# Plot a scatter plot (first we put the horizontal axis, then the vertical axis)
plt.scatter(x1,y)
# Name the axes
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('GPA', fontsize = 20)
# Show the plot
plt.show()

## Regression itself

In [None]:
# Add a constant (which we know as the intercept). Essentially, we are adding a new column (equal in length to x), which consists only of 1s
x = sm.add_constant(x1)
# Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y and an idependent x
results = sm.OLS(y,x).fit()
# Print a nice summary of the regression. That's one of the strong points of statsmodels -> the summaries
results.summary()

### Interpreting the summary

Let's look at the first table, and it's most important values:

* `Dep. Variable` - the variable we are trying to predict
* `Model` - OLS (Ordinary Least Squares) - most common method to estimate the linear regression equation
* `R-squared` - measures the goodness of the fit, and goes from 0 to 1, 0 meaning _the regression explains none of the variability_ and 1 meaning _the regression explains all the variability_. Therefore, a value of 0.406 indicates that SAT might be insufficient to predict the GPA; we might need other variables such as gender, household income or location to better predict the GPA.

> __Note__: There's no rule of thumb for R-squared: in exact sciences like chemistry values between 0.7 and 0.9 are considered good, while in social sciences such as sociology a value of 0.2 might be considered a great outcome.

> __Note2__: If you're using more features, it's advisable to rely on the `Adj. R-squared` instead of `R-squared`.

Now let's focus on the middle table, the coefficients table:

* `const` - the value in the _coef_ column will be our $B_{0}$
* `SAT` - the value in the _coef_ column will be our $B_{1}$

So the formula will be: $y = 0.275 + 0.0017x_{1}$ or in other words: $GPA = 0.275 + 0.0017 * SAT$

* `std err` - gives the accuracy of the prediction. The lower the value, the most accurate it is.
* `P>|t|` - in the SAT row, if the value is lower than 0.05 it means that __the SAT is a significant variable when predicting GPA__.

In [None]:
# Create a scatter plot
plt.scatter(x1,y)
# Define the regression equation, so we can plot it later
yhat = 0.0017*x1 + 0.275
# Plot the regression line against the independent variable (SAT)
fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line')
# Label the axes
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('GPA', fontsize = 20)
plt.show()