### Linear Regression 

>The r-value indicates how correlated two variables are. This can range from no correlation to a negative correlation to a positive correlation.
>
The more correlated two variables are, the easier it becomes to use one to predict the other. For instance, if I know that how much I pay for my steak is highly positively correlated to the size of the steak (in ounces), I can create a formula that helps me predict how much I would be paying for my steak.
>
The way we do this is with **linear regression**. Linear regression gives us a formula. If we plug in the value for one variable into this formula, we get the value for the other variable.
>
The equation to create the formula takes the form 
**y=mx+b **

>We have to calculate values for m and b before we can use our formula.

>We'll calculate slope first -- the formula is **cov(x,y)/σx2**, which is just the covariance of x and y divided by the variance of x.
>
>We can use the cov function to calculate covariance, and the .var() method on Pandas series to calculate variance.

> # If using linear regression must ensure that the 2 variables x(to be used to predict) and y(to predict) should be co-related

In [1]:
import pandas as p

wine_quality = p.read_csv("wine_quality_white.csv")
wine_quality.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [3]:
from numpy import cov

# To predict quality (y) density (x)

slope_density = cov(wine_quality["density"],wine_quality["quality"])[0, 1]/ wine_quality["density"].var()

print("The slope for quality and density : ", slope_density)

The slope for quality and density :  -90.9423999421


In [6]:
from numpy import cov

# This function will take in two columns of data, and return the slope of the linear regression line.
def calc_slope(x, y):
    return cov(x, y)[0, 1] / x.var()

# Calculating the y-intersept y - mx =b

intercept_density = wine_quality["quality"].mean() - (calc_slope(wine_quality["quality"], wine_quality["density"]) * wine_quality["density"].mean())

print("The intersept for the density and quality eqauation : ",intercept_density)

The intersept for the density and quality eqauation :  5.87894034799


In [7]:
## Predicting

from numpy import cov

def calc_slope(x, y):
    return cov(x, y)[0, 1] / x.var()

# Calculate the intercept given the x column, y column, and the slope
def calc_intercept(x, y, slope):
    return y.mean() - (slope * x.mean())

def predict(value):
    slope = calc_slope(wine_quality["density"],wine_quality["quality"])
    return ( slope * value )+ calc_intercept(wine_quality["density"],wine_quality["quality"], slope)
    
predicted_quality = wine_quality["density"].apply(predict)

In [8]:
a =[1,2,3]
b = [3,4,5]
print(a+b)

[1, 2, 3, 3, 4, 5]


### Using linregress and finding residuals(error in prediction)

>The linregress function makes it simple to do linear regression.

>We can plot out our line and our actual values, and see how far apart they are on the y-axis.
>
We can also compute the distance between each prediction and the actual value -- these distances are called residuals.
If we add up the sum of the squared residuals, we can get a good error estimate for our line.
>
We have to add the squared residuals, because just like differences from the mean, the residuals add to 0 if they aren't squared.
>
To put it in math terms, the sum of squared residuals is: ∑i=1n (yi−y^i)2
The variable
y^i
is the predicted y value at position i.

#### Std Error

>From the sum of squared residuals, we can find the standard error. The standard error is similar to the standard deviation, but it tries to make an estimate for the whole population of y-values -- even the ones we haven't seen yet that we may want to predict in the future.

>The standard error lets us quickly determine how good or bad a linear model is at prediction.

>The equation for standard error is RSS/n−2.

>You take the sum of squared residuals, divide by the number of y-points minus two, and then take the square root.



In [10]:
from scipy.stats import linregress
import numpy as np

# We can do our linear regression
# Sadly, the stderr_slope isn't the standard error, but it is the standard error of the slope fitting only
# We'll need to calculate the standard error of the equation ourselves
slope, intercept, r_value, p_value, stderr_slope = linregress(wine_quality["density"], wine_quality["quality"])

predicted_y = np.asarray([slope * x + intercept for x in wine_quality["density"]])
residuals = (wine_quality["quality"] - predicted_y) ** 2
rss = sum(residuals)

stderr = (rss /(len(wine_quality["quality"]-2))) ** (1/2)

def within_percentage(y, predicted_y, stderr, error_count):
    within = stderr * error_count

    differences = abs(predicted_y - y)
    lower_differences = [d for d in differences if d <= within]
    within_count = len(lower_differences)
    return within_count / len(y)

within_one = within_percentage(wine_quality["quality"], predicted_y, stderr, 1)
within_two = within_percentage(wine_quality["quality"], predicted_y, stderr, 2)
within_three = within_percentage(wine_quality["quality"], predicted_y, stderr, 3)

print("Predicted value within 1 stderr",within_one )
print("Predicted value within 2 stderr",within_two )
print("Predicted value within 3 stderr",within_three )

Predicted value within 1 stderr 0.6845651286239282
Predicted value within 2 stderr 0.9356880359330338
Predicted value within 3 stderr 0.9936708860759493
