# Linear Regression

# Find n, Σy, Σx, Σxy, Σx^2 , Σy^2

+ n: the number of data points
+ Σy: the sum of the y values
+ Σx: the sum of the x values
+ Σxy: the sum of the products of the x and y values
+ Σx^2: the sum of the squares of the x values
+ Σy^2: the sum of the squares of the y values

In [10]:
# check the version of the python
! python --version

Python 3.10.4


In [11]:
# pandas Datadaframe contains data points

import pandas as pd

# List for x
x = [2, 4, 6, 8, 10, 12]

# List for y
y = [1.8, 1.5, 1.4, 1.1, 1.1, 0.9]

# List for xy
# create xy by looping through x and y and multiplying them using list comprehension
xy = [x * y for x, y in zip(x, y)]

# List for x2
# create x2 by looping through x and squaring them using list comprehension
x2 = [x * x for x in x]

# List for y2
# create y2 by looping through y and squaring them using list comprehension
y2 = [y * y for y in y]


# get the list of tuples from two lists and merge them by using zip()
list_of_tuples = list(zip(x, y, xy, x2, y2))

# Converting lists of tuples into a pandas Dataframe.
df = pd.DataFrame(list_of_tuples,
                  columns=['x', 'y', 'xy', 'x²', 'y²'])

# sample size
n = len(x)

# sum_x (Σx)
sum_x = sum(x)

# sum_y (Σy) 
sum_y = sum(y)

# sum_xy (Σxy) 
sum_xy = sum(xy)

# sum_x2 (Σx²) 
sum_x2 = sum(x2)

# sum_y2 (Σy²)
sum_y2 = sum(y2)

print(f'Summary:\n n = {n}\n Σx = {sum_x},\n Σy = {sum_y},\n Σxy = {sum_xy},\n Σx² = {sum_x2},\n Σy² = {sum_y2}\n')

# print dataframe
df

Summary:
 n = 6
 Σx = 42,
 Σy = 7.799999999999999,
 Σxy = 48.599999999999994,
 Σx² = 364,
 Σy² = 10.680000000000001



Unnamed: 0,x,y,xy,x²,y²
0,2,1.8,3.6,4,3.24
1,4,1.5,6.0,16,2.25
2,6,1.4,8.4,36,1.96
3,8,1.1,8.8,64,1.21
4,10,1.1,11.0,100,1.21
5,12,0.9,10.8,144,0.81


# Find Sxy, Sxx, Syy

+ Sxy: the sample covariance of x and y
+ Sxx: the sample variance of x
+ Syy: the sample variance of y


In [12]:
# Find Sxy, Sxx, and Syy
Sxy = sum_xy - (sum_x * sum_y) / n
Sxx = sum_x2 - (sum_x * sum_x) / n
Syy = sum_y2 - (sum_y * sum_y) / n
print(f'Sxy = {Sxy}, Sxx = {Sxx}, and Syy = {Syy}\n')


Sxy = -6.0, Sxx = 70.0, and Syy = 0.5400000000000045



# General regression equation
y = a + bx OR y = b0 + b1x1 + b2x2 + b3x3 + ... + bnxn

𝑦̂i=𝛽̂0+𝛽̂1𝑥i+εi

In [13]:
# Find b1 and b0
b1 = Sxy / Sxx
b0 = (sum_y / n) - (b1 * (sum_x / n))
print(f'b1 = {b1}, b0 = {b0}\nThe regression equation is y = {b0} + ({b1}x)')

b1 = -0.08571428571428572, b0 = 1.9
The regression equation is y = 1.9 + (-0.08571428571428572x)


# Find SSTO, SSR, SSE, R2, and R

+ SSTO: Total Sum of Squares
+ SSR: Regression Sum of Squares
+ SSE: Error Sum of Squares
+ R2: Coefficient of Determination
+ R: Correlation Coefficient


In [20]:
from cmath import sqrt

# determine SSTO, SSR, SSE, R^2, and R
# The sum squares of the total:
# SSTO = SSR + SSE
SSTO = Syy

# The sum squares of the regression:
# SSR = b1 * Sxy
# SSR = (b1^2) * Sxx
SSR = (Sxy*Sxy) / Sxx

# The sum squares of the error:
SSE = Syy - SSR
# SSE = ((Sxx * SSyy) - (Sxy*Sxy))/Sxx

# The coefficient of determination:
R2 = SSR / SSTO
# R2 = 1 - (SSE / SSTO)

print(f'{R2*100}% of the variation in y is explained by the variation in x\n')

# The correlation coefficient:
R = sqrt(R2)

print(f' SSTO = {SSTO},\n SSR = {SSR},\n SSE = {SSE},\n R² = {R2},\n R = {R}\n')

95.23809523809444% of the variation in y is explained by the variation in x

 SSTO = 0.5400000000000045,
 SSR = 0.5142857142857142,
 SSE = 0.02571428571429024,
 R² = 0.9523809523809444,
 R = (0.9759000729485291+0j)



# Analysis of Variance (ANOVA) for Regression

| **Source**     | **DF** | **Adj SS** | **Adj MS**        | **F-Value** | **Note**                               |
|:--------------:|:------:|:----------:|:-----------------:|:-----------:|:--------------------------------------:|
| **Regression** | k      | SSR        | MSR = SSR/k       | F = MSR/MSE | k: number of independent variables (x) |
| **Error**      | n-1-k  | SSE        | MSE = SSE/(n-k-1) |             | n: sample size                         |
| **Total**      | (n-1)  | SST        |                   |             |                                        |

# Assumption:
+ Each of the error between the actual data and predicted one is normally distributed
+ The independent variables (x) are not random

**Note here used the t-table instead of the z-table because we dont know the standard deviation for the population despite the large sample.**

# Hypothesis test:
+ H₀: 𝛽̂1 = 0
+ H₁: 𝛽̂1 ≠ 0  -> there's evidence of relationship between x and y

# Critical value (t-table)
t(α/2;n-2)

`scipy.stats.t.ppf(q, df)`

+ q: The critical level to use.
+ df: The degrees of freedom.

In [21]:
# Find t critical

# Import Library
import scipy.stats

# n = dfT + 1
critical_level = 0.1
degrees_freedom = 11  # degrees_freedom = n-2

# Right-tailed test to find the T critical value
right_tailed = scipy.stats.t.ppf(q=1-critical_level,df=degrees_freedom)

# Left-tailed test to find the T critical value
left_tailed = scipy.stats.t.ppf(q=critical_level,df=degrees_freedom)

# 2 tailed to find the T critical value
two_tailed = scipy.stats.t.ppf(q=1-critical_level/2,df=degrees_freedom)

print(f'Right-tailed test: {right_tailed}\nLeft-tailed test: {left_tailed}\n2 tailed test: {two_tailed}')

Right-tailed test: 1.3634303180205214
Left-tailed test: -1.3634303180205214
2 tailed test: 1.7958848187036691


# Test statistic

In [46]:
import os

# The .py format of the jupyter notebook
for fname in os.listdir():
    if fname.endswith('ipynb'):
        os.system(f'jupyter nbconvert {fname} --to python')

[NbConvertApp] Converting notebook EDA.ipynb to python
[NbConvertApp] Writing 33153 bytes to EDA.py
[NbConvertApp] Converting notebook preprocessing.ipynb to python
[NbConvertApp] Writing 10095 bytes to preprocessing.py
[NbConvertApp] Converting notebook quiz_3.ipynb to python
[NbConvertApp] Writing 17083 bytes to quiz_3.py
[NbConvertApp] Converting notebook normality_test.ipynb to python
[NbConvertApp] Writing 9797 bytes to normality_test.py
[NbConvertApp] Converting notebook quiz_2.ipynb to python
[NbConvertApp] ERROR | Notebook JSON is invalid: data.cells[{data__cells_x}] must be valid exactly by one definition (0 matches found)

Failed validating <unset> in notebook['data']['cells']:

On instance:
<unset>
[NbConvertApp] Writing 10573 bytes to quiz_2.py
[NbConvertApp] Converting notebook regression.ipynb to python
[NbConvertApp] Writing 5571 bytes to regression.py
[NbConvertApp] Converting notebook statistical_methods.ipynb to python
[NbConvertApp] Writing 3015 bytes to statistical_

# References
+ https://www.statisticshowto.com/probability-and-statistics/regression-analysis/anova-for-regression/
+ https://www.geeksforgeeks.org/how-to-find-the-t-critical-value-in-python/