# Disclaimer

***It is the first part of a series of notebooks meant to highlight my work on making a regression model and testing it on the data. I would like to point out that many of the concepts that I use in this notebook is already discussed and explained in my previous notebooks. Therefore, if the audience want to learn more about the concepts I use, they can visit my previous notebooks, the links of which are also given below. Nonetheless, I would say that the code in my notebook is self explanatory that would not require much reading.***

# Previous notebooks on the same dataset:

1. [Elemental approach to finding correlation](https://www.kaggle.com/ritikpnayak/elemental-approach-to-finding-correlation)

2. [Computing the magnitude of skewness in Maths score](https://www.kaggle.com/ritikpnayak/computing-the-magnitude-of-skewness-in-maths-score)

# Previous notebooks on the same subject:

1. [Introduction to Hypothesis Testing and Estimation](https://www.kaggle.com/ritikpnayak/introduction-to-hypothesis-testing-and-estimation)

# Here we go!

# ***Throughout the series of notebooks, I'll perform regression analysis to predict the reading scores using the writing scores***

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')

In [None]:
df = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')

In [None]:
df.head()

# 1. Is correlation "by chance"?

1. In my previous notebook, I found that there is a very strong correlation between the reading scores and the writing scores
2. The correlation is linear and its magnitude (Pearson's coefficient) is greater that 0.9 .
3. In this section, we'll analyse if the correlation is true for the population (whole of the data) or is it merely by chance.
4. I'll perform Hypothesis Testing for that purpose.

Please refer to my notebook; [Elemental approach to finding correlation](https://www.kaggle.com/ritikpnayak/elemental-approach-to-finding-correlation) to see the computation of correlation

Also refer to my notebook; [Introduction to Hypothesis Testing and Estimation](https://www.kaggle.com/ritikpnayak/introduction-to-hypothesis-testing-and-estimation) to learn more about Hypothesis Testing

***What is the Null Hypothesis?***

1. ***Our Null Hypothesis:*** The correlation of magnitude as strong as 0.9 has occured ***by chance***.
2. If the p value for the null hypothesis turns out to be very small, the null hypothesis would be rejected.
3. I wish the null hypothesis stands false for I really want that the correlation of such a strong magnitude to be statistically significant.

In [None]:
class HypothesisTest(object):
    
    def __init__(self, data):
        self.data = data
        self.MakeModel()
        self.actual = self.TestStatistic(data)
        
    def PValue(self, iters = 1000):
        self.test_stats = [self.TestStatistic(self.RunModel()) for _ in range(iters)]
        count = sum(1 for x in self.test_stats if x >= self.actual)
        return count / iters
    
    def TestStatistic(self, data):
        raise UnimplementedMethodException()
        
    def MakeModel(self):
        pass
    
    def RunModel(self):
        raise UnimplementedMethodException()

In [None]:
class CorrPermute(HypothesisTest):
    
    def TestStatistic(self, data):
        xs, ys = data
        test_stat = abs(correlation(xs, ys))
        return test_stat
    
    def RunModel(self):
        xs, ys = self.data
        xs = np.random.permutation(xs)
        return xs, ys
    
def correlation(x, y):
    std_x = np.std(x)
    std_y = np.std(y)
    if std_x and std_y > 0:
        return covariance(x, y) / std_x / std_y
    else:
        return 0
    
def de_mean(x):
    x_bar = np.mean(x)
    return [x_i - x_bar for x_i in x]

def covariance(x, y):
    n = len(x)
    return np.dot(de_mean(x), de_mean(y)) / (n - 1)

In [None]:
rs = df['reading score']
ws = df['writing score']

ct = CorrPermute((rs, ws))
pvalue = ct.PValue()

In [None]:
print('The P-Value is: ', pvalue)

***What is the conclusion?***

1. The P value for the null hypothesis is 0
2. This means that the null hypothesis is FALSE and that the magnitude as strong as 0.9 of the correlation between the 2 variables (reading scores and writing scores) is unlikely to have occured by chance.
3. In other words, we can say that the correlation is statistically significant.

# 2. Using Linear Least Squares 

***What are linear least squares?***

1. The correlation measures the strength and sign of a relationship.
2. It doesn't measure the slope.
3. There are several ways to measure the slope.
4. The most common method is a "linear least squares fit".
5. It is the one that minimizes the "mean squared error (MSE)".
6. We'll use the concept of the general equation of a line; ***y = mx + c***

To know more about it, please refer to this link: [Linear least squares](https://en.wikipedia.org/wiki/Linear_least_squares#:~:text=Linear%20least%20squares%20(LLS)%20is,and%20generalized%20(correlated)%20residuals.)

***We'll get the intercept and slope of the line (using, of course, linear least squares) that we'll use to predict the reading scores***

In [None]:
def LeastSquares(xs, ys):
    meanx, varx = np.mean(xs), np.var(xs)
    meany = np.mean(ys)
    
    slope = covariance(xs, ys) / varx
    inter = meany - slope * meanx
    
    return inter, slope

def FitLine(xs, inter, slope):
    fit_xs = np.sort(xs)
    fit_ys = inter + slope * fit_xs
    
    return fit_xs, fit_ys

In [None]:
inter, slope = LeastSquares(ws, rs)
fit_xs, fit_ys = FitLine(ws, inter, slope)

print('intercept is: {} and slope is: {}'.format(inter, slope))

# 3. Residuals

***What are residuals?***

1. It is the deviation of the fitted values of y from the actual values of y.
2. In our example, the actual values of y is in ys whereas the fitted values of y is in fit_ys.
3. The actual equation of the line that we use in regression is; ***ys = intercept + slope * xs + residuals***

To know more about it, please refer to this link: [Errors and residuals
](https://en.wikipedia.org/wiki/Errors_and_residuals)

In [None]:
def Residuals(xs, ys, inter, slope):
    xs = np.asarray(xs)
    ys = np.asarray(ys)
    res = ys - (inter + slope * xs)
    return res

In [None]:
res = Residuals(ws, rs, inter, slope)

df['residuals'] = res

# 3.1. Is it good to predict the reading scores with the writing scores or without it?

In [None]:
print('RMSE if we predict the reading scores using the writing scores: ', np.std(res))
print('RMSE if we predict the reading scores without using the writing scores: ', np.std(rs))

***How to interpret the above values?***

1. The "Root Mean Square Error (RMSE)" is more than 4 ***if we use the writing scores*** to predict the values of the reading scores.
2. The "Root Mean Square Error (RMSE)" is more than 14 ***if we do not use the writing scores*** to predict the values of the reading scores.

***Therefore, as the RMSE is less in the former situation, it would be better off to use the writing scores to predict the reading scores***

# 4. Coefficient of Determination

***What is it?***

1. Denoted by R^2 or "R squared".
2. It is a metric used to determine how good our model is.

In [None]:
r_squared = 1 - (np.var(res) / np.var(rs))

print('The Coefficient of Determination or r^2 is: ', r_squared)

***The value of R^2 is more than 0.9, which means that our line is a perfect fit for the data. We can use this line to predict the reading scores.***

# 5. Plotting the line

In [None]:
plt.figure(figsize = (15, 8))

plt.xlabel('Writing scores')
plt.ylabel('Reading scores')

plt.plot(fit_xs, fit_ys, color = 'black')
plt.scatter(ws, rs, color = 'green')

***As expected, our line fits the data well.***