# Disclaimer

***This is the second part of the series of notebook meant to do a regression analysis on this dataset. The first notebook of this series: [Regression model-1: The best fit line with R2>0.9](https://www.kaggle.com/ritikpnayak/regression-model-1-the-best-fit-line-with-r2-0-9)***

# Previous notebooks on the same dataset:

1. [Elemental approach to finding correlation](https://www.kaggle.com/ritikpnayak/elemental-approach-to-finding-correlation)

2. [Computing the magnitude of skewness in Maths score](https://www.kaggle.com/ritikpnayak/computing-the-magnitude-of-skewness-in-maths-score)

# Previous notebooks on the same subject:

1. [Introduction to Hypothesis Testing and Estimation](https://www.kaggle.com/ritikpnayak/introduction-to-hypothesis-testing-and-estimation)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')

# 1. What I am going to do in this notebook?

1. In the first part, I found that the fit line for our data. 
2. The R square for out line was as big as 0.9.
3. In this notebook, I analyze the errors associated with the intercept and slope of the line.
4. I'll specifically look for standard error and the confidence interval for both intercept and slope.

***What is standard error?***

1. standard error or SE  is a measure of how far we expect the estimate to be off, on average.
2. We find it using the Root Mean Square Error or RMSE.
3. It is NOT EQUAL to standard deviation.

***Whats is confidence interval?***

1. A confidence interval (CI) is a range that includes a given fraction of the sampling distribution.
2. For instance, if we say that a certain number of values fall in the 90 percent confidence interval, we mean that if we take 90 out of 100 values, we would expect those values to fall in that range.

In [None]:
rs = df['reading score']
ws = df['writing score']

# 2. Sampling Distributions

To assess the SE, I'll answer the question; 

***"If we run this experiment again, how much variability do we expect in the estimates?"***

In [None]:
def SamplingDistributions(df, iters = 101):
    t = []
    
    for _ in range(iters):
        sample = ResampleRows(df)
        rscore = sample['reading score']
        wscore = sample['writing score']
        estimates = LeastSquares(wscore, rscore)
        t.append(estimates)
        
    iters, slopes = zip(*t)
    return iters, slopes

def LeastSquares(xs, ys):
    meanx, varx = np.mean(xs), np.var(xs)
    meany = np.mean(ys)
    
    slope = covariance(xs, ys) / varx
    inter = meany - slope * meanx
    
    return inter, slope

def de_mean(x):
    x_bar = np.mean(x)
    return [x_i - x_bar for x_i in x]

def covariance(x, y):
    n = len(x)
    return np.dot(de_mean(x), de_mean(y)) / (n - 1)

def ResampleRows(df):
    return SampleRows(df, len(df), replace = True)

def SampleRows(df, nrows, replace=False):
    indices = np.random.choice(df.index, nrows, replace=replace)
    sample = df.loc[indices]
    return sample

***The code is self-explanatory***

In [None]:
def RMSE(estimates, actual):
    e2 = [(estimate - actual) ** 2 for estimate in estimates]
    mse = np.mean(e2)
    return mse ** (1/2)

In [None]:
inters = SamplingDistributions(df)[0]
slopes = SamplingDistributions(df)[1]

print('Mean of intercepts is: ', np.mean(inters))
print('90% confidence interval of intercepts is: ', np.percentile(inters, 5), '-', np.percentile(inters, 95))
print('Standard error of intercepts is: ', RMSE(inters, 6.688023759635257))
print('\n')
print('Mean of slopes is: ', np.mean(slopes))
print('90% confidence interval of slopes is: ', np.percentile(slopes, 5), '-', np.percentile(slopes, 95))
print('Standard error of slopes is: ', RMSE(slopes, 0.9181087994881232))

In the code;

**print('Standard error of intercepts is: ', RMSE(inters, 6.688023759635257))**,

the number 6.6880... is the value of the intercept of the line that I draw in my previous notebook (which is the fit line for our data). 

Similarly, in the code;

**print('Standard error of slopes is: ', RMSE(slopes, 0.9181087994881232))**,

the number is the value of the slope that I found in the previous notebook.

# 3. Conclusion

1. The SE for the itercept and the slope is 0.5 and 0.0075 respectively.
2. This means that we expect the intercept and slope are expected to be off by 0.5 and 0.0075 respectively.
3. The difference between the extremes of the confidence intervals of intercepts (7.61 - 5.98) and slopes (0.92 - 0.90) is quite less.
4. This means that if we run the sample 101 times, we expect 90% of the values to fall between the respective range. The 90% CI for both the intercept and slope is very near to the mean of the estimates which indicates the validity/accuracy of the mean of the estimates.

# ***I'll use the same intecept ans slope to predict the results***