In [1]:
import pandas as pd
import statsmodels.api as sm

## Simple Linear Regression
### A comparison between two variables

<strong>Examples</strong>:
<ul>
    <li>price vs. sales</li>
    <li>temperature vs. humidity</li>
    <li>height vs. weight</li>
    <li>hours studying vs. test grades</li>
</ul>

## Scatterplots
### Most commonly used for comparing two quantitative variables

- <strong>Response Variable (y)</strong> - The variable we're interested in predicting
- <strong>Explanatory Variable (x)</strong> - The variable used to predict the response

#### Correlation coefficients in scatterplots are read based on their strength and direction of the observed points

## Correlation coefficient
<p>The strength and direction of a linear relationship</p>
<p><strong>Correlation strength boundaries</strong></p>
<ul>
    <li><strong>Strong</strong>: 0.7 - 1.0</li>
    <li><strong>Moderate</strong>: 0.3 - 0.7</li>
    <li><strong>Weak</strong>: 0.0 - 0.3</li>
</ul>

In [2]:
Image('https://i.pinimg.com/originals/85/e6/a9/85e6a9e41b520d6984457e0748b5ef2b.jpg')

<IPython.core.display.Image object>

### Lines

<p><strong>Intercept (<em>b</em><sub>0</sub>)</strong> - Expected value of response variable (y) when explanatory variable (x) is 0</p>
<p><strong>Slope (<em>b</em><sub>1</sub>)</strong> - Expected value of response variable (y) when explanatory variable (x) is 0</p>
<p><strong>Best fit - <em>y</em> = <em>b</em><sub>0</sub> + <em>b</em><sub>1</sub><em>x</em></strong></p>



### Fitting a Regression Line

In [12]:
df = pd.read_csv('./price_by_area.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,price,area
0,0,598291,1188
1,1,1744259,3512
2,2,571669,1134
3,3,493675,1940
4,4,1101539,2208


In [13]:
df.head()

Unnamed: 0.1,Unnamed: 0,price,area
0,0,598291,1188
1,1,1744259,3512
2,2,571669,1134
3,3,493675,1940
4,4,1101539,2208


In [14]:
df['intercept'] = 1

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,price,area,intercept
0,0,598291,1188,1
1,1,1744259,3512,1
2,2,571669,1134,1
3,3,493675,1940,1
4,4,1101539,2208,1


In [17]:
df[['intercept', 'area']]

Unnamed: 0,intercept,area
0,1,1188
1,1,3512
2,1,1134
3,1,1940
4,1,2208
...,...,...
6023,1,757
6024,1,3540
6025,1,1518
6026,1,2270


In [18]:
# Ordinary Least Squared
lm = sm.OLS(df['price'], df[['intercept', 'area']]) # sm.OLS(y-value, [intercept and x-value])
results = lm.fit() # fits your data to the model
results.summary() # View data findings

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,12690.0
Date:,"Tue, 28 Jul 2020",Prob (F-statistic):,0.0
Time:,13:57:26,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6026,BIC:,169100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,9587.8878,7637.479,1.255,0.209,-5384.303,2.46e+04
area,348.4664,3.093,112.662,0.000,342.403,354.530

0,1,2,3
Omnibus:,368.609,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,349.279
Skew:,0.534,Prob(JB):,1.43e-76
Kurtosis:,2.499,Cond. No.,4930.0


<p><strong>Intercept (<em>b</em><sub>0</sub>)</strong>: 9587.8878</p>
<p><strong>Slope (<em>b</em><sub>1</sub>)</strong>: 348.4664</p>
<p><strong>Predicted home price</strong>: <em>y</em> = 9588 + 348<em>x</em></p>
<p><strong>R-squared</strong>: 0.678</p>

<p>If a home has an area of 0, the predicted cost of the model would be about 9588.</p>
<p>For every 1 unit increase in the area, the price would increase by 348.</p>

### Hypothesis Testing
<p><strong>H<sub>0</sub></strong>: <em>b</em><sub>1</sub> = 0</p>
<p><strong>H<sub>1</sub></strong>: <em>b</em><sub>1</sub> $\neq$ 0</p>

<p>The <strong>P-value</strong> is useful in predicting whether or not a particular variable is useful for predicting the response.</p>
<p>The area is statistically significant for predicting price.</p>

### How well does the line fit the data?

<strong>R-squared</strong>: amount of variance  in the response (y) explained by the model
<ul>
    <li>Closer the value is to 1, the better the model fits</li>
    <li>Closer the value is to 0, the worse the model fits</li>
    <li>R-squared value is the square of the correlation coefficient</li>
    <li>Can be read as "67.8% of the variance in price is explained by the area of the house". The remaining 33.2% of the variance in price is due to other characteristics of the house, not including the area.</li>
</ul>