In [56]:
import pandas as pd
import statsmodels.api as sm
import numpy as np

## Simple Linear Regression
### A comparison between two variables

<strong>Examples</strong>:
<ul>
    <li>price vs. sales</li>
    <li>temperature vs. humidity</li>
    <li>height vs. weight</li>
    <li>hours studying vs. test grades</li>
</ul>

## Scatterplots
### Most commonly used for comparing two quantitative variables

- <strong>Response Variable (y)</strong> - The variable we're interested in predicting
- <strong>Explanatory Variable (x)</strong> - The variable used to predict the response

#### Correlation coefficients in scatterplots are read based on their strength and direction of the observed points

## Correlation coefficient
<p>The strength and direction of a linear relationship</p>
<p><strong>Correlation strength boundaries</strong></p>
<ul>
    <li><strong>Strong</strong>: 0.7 - 1.0</li>
    <li><strong>Moderate</strong>: 0.3 - 0.7</li>
    <li><strong>Weak</strong>: 0.0 - 0.3</li>
</ul>

In [29]:
# Image('https://i.pinimg.com/originals/85/e6/a9/85e6a9e41b520d6984457e0748b5ef2b.jpg')

### Lines

<p><strong>Intercept (<em>b</em><sub>0</sub>)</strong> - Expected value of response variable (y) when explanatory variable (x) is 0</p>
<p><strong>Slope (<em>b</em><sub>1</sub>)</strong> - Expected value of response variable (y) when explanatory variable (x) is 0</p>
<p><strong>Best fit - <em>y</em> = <em>b</em><sub>0</sub> + <em>b</em><sub>1</sub><em>x</em></strong></p>



### Fitting a Regression Line

In [30]:
df = pd.read_csv('./price_by_area.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,price,area
0,0,598291,1188
1,1,1744259,3512
2,2,571669,1134
3,3,493675,1940
4,4,1101539,2208


In [31]:
df.head()

Unnamed: 0.1,Unnamed: 0,price,area
0,0,598291,1188
1,1,1744259,3512
2,2,571669,1134
3,3,493675,1940
4,4,1101539,2208


In [32]:
df['intercept'] = 1

In [33]:
df.head()

Unnamed: 0.1,Unnamed: 0,price,area,intercept
0,0,598291,1188,1
1,1,1744259,3512,1
2,2,571669,1134,1
3,3,493675,1940,1
4,4,1101539,2208,1


In [34]:
df[['intercept', 'area']]

Unnamed: 0,intercept,area
0,1,1188
1,1,3512
2,1,1134
3,1,1940
4,1,2208
...,...,...
6023,1,757
6024,1,3540
6025,1,1518
6026,1,2270


In [35]:
# Ordinary Least Squared
lm = sm.OLS(df['price'], df[['intercept', 'area']]) # sm.OLS(y-value, [intercept and x-value])
results = lm.fit() # fits your data to the model
results.summary() # View data findings

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,12690.0
Date:,"Tue, 06 Oct 2020",Prob (F-statistic):,0.0
Time:,15:23:36,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6026,BIC:,169100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,9587.8878,7637.479,1.255,0.209,-5384.303,2.46e+04
area,348.4664,3.093,112.662,0.000,342.403,354.530

0,1,2,3
Omnibus:,368.609,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,349.279
Skew:,0.534,Prob(JB):,1.43e-76
Kurtosis:,2.499,Cond. No.,4930.0


<p><strong>Intercept (<em>b</em><sub>0</sub>)</strong>: 9587.8878</p>
<p><strong>Slope (<em>b</em><sub>1</sub>)</strong>: 348.4664</p>
<p><strong>Predicted home price</strong>: <em>y</em> = 9588 + 348<em>x</em></p>
<p><strong>R-squared</strong>: 0.678</p>

<p>If a home has an area of 0, the predicted cost of the model would be about 9588.</p>
<p>For every 1 unit increase in the area, the price would increase by 348.</p>

### Hypothesis Testing
<p><strong>H<sub>0</sub></strong>: <em>b</em><sub>1</sub> = 0</p>
<p><strong>H<sub>1</sub></strong>: <em>b</em><sub>1</sub> $\neq$ 0</p>

<p>The <strong>P-value</strong> is useful in predicting whether or not a particular variable is useful for predicting the response.</p>
<p>The area is statistically significant for predicting price.</p>

### How well does the line fit the data?

<strong>R-squared</strong>: amount of variance  in the response (y) explained by the model
<ul>
    <li>Closer the value is to 1, the better the model fits</li>
    <li>Closer the value is to 0, the worse the model fits</li>
    <li>R-squared value is the square of the correlation coefficient</li>
    <li>Can be read as "67.8% of the variance in price is explained by the area of the house". The remaining 33.2% of the variance in price is due to other characteristics of the house, not including the area.</li>
</ul>

## Multiple Linear Regression
### Predict response variables w/ multiple inputs using quantitative and categorical data

##### Example:
housing price vs. amount of rooms, area, bathrooms, type of house, etc.

In [36]:
df = pd.read_csv('house_prices.csv')
df.head()

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price
0,1112,B,1188,3,2,ranch,598291
1,491,B,3512,5,3,victorian,1744259
2,5952,B,1134,3,2,ranch,571669
3,3525,A,1940,4,2,ranch,493675
4,5108,B,2208,6,4,victorian,1101539


In [37]:
df['intercept'] = 1

lm = sm.OLS(df.price, df[['intercept', 'area', 'bedrooms', 'bathrooms']])
result = lm.fit()
result.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,4230.0
Date:,"Tue, 06 Oct 2020",Prob (F-statistic):,0.0
Time:,15:23:36,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6024,BIC:,169100.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.007e+04,1.04e+04,0.972,0.331,-1.02e+04,3.04e+04
area,345.9110,7.227,47.863,0.000,331.743,360.079
bedrooms,-2925.8063,1.03e+04,-0.285,0.775,-2.3e+04,1.72e+04
bathrooms,7345.3917,1.43e+04,0.515,0.607,-2.06e+04,3.53e+04

0,1,2,3
Omnibus:,367.658,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,350.116
Skew:,0.536,Prob(JB):,9.4e-77
Kurtosis:,2.503,Cond. No.,11600.0


### Dummy Variables
Take neighborhoods for example. There are three types of neighborhoods through which we could determine the price of a house. We ned to break it up into 3 columns in order to run calculations

In [38]:
df[['a', 'b' ,'c']] = pd.get_dummies(df['neighborhood'])
df[['lodge', 'ranch', 'victorian']] = pd.get_dummies(df['style'])
df['intercept'] =1
lm = sm.OLS(df.price, df[['intercept','lodge', 'ranch']])
result = lm.fit()
result.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.339
Model:,OLS,Adj. R-squared:,0.339
Method:,Least Squares,F-statistic:,1548.0
Date:,"Tue, 06 Oct 2020",Prob (F-statistic):,0.0
Time:,15:23:36,Log-Likelihood:,-86683.0
No. Observations:,6028,AIC:,173400.0
Df Residuals:,6025,BIC:,173400.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.046e+06,7775.607,134.534,0.000,1.03e+06,1.06e+06
lodge,-7.411e+05,1.44e+04,-51.396,0.000,-7.69e+05,-7.13e+05
ranch,-4.71e+05,1.27e+04,-37.115,0.000,-4.96e+05,-4.46e+05

0,1,2,3
Omnibus:,1340.12,Durbin-Watson:,2.004
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3232.81
Skew:,1.23,Prob(JB):,0.0
Kurtosis:,5.611,Cond. No.,3.28


In [None]:
# If our home is a Victorian (intercept) its price will be ablout 1,046,000. 
# A lodge is predicted to be about 741,000 less than a Vic. 
# A ranch is a predicted to be about 471,000 less than a Vic

### Logistic Regression

In [41]:
df = pd.read_csv('fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


In [42]:
df.day.value_counts()

weekend    5757
weekday    3036
Name: day, dtype: int64

In [43]:
df.fraud.value_counts()

False    8686
True      107
Name: fraud, dtype: int64

In [44]:
df[['no_fraud', 'fraud']] = pd.get_dummies(df.fraud)

In [51]:
df[['weekday', 'weekend']] = pd.get_dummies(df.day)

In [54]:
df['intercept'] = 1
lm = sm.Logit(df.fraud, df[['intercept', 'duration', 'weekday']])
result = lm.fit()
result.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))
  warn('Inverting hessian failed, no bse or cov_params '
  warn('Inverting hessian failed, no bse or cov_params '


0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 06 Oct 2020",Pseudo R-squ.:,inf
Time:,15:31:47,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894
weekday,2.5465,0.904,2.816,0.005,0.774,4.319


In [60]:
print(f"On weekdays, fraud is {round(np.exp(2.5465), 2)}% more likely to happen.")
print(f"For each 1-unit increase in the duration, fraud is {round(np.exp(-1.4637), 2)}% less likely to happen.")

On weekdays, fraud is 12.76% more likely to happen.
For each 1-unit increase in the duration, fraud is 0.23% less likely to happen.
