### Meeting Assumptions for Linear Regression

In [2]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

#### Assumptions in linear regression

##### 1: linear relationship
    - features in a regression need to have a linear relationship with the outcome. 
    - Sometimes this can be fixed by applying a non-linear transformation function to a feature. 
##### 2: multivariate normality
    - The error from the model (calculated by subtracting the model-predicted values from the real outcome values) should be normally distributed   
    - Outliers or skewness in error can often be traced back to outliers or skewness in data.
##### 3: homoscedasticity 
    - distribution of your error terms (its "scedasticity"), should be consistent for all predicted values, or homoscedastic.(if your error terms aren't consistently distributed and you have more variance in the error for large outcome values than for small ones, then the confidence interval for large predicted values will be too small because it will be based on the average error variance. This leads to overconfidence in the accuracy of your model's predictions.)
    -Some fixes to heteroscedasticity include transforming the dependent variable and adding features that target the poorly-estimated areas. For example, if a model tracks data over time and model error variance jumps in the September to November period, a binary feature indicating season may be enough to resolve the problem.
##### 4: low multicollinearity
    - Correlations among features should be low or nonexistent.
    - Multicollinearity can be fixed by PCA or by discarding some of the correlated features.

#### Potential Corrections for Homoscedasticity:

##### 1.Data transformations

1. Take the square root of the features
new_feature = np.sqrt(df_house_sales_numerical['sqft_living'])),bins=50)
2. Take the log of the feature
new_feature = np.log(df_house_sales_numerical["sqft_living"].astype(float))
3. Take inverse of the feature
new_feature = 1/df_house_sales_numerical["sqft_lot"].astype(float)
4. Take exponential of the feature
new_feature = np.exp(df_house_sales_numerical["sqft_living"].astype(float))

##### 2. New features to target poorly-estimated areas

new_feature = np.where(df_house_sales_numerical_minus_outliers['sqft_lot_sqrt'] < 50, 0, 1)

#### Potential Corrections for Multivariate non-normality:

1. Removing outliers (outside of 1.5x IQR)

#sort values
df_house_sales_numerical.sort_values('sqft_living', inplace=True)
df_house_sales_numerical.sort_values('sqft_lot', inplace=True)
df_house_sales_numerical.sort_values('sqft_living15', inplace=True)

#identify IQR for sqft_living
Q1_living = df_house_sales_numerical['sqft_living'].quantile(0.25)
Q3_living = df_house_sales_numerical['sqft_living'].quantile(0.75)
IQR_living = Q3_living - Q1_living

#identify IQR for sqft_living15
Q1_living15 = df_house_sales_numerical['sqft_living15'].quantile(0.25)
Q3_living15 = df_house_sales_numerical['sqft_living15'].quantile(0.75)
IQR_living15 = Q3_living15 - Q1_living15

#identify IQR for sqft lot
Q1_lot = df_house_sales_numerical['sqft_lot'].quantile(0.25)
Q3_lot = df_house_sales_numerical['sqft_lot'].quantile(0.75)
IQR_lot = Q3_lot - Q1_lot

#perform query that limits records to those that fall in the IQR for sqft_living, sqft_lot, sqft_living15
df_house_sales_numerical_minus_outliers = df_house_sales_numerical.query('(@Q1_living - 1.5 * @IQR_living) <= sqft_living <= (@Q3_living + 1.5 * @IQR_living) & \
(@Q1_living15 - 1.5 * @IQR_living15) <= sqft_living15 <= (@Q3_living15 + 1.5 * @IQR_living15)& \
(@Q1_lot - 1.5 * @IQR_lot) <= sqft_lot <= (@Q3_lot + 1.5 * @IQR_lot)')