## New Notebook For Fitting a Model

There will be some more cleaning and exploring to do before finding the model, but the previous EDA notebook should provide a good start with a clean dataset. 

### Overall Goals
- **Find highest R-squared regression model that can predict prices for real estate agency.**
- Decide what to do with null values.
- Decide if any outliers need to be dropped.
- Decide what to do with skewed target variable data.
- Use one-hot encoding for categorical variables.
- Decide which variables to drop given multicollinearity.
- Decide which variables to use given stepwise selection methods & recursive feature elimination.
- Consider log transformations on data that is not normally distributed.
- Check other tests for linearity assumptions; drop variabels that don't meet standards
- Consider scaling or normalizing.
- Validation, test and training the model.

In [21]:
#Loading the needed packages, libraries, functions and variables from the EDA notebook.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [5]:
#Original DataFrame
%store -r df_original

In [7]:
#Cleaned DataFrame
%store -r df_clean

In [8]:
#Highly correlated independent variables
%store -r df_high_corr_pairs

## Business & Data Understanding
#### Revisiting our end goals with sombe EDA knowledge
- We want to create a tool for a real estate agency to estimate sales or purchase prices given housing info.
- This can be done with a regression model.

In [10]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             21597 non-null  int64         
 1   date           21597 non-null  datetime64[ns]
 2   price          21597 non-null  float64       
 3   bedrooms       21597 non-null  int64         
 4   bathrooms      21597 non-null  float64       
 5   sqft_living    21597 non-null  int64         
 6   sqft_lot       21597 non-null  int64         
 7   floors         21597 non-null  float64       
 8   waterfront     19221 non-null  object        
 9   view           21534 non-null  object        
 10  condition      21597 non-null  object        
 11  grade          21597 non-null  object        
 12  sqft_above     21597 non-null  int64         
 13  sqft_basement  8317 non-null   float64       
 14  yr_built       21597 non-null  int64         
 15  yr_renovated   744 

In [19]:
df_clean.corr().abs()['price'].sort_values()

high_corr_cols = ['sqft_living', 'sqft_above', 'sqft_living15', 'bathrooms', 'sqft_basement', 'bedrooms']

In [27]:
y = df_clean['price']
X = df_clean
    
reg = LinearRegression().fit(X, y)

plt.scatter(X, y, color='green')
plt.plot(X, reg.predict(X))
plt.xlabel('sqft_living')
plt.ylabel('Price');

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [22]:
for x in high_corr_cols:
    y = df_clean['price']
    X = df_clean[x]
    
    reg = LinearRegression().fit(X, y)

    plt.scatter(X, y, color='green')
    plt.plot(X, reg.predict(X))
    plt.xlabel(x)
    plt.ylabel('Price');

ValueError: Expected 2D array, got 1D array instead:
array=[1180 2570  770 ... 1020 1600 1020].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.