### Regressões Multiplas

Exemplo (1) Modelando uma regressão multipla para entender como diferentes variáveis afetam os preços de casas nos EUA.

Features:
- **price** - The last price the house was sold for
- **num_bed** - The number of bedrooms
- **num_bath** - The number of bathrooms (fractions mean the house has a toilet-only or shower/bathtub-only bathroom)
- **size_house** (includes basement) - The size of the house
- **size_lot** - The size of the lot
- **num_floors** - The number of floors
- **is_waterfront** - Whether or not the house is a waterfront house (0 means it is not a waterfront house whereas 1 means that it is a waterfront house)
- **condition** - How worn out the house is. Ranges from 1 (needs repairs all over the place) to 5 (the house is very well maintained)
- **size_basement** - The size of the basement
- **year_built** - The year the house was built
- **renovation_date** - The year the house was renovated for the last time. 0 means the house has never been renovated
- **zip** - The zip code
- **latitude** - Latitude
- **longitude** - Longitude
- **avg_size_neighbor_houses** - The average house size of the neighbors
- **avg_size_neighbor_lot** - The average lot size of the neighbors

In [4]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
%config InlineBackend.figure_formats=['svg']

In [5]:
df = pd.read_csv(r'./data/house_sales.csv')

In [6]:
df.describe()

Unnamed: 0,price,num_bed,num_bath,size_house,size_lot,num_floors,is_waterfront,condition,size_basement,year_built,renovation_date,zip,latitude,longitude,avg_size_neighbor_houses,avg_size_neighbor_lot
count,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0,18448.0
mean,542362.4,3.372615,2.118888,2083.940915,15036.02,1.494606,0.007643,3.411698,293.571498,1971.001138,85.145002,98077.921455,47.56003,-122.214419,1988.306483,12571.596216
std,372013.5,0.933892,0.772384,921.416218,41814.55,0.540806,0.087092,0.652593,443.607503,29.361619,403.371263,53.49744,0.138557,0.13991,686.173124,26329.260211
min,78000.0,0.0,0.0,290.0,520.0,1.0,0.0,1.0,0.0,1900.0,0.0,98001.0,47.155933,-122.518648,399.0,651.0
25%,321837.5,3.0,1.75,1430.0,5050.0,1.0,0.0,3.0,0.0,1952.0,0.0,98033.0,47.471527,-122.328084,1490.0,5100.0
50%,450000.0,3.0,2.25,1920.0,7600.5,1.5,0.0,3.0,0.0,1975.0,0.0,98065.0,47.571599,-122.230688,1840.0,7611.0
75%,648000.0,4.0,2.5,2560.0,10625.25,2.0,0.0,4.0,570.0,1997.0,0.0,98118.0,47.677918,-122.125733,2370.0,10050.0
max,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,5.0,4820.0,2015.0,2015.0,98199.0,47.777624,-121.315254,6110.0,858132.0


In [None]:
corrmat = df[['price','size_house','num_bath','size_house','size_lot','num_floors','is_waterfront','year_built','latitude','longitude','avg_size_neighbor_houses','avg_size_neighbor_lot']].corr()
cols = corrmat.nlargest(10, 'price')['price'].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.15)
f, ax = plt.subplots(figsize=(12, 9))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)

In [None]:
from IPython.display import Image
Image(filename=r'img\houses_tableau.jpg')

In [None]:
function1 = '''
price ~ 
 + size_house
 + num_bath
 + size_house
 + size_lot
 + num_floors
 + is_waterfront
 + year_built
 + latitude
 + longitude
 + avg_size_neighbor_houses
 + avg_size_neighbor_lot
'''

model1 = smf.ols(function1, df).fit()
print(model1.summary2())