<img style="float: right;" src="fig/unifor.jpg" width="250px">

# Introdução ao Aprendizado de Máquina
Prof. Erneson A. Oliveira<br>
MBA em Ciência de Dados<br>
Universidade de Fortaleza

# 1. Como prever preços de habitações?

<img src="fig/taahm.jpg" width="800px">

## 1.1 Base de dados

In [4]:
import pandas as pd

housing=pd.read_csv('datasets/housing.csv',sep=';',encoding='utf-8') # Open CSV file

In [5]:
housing.head() # preview

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
housing.info() # Some general information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20433 entries, 0 to 20432
Data columns (total 10 columns):
longitude             20433 non-null float64
latitude              20433 non-null float64
housing_median_age    20433 non-null float64
total_rooms           20433 non-null float64
total_bedrooms        20433 non-null float64
population            20433 non-null float64
households            20433 non-null float64
median_income         20433 non-null float64
median_house_value    20433 non-null float64
ocean_proximity       20433 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


## 1.2 Regressão Linear Múltipla

Em um modelo linear, o valor alvo é dado por uma combinação linear dos atributos. Matematicamente,

\begin{equation}\nonumber
\hat{y}=\boldsymbol{\theta}^T \cdot \mathbf{x} = \theta_0+\theta_1 x_1+\dots+\theta_n x_n
\end{equation}

\begin{equation}\nonumber
    \boldsymbol{\theta}=
    \begin{bmatrix}
        \theta_0\\
        \theta_1\\
        \vdots\\
        \theta_n
    \end{bmatrix}
    \quad \text{e} \quad
    \mathbf{x}=
    \begin{bmatrix}
        1\\
        x_1\\
        \vdots\\
        x_n
    \end{bmatrix}
\end{equation}


onde $\hat{y}$ é o valor previsto para o valor alvo, $n$ é número de atributos, $\{\theta_i\}$ são parâmetros do modelo (`intercept_` e `coef_`) e $\{x_i\}$ são os atributos. O valor de $\theta$ que minimiza o método dos mínimos quadrados é dado por:

\begin{equation}\nonumber
\hat{\boldsymbol{\theta}}=(\mathbf{X}^T \cdot \mathbf{X})^{-1} \cdot \mathbf{X}^T \cdot \mathbf{y}
\end{equation}

onde,

\begin{equation}\nonumber
    \mathbf{y}=
    \begin{bmatrix}
        y^{(1)}\\
        y^{(2)}\\
        \vdots\\
        y^{(m)}
    \end{bmatrix}
    \quad \text{e} \quad
    \mathbf{X}=
    \begin{bmatrix}
        x^{(1)}_1 & x^{(1)}_2 & \dots  & x^{(1)}_n\\
        x^{(2)}_1 & x^{(2)}_2 & \dots  & x^{(2)}_n\\
        \vdots    & \vdots    & \ddots & \vdots\\
        x^{(m)}_1 & x^{(m)}_2 & \dots  & x^{(m)}_n\\
    \end{bmatrix}.
\end{equation}

Aqui:

- $m$ é o número de instâncias;<br>
- $n$ é o número de características de cada instância;<br>
- $\mathbf{y}$ é o vetor de rótulos de todas as instância;<br>
- $\mathbf{X}$ é a matriz que contêm todas as características de todas as instâncias.

In [19]:
from sklearn.linear_model import LinearRegression
#from sklearn.linear_model import SGDRegressor

X=housing.drop(['median_house_value','ocean_proximity'],axis=1) # Matrix of data
y=housing['median_house_value'].copy() # Target value

model=LinearRegression() # Selecting a linear model
#model=SGDRegressor(n_iter=50, penalty=None, eta0=0.1)

lin_reg=model.fit(X,y) # Estimating the model parameters

# print(type(lin_reg))
# print(dir(lin_reg))

#print(lin_reg.get_params())

#print(model.score(X,y)) # R^2

print(lin_reg.coef_) # theta_1, theta_2, ...
print(lin_reg.intercept_) # theta_0

[ -4.27301205e+04  -4.25097369e+04   1.15790031e+03  -8.24972507e+00
   1.13820707e+02  -3.83855780e+01   4.77013513e+01   4.02975217e+04]
-3585395.74789


In [20]:
#longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
Xnew=[[-122.221,37.864,22,7097,1104,2402,1136,8.314]]

ynew=model.predict(Xnew) # Make a prediction

print(Xnew)
print(ynew)

[[-122.221, 37.864, 22, 7097, 1104, 2402, 1136, 8.314]]
[ 417137.36528409]
