# Real Estate estimator

In the following challenge, we will try to figure out if a linear relationship exists between the **price** of a flat and a few usual factors like surface area, bedrooms, etc.

⚠️ Pandas is forbidden in this challenge: Welcome to the [NumPy documentation](https://docs.scipy.org/doc/numpy/reference/) which will be your friend throughout this exercise. You can also find help on this [NumPy cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

In [8]:
# Load the NumPy library
import numpy as np
import matplotlib.pyplot as plt

We have been able to collect data for 4 flats below: Their `surface` (square feet), `bedrooms` and `floors` numbers are the 3 **features** of our problem, and the `price` (in thousands of $) is our **target**:

|flats |surface|bedrooms|floors|price
|
|------|-------------|--------|------|------------|
|flat1 |620|1|1|244|
|flat2 |3280|4|2|671|
|flat3 |1900|2|2|504|
|flat4 |1320|3|3|510|
|
A first approach is to try 

to find a **linear** relationship between the `price` and the 3 features, by solving this system of equations. Here we use the Greek letter Theta, $\theta$, to represent the coefficients of our **features**:

$$\begin{cases}
    244 = \theta_0 + 620\theta_1 + 1\theta_2 + 1\theta_3 \\
    671 = \theta_0 + 3280\theta_1 + 4\theta_2 + 2\theta_3 \\
    504 = \theta_0 + 1900\theta_1 + 2\theta_2 + 2\theta_3 \\
    510 = \theta_0 + 1320\theta_1 + 3\theta_2 + 3\theta_3
\end{cases}$$

Which can be translated into a matricial equation:

$$Y = X\theta$$

$$\begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix} = \begin{bmatrix}
    1 & 620 & 1 & 1 \\
    1 & 3280 & 4 & 2 \\
    1 & 1900 & 2 & 2 \\
    1 & 1320 & 3 & 3
\end{bmatrix} \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$$

where $Y$ is the vector of `Price`, $X$ is the matrix of features and $\theta$ (theta) is the vector of coefficients to be found.

If $\theta$ is found, the price of any new flat could be estimated using $$Y_{flat5} = X_{flat5}\theta$$


|flat1 |620|1|1|244|
|flat2 |3280|4|2|671|
|flat3 |1900|2|2|504|
|flat4 |1320|3|3|510|


## 1. Define the matrix `X` of features:

❓ Create a (4,3) `numpy.ndarray`_ storing the values of the 3 features (surface, bedrooms, floors) for the 4 observations. Double check it's `shape`, `size` and `dim`

In [9]:
X=np.array([[620,1,1],[3280,4,2],[1900,2,2],[1320,3,3]],dtype=int)
X


# Make sure `surface` is the second, `bedrooms` the third and
#             `floors` the fourth column of the features.''

array([[ 620,    1,    1],
       [3280,    4,    2],
       [1900,    2,    2],
       [1320,    3,    3]])

In [10]:
print(X.shape)
print(X.size)
print(X.ndim)

(4, 3)
12
2


❓Add a "constant" vector of 1's to create the (4,4) matrix `X` representing the linear system of equations

🤔 As you probably noticed, the linear system of equations includes a $\theta_0$ coefficient which appears in the 4 equations. We need an additional feature to represent the y-intercept of the linear regression line (we talk about an [affine relation](https://math.stackexchange.com/questions/275310/what-is-the-difference-between-linear-and-affine-function) rather than a strict linear relation between the `price` and the features - more on that next week).

In [11]:
# Define x0 as a (4,1) vector filled with 1 with the fastest NumPy method
x0=np.ones((4,1),dtype=int)
x0

array([[1],
       [1],
       [1],
       [1]])

In [12]:
# Use `numpy.hstack` to create the (4,4) matrix X by concatenating x0 to your previous (4,3) matrix
X=np.hstack((x0,X))

In [13]:
X

array([[   1,  620,    1,    1],
       [   1, 3280,    4,    2],
       [   1, 1900,    2,    2],
       [   1, 1320,    3,    3]])

In [14]:
X.shape

(4, 4)

### Check your code!

In [15]:
from nbresult import ChallengeResult

result = ChallengeResult('features',
    features=X
)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/shu/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shu/Desktop/Lewagon/code/shiro101010101/data-challenges/03-Maths/01-Algebra-Calculus/01-real-estate-estimator
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 3 items

tests/test_features.py::TestFeatures::test_X_shape [32mPASSED[0m[32m                [ 33%][0m
tests/test_features.py::TestFeatures::test_features_order [32mPASSED[0m[32m         [ 66%][0m
tests/test_features.py::TestFeatures::test_x0_is_ones [32mPASSED[0m[32m             [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/features.pickle

[32mgit[39m commit -m [33m'Completed features step'[39m

[32mgit[39m push origin master


## 2. Define the vector `Y` of `Prices`

$Y = \begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}$

In order to match our matricial representation $Y = X\theta$, what should the shape of $Y$ be? Define $Y$ below

<details>
    <summary>Hint</summary>

Y should be a (4,1) array, equivalent to a flat "vector", represented vertically
</details>

In [16]:
# Define Y here
Y=np.array([244,671,504,510]).reshape(4,1)
Y

array([[244],
       [671],
       [504],
       [510]])

### Check your code!

In [17]:
from nbresult import ChallengeResult

result = ChallengeResult('target',
    target=Y
)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/shu/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shu/Desktop/Lewagon/code/shiro101010101/data-challenges/03-Maths/01-Algebra-Calculus/01-real-estate-estimator
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 1 item

tests/test_target.py::TestTarget::test_shape [32mPASSED[0m[32m                      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/target.pickle

[32mgit[39m commit -m [33m'Completed target step'[39m

[32mgit[39m push origin master


## 3 Find the solution of the system

Now is the time to find the vector of coefficients $\theta$ !

The solution of the equation is:
 
$$ X\theta = Y \\
\Leftrightarrow X^{-1}X\theta = X^{-1}Y \\
\Leftrightarrow \theta = X^{-1}Y$$

where $X^{-1}$ is the inverse of $X$.

In [18]:
# Compute the inverse of the matrix X with the right NumPy method
X_inv=np.linalg.inv(X)
X_inv

array([[ 1.64516129e+00, -7.51278738e-18, -2.90322581e-01,
        -3.54838710e-01],
       [-5.37634409e-04, -1.66950831e-19,  1.07526882e-03,
        -5.37634409e-04],
       [ 3.70967742e-01,  5.00000000e-01, -1.24193548e+00,
         3.70967742e-01],
       [-6.82795699e-01, -5.00000000e-01,  8.65591398e-01,
         3.17204301e-01]])

You can check if the inversion worked by testing:

$$X^{-1}X = I_4$$
where $I_4$ is the 4 by 4 identity matrix.

In [19]:
# Define I4 using the right NumPy method
I4=np.identity(4,dtype=int)
I4

array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1]])

Now compute $X^{-1}X$:

In [20]:
# Your code
X_dot=np.dot(X_inv,X)
X_dot

array([[ 1.00000000e+00, -1.13686838e-13, -2.22044605e-16,
        -2.22044605e-16],
       [-3.25260652e-19,  1.00000000e+00, -6.50521303e-19,
        -2.16840434e-19],
       [ 3.33066907e-16,  6.25277607e-13,  1.00000000e+00,
         2.22044605e-16],
       [-2.77555756e-16, -5.11590770e-13, -6.66133815e-16,
         1.00000000e+00]])

In [21]:
X_dot==I4

array([[False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False]])

In [22]:
# You can do it directly using the numpy.linalg.solve method:
# A = np.array([[1, -1], [1, -2]])
# B = np.array([[22], [10]])
# X = np.linalg.solve(A, B)
# X

Does it look like $I_4$?

⛔️ If it doesn't, you probably used the `*` operator to perform the multiplication between $X^{-1}$ and $X$. Here we want to perform the matrix product. You should find the right Numpy method to do so.

✅ If it does, you might have noticed that you do not get exactly zeros and ones in the resulting product. To be sure, you can try the [`numpy.allclose()`](https://numpy.org/doc/stable/reference/generated/numpy.allclose.html?highlight=allclose#numpy.allclose) method to check your result:

In [23]:
# Your code
np.allclose(X_dot,I4)

True

You are finally able to find $\theta = X^{-1}Y$:

In [24]:
# Compute theta
theta=np.dot(X_inv,Y)
theta

array([[ 74.12903226],
       [  0.13655914],
       [-10.72580645],
       [ 95.93010753]])

### Check your code !

In [25]:
from nbresult import ChallengeResult
import math
result = ChallengeResult('solution',
    theta=theta
)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/shu/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/shu/Desktop/Lewagon/code/shiro101010101/data-challenges/03-Maths/01-Algebra-Calculus/01-real-estate-estimator
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 2 items

tests/test_solution.py::TestSolution::test_theta0_is_positive [32mPASSED[0m[32m     [ 50%][0m
tests/test_solution.py::TestSolution::test_theta_shape [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master


## 4. Estimation of a new flat price

You finally solved the system finding $\theta$ and are now able to estimate the `Price` (in thousands of $) of a 5th flat given these characteristics:

- `Surface`: 3000 $ft^2$
- `Bedrooms`: 5 
- `Floors`: 1

with the following formula:

$$Y_{flat5} = X_{flat5}\theta$$

In [26]:
# Define X5
X5=[1,3000,5,1]

# Compute Y5
Y5=np.dot(X5,theta)
Y5
# You should find a Price of 526,000 $

array([526.10752688])

We have been able to collect data for 4 flats below: Their `surface` (square feet), `bedrooms` and `floors` numbers are the 3 **features** of our problem, and the `price` (in thousands of $) is our **target**:

|flats |surface|bedrooms|floors|price
|
|------|-------------|--------|------|------------|
|flat1 |620|1|1|244|
|flat2 |3280|4|2|671|
|flat3 |1900|2|2|504|
|flat4 |1320|3|3|510|
|
A first approach is to try 

to find a **linear** relationship between the `price` and the 3 features, by solving this system of equations. Here we use the Greek letter Theta, $\theta$, to represent the coefficients of our **features**:

$$\begin{cases}
    244 = \theta_0 + 620\theta_1 + 1\theta_2 + 1\theta_3 \\
    671 = \theta_0 + 3280\theta_1 + 4\theta_2 + 2\theta_3 \\
    504 = \theta_0 + 1900\theta_1 + 2\theta_2 + 2\theta_3 \\
    510 = \theta_0 + 1320\theta_1 + 3\theta_2 + 3\theta_3
\end{cases}$$

Which can be translated into a matricial equation:

$$Y = X\theta$$

$$\begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix} = \begin{bmatrix}
    1 & 620 & 1 & 1 \\
    1 & 3280 & 4 & 2 \\
    1 & 1900 & 2 & 2 \\
    1 & 1320 & 3 & 3
\end{bmatrix} \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$$

where $Y$ is the vector of `Price`, $X$ is the matrix of features and $\theta$ (theta) is the vector of coefficients to be found.

If $\theta$ is found, the price of any new flat could be estimated using $$Y_{flat5} = X_{flat5}\theta$$

## 5. Reality check

In reality, flat price is never entirely determined by it's surface, bedroom and floor numbers.

Let's imagine that we measure the real price $Y_{flat5}$ at $700,000$ instead of $526,000$ as predicted. Could we take this new information into account to improve our model?

Update the linear system of equation $X\theta = Y$ to reflect this new datapoint measured

In [27]:
# Create the new matrix of a feature X of shape (5,4)
X6=np.array([[1,620,1,1],[1,3280,4,2],[1,1900,2,2],[1,1320,3,3],[1,4500,4,2]])
X6

array([[   1,  620,    1,    1],
       [   1, 3280,    4,    2],
       [   1, 1900,    2,    2],
       [   1, 1320,    3,    3],
       [   1, 4500,    4,    2]])

In [28]:
# Create a new Y of shape (5,1)
Y6=np.array([[224],[671],[504],[510],[700]])
Y6

array([[224],
       [671],
       [504],
       [510],
       [700]])

Let's try to predict the price of a 6th flat from our updated model.  
To do so, try to solve $\theta$ from $X\theta = Y$ using [`numpy.linalg.solve`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html).

What is going on? What can you conclude?

In [29]:
# Your code
# np.linalg.solve(X6,Y6)

In [30]:
# A = np.array([[1, -1], [1, -2]])
# B = np.array([[22], [10]])
# X = np.linalg.solve(A, B)
# X

<details>
    <summary>👉 Explanations</summary>

$X$ is not a square matrix, therefore it cannot be inversible: $X^-1$ does not exist, and $\theta$ cannot be computed from $Y = X\theta$ 
    
Our initial approach, which consists of finding a closed mathematical formula to compute an exact flat price as linear combination of only 3 features **does not hold** true for our 5 observed flats. 

Instead, we will learn in the coming weeks to find ways to **approximate** a flat price based on these features.

For instance, instead of solving $Y = X\theta$ we could find $\hat{\theta}$ that minimizes the error $e = X\hat{\theta} - Y $: This approach is called a **linear regression model**

This new estimator can then be used to give an **approximate** estimation of the price on any new flats with $Y_{flat_6} = X_{flat_6} \hat{\theta}$ 

</details>

🏁 Congratulations! Don't forget to commit and push your notebook before moving on to the next challenge! 