# 🏠 Real Estate estimator

🎯In the following challenge, we will try to figure out whether there exists a ***LINEAR RELATIONSHIP*** between :
- the **price** of a flat (our **target** for each flat)
- and some usual **features** such as like surface area, bedrooms, etc...

❗️ Pandas is forbidden in this challenge ❗️

😁 Welcome to the [NumPy documentation](https://docs.scipy.org/doc/numpy/reference/) which will be your friend throughout this exercise.

👉 You can also find help on this [NumPy cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

In [1]:
import numpy as np


🌆 Suppose that we were able to collect data for 4 flats down below: 
- their **features**:
    - `surface` (square feet)
    - `bedrooms`
    - `floors` 
- their **target**:
    - `price` (in thousands of USD)

|flats |surface (square feet)|bedrooms|floors|price (k USD)|
|------|-------------|--------|------|------------|
|flat1 |620|1|1|244|
|flat2 |3280|4|2|671|
|flat3 |1900|2|2|504|
|flat4 |1320|3|3|510|

👉 A first approach to **predict the price of an apartment** is to try to **find a linear relationship between the  target and the features** (*i.e. between the price and the (surface, bedrooms, floor)*), by solving the following **system of $n = 4$ linear equations with $p = 4$ unknown variables**: 



$$\begin{cases}
    244 = \theta_0 + 620\theta_1 + 1\theta_2 + 1\theta_3 \\
    671 = \theta_0 + 3280\theta_1 + 4\theta_2 + 2\theta_3 \\
    504 = \theta_0 + 1900\theta_1 + 2\theta_2 + 2\theta_3 \\
    510 = \theta_0 + 1320\theta_1 + 3\theta_2 + 3\theta_3 \\
\end{cases}$$

which can be translated into a matricial equation:

$$\boldsymbol y = \boldsymbol {X \cdot \theta}$$

$$\begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}_{4 \times 1} = \begin{bmatrix}
    1 & 620 & 1 & 1 \\
    1 & 3280 & 4 & 2 \\
    1 & 1900 & 2 & 2 \\
    1 & 1320 & 3 & 3
\end{bmatrix}_{4 \times 4} \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}_{4 \times 1}$$

where :
* $\boldsymbol y$ is the **`target`**, the vector of `Price`
* $\boldsymbol X$ represents the **`matrix of features`**
* $\boldsymbol {\theta} = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$ (*theta*) is the vector of **coefficients/variables/unknowns** to be found

----

🤓 Here, we are using the Greek letter `theta` $\boldsymbol \theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3 \\
\end{bmatrix}$, to represent the coefficients of our **features**:

- A flat with no surface, no bedroom and no floor would cost $\theta_0$
- An increase of one square foot - *holding the number of bedrooms and the floor number constant* -  would increase the price by $\theta_1$ thousand USD
- An additional bedroom - *holding the surface and the floor number constant* -   would increase the price by $\theta_2$ thousand USD
- An increase of one floor number - *holding the surface and the number of bedrooms constant* - would increase the price by $\theta_3$ thousand USD

----

😉 If we manage to solve this system of linear equations (i.e. if we find $\theta_0$, $\theta_1$, $\theta_2$, $\theta_3$), the price of any new flat could be estimated using the following formula: $$y_{newflat} = \boldsymbol x_{newflat} \cdot \boldsymbol \theta$$

## (1) Define the matrix $\boldsymbol X$ of `features`:

❓ Create a $(4,3)$ `numpy.ndarray` storing the values of the 3 features (surface, bedrooms, floors) for the 4 observations. 

In [2]:
# Create the 4x3 matrix of features
X_features = np.array([
    [620, 1, 1],
    [3280, 4, 2],
    [1900, 2, 2],
    [1320, 3, 3]
])

print(X_features)


[[ 620    1    1]
 [3280    4    2]
 [1900    2    2]
 [1320    3    3]]


❓ Double-check the ***shape***, the ***size*** and the ***dim*** of this ***np.array***

In [3]:
print("Shape:", X_features.shape)      
print("Size:", X_features.size)        
print("Dimensions:", X_features.ndim)

Shape: (4, 3)
Size: 12
Dimensions: 2


❓Add a "constant" vector of ones $ = \begin{bmatrix}
    1 \\
    1 \\
    1 \\
    1 \\
\end{bmatrix}$ to create the $(4,4)$ matrix $\boldsymbol X$ representing the linear system of equations

<details>
    <summary><i>Explanations</i></summary>

🤔 As you've probably noticed, the linear system of equations includes a $\theta_0$ coefficient which appears in the 4 equations. 

❗️ We need an additional feature to represent the y-intercept of the linear regression line 

_Note_ : we talk about an [affine relation](https://math.stackexchange.com/questions/275310/what-is-the-difference-between-linear-and-affine-function) rather than a strict linear relation between the `price` and the features (_Cf. Decision Science Module_)
    
    
</details>

In [4]:
# Define x0 as a (4,1) vector filled with 1 with the fastest NumPy method

x0 = np.ones((4,1), dtype = "int" )
print(x0)


[[1]
 [1]
 [1]
 [1]]


In [5]:
# Use `numpy.hstack` to create the (4,4) matrix X by concatenating x0 to your previous (4,3) matrix
X = np.hstack((x0, X_features))

print(X)

[[   1  620    1    1]
 [   1 3280    4    2]
 [   1 1900    2    2]
 [   1 1320    3    3]]


**🧪 Check your code!**

In [6]:
from nbresult import ChallengeResult

result = ChallengeResult('features',
    features=X
)
result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/saranjthilak92/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/saranjthilak92/code/saranjthilak/03-Maths/01-Algebra-Calculus/data-real-estate-estimator/tests
plugins: typeguard-4.4.2, anyio-4.8.0
[1mcollecting ... [0mcollected 3 items

test_features.py::TestFeatures::test_X_shape [32mPASSED[0m[32m                      [ 33%][0m
test_features.py::TestFeatures::test_features_order [32mPASSED[0m[32m               [ 66%][0m
test_features.py::TestFeatures::test_x0_is_ones [32mPASSED[0m[32m                   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/features.pickle

[32mgit[39m commit -m [33m'Completed features step'[39m

[32mgit[39m push origin master



## (2) Define the vector $\boldsymbol y$ of `Prices`

$\boldsymbol y  = \begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}$

❓ In order to match our matricial representation $\boldsymbol y  = \boldsymbol {X\cdot \theta}$, what should the shape of $\boldsymbol y$ be? Define $\boldsymbol y$ down below. ❓

<details>
    <summary><i>Hint</i></summary>

$\boldsymbol y$ should be a $(4,1)$ array, equivalent to a flat "vector", represented vertically
</details>

In [7]:
y = np.array([[244], [671], [504], [510]])


print("y:", y)
print("Shape of y:", y.shape)

y: [[244]
 [671]
 [504]
 [510]]
Shape of y: (4, 1)


**🧪 Check your code!**

In [8]:
from nbresult import ChallengeResult

result = ChallengeResult('target',
    target=y
)
result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/saranjthilak92/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/saranjthilak92/code/saranjthilak/03-Maths/01-Algebra-Calculus/data-real-estate-estimator/tests
plugins: typeguard-4.4.2, anyio-4.8.0
[1mcollecting ... [0mcollected 1 item

test_target.py::TestTarget::test_shape [32mPASSED[0m[32m                            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/target.pickle

[32mgit[39m commit -m [33m'Completed target step'[39m

[32mgit[39m push origin master



## (3) Find the solution of the system

⏰Now, it's time to find the vector of coefficients $\boldsymbol \theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$ !

👍 The solution of the equation is:
 
$$ \large \boldsymbol X \cdot \boldsymbol \theta = \boldsymbol y 
\large \iff \boldsymbol X^{-1} \cdot \boldsymbol X \boldsymbol \cdot \theta = \boldsymbol X^{-1} \cdot \boldsymbol y 
\large \iff \boldsymbol \theta = \boldsymbol X^{-1} \cdot \boldsymbol y
$$

where $\large \boldsymbol X^{-1}$ is the inverse of $\large \boldsymbol X$.

In [9]:
# Compute the inverse of the matrix X with the right NumPy method
X_inv =np.linalg.inv(X)
print(X_inv)

[[ 1.64516129e+00  4.42419702e-17 -2.90322581e-01 -3.54838710e-01]
 [-5.37634409e-04 -2.50426246e-19  1.07526882e-03 -5.37634409e-04]
 [ 3.70967742e-01  5.00000000e-01 -1.24193548e+00  3.70967742e-01]
 [-6.82795699e-01 -5.00000000e-01  8.65591398e-01  3.17204301e-01]]


👉 You can check that the inversion worked by testing the following equality:

$$\boldsymbol X^{-1} \cdot\boldsymbol X = \boldsymbol I_4$$
where $\boldsymbol I_4$ is the $ 4 \times 4 $ identity matrix $ \begin{bmatrix}
    1 & 0 & 0 & 0 \\
    0 & 1 & 0 & 0 \\
    0 & 0 & 1 & 0 \\
    0 & 0 & 0 & 1
\end{bmatrix}$

In [10]:
# Define I4 using the right NumPy method
I4 = np.eye(4)
print(I4)


[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


👉 Now compute $\boldsymbol X^{-1} \boldsymbol X$:

In [11]:
result = np.dot(X, X_inv)
print(result)

[[ 1.00000000e+00  0.00000000e+00 -1.11022302e-16 -5.55111512e-17]
 [-2.22044605e-16  1.00000000e+00 -2.22044605e-16  1.11022302e-16]
 [-4.44089210e-16 -2.22044605e-16  1.00000000e+00 -1.11022302e-16]
 [ 2.22044605e-16  1.11022302e-16 -3.33066907e-16  1.00000000e+00]]


❓Does it look like $\boldsymbol I_4 = $ ❓

⛔️ If it doesn't, you probably used the `*` operator to perform the multiplication between $\boldsymbol X^{-1}$ and $\boldsymbol X$. Here we want to perform the matrix product. You should find the right Numpy method to do so.

✅ If it does, you might have noticed that you do not get exactly zeros and ones in the resulting product. To be sure, you can try the [`numpy.allclose()`](https://numpy.org/doc/stable/reference/generated/numpy.allclose.html?highlight=allclose#numpy.allclose) method to check your result:

In [12]:
is_identity = np.allclose(result, I4)
print("\nIs the result close to the identity matrix?", is_identity)


Is the result close to the identity matrix? True


🎉 You are finally able to compute `theta` using the following formula: $ \large \boldsymbol \theta = \boldsymbol X^{-1}\cdot \boldsymbol y $:

In [13]:
theta = np.dot(X_inv, y)

**🧪 Check your code!**

In [14]:
from nbresult import ChallengeResult

result = ChallengeResult('solution',
    theta=theta
)
result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/saranjthilak92/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/saranjthilak92/code/saranjthilak/03-Maths/01-Algebra-Calculus/data-real-estate-estimator/tests
plugins: typeguard-4.4.2, anyio-4.8.0
[1mcollecting ... [0mcollected 2 items

test_solution.py::TestSolution::test_theta0_is_correct [32mPASSED[0m[32m            [ 50%][0m
test_solution.py::TestSolution::test_theta_shape [32mPASSED[0m[32m                  [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master



## (4) Estimation of a new flat price

You finally solved the system finding $\boldsymbol \theta$ and are now able to estimate the `Price` (in thousands of $) of a 5th flat given these characteristics:

- `Surface`: 3000 $ft^2$
- `Bedrooms`: 5 
- `Floors`: 1

with the following formula:

$$y_{flat5} = \boldsymbol x_{flat5} \cdot \boldsymbol \theta$$

In [19]:
# Define x5
X5 = np.array([1,3000,5,1])

# Compute y5
y5 = np.dot(X5,theta)
print(y5)
# You should find a Price of 526,000 $


[526.10752688]


## (5) Reality-check

❗️ In reality, a flat price is never entirely determined by its surface, number of bedrooms and  the floor number.

Let's imagine that we measure the real price $y_{flat5}$ at $700,000$ instead of $526,000$ as predicted. 

🤨 Could we take this new information into account to improve our model?

❓ Update the linear system of equations $ \large \boldsymbol X \cdot \boldsymbol \theta = \boldsymbol y$ to incorporate the information carried out by this new flat.

In [16]:
# Create the new matrix of features X of shape (5,4)
# Print the shape to double-check the shape is indeed (5,4)


In [17]:
# Create new y of shape (5,1)


👉 Let's try to predict the price of a 6th flat from our updated model.  
To do so, try to solve $\boldsymbol \theta$ from $\boldsymbol X \cdot \boldsymbol \theta = \boldsymbol y$ using [`numpy.linalg.solve`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html).

😲 What is going on? Try to understand why you get an error. Then, open the explanation below the cell.

In [18]:
# YOUR CODE HERE

<details>
    <summary><i>Explanations</i></summary>

$ \large \boldsymbol X$ is not a square matrix 

$ \large  \rightarrow$  therefore it cannot be inversible: $ \large  \boldsymbol X^{-1}$ does not exist
 
$ \large  \rightarrow$ $ \large \boldsymbol \theta$ cannot be computed from $ \large  \boldsymbol y = \boldsymbol X \cdot \boldsymbol \theta$ 
    
😕 Our initial approach, which consists in finding a closed mathematical formula to compute a predicted price of a flat as a linear combination of only 3 features **does not hold true** for our 5 observed flats. 

😉 ***Trust the process !*** 

$ \large  \rightarrow$ Instead, we will learn in the coming weeks methods to **approximate** a flat price based on these features.

For instance, instead of solving $\large  \boldsymbol y = \boldsymbol X \cdot \boldsymbol \theta$ we could find $ \large  \hat{\boldsymbol \theta}$ that minimizes the error $ \large \boldsymbol e = \boldsymbol X \cdot \hat{\boldsymbol \theta} - \boldsymbol y $: This approach is called a **Linear Regression model**

This new estimator can then be used to give an **approximate** estimation of the price on any new flats with $ \large  \hat y_{flat_6} = \boldsymbol x_{flat_6} \cdot \hat{\boldsymbol \theta}$ 

</details>

🏁 Congratulations! 

💾 Don't forget to commit and push your notebook before moving on to the next challenge! 