# 🏠 Real Estate estimator

🎯In the following challenge, we will try to figure out whether there exists a ***LINEAR RELATIONSHIP*** between :
- the **price** of a flat (our **target** for each flat)
- and some usual **features** such as like surface area, bedrooms, etc...

❗️ Pandas is forbidden in this challenge ❗️

😁 Welcome to the [NumPy documentation](https://docs.scipy.org/doc/numpy/reference/) which will be your friend throughout this exercise.

👉 You can also find help on this [NumPy cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

In [1]:
# Load the NumPy library
import numpy as np

🌆 Suppose that we were able to collect data for 4 flats down below: 
- their **features**:
    - `surface` (square feet)
    - `bedrooms`
    - `floors` 
- their **target**:
    - `price` (in thousands of USD)

|flats |surface (square feet)|bedrooms|floors|price (k USD)|
|------|-------------|--------|------|------------|
|flat1 |620|1|1|244|
|flat2 |3280|4|2|671|
|flat3 |1900|2|2|504|
|flat4 |1320|3|3|510|

👉 A first approach to **predict the price of an apartment** is to try to **find a linear relationship between the  target and the features** (*i.e. between the price and the (surface, bedrooms, floor)*), by solving the following **system of $n = 4$ linear equations with $p = 4$ unknown variables**: 



$$\begin{cases}
    244 = \theta_0 + 620\theta_1 + 1\theta_2 + 1\theta_3 \\
    671 = \theta_0 + 3280\theta_1 + 4\theta_2 + 2\theta_3 \\
    504 = \theta_0 + 1900\theta_1 + 2\theta_2 + 2\theta_3 \\
    510 = \theta_0 + 1320\theta_1 + 3\theta_2 + 3\theta_3 \\
\end{cases}$$

which can be translated into a matricial equation:

$$Y = X\theta$$

$$\begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}_{4 \times 1} = \begin{bmatrix}
    1 & 620 & 1 & 1 \\
    1 & 3280 & 4 & 2 \\
    1 & 1900 & 2 & 2 \\
    1 & 1320 & 3 & 3
\end{bmatrix}_{4 \times 4} \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}_{4 \times 1}$$

where :
* $Y$ is the **`target`**, the vector of `Price`
* $X$ represents the **`matrix of features`**
* $\theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$ (*theta*) is the vector of **coefficients/variables/unknowns** to be found

----

🤓 Here, we are using the Greek letter `theta` $\theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3 \\
\end{bmatrix}$, to represent the coefficients of our **features**:

- A flat with no surface, no bedroom and no floor would cost $\theta_0$
- An increase of one sqm - *holding the number of bedrooms and the floor  number constant* -  would increase the price by $\theta_1$ thousand USD
- An additional bedroom - *holding the surface and the floor number constant* -   would increase the price by $\theta_2$ thousand USD
- An increase of one floor number - *holding the surface and the number of bedrooms constant* - would increase the price by $\theta_3$ thousand USD

----

😉 If we manage to solve this system of linear equations (i.e. if we find $\theta_0$, $\theta_1$, $\theta_2$, $\theta_3$), the price of any new flat could be estimated using the following formula: $$Y_{newflat} = X_{newflat} \theta$$

## (1) Define the matrix `X` of `features`:

❓ Create a $(4,3)$ `numpy.ndarray` storing the values of the 3 features (surface, bedrooms, floors) for the 4 observations. 

In [2]:
features = np.array([[620, 1, 1],
                    [3280, 4, 2],
                    [1900, 2, 2],
                    [1320, 3, 3]])            

❓ Double-check the ***shape***, the ***size*** and the ***dim*** of this ***np.array***

In [3]:
print("Array features:")
print("Shape: ", features.shape)
print("Size: ", features.size)
print("Dimensions: ", features.ndim)


Array features:
Shape:  (4, 3)
Size:  12
Dimensions:  2


❓Add a "constant" vector of ones $ = \begin{bmatrix}
    1 \\
    1 \\
    1 \\
    1 \\
\end{bmatrix}$ to create the $(4,4)$ matrix $X$ representing the linear system of equations

<details>
    <summary><i>Explanations</i></summary>

🤔 As you've probably noticed, the linear system of equations includes a $\theta_0$ coefficient which appears in the 4 equations. 

❗️ We need an additional feature to represent the y-intercept of the linear regression line 

_Note_ : we talk about an [affine relation](https://math.stackexchange.com/questions/275310/what-is-the-difference-between-linear-and-affine-function) rather than a strict linear relation between the `price` and the features 
    
    
</details>

In [4]:
x0 = np.ones((4, 1))

In [5]:
# Use `numpy.hstack` to create the (4,4) matrix X by concatenating x0 to your previous (4,3) matrix

X = np.hstack([features, x0])
print(X)

[[6.20e+02 1.00e+00 1.00e+00 1.00e+00]
 [3.28e+03 4.00e+00 2.00e+00 1.00e+00]
 [1.90e+03 2.00e+00 2.00e+00 1.00e+00]
 [1.32e+03 3.00e+00 3.00e+00 1.00e+00]]


## (2) Define the vector `Y` of `Prices`

$Y = \begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}$

❓ In order to match our matricial representation $Y = X\theta$, what should the shape of $Y$ be? Define $Y$ down below. ❓

<details>
    <summary><i>Hint</i></summary>

Y should be a $(4,1)$ array, equivalent to a flat "vector", represented vertically
</details>

In [6]:
#create Y and the size, shape of Y
print("In order to perform a matrix multiplication between X and Y, "\
      + "Y's shape must be (4, 1).")

Y = np.array([[244], [671], [504], [510]])


In order to perform a matrix multiplication between X and Y, Y's shape must be (4, 1).


## (3) Find the solution of the system

⏰Now, it's time to find the vector of coefficients $\theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$ !

👍 The solution of the equation is:
 
$$ \large X\theta = Y 
\large \Leftrightarrow X^{-1}X\theta = X^{-1}Y 
\large \Leftrightarrow \theta = X^{-1}Y
$$

where $\large X^{-1}$ is the inverse of $\large X$.

In [7]:
# Compute the inverse of the matrix X with the right NumPy method

X_I = np.linalg.inv(X)

👉 You can check that the inversion worked by testing the following equality:

$$X^{-1}X = I_4$$
where $I_4$ is the $ 4 \times 4 $ identity matrix $ \begin{bmatrix}
    1 & 0 & 0 & 0 \\
    0 & 1 & 0 & 0 \\
    0 & 0 & 1 & 0 \\
    0 & 0 & 0 & 1
\end{bmatrix}$

In [8]:
# Define I4 using the right NumPy method

I4 = np.identity(4)

👉 Now compute $X^{-1}X$:

In [9]:
#YOUR CODE
X_invX = np.matmul(X, X_I)
print(X_invX)


[[1.00000000e+00 4.56623986e-17 5.55111512e-17 0.00000000e+00]
 [0.00000000e+00 1.00000000e+00 7.21644966e-16 3.33066907e-16]
 [0.00000000e+00 4.56623986e-17 1.00000000e+00 1.11022302e-16]
 [2.22044605e-16 2.67707004e-16 1.66533454e-16 1.00000000e+00]]


❓Does it look like $I_4 = $ ❓

⛔️ If it doesn't, you probably used the `*` operator to perform the multiplication between $X^{-1}$ and $X$. Here we want to perform the matrix product. You should find the right Numpy method to do so.

✅ If it does, you might have noticed that you do not get exactly zeros and ones in the resulting product. To be sure, you can try the [`numpy.allclose()`](https://numpy.org/doc/stable/reference/generated/numpy.allclose.html?highlight=allclose#numpy.allclose) method to check your result:

In [10]:
np.allclose(X_invX, I4)

True

🎉 You are finally able to compute `theta` using the following formula: $ \large \theta = X^{-1}Y$:

In [11]:
#YOUR CODE
theta = np.matmul(X_I, Y)
print("𝜃 coefficients: ")
print(theta)

𝜃 coefficients: 
[[  0.13655914]
 [-10.72580645]
 [ 95.93010753]
 [ 74.12903226]]


## (4) Estimation of a new flat price

You finally solved the system finding $\theta$ and are now able to estimate the `Price` (in thousands of $) of a 5th flat given these characteristics:

- `Surface`: 3000 $ft^2$
- `Bedrooms`: 5 
- `Floors`: 1

with the following formula:

$$ \large Y_{flat5} = X_{flat5}\theta$$

In [12]:
X5 = np.array([3000, 5, 1, 1])
#YOUR CODE 

Y5 = np.matmul(X5, theta)
# You should find a Price of 526,000 $
print(f"The predicted price for a 3000 sq. feet-flat on one floor with 5 bedrooms is ${Y5[0]}.")


The predicted price for a 3000 sq. feet-flat on one floor with 5 bedrooms is $526.1075268817206.


## (5) Reality-check

❗️ In reality, a flat price is never entirely determined by its surface, number of bedrooms and  the floor number.

Let's imagine that we measure the real price $Y_{flat5}$ at $700,000$ instead of $526,000$ as predicted. 

🤨 Could we take this new information into account to improve our model?

❓ Update the linear system of equations $ \large X\theta = Y$ to incorporate the information carried out by this new flat.

In [13]:
# Create the new matrix of features X of shape (5,4)
# Print the shape to double-check the shape is indeed (5,4)

# Updating the features matrix with the new values
# We also add the extra 1 at the end
new_values = np.array([[3000, 5, 1, 1]])
X2 = np.vstack([X, new_values])

In [14]:
# Create new Y of shape (5,1)
# Updating the prices vector with the new value
Y2 = np.append(Y, [[700]], axis=0)

👉 Let's try to predict the price of a 6th flat from our updated model.  
To do so, try to solve $\theta$ from $X\theta = Y$ using [`numpy.linalg.solve`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html).

😲 What is going on? What can you conclude?

In [15]:
theta_2 = np.linalg.solve(X2, Y2)
print(theta_2)

[[6.20e+02 1.00e+00 1.00e+00 1.00e+00]
 [3.28e+03 4.00e+00 2.00e+00 1.00e+00]
 [1.90e+03 2.00e+00 2.00e+00 1.00e+00]
 [1.32e+03 3.00e+00 3.00e+00 1.00e+00]
 [3.00e+03 5.00e+00 1.00e+00 1.00e+00]]
[[244]
 [671]
 [504]
 [510]
 [700]]


LinAlgError: Last 2 dimensions of the array must be square

So why is this approach failing? Here numpy tells us that the last 2 dimensions of the array must be square. The issue here is that only square matrices can be inverted: this is why numpy cannot compute theta2, as it needs to invert X2 in order to do that.

<details>
    <summary><i>Explanations</i></summary>

$ \large X$ is not a square matrix 

$ \large  \rightarrow$  therefore it cannot be inversible: $ \large  X^{-1}$ does not exist
 
$ \large  \rightarrow$ $ \large \theta$ cannot be computed from $ \large  Y = X\theta$ 
    
😕 Our initial approach, which consists in finding a closed mathematical formula to compute a predicted price of a flat as a linear combination of only 3 features **does not hold true** for our 5 observed flats. 

😉 ***Trust the process !*** 

$ \large  \rightarrow$ Instead, we will learn in the coming weeks methods to **approximate** a flat price based on these features.

For instance, instead of solving $\large  Y = X\theta$ we could find $ \large  \hat{\theta}$ that minimizes the error $ \large  e = X\hat{\theta} - Y $: This approach is called a **Linear Regression model**

This new estimator can then be used to give an **approximate** estimation of the price on any new flats with $ \large  Y_{flat_6} = X_{flat_6} \hat{\theta}$ 

</details>

🏁 Congratulations! 

💾 Don't forget to save your notebook before moving on to the next challenge! 