# Setting up the Environment

## Check python version

In [1]:
!python3 --version

Python 3.10.11


## Install python modules 

In [2]:
## Install numpy, pandas and scikit-learn (latest with py-3.9)
!pip install numpy pandas scikit-learn seaborn

Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)'))) - skipping


## Import modules 

In [3]:
import numpy as np 
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

***

## Question 1
What's the version of NumPy that you installed?

In [5]:
pd.__version__

'2.0.1'

## Getting the data 

In [13]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv -O ../data/01_data_homework_cohort3.csv --no-check-certificate

--2023-09-18 20:10:23--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘../data/01_data_homework_cohort3.csv’


2023-09-18 20:10:24 (3.80 MB/s) - ‘../data/01_data_homework_cohort3.csv’ saved [1423529/1423529]



***

## Question 2

How many records are in the dataset?

In [26]:
df = pd.read_csv('../data/01_data_homework_cohort3.csv')
df.shape

(20640, 10)

***

## Question 3

Which columns in the dataset have missing values?

In [27]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

***

## Question 4

Number of unique values in the 'ocean_proximity' column

In [28]:
df['ocean_proximity'].nunique()

5

In [29]:
df['ocean_proximity'].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

****

## Question 5

Average value of the 'median_house_value' for the houses near the bay

In [30]:
int(df[df['ocean_proximity'] == 'NEAR BAY']['median_house_value'].mean())

259212

***

## Question 6

- Calculate the average of total_bedrooms column in the dataset.
- Use the fillna method to fill the missing values in total_bedrooms with the mean value from the previous step.
- Now, calculate the average of total_bedrooms again.
- Has it changed?

Has the mean value changed after filling missing values?

In [31]:
# Before fillna
df['total_bedrooms'].mean()

537.8705525375618

In [32]:
# After fillna
df['total_bedrooms'].fillna(df['total_bedrooms'].mean(), inplace=True)
df['total_bedrooms'].mean()

537.8705525375617

***

## Question 7

- Select all the options located on islands.
- Select only columns housing_median_age, total_rooms, total_bedrooms.
- Get the underlying NumPy array. Let's call it X.
- Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.
- Compute the inverse of XTX.
- Create an array y with values [950, 1300, 800, 1000, 1300].
- Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
- What's the value of the last element of w?

Value of the last element of w

In [50]:
df = df[df['ocean_proximity'] == 'ISLAND']
df.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
8314,-118.32,33.35,27.0,1675.0,521.0,744.0,331.0,2.1579,450000.0,ISLAND
8315,-118.33,33.34,52.0,2359.0,591.0,1100.0,431.0,2.8333,414700.0,ISLAND
8316,-118.32,33.33,52.0,2127.0,512.0,733.0,288.0,3.3906,300000.0,ISLAND
8317,-118.32,33.34,52.0,996.0,264.0,341.0,160.0,2.7361,450000.0,ISLAND
8318,-118.48,33.43,29.0,716.0,214.0,422.0,173.0,2.6042,287500.0,ISLAND


In [51]:
df[['housing_median_age', 'total_rooms', 'total_bedrooms']].head(5)

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms
8314,27.0,1675.0,521.0
8315,52.0,2359.0,591.0
8316,52.0,2127.0,512.0
8317,52.0,996.0,264.0
8318,29.0,716.0,214.0


In [52]:
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms']].to_numpy()
X

array([[  27., 1675.,  521.],
       [  52., 2359.,  591.],
       [  52., 2127.,  512.],
       [  52.,  996.,  264.],
       [  29.,  716.,  214.]])

In [53]:
XTX = np.matmul(X.T, X)
XTX

array([[9.6820000e+03, 3.5105300e+05, 9.1357000e+04],
       [3.5105300e+05, 1.4399307e+07, 3.7720360e+06],
       [9.1357000e+04, 3.7720360e+06, 9.9835800e+05]])

In [54]:
from numpy.linalg import inv 

iXTX = inv(XTX)
iXTX

array([[ 9.19403586e-04, -3.66412216e-05,  5.43072261e-05],
       [-3.66412216e-05,  8.23303633e-06, -2.77534485e-05],
       [ 5.43072261e-05, -2.77534485e-05,  1.00891325e-04]])

In [55]:
y = [950, 1300, 800, 1000, 1300]
y

[950, 1300, 800, 1000, 1300]

In [56]:
w = np.matmul(np.matmul(iXTX, X.T), y)
w

array([23.12330961, -1.48124183,  5.69922946])

In [57]:
w[-1]

5.699229455065586