# Setting up the Environment

## Check python version

In [1]:
!python3 --version

Python 3.10.11


## Install python modules 

In [2]:
## Install numpy, pandas and scikit-learn (latest with py-3.9)
!pip install numpy pandas scikit-learn seaborn

Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)'))) - skipping


## Import modules 

In [3]:
import numpy as np 
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

***

## Question 1
What's the version of NumPy that you installed?

In [5]:
pd.__version__

'2.0.1'

## Getting the data 

In [13]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv -O ../data/01_data_homework_cohort3.csv --no-check-certificate

--2023-09-18 20:10:23--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘../data/01_data_homework_cohort3.csv’


2023-09-18 20:10:24 (3.80 MB/s) - ‘../data/01_data_homework_cohort3.csv’ saved [1423529/1423529]



***

## Question 2

How many records are in the dataset?

In [26]:
df = pd.read_csv('../data/01_data_homework_cohort3.csv')
df.shape

(20640, 10)

***

## Question 3

Which columns in the dataset have missing values?

In [27]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

***

## Question 4

Number of unique values in the 'ocean_proximity' column

In [28]:
df['ocean_proximity'].nunique()

5

In [29]:
df['ocean_proximity'].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

****

## Question 5

Average value of the 'median_house_value' for the houses near the bay

In [30]:
int(df[df['ocean_proximity'] == 'NEAR BAY']['median_house_value'].mean())

259212

***

## Question 6

- Calculate the average of total_bedrooms column in the dataset.
- Use the fillna method to fill the missing values in total_bedrooms with the mean value from the previous step.
- Now, calculate the average of total_bedrooms again.
- Has it changed?

Has the mean value changed after filling missing values?

In [31]:
# Before fillna
df['total_bedrooms'].mean()

537.8705525375618

In [32]:
# After fillna
df['total_bedrooms'].fillna(df['total_bedrooms'].mean(), inplace=True)
df['total_bedrooms'].mean()

537.8705525375617

***

## Question 7

- Select all the options located on islands.
- Select only columns housing_median_age, total_rooms, total_bedrooms.
- Get the underlying NumPy array. Let's call it X.
- Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.
- Compute the inverse of XTX.
- Create an array y with values [950, 1300, 800, 1000, 1300].
- Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
- What's the value of the last element of w?

Value of the last element of w

In [37]:
df.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [36]:
df[['housing_median_age', 'total_rooms', 'total_bedrooms']].head(5)

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms
0,41.0,880.0,129.0
1,21.0,7099.0,1106.0
2,52.0,1467.0,190.0
3,52.0,1274.0,235.0
4,52.0,1627.0,280.0


In [38]:
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms']].to_numpy()
X

array([[  41.,  880.,  129.],
       [  21., 7099., 1106.],
       [  52., 1467.,  190.],
       ...,
       [  17., 2254.,  485.],
       [  18., 1860.,  409.],
       [  16., 2785.,  616.]])

In [39]:
XTX = np.matmul(X.T, X)
XTX

array([[2.01984850e+07, 1.35332892e+09, 2.83204712e+08],
       [1.35332892e+09, 2.41621366e+11, 4.67660348e+10],
       [2.83204712e+08, 4.67660348e+10, 9.59926544e+09]])

In [40]:
from numpy.linalg import inv 

iXTX = inv(XTX)
iXTX

array([[ 8.47988169e-08,  1.62386103e-10, -3.29291640e-09],
       [ 1.62386103e-10,  7.28537219e-11, -3.59722129e-10],
       [-3.29291640e-09, -3.59722129e-10,  1.95383149e-09]])

In [47]:
y = [950, 1300, 800, 1000, 1300]
y

[950, 1300, 800, 1000]

In [48]:
w = np.matmul(np.matmul(iXTX, X.T), y)
w

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 4 is different from 20640)