## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from
[06-environment.md](../01-intro/06-environment.md).

In [1]:
# Import the libraries
import numpy as np
import pandas as pd
import seaborn as sb

### Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

```python
pd.__version__
```

In [2]:
pd.__version__

'2.2.3'

**Ans:** Pandas v2.2.3

### Getting the data 

For this homework, we'll use the Laptops Price dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv).

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [3]:
# Read from CSV
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv')
df

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.00
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.00
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.00
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.00
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01
...,...,...,...,...,...,...,...,...,...,...,...,...
2155,Razer Blade 17 FHD 360Hz Intel Core i7-11800H/...,Refurbished,Razer,Blade,Intel Core i7,16,1000,SSD,RTX 3060,17.3,No,2699.99
2156,Razer Blade 17 FHD 360Hz Intel Core i7-11800H/...,Refurbished,Razer,Blade,Intel Core i7,16,1000,SSD,RTX 3070,17.3,No,2899.99
2157,Razer Blade 17 FHD 360Hz Intel Core i7-11800H/...,Refurbished,Razer,Blade,Intel Core i7,32,1000,SSD,RTX 3080,17.3,No,3399.99
2158,Razer Book 13 Intel Evo Core i7-1165G7/16GB/1T...,Refurbished,Razer,Book,Intel Evo Core i7,16,1000,SSD,,13.4,Yes,1899.99


### Q2. Records count

How many records are in the dataset?

- 12
- 1000
- 2160
- 12160

In [4]:
# Get number of rows in the dataframe
df.shape[0]

2160

**Ans:** 2160 records

### Q3. Laptop brands

How many laptop brands are presented in the dataset?

- 12
- 27
- 28
- 2160

In [5]:
# Calculate number of distinct laptop brands in the dataset
df['Brand'].nunique()

27

**Ans:** 27 laptop brands

### Q4. Missing values

How many columns in the dataset have missing values?

- 0
- 1
- 2
- 3

In [6]:
# Summarize the number of missing values for each column in the dataframe
df.isnull().sum()

Laptop             0
Status             0
Brand              0
Model              0
CPU                0
RAM                0
Storage            0
Storage type      42
GPU             1371
Screen             4
Touch              0
Final Price        0
dtype: int64

**Ans:** 3 columns in the dataset have missing values. The three columns are `Storage type`, `GPU`, and `Screen`. 

### Q5. Maximum final price

What's the maximum final price of Dell notebooks in the dataset?

- 869
- 3691
- 3849
- 3936

In [7]:
# Display the maximum final price for each laptop brand
df.groupby('Brand')['Final Price'].max()

Brand
Acer                3691.00
Alurin               869.00
Apple               3849.00
Asus                5758.14
Deep Gaming         1639.01
Dell                3936.00
Denver               329.95
Dynabook Toshiba    1805.01
Gigabyte            3799.00
HP                  5368.77
Innjoo               431.38
Jetwing              469.27
LG                  2399.00
Lenovo              5018.14
MSI                 7150.47
Medion              3799.00
Microsoft           3747.91
Millenium           2312.71
PcCom               1949.90
Primux               599.41
Prixton              329.95
Razer               4999.01
Realme               999.00
Samsung             3699.01
Thomson              436.56
Toshiba              799.00
Vant                1217.01
Name: Final Price, dtype: float64

In [8]:
# Find maximum final price for Dell laptops
df['Final Price'] [df['Brand'] == 'Dell'].max()

np.float64(3936.0)

**Ans:** The maximum final price of Dell notebooks in the dataset is $3936.

### Q6. Median value of Screen

1. Find the median value of `Screen` column in the dataset.
2. Next, calculate the most frequent value of the same `Screen` column.
3. Use `fillna` method to fill the missing values in `Screen` column with the most frequent value from the previous step.
4. Now, calculate the median value of `Screen` once again.

Has it changed?

> Hint: refer to existing `mode` and `median` functions to complete the task.

- Yes
- No

In [9]:
# Provide statistics of the Screen column
df['Screen'].describe()

count    2156.000000
mean       15.168112
std         1.203329
min        10.100000
25%        14.000000
50%        15.600000
75%        15.600000
max        18.000000
Name: Screen, dtype: float64

In [10]:
# Calculate median value of Screen column
df['Screen'].median()

np.float64(15.6)

In [11]:
# Calculate most frequent value of Screen column
df['Screen'].mode()[0]

np.float64(15.6)

In [12]:
# Provide a summary on the count of each Screen size
df['Screen'].value_counts()

Screen
15.60    1009
14.00     392
16.00     174
17.30     161
13.30     131
16.10      48
17.00      33
13.00      27
15.00      21
13.40      19
13.50      19
11.60      16
14.20      14
12.30      13
14.10      11
13.60      11
16.20      10
15.30       8
10.50       7
12.40       6
14.40       6
15.40       5
12.00       4
18.00       3
14.50       3
13.90       2
12.50       1
10.95       1
10.10       1
Name: count, dtype: int64

In [13]:
# Display records with screen size of 15.6
df[df['Screen'] == 15.6]

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.00
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.00
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.00
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.00
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01
...,...,...,...,...,...,...,...,...,...,...,...,...
2149,Razer Blade 15 Advanced Model QHD Intel Core i...,Refurbished,Razer,Blade,Intel Core i7,16,1000,SSD,RTX 3070,15.6,No,2899.99
2150,Razer Blade 15 Advanced Model QHD Intel Core i...,Refurbished,Razer,Blade,Intel Core i7,32,1000,SSD,RTX 3080,15.6,No,3299.99
2151,Razer Blade 15 Advanced Model QHD Intel Core i...,Refurbished,Razer,Blade,Intel Core i7,32,1000,SSD,RTX 3080,15.6,No,3399.99
2152,Razer Blade 15 Base Model FHD Intel Core i7-10...,Refurbished,Razer,Blade,Intel Core i7,16,512,SSD,RTX 3060,15.6,No,1232.74


In [14]:
# Get number of records of laptops with a screen size of 15.6
df[df['Screen'] == 15.6].shape[0]

1009

In [15]:
# Display records with missing values in Screen column
df[df['Screen'].isnull() == True]

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
624,Acer Extensa 15 EX215-54 Intel Core i5-1135G7/...,New,Acer,Extensa,Intel Core i5,8,256,SSD,,,No,524.99
1430,HP ENVY x360 2-in-1 Laptop 15-ew0008np Intel C...,New,HP,Envy,Intel Core i7,16,512,SSD,RTX 2050,,Yes,1863.52
1503,Lenovo IdeaPad Gaming 3 15ACH6 AMD Ryzen 5 560...,New,Lenovo,IdeaPad,AMD Ryzen 5,16,512,SSD,RTX 3060,,No,1505.0
1548,Lenovo ThinkPad P15 Gen 2 Intel Core i7-11850H...,New,Lenovo,ThinkPad,Intel Core i7,16,512,SSD,RTX A2000,,No,2569.0


In [16]:
# Tally records with missing values in Screen column
df['Screen'].isnull().sum()

np.int64(4)

In [17]:
# Fill missing values in Screen column with the most frequent screen size. 
df['Screen'] = df['Screen'].fillna(df['Screen'].mode()[0])

In [18]:
# Tally records of missing values in Screen column after filling
df['Screen'].isnull().sum()

np.int64(0)

In [19]:
# Retrieve earlier records with missing values in Screen column to verify it has been filled
df['Screen'].loc[[624, 1430, 1503, 1548]]

624     15.6
1430    15.6
1503    15.6
1548    15.6
Name: Screen, dtype: float64

In [20]:
# Provide an updated summary on the count of each Screen size
df['Screen'].value_counts()

Screen
15.60    1013
14.00     392
16.00     174
17.30     161
13.30     131
16.10      48
17.00      33
13.00      27
15.00      21
13.40      19
13.50      19
11.60      16
14.20      14
12.30      13
14.10      11
13.60      11
16.20      10
15.30       8
10.50       7
12.40       6
14.40       6
15.40       5
12.00       4
18.00       3
14.50       3
13.90       2
12.50       1
10.95       1
10.10       1
Name: count, dtype: int64

In [21]:
# Calculate the new median value of Screen column
df['Screen'].median()

np.float64(15.6)

**Ans:** No. The median value of Screen hasn't changed after using the `fillna` method to fill the missing values with the most frequent value. 

### Q7. Sum of weights

1. Select all the "Innjoo" laptops from the dataset.
2. Select only columns `RAM`, `Storage`, `Screen`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the sum of all the elements of the result?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- 0.43
- 45.29
- 45.58
- 91.30

In [22]:
# Display all records of laptops with Innjoo brand
df[df['Brand'] == 'Innjoo']

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
1478,InnJoo Voom Excellence Intel Celeron N4020/8GB...,New,Innjoo,Voom,Intel Celeron,8,256,SSD,,15.6,No,311.37
1479,InnJoo Voom Excellence Pro Intel Celeron N4020...,New,Innjoo,Voom,Intel Celeron,8,512,SSD,,15.6,No,392.55
1480,Innjoo Voom Intel Celeron N3350/4GB/64GB eMMC/...,New,Innjoo,Voom,Intel Celeron,4,64,eMMC,,14.1,No,251.4
1481,Innjoo Voom Laptop Max Intel Celeron N3350/6GB...,New,Innjoo,Voom,Intel Celeron,6,64,eMMC,,14.1,No,383.61
1482,Innjoo Voom Laptop Pro Intel Celeron N3350/6GB...,New,Innjoo,Voom,Intel Celeron,6,128,SSD,,14.1,No,317.02
1483,Innjoo Voom Pro Intel Celeron N3350/6GB/128GB ...,New,Innjoo,Voom,Intel Celeron,6,128,eMMC,,14.1,No,431.38


In [23]:
# Select only RAM, Storage, and Screen columns to display
df[df['Brand'] == 'Innjoo'][['RAM', 'Storage', 'Screen']]

Unnamed: 0,RAM,Storage,Screen
1478,8,256,15.6
1479,8,512,15.6
1480,4,64,14.1
1481,6,64,14.1
1482,6,128,14.1
1483,6,128,14.1


In [24]:
# Get the underlying Numpy array and call it X
X = df[df['Brand'] == 'Innjoo'][['RAM', 'Storage', 'Screen']].values
X

array([[  8. , 256. ,  15.6],
       [  8. , 512. ,  15.6],
       [  4. ,  64. ,  14.1],
       [  6. ,  64. ,  14.1],
       [  6. , 128. ,  14.1],
       [  6. , 128. ,  14.1]])

In [25]:
# Compute the transpose of matrix X
X.T

array([[  8. ,   8. ,   4. ,   6. ,   6. ,   6. ],
       [256. , 512. ,  64. ,  64. , 128. , 128. ],
       [ 15.6,  15.6,  14.1,  14.1,  14.1,  14.1]])

In [26]:
# Compute matrix-matrix multiplication between X.T and X, call the result XTX
XTX = X.T.dot(X)
XTX

array([[2.52000e+02, 8.32000e+03, 5.59800e+02],
       [8.32000e+03, 3.68640e+05, 1.73952e+04],
       [5.59800e+02, 1.73952e+04, 1.28196e+03]])

In [27]:
# Compute the inverse of XTX and call the result XTX_inverse
XTX_inverse = np.linalg.inv(XTX)
XTX_inverse

array([[ 2.78025381e-01, -1.51791334e-03, -1.00809855e-01],
       [-1.51791334e-03,  1.58286725e-05,  4.48052175e-04],
       [-1.00809855e-01,  4.48052175e-04,  3.87214888e-02]])

In [28]:
# Create an array y with the given values
y = np.array([1100, 1300, 800, 900, 1000, 1100])
y

array([1100, 1300,  800,  900, 1000, 1100])

In [29]:
XTX_inverse.dot(X.T)

array([[ 0.26298349, -0.12560233, -0.40646389,  0.14958687,  0.05244042,
         0.05244042],
       [-0.00110155,  0.00295059,  0.00125892, -0.00177691, -0.00076387,
        -0.00076387],
       [-0.08772226,  0.0269791 ,  0.17140891, -0.0302108 , -0.00153546,
        -0.00153546]])

In [30]:
# Multiply XTX_inverse with X.T, and then multiply the result by y, call the result w
w = XTX_inverse.dot(X.T).dot(y)
w

array([45.58076606,  0.42783519, 45.29127938])

In [31]:
# Compute the sum of all elements of the result
w.sum().round(2)

np.float64(91.3)

**Ans:** The sum of all the elements of the result `w` is 91.30.

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw01
* If your answer doesn't match options exactly, select the closest one
