In [1]:
import numpy as np
import pandas as pd

print("Numpy version: " + str(np.__version__))
print("Pandas version: " + str(pd.__version__))

Numpy version: 2.3.3
Pandas version: 2.3.2


In [2]:
# read in data and print first 5 rows
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv')

df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369



### Q1. Pandas version

What's the version of Pandas that you installed?

In [3]:
### Q1: Pandas version
pd.__version__

'2.3.2'

### Q2. Records count

How many records are in the dataset?

In [4]:
### Q2: Records count
df.shape[0]

9704

### Q3. Fuel types

How many fuel types are presented in the dataset?

In [5]:
### Q3: Fuel types
df['fuel_type'].nunique()

2

### Q4. Missing values

How many columns in the dataset have missing values?

In [6]:
### Q4: Missing values
print("Column missing value counts:\n")
print(df.isnull().sum())

num_missing_cols = df.isnull().any().sum()
print(f"\nThere are {num_missing_cols} columns with missing values.")

Column missing value counts:

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

There are 4 columns with missing values.


### Q5. Max fuel efficiency

What's the maximum fuel efficiency of cars from Asia?

In [7]:
### Q5: Max fuel efficiency
# df['origin'].unique()
max_fuel_efficiency = df[df['origin'] == 'Asia']['fuel_efficiency_mpg'].max()
print(round(max_fuel_efficiency, 2))

23.76


### Q6. Median value of horsepower

1. Find the median value of `horsepower` column in the dataset.
2. Next, calculate the most frequent value of the same `horsepower` column.
3. Use `fillna` method to fill the missing values in `horsepower` column with the most frequent value from the previous step.
4. Now, calculate the median value of `horsepower` once again.

Has it changed?

In [8]:
### Q6: Median value of horsepower
# 1: median
horsepower_median = df['horsepower'].median()
print(f"Median: {horsepower_median}")

# 2: mode
horsepower_mode = df['horsepower'].mode()[0]
print(f"Mode: {horsepower_mode}")

# 3: replace missing with mode
df['horsepower (updated)'] = df['horsepower'].fillna(horsepower_mode)

# 4: new horsepower median
new_horsepower_median = df['horsepower (updated)'].median()
print(f"Updated median: {new_horsepower_median}")

Median: 149.0
Mode: 152.0
Updated median: 152.0


### Q7. Sum of weights

1. Select all the cars from Asia
2. Select only columns `vehicle_weight` and `model_year`
3. Select the first 7 values
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the sum of all the elements of the result?

In [9]:
### Q7: Sum of weights
# 1: select all cars from Asia
asia_cars_df = df[df['origin'] == 'Asia']

# 2: select only RAM, Storage, and Screen columns
asia_cars_df = asia_cars_df[['vehicle_weight', 'model_year']]

# 3: select the first 7 values
asia_cars_df = asia_cars_df.head(7)

# 4: get underlying numpy array
X = asia_cars_df.to_numpy()

# 5: compute XTX
XTX = X.T @ X

# 6: compute the inverse of XTX
XTX_inv = np.linalg.inv(XTX)

# 7: create the y array
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])

# 8: compute w
w = XTX_inv @ X.T @ y

# 9: sum all elements of w
sum_w = w.sum()
print("Sum of w:", f"{sum_w:.3f}")

Sum of w: 0.519


In [10]:
# vector-vector multiplication function
def vector_vector_multiplication(u, v):
    assert u.shape[0] == v.shape[0]

    n = u.shape[0]

    result = 0.0

    for i in range(n):
        result = result + u[i] * v[i]

    return result

# matrix-vector multiplication function
def matrix_vector_multiplication(U, v):
    assert U.shape[1] == v.shape[0]

    num_rows = U.shape[0]

    result = np.zeros(num_rows)

    for i in range(num_rows):
        result[i] = vector_vector_multiplication(U[i], v)

    return result

# matrix-matrix multiplication function
def matrix_matrix_multiplication(U, V):
    assert U.shape[1] == V.shape[0]

    num_rows = U.shape[0]
    num_cols = V.shape[1]

    result = np.zeros((num_rows, num_cols))

    for i in range(num_cols):
        vi = V[:, i]
        Uvi = matrix_vector_multiplication(U, vi)
        result[:, i] = Uvi
    
    return result

In [11]:
### Sum of weights with manual functions

# 5: compute XTX using functions above
XTX = matrix_matrix_multiplication(X.T, X)

# 6: compute the inverse of XTX
XTX_inv = np.linalg.inv(XTX)

# 7: create the y array
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])

# 8: compute w using functions above
w = matrix_vector_multiplication(matrix_matrix_multiplication(XTX_inv, X.T), y)

# 9: sum all elements of w
sum_w = w.sum()
print("Sum of w:", f"{sum_w:.3f}")

Sum of w: 0.519


### Why is this "linear regression"?

For each column in X, we are trying to determine the optimal value of a **weight** (machine learning term) or **parameter** (statistical term) that solves the following system of linear equations:

$$
\mathbf{X} \, \boldsymbol{\beta} = \mathbf{y}
$$

#### Matrices with actual data:

The **design matrix** $ \mathbf{X} $ (7 × 2) is:

$$
\mathbf{X} =
\begin{bmatrix}
2714.219 & 2016 \\
2783.869 & 2010 \\
3582.687 & 2007 \\
2231.808 & 2011 \\
2659.431 & 2016 \\
2844.228 & 2014 \\
3761.994 & 2019
\end{bmatrix}
$$

The **coefficient vector** $ \boldsymbol{\beta} $ (2 × 1) is:

$$
\boldsymbol{\beta} =
\begin{bmatrix}
\beta_1 \\
\beta_2
\end{bmatrix}
$$

The **target vector** \( \mathbf{y} \) (7 × 1) is:

$$
\mathbf{y} =
\begin{bmatrix}
1100 \\
1300 \\
800 \\
900 \\
1000 \\
1100 \\
1200
\end{bmatrix}
$$

---

Typically there is no exact solution because the system is **overdetermined** ($n > p$).  
A very good solution is the **ordinary least squares (OLS) solution**, which minimizes the sum of squared residuals:

$$
\hat{\beta} = \arg \min_{\beta} \| \mathbf{y} - \mathbf{X}\beta \|^2
$$

This solution satisfies the **normal equation**:

$$
(X^\top X) \, \hat{\beta} = X^\top \mathbf{y}
$$

where $X^\top X$ is the **normal matrix**, $X^\top y$ is the **moment matrix**, and $\hat{\beta}$ is the **least-squares coefficient vector**:

$$
\hat{\beta} = (X^\top X)^{-1} X^\top \mathbf{y}
$$

---

We can confirm this with scikit-learn's OLS:

In [12]:
from sklearn.linear_model import LinearRegression

ols = LinearRegression(fit_intercept=False)
ols.fit(X, y)
w_sklearn = ols.coef_

print("OLS (sklearn) weights:", w_sklearn)
print("Weights from normal equations:", w)

OLS (sklearn) weights: [0.01386421 0.5049067 ]
Weights from normal equations: [0.01386421 0.5049067 ]
