# Homework

- Set up the environment
- You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from 06-environment.md.

### Q1. Pandas version

In [1]:
import pandas as pd

pd.__version__

'2.3.2'

### Q2. Records count
- How many records are in the dataset?

In [2]:
df = pd.read_csv('../data/car_fuel_efficiency.csv')
df.info()
print(f"\nTotal records: {df.shape[0]}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9704 entries, 0 to 9703
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   engine_displacement  9704 non-null   int64  
 1   num_cylinders        9222 non-null   float64
 2   horsepower           8996 non-null   float64
 3   vehicle_weight       9704 non-null   float64
 4   acceleration         8774 non-null   float64
 5   model_year           9704 non-null   int64  
 6   origin               9704 non-null   object 
 7   fuel_type            9704 non-null   object 
 8   drivetrain           9704 non-null   object 
 9   num_doors            9202 non-null   float64
 10  fuel_efficiency_mpg  9704 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 834.1+ KB

Total records: 9704


### Q3. Fuel types
- How many fuel types are presented in the dataset?

In [4]:
fuel_types = df['fuel_type'].nunique()
print(f"Number of unique fuel types: {fuel_types}")

Number of unique fuel types: 2


### Q4. Missing values
- How many columns in the dataset have missing values?

In [5]:
missing_values_count = df.isnull().sum()
columns_with_missing_values = missing_values_count[missing_values_count > 0]
print(f"Columns with missing values: {len(columns_with_missing_values)}")

Columns with missing values: 4


### Q5. Max fuel efficiency
- What's the maximum fuel efficiency of cars from Asia?

In [6]:
fuel_types_of_asia = df['fuel_efficiency_mpg'].value_counts()
fuel_types_of_asia
max_fuel_efficiency_asia = df[df['origin'] == 'Asia']['fuel_efficiency_mpg'].max()
print(f"Maximum fuel efficiency of cars from Asia: {max_fuel_efficiency_asia:.2f} mpg")

Maximum fuel efficiency of cars from Asia: 23.76 mpg


### Q6. Median value of horsepower
- Find the median value of horsepower column in the dataset.
- Next, calculate the most frequent value of the same horsepower column.
- Use fillna method to fill the missing values in horsepower column with the most frequent value from the previous step.
- Now, calculate the median value of horsepower once again.
- Has it changed?

In [7]:
median_horsepower_before = df['horsepower'].median()
most_frequent_horsepower = df['horsepower'].mode()[0]
print(f"Median horsepower before filling missing values: {median_horsepower_before}")
print(f"Most frequent horsepower: {most_frequent_horsepower}")

df['horsepower'] = df['horsepower'].fillna(most_frequent_horsepower)
median_horsepower_after = df['horsepower'].median()
print(f"Median horsepower after filling missing values: {median_horsepower_after}")

if median_horsepower_before == median_horsepower_after:
    print("The median has NOT changed after filling missing values.")
else:
    print("The median HAS changed after filling missing values.")

Median horsepower before filling missing values: 149.0
Most frequent horsepower: 152.0
Median horsepower after filling missing values: 152.0
The median HAS changed after filling missing values.


### Q7. Sum of weights
- Select all the cars from Asia
- Select only columns vehicle_weight and model_year
- Select the first 7 values
- Get the underlying NumPy array. Let's call it X.
- Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.
- Invert XTX.
- Create an array y with values [1100, 1300, 800, 900, 1000, 1100, 1200].
- Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
- What's the sum of all the elements of the result?
- Note: You just implemented linear regression. We'll talk about it in the next lesson.

In [10]:
# Import numpy for matrix operations
import numpy as np

# Select all the cars from Asia
cars_from_asia = df[df['origin'] == 'Asia']

# Select only columns vehicle_weight and model_year
selected_columns = cars_from_asia[['vehicle_weight', 'model_year']]

# Select the first 7 values
first_7_values = selected_columns.head(7)
print("First 7 values:")
print(first_7_values)

# Get the underlying NumPy array. Let's call it X
X = first_7_values.to_numpy()
print(f"\nX shape: {X.shape}")

# Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX
XTX = X.T @ X
print(f"\nXTX:")
print(XTX)

# Invert XTX. Let's call the result XTX_inv
XTX_inv = np.linalg.inv(XTX)
print(f"\nXTX_inv:")
print(XTX_inv)

# Create an array y with values [1100, 1300, 800, 900, 1000, 1100, 1200]
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])
print(f"\ny: {y}")

# Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w
w = XTX_inv @ X.T @ y
print(f"\nw: {w}")

# What is the sum of all the elements of the result?
result_sum = w.sum()
print(f"\nSum of all elements in w: {result_sum}")
result_sum

First 7 values:
    vehicle_weight  model_year
8      2714.219310        2016
12     2783.868974        2010
14     3582.687368        2007
20     2231.808142        2011
21     2659.431451        2016
34     2844.227534        2014
38     3761.994038        2019

X shape: (7, 2)

XTX:
[[62248334.33150762 41431216.5073268 ]
 [41431216.5073268  28373339.        ]]

XTX_inv:
[[ 5.71497081e-07 -8.34509443e-07]
 [-8.34509443e-07  1.25380877e-06]]

y: [1100 1300  800  900 1000 1100 1200]

w: [0.01386421 0.5049067 ]

Sum of all elements in w: 0.5187709081074023


np.float64(0.5187709081074023)