# Lab 4: Basic regression - Predict fuel efficiency



## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # we use this library to load the dataset
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

## Load data

In [13]:
# Load the 'mpg' dataset using seaborn library into a Pandas DataFrame
df = sns.load_dataset('mpg')

MPG dataset can be viewed online at  
https://github.com/mwaskom/seaborn-data/blob/master/mpg.csv

## Data Exploration - Pandas Review

### Show the first 5 rows of the dataset

In [15]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


### Show the size of the dataframe

In [17]:
df.shape

(398, 9)

### Find the columns name and their types (numerical or categorical)

In [19]:
df.dtypes

Unnamed: 0,0
mpg,float64
cylinders,int64
displacement,float64
horsepower,float64
weight,int64
acceleration,float64
model_year,int64
origin,object
name,object


### Find the number of missing values in each column

In [23]:
df.isnull().sum()

Unnamed: 0,0
mpg,0
cylinders,0
displacement,0
horsepower,0
weight,0
acceleration,0
model_year,0
origin,0
name,0


### Handle the missing values in the dataframe

Since the number of missing values is low, we can simply drop the rows containing them. However, as a practice and review, let's substitute the missing values in the numerical columns (if any) with the mean of the respective column and the missing values in the categorical columns (if any) with the median of the respective column.

In [24]:
# Fill numerical columns with mean
df.fillna(df.mean(numeric_only=True), inplace=True)

# Fill categorical columns with median
df.fillna(df.select_dtypes(include=['object']).mode().iloc[0], inplace=True)


### Compute the average and the median weight

In [25]:
print("\nAverage weight:", df['weight'].mean())
print("Median weight:", df['weight'].median())


Average weight: 2970.424623115578
Median weight: 2803.5


### Find the number of cars that weight more than 2000 kgs

In [26]:
count = df[df['weight'] > 2000].shape[0]
print("\nNumber of cars weighing more than 2000 lbs:", count)


Number of cars weighing more than 2000 lbs: 354


### Find how many cars there are for each number of cylinders

In [27]:
print("\nNumber of cars per cylinders:")
print(df['cylinders'].value_counts())


Number of cars per cylinders:
cylinders
4    204
8    103
6     84
3      4
5      3
Name: count, dtype: int64


### Find what are the car models with number of cylinders (3 or 5)

In [28]:
models = df[df['cylinders'].isin([3, 5])]['name']
print("\nCar models with 3 or 5 cylinders:")
print(models)


Car models with 3 or 5 cylinders:
71         mazda rx2 coupe
111              maxda rx3
243             mazda rx-4
274              audi 5000
297     mercedes benz 300d
327    audi 5000s (diesel)
334          mazda rx-7 gs
Name: name, dtype: object


### Show the `value_counts()` of `origin` column or show the unique values of this column.

In [29]:
print("\nOrigin value counts:")
print(df['origin'].value_counts())


Origin value counts:
origin
usa       249
japan      79
europe     70
Name: count, dtype: int64


## Data Preprocessing

### Use one hot encoding to change the categorical values of `origin` column to numerical values.

- use `pd.get_dummies()` method to do the encoding

In [30]:
df = pd.get_dummies(df, columns=['origin'])

### Remove the name column form the dataframe to have all numerical dataframe.

In [31]:
df = df.drop('name', axis=1)

### Does the input needs reshaping?

In [32]:
print("\nInput shape check:")
print(f"Current shape: {df.shape} (samples, features) - No reshaping needed")


Input shape check:
Current shape: (398, 10) (samples, features) - No reshaping needed


### Split the data into training and test sets and form `train_features`, `train_labels`, `test_features`, `test_labels`

In [35]:
from sklearn.model_selection import train_test_split

X = df.drop('mpg', axis=1)
y = df['mpg']

train_features, test_features, train_labels, test_labels = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### For simplicity in the following steps, convert the dataset from a pandas DataFrame to a numpy array.

In [36]:
train_features = np.array(train_features)
train_labels = np.array(train_labels)
test_features = np.array(test_features)
test_labels = np.array(test_labels)

## Normalization layer

To ensure stable training of neural networks, we typically normalize the data. This process also enhances the convergence of the gradient descent algorithm.

There is not single way to normalize the data. You can also use `scikit-learn `or `pandas` to do it. However, in this lab, we will use the normalization layer provided by tensorflow which matches the other parts of the model.

The `tf.keras.layers.Normalization` is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

In [41]:
normalizer = tf.keras.layers.Normalization(axis=-1)

Then, fit the state of the preprocessing layer to the data by calling `Normalization.adapt`.

It calculates the mean and variance of each feature, and store them in the layer

In [42]:
normalizer.adapt(train_features)

When the layer is called, it returns the input data, with each feature independently normalized.

In [43]:
first = train_features[0]
print('First example:', first)
print()
print('Normalized:', normalizer(first).numpy())

First example: [8 304.0 150.0 3433 12.0 70 False False True]



ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).

## **Approach #1:** Regression using `Linear Regression`

**You are welcome to use scikit-learn to perform linear regression on this dataset.**

However, here we aim to implement it using TensorFlow.

- As we saw in Lab Week 2, `logistic regression` is essentially a single neuron with a `sigmoid` activation function.

- Similarly, `linear regression` can be viewed as a single neuron with a `linear` activation function.

### **Step 1:** Linear regression model architecture

In [44]:
linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(1, activation='linear')
])

**Note:** You can define your model all at once like the cell above or you can buid the model incrementaly  (suitable for your assignment)

In [7]:
# Defining the model incrementaly (suitable for your assignment)
linear_model = tf.keras.Sequential()
linear_model.add(normalizer)
linear_model.add(layers.Dense(1, activation='linear'))

NameError: name 'tf' is not defined

### **Step 2:** Configure the model with Keras `Model.compile()`

The most important arguments to compile are the `loss` and the `optimizer`, since these define what will be optimized (`"mean_absolute_error"`) and how (using the `tf.keras.optimizers.Adam(learning_rate=0.1)`).

**arguments:**
- optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
- loss='mean_absolute_error'

In [6]:
linear_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error'
)

NameError: name 'linear_model' is not defined

### **Step 3:** Train the model using the `Model.fit()` for `100` epochs, and store the output in a variable named history.

In [5]:
history = linear_model.fit(train_features, train_labels, epochs=100)

NameError: name 'linear_model' is not defined

In [4]:
history.history

NameError: name 'history' is not defined

In [3]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

plot_loss(history)

NameError: name 'history' is not defined

### Get the model summary

In [2]:
linear_model.summary()

NameError: name 'linear_model' is not defined

### **Step 4:** Evaluate the linear model on the test set using Keras `Model.evaluate()` and see the `mean_absolute_error` and save the result for future comparison.

In [1]:
print("\nLinear Model Evaluation:")
linear_mae = linear_model.evaluate(test_features, test_labels, verbose=0)
print(f"Test MAE (Linear): {linear_mae:.2f} MPG")


Linear Model Evaluation:


NameError: name 'linear_model' is not defined

## **Approach #2:** Regression using a `Deep Neural Network (DNN)`

### Solve the same problem and using deep neural network with the sample architecture;
- 1st hidden layer no. of units =  64
- 2nd hidden layer no. of units = 64
- Choose appropriate `activation` functions for hidden and output layers

In [None]:
dnn_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='linear')
])

dnn_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='mean_absolute_error'
)

history_dnn = dnn_model.fit(
    train_features, train_labels,
    epochs=100,
    verbose=0
)

### Print the model summary (after training). How many parameters are there in the model?

In [None]:
print("\nDNN Model Evaluation:")
dnn_mae = dnn_model.evaluate(test_features, test_labels, verbose=0)
print(f"Test MAE (DNN): {dnn_mae:.2f} MPG")
print("\nDNN Model Summary:")
dnn_model.summary()

### You can see even this small model has more than 4000 trainable parameters. The more the number of parameters, the longer the training time and cost. Search the net and see how many trainable parameters does the `ChatGPT` model have? What about `DeepSeek` model? (Optional)

## Compare the evaluation result of the two approaches, i.e., linear regression and deep neural network.

In [8]:
print("\nModel Comparison:")
print(f"Linear Regression Test MAE: {linear_mae:.2f} MPG")
print(f"DNN Regression Test MAE: {dnn_mae:.2f} MPG")



Model Comparison:


NameError: name 'linear_mae' is not defined

## Use the following large model and evaluate it on the test set.

In [None]:
model_dnn_large = tf.keras.Sequential([
    normalizer,
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='linear')
])


### Explain your observation. Why do you think the large model is not performing well?

- hint: when the number of trainable parameters is very large (even larger than the number of data points), the model may overfit the training data.One way to solve this problem is to use more data.