# Lab 4: Basic regression - Predict fuel efficiency



## Imports

In [82]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # we use this library to load the dataset
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

## Load data

In [83]:
# Load the 'mpg' dataset using seaborn library into a Pandas DataFrame
df = sns.load_dataset('mpg')

MPG dataset can be viewed online at  
https://github.com/mwaskom/seaborn-data/blob/master/mpg.csv

## Data Exploration - Pandas Review

### Show the first 5 rows of the dataset

In [84]:
df.head(5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


### Show the size of the dataframe

In [85]:
df.shape # 398 records and 9 columns

(398, 9)

### Find the columns name and their types (numerical or categorical)

In [86]:
#df.describe()
print(df.columns)

for col in df.columns:
  print(col, df[col].dtype)


print(df.info()) #Get data types and the null/not null count
print(df.dtypes) #This is a succint way to get the datatype of the values in aech column
print(df.describe())

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'name'],
      dtype='object')
mpg float64
cylinders int64
displacement float64
horsepower float64
weight int64
acceleration float64
model_year int64
origin object
name object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
None
mpg             float64
cylinders         int64
displacement    f

### Find the number of missing values in each column

In [87]:
print(df.isna().sum())
#horsepower 392 non-null float 64
#Count all the N/A / empty values
print(df.isnull().sum())

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64
mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64


### Handle the missing values in the dataframe

Since the number of missing values is low, we can simply drop the rows containing them. However, as a practice and review, let's substitute the missing values in the numerical columns (if any) with the mean of the respective column and the missing values in the categorical columns (if any) with the median of the respective column.

In [88]:
df.horsepower.fillna(df.horsepower.mean(), inplace=True)
print(df.isna().sum())

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.horsepower.fillna(df.horsepower.mean(), inplace=True)


### Compute the average and the median weight

In [89]:
#your code here
print(df.weight.mean())
print(df.weight.median())

2970.424623115578
2803.5


### Find the number of cars that weight more than 2000 kgs

In [90]:
#your code here
numHevyCars = sum(df.weight > 2000*2.2) #Vehicle weight in pounds is given (numercial)
# 1kg = 2.2 pounds
print(numHevyCars)

26


### Find how many cars there are for each number of cylinders

In [91]:
#your code here
df.cylinders.value_counts() #Categorizees the unique values and

Unnamed: 0_level_0,count
cylinders,Unnamed: 1_level_1
4,204
8,103
6,84
3,4
5,3


### Find what are the car models with number of cylinders (3 or 5)

In [92]:
#your code here
print(df.cylinders.value_counts()[3:5])
print(df.name.groupby(df.cylinders).get_group(3)[0:3])
print(df.groupby("cylinders").count().name[0:3:2])

cylinders
3    4
5    3
Name: count, dtype: int64
71     mazda rx2 coupe
111          maxda rx3
243         mazda rx-4
Name: name, dtype: object
cylinders
3    4
5    3
Name: name, dtype: int64


### Show the `value_counts()` of `origin` column or show the unique values of this column.

In [93]:
#your code here
df.origin.value_counts()

Unnamed: 0_level_0,count
origin,Unnamed: 1_level_1
usa,249
japan,79
europe,70


## Data Preprocessing

### Use one hot encoding to change the categorical values of `origin` column to numerical values.

- use `pd.get_dummies()` method to do the encoding

In [94]:
#your code here
df = pd.get_dummies(df, columns=['origin'])
print(df.head())

    mpg  cylinders  displacement  horsepower  weight  acceleration  \
0  18.0          8         307.0       130.0    3504          12.0   
1  15.0          8         350.0       165.0    3693          11.5   
2  18.0          8         318.0       150.0    3436          11.0   
3  16.0          8         304.0       150.0    3433          12.0   
4  17.0          8         302.0       140.0    3449          10.5   

   model_year                       name  origin_europe  origin_japan  \
0          70  chevrolet chevelle malibu          False         False   
1          70          buick skylark 320          False         False   
2          70         plymouth satellite          False         False   
3          70              amc rebel sst          False         False   
4          70                ford torino          False         False   

   origin_usa  
0        True  
1        True  
2        True  
3        True  
4        True  


### Remove the name column form the dataframe to have all numerical dataframe.

In [95]:
#your code here

### Does the input needs reshaping?

In [96]:
#your code here

### Split the data into training and test sets and form `train_features`, `train_labels`, `test_features`, `test_labels`

In [97]:
from sklearn.model_selection import train_test_split
#your code here

### For simplicity in the following steps, convert the dataset from a pandas DataFrame to a numpy array.

In [98]:
train_features = np.array(train_features)
train_labels = np.array(train_labels)
test_features = np.array(test_features)
test_labels = np.array(test_labels)

NameError: name 'train_features' is not defined

## Normalization layer

To ensure stable training of neural networks, we typically normalize the data. This process also enhances the convergence of the gradient descent algorithm.

There is not single way to normalize the data. You can also use `scikit-learn `or `pandas` to do it. However, in this lab, we will use the normalization layer provided by tensorflow which matches the other parts of the model.

The `tf.keras.layers.Normalization` is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

In [None]:
normalizer = tf.keras.layers.Normalization(axis=-1)

Then, fit the state of the preprocessing layer to the data by calling `Normalization.adapt`.

It calculates the mean and variance of each feature, and store them in the layer

In [None]:
normalizer.adapt(train_features)

When the layer is called, it returns the input data, with each feature independently normalized.

In [None]:
first = train_features[0]
print('First example:', first)
print()
print('Normalized:', normalizer(first).numpy())

## **Approach #1:** Regression using `Linear Regression`

**You are welcome to use scikit-learn to perform linear regression on this dataset.**

However, here we aim to implement it using TensorFlow.

- As we saw in Lab Week 2, `logistic regression` is essentially a single neuron with a `sigmoid` activation function.

- Similarly, `linear regression` can be viewed as a single neuron with a `linear` activation function.

### **Step 1:** Linear regression model architecture

In [None]:
linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(1, activation='linear')
])

**Note:** You can define your model all at once like the cell above or you can buid the model incrementaly  (suitable for your assignment)

In [None]:
# Defining the model incrementaly (suitable for your assignment)
linear_model = tf.keras.Sequential()
linear_model.add(normalizer)
linear_model.add(layers.Dense(1, activation='linear'))

### **Step 2:** Configure the model with Keras `Model.compile()`

The most important arguments to compile are the `loss` and the `optimizer`, since these define what will be optimized (`"mean_absolute_error"`) and how (using the `tf.keras.optimizers.Adam(learning_rate=0.1)`).

**arguments:**
- optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
- loss='mean_absolute_error'

In [None]:
#your code here

### **Step 3:** Train the model using the `Model.fit()` for `100` epochs, and store the output in a variable named history.

In [None]:
history = linear_model.fit(train_features, train_labels, epochs=100)

In [None]:
history.history

In [None]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

plot_loss(history)

### Get the model summary

In [None]:
linear_model.summary()

### **Step 4:** Evaluate the linear model on the test set using Keras `Model.evaluate()` and see the `mean_absolute_error` and save the result for future comparison.

In [None]:
#your code here

## **Approach #2:** Regression using a `Deep Neural Network (DNN)`

### Solve the same problem and using deep neural network with the sample architecture;
- 1st hidden layer no. of units =  64
- 2nd hidden layer no. of units = 64
- Choose appropriate `activation` functions for hidden and output layers

In [None]:
#your code here

### Print the model summary (after training). How many parameters are there in the model?

### You can see even this small model has more than 4000 trainable parameters. The more the number of parameters, the longer the training time and cost. Search the net and see how many trainable parameters does the `ChatGPT` model have? What about `DeepSeek` model? (Optional)

## Compare the evaluation result of the two approaches, i.e., linear regression and deep neural network.

In [None]:
#your code here

## Use the following large model and evaluate it on the test set.

In [None]:
model_dnn_large = tf.keras.Sequential([
    normalizer,
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='linear')
])


### Explain your observation. Why do you think the large model is not performing well?

- hint: when the number of trainable parameters is very large (even larger than the number of data points), the model may overfit the training data.One way to solve this problem is to use more data.