# Lab 4: Basic regression - Predict fuel efficiency



## Imports

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # we use this library to load the dataset
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

## Load data

In [8]:
# Load the 'mpg' dataset using seaborn library into a Pandas DataFrame
df = sns.load_dataset('mpg') # mpg -> miles per galion

MPG dataset can be viewed online at  
https://github.com/mwaskom/seaborn-data/blob/master/mpg.csv

## Data Exploration - Pandas Review

### Show the first 5 rows of the dataset

In [9]:
#your code here
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


### Show the size of the dataframe

In [10]:
#your code here
df.size # Notic that from 398 rows x 9 columns = 3582


3582

### Find the columns name and their types (numerical or categorical)

In [11]:
#your code here
# print the columns name
print(df.columns)

for singleColumn in df.columns:
  print(singleColumn)
  print(df[singleColumn].info())

print(df.info()) # gers datatypes and the null/not null counts
print(df.describe())
print(df.dtypes)

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'name'],
      dtype='object')
mpg
<class 'pandas.core.series.Series'>
RangeIndex: 398 entries, 0 to 397
Series name: mpg
Non-Null Count  Dtype  
--------------  -----  
398 non-null    float64
dtypes: float64(1)
memory usage: 3.2 KB
None
cylinders
<class 'pandas.core.series.Series'>
RangeIndex: 398 entries, 0 to 397
Series name: cylinders
Non-Null Count  Dtype
--------------  -----
398 non-null    int64
dtypes: int64(1)
memory usage: 3.2 KB
None
displacement
<class 'pandas.core.series.Series'>
RangeIndex: 398 entries, 0 to 397
Series name: displacement
Non-Null Count  Dtype  
--------------  -----  
398 non-null    float64
dtypes: float64(1)
memory usage: 3.2 KB
None
horsepower
<class 'pandas.core.series.Series'>
RangeIndex: 398 entries, 0 to 397
Series name: horsepower
Non-Null Count  Dtype  
--------------  -----  
392 non-null    float64
dtypes: float64(1)
memory usage:

### Find the number of missing values in each column

In [None]:
#your code here
# horsepower 392 non-null float64

print(df.isna().sum())  # count all of the NA values
print(df.isnull().sum())

### Handle the missing values in the dataframe

Since the number of missing values is low, we can simply drop the rows containing them. However, as a practice and review, let's substitute the missing values in the numerical columns (if any) with the mean of the respective column and the missing values in the categorical columns (if any) with the median of the respective column.

In [None]:
#your solution here

#df.select_dtypes(include = np.number) => filters the dataframe to include only columns with numerical data types

# .columns => extract the names of these selected columns
numberOnlyColumns = df.select_dtypes(include=np.number).columns # only numerical values column

df[numberOnlyColumns] = df[numberOnlyColumns].fillna(df[numberOnlyColumns].mean())
# fillna -> is a pandas method used to fill missing values(NaN) in a Dataframe or series
# filling with (df[numberOnlyColumns].mean()) -> calculates the mean of each numerical column, ignoring missing values

print(df.isnull().sum())


### Compute the average and the median weight

In [None]:
#your code here
print(df.weight.mean())
print(df.weight.median())

### Find the number of cars that weight more than 2000 kgs

In [None]:
#your code here
df[df.weight > 2000 * 2.2].count()

### Find how many cars there are for each number of cylinders

In [None]:
#your code here
df.cylinders.value_counts()


### Find what are the car models with number of cylinders (3 or 5)

In [None]:
#your code here
# 그룹화 (Groupby)
print(df.name.groupby(df.cylinders).get_group(3)) # cylinders 컬럼을 기준으로 name 컬럼을 그룹화 / .get_group(3) -> cylinders == 3 인 데이터 출력
print(df.name.groupby(df.cylinders).get_group(5))
# print(df.groupby("cylinders").count().name[0:3:2])



### Show the `value_counts()` of `origin` column or show the unique values of this column.

In [None]:
#your code here
print(df.origin.value_counts())

## Data Preprocessing

### Use one hot encoding to change the categorical values of `origin` column to numerical values.

(One - Hot - Encoding) 은 범주형 데이터 (Categorical Data) 를 머신러닝 모델이 이해 할수있도록 숫자로 변환하는 기법

why ?
-> 범주형 데이터 (카테고리 데이터)를 직접 숫자로 변환하면 문제발생
-> 숫자의 크기 차이가 의미를 갖지 않도록 변환해야 함

- use `pd.get_dummies()` method to do the encoding

In [None]:
#your code here
df = pd.get_dummies(df, columns=['origin'])
df.head()

### Remove the name column form the dataframe to have all numerical dataframe.

In [None]:
#your code here
df_preprocessed = df.drop('name', axis=1) # axis=1 specifies column removal
df_preprocessed.head()

### Does the input needs reshaping?

In [None]:
#your code here
print(df_preprocessed.shape)


### Split the data into training and test sets and form `train_features`, `train_labels`, `test_features`, `test_labels`

In [None]:
from sklearn.model_selection import train_test_split
#your code here
train_X, train_y, test_X, test_y = train_test_split(df_preprocessed, test_size=0.2, random_state=66)
print(train_df.shape)
print(test_df.shape)

train_labels = train_df.iloc[:, 0] # This column is the mpg column which we have chosen as the label/target/answer
train_features = train_df.iloc[:, 1:] # This column is the mpg column which we have chosen as the label/target/answer

test_labels = test_df.iloc[:, 0] # This column is the mpg column which we have chosen as the label/target/answer
test_features = test_df.iloc[:, 1:] # This column is the mpg column which we have chosen as the label


print(train_features.shape)
print(train_labels.shape)
print(test_features.shape)
print(test_labels.shape)

### For simplicity in the following steps, convert the dataset from a pandas DataFrame to a numpy array.

In [None]:
train_features = np.array(train_features)
train_labels = np.array(train_labels)
test_features = np.array(test_features)
test_labels = np.array(test_labels)

## Normalization layer

To ensure stable training of neural networks, we typically normalize the data. This process also enhances the convergence of the gradient descent algorithm.

There is not single way to normalize the data. You can also use `scikit-learn `or `pandas` to do it. However, in this lab, we will use the normalization layer provided by tensorflow which matches the other parts of the model.

The `tf.keras.layers.Normalization` is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

In [None]:
normalizer = tf.keras.layers.Normalization(axis=-1)
# axis= -1 -> 마지막 차원(각 특성별, feature-wise) 기준으로 정규화
# 즉 , 각 열(특성)에 대해 평균을 0, 분산을 1로 조정

# Normalization 레이어란 ?
# -> tf.keras.layers.normalization 은 입력 데이터를 정규화 하는 keras 레이어
# 데이터를 평균 0, 분산1이 되도록 변환해서 학습을 더 빠르고 안정적으로 진행할수있도록 도움

Then, fit the state of the preprocessing layer to the data by calling `Normalization.adapt`.

It calculates the mean and variance of each feature, and store them in the layer

In [None]:
normalizer.adapt(train_features)

When the layer is called, it returns the input data, with each feature independently normalized.

In [None]:
first = train_features[0]
print('First example:', first)
print()
print('Normalized:', normalizer(first).numpy())

## **Approach #1:** Regression using `Linear Regression`

**You are welcome to use scikit-learn to perform linear regression on this dataset.**

However, here we aim to implement it using TensorFlow.

- As we saw in Lab Week 2, `logistic regression` is essentially a single neuron with a `sigmoid` activation function.

- Similarly, `linear regression` can be viewed as a single neuron with a `linear` activation function.

### **Step 1:** Linear regression model architecture

In [None]:
linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(1, activation='linear')
])

**Note:** You can define your model all at once like the cell above or you can buid the model incrementaly  (suitable for your assignment)

In [None]:
# Defining the model incrementaly (suitable for your assignment)
linear_model = tf.keras.Sequential()
linear_model.add(normalizer)
linear_model.add(layers.Dense(1, activation='linear'))

### **Step 2:** Configure the model with Keras `Model.compile()`

The most important arguments to compile are the `loss` and the `optimizer`, since these define what will be optimized (`"mean_absolute_error"`) and how (using the `tf.keras.optimizers.Adam(learning_rate=0.1)`).

**arguments:**
- optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
- loss='mean_absolute_error'

In [None]:
#your code here
linear_model.compile

### **Step 3:** Train the model using the `Model.fit()` for `100` epochs, and store the output in a variable named history.

In [None]:
history = linear_model.fit(train_features, train_labels, epochs=100)

In [None]:
history.history

In [None]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

plot_loss(history)

### Get the model summary

In [None]:
linear_model.summary()

### **Step 4:** Evaluate the linear model on the test set using Keras `Model.evaluate()` and see the `mean_absolute_error` and save the result for future comparison.

In [None]:
#your code here

## **Approach #2:** Regression using a `Deep Neural Network (DNN)`

### Solve the same problem and using deep neural network with the sample architecture;
- 1st hidden layer no. of units =  64
- 2nd hidden layer no. of units = 64
- Choose appropriate `activation` functions for hidden and output layers

In [None]:
#your code here

### Print the model summary (after training). How many parameters are there in the model?

### You can see even this small model has more than 4000 trainable parameters. The more the number of parameters, the longer the training time and cost. Search the net and see how many trainable parameters does the `ChatGPT` model have? What about `DeepSeek` model? (Optional)

## Compare the evaluation result of the two approaches, i.e., linear regression and deep neural network.

In [None]:
#your code here

## Use the following large model and evaluate it on the test set.

In [None]:
model_dnn_large = tf.keras.Sequential([
    normalizer,
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='linear')
])


### Explain your observation. Why do you think the large model is not performing well?

- hint: when the number of trainable parameters is very large (even larger than the number of data points), the model may overfit the training data.One way to solve this problem is to use more data.