# TP4

In Vehicular ad hoc networks (VANET), it is crucial to be able to detect misbehaviors, and in particular cyberattacks, to guarantee the safety of the drivers.

Denial-of-service (DoS) attacks send malicious traffic of data through a great number of attack machines in an attempt to shut down the availability of information for vehicles to communicate. This work will focus on predicting those specific attacks using Federated Learning.

Federated Learning will be implemented using the Tensorflow and the Tensorflow Federated libraries. The Pandas library will be used to clean and preprocess the dataset.

## Dataset

The dataset used is a public simulated dataset called Vehicular Reference Misbehavior (VeReMi). This dataset presents several attack types such as Data Replay or Eventual Stop, but we will focus on DoS attacks in this work.

In the *_data* folder, you will find another folder *dos_dataset* and a CSV file containing the sender-labels correspondences. The *dos_dataset* contains the JSON files with all the message data.

### Dataset study and objectives

In the JSON files, the messages can be of 3 types :
- Type = 2 : GPS, sent messages
- Type = 3 : BSM (Basic Safety Message), received messages
- Type = 4 : Ground truth, received messages
We will therefore focus on the type 3 received messages.

The type 3 messages have the following data :
- *rcvTime* : float
- *sendTime* : float
- *sender* : int, corresponds to the *sender* column in *sender_labels.csv*. We will be able to join the files on this column to associate messages with their labels.
- *senderPseudo* : int
- *messageID* : int
- *pos & noise* : lists of floats
- *spd & noise* : lists of floats
- *acl & noise* : lists of floats
- *hed & noise* : lists of floats

In each file name, we can note the receiver id (first number in the file name). In the dataframe that we will create further, a column with this id will be added to make the Federated Learning approach possible, as each receiver will train its own model.

**The main goal will be to predict the attack label (0 for a normal vehicle, 1 for an attacking vehicle).**

### Choosing the variables

### Dataset cleaning

Our first goal will be to clean the dataset by stripping the JSON files and creating one big pandas dataframe that will contains all of the data.

The lines will be filtered so that only type 3 messages remain and a *receiver_id* column corresponding to the first number in the file names will be added.

In [114]:
import pandas as pd
import json
import os
import glob

# Path to the dataset folder
dos_path = './_data/dos_dataset'

# Finding the JSON files
file_paths = glob.glob(os.path.join(dos_path, 'traceJSON-*-*.json'))

# Initialising a list to stock the dataframes
dfs = []

# Browsing the JSON files
for file_path in file_paths:
    # Extracting the receiver id in the name of the file
    file_name= os.path.basename(file_path)
    receiver_id = file_name.split('-')[1]

    # Reading the JSON file
    with open(file_path, 'r') as f:
        data = f.readlines()

    # Stripping each JSON line and adding the receiver id
    json_data = [json.loads(line.strip()) for line in data]
    for line in json_data:
        line['receiver_id'] = receiver_id

    # Creating a pandas dataframe
    df = pd.DataFrame(json_data)

    # Filtering the lines where type = 3
    df_filtered = df[df['type'] == 3]

    # Adding the filtered dataframe to the dataframe list
    dfs.append(df_filtered)

df_final = pd.concat(dfs)

### Further processing

The next step will be to make the join between the sender-labels correspondences file and our dataframe on the *sender* column. Also, we will have to split the columns that contain lists of floats so that every column contains floats and only floats. For example, the *pos* column will be separated into 3 columns *pos1*, *pos2* and *pos3*.

In [115]:
sender_labels_path = './_data/sender_labels.csv'

df_labels = pd.read_csv(sender_labels_path, delimiter=';')

# Columns to split
columns_to_split = ['pos', 'pos_noise', 'spd', 'spd_noise', 'acl', 'acl_noise', 'hed', 'hed_noise']


df_final = pd.merge(df_final, df_labels, how='left', on='sender')

# Function to split lists into sub-columns
def split_list_to_columns(row, col_name):
    return pd.Series(row[col_name])

# Applying the function to every columns that have list data
for col in columns_to_split:
    new_cols = df_final.apply(lambda x: split_list_to_columns(x, col), axis=1)
    new_cols.columns = [f"{col}{i+1}" for i in range(len(df_final[col][0]))]
    df_final[new_cols.columns] = new_cols

df_final = df_final.drop(columns=columns_to_split)


# Printing the final dataframe

print(df_final.columns)


Index(['type', 'rcvTime', 'receiver_id', 'sendTime', 'sender', 'senderPseudo',
       'messageID', 'label', 'pos1', 'pos2', 'pos3', 'pos_noise1',
       'pos_noise2', 'pos_noise3', 'spd1', 'spd2', 'spd3', 'spd_noise1',
       'spd_noise2', 'spd_noise3', 'acl1', 'acl2', 'acl3', 'acl_noise1',
       'acl_noise2', 'acl_noise3', 'hed1', 'hed2', 'hed3', 'hed_noise1',
       'hed_noise2', 'hed_noise3'],
      dtype='object')


Before getting into the Federated Model, we will have to transform our dataframe into a ClientData object (*tff.simulation.datasets.ClientData* class). The *receiver_id* column will be use to create client ids and to group clients. At the end of this process, each client will have a dataset containing tensors representing their data and labels.

In [119]:
import tensorflow_federated as tff
import pandas as pd
import tensorflow as tf

# Drop rows with NaN values
df = df_final.dropna()

# Function to create client datasets
def create_client_data():
    client_data = []
    # Assuming 'receiver_id' is the column to group clients
    for _, client_df in df.groupby('receiver_id'):
        client_data.append(
            tf.data.Dataset.from_tensor_slices(
                ({'dense_28_input': client_df.drop(columns=['label']).values.astype('float32')},
                 client_df['label'].values.astype('float32'))
            )
        )
    return client_data

# Create ClientData object directly
client_data = create_client_data()

# Create ClientData object
client_data_obj = tff.simulation.datasets.ClientData.from_clients_and_tf_fn(
    client_ids=[str(i) for i in range(len(client_data))],  # Assuming client_ids are numeric
    serializable_dataset_fn=lambda client_id: client_data[int(client_id)]
)

# Test by printing the client data for the first client
client_dataset = client_data_obj.create_tf_dataset_for_client(client_data_obj.client_ids[0])
for data in client_dataset.take(1):
    print(data)


({'dense_28_input': <tf.Tensor: shape=(31,), dtype=float32, numpy=
array([ 3.00000000e+00,  5.04046016e+04,  1.50000000e+01,  5.04046016e+04,
        9.00000000e+00,  1.09500000e+03,  8.48800000e+03,  1.31761047e+03,
        1.01347186e+03,  0.00000000e+00,  3.60492206e+00,  3.79965401e+00,
        0.00000000e+00, -9.05958951e-01, -1.42745638e+00,  0.00000000e+00,
       -1.89850503e-03, -2.99134199e-03,  0.00000000e+00, -9.18970764e-01,
       -1.44807351e+00,  0.00000000e+00,  3.99408024e-03,  5.08691743e-03,
        0.00000000e+00, -5.35855293e-01, -8.44309807e-01,  0.00000000e+00,
        2.02444115e+01,  2.13281651e+01,  0.00000000e+00], dtype=float32)>}, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>)


## Federated Learning Model

For the machine learning method used, we will try two methods : Convolutionnal Neural Network and Logistic Regression. Weight updates can be challenging with CNNs due to their large number of parameters, whereas Logistic Regression allows an easier weight update with its smaller number of parameters.
Earlier during the dataset preprocessing, we partitioned the data using the ids of the receivers. To perform the model aggregation, we will use federated averaging with the Tensorflow Federate library.

### Logistic Regression

In [120]:
import collections
import tensorflow as tf
import tensorflow_federated as tff

# Load simulation data.
source = client_data_obj
def make_client_data(n):
  return source.create_tf_dataset_for_client(source.client_ids[n]).repeat(10).batch(20)

# Pick a subset of client devices to participate in training.
train_data = [make_client_data(n) for n in range(5)]

# Wrap a Keras CNN model for use with TFF.
keras_model = tf.keras.models.Sequential([
  tf.keras.layers.Dense(
    2, tf.nn.softmax, input_shape=(31,), kernel_initializer='zeros')
])

tff_model = tff.learning.models.functional_model_from_keras(
      keras_model,
      loss_fn=tf.keras.losses.SparseCategoricalCrossentropy(),
      input_spec=train_data[0].element_spec,
      metrics_constructor=collections.OrderedDict(
        accuracy=tf.keras.metrics.SparseCategoricalAccuracy))

# Simulate a few rounds of training with the selected client devices.
trainer = tff.learning.algorithms.build_weighted_fed_avg(
  tff_model,
  client_optimizer_fn=tff.learning.optimizers.build_sgdm(learning_rate=0.1))
state = trainer.initialize()
for _ in range(5):
  result = trainer.next(state, train_data)
  state = result.state
  metrics = result.metrics
  print(metrics['client_work']['train']['accuracy'])

0.46079183
0.3706258
0.3706258
0.3706258
0.3706258


### CNN

In [112]:
import collections
import tensorflow as tf
import tensorflow_federated as tff

# Load simulation data.
source = client_data_obj
def make_client_data(n):
  return source.create_tf_dataset_for_client(source.client_ids[n]).repeat(10).batch(20)

# Pick a subset of client devices to participate in training.
train_data = [make_client_data(n) for n in range(5)]

# Wrap a Keras CNN model for use with TFF.
keras_model = tf.keras.models.Sequential([
    tf.keras.layers.Conv1D(16, 3, activation='relu', input_shape=(31,1)),
    tf.keras.layers.MaxPooling1D(2),
    tf.keras.layers.Conv1D(8, 3, activation='relu'),
    tf.keras.layers.MaxPooling1D(2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=2, activation='softmax')
])
tff_model = tff.learning.models.functional_model_from_keras(
      keras_model,
      loss_fn=tf.keras.losses.SparseCategoricalCrossentropy(),
      input_spec=train_data[0].element_spec,
      metrics_constructor=collections.OrderedDict(
        accuracy=tf.keras.metrics.SparseCategoricalAccuracy))

# Simulate a few rounds of training with the selected client devices.
trainer = tff.learning.algorithms.build_weighted_fed_avg(
  tff_model,
  client_optimizer_fn=tff.learning.optimizers.build_sgdm(learning_rate=0.1))
state = trainer.initialize()
for _ in range(5):
  result = trainer.next(state, train_data)
  state = result.state
  metrics = result.metrics
  print(metrics['client_work']['train']['accuracy'])

0.5634738
0.5634738
0.5634738
0.5634738
0.5634738


### Conclusions

CNNs presents a better final accuracy (0.56) compared to Logistic Regression (0.37). Even if the weight updates will be more difficult at a larger scale of data, it would be a significant gain in accuracy to use CNNs over Logistic Regression.

However, the accuracy using CNNs is still low. It could be because the patterns in the data are too complex for the models, but at the same time the models should not be too complex to not make weight updates too difficult. Further work should be made to find that trade-off between accuracy and model complexity.