<a href="https://colab.research.google.com/github/tristan-allard/l1-ise-public/blob/main/notebooks/notebook2-attack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# L1 INFO - Introduction à la Sécurité (des données) - Projet 2023/2024

__Intervenants :__ Tristan Allard, Mathieu Gestin, Mathieu Goessens 

Ce projet est issu d'un atelier réalisé en octobre 2021 dans le cadre du projet Rudi (https://blog.rudi.bzh/) par Tristan Allard et Javier Rojas~Balderrama.

_Univ Rennes, CNRS, IRISA, INRIA_
  
This work is licensed under a [Creative Commons Zero v1.0 Universal License](https://creativecommons.org/publicdomain/zero/1.0/)

## Acknowledgments

We warmly thank François Bodin and Luc Lesoil for their support on the data and the definition of the use-case.

# Notebook __TWO__: The case for privacy -- Part 1

## Step 0 (STARTER)

Yes, raw data is not immune to re-identification! 

You are now going to perform a reidentification attack on a small set of targets. To this end, we will give you some auxiliary information (also called background knowledge) and programming tools for helping you query the dataset.
1. You can display the buses validations dataset [here](#displayvalid). Feel free to to play with the filter menu, although the number of shown rows is limited. 
2. You can attack the dataset [Step 1](#attack) (do not be afraid to try!). 
3. In order to understand better your attacks and/or design other attacks, you can display informative measures about the _identifying power_ of the attributes of the dataset ([Step 2](#explain)). 

## Settings and data


 ### Download dataset


In [None]:
!wget -nv -nc https://zenodo.org/record/5509268/files/buses.parquet

### Import required modules

In [None]:
import copy
import importlib
import os
from errno import ENOENT
from pathlib import Path
from typing import Optional, Sequence, Tuple, Union

import folium
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import pyarrow.parquet as pq
from folium.plugins import HeatMapWithTime
from IPython import display, get_ipython
from pandas import NA, DataFrame, Series, Timestamp
from plotly.graph_objs import Figure, Scatter

### Setup notebook constants and running environment

In [None]:
# project base directory
BASE_DIRECTORY = Path(".")

# detect running environment
COLAB_ON = True if "google.colab" in str(get_ipython()) else False

In [None]:
# Set Ploty renderer
if COLAB_ON:
    pio.renderers.default = "colab"

### Load and display raw dataset

In [None]:
# load dataset from file system
def load_data(
    path: Path,
) -> DataFrame:
    if not path.exists():
        raise FileNotFoundError(ENOENT, os.strerror(ENOENT), path)

    table = pq.read_table(path)
    return table.to_pandas()


# show a dataframe as a table
def display_dataframe(
    dataframe: DataFrame,
) -> None:
    if COLAB_ON:
        spec = importlib.util.find_spec("google.colab")
        if spec:
            data_table = importlib.import_module("google.colab.data_table")
            enable_dataframe_formatter = getattr(
                data_table,
                "enable_dataframe_formatter",
            )

            enable_dataframe_formatter()

    display.display(dataframe[:20000] if COLAB_ON else dataframe)

#### Show raw dataset

<a id="displayvalid"></a>

In [None]:
path = BASE_DIRECTORY.joinpath("buses.parquet")
buses_dataset = load_data(path)
display_dataframe(buses_dataset)

####################
# BEGIN : Observe

In [None]:
# show dataset on a map
def plot_heatmap(
    dataframe: DataFrame,
    group_column: str = "departure_time",
    # Rennes GPS coordinates
    location: Tuple[float, float] = (48.1147, -1.6794),
) -> None:
    _dataframe = dataframe.copy(deep=True)
    timestamps = []
    coordinates = []
    for timestamp, coordinate in _dataframe.groupby(group_column):
        timestamps.append(str(timestamp))
        coordinates.append(
            coordinate[
                [
                    "stop_lat",
                    "stop_lon",
                ]
            ].values.tolist()
        )

    base_map = folium.Map(
        location=location,
        zoom_start=11,
        tiles="https://{s}.basemaps.cartocdn.com/light_all/{z}/{x}/{y}{r}.png",
        # tiles="https://{s}.basemaps.cartocdn.com/dark_nolabels/{z}/{x}/{y}{r}.png",
        attr="CartoDB",
    )

    heat_map = HeatMapWithTime(
        data=coordinates,
        index=timestamps,
        auto_play=True,
        min_speed=1,
        radius=4,
        max_opacity=0.5,
    )

    heat_map.add_to(base_map)
    display.display(base_map)

In [None]:
# Showing the heat map of validations only works on a local server
if not COLAB_ON:
    plot_heatmap(buses_dataset)

In [None]:
# END : Observe
####################

## Step 1: Attack raw buses validations
<a id="attack"></a>

Re-identification attacks are simple conceptually. They consist in selecting the subset of individuals whose records match the auxiliary information that the attacker has about them. If a single individual matches the adversarial knowledge, the success of the attack is clear (assuming that the adversarial knowledge is reliable). Otherwise the success is less clear. But when more than a single individual match the adversarial knowledge, is it really a failure? 

Lets have a look at an [example](#attackexample).

In [None]:
# drop geospatial attributes from dataset
def tidy_dataframe(
    dataframe: DataFrame,
) -> DataFrame:
    dataframe_ = dataframe.copy()
    return dataframe_[
        [
            "departure_time",
            "id",
            "stop_name",
            "route_short_name",
            "stop_id",
            "direction_id",
        ]
    ]


# query the dataset by attribute and value
def query(
    dataframe: DataFrame,
    name: str,
    value: Union[str, int, float, Sequence[str]],
) -> DataFrame:
    return (
        dataframe.query(f"{name} == {value}")
        if isinstance(value, (int, float))
        else dataframe.query(f'''{name} == "{value}"''')
        if isinstance(value, str)
        else dataframe.query(f"{name} in {value}")
    )


# filter dataset between two timestamps
def between(
    dataframe: DataFrame,
    start: Union[str, Timestamp],
    end: Union[str, Timestamp],
    complement: bool = False,
) -> DataFrame:
    start_ = Timestamp(start) if not isinstance(start, Timestamp) else start
    end_ = Timestamp(end) if not isinstance(end, Timestamp) else end
    return (
        (dataframe.set_index("departure_time").loc[start_:end_].reset_index())
        if not complement
        else (
            dataframe.loc[
                (dataframe["departure_time"] < start_)
                | (dataframe["departure_time"] > end_)
            ]
        )
    )


# intersect two datasets with a common attribute ('on')
def intersect(
    right: DataFrame,
    left: DataFrame,
    on: Optional[Sequence[str]] = None,
    how: str = "inner",
) -> Optional[DataFrame]:
    on_ = on if on else right.columns.values.tolist()
    return pd.merge(
        right,
        left,
        how=how,
        on=on_,
    )  # if set(rvalues) == set(lvalues) else None


# get distinct rows from a dataset grouping by a 'subset'
def distinct(
    dataframe: DataFrame,
    subset: Union[str, Sequence[str]],
) -> DataFrame:
    return dataframe.drop_duplicates(subset=subset)


# count rows by name and value
def count_by(
    dataframe: DataFrame,
    name: str,
    value: Union[str, int, float],
    *,
    frequency: str = "15T",
) -> DataFrame:
    dataframe_ = (
        dataframe[dataframe[name] == value]
        .set_index("departure_time")
        .groupby(
            [
                pd.Grouper(level="departure_time", freq=frequency),
            ]
        )
        .count()
    )

    # #domain = pd.date_range(start=dataframe_.index.min(), end=dataframe_.index.max(), freq="15T")
    # #dataframe_ = dataframe_.reindex(domain, method=None, fill_value=NA)
    # #dataframe_.replace(0, np.NAN, inplace=True)
    # #display_dataframe(dataframe_)
    return dataframe_[dataframe_.columns[0]].to_frame(name="count")


# show a timeseries graph of a selected attribute
def plot_dataset(
    dataframe: DataFrame,
    column: str,
) -> None:
    figure = Figure()
    scatter = Scatter(
        x=dataframe.index,
        y=dataframe[column],
        mode="lines",
        name="values",
        connectgaps=False,
    )

    figure.add_trace(scatter)
    figure.update_layout(
        showlegend=False,
        title_text=column,
        template="simple_white",
    )

    figure.update_xaxes(showgrid=True)
    figure.show()

### Example of a re-identification attack
<a id='attackexample'></a>

Somebody said:

> "*I often take the bus in the morning to go to Beaulieu from the 'Anne de Bretagne' bus stop in Cesson* "

Is this information enough to discover the mobility patterns of that person?

A short summary of implemented methods used to perform the attack, refer to the example below for the use (or if you feel confortable use the **Pandas** API directly):

- `query`:  perform a query on the dataset by attribute name and value.
- `between`:  filter dataset between two timestamps.
- `intersect`: "intersect" two datasets with a common attribute (the 'on' attribute). Please note that we call it "intersection" but it is actually a "join" operation (in database language).
- `distinct`: get distinct rows from a dataset grouping by a 'subset'.

Have a look at the code below that implements this attack. You can also [go straight to your targets](#attacktargets).

In [None]:
####################
# BEGIN : Observe

# remove geo-spatial information from the dataset
dataset = tidy_dataframe(buses_dataset)

# show the dataset
print("Initial dataset")
display_dataframe(dataset)

# query: "I take the bus from the bus stop 'Anne de Bretagne'"
q_1 = query(dataset, "stop_name", "Anne de Bretagne")

# query: "I take the bus going to Beaulieu (city center)"
q_2 = query(dataset, "direction_id", 0)

# intersect results of 'q_1' and 'q_2'
q_3 = intersect(q_1, q_2, on=["id"])

# show results of intesection done on 'q_3'
print("Result of the intersection of queries 1 & 2")
display_dataframe(q_3)

# check how many different users are in query 'q_3'
q_4 = distinct(q_3, ["id"])

# show results of query 'q_3'
# => since there is only one row we found the user!")
print("Result of checking different `id` in previous result")
display_dataframe(q_4)

# query: all travels of the user ('id') of query 'q_4'
q_5 = query(dataset, "id", 175)

# show results of query 'q_5'
print("Complete dataset of the user with `id` 175")
display_dataframe(q_5)

# get the travels count of the user ('id') of query 'q_3' in a timeline
q_6 = count_by(dataset, "id", 175)

# plot results of query 'q_6'
plot_dataset(q_6, "count")

# for the curious:
# all-in-one 'plain vanilla' code equivalent as follows
# (results are not printed on screen)
target = dataset.query(
    "stop_name == 'Anne de Bretagne' & direction_id == 0"
).drop_duplicates(
    subset=[
        "id",
        "stop_name",
    ],
)

# END : Observe
####################

### Food for thoughts
<a id='attacktargets'></a>

Here below there is auxiliary information that you have on different targets. Can you re-identify them based on the available dataset and all their bus validations? 

```
####################
# BEGIN : Answer
```

> - Target 1: *I remember that on __Sept the 6th 2021__ I had to take the bus very early __before 6:30AM__. I leave close to the __'Cesson Collège'__ bus stop, yes.*
> - Target 2: *I usually take the bus from __'Saint-Sulpice'__, the __'Zone d'Activités'__. But during the __Autumn 2021 holidays__ I stayed at my parents' home, close to __the lakes__ there in __Cesson__. I took the bus __'217'__ a couple of times at that time. I think it is not the __'217'__ line anymore today but the __'216'__ line. I love taking the bus* <3 .
> - Target 3: *I live __Blvd Villebois Mareuil__ below the river. I __do not like the big avenue__ going from West to East with crowded buses so I usually walk a bit in order to take the bus at a __quieter place__. By the way, I often take the bus from the __RU in Beaulieu__.*.

```
# END : Answer
####################
```

__Warning:__ There might be discrepancies with the current public transportation map. You can use the [global map](https://www.star.fr/se-deplacer/plans-du-reseau) for spotting the right bus stops for Target 2 and Target 3 though. 

In [None]:
####################
# BEGIN : Code

In [None]:
# Target 1
dataset = tidy_dataframe(buses_dataset)

# TODO YOUR code here!

In [None]:
# Target 2
dataset = tidy_dataframe(buses_dataset)

# TODO YOUR code here!

# NOTE: To use 'between' set the start and end dates as strings:
#       result = between(dataset, "2021-08-01", "2021-08-31")

In [None]:
# Target 3
dataset = tidy_dataframe(buses_dataset)

# TODO YOUR code here!

# NOTE: To test several values of an attribute at once, provide a list to query:
#       values = ["Tournebride", "Le Mail", "Maison d'Accueil"]
#       result = query(dataset, "stop_name", values)

In [None]:
# END : Code
####################

Why was this auxiliary information sufficient for enabling your attacks? Displaying the anonymity sets as done in [Step 2](#explain) can give some explainations...