Skip to content

Commit

Permalink
recommended changes
Browse files Browse the repository at this point in the history
  • Loading branch information
sandrasiby committed Aug 25, 2022
1 parent d0e9ec8 commit c951a98
Show file tree
Hide file tree
Showing 4 changed files with 46 additions and 95 deletions.
37 changes: 32 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,19 @@ pip install -r requirements.txt

#### Preparing Crawl Data

To generate the crawl data needed for the pipeline, you need to run a crawl using the installed OpenWPM tool.
To generate the crawl data needed for the pipeline, you need to run a crawl using the installed OpenWPM tool. To run a crawl, first update the script `demo.py` to read in the list of sites that you want to visit. Then, run `demo.py`.

After you run the demo, a `datadir` folder will be created in your `demo` directory. Inside the folder, you will find two database files to be used in our pipeline: `crawl-db.sqlite` and `content.ldb`

### Pipeline
### Pipelines

with WebGraph, we mainly present two tasks that you can run:
The codebase consists of two pipelines: WebGraph and Robustness. We describe each of them below.

#### WebGraph Pipeline

This pipeline runs the WebGraph system, which is a graph-based Ad and Tracking Services (ATS) detection system. WebGraph takes in crawl data, builds graph representations of sites, extracts features and labels from these representations, and trains a machine learning model.

With the WebGraph code, we present two tasks that you can run:

1. Graph Preprocessing and Feature building
2. Classification (training and testing)
Expand All @@ -47,7 +53,7 @@ with WebGraph, we mainly present two tasks that you can run:
In this task, WebGraph constructs the dataset for classification by:

- taking your *sqlite* and *leveldb* database files to construct a graph representation of each crawl as explained in the [paper](https://www.usenix.org/system/files/sec22summer_siby.pdf) and export it in a tabular format to a `graph.csv` file and `features.csv` file
- applying the rules from public *filterlists* to label the nodes in each graph and export it in a tabular format to a `labeled.csv` file
- applying the rules from public *filterlists* to label the nodes in each graph and export it in a tabular format to a `labelled.csv` file

To run this task, run the following script:

Expand All @@ -64,6 +70,7 @@ python <project-directory>/code/run.py --input-db <location-to-datadir>/datadir/
> - `--out`: the path to the directory of the output `.csv` files.
> - `--mode`: the system to run (webgraph or adgraph).
Note: With the `--mode` argument, you can also run AdGraph (we evaluate AdGraph in Section 3 of the paper).

#### 2. Classification

Expand All @@ -83,8 +90,14 @@ python <project-directory>/code/classification/classify.py --features features.c
<hr/>

#### Robustness Pipeline

This pipeline runs the robustness experiments performed in the paper. There are two types of robustness experiments: content and structure mutations. All the code and READMEs associated with these experiments are in the `robustness` folder.

### Data Schema

The output of the WebGraph pipeline is three files: `graph.csv`, `features.csv`, `labelled.csv`.

#### Graph

These are the columns present in the graph output under `graph.csv`
Expand Down Expand Up @@ -119,12 +132,22 @@ The features in `features.csv` used are described in [features.yaml](https://git

#### Labels

Nodes labeled by either True or False if they are blocked by filter lists or not.
Nodes labeled by either True or False if they are blocked by filter lists or not. These are the columns present in the `labelled.csv` file.

| Column | Description |
| -------------------- | ------------------------------------------------------------ |
| *visit_id* | The visit id of the crawl |
| *top_level_url* | The top level URL (page being visited) |
| *name* | The name of the node |
| *label* | The label of the node |


<hr/>

### Code Organization

The WebGraph pipeline is in the `code` folder. The Robustness pipeline is in the `robustness` folder.

### Paper

**WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking**
Expand Down Expand Up @@ -154,4 +177,8 @@ If you use the code/data in your research, please cite our work as follows:

In case of questions, please get in touch with [Sandra Siby](https://sandrasiby.github.io/).

### Acknowledgements

Thanks to Laurent Girod and Saiid El Hajj Chehade for helping test and improve the code.


95 changes: 12 additions & 83 deletions code/features.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
# Things to check in this file
# Config file to set features

# List under selected_features: This is used in feature extraction.
# AdGraph uses - node, ne, connectivity, url, script_content.
# WebGraph uses AdGraph + data_flow, indirect_edge, indirect_all_edge, cookie
# Select the feature set you want by uncommenting the relevant set under features_to_extract

# List under feature_set. Select only one option
# adgraph and trackergraph options will only use non-content
# adgraph_all and trackergraph_all will also use content
# graph_columns provides the name of the columns in the output graph CSV file
# Update this list if you add a new graph attribute

# feature_columns_adgraph and feature_columns provide the name of the columns
# in the output feature file for AdGraph and WebGraph respectively
# Note: If running AdGraph, change feature_columns_adgraph to feature_columns
# (the old feature_columns can be changed to feature_columns_webgraph)

# label_columns provides the name of the columns in the output label file

features_to_extract:
#- content
Expand Down Expand Up @@ -39,81 +43,6 @@ graph_columns:
- post_body
- post_body_raw

feature_columns:
- visit_id
- name
- content_policy_type
- url_length
- is_subdomain
- is_valid_qs
- is_third_party
- base_domain_in_query
- semicolon_in_query
- screen_size_present
- ad_size_present
- ad_size_in_qs_present
- keyword_raw_present
- keyword_char_present
- num_nodes
- num_edges
- nodes_div_by_edges
- edges_div_by_nodes
- in_degree
- out_degree
- in_out_degree
- ancestors
- descendants
- closeness_centrality
- average_degree_connectivity
- eccentricity
- is_parent_script
- is_ancestor_script
- ascendant_has_ad_keyword
- is_eval_or_function
- descendant_of_eval_or_function
- ascendant_script_has_eval_or_function
- ascendant_script_has_fp_keyword
- ascendant_script_length
- num_get_storage
- num_set_storage
- num_get_cookie
- num_set_cookie
- num_script_predecessors
- num_script_successors
- num_requests_sent
- num_requests_received
- num_redirects_sent
- num_redirects_rec
- max_depth_redirect
- indirect_in_degree
- indirect_out_degree
- indirect_ancestors
- indirect_descendants
- indirect_closeness_centrality
- indirect_average_degree_connectivity
- indirect_eccentricity
- indirect_mean_in_weights
- indirect_min_in_weights
- indirect_max_in_weights
- indirect_mean_out_weights
- indirect_min_out_weights
- indirect_max_out_weights
- num_set_get_src
- num_set_mod_src
- num_set_url_src
- num_get_url_src
- num_set_get_dst
- num_set_mod_dst
- num_set_url_dst
- num_get_url_dst
- indirect_all_in_degree
- indirect_all_out_degree
- indirect_all_ancestors
- indirect_all_descendants
- indirect_all_closeness_centrality
- indirect_all_average_degree_connectivity
- indirect_all_eccentricity

feature_columns_adgraph:
- visit_id
- name
Expand Down Expand Up @@ -146,7 +75,7 @@ feature_columns_adgraph:
- ascendant_script_has_fp_keyword
- ascendant_script_length

feature_columns_webgraph:
feature_columns:
- visit_id
- name
- num_nodes
Expand Down
4 changes: 2 additions & 2 deletions robustness/structure_mutation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,13 @@ The `config.yaml` file contains the following parameters that have to be set:
>
> feature_config: Path to the feature configuration file. Sample file [here](https://github.com/spring-epfl/WebGraph/blob/main/code/features.yaml). We need this for feature extraction.
>
> vid_file: Path to a JSON file containing a list of visit IDs for which we want to perform the mutation.
> vid_file: Path to a JSON file containing a list of visit IDs for which we want to perform the mutation. Example in `sample/chosen_ids.json`
>
> filterlists: Path to an output folder to which filter lists will be downloaded for labelling.
>
> parent_limit: Number of nodes to act as start points to perform mutation. Note: Increasing the parent_limit will increase experiment run time.
>
> model: Path to a trained model file generated the WebGraph classification process (these are files labelled `model_0.joblib`, `model_1.joblib`, etc.).
> model: Path to a trained model file generated the WebGraph classification process (these are files labelled `model_0.joblib`, `model_1.joblib`, etc.). Example in `sample/model_1.joblib`.
>
> result_dir: Path to output folder for the result.
>
Expand Down
5 changes: 0 additions & 5 deletions robustness/structure_mutation/sample_vid_file.json

This file was deleted.

0 comments on commit c951a98

Please sign in to comment.