recommended changes

spring-epfl · Aug 25, 2022 · c951a98 · c951a98
1 parent d0e9ec8
commit c951a98
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 95 deletions.
diff --git a/README.md b/README.md
@@ -31,13 +31,19 @@ pip install -r requirements.txt
 
 #### Preparing Crawl Data
 
-To generate the crawl data needed for the pipeline, you need to run a crawl using the installed OpenWPM tool. 
+To generate the crawl data needed for the pipeline, you need to run a crawl using the installed OpenWPM tool. To run a crawl, first update the script `demo.py` to read in the list of sites that you want to visit. Then, run `demo.py`. 
 
 After you run the demo, a `datadir` folder will be created in your `demo` directory. Inside the folder, you will find two database files to be used in our pipeline: `crawl-db.sqlite` and `content.ldb`
 
-### Pipeline
+### Pipelines
 
-with WebGraph, we mainly present two tasks that you can run:
+The codebase consists of two pipelines: WebGraph and Robustness. We describe each of them below.
+
+#### WebGraph Pipeline
+
+This pipeline runs the WebGraph system, which is a graph-based Ad and Tracking Services (ATS) detection system. WebGraph takes in crawl data, builds graph representations of sites, extracts features and labels from these representations, and trains a machine learning model. 
+
+With the WebGraph code, we present two tasks that you can run:
 
 1. Graph Preprocessing and Feature building
 2. Classification (training and testing)
@@ -47,7 +53,7 @@ with WebGraph, we mainly present two tasks that you can run:
 In this task, WebGraph constructs the dataset for classification by:
 
 - taking your *sqlite* and *leveldb* database files to construct a graph representation of each crawl as explained in the [paper](https://www.usenix.org/system/files/sec22summer_siby.pdf) and export it in a tabular format to a `graph.csv` file and `features.csv` file
-- applying the rules from public *filterlists* to label the nodes in each graph and export it in a tabular format to a `labeled.csv` file
+- applying the rules from public *filterlists* to label the nodes in each graph and export it in a tabular format to a `labelled.csv` file
 
 To run this task, run the following script:
 
@@ -64,6 +70,7 @@ python <project-directory>/code/run.py --input-db <location-to-datadir>/datadir/
 > - `--out`: the path to the directory of the output `.csv` files.
 > - `--mode`: the system to run (webgraph or adgraph).
 
+Note: With the `--mode` argument, you can also run AdGraph (we evaluate AdGraph in Section 3 of the paper).
 
 #### 2. Classification
 
@@ -83,8 +90,14 @@ python <project-directory>/code/classification/classify.py --features features.c
 
 <hr/>
 
+#### Robustness Pipeline
+
+This pipeline runs the robustness experiments performed in the paper. There are two types of robustness experiments: content and structure mutations. All the code and READMEs associated with these experiments are in the `robustness` folder.
+
 ### Data Schema
 
+The output of the WebGraph pipeline is three files: `graph.csv`, `features.csv`, `labelled.csv`.
+
 #### Graph
 
 These are the columns present in the graph output under `graph.csv`
@@ -119,12 +132,22 @@ The features in `features.csv` used are described in [features.yaml](https://git
 
 #### Labels
 
-Nodes labeled by either True or False if they are blocked by filter lists or not.
+Nodes labeled by either True or False if they are blocked by filter lists or not. These are the columns present in the `labelled.csv` file.
+
+| Column               | Description                                                  |
+| -------------------- | ------------------------------------------------------------ |
+| *visit_id*           | The visit id of the crawl                                    |
+| *top_level_url*      | The top level URL (page being visited)                       |
+| *name*               | The name of the node                                         |
+| *label*              | The label of the node                                        |
+
 
 <hr/>
 
 ### Code Organization
 
+The WebGraph pipeline is in the `code` folder. The Robustness pipeline is in the `robustness` folder. 
+
 ### Paper
 
 **WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking**
@@ -154,4 +177,8 @@ If you use the code/data in your research, please cite our work as follows:
 
 In case of questions, please get in touch with [Sandra Siby](https://sandrasiby.github.io/). 
 
+### Acknowledgements
+
+Thanks to Laurent Girod and Saiid El Hajj Chehade for helping test and improve the code.
+
 
diff --git a/code/features.yaml b/code/features.yaml
@@ -1,12 +1,16 @@
-# Things to check in this file
+# Config file to set features
 
-# List under selected_features: This is used in feature extraction.
-# AdGraph uses - node, ne, connectivity, url, script_content.
-# WebGraph uses AdGraph + data_flow, indirect_edge, indirect_all_edge, cookie
+# Select the feature set you want by uncommenting the relevant set under features_to_extract
 
-# List under feature_set. Select only one option
-# adgraph and trackergraph options will only use non-content
-# adgraph_all and trackergraph_all will also use content
+# graph_columns provides the name of the columns in the output graph CSV file 
+# Update this list if you add a new graph attribute
+
+# feature_columns_adgraph and feature_columns provide the name of the columns 
+# in the output feature file for AdGraph and WebGraph respectively
+# Note: If running AdGraph, change feature_columns_adgraph to feature_columns 
+# (the old feature_columns can be changed to feature_columns_webgraph)
+
+# label_columns provides the name of the columns in the output label file
 
 features_to_extract:
       #- content
@@ -39,81 +43,6 @@ graph_columns:
       - post_body
       - post_body_raw
 
-feature_columns:
-      - visit_id
-      - name
-      - content_policy_type
-      - url_length
-      - is_subdomain
-      - is_valid_qs
-      - is_third_party
-      - base_domain_in_query 
-      - semicolon_in_query 
-      - screen_size_present 
-      - ad_size_present
-      - ad_size_in_qs_present
-      - keyword_raw_present 
-      - keyword_char_present
-      - num_nodes
-      - num_edges
-      - nodes_div_by_edges
-      - edges_div_by_nodes
-      - in_degree
-      - out_degree
-      - in_out_degree
-      - ancestors
-      - descendants
-      - closeness_centrality
-      - average_degree_connectivity
-      - eccentricity 
-      - is_parent_script
-      - is_ancestor_script
-      - ascendant_has_ad_keyword
-      - is_eval_or_function
-      - descendant_of_eval_or_function
-      - ascendant_script_has_eval_or_function
-      - ascendant_script_has_fp_keyword
-      - ascendant_script_length
-      - num_get_storage
-      - num_set_storage
-      - num_get_cookie
-      - num_set_cookie
-      - num_script_predecessors
-      - num_script_successors
-      - num_requests_sent
-      - num_requests_received
-      - num_redirects_sent
-      - num_redirects_rec
-      - max_depth_redirect
-      - indirect_in_degree
-      - indirect_out_degree
-      - indirect_ancestors
-      - indirect_descendants
-      - indirect_closeness_centrality
-      - indirect_average_degree_connectivity
-      - indirect_eccentricity
-      - indirect_mean_in_weights
-      - indirect_min_in_weights
-      - indirect_max_in_weights
-      - indirect_mean_out_weights
-      - indirect_min_out_weights
-      - indirect_max_out_weights
-      - num_set_get_src
-      - num_set_mod_src
-      - num_set_url_src
-      - num_get_url_src
-      - num_set_get_dst
-      - num_set_mod_dst
-      - num_set_url_dst
-      - num_get_url_dst
-      - indirect_all_in_degree
-      - indirect_all_out_degree
-      - indirect_all_ancestors
-      - indirect_all_descendants
-      - indirect_all_closeness_centrality
-      - indirect_all_average_degree_connectivity
-      - indirect_all_eccentricity
-
 feature_columns_adgraph:
       - visit_id
       - name
@@ -146,7 +75,7 @@ feature_columns_adgraph:
       - ascendant_script_has_fp_keyword
       - ascendant_script_length
 
-feature_columns_webgraph:
+feature_columns:
       - visit_id
       - name
       - num_nodes

diff --git a/robustness/structure_mutation/README.md b/robustness/structure_mutation/README.md
@@ -25,13 +25,13 @@ The `config.yaml` file contains the following parameters that have to be set:
 >
 > feature_config: Path to the feature configuration file. Sample file [here](https://github.com/spring-epfl/WebGraph/blob/main/code/features.yaml). We need this for feature extraction.
 >
-> vid_file: Path to a JSON file containing a list of visit IDs for which we want to perform the mutation.
+> vid_file: Path to a JSON file containing a list of visit IDs for which we want to perform the mutation. Example in `sample/chosen_ids.json`
 >
 > filterlists: Path to an output folder to which filter lists will be downloaded for labelling.
 >
 > parent_limit: Number of nodes to act as start points to perform mutation. Note: Increasing the parent_limit will increase experiment run time.
 >
-> model: Path to a trained model file generated the WebGraph classification process (these are files labelled `model_0.joblib`, `model_1.joblib`, etc.). 
+> model: Path to a trained model file generated the WebGraph classification process (these are files labelled `model_0.joblib`, `model_1.joblib`, etc.). Example in `sample/model_1.joblib`.
 >
 > result_dir: Path to output folder for the result.
 >

diff --git a/robustness/structure_mutation/sample_vid_file.json b/robustness/structure_mutation/sample_vid_file.json