# Task 1 Perspectives on Python-oriented SBOM Generation Tools
Task 1 analysed current SBOM generation practices in the Python ecosystem and identified a fundamental limitation: existing tools rely primarily on metadata files or semi-dynamic installation processes, leading to incomplete dependency discovery and inconsistent results. Through critical synthesis of recent academic work, the first part motivates a new SBOM generation approach that shifts the focus from metadata-centric analysis to source-code–level dependency discovery.

This part is now dedicated to the implementation of our novel approach. Using Python’s Abstract Syntax Tree (AST) to extract imports directly from source code and construct a dependency graph, we complement existing methods and seek to achieve 100% completeness. 
The first section is about selecting appropriate data structures, and providing an explanation as to how and why a "merged" dataset has been created.

## Nota bene


## Datasets and setup 
In the subsequent task, we examined and selected two datasets that allow us to draw a comparison between our tools and our new approach. Both datasets contain the source code of different pacakges to analyse. Due to the sheer size of those datasets (multiple GBs), they are not included in this file directly and can be consulted on the (project's github repository)[https://github.com/rtafurthgarcia/COM713].
Our datasets (`\dataset1` and `\dataset2`) share the same structure:
- `\packages` contains the packages to analyse and to generate SBOMs from, 
- `\sbom` contains the generated SBOMs generated by each tool for each package, and serve as comparison source.

### Dataset n°1
Dataset n1 (ds1) is a copy from Cofano et al. Dependencies are read both from `requirements.txt` and `pyproject.toml`, and will be deduced from `\packages`. This dataset contains no ground truth, and only `\sbom` can be used to draw a comparison between our new approach and the other tools. 

### Dataset n°1
Dataset n2 (ds2) is a copy from Jia et al's dataset. `\deptree_gt` contains the ground truth as json files for each package. 

### Merge
These two datasets have been parsed and merged externally; the process required a cyclonedx library that couldnt be attached to this project. However, if curious as to how it worked, you can peek into the `merge.py` file and see how it got done.

## Basic AST parsing
Here, we

In [None]:
import os 
import ast

ds2 = os.path.join(".", "dataset2", "packages", )


def parse_source(source_path: str):
    def _parse(file_path: str):
        with(open(file=file_path, mode="+rb")) as start_file:
            source_code = start_file.read()
            return ast.parse(source=source_code)

    if os.path.isfile(source_path):
        yield _parse(source_path)
    elif os.path.isdir(source_path):
        # we dont bother with subdirectories and thus only analyse .py files
        source_files = [f for f in os.listdir(source_path) if os.path.isfile(os.path.join(source_path, f) and ".py" in f)]
        for source_file in source_files:
            yield _parse(os.path.join(source_path, source_file))

def analyse_package(source_path: str):
    for tree in parse_source(source_path):
        print(tree)

for package in analysis_ds2["packages"]:
    analyse_package(source_path=package["source"])