# Task 1 Perspectives on Python-oriented SBOM Generation Tools
Task 1 analysed current SBOM generation practices in the Python ecosystem and identified a fundamental limitation: existing tools rely primarily on metadata files or semi-dynamic installation processes, leading to incomplete dependency discovery and inconsistent results. Through critical synthesis of recent academic work, the first part motivates a new SBOM generation approach that shifts the focus from metadata-centric analysis to source-code–level dependency discovery.
This part is now dedicated to the implementation of our novel approach. Using Python’s Abstract Syntax Tree (AST) to extract imports directly from source code and construct a dependency graph, we complement existing methods and seek to achieve 100% completeness. 
The first section is about the project's chosen datasets, why and how they've been merged. The second section is about the choice of data structure appropriate for graph problems. Goodrich et al.[2] provide us with four possible data structures

## Nota bene
In the first task, under Comparison and metrics, we suggest correctness will be also assessed. This was a mistake; the focus of this project remains exclusively about completeness, as in capturing all dependencies.  

## 3 Data structures 
One could conceive the data structure defining how packages relate to one another naively as a tree, where each package may have children-packages. In fact, it is often referred in the literature as a dependency tree. However, 
this term misrepresents the true nature of software dependencies, for packages rarely have a single parent relationship or even form (problematic) cycles, per Tellnes' own introduction of the problem[1]. Thus, in professional software development, dependency graphs are the norm. For this reason, the core data structure featured in this document will be shaped by graph theory. Goodrich et al.[2] provide us with four possible data structures: 
1. edge lists 
2. adjacency list
3. adjacency map
4. adjacency matrix 

Whilst in the task n°1 it had been concluded that the adjacency list had to be used for NodeVisitor depended on it, 
it turned out that `ast.NodeVisitor` doesnt really expect a specific type. We were thus free to choose any of those data structures. One could choose the edge list on a whim, for it is easy to implement, and has the most optimal for the three functions we would use (`vertices()` being O(n), `insert_vertices()` and `insert_edges()` being O(1)), as described in [2, p.627], to construct the graph and compare it against others. But doing so would be a mistake; we cannot forget some projects might have cycles and for this reason we need to be able to avoid duplicates. Replacing the lists by sets would mean both the getter and setter necessary to check for duplicates before adding are O(n) too[3]. Therefore it would have best to use a adjacency map, for its getter `get_edge()` is O(1), but our dataset needs to exported and merged too, implying serialisation is a major challenge for behavior-heavy objects like graphs. Bad experience has been made about this last part, as visible in our chatgpt transcript[4]. Hence we reverted back to the first option that was the edge list, despite its "poor" worst time case performances. We follow Goodrich et al.'s edge list implementation [3].

In [None]:
from dataclasses import dataclass, asdict, field

@dataclass(eq=True, frozen=True)
class Package():
    name: str

@dataclass(eq=True, frozen=True)
class ImportStatement():
    imports: Package
    imported: Package
    
@dataclass(eq=True, frozen=True)
class DependencyGraph():
    packages: list[Package] = field(default_factory=list)
    import_statements: list[ImportStatement] = field(default_factory=list)

    def insert_package(self, package_name: str) -> Package:
        new_package = Package(package_name)
        self.packages.append(new_package)

        return new_package

    def insert_importstatement(self, imports: Package, imported: Package):
        new_importstatement = ImportStatement(imports, imported)
        self.import_statements.append(new_importstatement)

        return new_importstatement



## 2.0 Datasets and setup 
In the subsequent task, we examined and selected two datasets that allow us to draw a comparison between our tools and our new approach. Both datasets contain the source code of different pacakges to analyse. Due to the sheer size of those datasets (multiple GBs), they are not included in this file directly and can be consulted on the (project's github repository)[https://github.com/rtafurthgarcia/COM713].
Our datasets (`\dataset1` and `\dataset2`) share the same structure:
- `\packages` contains the packages to analyse and to generate SBOMs from, 
- `\sbom` contains the generated SBOMs generated by each tool for each package, and serve as comparison source.

### 2.1 Dataset n°1
Dataset n1 (ds1) is a copy from Cofano et al. Dependencies are read both from `requirements.txt` and `pyproject.toml`, and will be deduced from `\packages`. This dataset contains no ground truth, and only `\sbom` can be used to draw a comparison between our new approach and the other tools. 

### 2.2 Dataset n°1
Dataset n2 (ds2) is a copy from Jia et al's dataset. `\deptree_gt` contains the ground truth as json files for each package. 

### 2.3 Merge
These two datasets have been parsed and merged externally; the process required a cyclonedx library that couldnt be attached to this project. However, if curious as to how it worked, you can peek into the `merge.py` file and see how it got done.

In [None]:
import os 
import ast

ds2 = os.path.join(".", "dataset2", "packages", )


def parse_source(source_path: str):
    def _parse(file_path: str):
        with(open(file=file_path, mode="+rb")) as start_file:
            source_code = start_file.read()
            return ast.parse(source=source_code)

    if os.path.isfile(source_path):
        yield _parse(source_path)
    elif os.path.isdir(source_path):
        # we dont bother with subdirectories and thus only analyse .py files
        source_files = [f for f in os.listdir(source_path) if os.path.isfile(os.path.join(source_path, f) and ".py" in f)]
        for source_file in source_files:
            yield _parse(os.path.join(source_path, source_file))

def analyse_package(source_path: str):
    for tree in parse_source(source_path):
        print(tree)

for package in analysis_ds2["packages"]:
    analyse_package(source_path=package["source"])

## References
[1] J. Tellnes, « Dependencies: No Software is an Island », Master thesis, The University of Bergen, 2013. Available on: https://bora.uib.no/bora-xmlui/handle/1956/7540
[2] M. T. Goodrich, R. Tamassia, et M. H. Goldwasser, Data structures and algorithms in Python, 1st edition. Hoboken, N.J: Wiley, 2013.
[3] « TimeComplexity - Python Wiki ». Consulted the: 4 janvier 2026. [Online]. Available on: https://wiki.python.org/moin/TimeComplexity
[4] R. E. L. Tafurth Garcia, « ChatGPT - COM713 », Transcript. [Online]. Available: https://chatgpt.com/share/695ad5cb-35f8-8008-a3a0-d8b0302b4eb2
