# Downloading workflows and triple creation 

I will download the available workflows from the [website](https://quangis.github.io/wfgen/) and include them in this document. I do it in a single directory. Then I need to create files with triples based on that. One file will contain single input triples, while the other will contain multiple input triples. 

---

## 1. Donwloading workflows 

In [1]:

# libraries for scraping the workflows
import requests
from bs4 import BeautifulSoup as bs 
import os
import re


#libraries for triple generation
from rdflib import Graph, RDF, RDFS, Namespace  
import codecs
import itertools
import pandas as pd
#import os


#library for saving file 
import pickle 

In [2]:
# indicating the directory where workflows will be saved (saparate folder)
dir_workflows = "C:/Users/wlibe/OneDrive/Pulpit/Thesis/final_code/workflows" 

# defining the webside we want to scrape
url = "https://quangis.github.io/wfgen/"

# geting the response form the server url
response = requests.get(url)

# creating a BeaufifulSoup object 
soup = bs(response.content, 'html.parser')
 
# Finding all hyperlinks, which is indicated by "a" in HTML
links = soup.find_all('a')

# since the files are not numbered in order, we are indicating that the name 
#should have the same number as the file on the url  
name = re.compile(r'solution(\d+)\.ttl')

In [3]:
#creating the loop to donwload the workflows 

i = 0
 
# from all links check if there is a .ttl file and then donwload it 
for link in links:
    
    # this makes sure that we only download files if the href exists 
    href = link.get('href')
    
    # making sure that files are not empty and then it ends with ".ttl"
    if href and href.endswith(".ttl"):
        pattern = name.search(href)
        if pattern: 
            # extracting the numeric part of the URL file name e.g. the 100 from "solution100.ttl"
            solution_number = int(pattern.group(1)) 
            print("Downloading file: ", solution_number)
 
        # get response object for link
        f_response = requests.get(url + href)
        
        # check if the status is "ok" = 200
        if f_response.status_code == 200:
 
            # Write content in ttl file in specific directory
            file_path = os.path.join(dir_workflows, f"solution{solution_number}.ttl")
            with open(file_path, 'wb') as file:
                file.write(f_response.content)
            print("solution ", solution_number, " downloaded")
        else:
            print("Failes to download", href)
            
# adding the message when all files are downloaded  
print("All ttl files downloaded")

Downloading file:  100
solution  100  downloaded
Downloading file:  101
solution  101  downloaded
Downloading file:  102
solution  102  downloaded
Downloading file:  103
solution  103  downloaded
Downloading file:  104
solution  104  downloaded
Downloading file:  105
solution  105  downloaded
Downloading file:  106
solution  106  downloaded
Downloading file:  107
solution  107  downloaded
Downloading file:  108
solution  108  downloaded
Downloading file:  109
solution  109  downloaded
Downloading file:  10
solution  10  downloaded
Downloading file:  110
solution  110  downloaded
Downloading file:  111
solution  111  downloaded
Downloading file:  112
solution  112  downloaded
Downloading file:  113
solution  113  downloaded
Downloading file:  114
solution  114  downloaded
Downloading file:  115
solution  115  downloaded
Downloading file:  116
solution  116  downloaded
Downloading file:  117
solution  117  downloaded
Downloading file:  118
solution  118  downloaded
Downloading file:  119

---

## 2. Triple generation 

Workflows we are working with can have multiple inputs, for example: 

```<https://example.com/#solution1> a ns1:Workflow ;  
    ns1:edge [ ns1:applicationOf <https://quangis.github.io/tool/abstract#IntersectDissolveField2Object> ;
            ns1:input1 _:N8b188f868bb14f86b6950485f8506bd2 ;
            ns1:input2 _:Ncb1ae296722548d0b16ea9ad84370570 ;
            ns1:output _:N381929c2bc3145ba933d79736ba97aff ], 
```

We can clearly see that there are 'input_1' and 'input_2' for the same output. What is also important to indicate is that we take *inputs* as a **head** of the triple, *tool* (for example 'IntersectDissolveField2Object') as a **relation**, and lastly as a **tail** we indicate the *output*. 

That is why this section is divided into two subsections. In the first one (2.1), we create triples with only one input. Based on the example above, tripple would look like this: 

>*(input_1, IntersectDissolveField2Object, output)*   
>*(input_2, IntersectDissolveField2Object, output)*

Then the second section will contain triples with multiple inputs, so they will look like this: 

>*([input_1, input_2], IntersectDissolveField2Object, output)*


### 2.1 Single input triples 

In [4]:
# Create an RDF graph
graph = Graph()

# Define the prefix that are in the workflows 
prefixes = {
    "ns1": Namespace("http://geographicknowledge.de/vocab/Workflow.rdf#"),
    "rdfs": Namespace("http://www.w3.org/2000/01/rdf-schema#"),
    "cc": Namespace("http://geographicknowledge.de/vocab/CoreConceptData.rdf#"),
}

# Bind the prefixes in the graph
for prefix, namespace in prefixes.items():
    graph.bind(prefix, namespace)
    
# changing working directory to the one where the workflows are saved   
directory_files = "C:/Users/wlibe/OneDrive/Pulpit/Thesis/final_code/workflows"
os.chdir(directory_files)

# iterate over files in that directory
for files in os.listdir(directory_files):
    file = graph.parse(files, format="turtle")

In [5]:
# Iempty list to store the triples 
triples_single = []

# Iterate over the triples in the graph
for subject in graph.subjects(RDF.type, prefixes["ns1"].Workflow):
    
    # Extract the data, tool, and data triples
    for edge_obj in graph.objects(subject, prefixes["ns1"].edge):
        tool_name = None
        inputs = []
        outputs = []
        for rel_predicate, rel_obj in graph.predicate_objects(edge_obj):
            if rel_predicate == prefixes["ns1"].applicationOf:
                # Extract the tool name
                tool_name = str(rel_obj).split("#")[-1]
            elif rel_predicate == prefixes["ns1"].input1 or rel_predicate == prefixes["ns1"].input2:
                inputs.append(rel_obj)
            elif rel_predicate == prefixes["ns1"].output:
                outputs.append(rel_obj)

        # Create separate triples for each input-output relationship
        for input_obj in inputs:
            for output_obj in outputs:
                # Get the rdfs:label for input_obj and output_obj
                input_label = graph.value(input_obj, RDFS.label)
                output_label = graph.value(output_obj, RDFS.label)
                if input_label and output_label:
                    triples_single.append((input_label, tool_name, output_label))
                else:
                    triples_single.append((input_obj, tool_name, output_obj))


In [6]:
# printing the triples 
for triple in triples_single:
    print(triple)

(rdflib.term.Literal('ObjectQ, VectorTessellationA, NominalA'), 'IntersectDissolveField2Object', rdflib.term.Literal('ObjectQ, VectorRegionA, ERA'))
(rdflib.term.Literal('FieldQ, VectorTessellationA, PlainNominalA'), 'IntersectDissolveField2Object', rdflib.term.Literal('ObjectQ, VectorRegionA, ERA'))
(rdflib.term.Literal('ObjectQ, VectorTessellationA, NominalA'), 'SpatialJoinSumTessRatio', rdflib.term.Literal('ObjectQ, VectorTessellationA, ERA'))
(rdflib.term.Literal('ObjectQ, VectorRegionA, ERA'), 'SpatialJoinSumTessRatio', rdflib.term.Literal('ObjectQ, VectorTessellationA, ERA'))
(rdflib.term.Literal('ObjectQ, VectorTessellationA, PlainIntervalA'), 'SpatialJoinSumTessRatio', rdflib.term.Literal('ObjectQ, VectorTessellationA, ERA'))
(rdflib.term.Literal('ObjectQ, VectorTessellationA, ERA'), 'SpatialJoinSumTessRatio', rdflib.term.Literal('ObjectQ, VectorTessellationA, ERA'))
(rdflib.term.Literal('ObjectQ, PointA, PlainNominalA'), 'addObjectCapacity', rdflib.term.Literal('ObjectQ, Point

In [7]:

# changing working directory to floder where I wantfiles to be saved   
directory_triples = "C:/Users/wlibe/OneDrive/Pulpit/Thesis/final_code/triples"
os.chdir(directory_triples)

#saving the single input triple into pickle file 
with open('triples_single_input.pkl', 'wb') as file:
    pickle.dump(triples_single, file)

In [8]:
# opening the picke file 
with open('triples_single_input.pkl', 'rb') as file:
    triples_single = pickle.load(file)
    
# saving the triple list into a data frame 
df_single = pd.DataFrame(triples_single, columns = ["head", "relation", "tail"])

#saving the triplets into pickle file 
with open('triples_single_df.pkl', 'wb') as file:
    pickle.dump(df_single, file)

### 2.2 Multiple input triples 

In [9]:
# Initialize an empty list to store triples
triples_multiple = []

# Iterate over the triples in the graph
for subject in graph.subjects(RDF.type, prefixes["ns1"].Workflow):
    
    # Extract the data, tool, and data triples
    for edge_obj in graph.objects(subject, prefixes["ns1"].edge):
        tool_name = None
        inputs = []
        outputs = []
        for rel_predicate, rel_obj in graph.predicate_objects(edge_obj):
            if rel_predicate == prefixes["ns1"].applicationOf:
                # Extract the tool name
                tool_name = str(rel_obj).split("#")[-1]
            elif rel_predicate == prefixes["ns1"].input1 or rel_predicate == prefixes["ns1"].input2:
                inputs.append(rel_obj)
            elif rel_predicate == prefixes["ns1"].output:
                outputs.append(rel_obj)

        # create triples with multiple inputs 
        for input_set in itertools.combinations(inputs, 2):
            input_list = []  # Ilist to store multiple inputs
            for input_obj in input_set:
                input_label = graph.value(input_obj, RDFS.label)
                if input_label:
                    input_list.append(input_label)
                else:
                    input_list.append(input_obj)
            for output_obj in outputs:
                output_label = graph.value(output_obj, RDFS.label)
                if output_label:
                    triples_multiple.append((input_list, tool_name, output_label))
                else:
                    triples_multiple.append((input_list, tool_name, output_obj))


In [10]:
# priniting multiple input triples 
for triple in triples_multiple:
    print(triple)

([rdflib.term.Literal('ObjectQ, VectorTessellationA, NominalA'), rdflib.term.Literal('FieldQ, VectorTessellationA, PlainNominalA')], 'IntersectDissolveField2Object', rdflib.term.Literal('ObjectQ, VectorRegionA, ERA'))
([rdflib.term.Literal('ObjectQ, VectorTessellationA, NominalA'), rdflib.term.Literal('ObjectQ, VectorRegionA, ERA')], 'SpatialJoinSumTessRatio', rdflib.term.Literal('ObjectQ, VectorTessellationA, ERA'))
([rdflib.term.Literal('ObjectQ, VectorTessellationA, PlainIntervalA'), rdflib.term.Literal('ObjectQ, VectorTessellationA, ERA')], 'SpatialJoinSumTessRatio', rdflib.term.Literal('ObjectQ, VectorTessellationA, ERA'))
([rdflib.term.Literal('ObjectQ, PlainVectorRegionA, PlainOrdinalA'), rdflib.term.Literal('FieldQ, RasterA, PlainRatioA')], 'ZonalStatisticsSumField', rdflib.term.Literal('ObjectQ, PlainVectorRegionA, ERA'))
([rdflib.term.Literal('ObjectQ, PlainVectorRegionA, PlainIntervalA'), rdflib.term.Literal('FieldQ, RasterA, PlainRatioA')], 'ZonalStatisticsSumField', rdflib

In [11]:
# saving the multiple input triple into pickle file 
with open('triples_multiple_input.pkl', 'wb') as file:
    pickle.dump(triples_multiple, file)

I need to create a data frame in the same way that I did with the single input triple. However, in this case, I must first get rid of list that contains inputs in order to avoid having a list inside a list. 

In [12]:
# opening the picke file 
with open('triples_multiple_input.pkl', 'rb') as file:
    triples_multiple = pickle.load(file)

In [13]:
# initializing the list 
triples_multiple_list = []

#creating a loop to go throug the triples 
for triple in triples_multiple:
    inner_list = triple[0]  # Get the list inside the triple (so the inputs)
    value = ' & '.join(str(term) for term in inner_list) #joining the inputs with "&" for indication 
    triples_multiple_list.append((value,) + triple[1:])  

In [14]:
#converting the list into the data frame with head, relation and tail columns 
df_multiple = pd.DataFrame(triples_multiple_list, columns = ["head", "relation", "tail"])

#saving the triplets into pickle file 
with open('triples_multiple_df.pkl', 'wb') as file:
    pickle.dump(df_multiple, file)

---

References: 
- GeeksforGeeks. (2023). Downloading PDFs with Python using Requests and BeautifulSoup. GeeksforGeeks. https://www.geeksforgeeks.org/downloading-pdfs-with-python-using-requests-and-beautifulsoup/