# RDF Basics

The Resource Description Framework or RDF is a standard model for data interchange that allows structured and semi-structured data to be shared across different applications. RDF expresses relationships between entities as triples; essentially a graph that links unique URIs via edges that describe their relationship. In this post, we'll use the Python library `rdflib` to build a product data graph and extract information about those individual products.

## About the Data
The data used in this post can be found [here](http://data.dws.informatik.uni-mannheim.de/structureddata/2014-12/quads/ClassSpecificQuads/schemaorgProduct.nq.sample.txt). It comes from the [Web Data Commons project](http://www.webdatacommons.org/structureddata/2014-12/stats/schema_org_subsets.html), which extracts structured data describing products, people, organizations, places, and events from billions of web pages and make the extracted data available for download. In particular, we will be exploring the subset of the corpus that concerns products and product descriptions. 

In [1]:
import os
import requests

base_folder = "data"
products_url = "http://data.dws.informatik.uni-mannheim.de/structureddata/2014-12/quads/ClassSpecificQuads/schemaorgProduct.nq.sample.txt"
product_path = "products.nq"


def download_data(dir_path, data_url, data_path):
    """
    Convenience function that uses the requests library to retrieve data
    given a url to the dataset and a directory folder on your computer.
    """
    if not os.path.exists(dir_path):
        os.mkdir(dir_path)

    response = requests.get(data_url)
    with open(os.path.join(dir_path, data_path), "wb") as f:
        f.write(response.content)

    return data_path

download_data(base_folder, products_url, product_path)

'products.nq'

In [None]:
# Print the first 10 rows

### What is RDF?

The product data we'll be exploring is represented in RDF (Resource Description Framework) format. RDF is a directed, labeled graph data format for representing information in the Web. Resources are represented as nodes in the graph and identified by unique URIs. Edges represent the named link between two given resources. 

![rdf data model](https://raw.githubusercontent.com/rebeccabilbro/rebeccabilbro.github.io/master/images/2018-07-02-rdf-graph.png)

RDF is an abstract model with several serialization formats that have different encodings. The variant we'll be dealing with is called N-Quad.

### What is N-Quads?

N-Quads is a line-based, plain text format for encoding an RDF dataset. In the data we'll be looking at, the forth element of each line contains the URL of the webpage from which the data was extracted.

In Python, we can use the `rdflib` library (which you can install via `pip`) to build a graph structure from the product data we've downloaded:

In [2]:
from rdflib import ConjunctiveGraph

def make_graph_from_nquads(input_data):
    g = ConjunctiveGraph()
    data = open(input_data, "rb")
    g.parse(data, format="nquads")

    return g

g = make_graph_from_nquads(os.path.join(base_folder, product_path))

Now we can explore the data using classes and methods implemented `rdflib`:

 - URIRef
 - BNode
 - g.subjects()
 - g.triples()


In [3]:
# First, count the nodes
len(g)

4983

In [4]:
from rdflib import URIRef

# get all the unique products
product_list = list(set(g.subjects(object=URIRef("http://schema.org/Product"))))
len(product_list)

216

In [5]:
product_list[0] # `BNode` is a blank node

rdflib.term.BNode('N34274ef79f704df1b2327c5630ca18b1')

In [None]:
# print out the first 10 products

Now we'll convert our graph into a dictionary representation, so that each product is represented as a key and the values correspond to the product details contained in the graph. This will allow us to index the documents in a document store like MongoDB or Elasticsearch to enable search in the context of an application.

In [6]:
uid_data = {"name"    : None,
            "image"   : None,
            "url"     : None,
            "desc"    : None,
            "sku"     : None,
            "review"  : None,
            "manu"    : None,
            "reviews" : None,
            "prod_id" : None,
            "mod_date": None,
            "rel_date": None,
            "brand"   : None,
            "model"   : None,
            "offers"  : None,
            "thumb"   : None,
            "logo"    : None,
            "rating"  : None
    }

product_dict = dict()

for product in product_list:
    product_dict[product] = uid_data

In [7]:
PRODUCT_FIELDS = {"name"    : "http://schema.org/Product/name",
                  "image"   : "http://schema.org/Product/image",
                  "url"     : "http://schema.org/Product/url",
                  "desc"    : "http://schema.org/Product/description",
                  "sku"     : "http://schema.org/Product/sku",
                  "review"  : "http://schema.org/Product/review",
                  "manu"    : "http://schema.org/Product/manufacturer",
                  "reviews" : "http://schema.org/Product/reviews",
                  "prod_id" : "http://schema.org/Product/productID",
                  "mod_date": "http://schema.org/Product/dateModified",
                  "rel_date": "http://schema.org/Product/releaseDate",
                  "brand"   : "http://schema.org/Product/brand",
                  "model"   : "http://schema.org/Product/model",
                  "offers"  : "http://schema.org/Product/offers",
                  "thumb"   : "http://schema.org/Product/thumbnailUrl",
                  "logo"    : "http://schema.org/Product/logo",
                  "rating"  : "http://schema.org/Product/aggregateRating",
    }

In [8]:
for key,val_dict in product_dict.items(): # for each unique product
    for product_id, field, details in g.triples( (key, None, None) ): # for each related triple in the graph
        for field_type, schema in PRODUCT_FIELDS.items(): # for each possible product field
            if str(field) == schema:
                val_dict[field_type]= str(details)
            
    product_dict[key].update(val_dict)

## What is SPARQL?

The RDF specification defines the syntax and semantics of the SPARQL query language. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph. The results of SPARQL queries can be results sets or RDF graphs.