# Lab 3: Populating Schemas and Basic `SPARQL`

**Learning Outcomes:**

*   Manually populate a graph with individuals.
*   Iteratively populate your RDF graph with instances from a `CSV` file.
*   Understand the fundamental structure of a `SPARQL` query.
*   Write basic `SELECT` queries to retrieve data from your graph.
*   Use the `FILTER` clause to find specific information based on textual or numerical conditions.

## **Part 1:** Populating Graph with Instances
Before lerning to write `SPARQL` queries, we need data. We'll start by defining a simple schema for a movie dataset, then load the data from a .csv file and add it to our graph.


---


### **Step 1:** Install Libraries and Import Useful Functions

In [7]:
!pip3 install rdflib pandas -U -q

import pandas as pd # for loading CSV
from rdflib import Graph, Literal, Namespace,URIRef
from rdflib.namespace import RDF, RDFS, XSD, SDO, OWL

Remember the functions and namespaces from Lab 2. Additionally, we will need the `pandas` library for loading external files to be iterated over.


---


### **Step 2:** Before Adding Instances
We will be using a publicly-available [movie dataset](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows). Following columns will be used:

*   **Series_Title:** The movie's title.
*   **Released_Year:** Year the movie was first released.
*   **Runtime:** Duration of the movie.
*   **Genre:** The movie's categories, separated by commas.
*   **IMDB_Rating:** The average user score.
*   **Director:** Name of the director.
*   **Star1-2-3-4:** Four main actors, in separate columns.
*   **No_of_Votes:** Total number of scorings given.
*   **Gross:** Generated revenue in dollars (e.g., "936,662,225").

It is generally good practice to have an ontology associated with your data. We already learned about how to create schema-level graphs using `rdflib` in Lab 2. We will import a schema, pre-made for this dataset, to be used as our ontology.

In [8]:
g = Graph()
g.parse("1st_part/movie_schema.ttl", format="ttl") # import graph

print(g.serialize(format="ttl"))

@prefix movies: <http://example.org/movies/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

movies:Actor a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Director a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Movie a rdfs:Class ;
    rdfs:subClassOf sdo:Movie .

movies:Person a rdfs:Class ;
    rdfs:subClassOf sdo:Person .

movies:hasActor a owl:ObjectProperty ;
    rdfs:domain movies:Movie ;
    rdfs:range movies:Actor ;
    rdfs:subPropertyOf sdo:actor .

movies:hasDirector a owl:ObjectProperty ;
    rdfs:domain movies:Movie ;
    rdfs:range movies:Director ;
    rdfs:subPropertyOf sdo:director .

movies:hasGenre a owl:DatatypeProperty ;
    rdfs:domain movies:Movie ;
    rdfs:range xsd:string ;
    rdfs:subPropertyOf sdo:genre .

movies:hasGrossRevenue a owl:DatatypeProperty ;
    rdfs:domain movies:Movie ;
    rdf

The ontology is visualized below:\
![Schema Visual](./schema_visual.png)

We should also define some variables (from the schema) in our coding environment. This will make our work easier and more understandable.

In [9]:
# Custom namespace
MOVIES = Namespace("http://example.org/movies/")

# Bind namespaces
g.bind("movies", MOVIES)
g.bind("sdo", SDO)
g.bind("rdfs", RDFS)
g.bind("owl", OWL)

# Classes
Movie = MOVIES.Movie
Director = MOVIES.Director
Actor = MOVIES.Actor
Genre = MOVIES.Genre

# Properties
hasTitle = MOVIES.hasTitle
hasDirector = MOVIES.hasDirector
hasActor = MOVIES.hasActor
hasGenre = MOVIES.hasGenre
releasedInYear = MOVIES.releasedInYear
hasImdbRating = MOVIES.hasImdbRating
hasRuntime = MOVIES.hasRuntime
hasGrossRevenue = MOVIES.hasGrossRevenue
hasName = MOVIES.hasName



---


### **Exercise 1:** Manually Adding Instances
At this point, we can start populating our graph with individuals. To understand the basics, we will add a single instance by hand.

1.   Think of a recent movie you watched. Search the web to find its information corresponding to the columns we will use, e.g. title, director, revenue etc.
2.   Fill the blank variables below, according to our schema. Be consistent with how the data looks under each column. Follow the comments for correct formatting.
3.   Print the graph and locate the data you've manually added.

<ins>**Note:**</ins>  See that we need to form a main URI for our movie. It is a good practice to form it as an extension of our custom namespace, e.g. `http://example.org/movies/YourMovieTitle`.

In [10]:
# --- Fill in the details of your chosen movie ---
title = "Inception"
director = "Christopher Nolan"
actor = "Leonardo DiCaprio"
genre = "Science Fiction"
year = 2010
rating = 8.8  # e.g., 8.5 (decimal)
runtime = 148  # e.g., 150 (with no "min" after)
revenue = 829895144  # e.g., 100000000 (no commas)
# ------------------------------------------------

# Create a unique URI for your movie
movie_uri = MOVIES["http://example.com/movie/Inception"]
director_uri =  MOVIES["http://example.com/person/Christopher_Nolan"]
actor_uri =  MOVIES["http://example.com/person/Leonardo_DiCaprio"]


# Add movie triples
g.add((movie_uri, RDF.type, Movie))
g.add((movie_uri, hasTitle, Literal(title, datatype=XSD.string))) # add title as a xsd:string Literal
g.add((movie_uri, hasGenre, Literal(genre)))
g.add((movie_uri, releasedInYear, Literal(year, datatype=XSD.integer))) # add release year as a xsd:gYear Literal
g.add((movie_uri, hasImdbRating, Literal(rating, datatype=XSD.decimal)))
g.add((movie_uri, hasRuntime, Literal(runtime, datatype=XSD.integer)))
g.add((movie_uri, hasGrossRevenue, Literal(revenue, datatype=XSD.integer))) # find the suitable predicate

# Add people triples
g.add((director_uri, RDF.type, Director)) # define director's Class
g.add((director_uri, hasName, Literal(director)))
g.add((actor_uri, RDF.type, Actor))
g.add((actor_uri, hasName, Literal(actor)))

# Link the movie to its people
g.add((movie_uri, hasDirector, director_uri)) # fill the proper URIs
g.add((movie_uri, hasActor, actor_uri))

# Print your graph
print(g.serialize(format="turtle"))


@prefix movies: <http://example.org/movies/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

movies:Actor a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Director a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Movie a rdfs:Class ;
    rdfs:subClassOf sdo:Movie .

movies:Person a rdfs:Class ;
    rdfs:subClassOf sdo:Person .

movies:hasActor a owl:ObjectProperty ;
    rdfs:domain movies:Movie ;
    rdfs:range movies:Actor ;
    rdfs:subPropertyOf sdo:actor .

movies:hasDirector a owl:ObjectProperty ;
    rdfs:domain movies:Movie ;
    rdfs:range movies:Director ;
    rdfs:subPropertyOf sdo:director .

movies:hasGenre a owl:DatatypeProperty ;
    rdfs:domain movies:Movie ;
    rdfs:range xsd:string ;
    rdfs:subPropertyOf sdo:genre .

movies:hasGrossRevenue a owl:DatatypeProperty ;
    rdfs:domain movies:Movie ;
    rdf



---


### **Step 3:** Iteratively Adding Instances
As adding each row manually would be very tiresome, we'll use `pandas` to read our `CSV` and loop through it to populate the graph programmatically. Download the `imdb_top_1000.csv` file from the [Kaggle dataset page](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows). Place it in the same directory this code runs in.

Let's start by reading the data:

In [11]:
# Load the CSV file into a pandas DataFrame, remove rows with empty cells
df = pd.read_csv("imdb_top_1000.csv").dropna()
print(f"Removed {1000-len(df)} rows with empty cells.")

# Select the columns of interest
selected_columns = [
    "Series_Title",
    "Released_Year",
    "Runtime",
    "Genre",
    "IMDB_Rating",
    "Director",
    "Star1",
    "Star2",
    "Star3",
    "Star4",
    "No_of_Votes",
    "Gross",
]
df = df[selected_columns]

# Print the head of the selected DataFrame
display(df.head())

Removed 286 rows with empty cells.


Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,142 min,Drama,9.3,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,The Godfather,1972,175 min,"Crime, Drama",9.2,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,The Dark Knight,2008,152 min,"Action, Crime, Drama",9.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,The Godfather: Part II,1974,202 min,"Crime, Drama",9.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,12 Angry Men,1957,96 min,"Crime, Drama",9.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


According to our schema, we need to do some pre-processing on couple columns:

*   See that the `Runtime` column is considered a string due to the "min" string after the numbers. Since our schema expects an integer for duration, we need to remove the string part.

*   The `Gross` column is supposed to contain integer values, but the commas will cause `rdflib` to think they are strings. We need to remove the commas before adding them to our data.

In [12]:
# Remove "min" from Runtime
df["Runtime"] = df["Runtime"].str.replace(" min", "").astype(int)

# Remove the commas from Gross
df["Gross"] = df["Gross"].str.replace(",", "").astype(int)

# Verify the change
display(df.head())

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,142,Drama,9.3,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,The Godfather,1972,175,"Crime, Drama",9.2,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,The Dark Knight,2008,152,"Action, Crime, Drama",9.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,The Godfather: Part II,1974,202,"Crime, Drama",9.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,12 Angry Men,1957,96,"Crime, Drama",9.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


A final step is to define a function which automatically creates unique URIs for us. Remember that we need these to represent each of our movies, directors, and actors. The function will take a string as an input, remove all spaces and punctuation marks, then adds it as an extension to our custom namespace.

In [13]:
def create_uri(input_string):
    """
    Creates a URI from an input string
    """
    import re
    cleaned_string = re.sub(r'[^\w]', '', input_string)
    uri_string = MOVIES[cleaned_string]
    return uri_string

test_uri_1 = create_uri("The Godfather: Part II")
test_uri_2 = create_uri("Francis Ford Coppola")
print(test_uri_1, test_uri_2)

http://example.org/movies/TheGodfatherPartII http://example.org/movies/FrancisFordCoppola


Let's start iterating through the data and add the movie triples to our graph.

In [14]:
for index, row in df.iterrows():

  # Get a URI for this movie
  movie_uri = create_uri(row["Series_Title"])

  # Add movie triples
  g.add((movie_uri, RDF.type, Movie))
  g.add((movie_uri, hasTitle, Literal(row["Series_Title"])))
  g.add((movie_uri, releasedInYear, Literal(row["Released_Year"], datatype=XSD.gYear)))
  g.add((movie_uri, hasRuntime, Literal(row["Runtime"], datatype=XSD.integer)))
  g.add((movie_uri, hasImdbRating, Literal(row["IMDB_Rating"], datatype=XSD.decimal)))
  g.add((movie_uri, hasGrossRevenue, Literal(row["Gross"], datatype=XSD.integer)))

print(g.serialize(format="ttl"))

Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#gYear, Converter=<function parse_xsd_gyear at 0x000002910D802DC0>
Traceback (most recent call last):
  File "c:\Users\Timur\AppData\Local\Programs\Python\Python39\lib\site-packages\rdflib\term.py", line 2163, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
  File "c:\Users\Timur\AppData\Local\Programs\Python\Python39\lib\site-packages\rdflib\xsd_datetime.py", line 618, in parse_xsd_gyear
    raise ValueError("gYear string must be at least 4 numerals in length")
ValueError: gYear string must be at least 4 numerals in length


@prefix movies: <http://example.org/movies/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

movies:Actor a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Director a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Movie a rdfs:Class ;
    rdfs:subClassOf sdo:Movie .

movies:Person a rdfs:Class ;
    rdfs:subClassOf sdo:Person .

movies:12AngryMen a movies:Movie ;
    movies:hasGrossRevenue 4360000 ;
    movies:hasImdbRating 9.0 ;
    movies:hasRuntime 96 ;
    movies:hasTitle "12 Angry Men" ;
    movies:releasedInYear "1957"^^xsd:gYear .

movies:12YearsaSlave a movies:Movie ;
    movies:hasGrossRevenue 56671993 ;
    movies:hasImdbRating 8.1 ;
    movies:hasRuntime 134 ;
    movies:hasTitle "12 Years a Slave" ;
    movies:releasedInYear "2013"^^xsd:gYear .

movies:1917 a movies:Movie ;
    movies:hasGrossRevenue 159227644 



---


### **Exercise 2:** Adding More Data
Just like we did for the movies above, add the data for the directors.

1.   Start looping through the data. Get the director name in each iteration.
2.   Use the function to create a unique director URI.
3.   Add the director triples from Exercise 1.
4.   Add the object property triple that links the director to their movie.

In [15]:
for index, row in df.iterrows():

    directors_name = row["Director"]
    movie_name = row["Series_Title"]

    director_uri = create_uri(directors_name)
    movie_name_uri = create_uri(movie_name)
    g.add((director_uri, RDF.type, Director)) # define director's Class
    g.add((director_uri, hasName, Literal(directors_name)))


    g.add((movie_name_uri, hasDirector, director_uri)) 


In [16]:
print(g.serialize(format="ttl"))

@prefix movies: <http://example.org/movies/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

movies:Actor a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Director a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Movie a rdfs:Class ;
    rdfs:subClassOf sdo:Movie .

movies:Person a rdfs:Class ;
    rdfs:subClassOf sdo:Person .

movies:12AngryMen a movies:Movie ;
    movies:hasDirector movies:SidneyLumet ;
    movies:hasGrossRevenue 4360000 ;
    movies:hasImdbRating 9.0 ;
    movies:hasRuntime 96 ;
    movies:hasTitle "12 Angry Men" ;
    movies:releasedInYear "1957"^^xsd:gYear .

movies:12YearsaSlave a movies:Movie ;
    movies:hasDirector movies:SteveMcQueen ;
    movies:hasGrossRevenue 56671993 ;
    movies:hasImdbRating 8.1 ;
    movies:hasRuntime 134 ;
    movies:hasTitle "12 Years a Slave" ;
    movies:releasedInYea



---


Sometimes, we may encounter multiple columns that represent same entities. In our case, `Star1-2-3-4` columns can all be added as an `Actor` entity. Instead of writing the same code over and over for each one, we can use a nested loop to add them at once.

In [17]:
for index, row in df.iterrows():

  # Nested loop that goes over column names
  for star_col in ["Star1", "Star2", "Star3", "Star4"]:
    actor_name = row[star_col]

    # Get URI for this actor
    actor_uri = create_uri(actor_name)

    # Add actor triples
    g.add((actor_uri, RDF.type, Actor))
    g.add((actor_uri, hasName, Literal(actor_name)))
    g.add((create_uri(row["Series_Title"]), hasActor, actor_uri))

print(g.serialize(format="ttl"))

@prefix movies: <http://example.org/movies/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

movies:Actor a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Director a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Movie a rdfs:Class ;
    rdfs:subClassOf sdo:Movie .

movies:Person a rdfs:Class ;
    rdfs:subClassOf sdo:Person .

movies:12AngryMen a movies:Movie ;
    movies:hasActor movies:HenryFonda,
        movies:JohnFiedler,
        movies:LeeJCobb,
        movies:MartinBalsam ;
    movies:hasDirector movies:SidneyLumet ;
    movies:hasGrossRevenue 4360000 ;
    movies:hasImdbRating 9.0 ;
    movies:hasRuntime 96 ;
    movies:hasTitle "12 Angry Men" ;
    movies:releasedInYear "1957"^^xsd:gYear .

movies:12YearsaSlave a movies:Movie ;
    movies:hasActor movies:BradPitt,
        movies:ChiwetelEjiofor,
        movies:M

On the other hand, a single column could contain multiple values that should be added separately to our graph. See that the `Genre` column can have multiple strings, separated by commas. We can split the data and add each item with nested loops again.

In [18]:
for index, row in df.iterrows():

  # Convert to a list
  genres = row["Genre"].split(",")

  # Nested loop going through all genres of this movie
  for genre in genres:
    movie_uri = create_uri(row["Series_Title"])
    g.add((movie_uri, hasGenre, Literal(genre.strip())))

print(g.serialize(format="ttl"))

@prefix movies: <http://example.org/movies/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

movies:Actor a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Director a rdfs:Class ;
    rdfs:subClassOf movies:Person .

movies:Movie a rdfs:Class ;
    rdfs:subClassOf sdo:Movie .

movies:Person a rdfs:Class ;
    rdfs:subClassOf sdo:Person .

movies:12AngryMen a movies:Movie ;
    movies:hasActor movies:HenryFonda,
        movies:JohnFiedler,
        movies:LeeJCobb,
        movies:MartinBalsam ;
    movies:hasDirector movies:SidneyLumet ;
    movies:hasGenre "Crime",
        "Drama" ;
    movies:hasGrossRevenue 4360000 ;
    movies:hasImdbRating 9.0 ;
    movies:hasRuntime 96 ;
    movies:hasTitle "12 Angry Men" ;
    movies:releasedInYear "1957"^^xsd:gYear .

movies:12YearsaSlave a movies:Movie ;
    movies:hasActor movies:BradPitt,
 

In [19]:
g.serialize(destination="movie_populated.ttl", format="ttl")

<Graph identifier=N9633fc6beed04792a9ae98d2d3c5cac9 (<class 'rdflib.graph.Graph'>)>

With these examples, we have populated our graph with information about 714 movies. We can fully leverage this large graph with `SPARQL` queries in the following section.

<ins>**Note:**</ins> In Protégé, you can check the specific instances you've added. Definitely take a look if you are curious.


---


## **Part 2:** Basic `SPARQL`
`SPARQL` (SPARQL Protocol and RDF Query Language) is the standard query language for `RDF` graphs. Think of it as `SQL` for graph data. Instead of selecting from tables and rows, `SPARQL` allows you to describe the shape or pattern of the data you're looking for within the graph. You can read the official W3C specification for `SPARQL 1.1` [here](https://www.w3.org/TR/sparql11-query/).


---


### **Anatomy of a `SPARQL` Query**
A basic query has a few key components.

```sparql
PREFIX movies: <http://example.org/movies/>

SELECT ?title ?year
WHERE {
  ?movie movies:hasTitle ?title ;
         movies:releasedInYear ?year .
}
```
*   **`PREFIX`**: This is a shortcut. It lets us write `movies:hasTitle` instead of `<http://example.org/movies/hasTitle>`.

*   **`SELECT`**: This specifies which variables you want in your results. A variable in `SPARQL` always starts with a **?**.

*   **`WHERE`**: This is the structural part of the query. It contains the triple patterns that `SPARQL` will try to match against the graph.

`SPARQL` may not be very intuitive at the beginning. A good way of getting used to it is studying the structure of your target graph(s) beforehand. This will help you reach the variable you are querying for way faster and efficiently.


---


### **Query 1:** `SELECT` for Basic Viewing
Let's start with the simplest query to learn `rdflib` syntax for querying.

In [None]:
# Select the first five movies' titles from our dataset
q1 = """
PREFIX movies: <http://example.org/movies/>

SELECT ?title
WHERE {

  ?movie movies:hasTitle ?title .

} LIMIT 5

"""

for row in g.query(q1):
  print(row.title)

Inception
The Shawshank Redemption
The Godfather
The Dark Knight
The Godfather: Part II


Write your query as a string block, then use the `Graph.query()` method to get the results as an iterable. See that we used `LIMIT` functionality to only return the first five results. This is useful when querying large graphs.


---


### **Query 2:** `FILTER` for Finding Specific Information
There are two ways we can filter data with `SPARQL`. While looking for exact values, we can state them explicitly in the object part of a query triple. For example, let's find the first 10 movies' genres with release date 2000.

In [21]:
# Query to find the first 10 movies' genres released in the year 2000
q2 = """
PREFIX movies: <http://example.org/movies/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?genre
WHERE {
  ?movie movies:hasGenre ?genre ;
         movies:releasedInYear "2000"^^xsd:gYear . # the value and data type
} LIMIT 10
"""

print("Movies released in 2000 (first 10):")
for row in g.query(q2):
    print(row.genre)

Movies released in 2000 (first 10):
Action
Adventure
Drama
Mystery
Thriller
Comedy
Crime
Drama
Drama
Thriller


For conditional filtering, we can directly use the `FILTER` functionality. This helps us find results with values greater or lesser than a certain amount. Let's query for the first 5 movies' titles and ratings that have gross revenue higher than 500 million dollars.

In [22]:
# Query for the first 5 movies' ratings with gross revenue higher than 100 million dollars
q3 = """
PREFIX movies: <http://example.org/movies/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?title ?rating
WHERE {

  ?movie movies:hasTitle ?title ;
         movies:hasImdbRating ?rating ;
         movies:hasGrossRevenue ?revenue .

  FILTER( ?revenue > 500000000 ) # conditional filtering

} LIMIT 5
"""

# Execute the query and print the results
print("Movies with gross revenue > $500,000,000 (first 5):")
for row in g.query(q3):
    print(f"Title: {row.title}, Rating: {row.rating}")

Movies with gross revenue > $500,000,000 (first 5):
Title: Inception, Rating: 8.8
Title: The Dark Knight, Rating: 9.0
Title: Avengers: Endgame, Rating: 8.4
Title: Avengers: Infinity War, Rating: 8.4
Title: The Avengers, Rating: 8.0




---


### **Exercise 3:** Combining Multiple `FILTER` Methods
We can also use both techniques to refine our search. Moreover, multiple conditions can be passed into the `FILTER` function by using `&&` (`AND`) operator. Fill the `WHERE` clause of below queries to understand targeted querying.

#### **Exercise 3.1:** Find 5 lengthy films (and their release years) that are both critically and commercially successful.

Criteria:
*   Runtime greater than 180 minutes.
*   IMDb Rating higher than 8.5.
*   Gross Revenue over $100,000,000.

In [23]:
# Fill in the WHERE clause and FILTER conditions for this query
q_3_1 = """
PREFIX movies: <http://example.org/movies/>

SELECT ?title ?year
WHERE {
    # Add your triple patterns here
    ?movie movies:hasTitle ?title;
          movies:hasImdbRating ?rating ;
          movies:hasRuntime ?rt;
          movies:releasedInYear ?year;
          movies:hasGrossRevenue ?revenue .

    # Add your filter conditions here
    FILTER(?rating>8.5 && ?rt>180 && ?revenue>100000000)
} # LIMIT 5
"""

print("--- Blockbusters ---")
for row in g.query(q_3_1):
  print(f"{row.title} (Year: {row.year})")

--- Blockbusters ---
The Lord of the Rings: The Return of the King (Year: 2003)
The Green Mile (Year: 1999)


#### **Exercise 3.2:** Find 5 films where both Leonardo DiCaprio and Johnny Depp has acted in, however has made less revenue than 10,000,000 dollars.

Criteria:
*   Must have Leonardo DiCaprio as an actor.
*   Must have Johnny Depp as an actor.
*   Must have gross revenue less than 10,000,000.

In [24]:
# Fill in the WHERE clause and FILTER conditions for this query
q_3_2 = """
PREFIX movies: <http://example.org/movies/>

SELECT ?title
WHERE {
    # Add your triple patterns here, do not forget to match exact values
    ?movie movies:hasGrossRevenue ?revenue;
          movies:hasActor ?actor1, ?actor2.

    ?actor1 movies:hasName "Leonardo DiCaprio"^^xsd:string .
    ?actor2 movies:hasName "Johnny Depp"^^xsd:string .

    # Add your filter conditions here
    FILTER(?revenue<10000000)
} # LIMIT 5
"""

print("--- Definitely Made a Loss ---")
results = list(g.query(q_3_2))
for row in results:
  print(f"{row.title}")

--- Definitely Made a Loss ---
