**Setup**

In [4]:
# Library
import pandas as pd
from sqlalchemy import create_engine

In [5]:
# Define the database connection parameters
db_params = {
    'host': 'localhost',
    'database': 'dvdrental',
    'user': 'postgres',
    'password': 'admin',
    'port': '5432'  # PostgreSQL default port
}

# Connect to the 'soccer' database
engine = create_engine(f'postgresql://{db_params["user"]}:{db_params["password"]}@{db_params["host"]}/{db_params["database"]}')

**What is a tsvector?**

You saw how to convert strings to `tsvector` and `tsquery` in the video and, in this exercise, we are going to dive deeper into what these functions actually return after converting a string to a `tsvector`. In this example, you will convert a text column from the `film` table to a `tsvector` and inspect the results. Understanding how full-text search works is the first step in more advanced machine learning and data science concepts like natural language processing.

**Instructions**

- Select the film description and convert it to a `tsvector` data type.

In [11]:
query = """
-- Select the film description as a tsvector
SELECT to_tsvector(description)
FROM film;
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,to_tsvector
0,'fate':2 'husband':9 'monkey':14 'moos':6 'mus...
1,'australia':16 'cat':6 'drama':3 'epic':2 'exp...
2,'ancient':16 'confront':12 'epic':2 'girl':9 '...
3,'boat':18 'conquer':12 'fate':2 'feminist':9 '...
4,'battl':13 'canadian':18 'drama':3 'epic':2 'f...
...,...
995,'administr':10 'boat':6 'boy':15 'databas':9 '...
996,'boat':20 'cat':9 'challeng':12 'drama':3 'mus...
997,'boy':14 'canadian':17 'compos':6 'face':12 'f...
998,'ancient':17 'boat':9 'china':18 'discov':12 '...


**Basic full-text search**

Searching text will become something you do repeatedly when building applications or exploring data sets for data science. Full-text search is helpful when performing exploratory data analysis for a natural language processing model or building a search feature into your application.

In this exercise, you will practice searching a text column and match it against a string. The search will return the same result as a query that uses the `LIKE` operator with the `%` wildcard at the beginning and end of the string, but will perform much better and provide you with a foundation for more advanced full-text search queries. Let's dive in.

**Instructions**

- Select the `title` and `description` columns from the `film` table.
- Perform a full-text search on the `title` column for the word `elf`.

In [13]:
query = """
-- Select the title and description
SELECT title, description
FROM film
-- Convert the title to a tsvector and match it against the tsquery 
WHERE to_tsvector(title) @@ to_tsquery('elf');
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,title,description
0,Elf Murder,A Action-Packed Story of a Frisbee And a Woman...
1,Encino Elf,A Astounding Drama of a Feminist And a Teacher...
2,Ghostbusters Elf,A Thoughtful Epistle of a Dog And a Feminist w...


**User-defined data types**

`ENUM` or enumerated data types are great options to use in your database when you have a column where you want to store a fixed list of values that rarely change. Examples of when it would be appropriate to use an `ENUM` include days of the week and states or provinces in a country.

Another example can be the directions on a compass (i.e., north, south, east and west.) In this exercise, you are going to create a new `ENUM` data type called `compass_position`.

**Instructions**

- Create a new enumerated data type called `compass_position`.
- Use the four positions of a compass as the values.
- Verify that the new data type has been created by looking in the pg_type system table.

In [15]:
query = """
-- Create an enumerated data type, compass_position
CREATE TYPE compass_position AS ENUM (
  	-- Use the four cardinal directions
  	'North', 
  	'South',
  	'East', 
  	'West'
);
-- Confirm the new data type is in the pg_type system table
SELECT *
FROM pg_type
WHERE typname='compass_position';
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,oid,typname,typnamespace,typowner,typlen,typbyval,typtype,typcategory,typispreferred,typisdefined,...,typalign,typstorage,typnotnull,typbasetype,typtypmod,typndims,typcollation,typdefaultbin,typdefault,typacl
0,16980,compass_position,2200,10,4,True,e,E,False,True,...,i,p,False,0,-1,0,0,,,


**Getting info about user-defined data types**

The Sakila database has a user-defined `enum` data type called `mpaa_rating`. The `rating` column in the `film` table is an `mpaa_rating` type and contains the familiar rating for that film like PG or R. This is a great example of when an enumerated data type comes in handy. Film ratings have a limited number of standard values that rarely change.

When you want to learn about a column or data type in your database the best place to start is the `INFORMATION_SCHEMA`. You can find information about the `rating` column that can help you learn about the type of data you can expect to find. For `enum` data types, you can also find the specific values that are valid for a particular `enum` by looking in the `pg_enum` system table. Let's dive into the exercises and learn more.

**Instructions**

- Select the `column_name`, `data_type`, `udt_name`.
- Filter for the `rating` column in the `film` table.

In [16]:
query = """
-- Select the column name, data type and udt name columns
SELECT column_name, data_type, udt_name
FROM INFORMATION_SCHEMA.COLUMNS 
-- Filter by the rating column in the film table
WHERE table_name ='film' AND column_name='rating';
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,column_name,data_type,udt_name
0,rating,USER-DEFINED,mpaa_rating


- Select all columns from the `pg_type` table where the type name is equal to `mpaa_rating`.

In [17]:
query = """
SELECT *
FROM pg_type 
WHERE typname='mpaa_rating'
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,oid,typname,typnamespace,typowner,typlen,typbyval,typtype,typcategory,typispreferred,typisdefined,...,typalign,typstorage,typnotnull,typbasetype,typtypmod,typndims,typcollation,typdefaultbin,typdefault,typacl
0,16486,mpaa_rating,2200,10,4,True,e,E,False,True,...,i,p,False,0,-1,0,0,,,


**User-defined functions in Sakila**

If you were running a real-life DVD Rental store, there are many questions that you may need to answer repeatedly like whether a film is in stock at a particular store or the outstanding balance for a particular customer. These types of scenarios are where user-defined functions will come in very handy. The Sakila database has several user-defined functions pre-defined. These functions are available out-of-the-box and can be used in your queries like many of the built-in functions we've learned about in this course.

In this exercise, you will build a query step-by-step that can be used to produce a report to determine which film title is currently held by which customer using the `inventory_held_by_customer()` function.

**Instructions**

- Select the `title` and `inventory_id` columns from the `film` and `inventory` tables in the database.

In [18]:
query = """
-- Select the film title and inventory ids
SELECT 
	f.title, 
    i.inventory_id
FROM film as f 
	-- Join the film table to the inventory table
	INNER JOIN inventory AS i ON f.film_id=i.film_id 
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,title,inventory_id
0,Academy Dinosaur,1
1,Academy Dinosaur,2
2,Academy Dinosaur,3
3,Academy Dinosaur,4
4,Academy Dinosaur,5
...,...,...
4576,Zorro Ark,4577
4577,Zorro Ark,4578
4578,Zorro Ark,4579
4579,Zorro Ark,4580


- inventory_id is currently held by a customer and alias the column as held_by_cust

In [19]:
query = """
-- Select the film title and inventory ids
SELECT 
	f.title, 
    i.inventory_id,
    -- Determine whether the inventory is held by a customer
    inventory_held_by_customer(i.inventory_id) AS held_by_cust
FROM film as f 
	INNER JOIN inventory AS i ON f.film_id=i.film_id 
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,title,inventory_id,held_by_cust
0,Academy Dinosaur,1,
1,Academy Dinosaur,2,
2,Academy Dinosaur,3,
3,Academy Dinosaur,4,
4,Academy Dinosaur,5,
...,...,...,...
4576,Zorro Ark,4577,
4577,Zorro Ark,4578,
4578,Zorro Ark,4579,
4579,Zorro Ark,4580,


- Now filter your query to only return records where the inventory_held_by_customer() function returns a non-null value.

In [20]:
query = """
-- Select the film title and inventory ids
SELECT 
	f.title, 
    i.inventory_id,
    -- Determine whether the inventory is held by a customer
    inventory_held_by_customer(i.inventory_id) as held_by_cust
FROM film as f 
	INNER JOIN inventory AS i ON f.film_id=i.film_id 
WHERE
	-- Only include results where the held_by_cust is not null
    inventory_held_by_customer(i.inventory_id) IS NOT NULL
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,title,inventory_id,held_by_cust
0,Academy Dinosaur,6,554
1,Ace Goldfinger,9,366
2,Affair Prejudice,21,111
3,African Egg,25,590
4,Ali Forever,70,108
...,...,...,...
178,Wild Apollo,4460,274
179,Window Side,4472,374
180,Women Dorado,4496,216
181,World Leathernecks,4537,532


**Enabling extensions**

Before you can use the capabilities of an extension it must be enabled. As you have previously learned, most PostgreSQL distributions come pre-bundled with many useful extensions to help extend the native features of your database. You will be working with `fuzzystrmatch` and `pg_trgm` in upcoming exercises but before you can practice using the capabilities of these extensions you will need to first make sure they are enabled in our database. In this exercise you will enable the `pg_trgm` extension and confirm that the `fuzzystrmatch` extension, which was enabled in the video, is still enabled by querying the `pg_extension` system table.

**Instructions**

- Enable the `pg_trgm` extension

In [22]:
query = """
-- Enable the pg_trgm extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;
"""

In [23]:
query = """
-- Select all rows extensions
SELECT * 
FROM pg_extension;
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,oid,extname,extowner,extnamespace,extrelocatable,extversion,extconfig,extcondition
0,13535,plpgsql,10,11,False,1.0,,


**Putting it all together**

In this exercise, we are going to use many of the techniques and concepts we learned throughout the course to generate a data set that we could use to predict whether the words and phrases used to describe a film have an impact on the number of rentals.

First, you need to create a `tsvector` from the `description` column in the `film` table. You will match against a `tsquery` to determine if the phrase "Astounding Drama" leads to more rentals per month. Next, create a new column using the `similarity` function to rank the film descriptions based on this phrase.

**Instructions**

- Select the title and description for all DVDs from the `film` table.
- Perform a full-text search by converting the description to a `tsvector` and match it to the phrase `'Astounding & Drama'` using a `tsquery` in the `WHERE` clause.

In [26]:
query = """
-- Select the title and description columns
SELECT  
  title, 
  description 
FROM 
  film 
WHERE 
  -- Match "Astounding Drama" in the description
  to_tsvector(description) @@ 
  to_tsquery('Astounding & Drama');
"""
result = pd.read_sql_query(query, engine)
result

Unnamed: 0,title,description
0,Bikini Borrowers,A Astounding Drama of a Astronaut And a Cat wh...
1,Campus Remember,A Astounding Drama of a Crocodile And a Mad Co...
2,Cowboy Doom,A Astounding Drama of a Boy And a Lumberjack w...
3,Encino Elf,A Astounding Drama of a Feminist And a Teacher...
4,Glass Dying,A Astounding Drama of a Frisbee And a Astronau...


- Add a new column that calculates the similarity of the description with the phrase 'Astounding Drama'.
- Sort the results by the new similarity column in descending order.

In [None]:
query = """
SELECT 
  title, 
  description, 
  -- Calculate the similarity
  similarity(description, 'Astounding Drama') 
FROM 
  film 
WHERE 
  to_tsvector(description) @@ 
  to_tsquery('Astounding & Drama') 
ORDER BY 
	similarity(description, 'Astounding Drama') DESC;
"""
result = pd.read_sql_query(query, engine)
result