<div>
    <img style="float:right;" src="images/smi-logo.png"/>
    <div style="float:left;color:#58288C;"><h1>Relational Databases and Data Warehousing</h1></div>
</div>

---
# Notebook 4: Full text index
In this notebook you learn to add a fulltext index to the existing SQlite database. We'll have a brief look under the hood to understand the difference to a regular table.

Requirements:
- You have completed SQL tutorial
- You have completed Notebook #2
---

Let's load our existing database.

In [93]:
%load_ext sql
%sql sqlite:///my-database.db

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## <span style="color:#FF5D02;">Assigment: Create a fts5 fulltext index table for reviews and do some queries</span>
Create a fulltext index and import the reviews from table ``mock_reviews``. If unsure how to proceed, refer to the tutorial at https://www.sqlitetutorial.net/sqlite-full-text-search/.

In [181]:
%%sql

drop table if exists fts_reviews;
create virtual table fts_reviews using fts5(id, product_id, title,content);
insert into fts_reviews (id, product_id, title, content) select id,product_id,title,content from mock_reviews;

 * sqlite:///my-database.db
Done.
Done.
107 rows affected.


[]

Now search for some terms like beautiful, problem, etc.

In [None]:
%sql SELECT * ...

## Lets' look under the hood
We can access the index directly by creating another virtual table. It contains all tokens and a link to where they appear in the original data like a glossar for a big book.

In [182]:
%%sql 

DROP TABLE IF EXISTS fts_tokens;
create virtual table fts_tokens using fts5vocab(fts_reviews,instance);

 * sqlite:///my-database.db
Done.
Done.


[]

Let's analyze this! How would the fulltext index search for the word 'beautiful' in the column 'content'? It would look it up in its index. We can do this manually:

In [183]:
%sql select * from fts_tokens WHERE term="beautiful" and col="content" order by doc;

 * sqlite:///my-database.db
Done.


term,doc,col,offset
beautiful,26,content,14
beautiful,39,content,20
beautiful,43,content,3
beautiful,53,content,6
beautiful,63,content,19
beautiful,68,content,7
beautiful,96,content,12


Ok, the doc column gives us the id of the document, like the column 'id' as primary key before. Let's check the first three IDs, if 'beautiful' is actually included...

In [184]:
%sql select rowid, content from fts_reviews where rowid in (26,39,43);

 * sqlite:///my-database.db
Done.


rowid,content
26,I got this necklace for my sister and it was perfect. The silver is beautiful and the design is very elegant. She loves it!
39,This Kids Dress is adorable! My daughter loves it and she looks so cute in it. The lace detail is beautiful and the color is perfect. She wore this dress to her friend's birthday party last week and she got lots of compliments!
43,This Necklace is beautiful - my girlfriend loves it! It's made of high quality materials which makes it look expensive even though it wasn't very expensive at all.


---
## <span style="color:#FF5D02;">Assigment: Tagcloud</span>
Besides making search for keywords way more flexible an faster, we can quickly generate statistics about tokens. In the simplest form that would be a tagcloud.

Problem is, that fts5 does not include stop-words, so all trivial words like 'I', 'the', 'and' etc. are also included. As a work-around we ignore all words shorter than 5 characters for our tagcloud.

In [185]:
%%sql 

SELECT term, count(doc) as occurrences 
FROM fts_tokens 
WHERE length(term)>5 
GROUP BY term 
ORDER BY occurrences DESC
LIMIT 10;

 * sqlite:///my-database.db
Done.


term,occurrences
quality,23
really,22
perfect,21
comfortable,20
recommend,16
beautiful,14
wearing,13
stylish,13
bought,13
highly,11


In [198]:
%%sql

SELECT term, count(term) as freq
FROM fts_tokens t 
INNER JOIN fts_reviews r ON r.rowid = t.doc
WHERE length(term)>5 and
r.id in (select id from dim_products where category="Accessories")
GROUP BY term
ORDER by freq DESC;


 * sqlite:///my-database.db
Done.


term,freq
recommend,3
highly,3
comfortable,3
stylish,2
running,2
compliments,2
wedding,1
trainers,1
support,1
really,1
