# __Search Text by Text__

- Tutorial Difficulty : ★★☆☆☆
- 7 min read 
- Languages : [SQL](https://en.wikipedia.org/wiki/SQL) (100%)
- File location : tutorial_en/thanosql_search/search_text_by_text.ipynb   
- References : [(Kaggle) IMDB Movie Reviews](https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/data), [Worrd Embeddings: LEXICAL SEMANTICS Encoding](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html)

## Tutorial Introduction

<div class="admonition note">
    <h4 class="admonition-title">Understanding Text Vectorization</h4>
    <p>Computers cannot directly interpret human language (natural language). Therefore, a process for converting natural language into numerical data that can be recognized by computers is required. In the field of natural language processing, embedding refers to the result of converting human natural language into a vectorized format, which is a form that can be understood by machines.</p>
</div>

Techniques for converting natural language into embeddings are largely divided into statistical techniques and artificial neural network-based techniques. ThanoSQL provides a method to train a text vectorization model using self-supervised learning.

<div class="admonition note">
    <h4 class="admonition-title">In this tutorial</h4>
    <p>👉 Uses <mark style="background-color:#FFD79C">movie review data</mark>. The data consists of movie review texts and label values. However, because we are demonstrating self-supervised learning, this tutorial does not use label values. By training a model with 4,000 movie reviews, we will be able to search text by text and extract the importance of each word from a given movie review. </p>
</div>

## __0. Prepare Dataset__

As mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/en/getting_started/how_to_use_ThanoSQL/#5-thanosql-workspace), you must create an API token and run the query below to execute the query of ThanoSQL. 

In [None]:
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>

### __Prepare Dataset__

In [3]:
%%thanosql
GET THANOSQL DATASET movie_review_data
OPTIONS (overwrite=True)

Success


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>GET THANOSQL DATASET</strong>" downloads the specified dataset to the workspace. </li>
        <li>"<strong>OPTIONS</strong>" specifies the option values to be used for the <strong>GET THANOSQL DATASET</strong> clause.
        <ul>
            <li>"overwrite" : Determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

In [4]:
%%thanosql
COPY movie_review_train
OPTIONS (overwrite=True) 
FROM "thanosql-dataset/movie_review_data/movie_review_train.csv"

Success


In [5]:
%%thanosql
COPY movie_review_test 
OPTIONS (overwrite=True) 
FROM "thanosql-dataset/movie_review_data/movie_review_test.csv"

Success


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>COPY</strong>" specifies the name of the dataset to be saved as a database table. </li>
        <li>"<strong>OPTIONS</strong>" specifies the option values to be used for the <strong>COPY</strong> clause.
        <ul>
           <li>"overwrite" : Determines whether to overwrite a table if it already exists. If set as True, the old table is replaced with the new table (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

## __1. Check Dataset__

To create a movie review text search model, we use the <mark style="background-color:#FFEC92 ">movie_review_train</mark> table from the ThanoSQL database. To check the table's contents, run the following query.

In [6]:
%%thanosql
SELECT *
FROM movie_review_train
LIMIT 5 

Unnamed: 0,review,sentiment
0,This is the kind of movie that BEGS to be show...,negative
1,Bulletproof is quite clearly a disposable film...,negative
2,A beautiful shopgirl in London is swept off he...,positive
3,"VERY dull, obvious, tedious Exorcist rip-off f...",negative
4,Do we really need any more narcissistic garbag...,negative


<div class="admonition note">
   <h4 class="admonition-title">Understanding the Data</h4>
   <ul>
      <li><mark style="background-color:#D7D0FF ">review</mark>: movie review in text format</li>
      <li><mark style="background-color:#D7D0FF ">sentiment</mark> : target value indicating whether the review has a positive or negative sentiment</li>
   </ul>
</div>

## __2. Create a Text Vectorization Model__

To create a text search model with the name <mark style="background-color:#E9D7FD ">movie_text_search_model</mark> using the <mark style="background-color:#FFEC92 ">movie_review_train</mark> dataset, run the following query.  
(Estimated time required for query execution: 2 min)

In [9]:
%%thanosql
BUILD MODEL movie_text_search_model
USING SBERTEn
OPTIONS (
    text_col="review",
    overwrite=True
)
AS
SELECT *
FROM movie_review_train

Building model...
Success


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>BUILD MODEL</strong>" creates and trains a model named <mark style="background-color:#E9D7FD">movie_text_search_model</mark>.</li>
        <li>"<strong>USING</strong>" specifies <mark style="background-color:#E9D7FD">SBERTEn</mark> as the base model.</li>
        <li>"<strong>OPTIONS</strong>" specifies the option values used to create a model.
        <ul>
            <li>"text_col" : a column containing movie review data in the data table.</li>
            <li>"epochs" : number of times to train with the training dataset (int, DEFAULT : 1)</li>
            <li>"batch_size" : the size of dataset bundle utilized in a single cycle of training. (int, DEFAULT : 16)</li> 
            <li>"learning_rate" : the learning rate of the model (float, DEFAULT : 3e-5)</li> 
            <li>"train" : determines whether to use a pretrained model. If specified as False, the pretrained model is used. (True|False, DEFAULT : True)</li> 
            <li>"overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (True|False, DEFAULT : False)</li>
        </ul>
        </li>
    </ul>
</div>

To vectorize the `movie_review_test` texts run the following "__CONVERT USING__" query. The vectorized results are stored in a column named <mark style="background-color:#D7D0FF">movie_text_search_model_sberten</mark> in the `movie_review_test` table.

In [10]:
%%thanosql
CONVERT USING movie_text_search_model
OPTIONS (
    text_col="review",
    table_name="movie_review_test",
    batch_size=32
    )
AS 
SELECT *
FROM movie_review_test

Unnamed: 0,review,sentiment,convert_result
0,"I read the book before seeing the movie, and t...",positive,"[0.026386067, 0.033429813, 0.02681509, -0.0183..."
1,"""9/11,"" hosted by Robert DeNiro, presents foot...",positive,"[0.026997166, -0.014019986, 0.047948554, -0.03..."
2,"Yesterday I attended the world premiere of ""De...",positive,"[-0.010400589, -0.032047845, -0.00088875234, 0..."
3,Moonwalker is a Fantasy Music film staring Mic...,positive,"[0.017395066, 0.033596385, 0.048339244, 0.0123..."
4,"Welcome to Oakland, where the dead come out to...",positive,"[-0.023515128, -0.085522555, -0.007945944, 0.0..."
...,...,...,...
995,Ocean's 12 starts off on annoying and gets wor...,negative,"[0.016584294, 0.02242299, 0.005079144, -0.0378..."
996,I remember catching this movie on one of the S...,negative,"[-0.0050650453, -0.0098094875, 0.00079921127, ..."
997,CyberTracker is set in Los Angeles sometime in...,negative,"[-0.01535927, 0.0027531378, 0.02185786, 0.0369..."
998,"There is so much that is wrong with this film,...",negative,"[0.0035664095, 0.05291827, -0.0014212326, -0.0..."


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>CONVERT USING</strong>" uses <code>movie_text_search_model</code> as an algorithm for text vectorizaion.</li>
        <li>"<strong>OPTIONS</strong>" specifies the options to be used for text vectorizaion.
        <ul>
            <li>"text_col" : a column containing movie review data in the data table.</li>
            <li>"table_name" : the table name to be stored in the ThanoSQL database.</li>
            <li>"batch_size" : the size of dataset bundle utilized in a single cycle of training.</li> 
        </ul>
        </li>
    </ul>
</div>

## __3. Search for Similar Texts__

This step uses the <mark style="background-color:#E9D7FD">movie_text_search_model</mark> text vectorization model and test table to search for similar texts.

In [12]:
%%thanosql
SELECT review, sentiment, search_result as score
FROM (
    SEARCH TEXT text="This movie was my favorite movie of all time"
    USING movie_text_search_model
    OPTIONS(emb_col = "convert_result")
    AS 
    SELECT * 
    FROM movie_review_test
    )
ORDER BY score DESC 
LIMIT 10

Searching...


Unnamed: 0,review,sentiment,score
0,The Muppet movie is an instant classic. I reme...,positive,0.528523
1,"The critics were like ""a movie that will break...",negative,0.502174
2,What can I say?? This movie has it all...Roman...,positive,0.497521
3,I have loved this movie since I saw it in the ...,positive,0.492566
4,"First time I saw this great movie and Alyssa, ...",positive,0.480914
5,I saw this movie for the first time in 1988 wh...,positive,0.476847
6,This movie was the second movie I saw on the c...,positive,0.474225
7,Why oh why don't blockbuster movies simply sti...,negative,0.473454
8,This film lingered and lingered at a small mov...,positive,0.465636
9,Amongst the standard one liner type action fil...,positive,0.465367


In [13]:
%%thanosql
SELECT review, sentiment, search_result as score
FROM (
    SEARCH TEXT text="The movie was unsatisfactory"
    USING movie_text_search_model
    OPTIONS(emb_col = "convert_result")
    AS 
    SELECT * 
    FROM movie_review_test
    )
ORDER BY score DESC 
LIMIT 10

Searching...


Unnamed: 0,review,sentiment,score
0,To quote Clark Griswold (in the original Chris...,negative,0.582234
1,"Well, I remember when the studio sacked Schrad...",negative,0.580146
2,There was absolutely nothing in this film that...,negative,0.571198
3,Badly made. Dreadful acting and an ending that...,negative,0.562158
4,"While the dog was cute, the film was not. It w...",negative,0.561449
5,Just plain terrible. Nick and Michael are WAY ...,negative,0.554555
6,"A gave it a ""2"" instead of a ""1"" (awful) becau...",negative,0.553373
7,How many times do we have to see bad horror mo...,negative,0.544694
8,This movie was pointless. I can't even call it...,negative,0.542747
9,"irritating, illogical flow of events. pretty m...",negative,0.542619


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>SEARCH TEXT [images|audio|videos|texts|keywords]</strong>" defines the image|audio|video|text|keyword data type to search for.</li>
        <li>"<strong>USING</strong>" defines the model used for the text vectorization.</li>
        <li>"<strong>AS</strong>" defines the embedding table to be used for the searches. In this example, <code>movie_review_test</code> table is used.</li>
    </ul>
</div>

## __4. Extract Keywords from Texts__

This step uses the <mark style="background-color:#E9D7FD">movie_text_search_model</mark> text vectorization model and test table to extract keywords from the texts.

In [14]:
%%thanosql
SEARCH KEYWORD
USING movie_text_search_model
OPTIONS (
    text_col="review",
    ngram_range=[1, 3],
    use_stopwords=True
    )
AS 
SELECT * 
FROM movie_review_test
LIMIT 10 OFFSET 40

Searching...


Unnamed: 0,review,sentiment,convert_result,keyword
0,"Like an earlier commentor, I saw it in 1980 an...",positive,"[0.006548314, -0.04284134, -0.0072711473, -0.0...","{'keyword': ['writers depict oppenheimer', 'st..."
1,"I'm a fan of Matthew Modine, but this film--wh...",negative,"[0.012814169, 0.03298126, -0.0034962571, -0.05...","{'keyword': ['watched it', 'acting talent', 'f..."
2,Paul Lukas played a Russian intellectual makin...,positive,"[0.017949676, -0.051755715, 0.0008833323, -0.0...","{'keyword': ['stanislavsky was also', 'counter..."
3,"to be honest, i didn't watch all of the origin...",negative,"[0.029674824, 0.021441624, -0.012533456, 0.014...","{'keyword': ['other vampire movie', 'horrible ..."
4,Police Squad! (1982) was a funny show that end...,positive,"[-0.0007850809, 0.0059329295, 0.010534317, 0.0...","{'keyword': ['american television shows', 'dre..."
5,I am still shuddering at the thought of EVER s...,negative,"[-0.012367624, 0.03253895, 0.0026787654, -0.01...","{'keyword': ['seeing this movie', 'than action..."
6,Gregory Peck gives a brilliant performance in ...,positive,"[0.039867412, 0.063295096, -0.024085267, -0.07...","{'keyword': ['this film', 'the old testament',..."
7,I first flicked onto the LoG accidentally one ...,positive,"[0.011700148, 0.046097543, 0.01669461, -0.0067...","{'keyword': ['watch it', 'the fast show', 'hum..."
8,I didn't know what to make of this film. I gue...,positive,"[-0.009551439, 0.026482461, 0.019772608, -0.03...","{'keyword': ['view the film', 'drug influenced..."
9,A family looking for some old roadside attract...,negative,"[0.028609712, -0.013428714, -0.008489482, -0.0...","{'keyword': ['horror film', 'the acting was', ..."


In [15]:
%%thanosql
SELECT review, sentiment, keyword -> 'keyword' AS keywords, keyword -> 'score' AS score
FROM (
    SEARCH KEYWORD 
    USING movie_text_search_model
    OPTIONS (
        text_col="review",
        use_stopwords=True
        )
    AS 
    SELECT * 
    FROM movie_review_test
    LIMIT 10
)

Searching...


Unnamed: 0,review,sentiment,keywords,score
0,"I read the book before seeing the movie, and t...",positive,"[film very haunting, best adaptations out, boo...","[0.5451, 0.388, 0.2604, 0.2171, 0.2131]"
1,"""9/11,"" hosted by Robert DeNiro, presents foot...",positive,"[9 11 hosted, the twin towers, television, rob...","[0.513, 0.4793, 0.3936, 0.3466, 0.2555]"
2,"Yesterday I attended the world premiere of ""De...",positive,"[the rape scene, she invites him, confusion an...","[0.4882, 0.4505, 0.3918, 0.3344, 0.2718]"
3,Moonwalker is a Fantasy Music film staring Mic...,positive,"[michael jackson film, the montage is, liked t...","[0.5297, 0.4231, 0.4067, 0.2848, 0.1965]"
4,"Welcome to Oakland, where the dead come out to...",positive,"[the ghetto setting, jermaine takes revenge, o...","[0.4428, 0.3672, 0.3418, 0.3217, 0.2916]"
5,Tipping the Velvet (2002) (TV) was directed by...,positive,"[protagonist nan astley, directed by geoffrey,...","[0.4801, 0.4062, 0.3669, 0.2537, 0.1555]"
6,The Stock Market Crash of 1929 and the Depress...,positive,"[with james cagney, dorothy lamour and, dances...","[0.4991, 0.4284, 0.3645, 0.3277, 0.106]"
7,I want to clarify a few things. I am not famil...,negative,"[art cinema, pseudo shocking scenes, is some r...","[0.5314, 0.4048, 0.3326, 0.3269, 0.2759]"
8,This is a nice movie with good performances by...,positive,"[in spanish cinema, is very good, better movie...","[0.5494, 0.4534, 0.3496, 0.3429, 0.1791]"
9,"Once a month, I invite a few friends over for ...",negative,"[retarded movie night, low budget horror, but ...","[0.6151, 0.4736, 0.334, 0.3087, 0.2609]"


In [16]:
%%thanosql 
SELECT * FROM (
    SELECT review, sentiment, json_array_elements(keyword -> 'keyword') AS keywords, (json_array_elements(keyword -> 'score'))::text::float AS score
        FROM (
            SEARCH KEYWORD 
            USING movie_text_search_model
            OPTIONS (
                text_col="review",
                use_stopwords=True
                )
            AS 
            SELECT * 
            FROM movie_review_test
            LIMIT 10
        )
    ) 
WHERE score > 0.5

Searching...


Unnamed: 0,review,sentiment,keywords,score
0,"I read the book before seeing the movie, and t...",positive,film very haunting,0.5451
1,"""9/11,"" hosted by Robert DeNiro, presents foot...",positive,9 11 hosted,0.513
2,Moonwalker is a Fantasy Music film staring Mic...,positive,michael jackson film,0.5297
3,This is a nice movie with good performances by...,positive,in spanish cinema,0.5494
4,I want to clarify a few things. I am not famil...,negative,art cinema,0.5314
5,"Once a month, I invite a few friends over for ...",negative,retarded movie night,0.6151


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>SEARCH KEYWORD [images|audio|videos|texts|keywords]</strong>" defines the image|audio|video|text|keyword data type to search for.</li>
        <li>"<strong>USING</strong>" defines the model used for the text vectorization.</li>
        <li>"<strong>OPTIONS</strong>" specifies the options to be used for the text vectorizaion.
            <ul>
                <li>"lang" : en, ko</li>
                <li>"text_col" :  a column containing movie review data in the data table</li>
                <li>"ngram_range" : minimum and maximum number of words for each keyword. ex) [1, 3]. In most situations, keywords are extracted according to the maximum number of words. (list[int, int], DEFAULT : [1, 2])</li>
                <li>"top_n" : number of keywords to be extracted, in order of highest similarity (int, DEFAULT : 5)</li>
                <li>"diversity" : variety of keywords to be extracted. The higher the value, the more diverse the keywords will be. 0 <= diversity <= 1 (float, DEFAULT : 0.5)</li>
                <li>"use_stopwords" : whether to exclude words that do not have a significant meaning (True|False, DEFAULT : True)</li>
                <li>"threshold" : minimum value of similarity value of keywords to be extracted. (float, DEFAULT : 0.0)</li>
            </ul>
        <li>"<strong>AS</strong>" defines the embedding table to be used for searches. In this example, <code>movie_review_test</code> table is used.</li>
    </ul>
</div>

## __5. Combine the Two Methods__

In [17]:
%%thanosql
SEARCH KEYWORD
USING movie_text_search_model
OPTIONS (
    text_col="review",
    ngram_range=[1, 3],
    use_stopwords=True
    )
AS (
    SELECT review, sentiment, search_result as score
    FROM (
        SEARCH TEXT text="The greatest movie of all time"
        USING movie_text_search_model
        OPTIONS(emb_col = "convert_result")
        AS 
        SELECT * 
        FROM movie_review_test
        )
    ORDER BY score DESC 
    LIMIT 10
)

Searching...
Searching...


Unnamed: 0,review,sentiment,score,keyword
0,"This movie is full of references. Like ""Mad M...",positive,0.517195,"{'keyword': ['movie', 'is a masterpiece', 'of ..."
1,This is without a doubt the most poorly though...,negative,0.508921,"{'keyword': ['movie in history', 'most awful',..."
2,There are not many movies around that have giv...,positive,0.488863,"{'keyword': ['most wonderful fantasy', 'as unf..."
3,This movie was the second movie I saw on the c...,positive,0.479862,"{'keyword': ['movie great monsters', 'astonish..."
4,"THE PROTECTOR. You hear the name. You think, ""...",negative,0.47312,"{'keyword': ['better cop film', 'hong kong mov..."
5,This film lingered and lingered at a small mov...,positive,0.471293,"{'keyword': ['warming film', 'it a comedy', 'c..."
6,Blade was a thrilling horror masterpiece and i...,positive,0.470355,"{'keyword': ['movie is great', 'wesley snipes ..."
7,This is without a doubt the greatest film ever...,positive,0.462852,"{'keyword': ['film', 'improvised recursive fil..."
8,"The critics were like ""a movie that will break...",negative,0.462665,"{'keyword': ['movie', 'had great expectations'..."
9,So fortunate were we to see this fantastic fil...,positive,0.461568,"{'keyword': ['fantastic film at', 'personal ra..."


In [18]:
%%thanosql
SELECT review, sentiment, keyword -> 'keyword' AS keywords, keyword -> 'score' AS score
FROM ( 
    SEARCH KEYWORD
    USING movie_text_search_model
    OPTIONS (
        text_col="review",
        ngram_range=[1, 3],
        use_stopwords=True
        )
    AS (
        SELECT review, sentiment, search_result as score
        FROM (
            SEARCH TEXT text="Such a romatic movie"
            USING movie_text_search_model
            OPTIONS(emb_col = "convert_result")
            AS 
            SELECT * 
            FROM movie_review_test
            )
        ORDER BY score DESC 
        LIMIT 10
    )
)

Searching...
Searching...


Unnamed: 0,review,sentiment,keywords,score
0,"""Crush"" examines female friendship, for the mo...",positive,"[a film for, a sudden passion, crush examines ...","[0.4908, 0.4565, 0.443, 0.4287, 0.4088]"
1,This is the kind of film for a snowy Sunday af...,positive,"[hours wonderful performances, film for a, war...","[0.6024, 0.4628, 0.4105, 0.3297, 0.3254]"
2,"""Casomai"" is a masterful tale depicting the st...",positive,"[italian movies of, masterful tale depicting, ...","[0.5791, 0.5258, 0.3959, 0.2491, 0.2339]"
3,"This movie is full of references. Like ""Mad M...",positive,"[movie, is a masterpiece, of references like, ...","[0.5193, 0.5011, 0.3729, 0.3434, 0.2747]"
4,There are not many movies around that have giv...,positive,"[most wonderful fantasy, as unforgettable char...","[0.4917, 0.4692, 0.4622, 0.3335, 0.2645]"
5,"Though I saw this movie dubbed in French, so I...",positive,"[movie whether alone, positive portrayal, sexu...","[0.4176, 0.4105, 0.3859, 0.3702, 0.2636]"
6,Burlinson and Thornton give an outstanding per...,positive,"[the horse scenes, love this movie, are absolu...","[0.5654, 0.5158, 0.3877, 0.3316, 0.1955]"
7,"A gentle story, hinting at fury, with a redemp...",positive,"[well executed cinematographers, a film, enhan...","[0.6517, 0.4865, 0.4304, 0.2402, 0.201]"
8,A labor of love. Each frame is picture perfect...,positive,"[recommend this film, vijay raaz camille, the ...","[0.5656, 0.3266, 0.3135, 0.3115, 0.3031]"
9,i would have to say that this is the first qua...,positive,"[quality romantic comedy, movie was well, the ...","[0.5481, 0.5253, 0.3588, 0.2273, 0.223]"


In [21]:
%%thanosql
SEARCH KEYWORD
USING movie_text_search_model
OPTIONS (
    text_col="review",
    ngram_range=[1, 3],
    use_stopwords=True
    )
AS (
    SELECT review, sentiment, search_result as score
    FROM (
        SEARCH TEXT text="The best action movie"
        USING movie_text_search_model
        OPTIONS(emb_col = "convert_result")
        AS 
        SELECT * 
        FROM movie_review_test
        WHERE review LIKE '%%gun%%'
        )
    ORDER BY score DESC 
    LIMIT 10
)

Searching...
Searching...


Unnamed: 0,review,sentiment,score,keyword
0,As a veteran screen writing instructor at Rich...,positive,0.384394,"{'keyword': ['movies before hollywood', 'this ..."
1,This is quite an unusual and unique little wes...,positive,0.379475,"{'keyword': ['western genre film', 'the movie ..."
2,Countless Historical & cultural mistakes 0/10 ...,negative,0.368026,"{'keyword': ['movie 3 jewish', 'hitler was kil..."
3,I was very surprised how bad this movie was. N...,negative,0.347552,"{'keyword': ['kung fu movies', 'movie is worse..."
4,"Oh, dear! This has to be one of the worst film...",negative,0.335154,"{'keyword': ['worst films', 'utterly dreadful ..."
5,"""In the world of old-school kung fu movies, wh...",positive,0.332992,"{'keyword': ['kung fu movies', 'artists whose ..."
6,Nobody could like this movie for its merit but...,negative,0.310565,"{'keyword': ['little stealth fighter', 'drop t..."
7,If you thought Day after tomorrow was implausi...,negative,0.310345,"{'keyword': ['most disaster films', 'implausib..."
8,"Okay, sure, this movie is a bit on the hokey s...",positive,0.297743,"{'keyword': ['film based on', 'the punisher an..."
9,"This was probably the worst movie ever, seriou...",negative,0.294311,"{'keyword': ['worst movie ever', 'cheesy porn ..."


In [26]:
%%thanosql
SELECT * 
FROM (
    SELECT review, sentiment, json_array_elements(keyword -> 'keyword') AS keywords, (json_array_elements(keyword -> 'score'))::text::float AS score
    FROM (
        SEARCH KEYWORD
        USING movie_text_search_model
        OPTIONS (
            text_col="review",
            ngram_range=[1, 3],
            use_stopwords=True
            )
        AS (
            SELECT review, sentiment, search_result as score
            FROM (
                SEARCH TEXT text="The best action movie"
                USING movie_text_search_model
                OPTIONS(emb_col = "convert_result")
                AS 
                SELECT * 
                FROM movie_review_test
                WHERE review LIKE '%%gun%%'
                )
            ORDER BY score DESC 
            LIMIT 10
        )
    )
)
WHERE score > 0.3

Searching...
Searching...


Unnamed: 0,review,sentiment,keywords,score
0,As a veteran screen writing instructor at Rich...,positive,movies before hollywood,0.5201
1,As a veteran screen writing instructor at Rich...,positive,this enchanting film,0.4555
2,This is quite an unusual and unique little wes...,positive,western genre film,0.6156
3,This is quite an unusual and unique little wes...,positive,the movie seemed,0.4696
4,This is quite an unusual and unique little wes...,positive,other surprising actors,0.3681
5,Countless Historical & cultural mistakes 0/10 ...,negative,movie 3 jewish,0.4822
6,Countless Historical & cultural mistakes 0/10 ...,negative,hitler was killed,0.3543
7,"Okay, sure, this movie is a bit on the hokey s...",positive,film based on,0.4548
8,"Okay, sure, this movie is a bit on the hokey s...",positive,the punisher anyone,0.4191
9,"Okay, sure, this movie is a bit on the hokey s...",positive,dolph lundgren as,0.3837


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>SEARCH TEXT [images|audio|videos|texts|keywords]</strong>" defines the image|audio|video|text|keyword data type to search for.</li>
        <li>"<strong>SEARCH KEYWORD [images|audio|videos|texts|keywords]</strong>" defines the image|audio|video|text|keyword data type to search for.</li>
        <li>"<strong>USING</strong>" defines the model used for text vectorization.</li>
        <li>"<strong>AS</strong>" defines the embedding table to be used for searches. In this example, <code>movie_review_test</code> table is used.</li>
    </ul>
</div>

## __6. In Conclusion__

In this tutorial, we performed text vectorization using `movie review data`, and similar text search and keyword extraction. As this is a beginner-level tutorial, we focused on the process rather than accuracy. The model's accuracy can be improved by adjusting various options, such as increasing the epoch or dataset size. Create your own model and provide competitive services by combining various unstructured data (image, audio, video, etc.) and structured data with ThanoSQL.
<br>
For the next step, explore the various "OPTIONS" and training methods of text vectorization models. If you want to learn more about building your own text model, proceed with the following tutorials.

* [How to Upload to ThanoSQL DB](https://docs.thanosql.ai/en/getting_started/data_upload/)
* [Creating an Intermediate Similar Text Search Model]

<div class="admonition tip">
    <h4 class="admonition-title">Inquiries about deploying a model for your own service</h4>
    <p>If you have any difficulties creating your own model using ThanoSQL or applying it to your services, please feel free to contact us below😊</p>
    <p>For inquiries regarding building an text similarity search model: contact@smartmind.team</p>
</div>