# __Create a text classification model__

## Preface

- Tutorial Difficulty: ★☆☆☆☆
- 10 min read
- Languages : [SQL](https://en.wikipedia.org/wiki/SQL) (100%)
- File location : tutorial_en/thanosql_ml/classification/text_classification.ipynb
- References : [(Kaggle) IMDB Movie Reviews](https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/data), [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)

## Tutorial Introduction

<div class="admonition note">
    <h4 class="admonition-title">Understanding Classification Operations</h4>
    <p>Classification is a form of <a href="https://en.wikipedia.org/wiki/Machine_learning">Machine Learning</a> used to predict the category (Category or Class) to which the target belongs. For example, both binary classifications that classify men or women and multiple classifications that predict animal species (dogs, cats, rabbits, etc.) are included in the classification task. <br></p>
</div>

[Natural Language Processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing) is a branch of artificial intelligence that uses machine learning to process and interpret text-based data.

<div class="admonition tip">
    <h4 class="admonition-title">What is <a href="https://en.wikipedia.org/wiki/Natural_language_processing">Natural Language Processing (NLP)</a></h4>
    <p>NLP can be classified as Natural Language Understanding (NLU) and Natural Language Generation (NLG) depending on the purpose of the task. Understanding natural language refers to the process of converting a natural language that a person understands into a value that a computer can understand. On the other hand, natural language generation further refers to the process of converting computer-readable values into natural language so that people can understand them.</p>
</div>

Recent advances in pre-training techniques such as [BERT](<https://en.wikipedia.org/wiki/BERT_(language_model)>) and [GPT-3](https://en.wikipedia.org/wiki/GPT-3) enable the building of a common model of language comprehension before fine-tuning for certain NLP tasks, such as emotional analysis or question-and-answer.

<div class="admonition summary">
    <p>This means that you can use more data efficiently by minimizing <a href="https://en.wikipedia.org/wiki/Labeled_data">data labeling</a> operations for large datasets.</p>
</div>

ThanoSQL provides a variety of pre-trained AI models, and provides various functions so that users can easily create their own text classification models even with a small amount of data labeling. Through this, users can extract potential insights from text data that are difficult to quantify features from an appropriately trained text classification model and utilize them for various services.

__The following is an example and usage of the ThanoSQL text classification model.__

- The ThanoSQL text classification model makes it easy to use text classification model for consultation or inquiry through chatbot, sentiment analysis of text in bulletin board, or categorization task. This action will later enable the customer to connect with the appropriate contact person.

- The ThanoSQL text classification model allows news or post sharing services to categorize groups of published content. It enables sentiment analysis by applying a text classification model to comments in posted content. This operation enables efficient management of problems that may suddenly become an issue or may be caused by profanity or slander.

<div class="admonition note">
    <h4 class="admonition-title">In this tutorial</h4>
    <p>👉 Create a model to classify the emotions of movie reviews using the <mark style="background-color:#FFD79C">IMDB Movie Reviews</mark> dataset from <a href="https://www.kaggle.com/">Kaggle</a>. This dataset consists of 50,000 movie review texts and targets for positive or negative emotions. Based on the movie rating, a value less than 5 is expressed as negative and a value greater than 7 is expressed as positive, and each individual film does not have more than 30 review results.</p>
</div>

<div class="admonition warning">
    <h4 class="admonition-title">Tutorial Precautions</h4>
    <ul>
        <li>A text classification model can be used to predict one target (Target, Category) in one text.</li>
        <li>There must be a column representing the text and a column representing the target value of the text.</li>
        <li>The base model of the corresponding text classification model (<code>ELECTRA</code>) uses a GPU. Depending on the size and batch size of the model used, GPU memory may be low. In this case, try using a smaller model or reducing the batch size.</li>
    </ul>
</div>

## __0. Prepare Dataset and Model__

To use the query syntax of ThanoSQL, you must create an API token and run the query below, as mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/en/getting_started/how_to_use_ThanoSQL/#5-thanosql-workspace).

In [None]:
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>

### __Prepare Dataset__

In [None]:
%%thanosql
GET THANOSQL DATASET movie_review_data
OPTIONS (overwrite=True)

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>GET THANOSQL DATASET</strong>" Use the query syntax to save the desired dataset to the workspace. </li>
        <li>"<strong>OPTIONS</strong>" Specifies the option to use for <strong>GET THANOSQL DATASET</strong> via query syntax.
        <ul>
            <li>"overwrite" : Set whether to overwrite if a dataset with the same name exists. If True, the old dataset is replaced with the new dataset (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

In [None]:
%%thanosql
COPY movie_review_train
OPTIONS (overwrite=True) 
FROM "thanosql-dataset/movie_review_data/movie_review_train.csv"

In [None]:
%%thanosql
COPY movie_review_test 
OPTIONS (overwrite=True) 
FROM "thanosql-dataset/movie_review_data/movie_review_test.csv"

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>COPY</strong>" Use the query syntax to specify the name of the dataset to be saved in the DB. </li>
        <li>Specifies the options to use for <strong>COPY</strong> via the query syntax "<strong>OPTIONS</strong>" .
        <ul>
            <li>"overwrite" : Set whether overwrite is possible if a dataset with the same name exists on the DB. If True, the old dataset is replaced with the new dataset (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

### __Prepare the Model__

In [None]:
%%thanosql
GET THANOSQL MODEL tutorial_text_classification
OPTIONS (overwrite=True)
AS tutorial_text_classification

<div class="admonition note">
    <h4 class="admonition-title">Query Details </h4>
    <ul>
        <li>"<strong>GET THANOSQL MODEL</strong>" Use the query syntax to store the desired model in the workspace and DB. </li>
        <li>"<strong>OPTIONS</strong>" Use the query syntax to specify the options to use for <strong>GET THANOSQL MODEL</strong>.
        <ul>
            <li>"overwrite" : Set whether datasets with the same name can be overwritten if they exist. If True, the existing dataset is changed to a new dataset (True|False, DEFAULT: False) </li>
        </ul>
        </li>
        <li>Use the query syntax "<strong>AS</strong>" to name the model. If you are not using the AS syntax, accept the name of <code>THANOSQL MODEL</code>.</li>
    </ul>
</div>

## __1.Check Dataset__

To create a movie review sentiment classification model, we use the <mark style="background-color:#FFEC92 ">movie_review_train</mark> table stored in ThanoSQL DB. Execute the query statement below to check the table contents.

In [None]:
%%thanosql
SELECT *
FROM movie_review_train
LIMIT 5

<div class="admonition note">
   <h4 class="admonition-title">Understanding Data</h4>
   <ul>
      <li><mark style="background-color:#D7D0FF ">review</mark>: movie review text</li>
      <li><mark style="background-color:#D7D0FF ">sentiment</mark> : Target value indicating whether the review is positive or negative</li>
   </ul>
</div>

## __2. Predict Movie Review Sentiment Classification Results Using Pretrained Model__

First, let's predict the result directly with the model we trained in advance. If you run the following query, you can predict the movie review classification result using the <mark style="background-color:#E9D7FD ">tutorial_text_classification</mark> model.

In [None]:
%%thanosql
PREDICT USING tutorial_text_classification
OPTIONS (
    text_col='review'
    )
AS
SELECT *
FROM movie_review_test

## __3. Creating a text classification model__

Create a text classification model using the <mark style="background-color:#FFEC92 ">movie_review_train</mark> dataset from the previous step. Execute the query syntax below to create a model named <mark style="background-color:#E9D7FD ">my_movie_review_classifier</mark>.  
(Estimated time required for query execution: 3 min)

In [None]:
%%thanosql
BUILD MODEL my_movie_review_classifier
USING ElectraEn
OPTIONS (
    text_col='review',
    label_col='sentiment',
    epochs=1,
    batch_size=4,
    overwrite=True
    )
AS
SELECT *
FROM movie_review_train

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>BUILD MODEL</strong>" Use the query syntax to create and learn a model called <mark style="background-color:#E9D7FD">my_movie_review_classifier</mark>.</li>
        <li>"<strong>USING</strong>" The query syntax specifies the use of <code>ElectraEn</code> as the base model.</li>
        <li>"<strong>OPTIONS</strong>" Specifies the options used to create the model through the query syntax. "text_col" is the name of the column containing the text to be used for learning, and "label_col" is the name of the column containing information about the target. Specify how many times you want to repeat with "epochs". "batch_size" is the size of a bundle of datasets read in one learning.</li>
    </ul>
</div>

<div class="admonition tip">
    <p>Here, we set "epochs" to 1 to learn quickly. In general, larger numbers take more computation time, but predictive performance increases as training progresses.</p>
</div>

<div class="admonition note">
    <p>When <strong>overwrite is True </strong>, the user can create a data table with the same name as the previously created data table.<br>
    On the other hand, when <strong>overwrite is False</strong>, the user cannot create a data table with the same name as the previously created data table.</p>
</div>

## __4. Predict Movie Review Sentiment Classification Results Using Generated Model__

Use the text classification prediction model created in the previous step to predict the target value for a specific review (data table not used for training, <mark style="background-color:#FFEC92 ">movie_review_test</mark>).

In [None]:
%%thanosql
PREDICT USING my_movie_review_classifier
OPTIONS (
    text_col='review'
    )
AS
SELECT *
FROM movie_review_test

<div class="admonition note">
    <h4 class="admonition-title">Query details</h4>
    <p>Use the <mark style="background-color:#E9D7FD ">my_movie_review_classifier</mark> model created in the previous step for prediction via the "<strong>PREDICT USING</strong>" query syntax.
    Specify the options to use for prediction via "<strong>OPTIONS</strong>". <mark style="background-color:#D7D0FF">review</mark> is the name of the column containing the text to use for prediction.
    The prediction result is stored and returned in the <mark style="background-color:#D7D0FF">predicted</mark> column.</p>
</div>

## __5. In Conclusion__

In this tutorial, we created a text classification model using the <mark style="background-color:#FFD79C">IMDB Movie Reviews</mark> dataset. As this is a beginner-level tutorial, we have focused on operation rather than explaining the process to improve accuracy. The text classification model can improve its accuracy through fine tuning for each platform or service. You can learn the base model using your own data, or digitize and transform your data using the [Self-supervised Learning](https://en.wikipedia.org/wiki/Self-supervised_learning) model. Afterwards, deployment using automated machine learning (Auto-ML) techniques is also possible. Create your own model and provide competitive services by combining various unstructured data (image, audio, video, etc.) and numeric data.

The next step, the [Creating an Intermediate Text Classification Model] tutorial, takes a more in-depth look at text classification models. If you want to learn more about how to build your own text classification model for your service, proceed with the following tutorials.

- [How to Upload to ThanoSQL DB](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_connecting/data_upload/)
- [Creating an Intermediate Text Classification Model]
- [Create My model using text conversion and Auto-ML]
- [Deploy My text classification model](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_connecting/thanosql_api/rest_api_thanosql_query/)

<div class="admonition tip">
    <h4 class="admonition-title">Inquiries about deploying a model for your own service</h4>
    <p>If you have any difficulties in creating your own model using ThanoSQL or applying it to the service, please feel free to contact us below😊</p>
    <p>For inquiries about building a text classification model: contact@smartmind.team</p>
</div>