# __Create a speech recognition model to dictate an audio file__

## Preface

- Tutorial Difficulty : ★☆☆☆☆
- 10 min read
- Languages : [SQL](https://en.wikipedia.org/wiki/SQL) (100%)
- File location : tutorial_en/thanosql_ml/audio_recognition/speech_recognition.ipynb
- References : [LibriSpeech DataSet](http://www.openslr.org/12), [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)

## Tutorial Introduction

<div class="admonition note">
    <h4 class="admonition-title">Understanding Speech Recognition</h4>
    <p>Speech recognition technology, also called computer speech recognition or speech-to-text, allows programs to process human speech into text format. Recently, it has been used in a wide range of fields ranging from automobiles, medical fields, to everyday life involving artificial intelligence speakers or smartphones. Recent <a href="https://en.wikipedia.org/wiki/Machine_learning">Machine Learning</a> Speech recognition technology utilizes algorithms that understands and processes speech by integrating grammar, syntax, structure, and composition of audio and speech signals.</p>
</div>

<div class="admonition warning">
    <p>Speech Recognition should not be confused with Voice Recognition, which focuses only on identifying the individual users' voices.</p>
</div>

Today, speech recognition technology is being applied in various industries. Advances in speech recognition technology have been expanding into automatic interpretation for simple travel to high-level business meetings. In addition, it has delved into fields such as speech synthesis technology, which acts as a virtual guide, mimicking the voice of a specific celebrity, and converting a predetermined fingerprint into a voice.

__The following are use case examples of ThanoSQL's speech recognition model.__

- Voice recognition technology converts phone consultation data into text to enable customer sentiment analysis and consultation trend analysis. Using voice recognition technology, counselors can improve customer service by quickly receiving relevant information that answers customer inquiries or referencing similar cases in the past.
In addition, after consultation, the customer satisfaction trend can be viewed by indirect measurement of customer satisfaction through sentiment analysis.

- Using voice recognition technology, you can write notes faster than when written with a keyboard, and can instantly search for specific keywords even in long voice files.

<div class="admonition note">
    <h4 class="admonition-title">In this tutorial</h4>
    <p>👉 Librispeech [Panayotov et al. 2015] is the result of <a href="https://librivox.org/">LibriVox project</a>, a user-participating audiobook project, which is one of the most used large-scale English speech data in speech recognition research. It was created by processing approximately 1,000 hours of recorded audiobook data sampled at 16 kHz. The target table for the tutorial consists of the pre-uploaded audio file paths and scripts. This tutorial aims to convert audio files to text.</p>
</div>

<div class="admonition warning">
    <h4 class="admonition-title">Tutorial Notes</h4>
    <ul>
        <li>ThanoSQL currently only supports the following audio file formats: '.wav', '.flac'.</li>
        <li>Both a column indicating the audio file path and a column indicating the text corresponding to the target value must exist in the table.</li>
        <li>The base model of the speech recognition model (<code>Wav2Vec2En</code>) utilizes GPU. Depending on the size of the model and the batch size, you may run out of GPU memory. In this case, try using a smaller model or reducing the batch size.</li>
    </ul>
</div>

## __0. Prepare Dataset and Model__

To run ThanoSQL queries, you must create an API token and run the code below, as mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/en/getting_started/how_to_use_ThanoSQL/#5-thanosql-workspace).

In [None]:
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>

### __Prepare Dataset__

In [None]:
%%thanosql
GET THANOSQL DATASET librispeech_data
OPTIONS (overwrite=True)

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>GET THANOSQL DATASET</strong>" Use this query statement to save the desired dataset to your workspace environment. </li>
        <li>"<strong>OPTIONS</strong>" Use this statement to specify the option to use for the <strong>GET THANOSQL DATASET</strong> query statement.
        <ul>
            <li>"overwrite" : Overwrite if a dataset with the same name exists. If set as True, the existing dataset is replaced with the new dataset (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

In [None]:
%%thanosql
COPY librispeech_train 
OPTIONS (overwrite=True)
FROM "thanosql-dataset/librispeech_data/librispeech_train.csv"

In [None]:
%%thanosql
COPY librispeech_test 
OPTIONS (overwrite=True)
FROM "thanosql-dataset/librispeech_data/librispeech_test.csv"

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>Use the "<strong>COPY</strong>" clause to specify the dataset to be copied into the DB. </li>
        <li>"<strong>OPTIONS</strong>" specifies the options to use for <strong>COPY</strong> clause.
        <ul>
            <li>"overwrite" : Overwrite if a dataset with the same name exists in the DB. If True, the existing dataset is overwritten with the new dataset. (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

### __Prepare the Model__

In [None]:
%%thanosql
GET THANOSQL MODEL tutorial_audio_recognition
OPTIONS (overwrite=True)
AS tutorial_audio_recognition

<div class="admonition note">
    <h4 class="admonition-title">Query Details </h4>
    <ul>
        <li>"<strong>GET THANOSQL MODEL</strong>" Use the query to save the desired model into your workspace environment. </li>
        <li>"<strong>OPTIONS</strong>" Use this clause to specify the options to use for the <strong>GET THANOSQL MODEL</strong> statement.
        <ul>
            <li>"overwrite" : Overwrite if a dataset with the same name exists in the DB. If True, the existing dataset is overwritten with the new dataset. (True|False, DEFAULT : False) </li>
        </ul>
        </li>
        <li>Use the "<strong>AS</strong>" clause to name the model. If you are not using the AS syntax, the default name of the <code>THANOSQL MODEL</code> is used.</li>
    </ul>
</div>

## __1. Check Dataset__

For this tutorial, we use the <mark style="background-color:#FFEC92 ">librispeech_train</mark> table stored in ThanoSQL DB. Execute the query statement below to check the contents of the table.

In [None]:
%%thanosql
SELECT *
FROM librispeech_train
LIMIT 5

<div class="admonition note">
    <h4 class="admonition-title">Understanding Data</h4>
    <ul>
        <li><mark style="background-color:#D7D0FF ">audio_path</mark>: The audio file's path</li>
        <li><mark style="background-color:#D7D0FF ">text</mark>: Target value of the corresponding audio (target, script)</li>
    </ul>
</div>

In [None]:
%%thanosql
PRINT AUDIO 
AS
SELECT audio_path
FROM librispeech_train
LIMIT 3

## __2. Predict Speech Recognition Results Using Pretrained Models__

By using the pre-trained speech recognition model, <mark style="background-color:#E9D7FD ">tutorial_audio_recognition</mark>, and by running the following query syntax, predict the outcome.

In [None]:
%%thanosql
PREDICT USING tutorial_audio_recognition
OPTIONS (
    audio_col='audio_path',
    batch_size=8
    )
AS 
SELECT * 
FROM librispeech_train

## __3. Creating a Speech Recognition Model__

Create a speech recognition model using the <mark style="background-color:#FFEC92 ">librispeech_train</mark> dataset from the previous step. Execute the query below to create a model named <mark style="background-color:#E9D7FD ">my_speech_recognition_model</mark>.  
(Estimated time required for query execution: 1 min)

In [None]:
%%thanosql
BUILD MODEL my_speech_recognition_model
USING Wav2Vec2En
OPTIONS (
    audio_col='audio_path',  
    text_col='text',  
    epochs=1,  
    batch_size=4,
    overwrite= True  
    )
AS
SELECT *
FROM librispeech_train

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>Create and train the model <mark style="background-color:#E9D7FD ">my_speech_recognition_model</mark> using the "<strong>BUILD MODEL</strong>" statement.</li>
        <li>Specify <code>Wav2Vec2En</code> as the base model with the "<strong>USING</strong>" clause.</li>
        <li>The "<strong>OPTIONS</strong>" clause specifies the options to use for model creation.
        <ul>
            <li>"audio_col" : The name of the column containing the audio path to be used for training.</li>
            <li>"text_col" : The name of the column containing the audio script information.</li>
            <li>"epochs" : Number of times to train the training dataset.</li>
            <li>"batch_size" : The size of the dataset read in a single training. </li>
            <li>"overwrite" : Overwrite if a model with the same name exists. If True, the existing model is overwritten with the new model. (True|False, DEFAULT : False)</li>
        </ul>
        </li>
    </ul>
</div>

<div class="admonition tip">
    <p>In this example, we set “epochs” to 1 to train the model quickly. In general, larger number of “epochs” increases performance of the inference at the cost of the computation time.</p>
</div>

## __4. Predict Speech Recognition Results Using the Model You Created__

Using the speech recognition model created in the previous step, predict the target value (script) of a specific speech (a table not used for training, <mark style="background-color:#FFEC92 ">librispeech_test</mark>). After executing the query below, the prediction result is stored and returned in the <mark style="background-color:#D7D0FF">predicted</mark> column.

In [None]:
%%thanosql
PREDICT USING my_speech_recognition_model
OPTIONS (
    audio_col='audio_path'
    )
AS
SELECT *
FROM librispeech_test

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>Use the <mark style="background-color:#E9D7FD ">my_speech_recognition_model</mark> model created in the previous step for prediction using the "<strong>PREDICT USING</strong>" statement.</li>
        <li>The "<strong>OPTIONS</strong>" clause specifies the options to use for prediction.
        <ul>
            <li>"audio_col" : The name of the column containing the audio path to use for prediction.</li>
        </ul>
        </li>
    </ul>
</div>

## __5. In Conclusion__

In this tutorial, we created a speech recognition model using the <mark style="background-color:#FFD79C">LibriSpeech</mark> dataset. As this is a beginner-level tutorial, we focused on the process rather than accuracy. Speech recognition models can improve their accuracy through fine tuning that is suitable for the user's needs. Try using your own data to train the base model and improving its performance. By combining various unstructured data (image, video, text, etc.) and numeric data, you can create your own model and create competitive services.

The next tutorial, [Creating an Intermediate Speech Recognition Model], takes a deeper dive into the speech recognition model. If you want to learn more about building your own speech recognition model for your service, try the following tutorials.

- [How to Upload to ThanoSQL DB](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_connecting/data_upload/)
- [Creating an Intermediate speech recognition model]
- [Deploying My Speech Recognition Models](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_connecting/thanosql_api/rest_api_thanosql_query/)

<div class="admonition tip">
    <h4 class="admonition-title">Inquiries about deploying a model for your own service</h4>
    <p>If you have any difficulties in creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊</p>
    <p>For inquiries regarding building a speech recognition model: contact@smartmind.team</p>
</div>