# __Create a Speech Recognition Model__

- Tutorial Difficulty: ★☆☆☆☆
- 10 min read
- Languages: [SQL](https://en.wikipedia.org/wiki/SQL) (100%)
- File location: tutorial_en/thanosql_ml/audio_recognition/speech_recognition.ipynb
- References: [LibriSpeech DataSet](http://www.openslr.org/12), [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)

## Tutorial Introduction

<div class="admonition note">
    <h4 class="admonition-title">Understanding Speech Recognition</h4>
    <p>Speech recognition technology, also called computer speech recognition or speech-to-text, allows programs to process human speech into text format. Recently, it has been used in a wide range of fields such as automobiles, medical fields, and everyday life involving artificial intelligence speakers or smartphones. Recent <a href="https://en.wikipedia.org/wiki/Machine_learning">Machine Learning</a> Speech recognition technology utilizes algorithms that understand and process speech by integrating grammar, syntax, structure, and composition of audio and speech signals.</p>
</div>

<div class="admonition warning">
    <p>Speech Recognition should not be confused with Voice Recognition, which focuses only on identifying the individual users' voices.</p>
</div>

Today, speech recognition technology is being applied in various industries. Advances in speech recognition technology have been expanding into automatic interpretation for simple travel to high-level business meetings. In addition, it has delved into fields such as speech synthesis technology, which acts as a virtual guide, mimicking the voice of a specific celebrity, and converting a predetermined fingerprint into a voice.

__The following are examples and applications of the ThanoSQL speech recognition model.__

- Speech recognition technology converts phone consultation data into text to enable customer sentiment analysis and consultation trend analysis. Using speech recognition technology, customer service representatives can improve their service by quickly receiving relevant information that answers customer inquiries.
In addition, after consultation, the customer satisfaction trend can be analyzed even with the indirect measurement of customer satisfaction through sentiment analysis.
- Using speech recognition technology, you can write notes faster than writing with a keyboard and instantly search for specific keywords even in long audio files.

<div class="admonition note">
    <h4 class="admonition-title">In This Tutorial</h4>
    <p>👉 Librispeech [Panayotov et al. 2015] is the result of <a href="https://librivox.org/">LibriVox project</a>, a user-participating audiobook project, which is one of the most used large-scale English speech data in speech recognition research. It was created by processing approximately 1,000 hours of recorded audiobook data sampled at 16 kHz. The target table for the tutorial consists of the pre-uploaded audio file paths and scripts. This tutorial aims to convert audio files to text.</p>
</div>

<div class="admonition warning">
    <h4 class="admonition-title">Tutorial Notes</h4>
    <ul>
        <li>ThanoSQL currently only supports the following audio file formats: '.wav', '.flac'.</li>
        <li>Both a column indicating the audio file path and a column indicating the text corresponding to the target value must exist in the table.</li>
        <li>The base model of the speech recognition model(<strong>Wav2Vec2En</strong>) utilizes GPU. Depending on the size of the model and the batch size, you may run out of GPU memory. In this case, try using a smaller model or reducing the batch size.</li>
    </ul>
</div>

## __0. Prepare Dataset and Model__

As mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/en/getting_started/paas/workspace/lab/), you must create an API token and run the query below to execute the query of ThanoSQL. 

In [None]:
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>

### __Prepare Dataset__

In [2]:
%%thanosql
GET THANOSQL DATASET librispeech_data
OPTIONS (overwrite=True)

Success


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>GET THANOSQL DATASET</strong>" downloads the specified dataset to the workspace.</li>
        <li>"<strong>OPTIONS</strong>" specifies the option values to be used for the <strong>GET THANOSQL DATASET</strong> clause.
        <ul>
            <li>"overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)</li>
        </ul>
        </li>
    </ul>
</div>

In [3]:
%%thanosql
COPY librispeech_train 
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/librispeech_data/librispeech_train.csv'

Success


In [4]:
%%thanosql
COPY librispeech_test 
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/librispeech_data/librispeech_test.csv'

Success


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>COPY</strong>" specifies the name of the dataset to be saved as a database table.</li>
        <li>"<strong>OPTIONS</strong>" specifies the option values to be used for the <strong>COPY</strong> clause.
        <ul>
           <li>"if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')</li>
        </ul>
        </li>
    </ul>
</div>

### __Prepare the Model__

In [5]:
%%thanosql
GET THANOSQL MODEL wav2vec2
OPTIONS (
    model_name='tutorial_audio_recognition',
    overwrite=True
    )

Success


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>GET THANOSQL MODEL</strong>" downloads the specified model to the workspace.</li>
        <li>"<strong>OPTIONS</strong>" specifies the option values to be used for the <strong>GET THANOSQL MODEL</strong> clause.
        <ul>
            <li>"model_name": the model name to store a given model in the ThanoSQL workspace (str, optional)</li>
            <li>"overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)</li>
        </ul>
        </li>
    </ul>
</div>

## __1. Check Dataset__

To create a speech recognition model, we use the __librispeech_train__ table located in the ThanoSQL workspace database. Run the query below to check the contents of the table.

In [6]:
%%thanosql
SELECT *
FROM librispeech_train
LIMIT 5

Unnamed: 0,audio_path,text
0,thanosql-dataset/librispeech_data/000.wav,i noticed how white and well shaped his own ha...
1,thanosql-dataset/librispeech_data/001.wav,the only conflicts that occurred on irish soil...
2,thanosql-dataset/librispeech_data/002.wav,inquired shaggy in the metal forest
3,thanosql-dataset/librispeech_data/003.wav,my grandmother always spoke in a very loud ton...
4,thanosql-dataset/librispeech_data/004.wav,the poets of succeeding ages have dwelt much i...


<div class="admonition note">
    <h4 class="admonition-title">Understanding the Data Table</h4>
    <p><strong>librispeech_train</strong> table contains the following information.</p>
    <ul>
        <li>audio_path: the audio file's path</li>
        <li>text: target value of the corresponding audio (target, script)</li>
    </ul>
</div>


In [7]:
%%thanosql
PRINT AUDIO 
AS
SELECT audio_path
FROM librispeech_train
LIMIT 3

/home/jovyan/thanosql-dataset/librispeech_data/000.wav


/home/jovyan/thanosql-dataset/librispeech_data/001.wav


/home/jovyan/thanosql-dataset/librispeech_data/002.wav


## __2. Predict Using Pre-built Model__

To predict the results using the pre-built __tutorial_audio_recognition__ model, run the query below.

In [8]:
%%thanosql
PREDICT USING tutorial_audio_recognition
OPTIONS (
    audio_col='audio_path',
    batch_size=8
    )
AS 
SELECT * 
FROM librispeech_train

Unnamed: 0,audio_path,text,predict_result
0,thanosql-dataset/librispeech_data/000.wav,i noticed how white and well shaped his own ha...,I NOTICED HOW WHITE AND WELL SHAPED HIS OWN HA...
1,thanosql-dataset/librispeech_data/001.wav,the only conflicts that occurred on irish soil...,THE ONLY CONFLICTS THAT OCCURRED ON IRISH SOIL...
2,thanosql-dataset/librispeech_data/002.wav,inquired shaggy in the metal forest,INQUIRED SHAGGY IN THE MEDAL FOREST
3,thanosql-dataset/librispeech_data/003.wav,my grandmother always spoke in a very loud ton...,MY GRANDMOTHER ALWAYS SPOKE IN A VERY LOUD TON...
4,thanosql-dataset/librispeech_data/004.wav,the poets of succeeding ages have dwelt much i...,THE POETS OF SUCCEEDING AGES HAVE DWELT MUCH I...
...,...,...,...
75,thanosql-dataset/librispeech_data/075.wav,we can't do anything without evidence complain,WE CAN'T DO ANYTHING WITHOUT EVIDENCE COMPLAIN
76,thanosql-dataset/librispeech_data/076.wav,when i came up he touched my shoulder and look...,WHEN I CAME UP HE TOUCHED MY SHOULDER AND LOOK...
77,thanosql-dataset/librispeech_data/077.wav,it relieved him for a while,IT RELIEVED HIM FOR A WHILE
78,thanosql-dataset/librispeech_data/078.wav,this world's thick vapours whelm your eyes unw...,THIS WORLD'S THICK VAPOURS WHELM YOUR EYES UNW...


## __3. Build a Speech Recognition Model__

To create a speech recognition model with the name __my_speech_recognition_model__ using the __librispeech_train__ dataset from the previous step, run the following query.  
(Estimated duration of query execution: 1 min)

In [9]:
%%thanosql
BUILD MODEL my_speech_recognition_model
USING Wav2Vec2En
OPTIONS (
    audio_col='audio_path',  
    text_col='text',  
    max_epochs=1,  
    batch_size=4,
    overwrite= True  
    )
AS
SELECT *
FROM librispeech_train

Success


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>BUILD MODEL</strong>" creates and trains a model named <strong>my_speech_recognition_model</strong>.</li>
        <li>"<strong>USING</strong>" specifies <strong>Wav2Vec2En</strong> as the base model.</li>
         <li>"<strong>OPTIONS</strong>" specifies the option values used to create the model. 
        <ul>
            <li>"audio_col": the name of the column containing the audio path to be used for training (str, default: 'audio_path')</li>
            <li>"text_col": the name of the column containing the audio script information (str, default: 'text')</li>
            <li>"max_epochs": number of times to train with the training dataset (int, optional, default: 5)</li>
            <li> "batch_size": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)</li>
            <li>"overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False) </li>
        </ul>
        </li>
    </ul>
</div>

<div class="admonition tip">
    <p>In this example, we set “max_epochs” to 1 to train the model quickly. In general, larger number of “max_epochs” increases performance of the inference at the cost of the computation time.</p>
</div>

## __4. Predict__

To use the speech recognition model created in the previous step for prediction of __librispeech_test__, run the following query.

In [10]:
%%thanosql
PREDICT USING my_speech_recognition_model
OPTIONS (
    audio_col='audio_path',
    result_col='predict_result',
    table_name='librispeech_test'
    )
AS
SELECT *
FROM librispeech_test

Unnamed: 0,audio_path,text,predict_result
0,thanosql-dataset/librispeech_data/080.wav,dead said doctor macklewain,DEAD SAID DOCTOR MACKELWAYNE
1,thanosql-dataset/librispeech_data/081.wav,one day when i rode over to the shimerdas i fo...,ONE DAY WHEN I RODE OVER TO THE SHIMERIDAS I F...
2,thanosql-dataset/librispeech_data/082.wav,well i don't think you should turn a guy's t v...,WELL I DON'T THINK YOU SHOULD TURN A GUISE TIV...
3,thanosql-dataset/librispeech_data/083.wav,and what allurements or what vantages upon the...,AND WHAT ALLUREMENTS OR WHAT VANTAGES UPON THE...
4,thanosql-dataset/librispeech_data/084.wav,yes how many,YES HOW MANY
5,thanosql-dataset/librispeech_data/085.wav,then i look perhaps like what i am,THEN I LOOK PERHAPS LIKE WHAT I AM
6,thanosql-dataset/librispeech_data/086.wav,i'm mister christopher from london,I'M MISTER CHRISTOPHER FROM LONDON
7,thanosql-dataset/librispeech_data/087.wav,nature a difference of fifty years had set a p...,NATURE A DIFFERENCE OF FIFTY YEARS HAD SET A P...
8,thanosql-dataset/librispeech_data/088.wav,he is just married you know is he said burgess,HE IS JUST MARRIED YOU KNOWIS HE SAID BURGIS
9,thanosql-dataset/librispeech_data/089.wav,she pointed into the gold cottonwood tree behi...,SHE POINTED IN TO THE GOLD COTTONWOOD TREE BEH...


<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>PREDICT USING</strong>" predicts the outcome using the <strong>my_speech_recognition_model</strong>.
        <li>"<strong>OPTIONS</strong>" specifies the option values to be used for prediction.
        <ul>
            <li>"audio_col": the name of the column containing the audio path to be used for prediction (str, default: 'audio_path')</li>
            <li>"result_col": the column that contains the predicted results (str, optional, default: 'predict_result')</li>
            <li>"table_name": the table name to be stored in the ThanoSQL workspace database. If a previously used table is specified, the existing table will be replaced by the new table with a 'predict_result' column. If not specified, the result dataframe will not be saved as a data table (str, optional)</li>
        </ul>
        </li>
    </ul>
</div>

## __5. In Conclusion__

In this tutorial, we created a speech recognition model using the LibriSpeech dataset. As this is a beginner-level tutorial, we focused on the process rather than accuracy. Speech recognition models can be improved in accuracy through fine tuning that is suitable for the user's needs. Try using your own data to train the base model and improving its performance. Create your own model and provide competitive services by combining various unstructured data(image, audio, video, etc.) and structured data with ThanoSQL.

* [How to Upload My Data to the ThanoSQL Workspace](https://docs.thanosql.ai/en/getting_started/data_upload/)
* [How to Create a Table Using My Data](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_query/COPY_SYNTAX/)
* [How to Upload My Model to the ThanoSQL Workspace](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_query/UPLOAD_MODEL_SYNTAX/)

<div class="admonition tip">
    <h4 class="admonition-title">Inquiries About Deploying a Model for Your Own Service</h4>
    <p>If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊</p>
    <p>For inquiries regarding building a speech recognition model: <a href="mailto:contact@smartmind.team">contact@smartmind.team</a></p>
</div>