<h1 align="center">Natural Language Processing - Transformers</h1>
<h2 align="center">Convert speech to Text. Machine Translation to English. Text Analytics.</h2>
<h3 align="center">Rositsa Chankova</h3>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
from pydub import AudioSegment
m4a_audio = AudioSegment.from_file(r'/content/drive/MyDrive/dnc-2004-speech.mp3', format='mp3')
m4a_audio.export('dnc-2004-speech_converted.wav', format='wav')

<_io.BufferedRandom name='dnc-2004-speech_converted.wav'>

In [3]:
!ls

1_audi_file.wav  3_audi_file.wav  5_audi_file.wav		 drive
2_audi_file.wav  4_audi_file.wav  dnc-2004-speech_converted.wav  sample_data


In [4]:
!pwd

/content


In [5]:
!pip install transformers

[0m

In [6]:
from pydub import AudioSegment


mp3_audio = AudioSegment.from_file(r'dnc-2004-speech_converted.wav', format='wav')
print(len(mp3_audio)/(1000*60))
# 12 Minutes audio breaks into 3 minutes 4 audio files (slicingis done by milliseconds)

counter_audio = 180
split_audio = [mp3_audio[:180*1000]]
for i in range(4):
    split_audio.append(mp3_audio[counter_audio*1000:(counter_audio+180)*1000])
    counter_audio += 180

count = 0
# # lets save it!
for count, audio_object in enumerate(split_audio):
    count += 1
    with open(f"{count}_audi_file.wav", 'wb') as out_f:
        audio_object.export(out_f, format='wav')

12.962


In [7]:
!ls

1_audi_file.wav  3_audi_file.wav  5_audi_file.wav		 drive
2_audi_file.wav  4_audi_file.wav  dnc-2004-speech_converted.wav  sample_data


### A python package for music and audio analysis.
### https://librosa.org/doc/latest/index.html

In [None]:
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

### The base model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model the speech input must also be sampled at 16Khz.

# Load model and tokenizer

In [8]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

collection_of_text = []
for i in range(4):

    speech, rate = librosa.load(f'{i+1}_audi_file.wav', sr=16000)

    input_values = tokenizer(speech, return_tensors='pt').input_values
    # Store logits (non-normalized predictions)
    with torch.no_grad():
        logits = model(input_values).logits

    # Store predicted id's
    predicted_ids = torch.argmax(logits, dim=-1)
    # decode the audio to generate text
    # Passing the prediction to the tokenzer decode to get the transcription
    transcription = tokenizer.batch_decode(predicted_ids)[0]
    # transcriptions = tokenizer.decode(predicted_ids[0])
    print(transcription)
    collection_of_text.append(transcription)

print(collection_of_text)
final_complete_speech = ""

# convert batch of text into one complete sentence
for i in collection_of_text:
    final_complete_speech += i

print(final_complete_speech)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ON BEHALF OF THE GREAT STATE OF ILLINOIS  TRO TO THE MATON LAND OF LINKON LET ME EXPRESS MY DEEPEST GRATITUDE FOR THE PRIVILEGE OF ADDRESSING THIS CONVENTION TO NIGHT IS A PARTICULAR HONOR FOR ME BECAUSE LET'S FACEIT MY PRESENCE ON THIS STAGE IS PRETTY UNLIKLEMY FATHER WAS A FOREIGN STUDE BORN AND RAISED IN A SMALL BILLAGE IN CANION HE GREW UP HURTING GOS WENT TO SCHOOL IN A TINROOF SHACK HIS FATHER MY GRANDFATHER WAS A COOK A DOMESTIC SERVANT TO THE BRITISEH BUT MY GRANDFATHER HAD LARGER DREAMS FOR HIS SON THROUGH HARD WORK AND PERSERBERANCE MY FATHER GOT A SCHOLARSHIP TO STUDY IN A MAGICAL PLACE AMERICA THAT SHOWN IS A BEAKIN A FREEDOM AND OPPORTUNITY TO SO MANY WHO HAD COME TE CO O TUDYIN HERE MY FATHER ME MY MO SHE WAS BORN IN A TOWN ON THE OTHER SIDE OF THE WORLD IN CANSAS HER FATHER WORKD ON OIL RIGS AND FARMS THROUGH MOST OF THE DEPRESSION THE DAY AFTER PEARL HARBOUR MY GRANDFATHER SIGNED UP FOR DUTEE JOING PATEN'S ARMY MARCHED ACROSS EUROPE BAC HONE MY GRANDMOTHER RAISED A BABY

In [9]:
# pip install -U pip setuptools wheel
# pip install -U spacy
# pip install scispacy
# pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_scibert-0.4.0.tar.gz

# Load English tokenizer, tagger, parser and NER

In [1]:
!pip install -U spacy
!python -m spacy download en_core_web_sm
import spacy

nlp = spacy.load("en_core_web_sm")

Collecting spacy
  Downloading spacy-3.3.0-cp39-cp39-macosx_10_9_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 1.7 MB/s eta 0:00:01
Collecting pathy>=0.3.5
  Using cached pathy-0.6.1-py3-none-any.whl (42 kB)
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.1-py3-none-any.whl (27 kB)
Collecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.3-cp39-cp39-macosx_10_9_x86_64.whl (457 kB)
[K     |████████████████████████████████| 457 kB 9.6 MB/s eta 0:00:01
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp39-cp39-macosx_10_9_x86_64.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 8.5 MB/s eta 0:00:01
[?25hCollecting wasabi<1.1.0,>=0.9.1
  Downloading wasabi-0.9.1-py3-none-any.whl (26 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.6-cp39-cp39-macosx_10_9_x86_64.whl (32 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.7-cp39-cp39-macosx_10_9_x86_64.whl (18 kB)
Collecting langcodes<4.0.0,>=

Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
doc = nlp(final_complete_speech.lower())
print(doc.ents)

SyntaxError: invalid syntax (3836216505.py, line 1)

# Analyze syntax

In [None]:
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])


# Find named entities, phrases and concepts

In [12]:
for entity in doc.ents:
    print(entity.text, entity.label_)

Noun phrases: ['behalf', 'the great state', 'illinois  tro', 'the maton land', 'linkon', 'me', 'my deepest gratitude', 'the privilege', 'this convention', 'night', 'a particular honor', 'me', "'s", 'my presence', 'this stage', 'pretty unliklemy father', 'a foreign stude', 'a small billage', 'canion', 'he', 'gos', 'school', 'a tinroof shack', 'his father', 'my grandfather', 'a cook', 'a domestic servant', 'the britiseh', 'my grandfather', 'larger dreams', 'his son', 'hard work', 'my father', 'a scholarship', 'a magical place', 'america', 'that', 'a beakin', 'a freedom', 'opportunity', 'who', 'o tudyin', 'my father', 'me', 'my mo', 'she', 'a town', 'the other side', 'the world', 'cansas', 'her father', 'workd', 'oil rigs', 'farms', 'the depression', 'pearl', 'my grandfather', "dutee joing paten's army", 'europe', 'my grandmother', 'a baby', 'a bomer assembly line', 'the war', 'they', 'the ge', 'i', 'a house', 'achay', 'h wine', 'search', 'opportunite', 'they', 'big dreams', 'their daught