EXPERIMENT EXPLANATION

CONTEXT: We utilize a third-party service for voice-to-text transcription. The output of this process yields a set of texts. Subsequently, we employ these texts as context for our Question Answering solution. Therefore, it is imperative to ensure that the voice-to-text (V2T) service provides satisfactory text quality.

GOAL: a benchmark enabling the estimation of text quality.

HOW TO: Let us assume that satisfactory text content consists solely of English words, along with some names, toponyms, etc. So we can use next equation: 
$$
\begin{equation*}
s=\ \frac{t_{b}}{t_{g}} \ *\ 100\%
\end{equation*}
$$
Where $t_{g}$ is a total number of words is a text dictinary. And $t_{m}$ is bad tokens in this dictinary.


EXTRACT TEXT FROM THE FILES

In [10]:
import configparser
import os

config = configparser.ConfigParser()
config.read('config.ini')
folder_path = config['Paths']['folder_path']

# Function to merge content of multiple .txt files into a single string
def merge_txt_files(folder_path):
    file_paths = os.listdir(folder_path)
    merged_text = ""
    for file_path in file_paths:
        # Open each file and read its content
        with open(file_path, 'r') as file:
            text = file.read()
        # Append the content to the merged_text variable
        merged_text += text + "\n"  # Optionally, you can add a newline character after each document
    return merged_text

# Call the function to merge the .txt files
merged_text = merge_txt_files(folder_path)

# Display the merged text
print(merged_text)

To access your star reports, you will click on the reports tile on your Renaissance home page and then star assessment. If you have a report that you use often, you can click the thumb tack inside that report tile to move the report to the top of your screen. If you are viewing a report and would like more information about that report. Clicking on the question mark at the top of your screen will bring up an article from Renaissance that explains that report in detail.
The star diagnostic report presents diagnostic and skill information for an individual student. You will find this under the state performance and mastery section of your reports page. When choosing your reports parameters, there are two things you want to keep in mind always select the benchmark type of state when it is available and always uncheck the show grade equivalent box as that can be confusing for students and parents. This report has a color coded bar that will show you where your student is performing. The re

TOKENISE

In [11]:
import nltk
from nltk.corpus import words

raw_tokens = nltk.word_tokenize(merged_text)

print (raw_tokens)



DELETE NOT WORDS (DATE, NUMBER, PUNCTUATION, ...)

In [12]:
import re

word_pattern = re.compile(r'^[^\W\d_]+$')  
words_tokens = [token for token in raw_tokens if word_pattern.match(token)]

print (words_tokens)



NORMALISE

In [19]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Function to convert words to their base form (lemma)
def convert_to_base_form(words):
    lemmatizer = WordNetLemmatizer()
    # Tokenize the words and tag them with their part of speech
    tagged_words = nltk.pos_tag(words)
    # Lemmatize each word based on its part of speech
    base_form_words = []
    for word, pos_tag in tagged_words:
        wordnet_pos = get_wordnet_pos(pos_tag)
        word = word.lower()
        if wordnet_pos:
            base_form_words.append(lemmatizer.lemmatize(word, pos=wordnet_pos))
        else:
            base_form_words.append(word)  # If part of speech is not recognized, keep the word as is
    return base_form_words

# Function to map NLTK's POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None  # Other parts of speech

normalized_tokens = convert_to_base_form(words_tokens)
print(normalized_tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!




DELETE A DUPLICATES

In [14]:
unique_tokens_set = set(normalized_tokens)
unique_tokens = list(unique_tokens_set)

print(unique_tokens)



LIST OF INCORRECT WORDS

In [15]:
from nltk.corpus import words
nltk.download('words')


english_vocab = set(words.words())
strange_tokens = [word for word in unique_tokens if word not in english_vocab]

print(strange_tokens)

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


['amanda', 'applies', 'blogs', 'martian', 'confusing', 'sorting', 'checkpoint', 'stanford', 'intro', 'andreas', 'que', 'kiddos', 'zpd', 'lizz', 'amazon', 'british', 'distracting', 'clickable', 'caitlyn', 'takeaway', 'anytime', 'windows', 'nepad', 'erica', 'cbm', 'prerecords', 'deborah', 'icann', 'db', 'hyperlink', 'premade', 'washington', 'spreadsheet', 'pre', 'gamification', 'implodes', 'co', 'videos', 'download', 'skills', 'clicking', 'websites', 'chatbox', 'thompson', 'jimenez', 'md', 'responses', 'outcomes', 'blog', 'kaylyn', 'bethany', 'diff', 'qual', 'mic', 'overtook', 'dennis', 'hype', 'powerpoints', 'chris', 'earned', 'checking', 'pairs', 'uploader', 'intertex', 'prete', 'christa', 'friends', 'melissa', 'plc', 'contextualizing', 'async', 'vr', 'columns', 'adding', 'updated', 'url', 'kimberly', 'tuesday', 'tanya', 'mikos', 'contextualize', 'nepo', 'caitlin', 'updating', 'odin', 'ron', 'jessica', 'cetera', 'rsars', 'lisa', 'worksheet', 'steelers', 'oops', 'krista', 'vocab', 'ic',

DELETE NAMES, TOPONIMS AND ETC.

In [16]:
import nltk
from nltk.corpus import names

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('names')

# Function to delete common abbreviations and topynoms from a list of tokens
def delete_common_abbreviations_and_toponyms(tokens):
    # Load NLTK's names corpus
    names_corpus = set(name.lower() for name in names.words())
    # Tag tokens with parts of speech
    tagged_tokens = nltk.pos_tag(tokens)
    # Filter out tokens that are abbreviations, toponyms, or names
    filtered_tokens = [token for token, pos in tagged_tokens if pos != 'NNP' and pos != 'NNPS' and token.lower() not in names_corpus]
    return filtered_tokens

not_valuable_tokens = delete_common_abbreviations_and_toponyms(strange_tokens)
print(not_valuable_tokens)

['applies', 'blogs', 'martian', 'confusing', 'sorting', 'checkpoint', 'intro', 'que', 'kiddos', 'zpd', 'lizz', 'amazon', 'british', 'distracting', 'clickable', 'caitlyn', 'takeaway', 'anytime', 'windows', 'nepad', 'cbm', 'prerecords', 'icann', 'db', 'hyperlink', 'premade', 'spreadsheet', 'pre', 'gamification', 'implodes', 'co', 'videos', 'download', 'skills', 'clicking', 'websites', 'chatbox', 'thompson', 'jimenez', 'md', 'responses', 'outcomes', 'blog', 'diff', 'qual', 'overtook', 'hype', 'powerpoints', 'earned', 'checking', 'pairs', 'uploader', 'intertex', 'prete', 'friends', 'plc', 'contextualizing', 'async', 'vr', 'columns', 'adding', 'updated', 'url', 'mikos', 'contextualize', 'nepo', 'updating', 'cetera', 'rsars', 'worksheet', 'steelers', 'oops', 'vocab', 'ic', 'ac', 'website', 'www', 'recommended', 'editorialism', 'personalized', 'gp', 'multi', 'nearer', 'asynchronously', 'workarounds', 'portfoli', 'npa', 'greece', 'firmo', 'digi', 'devani', 'spotify', 'devanny', 'glitch', 'repo

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


CALCULATE THE SCORE

In [17]:
count_unique_tokens = len(unique_tokens)
count_not_valuable_tokens = len(not_valuable_tokens)
score = round (count_not_valuable_tokens / count_unique_tokens, 2)

print(score)

0.11


In [18]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Download necessary resources
nltk.download('punkt')
nltk.download('wordnet')

# Function to normalize words in a list
def normalize_words(words):
    normalized_words = []
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    
    for word in words:
        # Convert to lowercase
        word = word.lower()
        lemma = lemmatizer.lemmatize(word)
        #stem = stemmer.stem(word)
        normalized_words.append((lemma))
    
    return normalized_words

normalized_tokens = normalize_words(words_tokens)
print(normalized_tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


