# Test NLP parsing on Philo De Opifico Mundi

## Table of content <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Install required libraries</a>
* <a href="#bullet3">3 - Load Text-Fabric/a>
* <a href="#bullet4">4 - Make a file for each individual sentences</a>



# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

some more remarks

# 2 - Install required libraries<a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

Some libraries are required. They are put together here.

install Text-Fabric

In [None]:
!pip3 install text-fabric

In [None]:
!pip install pandas

In [None]:
!pip3 install ipywidgets

install spacy

In [None]:
!pip install -v spacy

In [None]:
!pip install spacy-transformers

In [44]:
# get version of spacy
print(spacy.__version__)

3.7.2


load odyCy pipeline for processing Ancient Greek text using the "grc_odycy_joint_trf" model.

In [None]:
# To install the transformer-based pipeline
!pip install -v https://huggingface.co/chcaa/grc_odycy_joint_trf/resolve/main/grc_odycy_joint_trf-any-py3-none-any.whl

# 3 - Load Text-Fabric<a class="anchor" id="bullet3"></a>
##### [Back to TOC](#TOC)

In [25]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

load Philo TF dataset

In [28]:
A = use("data:./tf/initial_version", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
_book,1,13347.0,100
edition-grc,1,13347.0,100
head,1,13347.0,100
altpage1,18,733.44,99
altpage2,22,585.27,96
pb,60,222.45,100
altref,61,218.8,100
note,61,218.8,100
lg,1,129.0,1
chapter,172,77.6,100


# 4 - Make a file for each individual sentences<a class="anchor" id="bullet4"></a>
##### [Back to TOC](#TOC) 

Idealy a NLP doc object would be created containing the full documents. However, this crashed the computer. The wordaround is to split the documents into sentences and feed sentence-by-sentence to the NLP pipeline.

In [32]:
import os
import time
# Record the start time for measuring total execution time
overallTime = time.time()
# Flag for enabling or disabling verbose logging
verbose=False
# Directory to store the sentences
sentenceDirectory="sentences" 

# Get the current working directory and append a backslash for path building
# (Note: should be cross-platform compatible)
pathFull = os.path.join(os.getcwd(), '')

# Maximum number of results to process
maxResults=100000
# Counter for the number of processed results
foundResults=0

# Iterate over a predefined maximum number of sentence nodes
for sentenceNode in F.otype.s('_sentence'):
    wordList=[]
    # Retrieve the sentence number from the data
    sentenceNumber = F._sentence.v(sentenceNode)
    # Iterate over slots linked to the sentence node and construct the sentence
    for wordNode in E.oslots.s(sentenceNode):
        # Construct each word and add it to the word list
        wordList.append(F.pre.v(wordNode).rstrip()+F.word.v(wordNode)+F.post.v(wordNode).rstrip())

    # Join all elements of wordList into a single string to form the sentence
    sentenceText = ' '.join(wordList)
    
    # Store the resulting sentence in a file
    fileName = os.path.join(sentenceDirectory,f"sentence_{sentenceNumber}.txt")
    try:
        # Write the sentence to the file
        with open(fileName, "w", encoding="utf-8") as file:
           file.write(sentenceText)
           # Log if verbose mode is on
           if verbose: 
               print(f"Sentence {sentenceNumber} written.")
           else:
               print(".",end="")
    except Exception as e:
        # Print exception details and break from the loop in case of an error
        print(f"Exception: {e}")
        break  # Stops execution on encountering an exception
        
    # Increment the result counter and check if the maximum number has been reached
    foundResults +=1
    if foundResults== maxResults: break

# Calculate and display the overall execution time
totalTime = time.time() - overallTime
print(f"\nTotal Time: {totalTime} seconds")

.........................................................................................................................................................................................................................................................................................................
Total Time: 1.0388965606689453 seconds


# load the model

In [1]:
import spacy

nlp = spacy.load("grc_odycy_joint_trf")

# Or if you want to load the small model:
# nlp = spacy.load("grc_odycy_joint_sm")

spaCy pipelines are callable so you can process text by calling the pipeline with them.

# process each sentence in NLP Pipeline

In [34]:
import os
import time
import re
import pandas as pd
overallTime = time.time()
verbose=False
sentenceDirectory="sentences" 
nlpDirectory="nlp"
# Get the current working directory and append directories
sentencePathFull = os.path.join(os.getcwd(),sentenceDirectory)
nlpPathFull = os.path.join(os.getcwd(),nlpDirectory)

def list_filenames(directory):
    """
    Lists all filenames in the given directory.
    Inputparameter - directory: The path to the directory.
    Returns - A list of filenames found in the directory.
    """
    # Check if the directory exists
    if not os.path.isdir(directory):
        raise ValueError("The provided directory does not exist or is not a directory.")

    # List all files in the directory
    filenames = [file for file in os.listdir(directory) if os.path.isfile(os.path.join(directory, file))]
    
    return filenames

def parseToDict(inputString):
    """
    Converts a string with format 'key1=value1|key2=value2|...' to a dictionary.
    Inputparameter - inputString: The input string to be parsed.
    Returns: A dictionary representation of the input string.
    """
    # Check the length of input_string
    if len(inputString) == 0:
        return {}  # Return an empty dictionary if the string is empty
    # Split the string by '|' to separate the key-value pairs
    pairs = inputString.split('|')
    # Split each pair by '=' and store in a dictionary
    dictResult = {}
    for pair in pairs:
        if '=' in pair:
            key, value = pair.split('=')
            dictResult[key] = value
        else:
            if verbose: print(f"this seems odd...")
    return dictResult


for fileName in list_filenames(sentencePathFull):
    # Read the content of the file
    fullFileName=os.path.join(sentencePathFull,fileName)
    # files should be opened as unicode
    with open(fullFileName, 'r', encoding='utf-8') as file:
        content = file.readline()
        
    # Now feed this sentence to the NLP parser
    doc = nlp(content)
    
    # Now extract the information from the tokens
    tokens = []

    for token in doc:
        tokenInfo={
            "orth": token.orth_,
            'text': token.text,
            "lemma": token.lemma_,
            "norm": token.norm_,
            "suffix": token.suffix_,
            "prefix": token.prefix_,
            "prob": token.prob,
            'pos': token.pos_,
            "tag": token.tag_,
            "morph": token.morph,
            'dep': token.dep_,
            "sentiment": token.sentiment,
            'head': token.head,
            'conjuncts': token.conjuncts,
            'is_stop': token.is_stop,
            'shape': token.shape_,
            'is_alpha': token.is_alpha,
            'is_punct': token.is_punct,
            'like_num': token.like_num,
            'is_space': token.is_space
        }
        # the following will expand the morph details into multiple rows
        morphString=str(token.morph)  # Convert MorphAnalysis object to string
        tokenInfo.update(parseToDict(morphString))
        tokens.append(tokenInfo)
    
    # Create a DataFrame
    df = pd.DataFrame(tokens)
        
    # Export to CSV
    # Using regex to change the file extension
    nlpFileName = re.sub(r'(.+)\.txt$', r'\1.csv', fileName)
    fullFileName=os.path.join(nlpPathFull,nlpFileName)
    df.to_csv(fullFileName, index=False)
    if verbose: 
        print(f"File {nlpFileName} written.")
    else:
        print (".",end="")

# Calculate and display the overall execution time
totalTime = time.time() - overallTime
print(f"\nTotal Time: {totalTime} seconds")

.........................................................................................................................................................................................................................................................................................................
Total Time: 41.97211837768555 seconds
