This assignment concerns using ```spaCy``` to extract linguistic information from a corpus of texts.

The corpus is an interesting one: *The Uppsala Student English Corpus (USE)*. All of the data is included in the folder called ```in``` but you can access more documentation via [this link](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457).This assignment concerns using ```spaCy``` to extract linguistic information from a corpus of texts.

The corpus is an interesting one: *The Uppsala Student English Corpus (USE)*. All of the data is included in the folder called ```in``` but you can access more documentation via [this link](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457).

In [3]:
# loading in packages
import os
import matplotlib as plt
import pandas as pd
import spacy

# loading spacy
# download spacy module: python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")

1) Loop over each text file in the folder called ```in```


In [4]:
# Defining path to folder
folder = os.path.join("..", "in", "USEcorpus")

# Empty list for our texts
texts = []

# Walk through the folder and its subdirectories
for root, dirs, files in os.walk(folder):
    for filename in files:
        # Construct the full path to the file
        file_path = os.path.join(root, filename)
        # Add the file path to the list of texts
        texts.append(file_path)

# Processing each text file
for text_path in texts:
    with open(text_path, 'r',encoding='latin-1') as file:
        content = file.read()

        # create spaCy docs for each text
        doc = nlp(content)


2) Extract the following information
    - Relative frequency of Nouns, Verbs, Adjective, and Adverbs per 10,000 words
    - Total number of *unique* PER, LOC, ORGS


In [None]:
# Creating a function for counting and calculating the frequency
def count_freq(doc, pos_tag):
    count = sum(1 for token in doc if token.pos_ == pos_tag)
    relative_freq = (count / len(doc)) * 10000
    print(f'The relative frequency per 10,000 words for {pos_tag.lower()}s is {round(relative_freq, 2)}')

count_freq(doc, "NOUN")
count_freq(doc, "VERB")
count_freq(doc, "ADJ")
count_freq(doc, "ADV")

In [14]:
def count_entities(doc):
    entities = [ent.label_ for ent in doc.ents]
    label_counts = {"PER": 0, "LOC": 0, "ORG": 0}
    for entity_label in entities:
        if entity_label in ["PER", "PERSON"]:
            label_counts["PER"] += 1
        elif entity_label in ["LOC", "GPE"]:
            label_counts["LOC"] += 1
        elif entity_label == "ORG":
            label_counts["ORG"] += 1
    return label_counts

3) For each sub-folder (a1, a2, a3, ...) save a table which shows the following information:

|Filename|RelFreq NOUN|RelFreq VERB|RelFreq ADJ|RelFreq ADV|Unique PER|Unique LOC|Unique ORG|
|---|---|---|---|---|---|---|---|
|file1.txt|---|---|---|---|---|---|---|
|file2.txt|---|---|---|---|---|---|---|
|etc|---|---|---|---|---|---|---|

In [None]:
# Function to generate table for a list of files in a folder
def generate_table(folder_path):
   
    columns = ['Filename', 'RelFreq NOUN', 'RelFreq VERB', 'RelFreq ADJ', 'RelFreq ADV', 'Unique PER', 'Unique LOC', 'Unique ORG']
    data = []
    for file in files:
        with open(os.path.join(folder_path, file), 'r') as f:
            text = f.read()
            doc = nlp(text)
            row = [file]
            row.extend([calculate_freq(doc, 'NOUN'), calculate_freq(doc, 'VERB'), calculate_freq(doc, 'ADJ'), calculate_freq(doc, 'ADV')])
            entities_count = count_entities(doc)
            row.extend([entities_count['PER'], entities_count['LOC'], entities_count['ORG']])
            data.append(row)
    df = pd.DataFrame(data, columns=columns)
    return df

# Generate tables for each sub-folder
for i in range(1, 4):  # assuming you have subfolders named a1, a2, a3, etc.
    folder_path = f'a{i}'
    print(f"Generating table for folder {folder_path}:")
    table = generate_table(folder_path)
    print(table)
    print("\n")
