## Importing required libraries

In [None]:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")

## Flattening the original Femke's dataset

In [None]:
df = pd.read_excel("/Users/urtejakubauskaite/Desktop/Metaphors/Final project/Femke's Dataset/Final table OSF source domains.xlsx",
                   skiprows = 3, header = None)

all_texts = df.values.flatten()
all_texts = [str(text).strip() for text in all_texts if pd.notna(text) and str(text).strip() != '']

df_output = pd.DataFrame(all_texts, columns = ["Text"])
df_output.to_excel("Flattened_Femke's_dataset.xlsx", index = False)

## Tokenizing selected rows

After receiving the document *Flattened_Femke's_dataset*, I manually deleted rows that did not contain any context, such as the plain phrase *buffer pool*, as well as rows that included metaphors consisting of three or more words. Afterwards, I created a document with tokenized sentences in the same format as the other documents used for testing the model.

In [None]:
df = pd.read_excel("Flattened_Femke's_dataset_max_two_with_context.xlsx", 
                   header = None)

token_data = []

for sent_id, row in df.iterrows():
    sentence = str(row[0])
    doc = nlp(sentence)
    
    for token_id, token in enumerate(doc):
        token_data.append({"sent_id": sent_id,
                           "token_id": token_id,
                           "token_text": token.text,
                           "pos": token.pos_,
                           "FINAL": ""})

token_df = pd.DataFrame(token_data)

token_df.to_excel("tokenized_femke_texts.xlsx", index = False)

## Correcting the file to its final version

After manually marking metaphors in *tokenized_femke_texts* based on Femke's bold-text annotations, I noticed a mistake in the file: the token count restarted from zero in every sentence. I corrected this issue and also added "0"s in the *FINAL* column where cells were empty.

In [None]:
df = pd.read_excel("tokenized_femke_texts.xlsx")

# Fixing the token count
df["token_id"] = range(len(df))

# Replacing empty values with "0"
df["FINAL"] = df["FINAL"].fillna("0").replace("", "0")

df.to_excel("tokenized_femke_texts_final.xlsx", index = False)