# Task: NLP - English scheduling phrases to Machine Readable format

Convert natural input:

`every tues 3-4pm MLP Tutorial`

to ics formatted string:

```icalendar
BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//mlp.ed.ac.uk//group1//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
BEGIN:VEVENT
SUMMARY:MLP Tutorial
UID:c7614cff-3549-4a00-9152-d25cc1fe077d
SEQUENCE:0
RRULE:FREQ=WEEKLY;BYDAY=TU;COUNT=10
DTSTART:20250214T150000
DTEND:20250214T160000
DTSTAMP:20150421T150000
CATEGORIES:University
END:VEVENT
END:VCALENDAR
```

## Goals
- Deal with as many edge cases as possible
- Deal with abbreviations
- Deal with relative dates
- Handle complex expressions as accurately as possible like from `2pm hourly til 5 every weekday for 5m except 3pm`



## Strategy

1. Use a base model BERT to use
2. Use the TempEval-3.0 dataset to train a model to recognise the temporal expressions and extract the temporal relations.
3. Use additional temporal reasoning datasets to increase the model reasoning performance.
4. Use Evol-Instruct to augment and customise the datasets to increase variety, relevance and complexity of expressions.

![Evol-Instruct workflow](evol.png)



## Base models


### BERT

### [BERTweet](https://huggingface.co/docs/transformers/v4.17.0/en/model_doc/bertweet)
- Pre-trained on 850M English tweets
- Better at handling informal language, emojis, hashtags, mobile input
- May be more suitable for processing colloquial text and abbreviations

### [DateBERT](https://huggingface.co/docs/transformers/v4.17.0/en/model_doc/datebert)
- Tags dates and times in text





## Embeddings

### Contextual Embeddings
- Reference date
- Region / timezone






In [1]:
import torch
torch.cuda.is_available()

False

# Tokenisation

In [12]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokeniser = Tokenizer(BPE())
tokeniser.pre_tokenizer = Whitespace()

trainer = BpeTrainer(vocab_size=120, special_tokens=["<pad>", "</s>"])
tokeniser.train(files=["phrases.txt"], trainer=trainer)




In [20]:
with open('subset.txt', 'r') as f:
    lines = f.readlines()
    # remove duplicates
    lines = list(set(lines))
    
tokenised_lines = [tokeniser.encode(line.strip()) for line in lines]

longest_line = max(tokenised_lines, key=len)

print(longest_line.tokens, len(longest_line.tokens))



['12', '/', '0', '6', 'm', 'a', 'ch', 'in', 'e', 'l', 'e', 'ar', 'ning', 'tu', 't', 'or', 'i', 'al', 'in', 'A', 'T', '4', '.', '12'] 24


In [8]:
# Get all tokens and their frequencies from the tokenizer
vocab = tokeniser.get_vocab()
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True)

# Calculate number of tokens in top quartile
num_tokens = len(sorted_vocab)
quartile_size = num_tokens // 4

# Get top quartile tokens
top_quartile = sorted_vocab[:quartile_size]

print(f"Top {quartile_size} most frequent tokens:")
for token, freq in top_quartile:
    print(f"{token}: {freq}")



Top 40 most frequent tokens:
end: 159
ss: 158
og: 157
ner: 156
mar: 155
last: 154
it: 153
din: 152
age: 151
6pm: 150
meeting: 149
shop: 148
meet: 147
int: 146
team: 145
sho: 144
ning: 143
mo: 142
eam: 141
da: 140
5hrs: 139
5pm: 138
3hrs: 137
session: 136
view: 135
rep: 134
up: 133
sess: 132
lan: 131
12pm: 130
11am: 129
ation: 128
30pm: 127
remind: 126
ge: 125
ast: 124
45: 123
00: 122
10am: 121
art: 120
