<a href="https://colab.research.google.com/github/sebug/hype-combiner/blob/main/LinkedInLearning/leprogramme.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Challenge
Have a look at [Le Programme](https://leprogramme.ch/) - this is an events agenda for Geneva. We want to query it using natural language. The idea is to construct the search URL, and let you go there, but only if we constructed something valid. We can double check the combination between events returned and search query by filtering the full list as well - listing the events on the main page that match, and then finding a search clause that gives the most similar results.

I don't know whether this works yet, and will be exploring a lot.

The first step is to determine the current week of the year. We use this to get the events we will train on.

In [1]:
import datetime
current_week = datetime.date.today().isocalendar().week
current_year = datetime.date.today().isocalendar().year
current_week_url = 'https://leprogramme.ch/agenda-culturel-de-la-semaine/Vaud/' + str(current_year) + '/' + str(current_week)
current_week_url

'https://leprogramme.ch/agenda-culturel-de-la-semaine/Vaud/2024/20'

In [2]:
import urllib.request
programme_response = urllib.request.urlopen(current_week_url)
programme_content = programme_response.read()
print(len(programme_content))

212976


We will need beautifulsoup to analyze the page content

In [3]:
!pip install beautifulsoup4



Let us get the events

In [4]:
from bs4 import BeautifulSoup
programme_soup = BeautifulSoup(programme_content)
spectacle_cards = [tag for tag in programme_soup.find_all('a', class_ = 'card-spectacle') ]
len(spectacle_cards)

99

and inspect the first card, to see how we can extract the relevant information. We may have to look at the parent to see the date we are talking about.

In [5]:
from datetime import datetime

def get_datetime_from_card(card):
  date_string = card.parent.parent.find('div', class_='spectacle-date').find('div', class_='day').get_text()

  french_months = {
    'Janvier': '01',
    'Février': '02',
    'Mars': '03',
    'Avril': '04',
    'Mai': '05',
    'Juin': '06',
    'Juillet': '07',
    'Août': '08',
    'Septembre': '09',
    'Octobre': '10',
    'Novembre': '11',
    'Décembre': '12'
  }

  day, month_name, year = date_string.split()
  month = french_months[month_name]

  hour = card.find('div', class_ = 'card-date').get_text()

  date_formatted = f"{day} {month} {year} {hour}"
  date_object = datetime.strptime(date_formatted, "%d %m %Y %H:%M")
  return date_object






We can test in on a specific card

In [6]:
get_datetime_from_card(spectacle_cards[0])

datetime.datetime(2024, 5, 13, 20, 30)

Now we also want to extract the rest, which can be found directly in the card

In [7]:
def get_title_from_card(card):
  card_title_element = card.find('h5', class_ = 'card-title')
  return card_title_element.get_text()

In [8]:
get_title_from_card(spectacle_cards[0])

'Double-Double'

The description will contain HTML but we are only interested in the text for now.

In [59]:
def get_description_from_card(card):
  card_description_element = card.find('p', class_ = 'card-description')
  txt = card_description_element.get_text()
  if (len(txt) > 200):
    txt = txt[:200]
  return txt


In [31]:
get_description_from_card(spectacle_cards[1])

"A partir d'un fait divers dramatique naît du théâtre dans du théâtre, une tragi-comédie qui suppose "

It's fun that there is basically no semantic information, we have to get it out of the current layout

In [32]:
def get_location_from_card(card):
  location_element = card.find('p', class_ = 'card-text')
  return location_element.get_text().strip()


In [33]:
get_location_from_card(spectacle_cards[1])

'Théâtre Saint-Gervais, Genève'

In [34]:
def get_theme_from_card(card):
  tags_element = card.find('ul', class_ = 'card-tags')
  list_item_element = tags_element.find('li')
  class_ = list_item_element['class']
  return class_[0].replace('theme-', '')

In [14]:
get_theme_from_card(spectacle_cards[0])

'music'

In [35]:
def get_genres_from_card(card):
  tags_element = card.find('ul', class_ = 'card-tags')
  list_item_element = tags_element.find('li')
  return list_item_element.get_text().strip().split(' | ')

In [16]:
get_genres_from_card(spectacle_cards[4])

['Opéra / Opérette', 'humour - stand-up']

For easily checking what is happening, create a dataclass holding that info.

In [36]:
from dataclasses import dataclass
from typing import List

@dataclass
class Spectacle:
  date: 'typing.Any'
  title: str
  description: str
  location: str
  theme: str
  genres: List[str]


In [37]:
def get_spectacle_from_card(card):
  return Spectacle(get_datetime_from_card(card), get_title_from_card(card), get_description_from_card(card),
                   get_location_from_card(card), get_theme_from_card(card),
                   get_genres_from_card(card))

In [63]:
spectacles = [get_spectacle_from_card(card) for card in spectacle_cards]

In [60]:
import json
import dataclasses
from datetime import datetime

class EnhancedJSONEncoder(json.JSONEncoder):
        def default(self, o):
            if dataclasses.is_dataclass(o):
                return dataclasses.asdict(o)
            elif isinstance(o, datetime):
                return o.isoformat()
            return super().default(o)

json_spectacles = json.dumps(spectacles, cls=EnhancedJSONEncoder)
json_spectacles

'[{"date": "2024-05-13T20:30:00", "title": "Double-Double", "description": "", "location": "AMR / Sud des Alpes, Gen\\u00e8ve", "theme": "music", "genres": ["Jazz / Blues"]}, {"date": "2024-05-14T19:00:00", "title": "L\\u2019Union indestructible des r\\u00e9publiques libres | Attilio Sandro Palese", "description": "A partir d\'un fait divers dramatique na\\u00eet du th\\u00e9\\u00e2tre dans du th\\u00e9\\u00e2tre, une tragi-com\\u00e9die qui suppose ", "location": "Th\\u00e9\\u00e2tre Saint-Gervais, Gen\\u00e8ve", "theme": "theater", "genres": ["Th\\u00e9\\u00e2tre"]}, {"date": "2024-05-14T19:30:00", "title": "Amour \\u00e0 mort | Leonardo Garc\\u00eda Alarc\\u00f3n - Jean-Yves Ruf", "description": "Les r\\u00e9cits se m\\u00ealent et l\\u2019orchestre participe \\u00e0 l\\u2019intrigue. Les \\u0153uvres instrumentales, les nouvelles", "location": "La Cit\\u00e9 Bleue, Gen\\u00e8ve", "theme": "other", "genres": ["Op\\u00e9ra / Op\\u00e9rette", "Classique / Baroque"]}, {"date": "2024-05

We may need it in a more textual form, let's try again:

In [66]:
def get_textual_form(spectacle):
  ret = ''
  ret += 'Event title: ' + spectacle.title
  ret += '\nHappening at: ' + spectacle.date.isoformat()
  ret += '\nLocation: ' + spectacle.location
  ret += '\nTheme: ' + spectacle.theme
  ret += '\nGenres: ' + ", ".join(spectacle.genres)
  ret += '\n\n'
  return ret

In [67]:
all_spectacles_text = '\n'.join([get_textual_form(spectacle) for spectacle in spectacles])
all_spectacles_text
len(all_spectacles_text)

17386

# Feeding it to Llama3
We have the data in a structured format now. The idea is now to actually query it with questions like "how many spectacles are there on Monday" or "where can I go listen to Jazz". My first goal is to get the answer out directly. If I can do that, then the next goal would be to have a simple query language and let the model create expressions in that. That second part is how I actually see using this, but that is where the challenge lies.

Ok, first things first, get dependencies for transformers and torch

In [20]:
!pip install transformers torch accelerate



In [68]:
import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an assistant helping me to decide which event to go see." +
     "The allowed events are listed below:\n\n"  + all_spectacles_text},
    {"role": "user", "content": "What events can I go see in Onex?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


According to the list of events, there is one event happening in Onex:

* Anne Roumanoff | L'expérience de la vie (May 14th, 20:00) at Salle communale d'Onex
