<a href="https://colab.research.google.com/github/sebug/hype-combiner/blob/main/LinkedInLearning/leprogramme.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Challenge
Have a look at [Le Programme](https://leprogramme.ch/) - this is an events agenda for Geneva. We want to query it using natural language. The idea is to construct the search URL, and let you go there, but only if we constructed something valid. We can double check the combination between events returned and search query by filtering the full list as well - listing the events on the main page that match, and then finding a search clause that gives the most similar results.

I don't know whether this works yet, and will be exploring a lot.

The first step is to determine the current week of the year. We use this to get the events we will train on.

In [1]:
import datetime
current_week = datetime.date.today().isocalendar().week
current_year = datetime.date.today().isocalendar().year
current_week_url = 'https://leprogramme.ch/agenda-culturel-de-la-semaine/Vaud/' + str(current_year) + '/' + str(current_week)
current_week_url

'https://leprogramme.ch/agenda-culturel-de-la-semaine/Vaud/2024/20'

In [2]:
import urllib.request
programme_response = urllib.request.urlopen(current_week_url)
programme_content = programme_response.read()
print(programme_content)

b'\n<!DOCTYPE html>\n<html lang="en"><head>\n\n    <meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests">\n    <meta name="google-site-verification" content="VsCKkW7qEQybw6oaG_dJ869KxnLQpO5xvaw8SrdCE2g" />\n    <meta charset="utf-8"/>    <meta name="author" content="Lunatyk"/><meta name="viewport" content="width=device-width, initial-scale=1"/>    <meta name="description" content="Agenda concerts, th\xc3\xa9\xc3\xa2tre, danse dans le canton de Gen\xc3\xa8ve, semaine 32 - 2021"/>    <title>Agenda concerts, th\xc3\xa9\xc3\xa2tre, danse dans le canton de Gen\xc3\xa8ve, semaine 32 - 2021 </title>\n    <link href="/favicon.ico" type="image/x-icon" rel="icon"/><link href="/favicon.ico" type="image/x-icon" rel="shortcut icon"/>\n    \n\n\n    <link rel="apple-touch-icon" sizes="57x57" href="https://leprogramme.ch/img/favicon/apple-icon-57x57.png">\n    <link rel="apple-touch-icon" sizes="60x60" href="https://leprogramme.ch/img/favicon/apple-icon-60x60.png">\n    <link

We will need beautifulsoup to analyze the page content

In [3]:
!pip install beautifulsoup4



Let us get the events

In [4]:
from bs4 import BeautifulSoup
programme_soup = BeautifulSoup(programme_content)
spectacle_cards = [tag for tag in programme_soup.find_all('a', class_ = 'card-spectacle') ]
len(spectacle_cards)

99

and inspect the first card, to see how we can extract the relevant information. We may have to look at the parent to see the date we are talking about.

In [5]:
from datetime import datetime

def get_datetime_from_card(card):
  date_string = card.parent.parent.find('div', class_='spectacle-date').find('div', class_='day').get_text()

  french_months = {
    'Janvier': '01',
    'Février': '02',
    'Mars': '03',
    'Avril': '04',
    'Mai': '05',
    'Juin': '06',
    'Juillet': '07',
    'Août': '08',
    'Septembre': '09',
    'Octobre': '10',
    'Novembre': '11',
    'Décembre': '12'
  }

  day, month_name, year = date_string.split()
  month = french_months[month_name]

  hour = card.find('div', class_ = 'card-date').get_text()

  date_formatted = f"{day} {month} {year} {hour}"
  date_object = datetime.strptime(date_formatted, "%d %m %Y %H:%M")
  return date_object






We can test in on a specific card

In [6]:
get_datetime_from_card(spectacle_cards[0])

datetime.datetime(2024, 5, 13, 20, 30)

Now we also want to extract the rest, which can be found directly in the card

In [10]:
def get_title_from_card(card):
  card_title_element = card.find('h5', class_ = 'card-title')
  return card_title_element.get_text()

In [11]:
get_title_from_card(spectacle_cards[0])

'Double-Double'

The description will contain HTML but we are only interested in the text for now.

In [12]:
def get_description_from_card(card):
  card_description_element = card.find('p', class_ = 'card-description')
  return card_description_element.get_text()


In [14]:
get_description_from_card(spectacle_cards[1])

'A partir d\'un fait divers dramatique naît du théâtre dans du théâtre, une tragi-comédie qui suppose que la réalité dépasse notre entendement, que notre entendement ne repose pas forcément sur la raison. Sinon comment "expliquer" notre monde encore si violent et si chaotique?Attilio Sandro Palese, texte et mise en scène  '

It's fun that there is basically no semantic information, we have to get it out of the current layout

In [17]:
def get_location_from_card(card):
  location_element = card.find('p', class_ = 'card-text')
  return location_element.get_text().strip()


In [19]:
get_location_from_card(spectacle_cards[1])

'Théâtre Saint-Gervais, Genève'

In [24]:
def get_theme_from_card(card):
  tags_element = card.find('ul', class_ = 'card-tags')
  list_item_element = tags_element.find('li')
  class_ = list_item_element['class']
  return class_[0].replace('theme-', '')

In [25]:
get_theme_from_card(spectacle_cards[0])

'music'

In [31]:
def get_genres_from_card(card):
  tags_element = card.find('ul', class_ = 'card-tags')
  list_item_element = tags_element.find('li')
  return list_item_element.get_text().strip().split(' | ')

In [32]:
get_genres_from_card(spectacle_cards[4])

['Opéra / Opérette', 'humour - stand-up']