<a href="https://colab.research.google.com/github/vprobon/BERC/blob/main/BERC_BMSM_wordcloud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERC_BMSM_wordcloud
## Purpose: Generate a word cloud from the text extracted from a list of URLs.
## Author: Vasilis J Promponas
## Contact: promponas.vasileios@ucy.ac.cy
## Date: 28/11/2025



## Scrape Text from URLs

### Subtask:
Implement code to fetch content from each URL and extract the main text using a library like BeautifulSoup.


**Reasoning**:
The first step is to import the necessary libraries: `requests` for fetching URL content, `BeautifulSoup` from `bs4` for parsing HTML, `nltk` for tokenization, lemmatization etc, and `zipfile` for ... ehm, zipping files.



In [40]:
import nltk
from nltk import tokenize
from nltk import stem
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('punkt_tab')
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [76]:
urls_fpas = ['https://berc.ucy.ac.cy/index.php/agapios-agapiou',
        'https://berc.ucy.ac.cy/index.php/georgios-archontis',
        'https://berc.ucy.ac.cy/index.php/chris-christodoulou',
        'https://berc.ucy.ac.cy/index.php/savvas-n-georgiades',
        'https://berc.ucy.ac.cy/index.php/georgios-georgiou',
        #'https://berc.ucy.ac.cy/index.php/antonis-kakas',
        'https://berc.ucy.ac.cy/index.php/elpida-keravnou',
        #'https://berc.ucy.ac.cy/index.php/constantinos-pattichis',
        'https://berc.ucy.ac.cy/index.php/vasilis-promponas',
        #'https://berc.ucy.ac.cy/index.php/christos-schizas',
        ]
urls_feng=[#'https://berc.ucy.ac.cy/index.php/chrysafis-andreou',
        #'https://berc.ucy.ac.cy/index.php/eftychios-christoforou',
        #'https://berc.ucy.ac.cy/index.php/julius-georgiou',
        #'https://berc.ucy.ac.cy/index.php/theodora-krasia',
        'https://berc.ucy.ac.cy/index.php/loucas-louca',
        'https://berc.ucy.ac.cy/index.php/fotios-mpekris',
        #'https://berc.ucy.ac.cy/index.php/costas-pitris',
        'https://berc.ucy.ac.cy/index.php/triantafyllos-stylianopoulos',
        'https://berc.ucy.ac.cy/index.php/vasileios-vavourakis',
        ]
urls_med =[
        #'https://berc.ucy.ac.cy/index.php/artemios-artemiadis',
        #'https://berc.ucy.ac.cy/index.php/panagiotis-bargiotas',
        #'https://berc.ucy.ac.cy/index.php/anastasia-constantinidou',
        #'https://berc.ucy.ac.cy/index.php/nikolas-dietis',
        #'https://berc.ucy.ac.cy/index.php/georgios-hadjigeorgiou',
        #'https://berc.ucy.ac.cy/index.php/nicos-mitsides',
        #'https://berc.ucy.ac.cy/index.php/ilias-nikas',
        'https://berc.ucy.ac.cy/index.php/georgios-nikolopoulos',
        'https://berc.ucy.ac.cy/index.php/panagiotis-zis',
        ]

urls_sse =[
      #'https://berc.ucy.ac.cy/index.php/andria-shimi'
]


faculty_urls = {
    'fpas': urls_fpas,
    'feng': urls_feng,
    'med': urls_med,
    #'sse': urls_sse,
    #'all': urls_fpas + urls_feng + urls_med,
}

**Reasoning**:
The first step is to import the `requests` library for making HTTP requests and `BeautifulSoup` from `bs4` for parsing HTML, as specified in the instructions. Then, I will initialize an empty list to store the scraped texts and loop through the provided URLs to fetch and parse their content, extracting text from paragraph tags.



In [85]:
import requests
from bs4 import BeautifulSoup

for faculty, urls in faculty_urls.items():
  scraped_texts = []
  for url in urls:
      response = requests.get(url)
      soup = BeautifulSoup(response.content, 'html.parser')
      paragraphs = soup.find_all('p')
      text_content = ''
      for p in paragraphs:
          text_content += p.get_text() + ' '
      scraped_texts.append(text_content.strip())

  print(faculty + ": Text successfully scraped from all URLs.")
  # Display the first 200 characters of the first scraped text to verify
  print(f"First scraped text (excerpt): {scraped_texts[0][:200]}...")
  text = ' '.join(scraped_texts)
  words = tokenize.word_tokenize(text)
  filtered_words = [w for w in words if w not in stopwords.words("english")]
  stopwords_custom = ['University', 'Cyprus', 'Professor', 'Research', 'research',
                      'et','al', 'two', 'Promponas', 'PhD','BRL','award','Louca', 'grant','non',
                      'Athens', 'Greece', 'School', 'aim']
  filtered_words = [w for w in filtered_words if w not in stopwords_custom]
  filtered_words = [w for w in filtered_words if len(w)>2]
  lemmatizer = stem.WordNetLemmatizer()

  lem_text1 = ""
  for word in filtered_words:
      lemma = lemmatizer.lemmatize(word.lower())
      if len(lemma) < 3:
        print(lemma,end=' ')
        continue
      lem_text1 += lemma + " "

  from wordcloud import WordCloud
  outname = 'BERC'+'_'+faculty+'_wordcloud.jpg'

  wc1 = WordCloud(background_color="white", width=600, height=400, min_font_size=15)
  wc1.generate(lem_text1)
  wc1.to_file(outname)

fpas: Text successfully scraped from all URLs.
First scraped text (excerpt): Associate Professor Agapios Agapiou received his Diploma and PhD in Chemical Engineering from the National Technical University of Athens (NTUA, Greece) in 2001 and 2006, respectively. Since 2001, he ...
us ac feng: Text successfully scraped from all URLs.
First scraped text (excerpt): ROBOTIC REHABILITATION  Loucas S. Louca received his Diploma in Mechanical Engineering from the National Technical University of Athens, Greece, in 1992.Â  He then moved to the University of Michigan w...
ac mc mc med: Text successfully scraped from all URLs.
First scraped text (excerpt): LABORATORY OF  MEDICAL STATISTICS,  EPIDEMIOLOGY &  PUBLIC HEALTH  Dr Nikolopoulos is Associate Professor of Epidemiology and Public Health at the Medical School of the University of Cyprus. He is a g...


In [84]:
# Package the generated images in a compressed file for downloading
from zipfile import ZipFile
with ZipFile('figures.zip', 'w') as myzip:
    myzip.write('BERC_fpas_wordcloud.jpg')
    myzip.write('BERC_feng_wordcloud.jpg')
    myzip.write('BERC_med_wordcloud.jpg')
    #myzip.write('BERC_all_wordcloud.jpg')