## Extracting the information from https://euvsdisinfo.eu/disinformation-cases/ website

Using data analysis and media monitoring services in multiple languages, EUvsDisinfo identifies, compiles, and exposes disinformation cases originating in pro-Kremlin outlets. These cases (and their disproofs) are collected in the EUvsDisinfo database – the only searchable, open-source repository of its kind. The database is updated every week.

In [55]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import random
import re

In [None]:
results = []

title_set = set()  # only unique titles
page = 10  # first 10 articles
add_page = True

while add_page == True:
    url = f'https://euvsdisinfo.eu/disinformation-cases/?text=&date=&orderby=date&offset={page}&per_page=10'
    user_agents_list = [
        'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
    ]

    response = requests.get(url, headers={'User-Agent': random.choice(user_agents_list)})

    if response.status_code != 200:
        print(f"Error: Failed to retrieve page for url {url}. Status code: {response.status_code}")
        continue

    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    hits_all = soup.select('tbody')
    rows = soup.find_all('tr', class_='disinfo-db-post')

    for row in rows:
        date = row.find('td', class_='disinfo-db-date').get_text(strip=True)
        title = row.find('td', class_='cell-title').a.get_text(strip=True)

        if title not in title_set:
            article_url = 'https://euvsdisinfo.eu' + row.find('td', class_='cell-title').a['href']
            outlet = row.find('div', class_='disinfo-outlets-list').get_text(strip=True)
            country = row.find('td', class_='disinfo-db-cell cell-country').get_text(strip=True)

            article_response = requests.get(article_url, headers={'User-Agent': random.choice(user_agents_list)})

            if article_response.status_code == 200:
                article_html = article_response.text
                article_soup = BeautifulSoup(article_html, 'html.parser')
                article_summary = '\n'.join([p.text.strip() for p in article_soup.select('div.b-report__summary-text')])
                article_disproof = '\n'.join([p.text.strip() for p in article_soup.select('div.b-report__disproof-text')])
                article_language = re.sub(r'^Article language\(s\)\n\s*', '','\n'.join([p.text.strip() for p in article_soup.select('div.b-catalog__col2 > div > div > ul > li:nth-child(3)')])).strip()
                article_keywords = re.sub(r'^Keywords:\n', '', '\n'.join([p.text.strip() for p in article_soup.select('div.b-catalog__col2 > div > div > ul > li:nth-child(5)')])).strip()

            else:
                print(f"Error: Failed to retrieve page for url {article_html}. Status code: {response.status_code}")
                continue

            results.append({
                'Date': date,
                'Title': title,
                'URL': article_url,
                'Outlets': outlet,
                'Countries': country,
                'Language': article_language,
                'Summary': article_summary,
                'Disproof': article_disproof,
                'Keywords': article_keywords
            })

            title_set.add(title)
            time.sleep(1)

    if soup.select_one('div.reports_loadmore'):
        page += 10
    else:
        add_page = False

df = pd.DataFrame(results)
df.to_csv('disinformation_database.csv')

It took 5 hours to run the code

In [63]:
print(df.shape)
df.head(1)

(9965, 9)


Unnamed: 0,Date,Title,URL,Outlets,Countries,Language,Summary,Disproof,Keywords
0,10.05.2023,USSR Victory Banner was raised over the Bundes...,https://euvsdisinfo.eu/report/ussr-victory-ban...,"ntv.ru,eadaily.com,Moskovskij Komsomolets,tsar...","Russia, ...",Russian,The Russians took Berlin without firing a shot...,The video is a manipulation. The Red Banner of...,World War 2


9965 rows with fake news or dissinformation and 9 columns:
* Date - the date of fake news release
* Title - the title of the fake news
* URL - web link to the news description
* Outlets - news agences from dissinformation originated
* Countries - countries mentioned in the fake news
* Language - the language of the fake news
* Summary - summary of the fake news
* Disproof - points made by "EU vs Disinfo" organization to disproof the fake news
* Keywords - keywords from the fake news