# UE22AM342BA2 – Deep Learning on Graph Course Project
## Project Title   :  BiasNet: A Contrastive GNN-Based Framework for Political Stance Detection in News


## 1. Import Libraries
#### Import necessary libraries for XML parsing, data manipulation, web requests, HTML parsing, regular expressions, and deep learning (PyTorch and Transformers).

In [4]:
import xml.etree.ElementTree as ET
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import torch
from transformers import AutoTokenizer, AutoModel

## 2. Load and Parse XML Data
#### Load the ground truth data from the specified XML file. Parse the XML structure to extract relevant attributes for each article.

In [None]:
xml_file = "dataset/ground-truth-trial.xml" 
tree = ET.parse(xml_file)
root = tree.getroot()

In [None]:
data = []

In [None]:
for article in root.findall("article"):
    data.append({
        "id": article.get("id"),
        "hyperpartisan": article.get("hyperpartisan"),
        "bias": article.get("bias"),
        "url": article.get("url"),
        "labeled_by": article.get("labeled-by")
    })

## 3. Initial Data Exploration and Cleaning
#### Perform initial cleaning and exploration steps.


In [None]:
df = pd.DataFrame(data)

In [None]:
df.drop('labeled_by', axis=1)

Unnamed: 0,id,hyperpartisan,bias,url
0,0000012,true,right,https://dailywire.com/node/9485
1,0000053,true,left,https://counterpunch.org/2010/12/08/calling-fr...
2,0000079,false,least,http://texastribune.org/2015/08/24/now-texas-e...
3,0000086,true,left,https://counterpunch.org/2017/10/02/how-long-a...
4,0000113,true,left,http://fair.org/home/arms-deal-stories-omit-wa...
...,...,...,...,...
199976,1494835,true,left,https://leftvoice.org/Refugees-The-End-of-the-...
199977,1494849,false,least,https://recode.net/2015/2/26/11559442/time-for...
199978,1494861,false,least,https://consortiumnews.com/2015/03/31/phasing-...
199979,1494886,false,least,https://reuters.com/article/us-tennis-ausopen-...


In [8]:
df.columns

Index(['id', 'hyperpartisan', 'bias', 'url', 'labeled_by'], dtype='object')

In [None]:
df['bias'].value_counts()

## 4. Extract Source Domain from URL
#### Define a function to extract the main domain name (source) from a URL using regular expressions. Apply this function to create a new 'source' column in the DataFrame.

In [9]:
def extract_source(url):
    match = re.search(r'https?://([^/]+)', url)
    return match.group(1) if match else "Unknown"

In [10]:
df['source'] = df['url'].apply(extract_source)

In [11]:
min_samples = 1463
df_balanced = df.groupby('bias', group_keys=False).apply(lambda x: x.sample(min(len(x), min_samples))).reset_index(drop=True)

  df_balanced = df.groupby('bias', group_keys=False).apply(lambda x: x.sample(min(len(x), min_samples))).reset_index(drop=True)


In [None]:
df_balanced.to_csv("updated_data.csv", index=False)

In [13]:
df_balanced.head()

Unnamed: 0,id,hyperpartisan,bias,url,labeled_by,source
0,501941,False,least,https://recode.net/2015/1/29/11558290/hbo-tech...,publisher,recode.net
1,648797,False,least,http://themoderatevoice.com/canada-lost-lots-o...,publisher,themoderatevoice.com
2,1335009,False,least,https://factcheck.org/2012/10/romney-all-wet-o...,publisher,factcheck.org
3,1458433,False,least,http://themoderatevoice.com/fiorina-is-too-mad...,publisher,themoderatevoice.com
4,502753,False,least,https://consortiumnews.com/2015/12/07/real-que...,publisher,consortiumnews.com


In [14]:
total_urls = len(df_balanced)
total_urls

7315

## 5. Web Scraping: Extract Article Text
#### Define headers to mimic a web browser and a function to scrape text content (specifically paragraphs) from the article URLs. Apply this function to the balanced DataFrame.

In [16]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

def extract_paragraphs(url, index, total):
    print(f"Processing {index + 1}/{total} - {url}")
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find_all('p')
        text = '\n'.join(p.get_text(strip=True) for p in paragraphs)
        print(f"[{index + 1}/{total}] Successfully extracted text from: {url}")
        return text if text else "No text extracted"
    except requests.exceptions.RequestException as e:
        print(f"[{index + 1}/{total}] Error fetching {url}: {e}")
        return f"Error: {e}"


In [17]:
df_balanced['extracted_text'] = [extract_paragraphs(url, i, total_urls) for i, url in enumerate(df_balanced['url'])]

Processing 1/7315 - https://recode.net/2015/1/29/11558290/hbo-tech-executives-leave-ahead-of-internet-launch-as-networks
[1/7315] Error fetching https://recode.net/2015/1/29/11558290/hbo-tech-executives-leave-ahead-of-internet-launch-as-networks: Exceeded 30 redirects.
Processing 2/7315 - http://themoderatevoice.com/canada-lost-lots-of-jobs-too/
[2/7315] Successfully extracted text from: http://themoderatevoice.com/canada-lost-lots-of-jobs-too/
Processing 3/7315 - https://factcheck.org/2012/10/romney-all-wet-on-ships/
[3/7315] Successfully extracted text from: https://factcheck.org/2012/10/romney-all-wet-on-ships/
Processing 4/7315 - http://themoderatevoice.com/fiorina-is-too-mad-to-see-straight/
[4/7315] Successfully extracted text from: http://themoderatevoice.com/fiorina-is-too-mad-to-see-straight/
Processing 5/7315 - https://consortiumnews.com/2015/12/07/real-question-for-president-obama/
[5/7315] Successfully extracted text from: https://consortiumnews.com/2015/12/07/real-question

In [None]:
df_balanced.drop("labeled_by",axis=1, inplace=True)

In [18]:
df_balanced.to_csv("updated_data.csv", index=False)

print("Processing complete. Saved to updated_data.csv")

Processing complete. Saved to updated_data.csv


## 6. Post-Scraping Data Cleaning
#### - Remove rows where text extraction failed (identified by the "Error:" prefix).
#### - Filter the DataFrame to include only specific bias classes of interest ('right', 'left', 'least').

In [6]:
df= pd.read_csv("updated_data.csv")

In [7]:
df.head()

Unnamed: 0,id,hyperpartisan,bias,url,source,extracted_text
0,744239,False,least,http://themoderatevoice.com/paul-ryan-what-the...,themoderatevoice.com,The Moderate Voice\nAn Internet hub with domes...
1,386959,False,least,https://factcheck.org/2011/02/factcheck-mailba...,factcheck.org,"ByFactCheck.org\nPosted onFebruary 8, 2011\nTh..."
2,1310030,False,least,http://belfercenter.org/publication/preview-wa...,belfercenter.org,Research and insight to improve policy and gov...
3,1448582,False,least,http://texastribune.org/2015/10/16/judge-denie...,texastribune.org,A federal district judge Friday declined to or...
4,176480,False,least,http://themoderatevoice.com/downing-of-flight-...,themoderatevoice.com,The Moderate Voice\nAn Internet hub with domes...


In [9]:
df = df[~df['extracted_text'].str.startswith("Error", na=False)]

In [12]:
classes = ['right', 'left', 'least']
df = df[df['bias'].isin(classes)]

In [13]:
len(df)

3832

## 7. Load Pre-trained Model and Tokenizer (LaBSE)
#### Load the tokenizer and model for the Language-agnostic BERT Sentence Embedding (LaBSE) model. This model is designed to create embeddings where sentences with similar meanings are close in the embedding space, even across languages.


In [19]:
tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")
sentences = df['extracted_text']

In [20]:
sentences_list=[]
for i in sentences_list:
    sentences_list.append(i)
print(len(sentences_list))

3832


## 8. Generate Text Embeddings
#### Iterate through the extracted text for each article, tokenize the text, feed it into the LaBSE model, and extract the resulting embedding and normalised it.


In [22]:
embeddings_list = []
total_sentences = len(df)

for idx, sentence in enumerate(df['extracted_text'], start=1):
    # Tokenization
    encoded_input = tokenizer(sentence, padding=True, truncation=True, max_length=64, return_tensors='pt')

    # Generate embedding
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Normalize embedding
    embedding = torch.nn.functional.normalize(model_output.pooler_output).squeeze(0).tolist()
    embeddings_list.append(embedding)

    # Print progress counter
    print(f"Processed {idx}/{total_sentences}")

Processed 1/3832
Processed 2/3832
Processed 3/3832
Processed 4/3832
Processed 5/3832
Processed 6/3832
Processed 7/3832
Processed 8/3832
Processed 9/3832
Processed 10/3832
Processed 11/3832
Processed 12/3832
Processed 13/3832
Processed 14/3832
Processed 15/3832
Processed 16/3832
Processed 17/3832
Processed 18/3832
Processed 19/3832
Processed 20/3832
Processed 21/3832
Processed 22/3832
Processed 23/3832
Processed 24/3832
Processed 25/3832
Processed 26/3832
Processed 27/3832
Processed 28/3832
Processed 29/3832
Processed 30/3832
Processed 31/3832
Processed 32/3832
Processed 33/3832
Processed 34/3832
Processed 35/3832
Processed 36/3832
Processed 37/3832
Processed 38/3832
Processed 39/3832
Processed 40/3832
Processed 41/3832
Processed 42/3832
Processed 43/3832
Processed 44/3832
Processed 45/3832
Processed 46/3832
Processed 47/3832
Processed 48/3832
Processed 49/3832
Processed 50/3832
Processed 51/3832
Processed 52/3832
Processed 53/3832
Processed 54/3832
Processed 55/3832
Processed 56/3832
P

In [23]:
df['Embedding'] = embeddings_list

In [24]:
df

Unnamed: 0.1,Unnamed: 0,id,hyperpartisan,bias,url,source,extracted_text,Embedding
0,1,648797,False,least,http://themoderatevoice.com/canada-lost-lots-o...,themoderatevoice.com,The Moderate Voice\nAn Internet hub with domes...,"[-0.05107906088232994, -0.020032858476042747, ..."
1,2,1335009,False,least,https://factcheck.org/2012/10/romney-all-wet-o...,factcheck.org,"ByRobert Farley\nPosted onOctober 26, 2012\nTh...","[-0.07073375582695007, 0.027049612253904343, 0..."
2,3,1458433,False,least,http://themoderatevoice.com/fiorina-is-too-mad...,themoderatevoice.com,The Moderate Voice\nAn Internet hub with domes...,"[-0.05738906189799309, 0.019467007368803024, -..."
3,4,502753,False,least,https://consortiumnews.com/2015/12/07/real-que...,consortiumnews.com,Part of Official Washington’s problem is that ...,"[-0.036798395216464996, 0.04595208540558815, -..."
4,5,724419,False,least,https://factcheck.org/2011/04/factcheck-mailba...,factcheck.org,"ByFactCheck.org\nPosted onApril 5, 2011| Updat...","[-0.017716171219944954, -0.03069198504090309, ..."
...,...,...,...,...,...,...,...,...
3827,5847,1043342,True,right,http://foxbusiness.com/markets/2017/10/23/tyle...,foxbusiness.com,Quotes displayed in real-time or delayed by at...,"[-0.023267608135938644, -0.026129573583602905,..."
3828,5848,839315,True,right,http://foxbusiness.com/markets/2017/01/18/bett...,foxbusiness.com,Quotes displayed in real-time or delayed by at...,"[-0.023267608135938644, -0.026129573583602905,..."
3829,5849,166391,True,right,http://foxbusiness.com/markets/2017/09/07/stic...,foxbusiness.com,Quotes displayed in real-time or delayed by at...,"[-0.023267608135938644, -0.026129573583602905,..."
3830,5850,1056440,True,right,https://dailywire.com/news/13352/breaking-nint...,dailywire.com,Redirecting you to//www.dailywire.com/news/133...,"[-0.044188059866428375, -0.010183711536228657,..."


In [28]:
df.columns

Index(['Unnamed: 0', 'id', 'hyperpartisan', 'bias', 'url', 'source',
       'extracted_text', 'Embedding'],
      dtype='object')

In [47]:
df.to_csv("dataset/df_with_embeddings.csv", index = False)