# **Data Collection**
### Here below is the first step of our projet, the code for our data collection. We've decided to scrap Wikipedia and Feedspots.

##### We'll start by listing all our import functions

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.parse import urlparse
import time
import random
import csv
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
import string
#nltk.download('punkt_tab') #: to uncomment if not already downloaded
#nltk.download('stopwords') #: to uncomment if not already downloaded

#### Below is a list of useful fonctions that will be used during the data collection.

##### The functions belows are used for the collect of links

In [None]:
def fetch_verify_url(url) :
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}  
        response = requests.get(url, headers=headers)
        if response.status_code != 200:  # If the status code is not OK (200) the function return none and an error message
            print(f"Failed to fetch the url: {url} with status code {response.status_code}")
            return None
        return response  
    except requests.RequestException:
        return None

This function is used to safely retrieve the HTML content of a web page. It sends an HTTP request to a given URL while mimicking a real web browser by specifying a User-Agent header, which helps avoid blocking by some websites. The function then checks whether the server responds with a successful status code (200). If the request fails, the page is inaccessible, or any network error occurs, the function returns None. Otherwise, it returns the raw HTTP response containing the page’s HTML content.

In [None]:
def to_soup(url):
    response = fetch_verify_url(url)
    if response:  # If the response is not none, the function return the beautiful soup object
        return BeautifulSoup(response.text, 'html.parser') 
    else:
        return None

This function converts the HTML content of a web page into a structured object that can be easily analyzed. It first calls the fetch_verify_url function to retrieve the web page’s HTML content. If the request is successful, the HTML is parsed using BeautifulSoup with the built-in HTML parser, producing a navigable representation of the page. If the page cannot be retrieved, the function returns None.

In [None]:
def filter_links(links, required_keywords=None, domain=None, already_seen=None):
    if required_keywords is None:  # If no list of required keywords is given then we create an empty one
        required_keywords = []
    if already_seen is None:  # If no list of already seen links is given then we create an empty one
        already_seen = set()
    
    filtered = []
    for l in links:
        l_lower = l.lower()  # Transformation of capital letter into lower case letter
        if required_keywords and not any(keyword.lower() in l_lower for keyword in required_keywords):  # The function skip the url if no required keywords are in the url
            continue
        if domain and urlparse(l).netloc != domain:  # The function filter the links that are not in the domain
            continue
        if l in already_seen:  # The function filter the links already in the list of links
            continue
        filtered.append(l) # If the url past all the filter, it is added to the list of links
    return filtered

This function filters a list of URLs in order to keep only the most relevant web pages. It removes links that doesn't contain predefined required keywords. The function can also restrict the results to a specific domain, allowing only internal links to be kept. Additionally, it eliminates URLs that have already been encountered, preventing duplicates during the crawling process.

By applying these filters, the function helps reduce noise and improves the quality and relevance of the collected dataset.

The functions below are used to collect the corpus of web pages

In [None]:
def get_html_corpus(links):
    corpus = []

    for link in links:
        response = fetch_verify_url(link)
        if response:
            corpus.append({'url': link, 'html': response.text}) # A dictionary with the url (as the key) and the corpus is created
        time.sleep(random.uniform(1,4))

    return corpus

This function collects the raw HTML content of a list of web pages.
It iterates over each URL in the input list, sends an HTTP request using the fetch_verify_url function, and checks whether the request was successful.
If a valid response is received, the function stores the page URL and its HTML source code as a dictionary and appends it to the corpus list.
A random delay between 1 and 4 seconds is added after each request to avoid overloading the server and to reduce the risk of being blocked.
Finally, the function returns the complete corpus containing the HTML content of all successfully fetched pages.

In [None]:
def save_to_csv(data, filename):
    if not data:
        print("Error : There is no data to save")
        return

    fieldnames = data[0].keys()  # Detection of the existing colons in the data file

    with open(filename, 'w', newline='', encoding='utf-8') as f:  # Opens the csv file as utf-8
        writer = csv.DictWriter(f, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)  # Initialising a writer to write the dictionary into the csv file
        writer.writeheader()  # writes the colons headers
        writer.writerows(data)  # writes the rows

    print(f"CSV saved : {filename}")

This function saves a list of dictionaries into a CSV file.
It first checks whether the input data is empty; if no data is provided, the function prints an error message and stops execution.
The column names of the CSV file are automatically extracted from the keys of the first dictionary in the list.
The function then opens (or creates) the CSV file in UTF-8 encoding and initializes a DictWriter to correctly map dictionary keys to CSV columns.
The header row is written first, followed by all rows of data.
Finally, a confirmation message is printed to indicate that the CSV file has been successfully saved.

In [None]:
def clean_html(html):
    soup = BeautifulSoup(html, 'html.parser')

    for tag in soup(['script', 'style', 'noscript']):  # Supress any unessecary tags
        tag.decompose()

    text = soup.get_text(separator=' ', strip=True)  # Collect all visible text
    text = re.sub(r'\s+', ' ', text)  # Supress any unecessary spaces

    return text


This function cleans raw HTML content and extracts only the meaningful textual information.
First, the HTML string is parsed into a BeautifulSoup object, which allows structured navigation of the document.
All non-textual and irrelevant elements such as script, style, and noscript tags are then removed to avoid including code or hidden content in the final text.
The function extracts all visible text from the cleaned HTML, using spaces as separators and trimming unnecessary leading and trailing whitespace.
Finally, multiple consecutive spaces are reduced to a single space to produce a clean and readable text output.

In [None]:
def clean_csv_file(input_csv, output_csv):
    df = pd.read_csv(input_csv)  # Take the csv file with the raw html as the input

    if 'html' not in df.columns:  # Verify that the html colon exists
        raise ValueError(f"The html colon is missing in: {input_csv}")

    df['cleaned_text'] = df['html'].apply(clean_html)  # Cleans the html colon
    df = df[['url', 'cleaned_text']]  # Keep the url and text colon (not the raw html)

    df.to_csv(output_csv, index=False, encoding='utf-8')  # Creats a new csv file as the output of the function

This function cleans an entire CSV file containing raw HTML content.
First, the CSV file is loaded into a pandas DataFrame, allowing efficient column-wise processing.
The function checks whether the column named html exists; if it is missing, an error is raised to prevent silent failures and ensure data consistency.
Each HTML document in the html column is then cleaned using the clean_html function, and the resulting plain text is stored in a new column called cleaned_text.
Only the relevant columns (url and cleaned_text) are retained, discarding the original raw HTML to reduce file size and complexity.
Finally, the cleaned data is saved into a new CSV file, producing a structured and reusable text corpus for downstream processing.

In [None]:
def normalize_html(text):
    text = text.lower()  # convert all letters to lowercase
    text = re.sub(r'\[\d+\]', ' ', text)  # remove reference numbers like [1], [2], etc.
    text = re.sub(r'[^a-z0-9\s]', ' ', text)  # keep only English letters, numbers, and spaces
    text = re.sub(r'\s+', ' ', text)  # replace multiple spaces with a single space
    return text.strip()  # remove leading and trailing spaces

This function performs text normalization on cleaned HTML content in order to prepare it for linguistic analysis.
First, all characters are converted to lowercase to ensure case-insensitive processing and avoid treating the same word as different tokens.
Reference markers such as [1], [2], commonly found in Wikipedia-style pages, are removed using a regular expression.
All characters that are not lowercase English letters, digits, or spaces are then replaced with spaces, effectively removing punctuation and special symbols.
Multiple consecutive spaces are collapsed into a single space to produce a cleaner and more consistent text format.
Finally, leading and trailing spaces are removed before returning the normalized text, ensuring uniform formatting across documents.

In [None]:
def normalize_csv_file(input_csv, output_csv):
    df = pd.read_csv(input_csv)  # Take the csv file with the cleaned text as the input

    if 'cleaned_text' not in df.columns:  # Verify that the cleaned text colon exists
        raise ValueError(f"The cleaned text is missing in: {input_csv}")

    df['normalized_text'] = df['cleaned_text'].apply(normalize_html)  # normalize the cleaned text
    df = df[['url', 'normalized_text']]  # keep the url and normalized text 

    df.to_csv(output_csv, index=False, encoding='utf-8')  # creats a new csv file as the output of the function

This function applies text normalization to an entire CSV file containing cleaned textual data.
First, the input CSV file is loaded into a pandas DataFrame. The function then checks whether the column cleaned_text exists to ensure that the expected input data is available. If the column is missing, a descriptive error is raised to prevent silent failures.
The normalization process is applied to each row of the cleaned_text column using the normalize_html function, producing a new column called normalized_text.
Only the URL and the normalized text columns are retained, as the raw and intermediate data are no longer needed at this stage.
Finally, the resulting DataFrame is saved to a new CSV file, which serves as the normalized version of the corpus and can be reused in subsequent processing steps.

In [None]:
stop_words = list(set(stopwords.words('english'))) + ["'s"]
stem = nltk.stem.SnowballStemmer("english")
def tokenize_html(text):
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in string.punctuation]  # remove punctuation
    tokens = [token for token in tokens if token not in stop_words]  # remove stopwords
    tokens = [stem.stem(token) for token in tokens]  # apply stemming (racinisation)
    return tokens

This part of the code prepares and tokenizes normalized text for linguistic analysis.
First, a list of English stopwords is created using NLTK’s predefined stopword list, with the additional removal of the possessive form ’s, which often appears in English texts but carries little semantic value.
A Snowball stemmer configured for English is then initialized to reduce words to their root form.

The tokenize_html function takes a text string as input and converts it to lowercase to ensure consistency. The text is then tokenized into individual words using NLTK’s tokenizer.
Punctuation tokens are removed, followed by the removal of stopwords to keep only meaningful terms.
Finally, stemming is applied to each remaining token, reducing inflected or derived words to a common base form.
The function returns a list of processed tokens, which can be directly used for further tasks such as frequency analysis, topic modeling, or vectorization.

In [None]:
def tokenize_csv_file(input_csv, output_csv):
    df = pd.read_csv(input_csv)  # Take the csv file with the normalized text as the input

    if 'normalized_text' not in df.columns:  # Verify that the normalized text colon exists
        raise ValueError(f"The normalized text is missing in: {input_csv}")

    df['tokenized_text'] = df['normalized_text'].apply(tokenize_html)  # cleans the html colon
    df = df[['url', 'tokenized_text']]  # keep the url and text colon (not the raw html)

    df.to_csv(output_csv, index=False, encoding='utf-8')  # creats a new csv file as the output of the function

This function applies tokenization to a CSV file containing normalized text.
It begins by loading the input CSV file into a pandas DataFrame. Before processing, the function checks whether the column normalized_text exists; if not, it raises an error to prevent incorrect execution of the pipeline.

The function then applies the tokenize_html function to each row of the normalized_text column. This step transforms each text into a list of cleaned and stemmed tokens.
Only the URL and the resulting tokenized text are kept in the final DataFrame to reduce unnecessary data.

Finally, the processed data is saved into a new CSV file, producing a structured and reusable representation of the tokenized corpus that can be used for further text analysis tasks.

##### Following that, we can start the data collection with the scraping of Feedspots.

In [None]:
url_blogs = "https://bloggers.feedspot.com/lifestyle_blogs/"

soup = to_soup(url_blogs)

if not soup:
    print("Error: Could not fetch main blogs page:", url_blogs)

blogs = soup.find_all(lambda tag: tag.name in ['a', 'span'] and tag.get('class') and 'wb-ba' in tag.get('class') and any('ext' in c for c in tag.get('class')))

links_blogs = []
for item in blogs:
    href = item.get('href') if item.name == 'a' else item.text.strip()
    if href and "http" in href and "bloggers.feedspot.com" not in href:  # Exclusion of internal links to only extract links directing to blogs
        links_blogs.append(href)

print(len(links_blogs), "blogs has been found")
print("Blogs founded:", links_blogs[:1])

Status: 200
100 blogs has been found
Blogs founded: ['https://www.mindbodygreen.com/', 'https://www.thepioneerwoman.com/', 'https://goop.com/', 'https://www.artofmanliness.com/', 'https://www.themarthablog.com/', 'https://sincerelyjules.com/', 'https://www.pbfingers.com/', 'https://camillestyles.com/', 'https://cupofjo.com/', 'https://www.theskinnyconfidential.com/', 'https://www.apetogentleman.com/', 'https://www.primermagazine.com/', 'https://livinginyellow.com/', 'https://www.ahealthysliceoflife.com/', 'https://onbetterliving.com/', 'https://thestripe.com/', 'https://helloadamsfamily.com/', 'https://julieblanner.com/', 'https://heleneinbetween.com/', 'https://inspirationsandcelebrations.net/', 'https://www.elizabethrider.com/', 'https://www.katiedidwhat.com/', 'https://www.idyllicpursuit.com/', 'https://witwhimsy.com/', 'https://happilyevaafter.com/', 'https://theblueridgegal.com/', 'https://lmgfl.com/', 'https://simplytaralynn.com/', 'https://socialifestylemag.com/', 'https://www.l

This code retrieves a curated list of lifestyle blogs from the Feedspot website. It first loads and parses the HTML content of the Feedspot page containing the blog rankings. If the page cannot be accessed, the program stops to prevent further errors. The code then identifies the HTML elements corresponding to blog links by targeting specific tags and CSS classes used by Feedspot.

For each identified element, the script extracts the blog URL and filters out internal Feedspot links, keeping only external blog websites. The resulting list contains the URLs of lifestyle blogs, which serve as the starting points (seed URLs) for the subsequent crawling process. Finally, the script reports the number of blogs successfully collected and displays a sample of the extracted URLs.

In [None]:
def get_links_from_blog(url):
    soup = to_soup(url)
    if not soup:
        return None
    
    links = []
    for link in soup.find_all("a"):
        href = link.get("href")
        if not href:
            continue
        # gérer les liens relatifs (/about → https://blog.com/about)
        full_url = urljoin(url, href)
        if full_url.startswith("http") and full_url not in links:
            links.append(full_url)

    return links


This function extracts all hyperlinks from a given web page. It first retrieves and parses the HTML content of the page using the to_soup function. If the page cannot be accessed, the function returns None. The function then scans all anchor (<a>) elements and extracts their href attributes. Relative URLs are converted into absolute URLs using the base page URL to ensure consistency.

Only valid HTTP links are kept, and duplicate URLs are removed. The function returns a list of unique hyperlinks found on the page, which are later used in the crawling process.

In [None]:

exclude_keywords = [
    "privacy",       
    "contact",      
    "terms",       
    "login",      
    "signup",      
    "register",    
    "tag",       
    "category",   
    "archive",       
    "feed",     
    "comments",    
    "search",        
    "newsletter",    
    "cart",          
    "checkout",     
    "admin",        
    "wp-",  
    "cgi-bin",
    "privacy-policy",
    "cookie",
    "sitemap",
    "login.php", 
    "register.php", 
    "unsubscribe", 
    "terms-of-service",
    "press",    
    "ad",   
    "ads",  
    "advertisement", 
    "accessibility",  
    "sponsor",
    "disclaimer",
]

A list of exclusion keywords is defined to remove non-relevant web pages from the crawling process. These keywords correspond to administrative, legal, technical, or commercial pages such as privacy policies, contact forms, login pages, archives, tags, advertisements, and sponsored content. By filtering out URLs containing these terms, the crawler focuses on pages that are more likely to contain meaningful blog content, reducing noise in the collected dataset.

In [None]:
links_in_blogs_1 = {}

total_new_urls_1 = 0

for url in links_blogs[:2]:  # scrappe les x premiers blogs et retourne le nombre de lien trouvé sur la page d'accueil et les 5 premiers liens
    print("\nLinks from the blog (first round):", url)

    links = get_links_from_blog(url)
    if links is None:
        print("→ 0 links found (scraping failed or blocked)")
        links_in_blogs_1[url] = []
        continue

    domain = urlparse(url).netloc
    filtered_links = filter_links(links, exclude_keywords=exclude_keywords, domain=domain, already_seen=set())

    links_in_blogs_1[url] = filtered_links

    total_new_urls_1 += len(filtered_links)

    print("→", len(filtered_links), "links found:", filtered_links[:1])

    time.sleep(1)

print("Total new URLs found in this first round iteration:", total_new_urls_1)

In the first crawling round, the script visits the homepages of a subset of selected blogs and extracts all hyperlinks present on each page. The collected links are then filtered using predefined exclusion keywords to remove non-content pages. Additionally, the filtering process is restricted to the same domain as the source blog in order to retain only internal links. The resulting URLs are stored in a dictionary and counted to measure the number of relevant pages identified during this initial crawling stage.