# 📚 Data Collection: Training Data Wiki Pages
## Source: https://www.kaggle.com/code/judehunt23/data-collection-training-data-wiki-pages
In this notebook, we'll create a dataset of the text in all of the wikipedia pages used to create the training dataset for the LLM Science Exam Competition. We'll then also take that a step further and download all of the text data from the pages in the `Concepts in physics` category on wikipedia (check out [this post](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/425246) and [this notebook](https://www.kaggle.com/code/judehunt23/llm-science-exam-wikipedia-graph-analysis/notebook) to see why we're focusing on this category!)

Once we have this data, we'll be able to use it to generate additional questions to use to train our LLM, while ensuring that the articles we're using to generate these questions are still relevant to the competition. @leonidkulyk has an [excellent notebook](https://www.kaggle.com/code/leonidkulyk/eda-data-gathering-llm-se-wiki-stem-1k-ds) showing how we can do this.

Don't forget to upvote this notebook if you find it helpful or interesting!

#### Other resources you might find useful:
- [📚 Wikipedia pages used to generate the training data!](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/425242#2349217)
- [📚 More efficient data collection: How to choose which categories to focus on?](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/425246)
- [📚 LLM Science Exam: Wikipedia Graph Analysis](https://www.kaggle.com/code/judehunt23/llm-science-exam-wikipedia-graph-analysis/notebook)

In [1]:
!pip install Wikipedia-API -q

In [2]:
import wikipediaapi
from tqdm.auto import tqdm
from collections import Counter
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
import pandas as pd

In [3]:
wiki_wiki = wikipediaapi.Wikipedia('MyProjectName (merlin@example.com)', 'en', timeout=1000)

In [4]:
train_pages = [
    'Supersymmetric quantum mechanics',
    'Relative density',
    'Memristor',
    'Quantization (physics)',
    'Symmetry in biology',
    'Mass versus weight',
    'Navier–Stokes equations',
    'Thermal equilibrium',
    'Electrical resistivity and conductivity',
    'Superconductivity',
    'Black hole',
    'Born reciprocity',
    "Commentary on Anatomy in Avicenna's Canon",
    'Supernova',
    'Angular momentum',
    'Condensation cloud',
    'Minkowski space',
    'Vacuum',
    'Standard Model',
    'Nebula',
    'Antiferromagnetism',
    'Light-year',
    'Propagation constant',
    'Phase transition',
    'Redshift',
    'The Ambidextrous Universe',
    'Interstellar medium',
    'Improper rotation',
    'Plant',
    'Clockwise',
    'Morphology (biology)',
    'Magnetic susceptibility',
    'Nuclear fusion',
    'Theorem of three moments',
    'Lorentz covariance',
    'Causality (physics)',
    'Total internal reflection',
    'Surgical pathology',
    'Environmental Science Center',
    'Electrochemical gradient',
    'Planetary system',
    'Cavitation',
    'Parity (physics)',
    'Dimension',
    'Heat treating',
    'Speed of light',
    'Mass-to-charge ratio',
    'Landau–Lifshitz–Gilbert equation',
    'Point groups in three dimensions',
    'Mammary gland',
    'Convection (heat transfer)',
    'Modified Newtonian dynamics',
    "Earnshaw's theorem",
    'Coherent turbulent structure',
    'Phageome',
    'Infectious tolerance',
    'Ferromagnetism',
    'Coffee ring effect',
    'Magnetic resonance imaging',
    'Ring-imaging Cherenkov detector',
    'Tidal force',
    'Kutta-Joukowski theorem',
    'Radiosity (radiometry)',
    'Quartz crystal microbalance',
    'Crystallinity',
    'Magnitude (astronomy)',
    "Newton's law of universal gravitation",
    'Uniform tilings in hyperbolic plane',
    'Refractive index',
    'Theorem',
    'Leidenfrost effect',
    'API gravity',
    'Supersymmetry',
    'Dark Matter',
    'Molecular symmetry',
    'Spin (physics)',
    'Astrochemistry',
    'List of equations in classical mechanics',
    'Diffraction',
    'C1 chemistry',
    'Reciprocal length',
    'Amplitude',
    'Work function',
    'Coherence (physics)',
    'Ultraviolet catastrophe',
    'Symmetry of diatomic molecules',
    'Bollard pull',
    'Linear time-invariant system',
    'Triskelion',
    'Cold dark matter',
    'Frame-dragging',
    "Fermat's principle",
    'Enthalpy',
    'Main sequence',
    'QCD matter',
    'Molecular cloud',
    'Free neutron decay',
    'Second law of thermodynamics',
    'Droste effect',
    'History of geology',
    'Gravitational wave',
    'Regular polytope',
    'Spatial dispersion',
    'Probability amplitude',
    'Stochastic differential equation',
    'Gravity Probe B',
    'Electronic entropy',
    'Renormalization',
    'Unified field theory',
    "Elitzur's theorem",
    "Hesse's principle of transfer",
    'Ecological pyramid',
    'Virtual particle',
    'Ramsauer–Townsend effect',
    'Butterfly effect',
    'Zero-point energy',
    'Baryogenesis',
    'Pulsar',
    'Decay technique',
    'Electric flux',
    'Water hammer',
    'Dynamic scaling',
    'Luminance',
    'Crossover experiment (chemistry)',
    'Spontaneous symmetry breaking',
    'Self-organization in cybernetics',
    'Stellar classification',
    'Probability density function',
    'Pulsar-based navigation',
    'Supermassive black hole',
    'Explicit symmetry breaking',
    'Surface power density',
    'Organography',
    'Copernican principle',
    'Geometric quantization',
    'Erlangen program',
    'Magnetic monopole',
    'Inflation (cosmology)',
    'Heart',
    'Observable universe',
    'Wigner quasiprobability distribution',
    'Shower-curtain effect',
    'Scale (ratio)',
    'Hydrodynamic stability',
    'Paramagnetism',
    'Emissivity',
    'Critical Raw Materials Act',
    'James Webb Space Telescope',
    'Signal-to-noise ratio',
    'Photophoresis',
    'Time standard',
    'Time',
    'Galaxy',
    'Rayleigh scattering'
]

print(f"{len(train_pages)} unique pages from train dataset")

154 unique pages from train dataset


In [5]:
def get_wiki_sections_text(page):
    ignore_sections = ["References", "See also", "External links", "Further reading", "Sources"]
    wiki_page = wiki_wiki.page(page)
    
    # Get all the sections text
    page_sections = [x.text for x in wiki_page.sections if x.title not in ignore_sections and x.text != ""]
    section_titles = [x.title for x in wiki_page.sections if x.title not in ignore_sections and x.text != ""]
    
    # Add the summary page
    page_sections.append(wiki_page.summary)
    section_titles.append("Summary")

    return page_sections, section_titles

In [6]:
def get_pages_df(pages):
    page_section_texts = []
    for page in tqdm(pages):
        sections, titles = get_wiki_sections_text(page)
        for section, title in zip(sections, titles):
            page_section_texts.append({
                'page': page,
                'section_title': title,
                'text': section
            })
    print(len(page_section_texts))
    return pd.DataFrame(page_section_texts)

In [7]:
train_pages_df = get_pages_df(train_pages)
#train_pages_df.to_csv("train_pages.csv", index=False)
print(train_pages_df.shape)
train_pages_df.head()

  0%|          | 0/154 [00:00<?, ?it/s]

891
(891, 3)


Unnamed: 0,page,section_title,text
0,Supersymmetric quantum mechanics,Introduction,Understanding the consequences of supersymmetr...
1,Supersymmetric quantum mechanics,Example: the harmonic oscillator,The Schrödinger equation for the harmonic osci...
2,Supersymmetric quantum mechanics,SUSY QM superalgebra,"In fundamental quantum mechanics, we learn tha..."
3,Supersymmetric quantum mechanics,Example,Let's look at the example of a one-dimensional...
4,Supersymmetric quantum mechanics,Shape invariance,Suppose \n \n \n \n W\n \...


In [9]:
import re
import nltk
from nltk.corpus import stopwords

# Download the stopwords from nltk if not already done
nltk.download('stopwords')
# Load English stopwords
stop_words = set(stopwords.words('english'))


# Text cleaning functions
# Remove special characters like '\n', excessive punctuation, etc.
def remove_special_chars(text):
    return re.sub(r'[^a-zA-Z0-9.,!?\'" ]+', ' ', text)

# Normalize whitespace (remove extra spaces, newlines, etc.)
def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

# Convert to lowercase
def to_lowercase(text):
    return text.lower()

# Remove stopwords
def remove_stopwords(text):
    words = text.split()  # Split text into words
    return ' '.join([word for word in words if word not in stop_words])

# Full cleaning function
def clean_text(text):
    text = remove_special_chars(text)
    text = normalize_whitespace(text)
    text = to_lowercase(text)
    text = remove_stopwords(text)  # Remove stopwords
    return text


[nltk_data] Downloading package stopwords to /home/vino/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
# Apply cleaning to the 'text' column of your DataFrame
train_pages_df['cleaned_text'] = train_pages_df['text'].apply(clean_text)

In [12]:
# Scaling up to most significant category: Concepts in physics

In [13]:
def split_category_members(members):
    category_list, page_list = [], []

    for member_name, member_page in members:
        if member_name.startswith('Category'):
            category_list.append((member_name, member_page))
        elif not member_name.startswith("File"):
            page_list.append((member_name, member_page))
    
    return category_list, page_list

def get_all_pages_category_deep(
    init_category: str,
    depth: int = 2,
):
    category_page = wiki_wiki.page(init_category)
    chosen_list = list(category_page.categorymembers.items())
    category_list, page_list = split_category_members(chosen_list)
    
    category_pages = {
        init_category: page_list
    }
    
    # Iterate through each set of sub-categories
    for i in range(depth):
        # Get all the pages from the next category
        new_category_list = []
        for category, _ in category_list:
            category_page = wiki_wiki.page(category)
            sub_category_list, page_list = split_category_members(list(category_page.categorymembers.items()))
            category_pages[category] = page_list
            new_category_list.extend(sub_category_list)
        category_list = new_category_list
            
    all_pages = [j for x in category_pages.values() for j in x]
    unique_pages = list(set(all_pages))
        
    return unique_pages

In [14]:
concepts_in_physics_pages = get_all_pages_category_deep(
    init_category = 'Category:Concepts in physics',
    depth=1
)

# Number of pages captured
captured = len([x for x in train_pages if x in [page_name for page_name, _ in concepts_in_physics_pages]])
missing = len([x for x in train_pages if x not in [page_name for page_name, _ in concepts_in_physics_pages]])
print(f"{len(concepts_in_physics_pages)} total pages. {captured} captured, {missing} missing from the train set")

917 total pages. 101 captured, 53 missing from the train set


In [15]:
# Just get the page names
concepts_in_physics_pages = [x[0] for x in concepts_in_physics_pages]

# Add in the training pages and create another dataset!
concepts_in_physics_pages.extend(train_pages)
concepts_in_physics_pages = list(set(concepts_in_physics_pages))
physics_pages_df = get_pages_df(concepts_in_physics_pages)
print(physics_pages_df.shape)
physics_pages_df.head()

  0%|          | 0/937 [00:00<?, ?it/s]

4313
(4313, 3)


Unnamed: 0,page,section_title,text
0,One-dimensional symmetry group,Point group,For a pattern without translational symmetry t...
1,One-dimensional symmetry group,Discrete symmetry groups,These affine symmetries can be considered limi...
2,One-dimensional symmetry group,1D-symmetry of a function vs. 2D-symmetry of i...,Symmetries of a function (in the sense of this...
3,One-dimensional symmetry group,Group action,Group actions of the symmetry group that can b...
4,One-dimensional symmetry group,Orbits and stabilizers,Consider a group G acting on a set X. The orbi...


In [16]:
# Apply cleaning to the 'text' column of your DataFrame
physics_pages_df['cleaned_text'] = physics_pages_df['text'].apply(clean_text)

In [17]:
train_pages_df.shape

(891, 4)

In [18]:
physics_pages_df.shape

(4313, 4)

In [19]:
# Function to chunk text
def chunk_text(text, max_words=512):
    words = text.split()  # Split the text into words
    chunks = [' '.join(words[i:i + max_words]) for i in range(0, len(words), max_words)]  # Create chunks
    return chunks

# Expanding the DataFrame by splitting long texts
def chunk_cleaned_text(df):    
    new_rows = []
    for index, row in df.iterrows():
        chunks = chunk_text(row['cleaned_text'], max_words=450)
        for i, chunk in enumerate(chunks):
            new_row = row.copy()
            new_row['cleaned_text'] = chunk
            new_row['chunk_id'] = i + 1  # Optional: add chunk id if you want to track chunks
            new_rows.append(new_row)
    
    # Create new DataFrame with the chunked texts
    chunked_df = pd.DataFrame(new_rows)
    return chunked_df
    
train_pages_df=chunk_cleaned_text(train_pages_df)
physics_pages_df=chunk_cleaned_text(physics_pages_df)

In [20]:
train_pages_df.to_csv('train_pages_df.csv',index=False)

In [21]:
physics_pages_df.to_csv('physics_pages_df.csv',index=False)