# Generate documents with overlapping content

### Get wikipedia page entries about different topics

We use beautiful soup and requests to get content from Wikipedia on following topics. Renessiance, COVID, Postmodernism and Linkin Park. The idea is to create a list which contains text on each of this topics. We will additionally create an ID which will simply be the topic name, and the number of the paragraph that was parsed. 

In [2]:
import requests
from bs4 import BeautifulSoup

topics = ["Renaissance", "COVID", "Postmodernism", "Linkin Park"]

data=[]
for topic in topics:
    url = f"https://en.wikipedia.org/wiki/{topic}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    title = soup.find(id="firstHeading").text
    first_paragraph = soup.find('p').text
    paragraphs = soup.find_all('p')
    cnt=1
    for para in paragraphs:
        text = para.text
        id = topic + '_' +str(cnt)
        data.append({'id':id, 'topic':topic,
            'para_no':cnt, 'text':text})
        cnt+=1

    
    #dict.append({'title':title, 'text':all_text})
# Group paragraphs by topic
topic_dict = {}
for item in data:
    if item['id'] not in topic_dict:
        topic_dict[item['id']] = []
    if len(item['text']) > 50:
        topic_dict[item['id']].append(item['text'])

### Random mixing of the topics in different documents

Now that we have the dictionary that contains the topic paragraphs and text content, we will randomly mix the content across various documents. Once the documents are ready, we will save them in various text files. 

In [8]:
import random
documents = []
for _ in range(50):
    document = []
    topics = list(topic_dict.keys())
    sequence = random.sample(range(len(topics)), 10)
    #print(sequence)
    selected_topics = [topics[i] for i in sequence]
    
    for topic in selected_topics:
        if topic_dict[topic]:  # Check if the list is not empty
            paragraph = (topic_dict[topic])
            document.append(paragraph)
    documents.append(document)


In [9]:

documents[23]

[['International research on vaccines and medicines in COVID‑19 is underway by government organisations, academic groups, and industry researchers.[476][477] The CDC has classified it to require a BSL3 grade laboratory.[478] There has been a great deal of COVID‑19 research, involving accelerated research processes and publishing shortcuts to meet the global demand.[479]\n'],
 ['Postmodernism is an intellectual stance or mode of discourse[1][2] characterized by skepticism towards elements of the Enlightenment worldview. It questions the "grand narratives" of modernity, rejects the certainty of knowledge and stable meaning, and acknowledges the influence of ideology in maintaining political power.[3][4] The idea of objective claims is dismissed as naïve realism,[5] emphasizing the conditional nature of knowledge.[4] Postmodernism embraces self-referentiality, epistemological relativism, moral relativism, pluralism, irony, irreverence, and eclecticism.[4] It opposes the "universal validit

In [10]:
import os
import re
import textwrap

def save_documents(documents):
    # Create the directory if it doesn't exist
    if not os.path.exists('data_docs'):
        os.makedirs('data_docs')

    # Iterate over the documents
    for i, document in enumerate(documents, start=1):
        # Remove non-ASCII characters
        document = re.sub(r'[^\x00-\x7F]+', ' ', str(document))
        document = re.sub(r'\[\d+\]','',document)
        document = re.sub(r'\n','',document)    
        document = document.replace('[', '').replace(']', '')
        document = textwrap.fill(document, width=70)
        # Save the document in a text file
        with open(f'data_docs/document_{i}.txt', 'w') as f:
            f.write(document)

In [11]:
save_documents(documents)