## Web scraping

In [6]:
# !pip install selenium
#!pip install webdriver_manager

## The following is code to run a python controlled chrome browser environment

### We will

### 1. Open the wikipedia page for someone
### 2. Gather all the links in the page
### 3. Maintain a set of links that we visited, we start visiting unvisited webpages
### 4. Extract text from each page
### 5. Filter clean the text

## Download chromedriver [here](https://googlechromelabs.github.io/chrome-for-testing)

Keep the chromedriver binary in the same directory as the jupyter notebook


In [None]:
import time
import itertools
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

## Add the path and user profile to chromedriver. You should add your own path. User profile is optional.
chrome_driver_path = "/Users/hardik/hardik-projects/secretllm/05-data-collection/chromedriver-mac-arm64/chromedriver"
chromium_path = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
user_profile_path = "/Users/hardik/Library/Application Support/Google/Chrome/Default"

chrome_options = Options()
chrome_options.binary_location = chromium_path
chrome_options.add_argument(f"user-data-dir={user_profile_path}")

# Set up the Chrome driver
service = Service("/Users/hardik/hardik-projects/secretllm/05-data-collection/chromedriver-mac-arm64/chromedriver")
driver = webdriver.Chrome(service=service, options=chrome_options)

In [71]:
# Open a website
driver.get("https://wikipedia.org/wiki/Dennis_Ritchie")


In [72]:
h2_element = driver.find_element(By.ID, 'mw-content-text')
h2_text = h2_element.text
print(h2_text)

Dennis Ritchie
Dennis Ritchie at the Japan Prize Foundation in May 2011
Born Dennis MacAlistair Ritchie
September 9, 1941
Bronxville, New York, U.S.
Died c. October 12, 2011 (aged 70)
Berkeley Heights, New Jersey, U.S.
Alma mater Harvard University (BS)
Known for ALTRAN
B
BCPL
C
Multics
Unix
Awards IEEE Emanuel R. Piore Award (1982)[1]
Turing Award (1983)
National Medal of Technology (1998)
IEEE Richard W. Hamming Medal (1990)
Computer Pioneer Award (1994)
Computer History Museum Fellow (1997)[2]
Harold Pender Award (2003)
Japan Prize (2011)
Scientific career
Fields Computer science
Institutions Lucent Technologies
Bell Labs
Doctoral advisor Patrick C. Fischer
Website bell-labs.com/usr/dmr/www
Dennis MacAlistair Ritchie (September 9, 1941 – c. October 12, 2011) was an American computer scientist.[3] He created the C programming language and, with long-time colleague Ken Thompson, the Unix operating system and B language.[3] Ritchie and Thompson were awarded the Turing Award from the As

## We find all urls in the page

In [73]:
anchor_elements = driver.find_elements(By.TAG_NAME, "a")
# Extract the href attribute from each anchor element
urls = [anchor.get_attribute("href") for anchor in anchor_elements]
url_set = set(urls)        

In [74]:
from urllib.parse import urlparse, urlunparse

def remove_url_fragments(url_set):
  cleaned_urls = set()
  for url in url_set:
    parsed_url = urlparse(url)
    cleaned_url = urlunparse(parsed_url._replace(fragment=''))
    if cleaned_url is not None and cleaned_url not in cleaned_urls and cleaned_url != b'':
      cleaned_urls.add(cleaned_url)
  return cleaned_urls

cleaned_url_set = remove_url_fragments(url_set)
print(cleaned_url_set)

{'https://en.wikipedia.org/wiki/C_mathematical_functions', 'https://en.wikipedia.org/wiki/Switching_circuit_theory', 'https://developer.wikimedia.org/', 'https://en.wikipedia.org/wiki/Workstation', 'https://en.wikipedia.org/wiki/CLion', 'http://www.ieee.org/documents/piore_rl.pdf', 'https://en.wikipedia.org/wiki/Michael_Schroeder', 'https://sk.wikipedia.org/wiki/Dennis_Ritchie', 'https://ko.wikipedia.org/wiki/%EB%8D%B0%EB%8B%88%EC%8A%A4_%EB%A6%AC%EC%B9%98', 'https://en.wikipedia.org/wiki/Berkeley_Software_Distribution', 'https://en.wikipedia.org/wiki/Peter_Elias', 'https://en.wikipedia.org/wiki/Massachusetts_Institute_of_Technology', 'https://en.wikipedia.org/wiki/S2CID_(identifier)', 'https://en.wikipedia.org/wiki/Tom_Glinos', 'http://www.extremetech.com/computing/102835-dennis-ritchie-creator-of-c-bids-goodbye-world', 'https://en.wikipedia.org/wiki/Multics', 'https://en.wikipedia.org/wiki/Shlomo_Shamai', 'https://en.wikipedia.org/wiki/Jean_Fr%C3%A9chet', 'https://en.wikipedia.org/wik

### We only visit webpages which we didn't visit in the past and collect text information from the page

In [75]:
visited_urls = set()
page_text_dict = dict()
# Loop through the URLs
for url in itertools.islice(cleaned_url_set, 10): # Limit the number of URLs visited for demonstration purposes
    if url is not None and url not in visited_urls:
        # Visit the URL
        driver.get(url)
        # Extract the page text
        page_text = driver.find_element(By.TAG_NAME, "body").text
        # Store the page text in the dictionary
        page_text_dict[url] = page_text
        visited_urls.add(url)
        

In [76]:
print(page_text_dict['https://en.wikipedia.org/wiki/Berkeley_Software_Distribution'])

Jump to content
Main menu
Search
Donate
Create account
Log in
Personal tools
Contents hide
(Top)
History
Toggle History subsection
Relationship to Research Unix
Relationship to System V
Technology
Toggle Technology subsection
Berkeley sockets
Binary compatibility
Standards
BSD descendants
See also
References
Bibliography
External links
Berkeley Software Distribution
68 languages
Article
Talk
Read
Edit
View history
Tools
Appearance hide
From Wikipedia, the free encyclopedia
"BSD" redirects here. For the family of free software licenses, see BSD licenses. For other uses, see BSD (disambiguation).
BSD
Developer Computer Systems Research Group
Written in C
OS family Unix
Working state Discontinued
Source model Originally source-available, later open-source
Initial release March 9, 1978; 46 years ago
Final release 4.4-Lite2 / June 1995; 29 years ago
Available in English
Platforms PDP-11, VAX, Intel 80386
Kernel type Monolithic
Userland BSD
Influenced NetBSD, FreeBSD, OpenBSD, DragonFly BSD,

In [83]:
# Clean the extracted text
def clean_text(text):
    # Remove leading and trailing white spaces
    text = text.strip()
    # Remove extra white spaces
    text = " ".join(text.split())
    # remove special characters
    text = ''.join(e for e in text if e.isalnum() or e.isspace())
    # remove "[edit] from the text"
    text = text.replace("[edit]", "")
    # Remove Main menu, Search, Donate, Create account, Log in, Personal tools, Contents hide, (Top), History, More, Jump to content Top  Toggle
    text = text.replace("Jump to content", "")
    text = text.replace("Top", "")
    text = text.replace("Toggle", "")
    text = text.replace("subsection", "")
    text = text.replace("Main menu", "")
    text = text.replace("Search", "")
    text = text.replace("Donate", "")
    text = text.replace("Create account", "")
    text = text.replace("Log in", "")
    text = text.replace("Personal tools", "")
    text = text.replace("Contents hide", "")
    text = text.replace("(Top)", "")
    text = text.replace("History", "")
    
    return text

for url, text in page_text_dict.items():
    page_text_dict[url] = clean_text(text)

In [84]:
print(page_text_dict['https://en.wikipedia.org/wiki/Berkeley_Software_Distribution'])

 Relationship to Research Unix Relationship to System V Technology Technology  Berkeley sockets Binary compatibility Standards BSD descendants See also References Bibliography External links Berkeley Software Distribution 68 languages Article Talk Read Edit View history Tools Appearance hide From Wikipedia the free encyclopedia BSD redirects here For the family of free software licenses see BSD licenses For other uses see BSD disambiguation BSD Developer Computer Systems Research Group Written in C OS family Unix Working state Discontinued Source model Originally sourceavailable later opensource Initial release March 9 1978 46 years ago Final release 44Lite2 June 1995 29 years ago Available in English Platforms PDP11 VAX Intel 80386 Kernel type Monolithic Userland BSD Influenced NetBSD FreeBSD OpenBSD DragonFly BSD NeXTSTEP Darwin Influenced by Unix Default user interface Unix shell License BSD The Berkeley Software Distribution or Berkeley Standard Distribution1 BSD is a discontinued 

In [42]:
h2_text

"Dennis Ritchie bei der Verleihung des Japan-Preises 2011\nKen Thompson (links) und Dennis Ritchie (rechts)\nDennis MacAlistair Ritchie (* 9. September 1941 in Bronxville, New York; † vor dem 12. Oktober 2011 in Berkeley Heights, New Jersey) war ein US-amerikanischer Informatiker. Er entwickelte zusammen mit Ken Thompson und anderen die erste Version des Unix-Betriebssystems[1][2] und schrieb die ersten sechs Ausgaben des Unix Programmer’s Manual,[3] ehe das Handbuch in „Unix Time-Sharing System“ umbenannt wurde.[4]\nZusammen mit Thompson und Brian W. Kernighan entwickelte er die Programmiersprache C.[5] Kernighan und Ritchie schrieben gemeinsam das Buch The C Programming Language, das oft mit dem Kürzel K&R zitiert wird. Bekannt und berühmt sind auch seine Initialen dmr, die Ritchie als Username für den System-Login verwendete. Für seine Arbeiten, insbesondere die Entwicklung des bis heute maßgeblichen Unix-Betriebssystems und der Sprache C, erhielt er einige der höchsten Auszeichnung

### Now we store the extracted text

In [85]:
with open ('extracted.txt', 'w') as f:
  for url, text in page_text_dict.items():
    f.write(f"URL: {url}\n")
    f.write(f"Text: {text}\n\n")

In [86]:
driver.quit()