# **Portfolio Project:** Web Scraping Using BeautifulSoup
# **Rob Boswell**

---
---

### This portfolio project shows how to create a web crawler using the BeautifulSoup Python library. I demonstrate how to combine BeautifulSoup with a TOR-based library to enable anonymous web scraping.

<br>

### In this project, I will be scraping a user-defined list of URLs for the embedded URL links within them, the html text on those URLs, and any tables which might also exist on the scraped URLs.

<br>

### Specifically, I will be creating a Pandas DataFrame with 4 columns. The first will contain the original URLs from the user-defined list from which the embedded links within them are scraped. The second will contain the actual URLs that are embedded and scraped. The third will contain the scraped text from the embedded URLs. Lastly, the forth will contain the html code for any tables which exist on the embedded URLs.

<br>

### I will then take the resulting Pandas DataFrame and use display options and pandas styling to visualize the first three columns. Afterward, I will show how the scraped tables can be viewed using IPython.display.

<br>

### There are many design options for web scrapers depending on specific goals; for instance, this code could be adapted to scrape PDF files, meta data, or images instead of HTML text.

---

## **BeautifulSoup Web Scraper Implementation:**

---

### *Data Cleaning:*

### - As discussed in greater detail in the comments in the code below, I use Regular Expression (regex) to clean scraped text. I make sure that common forms of punctuation are kept in the text, and keep alphanumeric characters, but I remove special characters and newline characters. Removing newline characters (which appear as \n) means that paraphraphs of text from a single HTML document will be concatenated together in your output display. I do not use regex to clean scraped tables, as doing so would remove the HTML code necessary to visualize them in table format. 

<br>

### *Final Format:*

### - I keep all originating URLs, scraped embedded URLs, scraped text, and HTML tables in the same output Pandas DataFrame. However, it would be easy for a user to create separate Pandas DataFrames for each unique originating URL in the user-defined URL list if they chose to do so, by subsetting the output DataFrame.

In [None]:
!pip install torpy -U
!pip install pandas
!pip install bs4
!pip install html5lib

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import time 
from urllib.parse import urljoin
from io import StringIO 

In [1]:
# Optional function to use to check your IP address before using TOR to compare with the TOR IP address to ensure TOR is working
def check_ip():
    try:
        response = requests.get('https://ifconfig.me', timeout=10)
        print("Current IP:", response.text.strip())
    except Exception as e:
        print(f"Failed to check IP address: {e}")

In [2]:
def init():
    # Optional: Check IP before using Tor
    # print("IP before Tor:")
    # check_ip()

    from torpy import TorClient
    hostname = 'ifconfig.me'  # It's possible use onion hostname here as well
    with TorClient() as tor:
        # Choose random guard node and create 3-hops circuit
        with tor.create_circuit(3) as circuit:
            # Create tor stream to host
            with circuit.create_stream((hostname, 80)) as stream:
                # Now we can communicate with host
                stream.send(b'GET / HTTP/1.0\r\nHost: %s\r\n\r\n' % hostname.encode())
                recv = stream.recv(1024)
                print("IP Used by Tor:", recv.decode().strip())
    
    return 0

## Below is the main regex term I will use to clean the web scraped text, and an explanation of each component:

#### **Note:** The | (before each subsequent term) in regex functions serves as a logical OR operator. It allows you to match any one of multiple patterns separated by the |. It means "if the previous pattern does not match, see if the next pattern does."

### tmp_text = re.sub(
###                       ```r"[^A-Za-z0-9,;:.()?\s…'’%$\-\"“”—]+"```
###                        ```r"|(?<!\d)\%"```
###                        ```r"|\$(?!\d)"```
###                        ```r"|(?<![A-Za-z\s])-|-(?![A-Za-z\s])"```
###                        ```r"|(?<!\w)['’\"“”](?!\w)"```,
###                        ' ',
###                        tmp_text
###                    )

---
---

### **Breakdown:**
### **```[^A-Za-z0-9,;:.()?\s…'’%$\-\"“”—]+```**

#### • ```[^ ]```: This denotes a negated character class, which matches any character not listed inside the brackets.
#### • ```A-Za-z0-9```: Matches any uppercase letter (A-Z), lowercase letter (a-z), or digit (0-9).
#### • ```,;:.()?```: Matches the punctuation characters comma, semicolon, colon, period, parentheses, and question mark.
#### • ```\s```: Matches any whitespace character (space, tab, newline, etc.).
#### • ```…```: Matches the ellipsis character.
#### • ```'’```: Matches both straight and curly apostrophes/single quotes.
#### • ```%$-```: Matches the percent sign %, dollar sign $, and hyphen -.
#### • ```\"“”```: Matches straight double quotes ", curly left double quote “, and curly right double quote ”.
#### • ```—```: Matches em dashes.
#### • ```+```: Matches one or more of the preceding elements (inside the square brackets). This part effectively removes any sequence of characters that do not match the specified characters.
<u>Purpose:</u> This ensures athat all characters represented inside the square brackets will be retained in the scraped text.

---

### **```(?<!\d)\%```**

#### • ```\%```: Matches the percent sign character.
#### • ```(?<!\d)```: An assertion that checks if the character <u>immediately preceding</u> the `%` is not a digit. If this condition is true, it will match the `%` sign and allow you to remove it.
<u>Purpose:</u> This ensures that only `%` signs directly associated with numeric percentages (like "50%") are retained.

---

### **```\$(?!\d)```**

#### • ```\$```: Matches the dollar sign character.
#### • ```(?!\d)```: An assertion that checks if the character <u>immediately following</u> the `$` is not a digit. If this condition is true, it will match the `$` sign and allow you to remove it.
<u>Purpose:</u> This ensures that only `$` signs directly associated with numeric amounts (like "$100") are retained.

---

### **```(?<![A-Za-z\s])-|-(?![A-Za-z\s])```** 
#### Note: The | denotes these are two combined terms, discussed below

#### • ```(?<![A-Za-z\s])-```: An assertion that matches a ***hyphen - if it <u>is not preceded by</u> a letter or a space.***

#### • ```-(?![A-Za-z\s])```: An assertion that matches a ***hyphen - if it <u>is not followed by</u> a letter or a space.***
<u>Purpose:</u> This removes hyphens that are not part of hyphenated words, or not part of sentences where there is a space on one side of a hyphen, such as in "Short- and long-term goals are important."

---

### **```(?<!\w)['’\"“”](?!\w)```**

#### • ```['’\"“”]```: Matches either a straight `'` or curly `’` single quote/apostrophe, or a straight or curly double quotation.
#### • ```(?<!\w)```: An assertion that matches a ***quote or apostrophe if it <u>is not preceded by</u> a word character*** (letters, digits, or underscore).
#### • ```(?!\w)```: An assertion that matches a ***quote or apostrophe if it <u>is not followed by</u> a word character.***
<u>Purpose:</u> This allows you to remove single or double quotes or apostrophes that are not part of a word or contraction.

---
---

In [33]:
# The following code uses the threading library, which can help prevent the scraper from hanging indefinitely 
# on a single request. 

import threading

# fetch_url performs an HTTP GET request to fetch a URL using a requests.Session object. 
# The result of the request (either a response or an exception) is stored in a result list at the specified index.

# Thread Role: This function encapsulates the logic that will be run concurrently in a thread, isolating the network 
# operation to prevent blocking the main execution flow.

def fetch_url(session, url, result, index, timeout_duration):
    try:
        response = session.get(url, timeout=timeout_duration)
        result[index] = response
    except Exception as e:
        result[index] = e

from bs4 import BeautifulSoup
import re
        
def my_scraper(tmp_url_in, scrape_timeout=40):
    tmp_text = ''
    max_retries = 3
    attempt = 0
    scraped_tables_html = ''
    response = None  # Initialize response to None

    while attempt <= max_retries:
        session = requests.Session()
        session.max_redirects = 3

        # Use a list to store the response or exception
        result = [None]

        # Start a thread to fetch the URL
        # This line creates a new Thread object. The target argument specifies the function to run in the new 
        # thread (fetch_url), and args provides the arguments to pass to this function when it starts. 
        
        # 0: Specifies the index in the result list where the response will be stored, allowing the main thread 
        # to access the result after the thread completes.
        
        thread = threading.Thread(target=fetch_url, args=(session, tmp_url_in, result, 0, scrape_timeout))
        thread.start() # Executes the thread, allowing the fetch_url function to run concurrently with the main program.
        thread.join(timeout=scrape_timeout) # Waits for the thread to complete its task, up to a specified timeout

        if thread.is_alive(): #  Checks if thread is still running after the timeout specified in the join() call has elapsed.
            print(f"Scraping timed out for {tmp_url_in}. Moving to next URL.")
            thread.join()  # Ensures thread is properly joined, then terminated (to prevent hanging), and cleans up resources.
            attempt += 1
            time.sleep(5)
            continue  # Retry in case of a timeout

        # Get the response from the result
        response = result[0]

        if isinstance(response, requests.Response):
            if response.status_code == 200:
                if 'text/html' in response.headers.get('Content-Type', ''):
                    soup = BeautifulSoup(response.text, 'html.parser')
                    
                    # Revised line to strip HTML tags
                    tmp_text = ' '.join(p.get_text() for p in soup.find_all('p')).replace('\n', ' ')
                    
                    # Decode the text to process escape sequences. If you do not use this line, backshlahses may only be hidden and not removed.
                    # To ensure that the backslashes are actually removed, first explicitly decode the string characters so backslashes will be 
                    # visible.
                    tmp_text = tmp_text.encode('unicode_escape').decode('unicode_escape')
                    
                    tmp_text = re.sub(
                        r"[^A-Za-z0-9,;:.()?\s…'’%$\-\"“”—]+"  # Match unwanted characters
                        r"|(?<!\d)\%"  # Remove standalone percent signs not preceded by a digit
                        r"|\$(?!\d)"  # Remove standalone dollar signs not followed by a digit
                        r"|(?<![A-Za-z\s])-|-(?![A-Za-z\s])"  # Handle hyphens in certain contexts
                        r"|(?<!\w)['’\"“”](?!\w)",  # Handle quotes in certain contexts
                        ' ',
                        tmp_text
                    )
                                           
                    # Replace any sequence of multiple whitespace characters (spaces, tabs, newlines, etc.) with a single 
                    # space and then trim leading and trailing whitespace from the string.
                    tmp_text = re.sub(r'\s+', ' ', tmp_text).strip()
                    
                    if soup.find('table'):
                        tables_html = [str(table) for table in soup.find_all('table')]
                        scraped_tables_html = ''.join(tables_html)
                        result_message = "Scraped text and tables"
                    else:
                        result_message = "Scraped text only"
                    
                    print(f"{result_message}: {tmp_url_in}")
                    break  # Successfully scraped, exit the loop
                else:
                    print(f"Non-HTML content type for {tmp_url_in}: {response.headers.get('Content-Type')}")
                    break
            else:
                print(f"Failed to scrape {tmp_url_in}: Status code {response.status_code}")
                break

        elif isinstance(response, requests.exceptions.RequestException):
            print(f"Failed to scrape {tmp_url_in}: {response}")
            break  # Unsuccessful connection, exit the loop

    if response is None:
        # If no response was ever received, indicate this explicitly
        print(f"Failed to scrape {tmp_url_in}: No response received.")

    return tmp_text, scraped_tables_html



In [89]:
def fetch_urls(user_urls):
    from bs4 import BeautifulSoup
    import re

    all_urls = []

    for url in user_urls:
        try:
            response = requests.get(url, timeout=30)
            soup = BeautifulSoup(response.text, "html.parser")
            for link in soup.find_all('a', href=True):
                full_url = urljoin(url, link['href'])
                all_urls.append((url, full_url))
        except Exception as e:
            print(f"Failed to fetch URLs from {url}: {e}")
            continue

    all_urls = list(set(all_urls)) # Remove any potential duplicate URLs
    return all_urls



In [90]:
def write_crawl_results(user_urls):
    import re
    import pandas as pd
    tmp_pd = pd.DataFrame(columns=['original_url', 'scraped_url', 'scraped_text', 'scraped_tables'])

    init()

    all_urls = fetch_urls(user_urls)

    for original_url, scraped_url in all_urls:
        if not scraped_url.startswith(('http://', 'https://')):
            scraped_url = urljoin(original_url, scraped_url)
        scraped_text, scraped_tables = my_scraper(scraped_url)
        if scraped_text or scraped_tables:
            try:
                tmp_pd = pd.concat([tmp_pd, pd.DataFrame({
                    'original_url': [original_url],
                    'scraped_url': [scraped_url],
                    'scraped_text': [scraped_text],
                    'scraped_tables': [scraped_tables]
                })], ignore_index=True)
    
            except Exception as e:
                print(f"Failed to append data for {scraped_url}: {e}")
                pass

    return tmp_pd



In [96]:
user_provided_urls = [
    'https://en.wikipedia.org/wiki/Web_scraping',
    'https://en.wikipedia.org/wiki/Data_scraping',
    
    # Add more URLs as needed
]

final_data = write_crawl_results(user_provided_urls)

IP Used by Tor: HTTP/1.0 200 OK
date: Sat, 17 Aug 2024 15:28:16 GMT
content-type: text/plain
Content-Length: 14
access-control-allow-origin: *
via: 1.1 google

107.189.10.175
Scraped text and tables: https://en.wikipedia.org/wiki/Facebook,_Inc._v._Power_Ventures,_Inc.
Scraped text and tables: https://en.wikipedia.org/wiki/Reuters
Scraped text only: https://en.wikipedia.org/wiki/Wrapper_(data_mining)
Scraped text and tables: https://en.wikipedia.org/wiki/Web_scraping#See_also
Scraped text and tables: https://en.wikipedia.org/wiki/Web_scraping#cite_ref-21
Scraped text and tables: https://en.wikipedia.org/wiki/Web_scraping
Scraped text and tables: https://en.wikipedia.org/wiki/Web_scraping#References
Scraped text only: https://en.wikipedia.org/w/index.php?title=Data_scraping&action=edit
Scraped text and tables: https://en.wikipedia.org/wiki/Data_scraping#cite_ref-3
Scraped text and tables: https://en.wikipedia.org/wiki/Data_structures
Scraped text and tables: https://en.wikipedia.org/wiki

In [97]:
final_data['scraped_text'] = final_data['scraped_text'].str.strip()
final_data = final_data[final_data['scraped_text'] != '']

In [98]:
# Reset the indexes first to ensure the indexes correctly reflect the row positions.
final_data.reset_index(drop=True, inplace=True)

In [2]:
# Save the object containing the scraped data so that you do not need to re-scrape the same data in the future.
import pickle

# pickle.dump(final_data, open('final_data.pkl', 'wb'))
# final_data = pickle.load(open('final_data.pkl', 'rb'))

---
---

### **Visualizing the Scraped Text - An Initial Approach:**

---

### First, I will show you code you technically could use to display the entire Pandas DataFrame. However, the font-styling and text structure in the output would not look very pretty. 

### Additionally, I do not recommend using this code if you are using a web browser-based Python IDE like Jupyter Notebook because the following error can occur due to high data rate restrictions if you have scraped a lot of web pages like I have. I.e., use this code only if you are using a non-browser-based IDE (e.g., Visual Studio Code or Spyder). Further below, I will show you better options for displaying your scraped data.

In [12]:
import pandas as pd

pd.set_option('display.max_rows', None) # Print all rows. If this line is not used, only the first and last 5 rows will be shown 
pd.set_option('display.max_colwidth', None) # Prevent column width truncation

print(final_data[["original_url", "scraped_url", "scraped_text"]]) # Running the code in print() will hide any escape characters

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



---

### **A Better Approach to Visualizing Scraped Text Data:**

---

### The following display code allows you to scroll down through all of the scraped text from the embedded URLs.

### It also provides the extra text cleaning steps of 1) removing escape characters that could otherwise appear in the display, and 2) preventing the misinterpretation of characters like `$` signs that could otherwise alter the font styling in the display.

### You can either read the scraped text directly from the display that the code below creates by scrolling down, or you can find a specific scraped web page of interest in the display, and then extract that specific page using code I will show you further below.


In [7]:
from IPython.display import display, HTML
import html
import pandas as pd

# Print all rows. If this line is not used, only the first and last 5 rows will be shown 
pd.set_option('display.max_rows', None)
# Set the maximum column width to None to prevent truncation
pd.set_option('display.max_colwidth', None)

from html import unescape

# Explanation of the wrap_in_pre() function:

# 1) I use the unescape() function from the html module in wrap_in_pre() to unescape any previously escaped HTML entities 
# in the text. This step gracefully handles backslahes. E.g., it ensures any backslash followed by an apostrophe (\') is 
# correctly converted back to a simple apostrophe (').

# 2) The <pre> tags ensure all text within them is treated as plain text by the browser. This included special characters 
# like dollar signs, preventing them from being interpreted in ways that could alter their appearance or introduce 
# unwanted formatting.

# Function to wrap text in <pre> tags and escape it
def wrap_in_pre(text):
    # Unescape any previously escaped HTML entities
    text = unescape(text)
    
    # Check if the text is already wrapped in <pre> tags
    if text.startswith('<pre>') and text.endswith('</pre>'):
        return text  # If already wrapped, return as is
    return f"<pre>{html.escape(text)}</pre>" # Wrap in <pre> tags and escape

# Apply the function to your DataFrame column
final_data['scraped_text'] = final_data['scraped_text'].apply(wrap_in_pre)

# Render the DataFrame with the new wrapped text
styled_df = final_data[['original_url', 'scraped_url', 'scraped_text']].style.set_properties(**{
    'text-align': 'left',
    'vertical-align': 'top',
    'white-space': 'normal',
    'font-family': 'Georgia, serif',
    'font-size': '16px',
    'line-height': '1.6',
    'font-weight': 'normal',
    'font-style': 'normal',
    'word-break': 'normal'
}).set_table_styles([
    {'selector': 'th, td',
    'props': [('text-align', 'left'), ('vertical-align', 'top'), ('white-space', 'normal'), ('word-wrap', 'break-word')]},
    {'selector': 'th:nth-child(1), td:nth-child(1)', # Apply styles to the first column
     'props': [('max-width', '120px')]},
    {'selector': 'th:nth-child(2), td:nth-child(2)', # Apply styles to the second column
     'props': [('max-width', '140px')]},
    {'selector': 'th:nth-child(3), td:nth-child(3)', # Apply styles to the third column
     'props': [('max-width', '200px')]}
])

from html import unescape


# Now, strip the inserted <pre> and </pre> tags and unescape HTML entities in case you later need to work with the data
# in its original form. This undoes what wrap_in_pre() did above. This is important because wrap_in_pre() introduced special
# formatting for display purposes.

def strip_pre_tags_and_unescape(text):
    
    # Remove the <pre> and </pre> tags if they exist
    if text.startswith('<pre>') and text.endswith('</pre>'):
        text = text[5:-6]  # Remove the <pre> and </pre> tags
    
    # Unescape any HTML entities in the text
    text = unescape(text)
    
    return text

# Apply the function to your DataFrame column
final_data['scraped_text'] = final_data['scraped_text'].apply(strip_pre_tags_and_unescape)

# Modify the div as follows
html_rendered_dataframe = f"""
<div style='max-height: 400px; overflow-x: auto; overflow-y: auto; border: 1px solid #ccc; padding: 10px; font-family: Georgia, serif; font-size: 16px;'>
    <style>
        table {{
            display: block;
            overflow-x: auto;
            max-width: 100%;
            white-space: nowrap;
        }}
        th, td {{
            font-style: normal !important;
            font-weight: normal !important;
            font-family: Georgia, serif !important;
            font-size: 16px !important;
        }}
    </style>
    {styled_df.to_html(escape=False)}
</div>
"""

# Display the scrollable styled DataFrame
display(HTML(html_rendered_dataframe))


Unnamed: 0,original_url,scraped_url,scraped_text
0,https://en.wikipedia.org/wiki/Web_scraping,"https://en.wikipedia.org/wiki/Facebook,_Inc._v._Power_Ventures,_Inc.","Facebook, Inc. v. Power Ventures, Inc. is a lawsuit brought by Facebook in the United States District Court for the Northern District of California alleging that Power Ventures Inc., a third-party platform, collected user information from Facebook and displayed it on their own website. Facebook claimed violations of the CAN-SPAM Act, the Computer Fraud and Abuse Act (""CFAA""), and the California Comprehensive Computer Data Access and Fraud Act. 1 According to Facebook, Power Ventures Inc. made copies of Facebook's website during the process of extracting user information. Facebook argued that this process causes both direct and indirect copyright infringement. In addition, Facebook alleged this process constitutes a violation of the Digital Millennium Copyright Act (""DMCA""). Finally, Facebook also asserted claims of both state and federal trademark infringement, as well as a claim under California's Unfair Competition Law (""UCL""). Power Ventures previously operated the domain power.com and used it to create a website that enabled its users to aggregate data about themselves that is otherwise spread across various social networking sites and messaging services, including LinkedIn, Twitter, Myspace, and AOL or Yahoo instant messaging. This aggregation method is embodied in its motto: ""all your friends in just one place"". 2 Power Ventures wanted to provide a single site for its customers to see all of their friends, to view their status updates or profile pages, and to send messages to multiple friends on multiple sites. 3 The litigation focuses on Power Ventures alleged ""scraping"" of content for and from users on Facebook into Power Ventures interface. Facebook sued claiming violations of copyright, DMCA, CAN-SPAM, and CFAA. 4 5 Power Ventures and Facebook tried unsuccessfully to work out a deal that allowed Power Ventures to access Facebook's site, through Facebook Connect. In late December 2008, Power Ventures informed Facebook that it would continue to operate without using Facebook Connect. Power Ventures allegedly continued to ""scrape"" Facebook's website, despite technological security measures to block such access. 4 Facebook sued Power Ventures Inc. in the Northern District of California. The court's ruling addressed a motion to dismiss the copyright, DMCA, trademark, and UCL claims. When a court considers a motion to dismiss, it must take the allegations in the Plaintiff's complaint as true and construe the Complaint in a manner that is favorable to the Plaintiff. Thus, for a motion to dismiss to succeed, the complaint must lack either a cognizable legal theory or sufficient facts to support the legal theory. 6 To state a claim for copyright infringement, a plaintiff need only allege The First Amended Complaint (""FAC"") alleged that Power Ventures accessed Facebook's website and made unauthorized ""cache"" copies of it or created derivative works derived from the Facebook website. However, Power Ventures contended that Facebook's copyright allegations are deficient because it is unclear which portions of Facebook's website are alleged to have been copied. Facebook argued that it need not define the exact contours of the protected material because copyright claims do not require particularized allegations. Since Facebook owns the copyright to any page within its system (including the material located on those pages besides user content, such as graphics, video and sound files), Power Ventures only has to access and copy one page to commit copyright infringement. 4 Facebook conceded that it did not have any proprietary rights in its users' information. Power Ventures users, who own the rights to the information sought, have expressly given them permission to gather this information. 4 5 Judge Fogel reasoned that MAI Systems Corp. v. Peak Computer, Inc. and Ticketmaster LLC v. RMG Techs. Inc. indicated that the scraping of a webpage inherently involves the copying of that webpage into a computer's memory in order to extract the underlying information contained therein. Even though this ""copying"" is ephemeral and momentary, that it is enough to constitute a ""copy"" under 106 of the Copyright Act and therefore infringement. 7 Since Facebook's Terms of Service prohibit scraping (and thus, Facebook has not given any license to third parties or users to do so), the copying happens without permission. 5 In the MAI case, the Court granted summary judgment in favor of MAI on its claims of copyright infringement and issued a permanent injunction against Peak. The alleged copyright violations included: In this particular case, the Court held that Ticketmaster LLC (""Ticketmaster"") was likely to prevail on claims of direct and contributory copyright infringement as a result of defendant RMG Technologies Inc. (""RMG"") distribution of a software application that permitted its clients to circumvent Ticketmaster.com's CAPTCHA access controls, and use Ticketmaster's copyrighted website in a manner that violated the site's Terms of Use. The Court held that RMG was likely to be found guilty of direct copyright infringement because when RMG viewed the site to create and test its product, it made unauthorized copies of Ticketmaster's site in its computer's RAM. 9 In the instant case, the Court followed Ticketmaster to determine that Power Ventures' 'scraping' made an actionable ""cache"" copy of a Facebook profile page each time it accessed a user's profile page. 4 The elements necessary to state a claim under the DMCA are Power Ventures argued that Facebook's DMCA claim was insufficient using the same arguments listed above. They also argued that the unauthorized use requirement was not met because the users are controlling the access (via Power Ventures site) to their own content on the Facebook website. However, the Terms of Use negate this argument because users are barred from using automated programs to access the Facebook website. While users may have the copyright rights in their own content, Facebook placed conditions on that access. After Power Ventures informed Facebook that it intended to continue their service without using Facebook Connect, Facebook implemented specific technical measures to block Power Ventures' access. Power Ventures then attempted to circumvent those technological measures. As all of the elements of a DMCA claim had been correctly pleaded and supported in the FAC, the motion to dismiss the DMCA claim was denied. 4 The Lanham Act imposes liability upon any person who Facebook stated that they were the registered owner of the FACEBOOK mark since 2004. Furthermore, they alleged that Power Ventures used the mark in connection with Power Ventures business. Facebook never authorized or consented to Power Ventures' use of the mark. Facebook also stated that Power Ventures' unauthorized use of the mark was likely to ""confuse recipients and lead to the false impression that Facebook is affiliated with, endorses, or sponsors"" Power Ventures. Power Ventures countered that Facebook was required to provide concise information in the Complaint with respect to the trademark infringement allegations, including information about each instance of such use. However, since particularized pleading is not required for trademark infringement claims, Facebook's allegations were sufficient to survive Power Ventures' motion to dismiss the trademark infringement claim. 4 To state a claim of trademark infringement under California common law, a plaintiff need only allege For the same reasons listed above, the Court also denied Power Ventures' motion to dismiss the state trademark claim. California's UCL jurisprudence had previously found Lanham Act claims to be substantially congruent to UCL claims. However, it was unclear as to whether Facebook was relying on it trade dress claims or if it also intended to incorporate other portions of the FAC, such as those dealing with the CAN-SPAM and CFAA claims. In order to promote an efficient docket, the Court granted Power Ventures' motion for a more definite statement. 4 On February 18, 2011 11 the judge granted the parties' stipulation to dismiss Facebook's DMCA claim, copyright and trademark infringement claims, and claims for violations of California Business and Professions Code Section 17200. Only three claims remained for the final order - the violation of the CAN-SPAM Act, violation of the CFAA and California Penal Code. The district court then granted summary judgment to Facebook on all three of the remaining Facebook claims. The district court awarded statutory damages of $3,031,350, compensatory damages, and permanent injunctive relief, and it held that Vachani 1064 1064 was personally liable for Power's actions. A magistrate judge ordered Power to pay $39,796.73 in costs and fees for a renewed Federal Civil Procedure Rule 30(b)(6) deposition. Power filed a motion for reconsideration, which the district court denied. Defendants timely appeal both the judgment and the discovery sanctions. Argued and Submitted December 9, 2015 San Francisco, California. Filed July 12, 2016 and Amended December 9, 2016. The appeals court affirmed the district court's holding that Vachani was personally liable for Power's actions. 12 Vachani was the central figure in Power's promotional scheme. First, Vachani admitted that, during the promotion, he controlled and directed Power's actions. Second, Vachani admitted that the promotion was his idea. It is undisputed, therefore, that Vachani was the guiding spirit and central figure in Power's challenged actions. Accordingly, we affirm the district court's holding on Vachani's personal liability for Power's actions. The court also affirmed discovery sanctions imposed against Power for non-compliance during a Rule 30(b)(6) deposition. Defendants failed to object to discovery sanctions in the district court. Failure to object forfeits Defendants' right to raise the issue on appeal. On April 24, 2017, Defendant Steven Vachani (""Vachani"") filed a motion to stay all proceedings in the case pending resolution of his petition for certiorari in the United States Supreme Court. However, the Ninth Circuit has held that ""once a federal circuit court issues a decision, the district courts within that circuit are bound to follow it and have no authority to await a ruling by the Supreme Court before applying the circuit court's decision as binding authority. 13 On May 2, 2017, the United States District Court, N.D. California, San Jose Division issued its final judgement ruled that, having considered the briefing of the parties, the record in the case, and the relevant law, the Court found that Facebook was only entitled to the reduced sum of $79,640.50 in compensatory damages and a permanent injunction. The Court also ordered Defendants to pay the $39,796.73 discovery sanction. 14"
1,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Reuters,"Reuters ( r t rz ROY-terz) is a news agency owned by Thomson Reuters. 1 2 It employs around 2,500 journalists and 600 photojournalists in about 200 locations worldwide writing in 16 languages. 3 Reuters is one of the largest and most trusted news agencies in the world. 4 5 6 The agency was established in London in 1851 by the German-born Paul Reuter. It was acquired by the Thomson Corporation of Canada in 2008 and now makes up the news media division of Thomson Reuters. 5 Paul Reuter worked at a book-publishing firm in Berlin and was involved in distributing radical pamphlets at the beginning of the Revolutions of 1848. These publications brought much attention to Reuter, who in 1850 developed a prototype news service in Aachen using homing pigeons and electric telegraphy from 1851 on, in order to transmit messages between Brussels and Aachen, 7 in what today is Aachen's Reuters House. Reuter moved to London in 1851 and established a news wire agency at the London Royal Exchange. Headquartered in London, Reuter's company initially covered commercial news, serving banks, brokerage houses, and business firms. 8 The first newspaper client to subscribe was the London Morning Advertiser in 1858, and more began to subscribe soon after. 8 9 According to the Encyclop dia Britannica: ""the value of Reuters to newspapers lay not only in the financial news it provided but in its ability to be the first to report on stories of international importance. 8 It was the first to report Abraham Lincoln's assassination in Europe, for instance, in 1865. 8 10 In 1865, Reuter incorporated his private business, under the name Reuter's Telegram Company Limited; Reuter was appointed managing director of the company. 11 In 1870 the press agencies French Havas (founded in 1835), British Reuter's (founded in 1851) and German Wolff (founded in 1849) signed an agreement (known as the Ring Combination) that set 'reserved territories' for the three agencies. Each agency made its own separate contracts with national agencies or other subscribers within its territory. In practice, Reuters, who came up with the idea, tended to dominate the Ring Combination. Its influence was greatest because its reserved territories were larger or of greater news importance than most others. It also had more staff and stringers throughout the world and thus contributed more original news to the pool. British control of cable lines made London itself an unrivalled centre for world news, further enhanced by Britain's wide-ranging commercial, financial and imperial activities. 12 In 1872, Reuter's expanded into the Far East, followed by South America in 1874. Both expansions were made possible by advances in overland telegraphs and undersea cables. 10 In 1878, Reuter retired as managing director, and was succeeded by his eldest son, Herbert de Reuter. 11 In 1883, Reuter's began transmitting messages electrically to London newspapers. 10 Reuter's son Herbert de Reuter continued as general manager until his death by suicide in 1915. The company returned to private ownership in 1916, when all shares were purchased by Roderick Jones and Mark Napier; they renamed the company ""Reuters Limited"", dropping the apostrophe. 11 In 1919, a number of Reuters reports falsely described the anti-colonial March 1st Movement protests in Korea as violent Bolshevik uprisings. South Korean researchers found that a number of these reports were cited in a number of international newspapers and possibly negatively influenced international opinion on Korea. 13 In 1923, Reuters began using radio to transmit news internationally, a pioneering act. 10 In 1925, the Press Association (PA) of Great Britain acquired a majority interest in Reuters, and full ownership some years later. 8 During the world wars, The Guardian reported that Reuters: ""came under pressure from the British government to serve national interests. In 1941, Reuters deflected the pressure by restructuring itself as a private company. 10 In 1941, the PA sold half of Reuters to the Newspaper Proprietors' Association, and co-ownership was expanded in 1947 to associations that represented daily newspapers in New Zealand and Australia. 8 The new owners formed the Reuters Trust. The Reuters Trust Principles were put in place to maintain the company's independence. 14 At that point, Reuters had become ""one of the world's major news agencies, supplying both text and images to newspapers, other news agencies, and radio and television broadcasters. 8 Also at that point, it directly or through national news agencies provided service ""to most countries, reaching virtually all the world's leading newspapers and many thousands of smaller ones"", according to Britannica. 8 In 1961, Reuters scooped news of the erection of the Berlin Wall. 15 Reuters was one of the first news agencies to transmit financial data over oceans via computers in the 1960s. 8 In 1973, Reuters ""began making computer-terminal displays of foreign-exchange rates available to clients. 8 In 1981, Reuters began supporting electronic transactions on its computer network and afterwards developed a number of electronic brokerage and trading services. 8 Reuters was floated as a public company in 1984, 15 when Reuters Trust was listed on the stock exchanges 10 such as the London Stock Exchange (LSE) and NASDAQ. 8 Reuters later published the first story of the Berlin Wall being breached in 1989. 15 Reuters was the dominant news service on the Internet in the 1990s. It earned this position by developing a partnership with ClariNet and Pointcast, two early Internet-based news providers. 16 Reuters' share price grew during the dotcom boom, then fell after the banking troubles in 2001. 10 In 2002, Britannica wrote that most news throughout the world came from three major agencies: the Associated Press, Reuters, and Agence France-Presse. 4 Until 2008, the Reuters news agency formed part of an independent company, Reuters Group plc. Reuters was acquired by Thomson Corporation in Canada in 2008, forming Thomson Reuters. 8 In 2009, Thomson Reuters withdrew from the LSE and the NASDAQ, instead listing its shares on the Toronto Stock Exchange (TSX) and the New York Stock Exchange (NYSE). 8 The last surviving member of the Reuters family founders, Marguerite, Baroness de Reuter, died at age 96 on 25 January 2009. 17 The parent company Thomson Reuters is headquartered in Toronto, and provides financial information to clients while also maintaining its traditional news-agency business. 8 In 2012, Thomson Reuters appointed Jim Smith as CEO. 14 In July 2016, Thomson Reuters agreed to sell its intellectual property and science operation for $3.55 billion to private equity firms. 18 In October 2016, Thomson Reuters announced expansions and relocations to Toronto. 18 As part of cuts and restructuring, in November 2016, Thomson Reuters Corp. eliminated 2,000 jobs worldwide out of its estimated 50,000 employees. 18 On 15 March 2020, Steve Hasker was appointed president and CEO. 19 In April 2021, Reuters announced that its website would go behind a paywall, following rivals who have done the same. 20 21 In March 2024, Gannett, the largest newspaper publisher in the United States, signed an agreement with Reuters to use the wire service's global content after cancelling its contract with the Associated Press. 22 In 2024, Reuters staff won the Pulitzer Prize for National Reporting for their work on Elon Musk and misconduct at his businesses, including SpaceX, Tesla, and Neuralink, as well as the Pulitzer Prize for Breaking News Photography for coverage of the Israel Hamas war. 23 Reuters employs some 2,500 journalists and 600 photojournalists 24 in about 200 locations worldwide. 25 26 5 Reuters journalists use the Standards and Values as a guide for fair presentation and disclosure of relevant interests, to ""maintain the values of integrity and freedom upon which their reputation for reliability, accuracy, speed and exclusivity relies"". 27 28 In May 2000, Kurt Schork, an American reporter, was killed in an ambush while on assignment in Sierra Leone. In April and August 2003, news cameramen Taras Protsyuk and Mazen Dana were killed in separate incidents by U.S. troops in Iraq. In July 2007, Namir Noor-Eldeen and Saeed Chmagh were killed when they were struck by fire from a U.S. military Apache helicopter in Baghdad. 29 30 During 2004, cameramen Adlan Khasanov was killed by Chechen separatists, and Dhia Najim was killed in Iraq. In April 2008, cameraman Fadel Shana was killed in the Gaza Strip after being hit by an Israeli tank. 31 32 While covering China's Cultural Revolution in Peking in the late 1960s for Reuters, journalist Anthony Grey was detained by the Chinese government in response to the jailing of several Chinese journalists by the colonial British government of Hong Kong. 33 He was released after being imprisoned for 27 months from 1967 to 1969 and was awarded an OBE by the British Government. After his release, he went on to become a best-selling historical novelist. 34 In May 2016, the Ukrainian website Myrotvorets published the names and personal data of 4,508 journalists, including Reuters reporters, and other media staff from all over the world, who were accredited by the self-proclaimed authorities in the separatist-controlled regions of eastern Ukraine. 35 In 2018, two Reuters journalists were convicted in Myanmar of obtaining state secrets while investigating a massacre in a Rohingya village. 36 The arrest and convictions were widely condemned as an attack on press freedom. The journalists, Wa Lone and Kyaw Soe Oo, received several awards, including the Foreign Press Association Media Award and the Pulitzer Prize for International Reporting, and were named as part of the Time Person of the Year for 2018 along with other persecuted journalists. 37 38 39 After 511 days in prison, Wa Lone and Kyaw Soe Oo were freed on 7 March 2019 after receiving a presidential pardon. 40 In February 2023, a team of Reuters journalists won the Selden Ring Award for their investigation that exposed human-rights abuses by the Nigerian military. 41 In 1977, Rolling Stone and The New York Times said that according to information from CIA officials, Reuters cooperated with the CIA. 43 44 45 In response to that, Reuters' then-managing director, Gerald Long, had asked for evidence of the charges, but none was provided, according to Reuters' then-managing editor for North America, 45 Desmond Maberly. 46 47 Reuters has a policy of taking a ""value-neutral approach"" which extends to not using the word terrorist in its stories. The practice attracted criticism following the September 11 attacks. 48 Reuters' editorial policy states: ""Reuters may refer without attribution to terrorism and counterterrorism in general, but do not refer to specific events as terrorism. Nor does Reuters use the word terrorist without attribution to qualify specific individuals, groups or events. 49 By contrast, the Associated Press does use the term terrorist in reference to non-governmental organizations who carry out attacks on civilian populations. 48 In 2004, Reuters asked CanWest Global Communications, a Canadian newspaper chain, to remove Reuters' bylines, as the chain had edited Reuters articles to insert the word terrorist. A spokesman for Reuters stated: ""My goal is to protect my reporters and protect our editorial integrity. 50 In July 2013, David Fogarty, former Reuters climate change correspondent in Asia, resigned after a career of almost 20 years with the company and wrote that ""progressively, getting any climate change-themed story published got harder"" following comments from then-deputy editor-in-chief Paul Ingrassia that he was a ""climate change sceptic"". In his comments, Fogarty stated: 51 52 53 By mid-October, I was informed that climate change just wasn't a big story for the present, but that it would be if there was a significant shift in global policy, such as the US introducing an emissions cap-and-trade system. Very soon after that conversation I was told my climate change role was abolished. Ingrassia, formerly Reuters' managing editor, previously worked for The Wall Street Journal and Dow Jones for 31 years. 54 55 Reuters responded to Fogarty's piece by stating: ""Reuters has a number of staff dedicated to covering this story, including a team of specialist reporters at Point Carbon and a columnist. There has been no change in our editorial policy. 56 Subsequently, climate blogger Joe Romm cited a Reuters article on climate as employing ""false balance"", and quoted Stefan Rahmstorf, co-chair of Earth System Analysis at the Potsdam Institute that s imply, a lot of unrelated climate sceptics nonsense has been added to this Reuters piece. In the words of the late Steve Schneider, this is like adding some nonsense from the Flat Earth Society to a report about the latest generation of telecommunication satellites. It is absurd. Romm opined: ""We can't know for certain who insisted on cramming this absurd and non-germane 'climate sceptics nonsense' into the piece, but we have a strong clue. If it had been part of the reporter's original reporting, you would have expected direct quotes from actual sceptics, because that is journalism 101. The fact that the blather was all inserted without attribution suggests it was added at the insistence of an editor. 57 According to Ynetnews, Reuters was accused of bias against Israel in its coverage of the 2006 Israel Lebanon conflict after the wire service used two doctored photos by a Lebanese freelance photographer, Adnan Hajj. 58 In August 2006, Reuters announced it had severed all ties with Hajj and said his photographs would be removed from its database. 59 60 In 2010, Reuters was criticised again by Haaretz for ""anti-Israeli"" bias when it cropped the edges of photos, removing commandos' knives held by activists and a naval commando's blood from photographs taken aboard the Mavi Marmara during the Gaza flotilla raid, a raid that left nine Turkish activists dead. It has been alleged that in two separate photographs, knives held by the activists were cropped out of the versions of the pictures published by Reuters. 61 Reuters said it is standard operating procedure to crop photos at the margins, and replaced the cropped images with the original ones after it was brought to the agency's attention. 61 On 9 June 2020, three Reuters journalists (Jack Stubbs, Raphael Satter and Christopher Bing) incorrectly used the image of an Indian herbal medicine entrepreneur in an exclusive story titled ""Obscure Indian cyber firm spied on politicians, investors worldwide"". 62 Indian local media picked up the report, and the man whose image was wrongly used was invited and interrogated for nine hours by Indian police. Reuters admitted to the error, but Raphael Satter claimed that they had mistaken the man for the suspected hacker Sumit Gupta because both men share same business address. A check by local media, however, showed that both men were in different buildings and not as claimed by Raphael Satter. 63 64 As the report of the inaccurate reporting trickled out to the public, Reuters' senior director of communication Heather Carpenter contacted media outlets asking them to take down their posts. 64 In March 2015, the Brazilian affiliate of Reuters released an excerpt from an interview with Brazilian ex-president Fernando Henrique Cardoso about Operation Car Wash (Portuguese: Opera o Lava Jato). In 2014, several politicians from Brazil were found to be involved in corruption, by accepting bribes from different corporations in exchange for Government contracts. After the scandal, the excerpt from Brazil's president Fernando Henrique's interview was released. One paragraph by a former Petrobras manager mentioned a comment, in which he suggested corruption in the company may date back to Cardoso's presidency. Attached, was a comment between parenthesis: ""Podemos tirar se achar melhor"" (""we can take it out if you think better""), 65 which was removed from the current version of the text. 66 This had the effect of confusing readers, and suggests that the former president was involved in corruption and the comment was attributed to him. Reuters later confirmed the error, and explained that the comment, originating from one of the local editors, was actually intended for the journalist who wrote the original text in English, and that it should not have been published. 67 In November 2019 the UK Foreign Office released archive documents confirming that it had provided funding to Reuters during the 1960s and 1970s so that Reuters could expand its coverage in the Middle East. An agreement was made between the Information Research Department (IRD) and Reuters for the UK Treasury to provide 350,000 over four years to fund Reuters' expansion. The UK government had already been funding the Latin American department of Reuters through a shell company; however, this method was discounted for the Middle East operation due to the accounting of the shell company looking suspicious, with the IRD stating that the company ""already looks queer to anyone who might wish to investigate why such an inactive and unprofitable company continues to run. 68 Instead, the BBC was used to fund the project by paying for enhanced subscriptions to the news organisation, for which the Treasury would reimburse the BBC at a later date. The IRD acknowledged that this agreement would not give them editorial control over Reuters, although the IRD believed it would give them political influence over Reuters' work, stating ""this influence would flow, at the top level, from Reuters' willingness to consult and to listen to views expressed on the results of its work. 68 69 On 1 June 2020, Reuters announced that Russian news agency TASS had joined its ""Reuters Connect"" programme, comprising a then-total of 18 partner agencies. Reuters president Michael Friedenberg said he was ""delighted that TASS and Reuters are building upon our valued partnership"". 70 Two years later, TASS's membership in Reuters Connect came under scrutiny in the wake of the 2022 Russian invasion of Ukraine; Politico reported that Reuters staff members were ""frustrated and embarrassed"" that their agency had not suspended its partnership with TASS. 71 On 23 March 2022, Reuters removed TASS from its ""content marketplace"". Matthew Keen, interim CEO of Reuters said ""we believe making TASS content available on Reuters Connect is not aligned with the Thomson Reuters Trust Principles"". 72 1 2 3 4 5 6 7 NBC News Wall Street Journal Politico MSNBC CNBC Telemundo Bloomberg Industry Group Washington Examiner Boston Globe Washington Blade Fox News CBS News Radio AP Radio PBS VOA Time Yahoo News Daily Caller EWTN CBS News Bloomberg News McClatchy NY Post TheGrio Washington Times Salem Radio CBN Cheddar News Hearst TV AP NPR Foreign pool The Hill Regionals Newsmax Gray TV Spectrum News ABC News Washington Post Agence France-Presse Fox Business Fox News Radio CSM Roll Call Al JazeeraNexstar Scripps News Reuters NY Times LA Times Univision AURN RealClearPolitics Daily Beast Dallas Morning News BBC Newsweek CNN USA Today ABC News RadioDaily Mail National JournalHuffPostFinancial Times The Guardian"
2,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Wrapper_(data_mining),"Wrapper in data mining is a procedure that extracts regular subcontent of an unstructured or loosely-structured information source and translates it into a relational form, so it can be processed as structured data. 1 Wrapper induction is the problem of devising extraction procedures on an automatic basis, with minimal reliance on hand-crafted rules. Many web pages are automatically generated from structured data telephone directories, product catalogs, etc. wrapped in a loosely structured presentation language (usually some variant of HTML), formatted for human browsing and navigation. Structured data are typically descriptions of objects retrieved from underlying databases and displayed in web pages following fixed templates at a low level, injected into pages where the high-level structure can vary from week to week, per the rapidly evolving fashion of the site's presentation skin. The precise dividing line between the fluid high-level skin and the less fluid structured data templates is rarely documented for public consumption, outside of the content management team at the web property. Software systems using such resources must translate HTML content into a relational form. Wrappers are commonly used as such translators. Formally, a wrapper is a function from a page to the set of tuples it contains. There are two main approaches to wrapper generation: wrapper induction and automated data extraction. Wrapper induction uses supervised learning to learn data extraction rules from manually labeled training examples. The disadvantages of wrapper induction are Due to the manual labeling effort, it is hard to extract data from a large number of sites as each site has its own templates and requires separate manual labeling for wrapper learning. Wrapper maintenance is also a major issue because whenever a site changes the wrappers built for the site become obsolete. Due to these shortcomings, researchers have studied automated wrapper generation using unsupervised pattern mining. Automated extraction is possible because most Web data objects follow fixed templates. Discovering such templates or patterns enables the system to perform extraction automatically. 2 Wrapper generation on the Web is an important problem with a wide range of applications. Extraction of such data enables one to integrate data information from multiple Web sites to provide value-added services, e.g., comparative shopping, object search, and information integration."
3,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Web_scraping#See_also,"Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. 1 Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, extraction can take place. The content of a page may be parsed, searched and reformatted, and its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping). As well as contact scraping, web scraping is used as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup, and web data integration. Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. As a result, specialized tools and software have been developed to facilitate the scraping of web pages. Web scraping applications include market research, price comparison, content monitoring, and more. Businesses rely on web scraping services to efficiently gather and utilize this data. Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport mechanism between the client and the web server. There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing Web scraping is the process of automatically mining data or collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes even the best web-scraping technology cannot replace a human's manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation. A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python). Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming. Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content, and translates it into a relational form, is called a wrapper. Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme. 3 Moreover, some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content. By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. Languages such as Xpath can be used to parse the resulting DOM tree. There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of ""bots"" for specific verticals with no ""man in the loop"" (no direct human involvement), and no work related to a specific target site. The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platform's robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail of sites that common aggregators find complicated or too labor-intensive to harvest content from. The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, 4 are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages. There are efforts using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might. 5 Uses advanced AI to interpret and process web page content contextually, extracting relevant information, transforming data, and customizing outputs based on the content's structure and meaning. This method enables more intelligent and flexible data extraction, accommodating complex and dynamic web content. The world of web scraping offers a variety of software tools designed to simplify and customize the process of data extraction from websites. These tools vary in their approach and capabilities, making web scraping accessible to both novice users and advanced programmers. Some advanced web scraping software can automatically recognize the data structure of a web page, eliminating the need for manual coding. Others provide a recording interface that allows users to record their interactions with a website, thus creating a scraping script without writing a single line of code. Many tools also include scripting functions for more customized extraction and transformation of content, along with database interfaces to store the scraped data locally. Web scraping tools are versatile in their functionality. Some can directly extract data from APIs, while others are capable of handling websites with AJAX-based dynamic content loading or login requirements. Point-and-click software, for instance, empowers users without advanced coding skills to benefit from web scraping. This democratizes access to data, making it easier for a broader audience to leverage the power of web scraping. Popular Web Scraping Tools BeautifulSoup: A Python library that provides simple methods for extracting data from HTML and XML files. Scrapy: An open-source and collaborative web crawling framework for Python that allows you to extract the data, process it, and store it. Octoparse: A no-code web scraping tool that offers a user-friendly interface for extracting data from websites without needing programming skills. ParseHub: Another no-code web scraper that can handle dynamic content and works with AJAX-loaded sites. Apify: A platform that offers a wide range of scraping tools and the ability to create custom scrapers. InstantAPI.ai: An AI-powered tool that transforms any web page into personalized APIs instantly, offering advanced data extraction and customization. Some platforms provide not only tools for web scraping but also opportunities for developers to share and potentially monetize their scraping solutions. By leveraging these tools and platforms, users can unlock the full potential of web scraping, turning raw data into valuable insights and opportunities. 6 The legality of web scraping varies across the world. In general, web scraping may be against the terms of service of some websites, but the enforceability of these terms is unclear. 7 In the United States, website owners can use three major legal claims to prevent undesired web scraping: (1) copyright infringement (compilation), (2) violation of the Computer Fraud and Abuse Act (""CFAA""), and (3) trespass to chattel. 8 However, the effectiveness of these claims relies upon meeting various criteria, and the case law is still evolving. For example, with regard to copyright, while outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. U.S. courts have acknowledged that users of ""scrapers"" or ""robots"" may be held liable for committing trespass to chattels, 9 10 which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site. This case involved automatic placing of bids, known as auction sniping. However, in order to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels. 11 One of the first major tests of screen scraping involved American Airlines (AA), and a firm called FareChase. 12 AA successfully obtained an injunction from a Texas trial court, stopping FareChase from selling software that enables users to compare online fares if the software also searches AA's website. The airline argued that FareChase's websearch software trespassed on AA's servers when it collected the publicly available data. FareChase filed an appeal in March 2003. By June, FareChase and AA agreed to settle and the appeal was dropped. 13 Southwest Airlines has also challenged screen-scraping practices, and has involved both FareChase and another firm, Outtask, in a legal claim. Southwest Airlines charged that the screen-scraping is Illegal since it is an example of ""Computer Fraud and Abuse"" and has led to ""Damage and Loss"" and ""Unauthorized Access"" of Southwest's site. It also constitutes ""Interference with Business Relations"", ""Trespass"", and ""Harmful Access by Computer"". They also claimed that screen-scraping constitutes what is legally known as ""Misappropriation and Unjust Enrichment"", as well as being a breach of the web site's user agreement. Outtask denied all these claims, claiming that the prevailing law, in this case, should be US Copyright law and that under copyright, the pieces of information being scraped would not be subject to copyright protection. Although the cases were never resolved in the Supreme Court of the United States, FareChase was eventually shuttered by parent company Yahoo , and Outtask was purchased by travel expense company Concur. 14 In 2012, a startup called 3Taps scraped classified housing ads from Craigslist. Craigslist sent 3Taps a cease-and-desist letter and blocked their IP addresses and later sued, in Craigslist v. 3Taps. The court held that the cease-and-desist letter and IP blocking was sufficient for Craigslist to properly claim that 3Taps had violated the Computer Fraud and Abuse Act (CFAA). Although these are early scraping decisions, and the theories of liability are not uniform, it is difficult to ignore a pattern emerging that the courts are prepared to protect proprietary content on commercial sites from uses which are undesirable to the owners of such sites. However, the degree of protection for such content is not settled and will depend on the type of access made by the scraper, the amount of information accessed and copied, the degree to which the access adversely affects the site owner's system and the types and manner of prohibitions on such conduct. 15 While the law in this area becomes more settled, entities contemplating using scraping programs to access a public web site should also consider whether such action is authorized by reviewing the terms of use and other terms or notices posted on or made available through the site. In a 2010 ruling in the Cvent, Inc. v. Eventbrite, Inc. In the United States district court for the eastern district of Virginia, the court ruled that the terms of use should be brought to the users' attention In order for a browse wrap contract or license to be enforced. 16 In a 2014 case, filed in the United States District Court for the Eastern District of Pennsylvania, 17 e-commerce site QVC objected to the Pinterest-like shopping aggregator Resultly's 'scraping of QVC's site for real-time pricing data. QVC alleges that Resultly ""excessively crawled"" QVC's retail site (allegedly sending 200 300 search requests to QVC's website per minute, sometimes to up to 36,000 requests per minute) which caused QVC's site to crash for two days, resulting in lost sales for QVC. 18 QVC's complaint alleges that the defendant disguised its web crawler to mask its source IP address and thus prevented QVC from quickly repairing the problem. This is a particularly interesting scraping case because QVC is seeking damages for the unavailability of their website, which QVC claims was caused by Resultly. In the plaintiff's web site during the period of this trial, the terms of use link are displayed among all the links of the site, at the bottom of the page as most sites on the internet. This ruling contradicts the Irish ruling described below. The court also rejected the plaintiff's argument that the browse-wrap restrictions were enforceable in view of Virginia's adoption of the Uniform Computer Information Transactions Act (UCITA)—a uniform law that many believed was in favor on common browse-wrap contracting practices. 19 In Facebook, Inc. v. Power Ventures, Inc., a district court ruled in 2012 that Power Ventures could not scrape Facebook pages on behalf of a Facebook user. The case is on appeal, and the Electronic Frontier Foundation filed a brief in 2015 asking that it be overturned. 20 21 In Associated Press v. Meltwater U.S. Holdings, Inc., a court in the US held Meltwater liable for scraping and republishing news information from the Associated Press, but a court in the United Kingdom held in favor of Meltwater. The Ninth Circuit ruled in 2019 that web scraping did not violate the CFAA in hiQ Labs v. LinkedIn. The case was appealed to the United States Supreme Court, which returned the case to the Ninth Circuit to reconsider the case in light of the 2021 Supreme Court decision in Van Buren v. United States which narrowed the applicability of the CFAA. 22 On this review, the Ninth Circuit upheld their prior decision. 23 Internet Archive collects and distributes a significant number of publicly available web pages without being considered to be in violation of copyright laws. citation needed In February 2006, the Danish Maritime and Commercial Court (Copenhagen) ruled that systematic crawling, indexing, and deep linking by portal site ofir.dk of real estate site Home.dk does not conflict with Danish law or the database directive of the European Union. 24 In a February 2010 case complicated by matters of jurisdiction, Ireland's High Court delivered a verdict that illustrates the inchoate state of developing case law. In the case of Ryanair Ltd v Billigfluege.de GmbH, Ireland's High Court ruled Ryanair's ""click-wrap"" agreement to be legally binding. In contrast to the findings of the United States District Court Eastern District of Virginia and those of the Danish Maritime and Commercial Court, Justice Michael Hanna ruled that the hyperlink to Ryanair's terms and conditions was plainly visible, and that placing the onus on the user to agree to terms and conditions in order to gain access to online services is sufficient to comprise a contractual relationship. 25 The decision is under appeal in Ireland's Supreme Court. 26 On April 30, 2020, the French Data Protection Authority (CNIL) released new guidelines on web scraping. 27 The CNIL guidelines made it clear that publicly available data is still personal data and cannot be repurposed without the knowledge of the person to whom that data belongs. 28 In Australia, the Spam Act 2003 outlaws some forms of web harvesting, although this only applies to email addresses. 29 30 Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the Information Technology Act, 2000, which penalizes unauthorized access to a computer resource or extracting data from a computer resource. The administrator of a website can use various measures to stop or slow a bot. Some techniques include:"
4,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Web_scraping#cite_ref-21,"Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. 1 Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, extraction can take place. The content of a page may be parsed, searched and reformatted, and its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping). As well as contact scraping, web scraping is used as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup, and web data integration. Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. As a result, specialized tools and software have been developed to facilitate the scraping of web pages. Web scraping applications include market research, price comparison, content monitoring, and more. Businesses rely on web scraping services to efficiently gather and utilize this data. Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport mechanism between the client and the web server. There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing Web scraping is the process of automatically mining data or collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes even the best web-scraping technology cannot replace a human's manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation. A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python). Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming. Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content, and translates it into a relational form, is called a wrapper. Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme. 3 Moreover, some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content. By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. Languages such as Xpath can be used to parse the resulting DOM tree. There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of ""bots"" for specific verticals with no ""man in the loop"" (no direct human involvement), and no work related to a specific target site. The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platform's robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail of sites that common aggregators find complicated or too labor-intensive to harvest content from. The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, 4 are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages. There are efforts using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might. 5 Uses advanced AI to interpret and process web page content contextually, extracting relevant information, transforming data, and customizing outputs based on the content's structure and meaning. This method enables more intelligent and flexible data extraction, accommodating complex and dynamic web content. The world of web scraping offers a variety of software tools designed to simplify and customize the process of data extraction from websites. These tools vary in their approach and capabilities, making web scraping accessible to both novice users and advanced programmers. Some advanced web scraping software can automatically recognize the data structure of a web page, eliminating the need for manual coding. Others provide a recording interface that allows users to record their interactions with a website, thus creating a scraping script without writing a single line of code. Many tools also include scripting functions for more customized extraction and transformation of content, along with database interfaces to store the scraped data locally. Web scraping tools are versatile in their functionality. Some can directly extract data from APIs, while others are capable of handling websites with AJAX-based dynamic content loading or login requirements. Point-and-click software, for instance, empowers users without advanced coding skills to benefit from web scraping. This democratizes access to data, making it easier for a broader audience to leverage the power of web scraping. Popular Web Scraping Tools BeautifulSoup: A Python library that provides simple methods for extracting data from HTML and XML files. Scrapy: An open-source and collaborative web crawling framework for Python that allows you to extract the data, process it, and store it. Octoparse: A no-code web scraping tool that offers a user-friendly interface for extracting data from websites without needing programming skills. ParseHub: Another no-code web scraper that can handle dynamic content and works with AJAX-loaded sites. Apify: A platform that offers a wide range of scraping tools and the ability to create custom scrapers. InstantAPI.ai: An AI-powered tool that transforms any web page into personalized APIs instantly, offering advanced data extraction and customization. Some platforms provide not only tools for web scraping but also opportunities for developers to share and potentially monetize their scraping solutions. By leveraging these tools and platforms, users can unlock the full potential of web scraping, turning raw data into valuable insights and opportunities. 6 The legality of web scraping varies across the world. In general, web scraping may be against the terms of service of some websites, but the enforceability of these terms is unclear. 7 In the United States, website owners can use three major legal claims to prevent undesired web scraping: (1) copyright infringement (compilation), (2) violation of the Computer Fraud and Abuse Act (""CFAA""), and (3) trespass to chattel. 8 However, the effectiveness of these claims relies upon meeting various criteria, and the case law is still evolving. For example, with regard to copyright, while outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. U.S. courts have acknowledged that users of ""scrapers"" or ""robots"" may be held liable for committing trespass to chattels, 9 10 which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site. This case involved automatic placing of bids, known as auction sniping. However, in order to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels. 11 One of the first major tests of screen scraping involved American Airlines (AA), and a firm called FareChase. 12 AA successfully obtained an injunction from a Texas trial court, stopping FareChase from selling software that enables users to compare online fares if the software also searches AA's website. The airline argued that FareChase's websearch software trespassed on AA's servers when it collected the publicly available data. FareChase filed an appeal in March 2003. By June, FareChase and AA agreed to settle and the appeal was dropped. 13 Southwest Airlines has also challenged screen-scraping practices, and has involved both FareChase and another firm, Outtask, in a legal claim. Southwest Airlines charged that the screen-scraping is Illegal since it is an example of ""Computer Fraud and Abuse"" and has led to ""Damage and Loss"" and ""Unauthorized Access"" of Southwest's site. It also constitutes ""Interference with Business Relations"", ""Trespass"", and ""Harmful Access by Computer"". They also claimed that screen-scraping constitutes what is legally known as ""Misappropriation and Unjust Enrichment"", as well as being a breach of the web site's user agreement. Outtask denied all these claims, claiming that the prevailing law, in this case, should be US Copyright law and that under copyright, the pieces of information being scraped would not be subject to copyright protection. Although the cases were never resolved in the Supreme Court of the United States, FareChase was eventually shuttered by parent company Yahoo , and Outtask was purchased by travel expense company Concur. 14 In 2012, a startup called 3Taps scraped classified housing ads from Craigslist. Craigslist sent 3Taps a cease-and-desist letter and blocked their IP addresses and later sued, in Craigslist v. 3Taps. The court held that the cease-and-desist letter and IP blocking was sufficient for Craigslist to properly claim that 3Taps had violated the Computer Fraud and Abuse Act (CFAA). Although these are early scraping decisions, and the theories of liability are not uniform, it is difficult to ignore a pattern emerging that the courts are prepared to protect proprietary content on commercial sites from uses which are undesirable to the owners of such sites. However, the degree of protection for such content is not settled and will depend on the type of access made by the scraper, the amount of information accessed and copied, the degree to which the access adversely affects the site owner's system and the types and manner of prohibitions on such conduct. 15 While the law in this area becomes more settled, entities contemplating using scraping programs to access a public web site should also consider whether such action is authorized by reviewing the terms of use and other terms or notices posted on or made available through the site. In a 2010 ruling in the Cvent, Inc. v. Eventbrite, Inc. In the United States district court for the eastern district of Virginia, the court ruled that the terms of use should be brought to the users' attention In order for a browse wrap contract or license to be enforced. 16 In a 2014 case, filed in the United States District Court for the Eastern District of Pennsylvania, 17 e-commerce site QVC objected to the Pinterest-like shopping aggregator Resultly's 'scraping of QVC's site for real-time pricing data. QVC alleges that Resultly ""excessively crawled"" QVC's retail site (allegedly sending 200 300 search requests to QVC's website per minute, sometimes to up to 36,000 requests per minute) which caused QVC's site to crash for two days, resulting in lost sales for QVC. 18 QVC's complaint alleges that the defendant disguised its web crawler to mask its source IP address and thus prevented QVC from quickly repairing the problem. This is a particularly interesting scraping case because QVC is seeking damages for the unavailability of their website, which QVC claims was caused by Resultly. In the plaintiff's web site during the period of this trial, the terms of use link are displayed among all the links of the site, at the bottom of the page as most sites on the internet. This ruling contradicts the Irish ruling described below. The court also rejected the plaintiff's argument that the browse-wrap restrictions were enforceable in view of Virginia's adoption of the Uniform Computer Information Transactions Act (UCITA)—a uniform law that many believed was in favor on common browse-wrap contracting practices. 19 In Facebook, Inc. v. Power Ventures, Inc., a district court ruled in 2012 that Power Ventures could not scrape Facebook pages on behalf of a Facebook user. The case is on appeal, and the Electronic Frontier Foundation filed a brief in 2015 asking that it be overturned. 20 21 In Associated Press v. Meltwater U.S. Holdings, Inc., a court in the US held Meltwater liable for scraping and republishing news information from the Associated Press, but a court in the United Kingdom held in favor of Meltwater. The Ninth Circuit ruled in 2019 that web scraping did not violate the CFAA in hiQ Labs v. LinkedIn. The case was appealed to the United States Supreme Court, which returned the case to the Ninth Circuit to reconsider the case in light of the 2021 Supreme Court decision in Van Buren v. United States which narrowed the applicability of the CFAA. 22 On this review, the Ninth Circuit upheld their prior decision. 23 Internet Archive collects and distributes a significant number of publicly available web pages without being considered to be in violation of copyright laws. citation needed In February 2006, the Danish Maritime and Commercial Court (Copenhagen) ruled that systematic crawling, indexing, and deep linking by portal site ofir.dk of real estate site Home.dk does not conflict with Danish law or the database directive of the European Union. 24 In a February 2010 case complicated by matters of jurisdiction, Ireland's High Court delivered a verdict that illustrates the inchoate state of developing case law. In the case of Ryanair Ltd v Billigfluege.de GmbH, Ireland's High Court ruled Ryanair's ""click-wrap"" agreement to be legally binding. In contrast to the findings of the United States District Court Eastern District of Virginia and those of the Danish Maritime and Commercial Court, Justice Michael Hanna ruled that the hyperlink to Ryanair's terms and conditions was plainly visible, and that placing the onus on the user to agree to terms and conditions in order to gain access to online services is sufficient to comprise a contractual relationship. 25 The decision is under appeal in Ireland's Supreme Court. 26 On April 30, 2020, the French Data Protection Authority (CNIL) released new guidelines on web scraping. 27 The CNIL guidelines made it clear that publicly available data is still personal data and cannot be repurposed without the knowledge of the person to whom that data belongs. 28 In Australia, the Spam Act 2003 outlaws some forms of web harvesting, although this only applies to email addresses. 29 30 Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the Information Technology Act, 2000, which penalizes unauthorized access to a computer resource or extracting data from a computer resource. The administrator of a website can use various measures to stop or slow a bot. Some techniques include:"
5,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Web_scraping,"Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. 1 Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, extraction can take place. The content of a page may be parsed, searched and reformatted, and its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping). As well as contact scraping, web scraping is used as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup, and web data integration. Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. As a result, specialized tools and software have been developed to facilitate the scraping of web pages. Web scraping applications include market research, price comparison, content monitoring, and more. Businesses rely on web scraping services to efficiently gather and utilize this data. Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport mechanism between the client and the web server. There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing Web scraping is the process of automatically mining data or collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes even the best web-scraping technology cannot replace a human's manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation. A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python). Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming. Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content, and translates it into a relational form, is called a wrapper. Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme. 3 Moreover, some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content. By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. Languages such as Xpath can be used to parse the resulting DOM tree. There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of ""bots"" for specific verticals with no ""man in the loop"" (no direct human involvement), and no work related to a specific target site. The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platform's robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail of sites that common aggregators find complicated or too labor-intensive to harvest content from. The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, 4 are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages. There are efforts using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might. 5 Uses advanced AI to interpret and process web page content contextually, extracting relevant information, transforming data, and customizing outputs based on the content's structure and meaning. This method enables more intelligent and flexible data extraction, accommodating complex and dynamic web content. The world of web scraping offers a variety of software tools designed to simplify and customize the process of data extraction from websites. These tools vary in their approach and capabilities, making web scraping accessible to both novice users and advanced programmers. Some advanced web scraping software can automatically recognize the data structure of a web page, eliminating the need for manual coding. Others provide a recording interface that allows users to record their interactions with a website, thus creating a scraping script without writing a single line of code. Many tools also include scripting functions for more customized extraction and transformation of content, along with database interfaces to store the scraped data locally. Web scraping tools are versatile in their functionality. Some can directly extract data from APIs, while others are capable of handling websites with AJAX-based dynamic content loading or login requirements. Point-and-click software, for instance, empowers users without advanced coding skills to benefit from web scraping. This democratizes access to data, making it easier for a broader audience to leverage the power of web scraping. Popular Web Scraping Tools BeautifulSoup: A Python library that provides simple methods for extracting data from HTML and XML files. Scrapy: An open-source and collaborative web crawling framework for Python that allows you to extract the data, process it, and store it. Octoparse: A no-code web scraping tool that offers a user-friendly interface for extracting data from websites without needing programming skills. ParseHub: Another no-code web scraper that can handle dynamic content and works with AJAX-loaded sites. Apify: A platform that offers a wide range of scraping tools and the ability to create custom scrapers. InstantAPI.ai: An AI-powered tool that transforms any web page into personalized APIs instantly, offering advanced data extraction and customization. Some platforms provide not only tools for web scraping but also opportunities for developers to share and potentially monetize their scraping solutions. By leveraging these tools and platforms, users can unlock the full potential of web scraping, turning raw data into valuable insights and opportunities. 6 The legality of web scraping varies across the world. In general, web scraping may be against the terms of service of some websites, but the enforceability of these terms is unclear. 7 In the United States, website owners can use three major legal claims to prevent undesired web scraping: (1) copyright infringement (compilation), (2) violation of the Computer Fraud and Abuse Act (""CFAA""), and (3) trespass to chattel. 8 However, the effectiveness of these claims relies upon meeting various criteria, and the case law is still evolving. For example, with regard to copyright, while outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. U.S. courts have acknowledged that users of ""scrapers"" or ""robots"" may be held liable for committing trespass to chattels, 9 10 which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site. This case involved automatic placing of bids, known as auction sniping. However, in order to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels. 11 One of the first major tests of screen scraping involved American Airlines (AA), and a firm called FareChase. 12 AA successfully obtained an injunction from a Texas trial court, stopping FareChase from selling software that enables users to compare online fares if the software also searches AA's website. The airline argued that FareChase's websearch software trespassed on AA's servers when it collected the publicly available data. FareChase filed an appeal in March 2003. By June, FareChase and AA agreed to settle and the appeal was dropped. 13 Southwest Airlines has also challenged screen-scraping practices, and has involved both FareChase and another firm, Outtask, in a legal claim. Southwest Airlines charged that the screen-scraping is Illegal since it is an example of ""Computer Fraud and Abuse"" and has led to ""Damage and Loss"" and ""Unauthorized Access"" of Southwest's site. It also constitutes ""Interference with Business Relations"", ""Trespass"", and ""Harmful Access by Computer"". They also claimed that screen-scraping constitutes what is legally known as ""Misappropriation and Unjust Enrichment"", as well as being a breach of the web site's user agreement. Outtask denied all these claims, claiming that the prevailing law, in this case, should be US Copyright law and that under copyright, the pieces of information being scraped would not be subject to copyright protection. Although the cases were never resolved in the Supreme Court of the United States, FareChase was eventually shuttered by parent company Yahoo , and Outtask was purchased by travel expense company Concur. 14 In 2012, a startup called 3Taps scraped classified housing ads from Craigslist. Craigslist sent 3Taps a cease-and-desist letter and blocked their IP addresses and later sued, in Craigslist v. 3Taps. The court held that the cease-and-desist letter and IP blocking was sufficient for Craigslist to properly claim that 3Taps had violated the Computer Fraud and Abuse Act (CFAA). Although these are early scraping decisions, and the theories of liability are not uniform, it is difficult to ignore a pattern emerging that the courts are prepared to protect proprietary content on commercial sites from uses which are undesirable to the owners of such sites. However, the degree of protection for such content is not settled and will depend on the type of access made by the scraper, the amount of information accessed and copied, the degree to which the access adversely affects the site owner's system and the types and manner of prohibitions on such conduct. 15 While the law in this area becomes more settled, entities contemplating using scraping programs to access a public web site should also consider whether such action is authorized by reviewing the terms of use and other terms or notices posted on or made available through the site. In a 2010 ruling in the Cvent, Inc. v. Eventbrite, Inc. In the United States district court for the eastern district of Virginia, the court ruled that the terms of use should be brought to the users' attention In order for a browse wrap contract or license to be enforced. 16 In a 2014 case, filed in the United States District Court for the Eastern District of Pennsylvania, 17 e-commerce site QVC objected to the Pinterest-like shopping aggregator Resultly's 'scraping of QVC's site for real-time pricing data. QVC alleges that Resultly ""excessively crawled"" QVC's retail site (allegedly sending 200 300 search requests to QVC's website per minute, sometimes to up to 36,000 requests per minute) which caused QVC's site to crash for two days, resulting in lost sales for QVC. 18 QVC's complaint alleges that the defendant disguised its web crawler to mask its source IP address and thus prevented QVC from quickly repairing the problem. This is a particularly interesting scraping case because QVC is seeking damages for the unavailability of their website, which QVC claims was caused by Resultly. In the plaintiff's web site during the period of this trial, the terms of use link are displayed among all the links of the site, at the bottom of the page as most sites on the internet. This ruling contradicts the Irish ruling described below. The court also rejected the plaintiff's argument that the browse-wrap restrictions were enforceable in view of Virginia's adoption of the Uniform Computer Information Transactions Act (UCITA)—a uniform law that many believed was in favor on common browse-wrap contracting practices. 19 In Facebook, Inc. v. Power Ventures, Inc., a district court ruled in 2012 that Power Ventures could not scrape Facebook pages on behalf of a Facebook user. The case is on appeal, and the Electronic Frontier Foundation filed a brief in 2015 asking that it be overturned. 20 21 In Associated Press v. Meltwater U.S. Holdings, Inc., a court in the US held Meltwater liable for scraping and republishing news information from the Associated Press, but a court in the United Kingdom held in favor of Meltwater. The Ninth Circuit ruled in 2019 that web scraping did not violate the CFAA in hiQ Labs v. LinkedIn. The case was appealed to the United States Supreme Court, which returned the case to the Ninth Circuit to reconsider the case in light of the 2021 Supreme Court decision in Van Buren v. United States which narrowed the applicability of the CFAA. 22 On this review, the Ninth Circuit upheld their prior decision. 23 Internet Archive collects and distributes a significant number of publicly available web pages without being considered to be in violation of copyright laws. citation needed In February 2006, the Danish Maritime and Commercial Court (Copenhagen) ruled that systematic crawling, indexing, and deep linking by portal site ofir.dk of real estate site Home.dk does not conflict with Danish law or the database directive of the European Union. 24 In a February 2010 case complicated by matters of jurisdiction, Ireland's High Court delivered a verdict that illustrates the inchoate state of developing case law. In the case of Ryanair Ltd v Billigfluege.de GmbH, Ireland's High Court ruled Ryanair's ""click-wrap"" agreement to be legally binding. In contrast to the findings of the United States District Court Eastern District of Virginia and those of the Danish Maritime and Commercial Court, Justice Michael Hanna ruled that the hyperlink to Ryanair's terms and conditions was plainly visible, and that placing the onus on the user to agree to terms and conditions in order to gain access to online services is sufficient to comprise a contractual relationship. 25 The decision is under appeal in Ireland's Supreme Court. 26 On April 30, 2020, the French Data Protection Authority (CNIL) released new guidelines on web scraping. 27 The CNIL guidelines made it clear that publicly available data is still personal data and cannot be repurposed without the knowledge of the person to whom that data belongs. 28 In Australia, the Spam Act 2003 outlaws some forms of web harvesting, although this only applies to email addresses. 29 30 Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the Information Technology Act, 2000, which penalizes unauthorized access to a computer resource or extracting data from a computer resource. The administrator of a website can use various measures to stop or slow a bot. Some techniques include:"
6,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Web_scraping#References,"Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. 1 Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, extraction can take place. The content of a page may be parsed, searched and reformatted, and its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping). As well as contact scraping, web scraping is used as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup, and web data integration. Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. As a result, specialized tools and software have been developed to facilitate the scraping of web pages. Web scraping applications include market research, price comparison, content monitoring, and more. Businesses rely on web scraping services to efficiently gather and utilize this data. Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport mechanism between the client and the web server. There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing Web scraping is the process of automatically mining data or collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes even the best web-scraping technology cannot replace a human's manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation. A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python). Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming. Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content, and translates it into a relational form, is called a wrapper. Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme. 3 Moreover, some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content. By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. Languages such as Xpath can be used to parse the resulting DOM tree. There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of ""bots"" for specific verticals with no ""man in the loop"" (no direct human involvement), and no work related to a specific target site. The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platform's robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail of sites that common aggregators find complicated or too labor-intensive to harvest content from. The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, 4 are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages. There are efforts using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might. 5 Uses advanced AI to interpret and process web page content contextually, extracting relevant information, transforming data, and customizing outputs based on the content's structure and meaning. This method enables more intelligent and flexible data extraction, accommodating complex and dynamic web content. The world of web scraping offers a variety of software tools designed to simplify and customize the process of data extraction from websites. These tools vary in their approach and capabilities, making web scraping accessible to both novice users and advanced programmers. Some advanced web scraping software can automatically recognize the data structure of a web page, eliminating the need for manual coding. Others provide a recording interface that allows users to record their interactions with a website, thus creating a scraping script without writing a single line of code. Many tools also include scripting functions for more customized extraction and transformation of content, along with database interfaces to store the scraped data locally. Web scraping tools are versatile in their functionality. Some can directly extract data from APIs, while others are capable of handling websites with AJAX-based dynamic content loading or login requirements. Point-and-click software, for instance, empowers users without advanced coding skills to benefit from web scraping. This democratizes access to data, making it easier for a broader audience to leverage the power of web scraping. Popular Web Scraping Tools BeautifulSoup: A Python library that provides simple methods for extracting data from HTML and XML files. Scrapy: An open-source and collaborative web crawling framework for Python that allows you to extract the data, process it, and store it. Octoparse: A no-code web scraping tool that offers a user-friendly interface for extracting data from websites without needing programming skills. ParseHub: Another no-code web scraper that can handle dynamic content and works with AJAX-loaded sites. Apify: A platform that offers a wide range of scraping tools and the ability to create custom scrapers. InstantAPI.ai: An AI-powered tool that transforms any web page into personalized APIs instantly, offering advanced data extraction and customization. Some platforms provide not only tools for web scraping but also opportunities for developers to share and potentially monetize their scraping solutions. By leveraging these tools and platforms, users can unlock the full potential of web scraping, turning raw data into valuable insights and opportunities. 6 The legality of web scraping varies across the world. In general, web scraping may be against the terms of service of some websites, but the enforceability of these terms is unclear. 7 In the United States, website owners can use three major legal claims to prevent undesired web scraping: (1) copyright infringement (compilation), (2) violation of the Computer Fraud and Abuse Act (""CFAA""), and (3) trespass to chattel. 8 However, the effectiveness of these claims relies upon meeting various criteria, and the case law is still evolving. For example, with regard to copyright, while outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. U.S. courts have acknowledged that users of ""scrapers"" or ""robots"" may be held liable for committing trespass to chattels, 9 10 which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site. This case involved automatic placing of bids, known as auction sniping. However, in order to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels. 11 One of the first major tests of screen scraping involved American Airlines (AA), and a firm called FareChase. 12 AA successfully obtained an injunction from a Texas trial court, stopping FareChase from selling software that enables users to compare online fares if the software also searches AA's website. The airline argued that FareChase's websearch software trespassed on AA's servers when it collected the publicly available data. FareChase filed an appeal in March 2003. By June, FareChase and AA agreed to settle and the appeal was dropped. 13 Southwest Airlines has also challenged screen-scraping practices, and has involved both FareChase and another firm, Outtask, in a legal claim. Southwest Airlines charged that the screen-scraping is Illegal since it is an example of ""Computer Fraud and Abuse"" and has led to ""Damage and Loss"" and ""Unauthorized Access"" of Southwest's site. It also constitutes ""Interference with Business Relations"", ""Trespass"", and ""Harmful Access by Computer"". They also claimed that screen-scraping constitutes what is legally known as ""Misappropriation and Unjust Enrichment"", as well as being a breach of the web site's user agreement. Outtask denied all these claims, claiming that the prevailing law, in this case, should be US Copyright law and that under copyright, the pieces of information being scraped would not be subject to copyright protection. Although the cases were never resolved in the Supreme Court of the United States, FareChase was eventually shuttered by parent company Yahoo , and Outtask was purchased by travel expense company Concur. 14 In 2012, a startup called 3Taps scraped classified housing ads from Craigslist. Craigslist sent 3Taps a cease-and-desist letter and blocked their IP addresses and later sued, in Craigslist v. 3Taps. The court held that the cease-and-desist letter and IP blocking was sufficient for Craigslist to properly claim that 3Taps had violated the Computer Fraud and Abuse Act (CFAA). Although these are early scraping decisions, and the theories of liability are not uniform, it is difficult to ignore a pattern emerging that the courts are prepared to protect proprietary content on commercial sites from uses which are undesirable to the owners of such sites. However, the degree of protection for such content is not settled and will depend on the type of access made by the scraper, the amount of information accessed and copied, the degree to which the access adversely affects the site owner's system and the types and manner of prohibitions on such conduct. 15 While the law in this area becomes more settled, entities contemplating using scraping programs to access a public web site should also consider whether such action is authorized by reviewing the terms of use and other terms or notices posted on or made available through the site. In a 2010 ruling in the Cvent, Inc. v. Eventbrite, Inc. In the United States district court for the eastern district of Virginia, the court ruled that the terms of use should be brought to the users' attention In order for a browse wrap contract or license to be enforced. 16 In a 2014 case, filed in the United States District Court for the Eastern District of Pennsylvania, 17 e-commerce site QVC objected to the Pinterest-like shopping aggregator Resultly's 'scraping of QVC's site for real-time pricing data. QVC alleges that Resultly ""excessively crawled"" QVC's retail site (allegedly sending 200 300 search requests to QVC's website per minute, sometimes to up to 36,000 requests per minute) which caused QVC's site to crash for two days, resulting in lost sales for QVC. 18 QVC's complaint alleges that the defendant disguised its web crawler to mask its source IP address and thus prevented QVC from quickly repairing the problem. This is a particularly interesting scraping case because QVC is seeking damages for the unavailability of their website, which QVC claims was caused by Resultly. In the plaintiff's web site during the period of this trial, the terms of use link are displayed among all the links of the site, at the bottom of the page as most sites on the internet. This ruling contradicts the Irish ruling described below. The court also rejected the plaintiff's argument that the browse-wrap restrictions were enforceable in view of Virginia's adoption of the Uniform Computer Information Transactions Act (UCITA)—a uniform law that many believed was in favor on common browse-wrap contracting practices. 19 In Facebook, Inc. v. Power Ventures, Inc., a district court ruled in 2012 that Power Ventures could not scrape Facebook pages on behalf of a Facebook user. The case is on appeal, and the Electronic Frontier Foundation filed a brief in 2015 asking that it be overturned. 20 21 In Associated Press v. Meltwater U.S. Holdings, Inc., a court in the US held Meltwater liable for scraping and republishing news information from the Associated Press, but a court in the United Kingdom held in favor of Meltwater. The Ninth Circuit ruled in 2019 that web scraping did not violate the CFAA in hiQ Labs v. LinkedIn. The case was appealed to the United States Supreme Court, which returned the case to the Ninth Circuit to reconsider the case in light of the 2021 Supreme Court decision in Van Buren v. United States which narrowed the applicability of the CFAA. 22 On this review, the Ninth Circuit upheld their prior decision. 23 Internet Archive collects and distributes a significant number of publicly available web pages without being considered to be in violation of copyright laws. citation needed In February 2006, the Danish Maritime and Commercial Court (Copenhagen) ruled that systematic crawling, indexing, and deep linking by portal site ofir.dk of real estate site Home.dk does not conflict with Danish law or the database directive of the European Union. 24 In a February 2010 case complicated by matters of jurisdiction, Ireland's High Court delivered a verdict that illustrates the inchoate state of developing case law. In the case of Ryanair Ltd v Billigfluege.de GmbH, Ireland's High Court ruled Ryanair's ""click-wrap"" agreement to be legally binding. In contrast to the findings of the United States District Court Eastern District of Virginia and those of the Danish Maritime and Commercial Court, Justice Michael Hanna ruled that the hyperlink to Ryanair's terms and conditions was plainly visible, and that placing the onus on the user to agree to terms and conditions in order to gain access to online services is sufficient to comprise a contractual relationship. 25 The decision is under appeal in Ireland's Supreme Court. 26 On April 30, 2020, the French Data Protection Authority (CNIL) released new guidelines on web scraping. 27 The CNIL guidelines made it clear that publicly available data is still personal data and cannot be repurposed without the knowledge of the person to whom that data belongs. 28 In Australia, the Spam Act 2003 outlaws some forms of web harvesting, although this only applies to email addresses. 29 30 Leaving a few cases dealing with IPR infringement, Indian courts have not expressly ruled on the legality of web scraping. However, since all common forms of electronic contracts are enforceable in India, violating the terms of use prohibiting data scraping will be a violation of the contract law. It will also violate the Information Technology Act, 2000, which penalizes unauthorized access to a computer resource or extracting data from a computer resource. The administrator of a website can use various measures to stop or slow a bot. Some techniques include:"
7,https://en.wikipedia.org/wiki/Data_scraping,https://en.wikipedia.org/w/index.php?title=Data_scraping&action=edit,"You do not have permission to edit this page, for the following reasons: The IP address or range 180.190.0.0 16 has been blocked by Stwalkerster for the following reason(s): Editing from this range has been disabled (blocked) in response to abuse. A range may be shared by many users and innocent users may be affected; if you believe that you are not the person this block is intended for, please follow the instructions below: If you have an account: Please log in to edit. In rare cases, in response to serious abuse, logged-in editing may also be disabled. If you still cannot edit, place unblock on your talk page and make reference to this message. You may wish to ping the blocking administrator or email them via the ""email this user"" function. If you do not have an account: Registered users are still able to edit. If you cannot create an account from this or another network, you may request that volunteers create your username for you. Please follow the instructions at Wikipedia:Request an account to request an account under your preferred username. It may take some time to process your request. Administrators: Please consult the blocking administrator before altering or lifting this block, and consider consulting with a CheckUser before granting an IP block exemption to an editor using this range. Note that large or hard (logged-in editing blocked) rangeblocks are usually only made in response to serious abuse, and the blocking admin may have information about this block which is essential to reviewing any unblock request. This block will expire on 09:06, 22 May 2032. Your current IP address is 180.190.75.212. Even when blocked, you will usually still be able to edit your user talk page, as well as email administrators and other editors. For information on how to proceed, please read the FAQ for blocked users and the guideline on block appeals. The guide to appealing blocks may also be helpful. Other useful links: Blocking policy Help:I have been blocked This block affects editing on all Wikimedia wikis. The IP address or range 180.190.0.0 16 has been globally blocked by for the following reason(s): Long-term abuse: If you are affected by this block, please message us: request This block will expire on 17:48, 21 December 2024. Your current IP address is 180.190.75.212. Even while globally blocked, you will usually still be able to edit pages on Meta-Wiki. If you believe you were blocked by mistake, you can find additional information and instructions in the No open proxies global policy. Otherwise, to discuss the block please post a request for review on Meta-Wiki. You could also send an email to the stewards VRT queue at stewards wikimedia.org including all above details. Other useful links: Global blocks Help:I have been blocked Pages transcluded onto the current version of this page (help): Return to Data scraping."
8,https://en.wikipedia.org/wiki/Data_scraping,https://en.wikipedia.org/wiki/Data_scraping#cite_ref-3,"Data scraping is a technique where a computer program extracts data from human-readable output coming from another program. Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity. Very often, these transmissions are not human-readable at all. Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing. Data scraping is most often done either to interface to a legacy system, which has no other mechanism which is compatible with current hardware, or to interface to a third-party system which does not provide a more convenient API. In the second case, the operator of the third-party system will often see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content. Data scraping is generally considered an ad hoc, inelegant technique, often used only as a ""last resort"" when no other mechanism for data interchange is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program will fail. Depending on the quality and the extent of error handling logic present in the computer, this failure can result in error messages, corrupted output or even program crashes. However, setting up a data scraping pipeline nowadays is straightforward, requiring minimal programming effort to meet practical needs (especially in biomedical data integration). 1 Although the use of physical ""dumb terminal"" IBM 3270s is slowly diminishing, as more and more mainframe applications acquire Web interfaces, some Web applications merely continue to use the technique of screen scraping to capture old screens and transfer the data to modern front-ends. 2 Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of parsing data as in web scraping. Originally, screen scraping referred to the practice of reading text data from a computer display terminal's screen. This was generally done by reading the terminal's memory through its auxiliary port, or by connecting the terminal output port of one computer system to an input port on another. The term screen scraping is also commonly used to refer to the bidirectional exchange of data. This could be the simple cases where the controlling program navigates through the user interface, or more complex scenarios where the controlling program is entering data into an interface meant to be used by a human. As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s—the dawn of computerized data processing. Computer to user interfaces from that era were often simply text-based dumb terminals which were not much more than virtual teleprinters (such systems are still in use today update , for various reasons). The desire to interface such a system to more modern systems is common. A robust solution will often require things no longer available, such as source code, system documentation, APIs, or programmers with experience in a 50 year-old computer system. In such cases, the only feasible solution may be to write a screen scraper that ""pretends"" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system. A sophisticated and resilient implementation of this kind, built on a platform providing the governance and control required by a major enterprise—e.g. change control, security, user management, data protection, operational audit, load balancing, and queue management, etc.—could be said to be an example of robotic process automation software, called RPA or RPAAI for self-guided RPA 2.0 based on artificial intelligence. In the 1980s, financial data providers such as Reuters, Telerate, and Quotron displayed data in 24 80 format intended for a human reader. Users of this data, particularly investment banks, wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper shredder. Internally Reuters used the term 'logicized' for this conversion process, running a sophisticated computer system on VAX VMS called the Logicizer. 3 More modern screen scraping techniques include capturing the bitmap data from the screen and running it through an OCR engine, or for some specialised automated testing systems, matching the screen's bitmap data against expected results. 4 This can be combined in the case of GUI applications, with querying the graphical controls by programmatically obtaining references to their underlying programming objects. A sequence of screens is automatically captured and converted into a database. Another modern adaptation to these techniques is to use, instead of a sequence of screens as input, a set of images or PDF files, so there are some overlaps with generic ""document scraping"" and report mining techniques. There are many tools that can be used for screen scraping. 5 Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API or tool to extract data from a website. 6 Companies like Amazon AWS and Google provide web scraping tools, services, and public data available free of cost to end-users. Newer forms of web scraping involve listening to data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the webserver. A web scraper uses a website's URL to extract data, and stores this data for subsequent analysis. This method of web scraping enables the extraction of data in an efficient and accurate manner. 7 Recently, companies have developed web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate the human processing that occurs when viewing a webpage to automatically extract useful information. 8 9 Large websites usually use defensive algorithms to protect their data from web scrapers and to limit the number of requests an IP or IP network may send. This has caused an ongoing battle between website developers and scraping developers. 10 Report mining is the extraction of data from human-readable computer reports. Conventional data extraction requires a connection to a working source system, suitable connectivity standards or an API, and usually complex querying. By using the source system's standard reporting options, and directing the output to a spool file instead of to a printer, static reports can be generated suitable for offline analysis via report mining. 11 This approach can avoid intensive CPU usage during business hours, can minimise end-user licence costs for ERP customers, and can offer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping involve interacting with dynamic output, report mining involves extracting data from files in a human-readable format, such as HTML, PDF, or text. These can be easily generated from almost any system by intercepting the data feed to a printer. This approach can provide a quick and simple route to obtaining data without the need to program an API to the source system."
9,https://en.wikipedia.org/wiki/Data_scraping,https://en.wikipedia.org/wiki/Data_structures,"In computer science, a data structure is a data organization, and storage format that is usually chosen for efficient access to data. 1 2 3 More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, 4 i.e., it is an algebraic structure about data. Data structures serve as the basis for abstract data types (ADT). The ADT defines the logical form of the data type. The data structure implements the physical form of the data type. 5 Different types of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, relational databases commonly use B-tree indexes for data retrieval, 6 while compiler implementations usually use hash tables to look up identifiers. 7 Data structures provide a means to manage large amounts of data efficiently for uses such as large databases and internet indexing services. Usually, efficient data structures are key to designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design. Data structures can be used to organize the storage and retrieval of information stored in both main memory and secondary memory. 8 Data structures can be implemented using a variety of programming languages and techniques, but they all share the common goal of efficiently organizing and storing data. 9 Data structures are generally based on the ability of a computer to fetch and store data at any place in its memory, specified by a pointer—a bit string, representing a memory address, that can be itself stored in memory and manipulated by the program. Thus, the array and record data structures are based on computing the addresses of data items with arithmetic operations, while the linked data structures are based on storing addresses of data items within the structure itself. This approach to data structuring has profound implications for the efficiency and scalability of algorithms. For instance, the contiguous memory allocation in arrays facilitates rapid access and modification operations, leading to optimized performance in sequential data processing scenarios. 10 The implementation of a data structure usually requires writing a set of procedures that create and manipulate instances of that structure. The efficiency of a data structure cannot be analyzed separately from those operations. This observation motivates the theoretical concept of an abstract data type, a data structure that is defined indirectly by the operations that may be performed on it, and the mathematical properties of those operations (including their space and time cost). 11 There are numerous types of data structures, generally built upon simpler primitive data types. Well known examples are: 12 A trie, or prefix tree, is a special type of tree used to efficiently retrieve strings. In a trie, each node represents a character of a string, and the edges between nodes represent the characters that connect them. This structure is especially useful for tasks like autocomplete, spell-checking, and creating dictionaries. Tries allow for quick searches and operations based on string prefixes. Most assembly languages and some low-level languages, such as BCPL (Basic Combined Programming Language), lack built-in support for data structures. On the other hand, many high-level programming languages and some higher-level assembly languages, such as MASM, have special syntax or other built-in support for certain data structures, such as records and arrays. For example, the C (a direct descendant of BCPL) and Pascal languages support structs and records, respectively, in addition to vectors (one-dimensional arrays) and multi-dimensional arrays. 14 15 Most programming languages feature some sort of library mechanism that allows data structure implementations to be reused by different programs. Modern languages usually come with standard libraries that implement the most common data structures. Examples are the C Standard Template Library, the Java Collections Framework, and the Microsoft .NET Framework. Modern languages also generally support modular programming, the separation between the interface of a library module and its implementation. Some provide opaque data types that allow clients to hide implementation details. Object-oriented programming languages, such as C , Java, and Smalltalk, typically use classes for this purpose. Many known data structures have concurrent versions which allow multiple computing threads to access a single concrete instance of a data structure simultaneously. 16"


---
---

### **Extract text from a specific scraped web page:**

### Option 1) Using `print()` with `final_data['scraped_text']` and `.iloc[]`. Simply find the index number of the row in the display above that contains your web page of interest, and enter the index number in `.iloc[]`. E.g., if you are interested in viewing the extracted text from Wikipedia's article on "Data Loading" (index 191), located at https://en.wikipedia.org/wiki/Data_loading, you would use the following code:

In [16]:
print(final_data['scraped_text'].iloc[191])

Data loading, or simply loading, is a part of data processing where data is moved between two systems so that it ends up in a staging area on the target system. With the traditional extract, transform and load (ETL) method, the load job is the last step, and the data that is loaded has already been transformed. With the alternative method extract, load and transform (ELT), the loading job is the middle step, and the transformed data is loaded in its original format for data transformation in the target system. Traditionally, loading jobs on large systems have taken a long time, and have typically been run at night outside a company's opening hours. Two main goals of data loading are to obtain fresher data in the systems after loading, and that the loading is fast so that the data can be updated frequently. For full data refresh, faster loading can be achieved by turning off referential integrity, secondary indexes and logging, but this is usually not allowed with incremental update or 

### Option 2) If you want to view the extracted text from your web page of interest (e.g., `.iloc[191]`) with prettier font styling, use the following code which implements the `<pre>` tag and `escape()` steps used earlier:

In [14]:
display(HTML(f"<pre style='font-family: Georgia, serif; font-size: 18px;'>{html.escape(final_data['scraped_text'].iloc[191])}</pre>"))

---
---

### **View Extracted Tables:**

---

### Finally, running the following commented out code will display all tables in HTML scraped from the first 5 web pages (i.e., rows) in the above Pandas DataFrame. I have commented the code out to minimize this HTML document's file size. Simply uncomment the code, and run it.

In [None]:
"""
from IPython.display import display, HTML

# Use to display tables in html along with their URLs from the first 5 scraped URLs by using .head(). 
for index, row in final_data.head().iterrows():
    # Display the original URL and the scraped URL from which the tables were extracted
    display(HTML(f"<b>Original URL:</b> {row['original_url']}<br><b>Scraped URL:</b> {row['scraped_url']}"))
    
    # Check if there is any HTML table to display
    if row['scraped_tables']:
        display(HTML(row['scraped_tables']))
    else:
        display(HTML("<p>No tables found.</p>"))
    
    # Display a thick black line as a separator after each URL's data before moving on to the next
    if index < len(final_data) - 1:  # Ensure the line is not added after the last item
        display(HTML('<hr style="border: 2px solid black;">'))
        
"""        

### *Note:* If you remove `.head()` from `final_data.head().iterrows()` in the above code, it will show all tables from <u>all scraped URLs</u> (i.e., nearly 700 URLs) saved in the final_data object. However, if you have scraped many URLs as is the case in this portfolio project, doing so will likely overload your memory. 

### In this case, you should instead find the row from the scraped text output from the Pandas DataFrame above that you are interested in viewing (e.g., the Wikipedia website on QVC, located on the row in which index number 58 is displayed on the left-hand side). Then use: final_data.iloc[58] in the following code:

In [3]:
from IPython.display import display, HTML

# Directly access the row in which the index number is 58 (iloc[58]), which shows all extracted tables from 
# the Wikipedia web page on QVC.

row = final_data.iloc[58]

# First display the original URL and the scraped URL from which the tables were extracted
display(HTML(f"<b>Original URL:</b> {row['original_url']}<br><b>Scraped URL:</b> {row['scraped_url']}"))

# Check if there are any HTML tables to display. If so, display them.
if row['scraped_tables']:
    display(HTML(row['scraped_tables']))
else:
    display(HTML("<p>No tables found.</p>"))

0,1
,
Country,United States
Broadcast area,Worldwide
Headquarters,"1200 Wilson Drive, West Chester, Pennsylvania 19380"
Programming,Programming
Language(s),English
Picture format,2160p UHDTV 1080i HDTV (downscaled to letterboxed 480i for the SDTV feed)
Ownership,Ownership
Owner,Qurate Retail Group
Sister channels,(see below)

0,1
,
Ownership,Ownership
Sister channels,(see above)
History,History
Launched,"August 22, 2013; 10 years ago"
Former names,QVC Plus (2013–2017)
Links,Links
Website,qvc.com

0,1
,
History,History
Launched,"April 1, 2019; 5 years ago"
Links,Links
Website,qvc.com

History,History.1
Launched,"April 23, 2019"
Former names,Beauty iQ (2019-2021)
Links,Links
Website,qvc.com

vteQurate Retail Group,vteQurate Retail Group.1
QVC Group,HSN QVC Zulily
Liberty Ventures Group,Evite FTD (37%) ILG (13%) LendingTree (26%)

vtePhiladelphia-area corporations (including the Delaware Valley),vtePhiladelphia-area corporations (including the Delaware Valley).1
List of companies based in the Philadelphia area,List of companies based in the Philadelphia area
Philadelphia-based Fortune 500 corporations (rank in the 2017 list),Comcast (31) Aramark (192) Crown Holdings (333)
Delaware Valley-based Fortune 500 corporations (rank in the 2017 list),AmerisourceBergen (11) DuPont (113) Lincoln Financial (207) Universal Health Services (276) Campbell Soup (339) UGI (457) Burlington Stores (463)
Other notable Philadelphia-based businesses,Amoroso's Chemtura Day & Zimmermann FMC Corporation Independence Blue Cross Pennsylvania Real Estate Investment Trust Pep Boys Philadelphia Media Network Radian Group Urban Outfitters
Notable Philadelphia-based professional partnerships,"Ballard Spahr Blank Rome Cozen O'Connor Dechert Drinker Biddle & Reath Duane Morris Morgan, Lewis & Bockius Pepper Hamilton Saul Ewing White and Williams"
Other notable Delaware Valley-based businesses,Actua Corporation Airgas AlliedBarton Ametek Aqua America Asplundh Bentley Systems Brandywine Realty Trust Boscov's Carpenter Technology Cephalon Chemours Christiana Care Health System Crozer Keystone Health System David's Bridal DuckDuckGo EPAM Systems EnerSys Liberty Property Trust Penn Entertainment Penn Mutual Rita's Italian Ice SEI Investments SLM Susquehanna International Group Vanguard Toll Brothers Triumph Group Unisys ViroPharma Vishay Intertechnology VWR Wawa Wilmington Trust W. L. Gore & Associates WSFS Bank
Notable Delaware Valley-based US headquarters of foreign businesses,Aberdeen Asset Management AgustaWestland AstraZeneca Chubb Delaware Investments GSK Keystone Foods SAP Siemens Healthineers Shire Pharmaceuticals Subaru Teva Pharmaceuticals TD Bank
Notable Delaware Valley-based division headquarters of US corporations,Acme (Cerberus Capital Management) Centocor (Johnson & Johnson) Colonial Penn (Conseco) Delmarva Power (Exelon) GSI Commerce (eBay) Hercules (Ashland) MAB Paints (Sherwin-Williams) McNeil Laboratories (Johnson & Johnson) Neoware (Hewlett-Packard) PECO (Exelon) QVC (Liberty Media) Rohm & Haas (Dow Chemical) SunGard (FIS) Tasty Baking (Flowers Foods)
