# **Portfolio Project:** Web Scraping Using Scrapy
# **Rob Boswell**

---

### This portfolio project shows how to create a web crawler using the Scrapy Python library. I demonstrate how to combine Scrapy with a TOR-based library to enable anonymous web scraping.

<br>

### I will show two related ways of scraping web text data. The second technique is generally considered better than the first:

<br>

### In my first code implementation, the crawler starts with a single predefined URL and follows all links found on the page, collecting paragraph text from each linked page it visits.

<br>

### The scraped text data is stored in a dictionary, with URLs as keys and lists of paragraph texts as values. This data is then processed to remove empty strings and standardized into a pandas DataFrame for easier analysis and presentation. This implementation demonstrates a simple way to build a web scraper for exploring web page contents anonymously.

<br>

### The second code implementation showcases how to scrape text data from HTML documents while cleaning the data by removing empty strings. Users can specify multiple URLs to scrape, and the code collects and processes all links found within each specified URL, scraping paragraphs from each linked page.

<br>

### Each specified URL is assigned a unique ID, which is used in a pandas DataFrame to associate the collected links and scraped text with their originating URL, facilitating easier data retrieval. There are many design options for web scrapers depending on specific goals; for instance, this code could be adapted to scrape PDF files, meta data, or images instead of HTML text. E.g., for more information, see: ["How do I scrape PDFs with Scrapy?"](https://webscraping.ai/faq/scrapy/how-do-i-scrape-pdfs-with-scrapy).  


---

In [None]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-2.11.2-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting Twisted>=18.9.0 (from scrapy)
  Downloading twisted-24.3.0-py3-none-any.whl.metadata (9.5 kB)
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.9.1-py2.py3-none-any.whl.metadata (11 kB)
Collecting queuelib>=1.4.2 (from scrapy)
  Downloading queuelib-1.7.0-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting service-identity>=18.1.0 (from scrapy)
  Downloading service_identity-24.1.0-py3-none-any.whl.metadata (4.8 kB)
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-2.2.1-py3-none-any.whl.metadata (2.1 kB)
Collecting zope.interface>=5.1.0 (from scrapy)
  Downloading zope.interface-7.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_

### **First Web Scraper Implementation:**

In [None]:
# This code is intentionally designed to minimize the amount of logging output. Since so many URLs are being scraped, you will still occasionally see error code.
# Step 1: Install necessary packages and start TOR
!apt-get install -y tor
!pip install stem
!pip install pysocks

import time
from stem import Signal
from stem.control import Controller
import subprocess
import socks
import socket

# Step 2: Create the torrc file
torrc_content = """
ControlPort 9051
CookieAuthentication 0
SocksPort 9050
Log notice file /var/log/tor/notices.log
"""
with open('torrc', 'w') as f:
    f.write(torrc_content)

# Step 3: Start TOR using subprocess and wait for it to be ready
tor_process = subprocess.Popen(['tor', '-f', 'torrc'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Give TOR some time to start
time.sleep(20)

# Step 4: Renew TOR identity
def renew_connection():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate()
        controller.signal(Signal.NEWNYM)

# Step 5: Completely Disable Logging
import logging
from scrapy.utils.log import configure_logging

# Disable logging for all levels
logging.disable(logging.CRITICAL)

# Step 6: Scrapy Spider
import scrapy
from scrapy.crawler import CrawlerProcess

# Set up SOCKS5 proxy for Scrapy using PySocks
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket

# Initialize the dictionary outside of the Spider class
data_link_dict = dict()

# Create the Spider class
class DCChapterSpider(scrapy.Spider):
    name = "dc_chapter_spider"

    def start_requests(self):
        try:
            renew_connection()  # Renew TOR identity before starting requests
        except Exception as e:
            print(f"Failed to renew TOR connection: {e}")  # Use print instead of log for errors
        yield scrapy.Request(url='https://en.wikipedia.org/wiki/Web_scraping',
                             callback=self.parse1)

    def parse1(self, response):
        links = response.xpath('//a/@href').extract()
        # Process links without logging
        for link in links:
            absolute_url = response.urljoin(link)
            yield response.follow(url=absolute_url, callback=self.parse2)

    def parse2(self, response):
        # Correct XPath selector for extracting paragraph text
        par_text = response.xpath('//p/text()').extract()
        par_text_strip = [t.strip() for t in par_text]
        # Use response.url as the key for the dictionary
        data_link_dict[response.url] = par_text_strip
        # No logging needed

# Configure Scrapy to use the SOCKS proxy
process = CrawlerProcess()
process.crawl(DCChapterSpider)
process.start()

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  logrotate tor-geoipdb torsocks
Suggested packages:
  bsd-mailx | mailx mixmaster torbrowser-launcher socat apparmor-utils nyx obfs4proxy
The following NEW packages will be installed:
  logrotate tor tor-geoipdb torsocks
0 upgraded, 4 newly installed, 0 to remove and 45 not upgraded.
Need to get 2,884 kB of archives.
After this operation, 15.5 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 logrotate amd64 3.19.0-1ubuntu1.1 [54.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tor amd64 0.4.6.10-1 [1,665 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 torsocks amd64 2.3.0-3 [62.5 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tor-geoipdb all 0.4.6.10-1 [1,103 kB]
Fetched 2,884 kB in 1s (2,691 kB/s)
Selecting previously unselected pa



See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)


Failed to renew TOR connection: Socket error: 0x01: General SOCKS server failure


Unhandled Error
Traceback (most recent call last):
  File "<ipython-input-2-1d4360bb8138>", line 84, in <cell line: 84>
    process.start()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/crawler.py", line 429, in start
    reactor.run(installSignalHandlers=install_signal_handlers)  # blocking call
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 695, in run
    self.mainLoop()
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 705, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 1090, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/tcp.py", line 448, in resolveAddress
    self._setRealAddress(self.addr)
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/tcp.py", line 469, in _setRealAddress
    self.doConnect()
  File "/usr/local/li

In [None]:
def print_dict_head(d, n=5):
    """Print the first n items of a dictionary."""
    # Use enumerate to limit the number of items printed
    for i, (key, value) in enumerate(d.items()):
        if i >= n:
            break
        print(f"{key}: {value}")

In [None]:
# By printing just the first part of the dictionary, we can see there are many empty strings that have been scraped. We should delete the empty strings
print_dict_head(data_link_dict)

https://en.wikipedia.org/wiki/Special:MyTalk: ['People on Wikipedia can use this', 'to post a public message about edits made from the IP address you are currently using.', 'Many IP addresses change periodically, and are often shared by several people. You may', 'or', 'to avoid future confusion with other logged out users. Creating an account also hides your IP address.']
https://en.wikipedia.org/wiki/Help:Introduction: ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
https://en.wikipedia.org/wiki/Special:RecentChanges: ['This is a list of recent changes to Wikipedia.']
https://en.wikipedia.org/wiki/Wikipedia:File_upload_wizard: ['Thank you for offering to contribute an image or other media file for use on Wikipedia. This wizard will guide you through a questionnaire prompting you for the appropriate copyright and sourcing information for each file. Please ensure you understand', 'and the', 'before proceeding.', '', 'Uploads to', '', '', '', 

In [None]:
import pandas as pd

# Clean the dictionary by removing empty strings
cleaned_dict = {k: [item for item in v if item] for k, v in data_link_dict.items()}

# Find the maximum number of paragraphs for any URL to standardize the DataFrame
max_length = max(len(v) for v in cleaned_dict.values())

# Create a standardized dictionary with lists of equal length
standardized_dict = {k: v + [''] * (max_length - len(v)) for k, v in cleaned_dict.items()}

# Convert to DataFrame
df = pd.DataFrame.from_dict(standardized_dict, orient='index').transpose()

# Display the DataFrame
df.head()

Unnamed: 0,https://en.wikipedia.org/wiki/Special:MyTalk,https://en.wikipedia.org/wiki/Help:Introduction,https://en.wikipedia.org/wiki/Special:RecentChanges,https://en.wikipedia.org/wiki/Wikipedia:File_upload_wizard,https://en.wikipedia.org/wiki/Special:Search,https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Web+scraping,https://en.wikipedia.org/wiki/Help:Contents,https://en.wikipedia.org/wiki/Wikipedia:Contact_us,https://en.wikipedia.org/wiki/Wikipedia:About,https://en.wikipedia.org/wiki/Wikipedia:Community_portal,...,https://en.wikipedia.org/wiki/Main_Page,https://ar.wikipedia.org/wiki/%D8%AA%D8%AC%D8%B1%D9%8A%D9%81_%D9%88%D9%8A%D8%A8,https://www.eff.org/cases/facebook-v-power-ventures,https://cs.wikipedia.org/wiki/Web_scraping,https://web.archive.org/web/20071012005033/http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf,https://www.semanticscholar.org/paper/Joint-optimization-of-wrapper-generation-and-Zheng-Song/61db194fc4693b002d507c6f027beeefef6ae3e7?p2df,https://www.techdirt.com/2009/06/10/can-scraping-non-infringing-content-become-copyright-infringement-because-of-how-scrapers-work/,https://web.archive.org/web/20191203113701/https://www.lloyds.com/~/media/5880dae185914b2487bed7bd63b96286.ashx,https://consent.yahoo.com/v2/collectConsent?sessionId=1_cc-session_1babb80b-28df-41ba-9864-b630fdb67946,https://web.archive.org/web/20120624103316/http://www.lkshields.ie/htmdocs/publications/newsletters/update26/update26_03.htm
0,People on Wikipedia can use this,,This is a list of recent changes to Wikipedia.,Thank you for offering to contribute an image ...,,edits,This page provides,"How to report a problem with an article, or fi...",is a,This page provides a listing of current collab...,...,is a,(,EFF has urged a San Francisco federal court an...,",",,,"Earlier this year, we couldn’t figure out how ...",History is littered with hundreds of conflicts...,"We, TechCrunch, are part of the",Website owners often have to contend with the ...
1,to post a public message about edits made from...,,,and the,,articles,.,"Problems with articles about you, your company...","that anyone can edit, and",? See the,...,in,:,Power Ventures was a company that allowed user...,nebo,,,. Power.com tried to aggregate various social ...,The main site for Archive Team is at,family of brands.,"(26 February 2010). However, \n i..."
2,"Many IP addresses change periodically, and are...",,,before proceeding.,,recent contributors,You can also search Wikipedia's help pages usi...,"How to copy Wikipedia's information, donate yo...",.,page or,...,". Taking the name from a local landmark, forme...",)‏ هي تقنية استخراج البيانات من مواقع,(CFAA) and the California state CFAA equivalen...,označují způsob získávání,,,points us to,and contains up to the date information on var...,If you do not want us and our partners to use ...,The Ryanair case concerned a claim by Ryanair ...
3,or,,,Uploads to,,,or the,"Find out about the process, how to donate, and...",is to benefit readers by presenting informatio...,for everything you need to know to get started...,...,", and the building's two other floors were use...",عن طريق برامج مخصصة مثل برامج محاكة تصفح الأشخ...,", the federal law that prohibits sending comme...",z,,,", and separately",This collection contains the output of many Ar...,'.,Mr Justice Hanna's decision relates only to a ...
4,to avoid future confusion with other logged ou...,,,Uploads locally to the English Wikipedia; must...,,,.,If you're a member of the press looking to con...,. Hosted by the,"of interest, see the",...,". By 2013, persistent high demand for Blackroc...",متكامل، مثل,"In February 2012, the district court found Pow...",. Spočívá v extrahování dat umístěných na webo...,,,of the ruling. Neuberger states the following:,", providing a path back to lost websites and w...","If you would like to customise your choices, c...","In any dispute, there is an initial issue that..."


In [None]:
# We can see that the total number of URL links that were scraped is 259, and that the largest number of paragraphs scraped from at least one of these links was 1044.
df.shape

(1044, 259)

#### If you prefer seeing logging output, below is a commented out version of the same code that implements logging to detect more errors, as well as to see which sites have been scraped, how many URL links were found on a given page, how many paragraphs were scraped on a given site, and the exact contents of the paragraphs that were scraped:



In [None]:
"""

# Step 1: Install necessary packages and start TOR
!apt-get install -y tor
!pip install stem
!pip install pysocks

import time
from stem import Signal
from stem.control import Controller
import subprocess
import os
import socks
import socket

# Step 2: Create the torrc file
torrc_content = """
ControlPort 9051
CookieAuthentication 0
SocksPort 9050
Log notice file /var/log/tor/notices.log
"""
with open('torrc', 'w') as f:
    f.write(torrc_content)

# Step 3: Start TOR using subprocess and wait for it to be ready
tor_process = subprocess.Popen(['tor', '-f', 'torrc'])

# Give TOR some time to start
time.sleep(20)

# Step 4: Renew TOR identity
def renew_connection():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate()
        controller.signal(Signal.NEWNYM)

# Step 5: Scrapy Spider
import scrapy
from scrapy.crawler import CrawlerProcess

# Set up SOCKS5 proxy for Scrapy using PySocks
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket

# Initialize the dictionary outside of the Spider class
data_link_dict = dict()

# Create the Spider class
class DCChapterSpider(scrapy.Spider):
    name = "dc_chapter_spider"

    def start_requests(self):
        try:
            renew_connection()  # Renew TOR identity before starting requests
        except Exception as e:
            self.log(f"Failed to renew TOR connection: {e}")
        yield scrapy.Request(url='https://en.wikipedia.org/wiki/Web_scraping',
                             callback=self.parse1)

    def parse1(self, response):
        links = response.xpath('//a/@href').extract()
        self.log(f"Found {len(links)} links on the page")
        for link in links:
            absolute_url = response.urljoin(link)
            yield response.follow(url=absolute_url, callback=self.parse2)

    def parse2(self, response):
        # Correct XPath selector for extracting paragraph text
        par_text = response.xpath('//p/text()').extract()
        self.log(f"Processing URL: {response.url}")
        self.log(f"Found {len(par_text)} paragraphs on the page")
        par_text_strip = [t.strip() for t in par_text]
        # Use response.url as the key for the dictionary
        data_link_dict[response.url] = par_text_strip
        self.log(f"Stored data for URL: {response.url}")
        self.log(f"Current data_link_dict: {data_link_dict}")

# Configure Scrapy to use the SOCKS proxy
process = CrawlerProcess()
process.crawl(DCChapterSpider)
process.start()

"""

### **Second Web Scraper Implementation:**

In [None]:
# Step 1: Install necessary packages and start TOR
!apt-get install -y tor > /dev/null 2>&1
!pip install stem > /dev/null 2>&1
!pip install pysocks > /dev/null 2>&1
!pip install pandas > /dev/null 2>&1
!pip install scrapy > /dev/null 2>&1

import time
from stem import Signal
from stem.control import Controller
import subprocess
import socks
import socket
import pandas as pd

# Step 2: Create the torrc file
torrc_content = """
ControlPort 9051
CookieAuthentication 0
SocksPort 9050
Log notice file /var/log/tor/notices.log
"""
with open('torrc', 'w') as f:
    f.write(torrc_content)

# Step 3: Start TOR using subprocess and wait for it to be ready
tor_process = subprocess.Popen(['tor', '-f', 'torrc'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Give TOR some time to start
time.sleep(20)

# Step 4: Renew TOR identity
def renew_connection():
    try:
        with Controller.from_port(port=9051) as controller:
            controller.authenticate()
            controller.signal(Signal.NEWNYM)
    except Exception as e:
        print(f"Failed to renew TOR connection: {e}")

# Step 5: Create a list of user predetermined URLs to scrape
urls_to_scrape = [
    'https://en.wikipedia.org/wiki/Web_scraping',
    'https://en.wikipedia.org/wiki/Data_scraping',
    # Add more URLs as needed
]

# Assign unique IDs to URLs
url_id_map = {url: f"url_{i+1}" for i, url in enumerate(urls_to_scrape)}

# Step 6: Suppress Scrapy's Detailed Logging
import logging
from scrapy.utils.log import configure_logging

# Disable logging for all levels
logging.disable(logging.CRITICAL)

# Disable all logging
configure_logging(install_root_handler=False)
logging.getLogger('scrapy').propagate = False

# Step 7: Scrapy Spider
import scrapy
from scrapy.crawler import CrawlerProcess

# Set up SOCKS5 proxy for Scrapy using PySocks
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket

# Initialize the dictionary outside of the Spider class
data_dicts = {}

# Create the Spider class
class DCChapterSpider(scrapy.Spider):
    name = "dc_chapter_spider"

    def start_requests(self):
        renew_connection()  # Renew TOR identity before starting requests
        for url, unique_id in url_id_map.items():
            yield scrapy.Request(url=url, callback=self.parse1, meta={'unique_id': unique_id, 'origin_url': url})


    def parse1(self, response):
        unique_id = response.meta['unique_id']
        origin_url = response.meta['origin_url']
        links = response.xpath('//a/@href').extract()
        for link in links:
            absolute_url = response.urljoin(link)
            # Ensure to follow the link even if it appears similar to the origin_url
            if absolute_url != origin_url:
                yield scrapy.Request(url=absolute_url, callback=self.parse2, meta={'unique_id': unique_id, 'scraped_url': absolute_url, 'origin_url': origin_url})


    def parse2(self, response):
        unique_id = response.meta['unique_id']
        scraped_url = response.meta['scraped_url']
        origin_url = response.meta['origin_url']
        # Correct XPath selector for extracting paragraph text
        par_text = response.xpath('//p/text()').extract()
        par_text_strip = [t.strip() for t in par_text if t.strip()]  # Remove empty strings
        # Store the data in the dictionary with unique_id as the key
        if unique_id not in data_dicts:
            data_dicts[unique_id] = []
        data_dicts[unique_id].append({
            'origin_url': origin_url,
            'scraped_url': scraped_url,
            'text': ' '.join(par_text_strip)  # Concatenate all paragraph text into a single string
        })

# Configure Scrapy to use the SOCKS proxy
process = CrawlerProcess()
process.crawl(DCChapterSpider)
process.start()

# Step 8: Process the collected data and convert each dictionary to a DataFrame
dataframes = {}

for unique_id, entries in data_dicts.items():
    # Create a DataFrame from the list of entries
    df = pd.DataFrame(entries)
    # Store the DataFrame with unique ID in the name
    dataframes[f"df_{unique_id}"] = df

# Step 9: Display the DataFrames
for df_name, df in dataframes.items():
    print(f"DataFrame: {df_name}")
    print(df.head())




See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)


Failed to renew TOR connection: Socket error: 0x01: General SOCKS server failure


Unhandled Error
Traceback (most recent call last):
  File "<ipython-input-2-380f46810f26>", line 113, in <cell line: 113>
    process.start()
  File "/usr/local/lib/python3.10/dist-packages/scrapy/crawler.py", line 429, in start
    reactor.run(installSignalHandlers=install_signal_handlers)  # blocking call
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 695, in run
    self.mainLoop()
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 705, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/base.py", line 1090, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/tcp.py", line 448, in resolveAddress
    self._setRealAddress(self.addr)
  File "/usr/local/lib/python3.10/dist-packages/twisted/internet/tcp.py", line 469, in _setRealAddress
    self.doConnect()
  File "/usr/local/

DataFrame: df_url_1
                                   origin_url  \
0  https://en.wikipedia.org/wiki/Web_scraping   
1  https://en.wikipedia.org/wiki/Web_scraping   
2  https://en.wikipedia.org/wiki/Web_scraping   
3  https://en.wikipedia.org/wiki/Web_scraping   
4  https://en.wikipedia.org/wiki/Web_scraping   

                                         scraped_url  \
0       https://en.wikipedia.org/wiki/Special:MyTalk   
1  https://en.wikipedia.org/w/index.php?title=Spe...   
2  https://en.wikipedia.org/wiki/Wikipedia:File_u...   
3       https://en.wikipedia.org/wiki/Special:Search   
4    https://en.wikipedia.org/wiki/Help:Introduction   

                                                text  
0  This user is currently blocked.\nThe latest bl...  
1                 edits articles recent contributors  
2  Thank you for offering to contribute an image ...  
3                                                     
4                                                     
DataFrame: df_url_

In [None]:
df_url_1 = dataframes['df_url_1']
df_url_2 = dataframes['df_url_2']

In [None]:
df_url_1.head()

Unnamed: 0,origin_url,scraped_url,text
0,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Special:MyTalk,This user is currently blocked.\nThe latest bl...
1,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/w/index.php?title=Spe...,edits articles recent contributors
2,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Wikipedia:File_u...,Thank you for offering to contribute an image ...
3,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Special:Search,
4,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Help:Introduction,


In [None]:
for key, df in dataframes.items():
    # Strip leading and trailing whitespace from the 'text' column
    df['text'] = df['text'].str.strip()

    # Remove rows where 'text' is empty
    df = df[df['text'] != '']

    # Save the cleaned DataFrame back to the dictionary
    dataframes[key] = df

# Example to display the head of a cleaned DataFrame
print(dataframes['df_url_1'].head())

                                   origin_url  \
0  https://en.wikipedia.org/wiki/Web_scraping   
1  https://en.wikipedia.org/wiki/Web_scraping   
2  https://en.wikipedia.org/wiki/Web_scraping   
5  https://en.wikipedia.org/wiki/Web_scraping   
6  https://en.wikipedia.org/wiki/Web_scraping   

                                         scraped_url  \
0       https://en.wikipedia.org/wiki/Special:MyTalk   
1  https://en.wikipedia.org/w/index.php?title=Spe...   
2  https://en.wikipedia.org/wiki/Wikipedia:File_u...   
5  https://en.wikipedia.org/wiki/Special:MyContri...   
6  https://en.wikipedia.org/wiki/Wikipedia:Contac...   

                                                text  
0  This user is currently blocked.\nThe latest bl...  
1                 edits articles recent contributors  
2  Thank you for offering to contribute an image ...  
5  This IP address is currently blocked.\nThe lat...  
6  How to report a problem with an article, or fi...  


In [None]:
df_url_1 = dataframes['df_url_1']
df_url_2 = dataframes['df_url_2']

In [None]:
df_url_1.head()

Unnamed: 0,origin_url,scraped_url,text
0,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Special:MyTalk,This user is currently blocked.\nThe latest bl...
1,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/w/index.php?title=Spe...,edits articles recent contributors
2,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Wikipedia:File_u...,Thank you for offering to contribute an image ...
5,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Special:MyContri...,This IP address is currently blocked.\nThe lat...
6,https://en.wikipedia.org/wiki/Web_scraping,https://en.wikipedia.org/wiki/Wikipedia:Contac...,"How to report a problem with an article, or fi..."


In [None]:
df_url_1['text'] = df_url_1['text'].str.strip()
df_url_1 = df_url_1[df_url_1['text'] != '']

In [None]:
df_url_2['text'] = df_url_1['text'].str.strip()
df_url_1 = df_url_1[df_url_1['text'] != '']

In [None]:
print(df_url_1.shape)
print(df_url_2.shape)

(223, 3)
(274, 3)


In [None]:
# Again, below is a version of the above code that will implement logging
"""
# Step 1: Install necessary packages and start TOR
!apt-get install -y tor > /dev/null 2>&1
!pip install stem > /dev/null 2>&1
!pip install pysocks > /dev/null 2>&1
!pip install pandas > /dev/null 2>&1
!pip install scrapy > /dev/null 2>&1

import time
from stem import Signal
from stem.control import Controller
import subprocess
import socks
import socket
import pandas as pd

# Step 2: Create the torrc file
torrc_content = """
ControlPort 9051
CookieAuthentication 0
SocksPort 9050
Log notice file /var/log/tor/notices.log
"""
with open('torrc', 'w') as f:
    f.write(torrc_content)

# Step 3: Start TOR using subprocess and wait for it to be ready
tor_process = subprocess.Popen(['tor', '-f', 'torrc'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Give TOR some time to start
time.sleep(20)

# Step 4: Renew TOR identity
def renew_connection():
    try:
        with Controller.from_port(port=9051) as controller:
            controller.authenticate()
            controller.signal(Signal.NEWNYM)
    except Exception as e:
        print(f"Failed to renew TOR connection: {e}")

# Step 5: Create a list of user predetermined URLs to scrape
urls_to_scrape = [
    'https://en.wikipedia.org/wiki/Web_scraping',
    'https://en.wikipedia.org/wiki/Data_scraping',
    # Add more URLs as needed
]

# Assign unique IDs to URLs
url_id_map = {url: f"url_{i+1}" for i, url in enumerate(urls_to_scrape)}

# Step 6: Scrapy Spider
import scrapy
from scrapy.crawler import CrawlerProcess

# Set up SOCKS5 proxy for Scrapy using PySocks
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket

# Initialize the dictionary outside of the Spider class
data_dicts = {}

# Create the Spider class
class DCChapterSpider(scrapy.Spider):
    name = "dc_chapter_spider"

    def start_requests(self):
        renew_connection()  # Renew TOR identity before starting requests
        for url, unique_id in url_id_map.items():
            yield scrapy.Request(url=url, callback=self.parse1, meta={'unique_id': unique_id, 'origin_url': url})


    def parse1(self, response):
        unique_id = response.meta['unique_id']
        origin_url = response.meta['origin_url']
        links = response.xpath('//a/@href').extract()
        self.log(f"Found {len(links)} links on the page")
        for link in links:
            absolute_url = response.urljoin(link)
            # Ensure to follow the link even if it appears similar to the origin_url
            if absolute_url != origin_url:
                yield scrapy.Request(url=absolute_url, callback=self.parse2, meta={'unique_id': unique_id, 'scraped_url': absolute_url, 'origin_url': origin_url})


    def parse2(self, response):
        unique_id = response.meta['unique_id']
        scraped_url = response.meta['scraped_url']
        origin_url = response.meta['origin_url']
        # Correct XPath selector for extracting paragraph text
        par_text = response.xpath('//p/text()').extract()
        self.log(f"Processing URL: {response.url}")
        self.log(f"Found {len(par_text)} paragraphs on the page")
        par_text_strip = [t.strip() for t in par_text if t.strip()]  # Remove empty strings
        # Store the data in the dictionary with unique_id as the key
        if unique_id not in data_dicts:
            data_dicts[unique_id] = []
        data_dicts[unique_id].append({
            'origin_url': origin_url,
            'scraped_url': scraped_url,
            'text': ' '.join(par_text_strip)  # Concatenate all paragraph text into a single string
        })
        self.log(f"Stored data for URL: {response.url}")

# Configure Scrapy to use the SOCKS proxy
process = CrawlerProcess()
process.crawl(DCChapterSpider)
process.start()

# Step 7: Process the collected data and convert each dictionary to a DataFrame
dataframes = {}

for unique_id, entries in data_dicts.items():
    # Create a DataFrame from the list of entries
    df = pd.DataFrame(entries)
    # Store the DataFrame with unique ID in the name
    dataframes[f"df_{unique_id}"] = df

# Step 8: Display the DataFrames
for df_name, df in dataframes.items():
    print(f"DataFrame: {df_name}")
    print(df.head())
"""

## **Comparison of Both Implementations:**

### The two web scraping implementations are similar in that they both use the TOR network to anonymize requests, utilize the Scrapy framework for scraping, and process the collected data into Pandas DataFrames. However, there are key differences between them in terms of their structure and functionality. Here is a breakdown of the differences:

<br>

## ***Similarities:***

<br>

### - **TOR Integration:** Both codes set up and start a TOR process, use PySocks to route requests through TOR, and have a function to renew the TOR identity.

### - **Scrapy Usage:** Both use the Scrapy framework to define a spider for scraping data from web pages.


### - **Data Processing:** Both process scraped data using Pandas DataFrames to clean and structure the data.

<br>

## ***Differences:***

<br>

### - **URL Handling:**

### **First Code:** It starts scraping from a single hardcoded URL (https://en.wikipedia.org/wiki/Web_scraping) and follows links from that page. The data collected from each followed link is stored in a single dictionary, data_link_dict.

### **Second Code:** It uses a predefined list of URLs (urls_to_scrape) to start scraping. Each URL is assigned a unique ID (url_id_map), and data from each URL is stored in a separate dictionary under the unique ID, resulting in a more organized structure for handling multiple starting URLs.

<br>

### - **Data Storage:**

### **First Code:** Uses a single dictionary (data_link_dict) to store all scraped data, with URLs as keys and lists of paragraph texts as values.

### **Second Code:** Uses a nested dictionary (data_dicts) where each unique URL ID maps to another dictionary, which then maps base URLs to lists of paragraph texts. This allows for clearer data organization when dealing with multiple initial URLs.

<br>

### - **DataFrame Creation:**

### **First Code:** Converts the single dictionary (data_link_dict) into one DataFrame after cleaning and standardizing the data.

### **Second Code:** Creates a separate DataFrame for each unique URL ID. Each DataFrame corresponds to one of the initial URLs and is stored in a dictionary (dataframes), allowing for the handling and analysis of data on a per-URL basis.

<br>

### - **Scrapy Spider Logic:**

### **First Code:** The spider starts by requesting a single hardcoded URL and then follows links from that page.

### **Second Code:** The spider starts by iterating over a list of predefined URLs, making it more flexible for scraping multiple specific pages without modifying the code for each new URL.

<br>

### - **Data Display:**

### **First Code:** Displays a single DataFrame containing all scraped data.

### **Second Code:** Iterates over each DataFrame in the dataframes dictionary, displaying them separately with unique identifiers, facilitating the distinction of data from different starting URLs.

<br>

### In summary, the second code is more flexible and organized, especially when dealing with multiple starting URLs. It structures data in a way that allows for easier analysis on a per-URL basis, whereas the first code is simpler and suitable for single-page scraping with link-following.