## Updated Scrape ports

The updated scrape ports has some new efficiencies.

1. It reads the URLs from a config file instead of hardcoding them in the script.
2. It uses a loop to iterate through the URLs, making it easier to add or remove URLs in the future.
3. It iterates through the HTML looking for Subheadings which it tracks then when a table is found it creates a dataframe of which the subheadings are added. This provides a more structured output which aligns with how the data is presented on the website.
4. It adds up to three levels of subheadings to the dataframe. This allows for a more detailed representation of the data.
5. Iterating through the HTML like this means that there are less unwanted artifacts in the data.
6. Doing all of this means that this code is less fragile and can be used to automate the updating of the data regularly.

In [41]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import json
import io
import numpy as np
import os 
from sqlalchemy import create_engine
engine = create_engine('sqlite:///allports_updated.db')

with open('configuration.json') as f:
    config = json.load(f)

Above reads all the configuration data from the configuration file as well as loading all the required libraries.

Below is the code that does the scraping. 

1. It iterates through the URLs in the config file and gets the HTML.
2. It uses BeautifulSoup to parse the HTML and find all the subheadings and tables.
3. It tracks the subheadings Subheading, Subheading_L2, and Subheading_L3.
4. When it finds a table, it creates a DataFrame and adds the subheadings to the DataFrame.
5. It appends the DataFrame to a list of DataFrames for that URL.
6. When all the tables for a URL have been processed, it concatenates the DataFrames and saves them to a central list.
7. Finally, it concatenates all the DataFrames from all URLs for a final output.

In [42]:
# Initialize a list to store all intermediate DataFrames
all_dataframes = []

# Iterate over each item in the config array
for entry in config:
    product = entry['product']
    url = entry['url']
    
    try:
        # Fetch the HTML content
        html = requests.get(url)
        soup = bs(html.content, "html.parser")

        # Initialize variables to track headings
        current_subheading = ""
        current_subheading_l2 = ""
        current_subheading_l3 = ""

        # List to store DataFrames for this product
        product_dataframes = []

        # Find all elements in the body
        all_elements = soup.body.find_all(['span', 'table'])

        # Iterate through the elements
        for element in all_elements:
            if element.name == 'span' and 'class' in element.attrs:
                element_classes = element.get('class', [])
                if 'Subheading' in element_classes:
                    current_subheading = element.get_text(strip=True)
                    current_subheading_l2 = ""
                    current_subheading_l3 = ""
                elif 'Subheading_L2' in element_classes:
                    current_subheading_l2 = element.get_text(strip=True)
                    current_subheading_l3 = ""
                elif 'Subheading_L3' in element_classes:
                    current_subheading_l3 = element.get_text(strip=True)
            elif element.name == 'table':
                # Convert the table to a DataFrame
                df = pd.read_html(io.StringIO(str(element)))[0]

                # Add the headings as columns
                df['Subheading'] = current_subheading
                df['Subheading_L2'] = current_subheading_l2
                df['Subheading_L3'] = current_subheading_l3

                # Add the product column
                df['Product'] = product

                # Append the DataFrame to the product-specific list
                product_dataframes.append(df)

        # Combine all DataFrames for this product
        if product_dataframes:
            combined_product_df = pd.concat(product_dataframes, ignore_index=True)
            all_dataframes.append(combined_product_df)

    except Exception as e:
        print(f"Error processing {product} ({url}): {e}")

# Combine all intermediate DataFrames into a single DataFrame
final_combined_df = pd.concat(all_dataframes, ignore_index=True)


The follow code is design to clean up the data and remove any unneeded information.

In [43]:
final_combined_df["Description"] = np.where(
    final_combined_df["Notes"].notna(), final_combined_df["Notes"], final_combined_df["Description"]
)

In [44]:
final_combined_df['Description'] = final_combined_df['Description'].fillna('')

In [45]:
final_combined_df['Subheading'] = np.where(final_combined_df['Subheading'] == '', final_combined_df['From'], final_combined_df['Subheading'])

In [46]:
final_combined_df['Port'] = np.where(
    (final_combined_df['Port'].isna()) & (~final_combined_df['Port/Endpoint'].isna()),
    final_combined_df['Port/Endpoint'],
    final_combined_df['Port']
)

In [47]:
columns_to_drop = [0, 'Port/Endpoint', 'Notes']

In [48]:
final_combined_df.drop(columns=columns_to_drop, inplace=True, axis=1, errors='ignore')

In [49]:
final_combined_df = final_combined_df.dropna(subset=['Port'], how='all')

In [50]:
rows_to_drop = ["Other Communications",  "Communication with Backup Server", "Communication with Backup Infrastructure Components", "Depends on device configuration", "Communication with Virtualization Servers"]

In [51]:
final_combined_df['Port'] = final_combined_df['Port'].astype(str)

In [52]:
if os.path.isfile("final_combined_df.parquet"):
    final_combined_df = pd.read_parquet('final_combined_df.parquet', engine='pyarrow')
else:
    final_combined_df.to_parquet(final_combined_df, 'final_combined_df.parquet', engine='pyarrow')

In [53]:
final_combined_df.head()

Unnamed: 0,Subheading,Subheading_L2,Subheading_L3,Product,From,To,Protocol,Port,Description
2,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,Microsoft Exchange Online,TCP,443,Required to connect to Microsoft Exchange Onli...
3,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,Microsoft SharePoint Online,TCP,443,Required to connect to Microsoft SharePoint On...
4,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,On-premises Microsoft SharePoint server,HTTP (HTTPS),5985 (5986 — used by default),Required to connect to on-premises Microsoft S...
5,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,On-premises Microsoft Exchange server,TCP,80 or 443,Required to connect to on-premises Microsoft E...
6,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,Backup proxy server,TCP,9193 (used by default),Required to manage inbound/outbound traffic wh...


In [54]:
# final_combined_df[final_combined_df['Subheading_L2'] == 'IBM FlashSystem (formerly Spectrum Virtualize) Storage']

final_combined_df['Subheading_L2'] = np.where((final_combined_df['Subheading_L2'] == 'IBM FlashSystem (formerly Spectrum Virtualize) Storage') & (final_combined_df['Subheading_L3'] != "" ), final_combined_df['Subheading_L3'], final_combined_df['Subheading_L2'])

In [55]:
final_combined_df[(final_combined_df['Subheading_L2'] == final_combined_df['Subheading_L3']) & (final_combined_df['Subheading_L2'] != "")]

Unnamed: 0,Subheading,Subheading_L2,Subheading_L3,Product,From,To,Protocol,Port,Description
220,Storage Systems,INFINIDAT InfiniBox,INFINIDAT InfiniBox,VBR,Backup server,INFINIDAT InfiniBox storage system,TCP,443,Default command port used for communication wi...
221,Storage Systems,INFINIDAT InfiniBox,INFINIDAT InfiniBox,VBR,Backup proxy,INFINIDAT InfiniBox storage system,TCP,3260,Default iSCSI target port.
222,Storage Systems,NEC Storage M Series,NEC Storage M Series,VBR,Backup server,NEC Storage M Series storage system,TCP,22,Default command port used for communication wi...
223,Storage Systems,NEC Storage M Series,NEC Storage M Series,VBR,Backup proxy,NEC Storage M Series storage system,TCP,3260,Default iSCSI target port.
224,Storage Systems,NEC Storage V Series,NEC Storage V Series,VBR,Backup server,NEC Storage V Series storage system,TCP,443,Default command port used for communication wi...
225,Storage Systems,NEC Storage V Series,NEC Storage V Series,VBR,Backup proxy,NEC Storage V Series storage system,TCP,3260,Default iSCSI target port.
226,Storage Systems,NetApp SolidFire/HCI,NetApp SolidFire/HCI,VBR,Backup server,NetApp SolidFire/HCI storage system,TCP,443,Default command port used for communication wi...
227,Storage Systems,NetApp SolidFire/HCI,NetApp SolidFire/HCI,VBR,Backup proxy,NetApp SolidFire/HCI storage system,TCP,3260,Default iSCSI target port.
228,Storage Systems,Pure Storage FlashArray,Pure Storage FlashArray,VBR,Backup server,Pure Storage FlashArray system,TCP,443,Default command port used for communication wi...
229,Storage Systems,Pure Storage FlashArray,Pure Storage FlashArray,VBR,Backup proxy,Pure Storage FlashArray system,TCP,3260,Default iSCSI target port.


In [56]:
final_combined_df['Subheading_L3'] = np.where(
    (final_combined_df['Subheading_L2'] == final_combined_df['Subheading_L3']) & 
    (final_combined_df['Subheading_L2'] != "") & 
    (final_combined_df['Subheading_L3'] != ""), 
    "", 
    final_combined_df['Subheading_L3']
)

In [57]:
final_combined_df = final_combined_df[~final_combined_df['Port'].isin(rows_to_drop)]

In [58]:
final_combined_df.loc[:, 'Port'] = final_combined_df['Port'].str.split('(').str[0].str.strip()

In [59]:
final_combined_df.rename(columns={"Subheading": "subheading", 
                                  "Subheading_L2": "subheadingL2", 
                                  "Subheading_L3": "subheadingL3", 
                                  "Port": "port", 
                                  "Protocol": "protocol",
                                  "Description": "description", 
                                  "Product": "product", 
                                  "From": "sourceService", 
                                  "To": "targetService"}, inplace=True)

In [60]:
final_combined_df.head()

Unnamed: 0,subheading,subheadingL2,subheadingL3,product,sourceService,targetService,protocol,port,description
2,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,Microsoft Exchange Online,TCP,443,Required to connect to Microsoft Exchange Onli...
3,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,Microsoft SharePoint Online,TCP,443,Required to connect to Microsoft SharePoint On...
4,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,On-premises Microsoft SharePoint server,HTTP (HTTPS),5985,Required to connect to on-premises Microsoft S...
5,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,On-premises Microsoft Exchange server,TCP,80 or 443,Required to connect to on-premises Microsoft E...
6,Veeam Backup for Microsoft 365 server,,,VB365,Veeam Backup for Microsoft 365 server,Backup proxy server,TCP,9193,Required to manage inbound/outbound traffic wh...


In [None]:
# final_combined_df.to_excel('final_combined_df.xlsx', index=False)

This section creates the sqlite database, the all_ports table and then inserts the data into the table.

This has been updated since the first version as it now uses: subheading, subheading_l2 and subheading_l3 to create a more structured table.

In [61]:
# Write DataFrame to SQL table (replace if exists)
final_combined_df.to_sql(
    'all_ports',
    con=engine,
    if_exists='replace',
    index=False
)

1208

In [62]:
# # Ensure the SQLite connection is established
# import sqlite3

# # Connect to the SQLite database
# con = sqlite3.connect("allports_updated.db")
# cur = con.cursor()

In [63]:
# cur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='all_ports'")
# if cur.fetchone():
#     print("Table 'all_ports' already exists. Dropping it.")
#     cur.execute("DROP TABLE all_ports")

In [64]:
# # Create the table if it doesn't already exist
# cur.execute(
#     """
#     CREATE TABLE IF NOT EXISTS all_ports (
#         product TEXT,
#         subheading TEXT,
#         subheading_l2 TEXT,
#         subheading_l3 TEXT,
#         from_port TEXT,
#         to_port TEXT,
#         protocol TEXT,
#         port TEXT,
#         description TEXT
#     )
#     """
# )

In [65]:
# # Insert all rows from final_combined_df into the database
# columns = ['Product', 'Subheading', 'Subheading_L2', 'Subheading_L3', 'From', 'To', 'Protocol', 'Port', 'Description']
# data = final_combined_df[columns].values.tolist()

# cur.executemany(
#     """
#     INSERT INTO all_ports (product, subheading, subheading_l2, subheading_l3, from_port, to_port, protocol, port, description)
#     VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
#     """,
#     data
# )

In [66]:
# # Commit the transaction and close the connection
# con.commit()
# con.close()