The provided Python code is used to initialize a Selenium WebDriver for Google Chrome. Selenium WebDriver is a tool for automating web browser interaction, and it's often used for testing web applications or for web scraping.

The `init_webdriver` function is defined to initialize and return a Selenium WebDriver for Google Chrome. Inside the function, an instance of `webdriver.ChromeOptions` is created. This is used to specify options for the WebDriver. Three options are added using the `add_argument` method: `'--headless'`, `'--no-sandbox'`, and `'--disable-dev-shm-usage'`. 

- The `'--headless'` option makes the browser run in the background, which is useful if you don't need to see the browser UI. 
- The `'--no-sandbox'` and `'--disable-dev-shm-usage'` options are used to solve some issues that can occur when running the WebDriver in certain environments.

Finally, an instance of `webdriver.Chrome` is created with the ChromeDriverManager and options, and it's returned by the function. This instance can be used to interact with the web browser.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time

def init_webdriver():
    """
    Initializes and returns a Selenium WebDriver.
    """
    # Setting up Chrome WebDriver
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')  # Run in background (no GUI)
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
    return driver




The provided Python code defines a class `PageNavigator` for navigating and interacting with a webpage using Selenium WebDriver. The class has several methods for different types of interactions.

- `__init__(self, driver, link)`: This is the constructor method that initializes the `PageNavigator` instance with a `driver` (Selenium WebDriver instance) and a `link` (URL of the webpage to navigate).

- `navigate(self)`: This method navigates the WebDriver to the specified link.

- `click_button_by_id(self, button_id)`: This method clicks a button on the webpage identified by its ID.

- `click_button_by_class_and_limit(self, button_class, limit_value)`: This method clicks a button on the webpage identified by its class and limit value.

- `click_next_button(self)`: This method clicks the "Next page" button on the webpage. If the button is not found or an error occurs, it prints an error message and returns `False`.

- `fetch_links(self)`: This method fetches links from the fifth column of a table on the webpage. It continues fetching links across pages until it reaches 1000 links or there are no more pages.

After the class definition, the code initializes a WebDriver and a `PageNavigator` instance, and uses the navigator to interact with a webpage. It navigates to the webpage, clicks a button by its ID, clicks another button by its class and limit value, fetches links from the table, and finally quits the driver.

In [2]:
class PageNavigator:
    def __init__(self, driver, link):
        self.driver = driver
        self.link = link

    def navigate(self):
        """Navigates to the specified link."""
        self.driver.get(self.link)

    def click_button_by_id(self, button_id):
        """Clicks a button identified by its ID."""
        wait = WebDriverWait(self.driver, 5)
        button = wait.until(EC.element_to_be_clickable((By.ID, button_id)))
        button.click()

    def click_button_by_class_and_limit(self, button_class, limit_value):
        """Clicks a button identified by its class and limit value."""
        css_selector = f"a.{button_class}[limit='{limit_value}']"
        wait = WebDriverWait(self.driver, 5)
        button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, css_selector)))
        button.click()

    def click_next_button(self):
        wait = WebDriverWait(self.driver, 10)
        try:
            next_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//a[@aria-label="Next page"]')))
           
            self.driver.execute_script("arguments[0].click();", next_button)  # Fallback to JS click if direct click fails
            time.sleep(2)  # Adjust based on observed page load times
            print("Navigating to next page...")
            return True
        except Exception as e:
            print("No more pages to navigate or encountered an error:", e)
            return False


    def fetch_links(self):
        """Fetches links from the fifth column of the table across pages until reaching 1000 links or no more pages."""
        links = []
        base_url = "https://global-standard.org"
        max_links = 1000

        while len(links) < max_links:
            # Wait for the table to be loaded
            WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.ui-table.search-result-list")))
            soup = BeautifulSoup(self.driver.page_source, 'html.parser')
            table = soup.find('table', class_='ui-table search-result-list')
            rows = table.find('tbody').find_all('tr')

            for row in rows:
                if len(links) >= max_links:
                    break
                cells = row.find_all('td')
                if len(cells) >= 5:  # Ensure there are enough columns
                    link_tag = cells[4].find('a')  # Fifth column for the link
                    if link_tag and link_tag.get('href'):
                        full_link = base_url + link_tag.get('href')
                        links.append(full_link)
           

            if not self.click_next_button() :
                break
               # Stop if no next button or max links reached

        return links


# Initialize the WebDriver and PageNavigator
driver = init_webdriver()
link = "https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search"
navigator = PageNavigator(driver, link)

# Use the navigator to interact with the page
navigator.navigate()
navigator.click_button_by_id("xFormForm-0-submit")
navigator.click_button_by_class_and_limit("xforms-limiter-limit", "50")
links =navigator.fetch_links()
driver.quit()

Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...
Navigating to next page...


In [3]:
print(links)


['https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/28841', 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/30502', 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/42592', 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/32019', 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/2274', 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/38747', 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/25973', 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/43427', 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/

The provided Python code is used to create a DataFrame from a list of links and then save this DataFrame to a CSV file.


In [None]:
# Create a DataFrame from the links list
import pandas as pd


df_links = pd.DataFrame({'Links': links})

# Save the DataFrame to a CSV file
df_links.to_csv('links.csv', index=False)


The provided Python code defines a class `CompanyInfoExtractor` for extracting company details from a webpage using Selenium WebDriver and BeautifulSoup. The class has several methods for different types of interactions.

- `__init__(self, driver)`: This is the constructor method that initializes the `CompanyInfoExtractor` instance with a `driver` (Selenium WebDriver instance).

- `scrape_company_details(self, company_url)`: This method navigates the WebDriver to the specified company URL, waits for the page to load, parses the loaded page with BeautifulSoup, and extracts various details from the page. These details are returned as a dictionary.

- `_extract_text(self, soup, selector)`: This helper method extracts and returns the text from the first element that matches the CSS selector.

- `_extract_href(self, soup, selector, base_url='')`: This helper method extracts and returns the href attribute from the first element that matches the CSS selector. If the href is not a PNG image, it's appended to the base URL.

- `_extract_combined_text(self, soup, selector)`: This helper method extracts and returns the combined text from all elements that match the CSS selector.

- `_extract_full_address(self, soup)`: This helper method extracts and returns the full address from the page.

After the class definition, the code initializes a WebDriver and a `CompanyInfoExtractor` instance, and uses the extractor to scrape company details from two webpages. The scraped details are printed to the console, and finally the driver is quit.

In [37]:
class CompanyInfoExtractor:
    def __init__(self, driver):
        self.driver = driver


    def scrape_company_details(self, company_url):
        # Navigate to the company details page
        self.driver.get(company_url)
        # Wait for the page to be fully loaded
        WebDriverWait(self.driver, 3).until(EC.presence_of_element_located((By.ID, "done")))
        # Now use BeautifulSoup to parse the loaded page
        soup = BeautifulSoup(self.driver.page_source, 'html.parser')

        # Extract details from the company page
        details = {
            'company': self._extract_text(soup, '#done h1'),
            'brand_name': self._extract_text(soup, 'td#xFormTd-16 span.xforms-text'),
            'country': self._extract_text(soup, 'td#xFormTd-0 span.xforms-text'),
            'product_category': self._extract_text(soup, 'td#xFormTd-3 span.xforms-text'),
            'contact_name': self._extract_combined_text(soup, 'td#xFormTd-4 span.xforms-text'),
            'email_address': self._extract_href(soup, 'td#xFormTd-7 a'),
            'address': self._extract_full_address(soup),
            'license_number': self._extract_text(soup, 'td#xFormTd-15 span.xforms-text'),
            'pdf': self._extract_href(soup, 'td#xFormTd-19 a#xFormA-2', base_url='https://global-standard.org'),
            'certification_body': self._extract_text(soup, 'td#xFormTd-18 span.xforms-text'),
            'expiry_date': self._extract_text(soup, 'td#xFormTd-20 span.xforms-text'),
            'product_details': self._extract_text(soup, 'td#xFormTd-17 span.xforms-text')
        }

        return details

    def _extract_text(self, soup, selector):
        element = soup.select_one(selector)
        return element.text.strip() if element else ''

    def _extract_href(self, soup, selector, base_url=''):
        element = soup.select_one(selector)
        if element and 'href' in element.attrs:
            href = element['href']
            # Check if the href is not a PNG image
            if not href.endswith('.png'):
                return base_url + href
        return ''
    def _extract_combined_text(self, soup, selector):
        elements = soup.select(selector)
        return ' '.join([elem.text.strip() for elem in elements if elem])

    def _extract_full_address(self, soup):
        address_parts = [
            self._extract_text(soup, 'td#xFormTd-8 span.xforms-text'),#adress1
            self._extract_text(soup, 'td#xFormTd-9 span.xforms-text'),#adress2
            self._extract_text(soup, 'td#xFormTd-10 span.xforms-text'),#sreet
            self._extract_combined_text(soup, 'td#xFormTd-11 span.xforms-text')#postcode,city
        ]
        return ', '.join(filter(None, address_parts))

# Usage example

driver = init_webdriver()  # Make sure this function is defined and initializes WebDriver in headless mode
extractor = CompanyInfoExtractor(driver)
company_details = extractor.scrape_company_details("https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/44175")
print(company_details)
company_details = extractor.scrape_company_details("https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/43335")
print(company_details)
driver.quit()

 First, the `csv` module is imported, which provides functions to read and write data in CSV format.

- The function `save_to_csv(data)` is defined. It takes one argument `data`, which is a dictionary where the keys are column names and the values are data to be written to the CSV file.

- Inside the function, a filename `data.csv` is defined for the CSV file.

- The keys from the `data` dictionary are retrieved using `data.keys()`, and these keys will be used as the column names in the CSV file.

- The CSV file is opened in append mode (`'a'`) with `open(file_name, 'a', newline='', encoding='utf-8')`. If the file doesn't exist, it will be created.

- A `csv.DictWriter` is created with the opened file and the headers. `DictWriter` is a class which writes dictionaries to the CSV file.

- The file's current position is checked with `csvfile.tell()`. If it's `0`, that means the file is empty, so the headers are written to the file with `writer.writeheader()`.

- Finally, the data is written to the CSV file with `writer.writerow(data)`. Each key-value pair in the `data` dictionary corresponds to a column and a cell in the CSV file.

In [38]:
import csv

def save_to_csv(data):
    # Define the CSV file name
    # file_name = 'scraped_data.csv'
    file_name = 'data.csv'

    # Get the keys (column names) from the data
    headers = data.keys()

    # Open the CSV file in append mode, create a new file if it doesn't exist
    with open(file_name, 'a', newline='', encoding='utf-8') as csvfile:
        # Create a CSV writer with specified fieldnames
        writer = csv.DictWriter(csvfile, fieldnames=headers)

        # Check if the file is empty; if yes, write the headers
        if csvfile.tell() == 0:
            writer.writeheader()

        # Write the data to the CSV file
        writer.writerow(data)

{'company': 'A.V. CREATIONS', 'brand_name': '', 'country': 'India (IN), Gujarat', 'product_category': 'Fabrics', 'contact_name': '', 'email_address': '', 'address': 'Shop No- 2004, Vikas Logistic Park, KUMBHARIYA, 395010, Surat,', 'license_number': '231279', 'pdf': '', 'certification_body': 'Intertek Testing Services', 'expiry_date': '2024-10-19', 'product_details': ''}


The provided Python code is used to initialize a WebDriver, create a `CompanyInfoExtractor` instance, and then use this instance to scrape company details from a list of links and save these details to a CSV file.

- First, a WebDriver is initialized with `init_webdriver()`. This function is assumed to return a Selenium WebDriver instance.

- Then, a `CompanyInfoExtractor` instance is created with the WebDriver. This class is assumed to have a method `scrape_company_details(link)` that takes a URL and returns a dictionary of company details.

- After that, a loop is started over the `links` list. For each link in the list, company details are scraped with `extractor.scrape_company_details(link)` and saved to a CSV file with `save_to_csv(company_details)`.

- Finally, after all links have been processed, the WebDriver is quit with `driver.quit()`.

In [48]:
driver = init_webdriver()
extractor = CompanyInfoExtractor(driver)
for link in links:
    company_details = extractor.scrape_company_details(link)
    save_to_csv(company_details)
driver.quit()

some insight about the data we see that we have no duplicated columns

In [51]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('data.csv')

# Find duplicate columns
duplicate_columns = df.columns[df.columns.duplicated()]
print(duplicate_columns)



Index([], dtype='object')


In [53]:
df.head()


Unnamed: 0,company,brand_name,country,product_category,contact_name,email_address,address,license_number,pdf,certification_body,expiry_date,product_details
0,AA CANVAS COMPANY,,"India (IN), Delhi","Fabrics, Product category (other)",ASHOK KUMAR,mailto:ashok@aacanvas.co,"177-A, Azad Market, 110006, DELHI",217155,https://global-standard.org/find-suppliers-sho...,Ecocert Greenlife,12/31/2024,BagsWoven Fabrics (PD0059)handbagspouches (PD0...
1,"A. Ferreira - Sociedade Têxteis, Lda",,Portugal (PT),Fabrics,,,"Travessa Santa Cruz, Nº 75, 4755-246, Góios - ...",CB-GOTS-CUC-03- 1015956,,Control Union Certifications,11/19/2024,Undyed fabrics (PC0027) Knitted fabrics (PD005...
2,A. Ferreira & Filhos S.A.,,Portugal (PT),"Babywear, Garments, Home Textiles",,,"Rua Amaro de Sousa, 480 - Ap. 66, 4816-901, Vi...",CB-GOTS-CUC-03- 1015660,,Control Union Certifications,6/9/2024,Babies' apparel (PC0003) Babies' clothing (PD0...
3,A. G. DRESSES LIMITED.,,"Bangladesh (BD), Dhaka","Babywear, Ladieswear, Children's wear, Men's wear",Hasan,,"Plot # 09, Block # C, Tongi I/A, Himardighi, 1...",USB 004153,,USB Certification,8/1/2024,Pantsblouses (PD0005) - 100% organic cotton (R...
4,"A. J. Gonçalves, S.A.",,Portugal (PT),Accessories,,,"Avenida Covedelo, Nº 28; Apartado 2485, 4705-4...",1048944,,Control Union Certifications,11/18/2024,Women's apparel (PC0002) Sockshosiery (PD0009)...


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   company             1001 non-null   object
 1   brand_name          6 non-null      object
 2   country             1001 non-null   object
 3   product_category    996 non-null    object
 4   contact_name        392 non-null    object
 5   email_address       436 non-null    object
 6   address             972 non-null    object
 7   license_number      998 non-null    object
 8   pdf                 205 non-null    object
 9   certification_body  998 non-null    object
 10  expiry_date         998 non-null    object
 11  product_details     879 non-null    object
dtypes: object(12)
memory usage: 94.0+ KB


In [56]:
df.isnull().sum()


company                 0
brand_name            995
country                 0
product_category        5
contact_name          609
email_address         565
address                29
license_number          3
pdf                   796
certification_body      3
expiry_date             3
product_details       122
dtype: int64

In [64]:


duplicate_companies = df.duplicated(subset=['company'], keep=False)

# Count the number of duplicate entries
num_duplicate_companies = duplicate_companies.sum()

# Extract the rows with duplicate companies to review
duplicate_companies_df = df[duplicate_companies].sort_values(by='company')

duplicate_companies_df.to_csv('duplicate_companies.csv', index=False)
