<div align="center">
    <h1>Web Scraping - Assignment 3</h1>
</div>

### **Author:** Péter Bence Török

###  **CEU ID:** 2404748

The aim of this project is to scrape the [ENF Solar website](https://www.enfsolar.com/pv/panel), a comprehensive online repository of solar panel products, to gather detailed technical data about various panels. The retrieved data will be analyzed to evaluate and compare the technical specifications of solar panels, providing insights into their performance and key features.

## 1. Setting up environment and creating user-defined functions


In this section, I will define the user-defined functions and set up the necessary environment required for the web scraping process. These functions will streamline repetitive tasks, such as data extraction, while ensuring the environment is properly configured with the necessary libraries and tools.

In [38]:
# Importing required python modules
from scrapethat import *
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
import pandas as pd
import time
from IPython.display import clear_output

In [40]:
# Define a function to retrieve links from a specific page number
def get_urls(n):
    # Initialize an empty list to store URLs
    urls = []
    
    # Read the content of the homepage for the given page number 'n'
    homepage_url = read_cloud(f"https://www.enfsolar.com/pv/panel/{n}")
    
    # Extract the links to individual products using CSS selectors
    # Select all elements with the class 'enf-product-name' and construct full URLs
    links = ['https://www.enfsolar.com' + x['href'] for x in homepage_url.select('.enf-product-name')]
    
    # Add the extracted links to the URLs list
    urls.extend(links)
    
    # Return the list of URLs
    return urls

In [5]:
# Creating a function that returns technical data for each product and handles missing variables
def get_product_details(url):
    df = []
    html = read_cloud(url)
    
    # Select the table with data
    table = html.select('.enf-api-spec-cell')
    unique_data_ids = list(set(td.get('data-id') for td in table if td.get('data-id')))
    
    for data_id in unique_data_ids:
        try:
            # Safely extract values for the current product
            one_product = [
                str(x).split('>')[1].split('<')[0].strip() 
                for x in table if data_id in str(x)
            ]
            
            data = {}
            
            # Product Name
            data['Product_name'] = one_product[0] if len(one_product) > 0 else None
            
            # Product Family
            try:
                title_section = html.select('.mk-title')[0]
                h1 = title_section.find('h1')
                data['Product_family'] = h1['title'] if h1 and 'title' in h1.attrs else None
            except (IndexError, AttributeError):
                data['Product_family'] = None

            # Manufacturer
            try:
                data['Manufacturer'] = title_section.find('span', {'id': 'mkjs-product-profile'}).get_text(strip=True)
            except (AttributeError, IndexError):
                data['Manufacturer'] = None

            # Price
            try:
                price_section = html.select('.yellow')
                data['Price'] = next(
                    (span.find('b').text.strip() for span in price_section if span.find('b')), 
                    None
                )
            except (StopIteration, AttributeError):
                data['Price'] = None

            # Technology
            try:
                tech_table = html.select('.enf-pd-profile-mini-table')[0]
                rows = tech_table.find_all('tr')
                data['Technology'] = rows[0].find('td').text.strip() if len(rows) > 0 else None
                data['Region'] = rows[2].find('td').text.strip() if len(rows) > 2 else None
            except (IndexError, AttributeError):
                data['Technology'] = None
                data['Region'] = None

            # Maximum Power, Voltage, Panel Efficiency
            data['Maximum_Power(Pmax)'] = one_product[1] if len(one_product) > 1 else None
            data['Voltage_at_Maximum_Power(Vmpp)'] = one_product[2] if len(one_product) > 2 else None
            data['Panel Efficiency'] = one_product[6] if len(one_product) > 6 else None
            
            # Similar Products Count and Link
            data['Similar_products'] = len(unique_data_ids)
            data['Link'] = url
            
            # Append the collected product data
            df.append(data)
        
        except Exception as e:
            print(f"Error processing data_id {data_id}: {e}")
            continue
    
    return df

### 2. Scraping of the websites
In this section, I will carry out a two-phase web scraping process. The first phase involves collecting links from a specified range of pages, while the second phase extracts detailed product information from those links, with progress tracked dynamically throughout. The chosen website employs scraping protection and occasionally blocks the read_cloud function. To handle this, the code automatically retries the queries and reprocesses the blocked pages until all data is successfully retrieved.

In [None]:
# Define a list of pages to scrape, ranging from 1 to 50
pages = [r for r in range(1, 51)]

# Initialize empty lists for storing links and final data
links = []
df_f = []

# Variable to track progress for the first phase
done = 0
n = len(pages)  # Total number of pages to scrape

# Loop through each page to gather links
for x in pages:
    # Call the get_urls function to fetch links from the current page
    link = get_urls(x)
    
    # If no links are found, re-add the page to the list for retrying
    if link == []:
        pages.append(x)
    else:
        # Extend the list of links with the retrieved links
        links.extend(link)
        done += 1  # Increment the progress counter
        
        # Clear the output and display the progress
        clear_output(wait=True)
        print(f"First phase: {(done/n)*100}% is ready")
        
        # Pause for 3 seconds to avoid overwhelming the server
        time.sleep(3)

# Update the total number of links collected
n = len(links)

# Print completion message for the first phase
print(f'First phase is done. Result: {n} links')

# Reset the progress tracker for the second phase
done = 0

# Loop through each link to scrape product details
for url in links:
    # Call the get_product_details function to extract data from the current URL
    data = get_product_details(url)
    
    # If no data is found, re-add the URL to the list for retrying
    if data == []:
        links.append(url)
    else:
        # Extend the final data list with the retrieved product details
        df_f.extend(data)
        done += 1  # Increment the progress counter
        
        # Clear the output and display the progress
        clear_output(wait=True)
        print(f"Second phase: {(done/n)*100}% is ready")

df=pd.DataFrame(df_f)
df

In [29]:
# Saving dataframe to local .csv file
df.to_csv('solar_panels.csv', index=False)