# Scraping Reviews from Trustpilot for Electricity Providers in Ireland

## Description

The provided Python script is designed to scrape customer reviews from Trustpilot for specific electricity providers in Ireland, such as Electric Ireland and Bord Gais Energy. The script gathers various details about each review, including the username, total reviews by the user, location, date of the review, content of the review, and the rating given. Here’s an overview of the process:

1. **Importing Necessary Libraries:**
   - `requests` for making HTTP requests to the Trustpilot website.
   - `BeautifulSoup` from `bs4` for parsing HTML content.
   - `pandas` for creating and manipulating data in a DataFrame.
   - `files` from `google.colab` for downloading the resulting Excel file.
   - `datetime` for formatting the date.

2. **Helper Functions:**
   - `soup2list(src, list_, attr=None)`: Extracts text or attribute values from HTML elements and appends them to a list.
   - `format_date(date_str)`: Converts the `datetime` attribute from the `time` elements to a more readable date format.

3. **Defining Lists to Store Scraped Data:**
   - Several lists (`users`, `userReviewNum`, `ratings`, `locations`, `dates`, `reviews`) are initialized to store the respective details of each review.

4. **Specifying the Range of Pages to Scrape:**
   - Variables `from_page` and `to_page` define the range of pages to scrape reviews from (e.g., pages 1 to 6).

5. **Scraping Reviews:**
   - A loop iterates over the specified range of pages.
   - For each page, an HTTP GET request is made to the Trustpilot review page for the specified company.
   - The HTML content of the page is parsed using `BeautifulSoup`.
   - Various details (username, total reviews, location, date, rating, and review content) are extracted using the helper function `soup2list` and appended to the respective lists.
   - A sleep interval is added between requests to avoid being throttled by the server.

6. **Ensuring Data Consistency:**
   - All lists are extended to have the same length by padding them with empty values if necessary. This ensures that the DataFrame can be created without any issues.

7. **Creating a DataFrame:**
   - A pandas DataFrame is created using the scraped data, with columns for each detail (Username, Total reviews, Location, Date, Content, Rating).

8. **Saving and Downloading the Data:**
   - The DataFrame is saved as an Excel file (`.xlsx`).
   - The Excel file is then downloaded to the local system using `files.download`.

This script automates the process of collecting and saving customer reviews from Trustpilot, making it easier to analyze feedback for different electricity providers in Ireland.


In [4]:
from time import sleep
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime

def soup2list(src, list_, attr=None):
    if attr:
        for val in src:
            list_.append(val[attr])
    else:
        for val in src:
            list_.append(val.get_text())

def format_date(date_str):
    date_obj = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%S.%fZ")
    return date_obj.strftime("%B %d, %Y")

def scrape_reviews(company, from_page=1, to_page=6):
    users = []
    userReviewNum = []
    ratings = []
    locations = []
    dates = []
    reviews = []

    for i in range(from_page, to_page+1):
        result = requests.get(f"https://www.trustpilot.com/review/{company}?page={i}")
        soup = BeautifulSoup(result.content, 'html.parser')

        soup2list(soup.find_all('span', {'class': 'typography_heading-xxs__QKBS8 typography_appearance-default__AAY17'}), users)
        soup2list(soup.find_all('span', {'class': 'typography_body-m__xgxZ_ typography_appearance-subtle__8_H2l'}), userReviewNum)
        soup2list(soup.find_all('span', {'class': 'typography_body-m__xgxZ_ typography_appearance-subtle__8_H2l'}), locations)

        date_elements = soup.find_all('time', {'datetime': True})
        for date_elem in date_elements:
            dates.append(format_date(date_elem['datetime']))

        soup2list(soup.find_all('div', {'class': 'styles_reviewHeader__iU9Px'}), ratings, attr='data-service-review-rating')
        soup2list(soup.find_all('p', {'class': 'typography_body-l__KUYFJ typography_appearance-default__AAY17 typography_color-black__5LYEn'}), reviews)

        sleep(1)  # To avoid throttling

    # Ensure all lists are of the same length
    max_len = max(len(users), len(userReviewNum), len(locations), len(dates), len(ratings), len(reviews))

    users.extend([''] * (max_len - len(users)))
    userReviewNum.extend([''] * (max_len - len(userReviewNum)))
    locations.extend([''] * (max_len - len(locations)))
    dates.extend([''] * (max_len - len(dates)))
    ratings.extend([''] * (max_len - len(ratings)))
    reviews.extend([''] * (max_len - len(reviews)))

    review_data = pd.DataFrame(
        {
            'Username': users,
            'Total reviews': userReviewNum,
            'Location': locations,
            'Date': dates,
            'Content': reviews,
            'Rating': ratings
        }
    )

    return review_data

companies = ['www.mexipass.com', 'tripinsure101.com']
all_reviews = pd.DataFrame()

for company in companies:
    reviews = scrape_reviews(company)
    reviews['Company'] = company
    all_reviews = pd.concat([all_reviews, reviews], ignore_index=True)


# Output

In [3]:
len(all_reviews)

138

In [2]:
all_reviews.head(50)

Unnamed: 0,Username,Total reviews,Location,Date,Content,Rating
0,Blanca Haro,3.1K reviews,3.1K reviews,"June 24, 2024",had a couple of questions regarding coverage. ...,5
1,Amalia Cardenas,1.2K reviews,1.2K reviews,"June 26, 2024",i have been appointed with Mexipass for many y...,5
2,Martin Johnson,25K reviews,25K reviews,"June 10, 2024","Other than the systems to update, change or re...",5
3,Thomas Gibson,2 reviews,2 reviews,"June 24, 2024",MexiPass is so easy to work with and team is g...,5
4,Sheba Insurance,1 review,1 review,"June 16, 2024",The website platform is very easy to use and u...,5
5,Carol A. Burns,2 reviews,2 reviews,"June 06, 2024",Your online quoting platform is very easy and ...,5
6,see above,1 review,1 review,"May 31, 2024",Your website is user friendly and it helps! B...,5
7,Michelle A,1 review,1 review,"June 22, 2024","Great customer service! As insurance agents, ...",5
8,Jose F. Arizola,2 reviews,2 reviews,"June 20, 2024",Wow more easy to process applications and very...,5
9,Guillermo Jimenez,1 review,1 review,"June 15, 2024","Great Value!Easy to work with, good pricing.Co...",5
