## Logic and Steps in the Code

1. WebDriver initialization and configuration.
At the beginning of the code, the browser is configured using Selenium WebDriver.
The options --disable-infobars, --disable-popup-blocking, and other parameters help prevent pop-up windows that may interfere with the scraping process.
The headless mode (commented out) allows the browser to run without a graphical interface, which is useful for background or server-side tasks.

2. Loading the web page and creating a BeautifulSoup object.
The page with comments on Investing.com is opened, and its HTML code is saved for further analysis.

3. Retrieving the maximum page number.
The total number of pages (page_max) is determined using the navigation buttons. The last button on the page contains the maximum page number, which is converted to an integer.

4. Parsing comments on each page.
This is where the main loop starts, which:
- Collects comments from the current page
- Navigates to the next page until the end or until the required conditions are met
- Processes comments
- For each comment, the following fields are extracted: comment text (by locating the corresponding div), comment date (by extracting the date text), and an additional feature ‚Äî the number of likes extracted from the comment
- Timestamp validation. During parsing, the script checks how much time has passed since the comment was published. If a comment is older than 1 year (365 days), the parsing process is terminated early (the current page is set to the last page and the loop is exited).
- If parsing is not finished and comments for the last year have not yet been fully collected, after processing the current page the script navigates to the next page using the "Next" button via Selenium. After the click, the browser is given time to load the new page.

5. Termination of the driver and data storage.
After completing the parsing process, the script stops the WebDriver and saves the collected comments into a DataFrame. The data is exported to an Excel file.

In [1]:
# We need to install the libraries if they aren't imported in the next step
 
%pip install beautifulsoup4 selenium

Defaulting to user installation because normal site-packages is not writeable
,Note: you may need to restart the kernel to use updated packages.


In [2]:
#Import of the necessary libraries for the further work

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from datetime import datetime, timedelta
import pandas as pd

In [3]:
# Setting of the browser configurations and turning off info panels and pop-up notifications

options = webdriver.ChromeOptions()

options.add_argument("--disable-infobars")
options.add_argument("--start-maximized")
options.add_argument("--disable-popup-blocking")
options.add_argument("--enable-automation")
options.add_argument("--disable-notifications")

# Run in headless browser mode
#options.add_argument('--headless')

#Configure the browser with the specified settings
driver = webdriver.Chrome(options=options)

In [4]:
# Open the page for further parsing and load its HTML code
driver.get('https://investing.com/crypto/litecoin/chat/')
url = driver.page_source

In [5]:
# Create object from bs4.BeautifulSoup
soup = BeautifulSoup(url, 'html.parser')

In [7]:
# List for storing extracted comments
comment_list = []

# Set the initial page number
page_current = 1

# Find all navigation button elements
pagination_buttons = soup.find_all('button', class_='flex items-center rounded border font-semibold leading-5 border-[#F7F7F8] bg-[#F7F7F8] text-[#1256A0] p-[11px]')

# Get the text of the last navigation button (which contains the maximum page number)
last_page_text = pagination_buttons[-1].get_text()

# Convert the text of the last button to an integer and store it as the maximum page number
page_max = int(last_page_text)

# Main loop for parsing data across pages
while page_current < page_max:
    
    # Collect comments from the current page
    data = soup.find_all('div', class_='px-1 pb-5 pt-4 transition-colors duration-300')
    
    for comment in data:
        # Extract text and attributes of the current comment
        text = comment.find('div', class_='break-words leading-5').get_text()
        date_text = comment.find('span', class_='text-[#5B616E]').get_text().replace(' –≥.', '')
        likes = int(comment.find_all('button', class_='group flex')[0].get_text())
        
        # Check the comment date to stop parsing when outdated messages are reached
        if 'hours' not in date_text and 'minutes' not in date_text:
            comment_date = datetime.strptime(date_text, '%b %d, %Y, %H:%M')
            if datetime.now() - comment_date > timedelta(days=365):
                page_current = page_max  # Terminate the loop early
                break 
        
        # If the comment meets the criteria, add it to the list
        comment_list.append([text, date_text, likes])
    
    # Stop the process if the required range has been reached
    if page_current >= page_max:
        break
    
    # Navigate to the next page and update the data
    page_current += 1
    driver.find_element(By.XPATH, "//button[.//span[text()='Next']]").click()
    time.sleep(3)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    
# Shut down the WebDriver    
driver.quit()

# Number of comments extracted by the parser   
print('–ö–æ–ª–∏—á–µ—Å—Ç–≤–æ –≤—ã–≥—Ä—É–∂–µ–Ω–Ω—ã—Ö –∫–æ–º–º–µ–Ω—Ç–∞—Ä–∏–µ–≤:', len(comment_list))

# Extracted comment text and dates
print('–¢–µ–∫—Å—Ç –∏ –¥–∞—Ç–∞ –∫–æ–º–º–µ–Ω—Ç–∞—Ä–∏–µ–≤:', comment_list)

–ö–æ–ª–∏—á–µ—Å—Ç–≤–æ –≤—ã–≥—Ä—É–∂–µ–Ω–Ω—ã—Ö –∫–æ–º–º–µ–Ω—Ç–∞—Ä–∏–µ–≤: 342
,–¢–µ–∫—Å—Ç –∏ –¥–∞—Ç–∞ –∫–æ–º–º–µ–Ω—Ç–∞—Ä–∏–µ–≤: [['hodl no 2 after btc', 'Nov 07, 2024, 11:53', 0], ['Undervalued', 'Nov 04, 2024, 22:13', 0], ['Sell this POüí©!', 'Nov 03, 2024, 03:39', 0], ['Sell this POS!', 'Nov 03, 2024, 03:37', 0], ['80 &amp; above once etf for ltc approved in usa', 'Oct 25, 2024, 11:01', 0], ['Only $80, I want $90 or above ‚Ä¶. ZZzzz ', 'Oct 26, 2024, 21:19', 0], ['When ltc etf get approved ? ', 'Oct 28, 2024, 08:06', 0], ['ETF for ltc is coming. hang in there', 'Oct 22, 2024, 11:28', 1], ['LTC $69.69 at 03:00 PM ET', 'Sep 29, 2024, 02:00', 1], ['CENTURY WEB RECOVERYCENTURY WEB RECOVERY SERVICE IS THE BEST\n\nTracking stolen crypto ‚Äî How CENTURY WEB RECOVERY helps Scam victims recover their lost funds. CENTURY WEB RECOVERY is a legitimate Crypto recovery company Who are considered to be one of the most reliable and experienced crypto recovery Experts that provides bitcoin recovery serv

In [8]:
# Create a DataFrame from the collected data
ltc = pd.DataFrame(comment_list, columns = ['comment', 'datetime', 'number_of_likes'])
ltc.head(400)

Unnamed: 0,comment,datetime,number_of_likes
0,hodl no 2 after btc,"Nov 07, 2024, 11:53",0
1,Undervalued,"Nov 04, 2024, 22:13",0
2,Sell this POüí©!,"Nov 03, 2024, 03:39",0
3,Sell this POS!,"Nov 03, 2024, 03:37",0
4,80 &amp; above once etf for ltc approved in usa,"Oct 25, 2024, 11:01",0
...,...,...,...
337,LTC $73.50 @ 10:00 PM ET,"Nov 12, 2023, 11:04",2
338,What is the target until end od sunday?,"Nov 12, 2023, 11:21",0
339,üî• Litecoin Is going up like hell But I still t...,"Nov 11, 2023, 00:47",3
340,How come one of the top 15 coins is performing...,"Nov 11, 2023, 00:36",0


In [9]:
# Export our DataFrame to Excel

ltc.to_excel('ltc_investing.xlsx', index = False)

### Link to the storage with the Excel file
##### https://github.com/ramin29/Cryptoasset-Market-Data-Analysis-and-Modeling/blob/main/gamzaev_ltc_investing.xlsx