# Return of the Scrape

Will try and maintain a LotR theme.  Apologies for any tenuous links.

As found in the previous github entry, a previous Upwork client requested I carry out more work for them.  As a refresher, the previous job involved scraping data from hyperlinks in an excel spreadsheet and filling a table full of data.  After I finisheed, I was exceptionally pleased and the following conversation happened with my client:

Me: I've done everything, could you please check it's okay? Thank you!
Client: Looks great, Mike.  I'd very much like the same but for all signals please.

The spreadsheet was full.  I was confused.

Me: As far as I can see all the columns are filled.  When you say all signals, could you give me more detail please?
Client: Ah, so you see this link here (link to website) and then see all the signals on the left.  I'd like a spreadsheet with all of these please.

Yeah, it was a whole website.  But, never fear.  Cups of tea and Google solve everything.

## Initial thoughts

So, we already have a few very significant challenges ahead of us from the get go:

### The Site
* The previous pages the client linked us were static and straightforward to parse.  The client has given us a whole website...which has dynamic webpages in it.

* It being dynamic means we can't use BeautifulSoup to our advantage.  We'll have to use Selenium.

### The Data

* Data consists of a dynamically loaded table which means no clear labels for any of the text.

* The website is controlled by left and right arrows with a selection of showing 10/20/30/40/100 records, defaulting on 20.  There's no way of selecting 100 records per page using Selenium.
   
* This is also a lot more data.

### The Approach

* Get the first page of data.

* Parse all hyperlinks.

* Loop through hyperlinks. 

* Parse data from hyperlinks using previous code.

* Write all to an excel file onto 3 separate worksheets:
    * Table 1: the main table from the search page.
    * Ratios added to a seperate page to make formatting a lot easier later.
    * Table 2: the parsed data from the hyperlinks.

## Gathering the Fellowship (Import tools)

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
from requests import get
import lxml
import time
import csv
import os
import datetime

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Scraping Elements for Table 1 

These are all of the elements found on the search page of the site:

* Rank
* Name
* Gain
* Pips
* DD
* Trades
* Type
* Price
* Age
* Added

In [2]:
def table_1(output=False):
    """
    1. Calculates pages to be scraped.
    
    2. Scrapes desired elements: Rank, Name, Gain, Pips, DD, Trades, Type, Price, Age, Added
    """
    elements = driver.find_elements_by_tag_name('td')
    number_of_rows = len(elements[::12])

    # Scraping for table 1 elements:
    i=0
    for num in range(number_of_rows):
        rank = elements[i].text
        name = elements[i+1].text
        gain = elements[i+2].text
        pips = elements[i+3].text
        dd = elements[i+4].text
        trades = elements[i+5].text
        type_ = elements[i+6].text
        price = elements[i+9].text
        age = elements[i+10].text
        added = elements[i+11].text
        i += 12
        
        if output == True:
            # Optional output
            print(f'Table 1 data for {name}:')
            print(f'Rank: {rank}')
            print(f'Name: {name}') 
            print(f'Gain: {gain}') 
            print(f'Pips: {pips}') 
            print(f'DD: {dd}')
            print(f'Trades: {trades}')
            print(f'Type: {type_}' )
            print(f'Price: {price} ')
            print(f'Age: {age}')
            print(f'Added: {added}\n')
        if output == False:
            pass 

# Scraping Elements for (the) Table (2 (Towers))

Now the search page data has been collected, we now need to collect all of the data within the hyperlinks from the first page.  These are all the desired elements from the hyperlinks:

* Profits
* Balance
* Equity
* Deposits
* Withdrawals
* Trades
* Won(%
* Average Trade Time
    * Calculates average trade time into hours
* Profit Factor
* Daily 
* Monthly
* Trades per Month
* Expectancy
* Rank
* Platform
* Ratio
* Platform 2
* Info

In [3]:
def table_2(output=False): 
    """
    1. Parses text for table 1.
    
    2. Scraped hyperlinks from table 1.
    
    2. Cycles through hyperlinks from table 1 to create table 2.
    """
    # Looping through hyperlinks from table 1:
    table = driver.find_elements_by_xpath("//a[@class='pointer']")
    for link in table:
        hyperlink = link.get_attribute("href")
        if 'analysis' in hyperlink:
           
            # Creating table 2:
            
            url = get(hyperlink)
            soup = BeautifulSoup(url.text, 'html.parser')
            table = soup.findAll("li", {"class":"list-group-item"})

            # Profits
            pr = soup.findAll("div", {"class": "number"})
            pro = pr[2].text.strip()
            profit = pro.encode('ascii', errors='ignore')

            # Balance
            ba = soup.find("li", class_="list-group-item")
            bal = ba.text.strip().split()
            balance = bal[1].encode('ascii', errors='ignore')

            # Equity
            eq = table[1].text.split()
            equity = eq[2].encode('ascii', errors='ignore')

            # Deposits
            de = table[3].text.split()
            deposits = de[1].encode('ascii', errors='ignore')

            # Withdrawals
            wi=table[4].text.strip().split()
            withdrawals=wi[1].encode('ascii', errors='ignore')

            # Trades
            tr = table[5].text.split()
            trades = tr[1]

            # Won
            wo = table[7].text.strip().split()
            won = wo[1].encode('ascii', errors='ignore')

            # Average trade time
            avg_trade_time = table[8].text.strip().split()
            att = ' '.join(avg_trade_time[3:5])

            if len(att) > 4:
                h = att.split('h')
                hours = int(h[0])
                m = h[1].split('m')
                mins = int(m[0])
                total_hours = (hours+mins/60)
            if len(att) <= 4:
                if 'm' in att:
                    m = att.split('m')
                    mins = int(m[0])
                    total_hours = mins/60
                if 'd' in att:
                    d = att.split('d')
                    days= int(d[0])
                    total_hours = days*24

            # Profit Factor
            pf = table[10].text.strip().split()
            profit_factor = pf[2].encode('ascii', errors='ignore')

            # Daily
            da = table[11].text.strip().split()
            daily = da[1].encode('ascii', errors='ignore')

            # Monthly
            mo = table[12].text.strip().split()
            monthly = mo[1].encode('ascii', errors='ignore')

            # Trades per month
            tpm_ = table[13].text.strip().split()
            tpm = tpm_[3].encode('ascii', errors='ignore')

            # Expectancy
            ex = table[14].text.strip().split()
            expectancy = ex[4].encode('ascii', errors='ignore')

            t2 = soup.find("div", {"class":"caption-helper font-blue-sharp bold master-description-container"})
            t2_parsed = t2.text.split(', ')

            # Rank
            if '#' in t2_parsed[0]:
                r = t2_parsed[0].split('#')
                rank = r[1]
            if '#' not in t2_parsed[0]:
                rank = '-'

            # Platform
            platform = t2_parsed[3]

            # Ratio
            ratio = t2_parsed[4]

            # Platform 2
            platform2 = t2_parsed[5]

            # Info
            info_table = soup.find("p")
            i_ = info_table.text.strip()
            info = i_.encode('ascii', errors='ignore')

        # Optional output:
            if output == True:
                print(f'Hyperlink data from: {hyperlink}')
                print(f'Rank: {rank}')
                print(f'Platform: {platform}')
                print(f'Ratio: {ratio}')
                print(f'Platform 2: {platform2}')
                print(f"Profit: {profit}")
                print(f"Balance: {balance}")
                print(f"Equity: {equity}")
                print(f"Deposits: {deposits}")
                print(f"Withdrawals: {withdrawals}")
                print(f"Trades: {trades}")
                print(f"Won %: {won}")
                print(f"Average trade time: {att}")
                print(f"Average trade time (hours): {total_hours}")
                print(f"Profit factor: {profit_factor}\n")
            if output == False:
                pass

## `process_data()` function

We need a function which can save all of this data to a CSV which can then imported and cleaned using pandas.

In [4]:
def process_data(target_filename, column_headers=[], data_fields=[], clean=None):
    """
    Writes data to CSV and returns CSV as raw data
    
    `column_headers` inputted as as list
    `data_fields` inputted as a list of a list
    
    clean:
    `clean = True` passes the target file to the clean_data function which produces columns with digits and punctuation only.
    `clean = False` does not pass the target file.
    """
    # Formatting file name
    date = datetime.datetime.now().strftime('%d-%m-%Y')
    file_name = date+'-'+target_filename+'.csv'
    
    # Creating data file
    raw_data_file_exists = os.path.isfile(file_name)
    with open(file_name, 'a') as file:
        if not raw_data_file_exists:
                # Add header once
                fields = column_headers
                writer = csv.DictWriter(file, fieldnames=fields)
                writer.writeheader()
        writer = csv.writer(file)
        for d in data_fields:
            writer.writerow(d)
    
    # Creating list of files for cleaning
    if clean == True:
        global data_to_clean
        data_to_clean = []
        data_to_clean.append(file_name)

## `clean_data()` function

Now the data has been collected, we need a function which can clean the data as the client has requested numbers only for certain columns

In [5]:
def clean_data(target_filename):
    df = pd.read_csv(target_filename)
    
    # Drop unnamed columns
    for c in df.columns:
        if 'Unnamed' in c:
            df = df.drop(c, axis=1)
    
    # Strip away all characters which aren't numbers or punctuation for certain columns
    num_only_columns = [
    'Gain',
    'DD',
    'Price',
    'Profit', 
    'Balance', 
    'Equity', 
    'Deposits', 
    'Withdrawals', 
    'Won', 
    'Profit Factor', 
    'Daily', 
    'Monthly', 
    'Trades per month', 
    'Expectancy'
    ]
    
    for c in num_only_columns:
            df[c] = df[c].str.replace(r'[a-zA-Z%$\'\+]+', '')
    
    # Rename certain columns
    df = df.rename({'Gain': 'Gain (%)',
                    'DD': 'DD (%)',
                    'Price': 'Price ($)',
                    'Trades.1': 'Trades',
                    'Won': 'Won %',
                    'Daily': 'Daily (%)',
                    'Monthly': 'Monthly (%)'})
    
    # Save formatted data
    df.to_csv('cleaned-'+target_filename)

## `scrape_data()` function

A combination of `table_1()` and `table_2()` to make life a lot easier when it comes to creating a pipeline.

In [8]:
def scrape_data(output=None, write=None):
    """
    Combination of table 1 and table 2.  Does the following:
    
    1. Parses the hyperlink from the dynamic main page to give a static page.
    
    2. Parses static page for relevant information.
    
    3. Parses main page for relevant information.
    
    4. Produces optional output for whatever reason.  I thought it looked nice.
    output:
    `output=True` prints all of the scraped details.  
    `output=False` turns off verbose mode.
    
    write:
    `write=True` calls `process _data()` function and saves the file to the current working directory to a csv
    `write=False` does not save the file.
    """
    i = 0
    table = driver.find_elements_by_xpath("//a[@class='pointer']")
    for link in table:
        elements = driver.find_elements_by_tag_name('td')
        
        # Counts the number of columns and number of rows
        columns = driver.find_elements_by_tag_name('tr')
        number_of_columns = (len(columns[0].text.split()))
        number_of_rows = len(elements[::number_of_columns])
        
        # Filter out language hyperlinks
        hyperlink = link.get_attribute("href")
        if 'analysis' in hyperlink:
           
            # Scraping hyperlink data:
            
            url = get(hyperlink)
            soup = BeautifulSoup(url.text, 'html.parser')
            table = soup.findAll("li", {"class":"list-group-item"})

            # Profits
            pr = soup.findAll("div", {"class": "number"})
            pro = pr[2].text.strip()
            profit = pro.encode('ascii', errors='ignore')

            # Balance
            ba = soup.find("li", class_="list-group-item")
            bal = ba.text.strip().split()
            balance = bal[1].encode('ascii', errors='ignore')

            # Equity
            eq = table[1].text.split()
            equity = eq[2].encode('ascii', errors='ignore')

            # Deposits
            de = table[3].text.split()
            deposits = de[1].encode('ascii', errors='ignore')

            # Withdrawals
            wi=table[4].text.strip().split()
            withdrawals=wi[1].encode('ascii', errors='ignore')

            # Trades
            tr = table[5].text.split()
            trades = tr[1]

            # Won
            wo = table[7].text.strip().split()
            won = wo[1].encode('ascii', errors='ignore')

            # Average trade time
            avg_trade_time = table[8].text.strip().split()
            att = ' '.join(avg_trade_time[3:5])
            
            if len(att) == 1:
                att_hours = att
            if len(att) > 4:
                h = att.split('h')
                hours = int(h[0])
                m = h[1].split('m')
                mins = int(m[0])
                att_hours = (hours+mins/60)
            if len(att) <= 4:
                if 'm' in att:
                    m = att.split('m')
                    mins = int(m[0])
                    att_hours = mins/60
                if 'd' in att:
                    d = att.split('d')
                    days= int(d[0])
                    att_hours = days*24
                
            # Profit Factor
            pf = table[10].text.strip().split()
            profit_factor = pf[2].encode('ascii', errors='ignore')

            # Daily
            da = table[11].text.strip().split()
            daily = da[1].encode('ascii', errors='ignore')

            # Monthly
            mo = table[12].text.strip().split()
            monthly = mo[1].encode('ascii', errors='ignore')

            # Trades per month
            tpm_ = table[13].text.strip().split()
            tpm = tpm_[3].encode('ascii', errors='ignore')

            # Expectancy
            ex = table[14].text.strip().split()
            expectancy = ex[4].encode('ascii', errors='ignore')

            t2 = soup.find("div", {"class":"caption-helper font-blue-sharp bold master-description-container"})
            t2_parsed = t2.text.split(', ')

            # Rank
            if '#' in t2_parsed[0]:
                r = t2_parsed[0].split('#')
                rank = r[1]
            if '#' not in t2_parsed[0]:
                rank = '-'

            # Platform
            platform = t2_parsed[3]

            # Ratio
            ratio = t2_parsed[4].split(':')
            ratio1 = ratio[0]
            ratio2 = ratio[1]

            # Platform 2
            platform2 = t2_parsed[5]

            # Info
            info_table = soup.find("p")
            i_ = info_table.text.strip()
            info = i_.encode('ascii', errors='ignore')
            
            # Scraping search page data:
            rank = elements[i].text
            name = elements[i+1].text
            gain = elements[i+2].text
            pips = elements[i+3].text
            dd = elements[i+4].text
            trades = elements[i+5].text
            price = elements[i+8].text
            age = elements[i+9].text
            added = elements[i+10].text
            i += number_of_columns
    
            # Optional output
            if output == True:
                print(f'Data for {name}.\nHyperlink: {hyperlink}\n')
                #table 1/main page
                print(f'Rank: {rank}')
                print(f'Name: {name}') 
                print(f'Gain: {gain}') 
                print(f'Pips: {pips}') 
                print(f'DD: {dd}')
                print(f'Trades: {trades}')
                print(f'Price: {price} ')
                print(f'Age: {age}')
                print(f'Added: {added}')

                #hyperlink data
                print(f'Rank: {rank}')
                print(f'Platform: {platform}')
                print(f'Ratio: {ratio1}:{ratio2}')
                print(f'Platform 2: {platform2}')
                print(f"Profit: {profit}")
                print(f"Balance: {balance}")
                print(f"Equity: {equity}")
                print(f"Deposits: {deposits}")
                print(f"Withdrawals: {withdrawals}")
                print(f"Trades: {trades}")
                print(f"Won %: {won}")
                print(f"Average trade time: {att}")
                print(f"Average trade time (hours): {att_hours}")
                print(f"Profit factor: {profit_factor}")
                print(f'Daily: {daily}')
                print(f'Monthly: {monthly}')
                print(f'Trades per month: {tpm}')
                print(f'Expectancy: {expectancy}')
                print(f'Info:{info}\n')
            if output == False:
                pass
            
            # Write to CSV file
            if write == True:
                all_signals_file_columns = [
                    'Rank', 
                    'Name', 
                    'Hyperlink', 
                    'Gain', 
                    'Pips', 
                    'DD', 
                    'Trades', 
                    'Price', 
                    'Age', 
                    'Added', 
                    'Platform', 
                    'Ratio1', 
                    'Ratio2', 
                    'Platform 2', 
                    'Profit', 
                    'Balance', 
                    'Equity', 
                    'Deposits', 
                    'Withdrawals', 
                    'Trades', 
                    'Won', 
                    'Average Trade Time', 
                    'Average Trade time (hours)', 
                    'Profit Factor',
                    'Daily', 
                    'Monthly', 
                    'Trades per month', 
                    'Expectancy']
                
                all_signals_file_data = [
                    [rank, 
                    name, 
                    hyperlink, 
                    gain, 
                    pips, 
                    dd, 
                    trades, 
                    price, 
                    age, 
                    added, 
                    platform, 
                    ratio1, 
                    ratio2, 
                    platform2, 
                    profit, 
                    balance, 
                    equity, 
                    deposits, 
                    withdrawals, 
                    trades, 
                    won, 
                    att, 
                    att_hours, 
                    profit_factor, 
                    daily, 
                    monthly, 
                    tpm, 
                    expectancy]
                    ]
                signals_info_file_data = [[name, info]]
                signals_info_file_columns = ['Name', 'Info']
                process_data('all-signals-no-info', all_signals_file_columns, all_signals_file_data, clean=True)
                process_data('all-signals-info-only', signals_info_file_columns, signals_info_file_data, clean=False)
                
            if write == False:
                pass

# One Cell to Rule Them All

Putting `scrape_data()`, `process_data()` and `clean_data` into a single cell which effectively makes the pipeline.

In [10]:
%%time

# Initialise Webdriver
url = #website
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()

# Define number of pages
page_count = driver.find_element_by_class_name('pagination-panel-total')
pages = int(page_count.text)
print(f'Scraping {url} ...')

# Scrape data, set for when loop ends
i = 1
for num in range(pages+1):
    if i > 1:
        time.sleep(2)
    print(f'Scraping page {i} of {pages} ...\n')
    scrape_data(output=True, write=True)
    print(f'Page {i} scraped.\n')
    i += 1
    if i == pages + 1:
        driver.close()
        print('All pages scraped!')
        break
    if i != pages:
        time.sleep(1)
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@class='btn btn-sm default next ']")))
        button = driver.find_element_by_xpath("//a[@class='btn btn-sm default next ']")
        button.click()
        print(f'Loading page {i}...')
        
        print(f'Page {i} loaded.\n')

# Cleaning data function
if len(data_to_clean) > 1:
    print('\nCleaning data...\n')
    for file in data_to_clean:
        clean_data(file)
    print(f'{file} cleaned!')

Scraping https://www.signalstart.com/search-signals ...
Scraping page 1 of 48 ...

Data for cacus risk parity portfolio.
Hyperlink: https://www.signalstart.com/analysis/cacus-risk-parity-portfolio/185843

Rank: 1
Name: cacus risk parity portfolio
Gain: 5,998.4%
Pips: 58,831
DD: 100%
Trades: 219
Price: $75 
Age: 3y 3m
Added: Dec 28, 2020
Rank: 1
Platform: FXCM 
Ratio: 1:100 
Platform 2: MetaTrader 4 
Profit: b'+$16,019.73'
Balance: b'$81,002.56'
Equity: b'$79,223.28'
Deposits: b'$15,283.07'
Withdrawals: b'$417.8'
Trades: 219
Won %: b'72%'
Average trade time: 4d
Average trade time (hours): 96
Profit factor: b'3.18'
Daily: b'0.34%'
Monthly: b'10.79%'
Trades per month: b'5'
Expectancy: b'$73.15'
Info:b'RECOMMENDED BROKER FOR 1:1 similarity is FXCM (multiplier set to 1).The Minimum recommended starting deposit is 4500 USD.This is a portfolio based strategy, primarily trading equities, currencies, and metals. Mostly NASDAQ 100, USDMXN, and XAUUSD to lower the beta and increase overall stabil

# Changes to Original Approach

* Instead of scraping two tables separately, these were combined into a single function which made it easier to save the data to a CSV/excel spreadsheet.

* Originally had a convoluted method of writing to text, importing into excel as a CSV file, cleaning in Excel, and then exporting back into pandas.  This was condensed into writing straight to CSV and formatted.

* This was then converted into a full pipeline where data can be scraped, cleaned, and exported a single function.

# Challenges and How They Were Addressed

* So, dynamic pages are a real pain.  As I mentioned before, there's no way of selecting number of records through Selenium, so instead of cycling through 8 pages of 100, I cycled through 20 pages of 39.  The approach was to have the script look for the elements in the same window and once the next page button was clicked, repeat the script.  This worked well, if a little slow.

* Clicking the next button requires an explicit wait.  I couldn't get an implicit/explicit wait to work, so I used `time.sleep()` instead aka cheating.

* The rest of this was a lot of list comprehension and slicing the correct elements in the correct manner.

* Writing a header once to the CSV, embarassingly, took longer than it should have.

* Regex cures all random guff that shouldn't be there.

# How I would improve this code

* Add an explicit wait instead of `time.sleep()`.  Struggling to get this going at the mo.

* Come up with some better puns.  Maybe cut out puns altogether.

In [122]:

# For testing the length of the first page. number_of_rows
url = #website
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()

elements = driver.find_elements_by_tag_name('td')

columns = driver.find_elements_by_tag_name('tr')
number_of_columns = (len(columns[0].text.split()))

number_of_rows = len(elements[::number_of_columns])
# Scraping for table 1 elements:
i=0
for num in range(number_of_rows):
    rank = elements[i].text
    name = elements[i+1].text
    gain = elements[i+2].text
    pips = elements[i+3].text
    dd = elements[i+4].text
    trades = elements[i+5].text
    price = elements[i+8].text
    age = elements[i+9].text
    added = elements[i+10].text
    i += number_of_columns
    
    # if output == True:
        # Optional output
    print(f'Table 1 data for {name}:')
    print(f'Rank: {rank}')
    print(f'Name: {name}') 
    print(f'Gain: {gain}') 
    print(f'Pips: {pips}') 
    print(f'DD: {dd}')
    print(f'Trades: {trades}')
    print(f'Price: {price} ')
    print(f'Age: {age}')
    print(f'Added: {added}\n')
driver.quit()

      

Table 1 data for cacus risk parity portfolio:
Rank: 1
Name: cacus risk parity portfolio
Gain: 5,998.4%
Pips: 58,831
DD: 100%
Trades: 219
Price: $75 
Age: 3y 3m
Added: Dec 28, 2020

Table 1 data for Profit24:
Rank: 2
Name: Profit24
Gain: 646.26%
Pips: 3,228.8
DD: 20.66%
Trades: 75
Price: $65 
Age: 2y 5m
Added: Dec 1, 2020

Table 1 data for Misstigo:
Rank: 3
Name: Misstigo
Gain: 2,794.76%
Pips: 71,656.6
DD: 35.45%
Trades: 1056
Price: $30 
Age: 10m 16d
Added: Nov 16, 2020

Table 1 data for Sicuro Moneta:
Rank: 4
Name: Sicuro Moneta
Gain: 612.4%
Pips: 9,882.4
DD: 10.11%
Trades: 506
Price: $69 
Age: 7m 17d
Added: Jan 14, 2021

Table 1 data for ZhangLi PAMM:
Rank: 5
Name: ZhangLi PAMM
Gain: 429.03%
Pips: 4,926.1
DD: 35.07%
Trades: 276
Price: $68 
Age: 5m 15d
Added: Nov 3, 2020

Table 1 data for Share Money:
Rank: 6
Name: Share Money
Gain: 204.65%
Pips: -1,472.5
DD: 12.45%
Trades: 2195
Price: $75 
Age: 3m 15d
Added: Dec 24, 2020

Table 1 data for InwC:
Rank: 7
Name: InwC
Gain: 224.59%
Pips: 1