# Scraping Stock Market Data
- ref: https://www.scraperapi.com/blog/how-to-scrape-stock-market-data-with-python/
- ScraperAPI is a service that allows you to scrape data from websites even if you are facing restrictions or challenges accessing them directly. It acts as a proxy between your application and the target website, providing a simplified interface for web scraping. By using ScraperAPI, you can bypass IP blocks, CAPTCHAs, and other obstacles that might prevent you from scraping data directly.
- The service works by routing your HTTP requests through their infrastructure, handling any necessary authentication or anti-bot measures on your behalf. It also manages rotating IP addresses, so you can avoid being detected as a scraper and maintain a higher level of anonymity.

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urlencode
import pandas as pd
import numpy as np
import re

- you should get your api_key for scrapeAPI at https://www.scraperapi.com/.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!ls drive/MyDrive/'Colab Notebooks'/scraper*

'drive/MyDrive/Colab Notebooks/scraper_api_key.txt'


In [4]:
api_key_file = '/content/drive/My Drive/Colab Notebooks/scraper_api_key.txt'

with open(api_key_file, 'r') as f:
    scraper_api_key = f.read().strip()

In [5]:
url = 'https://www.investing.com/equities/nike'
requests.get(url)    # not accessible

<Response [403]>

- you can not access it. but you can do it through scrapeAPI.

## Use ScraperAPI to get access
- detail information at each web page

In [6]:
url = 'https://www.investing.com/equities/nike'
url = 'https://www.investing.com/equities/coca-cola-co'
params = {'api_key': scraper_api_key, 'url': url}
page = requests.get('http://api.scraperapi.com/', params=urlencode(params))
page.status_code

200

In [7]:
soup = BeautifulSoup(page.text, 'html.parser')

In [8]:
soup.select('.text-xl')  # too many

[<h1 class="text-xl text-left font-bold leading-7 md:text-3xl md:leading-8 mb-2.5 md:mb-2 text-[#232526] rtl:soft-ltr">Coca-Cola Co (KO)</h1>,
 <h2 class="text-xl sm:text-3xl leading-7 sm:leading-8 font-bold mb-6"><a class="flex items-center hover:underline" data-test="link-news" href="/equities/coca-cola-co-news"><span class="mr-3">Coca-Cola Co News</span><svg class="ltr:-scale-x-100 text-[#6A707C]" fill="currentColor" height="15" viewbox="0 0 9 15" width="15" xmlns="http://www.w3.org/2000/svg"><path clip-rule="evenodd" d="M2.828 7.667l5.627 5.626-1.415 1.414L0 7.667 7.04.627 8.455 2.04 2.828 7.667z" fill-rule="evenodd"></path></svg></a></h2>,
 <h2 class="text-xl sm:text-3xl leading-7 sm:leading-8 font-bold mb-6 mt-12"><a class="flex items-center hover:underline" data-test="link-analysis" href="/equities/coca-cola-co-opinion"><span class="mr-3">Coca-Cola Co Analysis</span><svg class="ltr:-scale-x-100 text-[#6A707C]" fill="currentColor" height="15" viewbox="0 0 9 15" width="15" xmlns="

In [9]:
soup.find_all("h1", class_="text-xl")

[<h1 class="text-xl text-left font-bold leading-7 md:text-3xl md:leading-8 mb-2.5 md:mb-2 text-[#232526] rtl:soft-ltr">Coca-Cola Co (KO)</h1>]

In [10]:
company = soup.find_all("h1", class_="text-xl")[0].text; company

'Coca-Cola Co (KO)'

In [11]:
price = soup.find_all("div", class_="text-5xl")[0].text; price

'61.64'

In [12]:
change0 = soup.find_all("div", class_="text-base")[0].text; change0

'+1.07'

In [13]:
change1 = soup.find_all("div", class_="text-base")[1].text; change1

'(+1.77%)'

In [14]:
change = change0 + change1; change

'+1.07(+1.77%)'

In [15]:
urls = [
'https://www.investing.com/equities/nike',
'https://www.investing.com/equities/coca-cola-co',
'https://www.investing.com/equities/microsoft-corp',
'https://www.investing.com/equities/intel-corp',
'https://www.investing.com/equities/apple-computer-inc',
]

df_stock = pd.DataFrame(columns=["company", "price", "change"])
for url in urls:
    params = {'api_key': scraper_api_key, 'url': url}
    page = requests.get('http://api.scraperapi.com/', params=urlencode(params))

    if page.status_code != 200:
        continue

    soup = BeautifulSoup(page.text, 'html.parser')
    company = soup.find_all("h1", class_="text-xl")[0].text
    price = soup.find_all("div", class_="text-5xl")[0].text
    change0 = soup.find_all("div", class_="text-base")[0].text
    change1 = soup.find_all("div", class_="text-base")[1].text
    change = change0 + change1
    df_stock.loc[len(df_stock)] = [company, price, change]
    # print(company, '\t', price, '\t', change)

print(df_stock)

                        company   price         change
0                Nike Inc (NKE)  109.88  +0.16(+0.15%)
1             Coca-Cola Co (KO)   61.64  +1.07(+1.77%)
2  Microsoft Corporation (MSFT)  355.08  -4.41(-1.23%)
3      Intel Corporation (INTC)   34.46  -0.04(-0.12%)
4              Apple Inc (AAPL)  195.10  +1.37(+0.71%)


## From the table
- now, let's try to get the information from the table itself

In [16]:
url = 'https://www.investing.com/equities/'
params = {'api_key': scraper_api_key, 'url': url}
page = requests.get('http://api.scraperapi.com/', params=urlencode(params))

In [17]:
soup = BeautifulSoup(page.text, 'html.parser')

In [18]:
table = soup.find(id='cross_rate_markets_stocks_1')
table_list = table.tbody.find_all('tr')
len(table_list)

30

In [19]:
table_list[0]

<tr id="pair_238"><td class="flag"><span class="ceFlags USA" title="United States"> </span></td><td class="bold left noWrap elp plusIconTd"><a href="/equities/boeing-co" title="Boeing Co">Boeing</a><span class="alertBellGrayPlus js-plus-icon genToolTip oneliner" data-id="238" data-name="Boeing Co" data-tooltip="Create Alert" data-volume="5,051,928"></span></td><td class="pid-238-last">208.60</td><td class="pid-238-high">211.87</td><td class="pid-238-low">208.24</td><td class="bold redFont pid-238-pc">-2.97</td><td class="bold redFont pid-238-pcp">-1.40%</td><td class="pid-238-turnover">4.74M</td><td class="pid-238-time" data-value="1689796799">19/07</td><td class="icon"><span class="redClockIcon isOpenExch-1"> </span></td></tr>

In [20]:
table_list[0].find('td', class_="bold")

<td class="bold left noWrap elp plusIconTd"><a href="/equities/boeing-co" title="Boeing Co">Boeing</a><span class="alertBellGrayPlus js-plus-icon genToolTip oneliner" data-id="238" data-name="Boeing Co" data-tooltip="Create Alert" data-volume="5,051,928"></span></td>

In [21]:
company = table_list[0].find('td', class_="bold").text ; company

'Boeing'

In [22]:
table_list[0].find_all('td')

[<td class="flag"><span class="ceFlags USA" title="United States"> </span></td>,
 <td class="bold left noWrap elp plusIconTd"><a href="/equities/boeing-co" title="Boeing Co">Boeing</a><span class="alertBellGrayPlus js-plus-icon genToolTip oneliner" data-id="238" data-name="Boeing Co" data-tooltip="Create Alert" data-volume="5,051,928"></span></td>,
 <td class="pid-238-last">208.60</td>,
 <td class="pid-238-high">211.87</td>,
 <td class="pid-238-low">208.24</td>,
 <td class="bold redFont pid-238-pc">-2.97</td>,
 <td class="bold redFont pid-238-pcp">-1.40%</td>,
 <td class="pid-238-turnover">4.74M</td>,
 <td class="pid-238-time" data-value="1689796799">19/07</td>,
 <td class="icon"><span class="redClockIcon isOpenExch-1"> </span></td>]

In [23]:
table_list[0].find_all('td', class_=re.compile("last"))

[<td class="pid-238-last">208.60</td>]

In [24]:
last = table_list[0].find_all('td', class_=re.compile("last"))[0].text; last

'208.60'

In [25]:
change = table_list[0].find_all('td', class_=re.compile("pc"))[0].text; change

'-2.97'

In [26]:
df_stock = pd.DataFrame(columns=["company", "last", "change"])

for one in table_list:
    company = one.find('td', class_="bold").text
    last = one.find_all('td', class_=re.compile("last"))[0].text
    change = one.find_all('td', class_=re.compile("pc"))[0].text

    df_stock.loc[len(df_stock)] = [company, last, change]

print(df_stock)

             company    last change
0             Boeing  208.60  -2.97
1            Chevron  154.69  +0.94
2        Caterpillar  262.75  +0.24
3              Intel   34.46  -0.04
4          Microsoft  355.08  -4.41
5        Walt Disney   87.04  +1.09
6                Dow   52.82  -0.16
7              Cisco   52.43  +1.19
8      Goldman Sachs  340.55  +3.28
9           JPMorgan  154.25  +0.59
10         Coca-Cola   61.64  +1.07
11        McDonald’s  294.13  +0.31
12          Merck&Co  105.95   0.00
13                3M  103.48  +0.47
14             Apple  195.10  +1.37
15             Amgen  232.05  -0.52
16           Walmart  154.62  +0.05
17        Home Depot  319.48  +2.72
18               IBM  135.48  +0.12
19           Verizon   33.97  +1.70
20         Travelers  170.56  -0.46
21               J&J  158.74  -0.32
22  American Express  177.12  -0.92
23         Honeywell  205.17  -0.08
24    Salesforce Inc  234.37  +6.74
25            Visa A  241.42  +0.65
26   Walgreens Boots   29.93

## Extracting the web address of each company using the"href" attributes

In [27]:
url = 'https://www.investing.com/equities/'
params = {'api_key': scraper_api_key, 'url': url}
page = requests.get('http://api.scraperapi.com/', params=urlencode(params))
soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find(id='cross_rate_markets_stocks_1')
table_list = table.tbody.find_all('tr')
len(table_list)

30

In [28]:
table_list[0]

<tr id="pair_238"><td class="flag"><span class="ceFlags USA" title="United States"> </span></td><td class="bold left noWrap elp plusIconTd"><a href="/equities/boeing-co" title="Boeing Co">Boeing</a><span class="alertBellGrayPlus js-plus-icon genToolTip oneliner" data-id="238" data-name="Boeing Co" data-tooltip="Create Alert" data-volume="5,051,928"></span></td><td class="pid-238-last">208.60</td><td class="pid-238-high">211.87</td><td class="pid-238-low">208.24</td><td class="bold redFont pid-238-pc">-2.97</td><td class="bold redFont pid-238-pcp">-1.40%</td><td class="pid-238-turnover">4.74M</td><td class="pid-238-time" data-value="1689796799">19/07</td><td class="icon"><span class="redClockIcon isOpenExch-1"> </span></td></tr>

In [29]:
table_list[0].a

<a href="/equities/boeing-co" title="Boeing Co">Boeing</a>

In [30]:
table_list[0].a['href']

'/equities/boeing-co'

In [31]:
df_stock = pd.DataFrame(columns=["company", "price", "change"])

for one in table_list:
    url = 'https://www.investing.com' + one.a['href']
    # print(url)

    params = {'api_key': scraper_api_key, 'url': url}
    page = requests.get('http://api.scraperapi.com/', params=urlencode(params))

    if page.status_code != 200:
        continue

    soup = BeautifulSoup(page.text, 'html.parser')
    company = soup.find_all("h1", class_="text-xl")[0].text
    price = soup.find_all("div", class_="text-5xl")[0].text
    change0 = soup.find_all("div", class_="text-base")[0].text
    change1 = soup.find_all("div", class_="text-base")[1].text
    change = change0 + change1
    df_stock.loc[len(df_stock)] = [company, price, change]
    # print(company, '\t', price, '\t', change)

print(df_stock)

                                  company   price         change
0                          Boeing Co (BA)  208.60  -2.97(-1.40%)
1                      Chevron Corp (CVX)  154.69  +0.94(+0.61%)
2                   Caterpillar Inc (CAT)  262.75  +0.24(+0.09%)
3                Intel Corporation (INTC)   34.46  -0.04(-0.12%)
4            Microsoft Corporation (MSFT)  355.08  -4.41(-1.23%)
5               Walt Disney Company (DIS)   87.04  +1.09(+1.27%)
6                           Dow Inc (DOW)   52.82  -0.16(-0.30%)
7                Cisco Systems Inc (CSCO)   52.43  +1.19(+2.32%)
8            Goldman Sachs Group Inc (GS)  340.55  +3.28(+0.97%)
9               JPMorgan Chase & Co (JPM)  154.25  +0.59(+0.38%)
10                      Coca-Cola Co (KO)   61.64  +1.07(+1.77%)
11           McDonald’s Corporation (MCD)  294.13  +0.31(+0.11%)
12              Merck & Company Inc (MRK)  105.95    0.00(0.00%)
13                       3M Company (MMM)  103.48  +0.47(+0.46%)
14                       

# A Good Reference for Scraping in python
- scraping job information from indeed.com
- https://github.com/marianna13/Indeed-Scraper/tree/main