# Text Scraping and Analysis

In this notebook I test drive the [selenium](https://www.selenium.dev/) python library. Selenium is a web-driver which essentially allows users to write algorithms to procedurally navigate and interact with the web and its contents. 

In this project, I use selenium to scrape the S&P100 wikipedia page for member companies and their respective wikis, then I scrape said wikis and perform several dimensionality reductions and convert the text data to numerical information. Then, I use this information to determine how similar S&P100 companies are to one another based on their wiki text.

From a financial perspective, this technique (in principle) is quite useful. There may exist a set of companies that do not appear related on their face, but in fact are entwined in such a way that their share prices should follow in relative lockstep. Any divergence on this front could represent an opportunity to predict instances of potential profit.

## Scraping Wiki Tables

In [108]:
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException


In [175]:
# Set up the driver and point it to the S&P100 wiki page. Extract the html data.
browser=webdriver.Firefox()
browser.get("https://en.wikipedia.org/wiki/S%26P_100")
html = browser.page_source


In [176]:
# The S&P100 table has a unique ID which we can index directly. 
# This may not always be the case but for now let's work with this.
constituent_table = browser.find_element(By.ID,"constituents")
# Use the <tr> tag to extract the table row info.
table_conts = constituent_table.find_elements(By.TAG_NAME, 'tr')


In [180]:
links = []
names = []
for tr in table_conts:
    td = tr.find_elements(By.TAG_NAME, 'td')
    if len(td) > 0:
        try:
            links.append(td[1].find_element(By.TAG_NAME, 'a').get_attribute("href"))
            names.append(td[1].text)
        except:
            print("No link for {}, skipping.".format(td[1].text))


No link for Alphabet (Class A), skipping.


In [190]:
def get_companies_wiki_texts(links,names):
    comp_dict = {}
    for i,link in enumerate(links):
        print(link+'\n')
        browser.get(link)
        html = browser.page_source
        paragraphs = browser.find_elements(By.TAG_NAME, 'p')
        paragraphs = [paragraph.text for paragraph in paragraphs]
        paragraphs = " ".join(paragraphs)
        comp_dict[names[i]]=paragraphs
    return comp_dict
comp_dict = get_companies_wiki_texts(links[:2],names[:2])

https://en.wikipedia.org/wiki/Apple_Inc.

https://en.wikipedia.org/wiki/AbbVie



In [191]:
print(comp_dict)

{'Apple': ' Apple Inc. is an American multinational technology company headquartered in Cupertino, California. Apple is the world\'s largest technology company by revenue, with US$394.3 billion in 2022 revenue.[6] As of March 2023, Apple is the world\'s biggest company by market capitalization.[7] As of June 2022, Apple is the fourth-largest personal computer vendor by unit sales and the second-largest mobile phone manufacturer in the world. It is considered one of the Big Five American information technology companies, alongside Alphabet (parent company of Google), Amazon, Meta Platforms, and Microsoft. Apple was founded as Apple Computer Company on April 1, 1976, by Steve Wozniak, Steve Jobs (1955–2011) and Ronald Wayne to develop and sell Wozniak\'s Apple I personal computer. It was incorporated by Jobs and Wozniak as Apple Computer, Inc. in 1977. The company\'s second computer, the Apple II, became a best seller and one of the first mass-produced microcomputers. Apple went public i

## Text Parsing

At this point I am able to navigate through multiple wikipedia pages and scrape the page contents for any number of companies into a dictionary. Though, this text is a bit messy and has two main issues I want to correct:
1. Remove all non-text elements that will draw false correlation between pages. This is mainly the citation numbers and brackets.
2. Do a dimension reduction on the text to collapse the written words into their simplest forms as well as getting rid of any "stop" words (like 'the', 'and', 'it', etc.) Luckily we have Python libraries that can help with this.

`re` or the "regular expressions" library can parse through and strip out non-useful text like the [##] elements.

In [192]:
import re
test_string = """This is our test string to find out how regular expressions work. [1]
We will be seeing what kinds of basic searches and functionality we can do. [2]
Case sensitivity may matter in ouR analysis. [33]"""

In [193]:
print(re.sub("\[[1-9]+\]","",test_string))

This is our test string to find out how regular expressions work. 
We will be seeing what kinds of basic searches and functionality we can do. 
Case sensitivity may matter in ouR analysis. 


Great! We can include this in our overall workflow in a second. But first I will develop one final module which will help greatly reduce the dimensionality of the text data. The process is called "stemming" (see #2 above).

## Text Stemming and Vectorization

To parse and stem the text data, I use the Natural Language Toolkit (`nltk`) library. This is Natural Language Processing!!

Then I use the skLearn CountVectorization method to convert the stemmed text data into numerical data by counting the occurences of word roots.

In [194]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [195]:

#nltk is the library I will use for text analysis
stemmer = nltk.stem.SnowballStemmer('english')

In [196]:
#The stemmer breaks down words into roots
stemmer.stem("run running")

'run run'

In [197]:
#To get counts, I can use the count vectorizer

vectorizer = CountVectorizer(stop_words='english')

test_text = "Text that I want to analyze with text analysis."
test_text = stemmer.stem(test_text)
print(test_text)
print()
#The text is not readable from the initial fit_transform because it is in sparse matrix form
counts = vectorizer.fit_transform([test_text])
print(counts)

print(vectorizer.get_feature_names())
print(counts.toarray())


text that i want to analyze with text analysis.

  (0, 2)	2
  (0, 3)	1
  (0, 1)	1
  (0, 0)	1
['analysis', 'analyze', 'text', 'want']
[[1 1 2 1]]


In [209]:
def process_company_text(comp_dict):
    stemmer = nltk.stem.SnowballStemmer('english')
    vectorizer = CountVectorizer(stop_words='english')
    df = pd.DataFrame()
    for key in comp_dict.keys():
        wiki_txt = stemmer.stem(comp_dict[key])
        counts = vectorizer.fit_transform([wiki_txt])
        df = pd.concat([df, pd.DataFrame(counts.toarray(),columns=vectorizer.get_feature_names()).transpose()],axis=1)
    df.columns = comp_dict.keys()
    return df.fillna(0)
df=process_company_text(comp_dict)
print(df)

            Apple  AbbVie
000          28.0     1.0
028           1.0     0.0
05            1.0     0.0
10           11.0     4.0
100          20.0     0.0
...           ...     ...
vie           0.0     1.0
vraylar       0.0     1.0
wholesale     0.0     1.0
wrongdoing    0.0     1.0
wyden         0.0     3.0

[3373 rows x 2 columns]


In [208]:
print(df.sum(axis=1).sort_values(ascending=False).head(100))

apple        447.0
company      133.0
jobs          73.0
billion       49.0
announced     46.0
             ...  
reported      12.0
global        12.0
cook          12.0
desktop       12.0
service       11.0
Length: 100, dtype: float64


In [205]:
#Convert to frequency
word_frequency = df.copy()
word_frequency = word_frequency / word_frequency.sum()
print(word_frequency)
word_frequency.columns

               Apple    AbbVie
000         0.003410  0.001018
028         0.000122       NaN
05          0.000122       NaN
10          0.001340  0.004073
100         0.002436       NaN
...              ...       ...
vie              NaN  0.001018
vraylar          NaN  0.001018
wholesale        NaN  0.001018
wrongdoing       NaN  0.001018
wyden            NaN  0.003055

[3373 rows x 2 columns]


Index(['Apple', 'AbbVie'], dtype='object')

## Bring it all back together, now

Wonderful stuff! Leaning a lot about how to manipulate pandas data frames. Now I want to condense all of this in 1 or 2 functions that I can easily point to other web pages in the future. Oh, and I'm not quite done with the analysis of these company's wikis. But I gotta get organized before I lose track of what's going on.

In [211]:
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [223]:
def gather_company_text(source_link):
    browser=webdriver.Firefox()
    browser.get(source_link)
    html = browser.page_source
    
    constituent_table = browser.find_element(By.ID,"constituents")
    # Use the <tr> tag to extract the table row info.
    table_conts = constituent_table.find_elements(By.TAG_NAME, 'tr')
    
    links = np.zeros(len(table_conts),dtype=object)
    names = np.zeros(len(table_conts),dtype=object)
    for i,tr in enumerate(table_conts):
        td = tr.find_elements(By.TAG_NAME, 'td')
        if len(td) > 0:
            try:
                links[i] = td[1].find_element(By.TAG_NAME, 'a').get_attribute("href")
                names[i] = td[1].text
            except:
                print("No link for {}, skipping.".format(td[1].text)) 
    links=links[:10]
    names=names[:10]
    stemmer = nltk.stem.SnowballStemmer('english')
    vectorizer = CountVectorizer(stop_words='english')
    df = pd.DataFrame()
    for i,link in enumerate(links):
        print(link)
        try:
            browser.get(link)
        except:
            print("Invalid Link, skipping...")
            #!!!!!!!!!!!!!!!!!!!!!np.delete(names,i)
            continue
        html = browser.page_source
        paragraphs = browser.find_elements(By.TAG_NAME, 'p')
        paragraphs = [paragraph.text for paragraph in paragraphs]
        paragraphs = " ".join(paragraphs)
        #comp_dict[names[i]]=paragraphs
        wiki_txt = stemmer.stem(paragraphs)
        counts = vectorizer.fit_transform([wiki_txt])
        df = pd.concat([df, pd.DataFrame(counts.toarray(),columns=vectorizer.get_feature_names()).transpose()],axis=1)
    print(names)
    df.columns = names
    browser.close()
    #for key in comp_dict.keys():
    #    wiki_txt = stemmer.stem(comp_dict[key])
    #    counts = vectorizer.fit_transform([wiki_txt])
    #    df = pd.concat([df, pd.DataFrame(counts.toarray(),columns=vectorizer.get_feature_names()).transpose()],axis=1)
    #df.columns = comp_dict.keys()
    #browser.close()
    
    return df.fillna(0)

df = gather_company_text("https://en.wikipedia.org/wiki/S%26P_100")
print(df.sum(axis=1).sort_values(ascending=False).head(100))

No link for Alphabet (Class A), skipping.
0
Invalid Link, skipping...
https://en.wikipedia.org/wiki/Apple_Inc.
https://en.wikipedia.org/wiki/AbbVie
https://en.wikipedia.org/wiki/Abbott_Laboratories
https://en.wikipedia.org/wiki/Accenture
https://en.wikipedia.org/wiki/Adobe_Inc.
https://en.wikipedia.org/wiki/American_International_Group
https://en.wikipedia.org/wiki/AMD
https://en.wikipedia.org/wiki/Amgen
https://en.wikipedia.org/wiki/American_Tower
[0 'Apple' 'AbbVie' 'Abbott' 'Accenture' 'Adobe'
 'American International Group' 'AMD' 'Amgen' 'American Tower']


ValueError: Length mismatch: Expected axis has 9 elements, new values have 10 elements