<a ><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/Logo_INSA_Lyon_%282014%29.svg/langfr-2560px-Logo_INSA_Lyon_%282014%29.svg.png"  width="200" align="left"> </a>
<div style="text-align: right"> <h3><span style="color:gray"> Projet de recherche </span> </h3> </div>

<br>
<br>
<br>


<h1><center>Data Acquisition</center></h1>
<h2><center> <span style="font-weight:normal"><font color='#e42618'> Parsing Morningstar (UK) to obtain Analyst Reports containing natural language and predictions</font>  </span></center></h2>


<h3><center><font color='gray'>JONAS GOTTAL</font></center></h3>





<h4>Project scope</h4>

- Obtaining financial research containing both a report in natural language and a quantifiable prediction on the underlying asset
- Building a predictive models that e.g., accurately detect semantic causality
- Evaluation whether there is a correlation between causal formulations and a higher accuracy from an analyst
<br>

---
---

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="text-decoration:none; margin-top: 30px; background-color:#F2F2F2; border-color:#720006">
    <span style="color:#720006">
    <ol>
        <li><a href="#1"> <span style="color:#720006;text-decoration:underline;text-decoration-color:#F2F2F2" 
       >Parsing of non-protected ID Table</span> </a></li>
       <li><a href="#2"> <span style="color:#720006;text-decoration:underline;text-decoration-color:#F2F2F2" 
       >Parsing of dedicated analyst reports</span> </a></li> 
       <ol>
       <li><a href="#3"> <span style="color:#720006;text-decoration:underline;text-decoration-color:#F2F2F2" 
       >Target Data Format</span> </a></li>
       <li><a href="#4"> <span style="color:#720006;text-decoration:underline;text-decoration-color:#F2F2F2" 
       >Connection Setup</span> </a></li>
       <li><a href="#5"> <span style="color:#720006;text-decoration:underline;text-decoration-color:#F2F2F2" 
       >Parsing</span> </a></li>
       </ol>
       <li><a href="#6"> <span style="color:#720006;text-decoration:underline;text-decoration-color:#F2F2F2" 
       >Depreciated Code</span> </a></li>
    </ol>
    </span>
</div>

#### Requirements
- ```Python 3.9.18``` (conda env)
- ```pip freeze > requirements.txt```
- ```conda env export > environment.yml```

# Parsing of non-protected ID Table <a id="1"></a>

In [1]:
import requests 
from bs4 import BeautifulSoup
import time
import re
import json
import pandas as pd

# initialise variables
URL_ID, Name, Sector, Analyst, Title, Rating, Date = "", "", "", "", "", "", ""

# Define the regex pattern for the URL_ID
pattern = r'id=([A-Za-z0-9]+)'

with requests.Session() as session:
    # define page size
    r = session.get("https://www.morningstar.co.uk/uk/research/equities/page/1?PageSize=2000")

    soup = BeautifulSoup(r.content, 'html.parser')  

    data = []
    #whole table
    for tr in soup.find('table').find_all('tr'):
        i=0
        #each row
        for td in tr.find_all('td'):
            # select each element in the row with a different pattern
            if i==0:
                URL_ID = re.search(pattern, td.find("a").get("href")).group(1)
                Name = td.text.strip('\n')
                
            if i == 1:
                Sector= td.text
            if i == 2:
                Analyst = td.text
            if i == 3:
                Title = td.text
            if i == 4:
                Rating = td.findNext("div").get("class")[1][-1]
            if i == 6:
                Date= td.text
            i+=1

        data.append([URL_ID, Name, Sector, Analyst, Title, Rating, Date, str(time.strftime("%d/%m/%Y"))])
    data.pop(0) #remove header
    df = pd.DataFrame(data, columns=['URL_ID','Name','Sector','Analyst', 'Title','StarRating','Date','ParseDate'])

#save to csv
df.to_csv('morningstar.csv', index=True)


In [2]:
# consistency check
len(df) == len(df["URL_ID"].unique())

True

# Parsing of dedicated analyst reports <a id="2"></a>

## Target Data Format <a id="3"></a>

Our goal: JSON File where the keys are ID or Index and the values the complete content of the website


```
{
    "0P00007XVV": {
        "Index": 0,
        "ParseDate": "01/11/2023",
        "ID": "0P00007XVV",
        "Title": "Ebos' Scale Has It Leading the Pack but Shares Screen as Expensive",
        "CompanyName": "Ebos Group Ltd",
        "TickerSymbol": "EBO",
        "Rating": 2,
        "ReportDate": "22/11/2023",
        "AuthorName": "Shane Ponraj",
        "price": {
            "value": 36.15,
            "currency": "NZD",
            "date": "22/11/2023"
        },
        "FairPrice": 30.5,
        "Uncertainty": "Medium",
        "CostAllocation": "Exemplary",
        "EconomicMoat": "Narrow",
        "FinancialStrength": "",
        "AnalystNote": {
            "Date": "22/11/2023",
            "Text": [
                "paragraph1",
                "paragraph2",
                "paragraph3"
            ]
        },
        "Bulls": [
            "Bullet1",
            "Bullet2",
            "Bullet3"
        ],
        "Bears": [
            "Bullet1",
            "Bullet2",
            "Bullet3"
        ],
        "ResearchThesis": {
            "Date": "22/11/202",
            "Text": [
                "paragraph1",
                "paragraph2",
                "paragraph3"
            ]
        },
        "MoatAnalysis": "Text",
        "RiskAnalysis": "Text",
        "ManagementAnalysis": "Text",
        "Overview": {
            "Profile": "Text.",
            "FinancialStrength": "Text."
        }
    },
    "NEXTID": {
        "Index": 1,
        ...
        
    },
    ...
}
```

## Connection set-up by manually created cookies <a id="4"></a>
1. Login with credentials on [Morningstar](https://www.morningstar.co.uk/)
1. Navigate to [List of equity reports](https://www.morningstar.co.uk/uk/research/equities) 
1. Activate network on browser in developer view 
1. Click on any item in List of reports
1. Search for get GET package to that site and right-click ```copy as cURL```
1. Paste on website [curlconverter](https://curlconverter.com/python/) to get cookies and headers for Python ```request``` package
1. Paste below the cookies and headers

In [3]:

cookies = {
    'PSI': 'S',
    'RT_uk_BS': '+I7qdl0ruLBRlX6zrB8CNg==',
    'RT_uk_CD': 'efNq51LXeHZjftQvvSJlDw==',
    'RT_uk_GI': 'mGr7p+708bJo/rsmM3mC2saJLu1zqcp3rB+tAgdYcgcaeDBhQnzEInIY9N50/4tl',
    'RT_uk_MD': 'PO5Mf2z103h9gY8gAJ+EDQ==',
    'RT_uk_MS': '8X3mhE0/kf6o/dIJeMo7TA==',
    'RT_uk_PS': '6G2wEmwHr8HiGZATAp96Mw==',
    'ad-profile': '%7b%22AudienceType%22%3a21%2c%22UserType%22%3a2%2c%22PortofolioCreated%22%3a0%2c%22IsForObsr%22%3afalse%2c%22NeedRefresh%22%3atrue%2c%22NeedPopupAudienceBackfill%22%3afalse%2c%22EnableInvestmentInUK%22%3a-1%7d',
    'RT_uk_LANG': 'en-GB',
}

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Sec-Fetch-Site': 'same-site',
    # 'Cookie': 'PSI=S; RT_uk_BS=+I7qdl0ruLBRlX6zrB8CNg==; RT_uk_CD=efNq51LXeHZjftQvvSJlDw==; RT_uk_GI=mGr7p+708bJo/rsmM3mC2saJLu1zqcp3rB+tAgdYcgcaeDBhQnzEInIY9N50/4tl; RT_uk_MD=PO5Mf2z103h9gY8gAJ+EDQ==; RT_uk_MS=8X3mhE0/kf6o/dIJeMo7TA==; RT_uk_PS=6G2wEmwHr8HiGZATAp96Mw==; ad-profile=%7b%22AudienceType%22%3a21%2c%22UserType%22%3a2%2c%22PortofolioCreated%22%3a0%2c%22IsForObsr%22%3afalse%2c%22NeedRefresh%22%3atrue%2c%22NeedPopupAudienceBackfill%22%3afalse%2c%22EnableInvestmentInUK%22%3a-1%7d; RT_uk_LANG=en-GB',
    'Sec-Fetch-Dest': 'document',
    # 'Accept-Encoding': 'gzip, deflate, br',
    'Sec-Fetch-Mode': 'navigate',
    'Host': 'tools.morningstar.co.uk',
    'Accept-Language': 'en-gb',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15',
    'Referer': 'https://www.morningstar.co.uk/',
    'Connection': 'keep-alive',
}

## Parsing <a id="5"></a>

### Partial URLs to combine to final URL

In [4]:


url1 =  'https://tools.morningstar.co.uk/ukp/stockreport/default.aspx?Site=uk&id='

url2 = '&tab=15&isreport=true&LanguageId=en-GB&SecurityToken='

url3 ='%5D3%5D0%5DE0WWE$$ALL'


### Method to parse website for each variable with beautiful soup with for loops in ```soup.find_all``` to not fail if item doesn't exist (still some try catch elements necessary)

In [5]:
# initialise variables
variable_list = ["ParseDate", "Title", "CompanyName", "TickerSymbol", "Rating", "ReportDate", "AuthorName", "Price", "Currency", "PriceDate", "FairPrice", "Uncertainty",  "EconomicMoat", "CostAllocation", "FinancialStrength", "AnalystNoteDate", "AnalystNoteList", "BullsList", "BearsList", "ResearchThesisDate", "ResearchThesisList", "MoatAnalysis", "RiskAnalysis", "CapitalAllocation", "Profile", "FinancialStrengthText"]

for item in variable_list:
    globals()[item] = ""

In [6]:
def get_variables(soup):
    # Re-initialize the variables
    variable_list = ["ParseDate", "Title", "CompanyName", "TickerSymbol", "Rating", "ReportDate", "AuthorName", "Price", "Currency", "PriceDate", "FairPrice", "Uncertainty",  "EconomicMoat", "CostAllocation", "FinancialStrength", "AnalystNoteDate", "AnalystNoteList", "BullsList", "BearsList", "ResearchThesisDate", "ResearchThesisList", "MoatAnalysis", "RiskAnalysis", "CapitalAllocation", "Profile", "FinancialStrengthText"]

    for item in variable_list:
        globals()[item] = ""
    # initialze doesnt work for those two
    ReportDate = "" 
    AnalystNoteDate = ""

    #python today's date in format  DD/MM/YYYY
    ParseDate = time.strftime("%d/%m/%Y")
    
    # Get the title
    for wrapper in soup.find_all('div', {"id":"AnalystResearch"}): 
        Title = wrapper.findChild().text.encode('ascii', 'ignore').decode('ascii')

    # Get the company name
    for wrapper in soup.find_all('span', {"class":"securityName"}): 
        CompanyName = wrapper.text

    # Get the ticker symbol
    for wrapper in soup.find_all('span', {"class":"securitySymbol"}): 
        TickerSymbol = wrapper.text

    # Get the rating
    for wrapper in soup.find_all('span', {"class":"securityRating"}):
        Rating = wrapper.findChild("img")["alt"][-1]

    # Get the report date
    try:
        for wrapper in soup.find('h3', {"id":"lnkAnalystNote"}).find_next("span"):  
            ReportDate = wrapper.text
    except:
        ReportDate = ""
        

    # Get the author name
    for wrapper in soup.find_all('div', {"class":"author clearfix"}): 
        for item in list(wrapper.children):
                
                # Get the text and split it by <br/>
                text=item.get_text(separator='<br/>').split('<br/>')
                # The second element is Name
                AuthorName = text[0].replace('by ', '')

    # Get the price and currency
    for wrapper in soup.find_all('datapoint', {"id":"Price"}): 
        s = wrapper.text
        # Split the string by space
        values = s.split()
        # Assign the values to variables
        Price = values[0].replace(',', '')
        Currency = values[1]

    # Get the price date
    for wrapper in soup.find_all('div', {"id":"Price"}): 
        s = wrapper.findChild().text
        # Split the string by space
        values = s.split()
        # Assign the values to variables
        PriceDate = values[1]

    # Get the fair price
    for wrapper in soup.find_all('datapoint', {"id":"FairValueEstimate"}):
        s = wrapper.text
        # Split the string by space
        values = s.split()
        # Assign the values to variables
        FairPrice = values[0].replace(',', '')

    # Get the uncertainty
    for wrapper in soup.find_all('datapoint', {"id":"Uncertainity"}): 
        Uncertainty = wrapper.text

    # Get the economic moat
    for wrapper in soup.find('datapoint', {"id":"EconomicMoat"}):
        EconomicMoat = wrapper.text  

    # Get the cost allocation
    for wrapper in soup.find('datapoint', {"id":"Stewardship"}):
        CostAllocation = wrapper.text  

    # Get the financial strength
    try:

        pattern = r'Grade_(\w)'
        for wrapper in soup.find('datapoint', {"id":"FinancialHealthGrade"}):   
            FinancialStrength = re.search(pattern, wrapper.get('src')).group(1)
    
    except:
        FinancialStrength = ""


    # Get the analyst note date
    try:

        for wrapper in soup.find_all('h3', {"id":"lnkAnalystNote"}): 
            AnalystNoteDate = wrapper.findChild().text
    except:
        AnalystNoteDate = ""
    
    # Get the analyst note as list
    AnalystNoteList = []
    try:

        for i in soup.find_all('h3', {"id":"lnkAnalystNote"}):
            for sib in i.next_siblings:
                if sib.name == 'p':
                    AnalystNoteList.append(sib.text.encode('ascii', 'ignore').decode('ascii'))
                    #print(sib.text)
                elif sib.name == 'h2':
                    #print ("*****")
                    break
        AnalystNoteList.pop() # remove "go to top"
    except: 
        AnalystNoteList = []     
    
    # Get the bulls say as list
    BullsList = []
    for wrapper in soup.find('div', {"id":"BullsView"}).find_all("li"):
        BullsList.append(wrapper.text.encode('ascii', 'ignore').decode('ascii'))

    # Get the bears say as list
    BearsList = []
    for wrapper in soup.find('div', {"id":"BearsView"}).find_all("li"):
        BearsList.append(wrapper.text.encode('ascii', 'ignore').decode('ascii'))

    # Get the research thesis date
    for wrapper in soup.find_all('div', {"id":"lnkThesis"}): 
        ResearchThesisDate = wrapper.findChild().findChild().text

    # Get the research thesis as list
    ResearchThesisList = []
    for i in soup.find_all('div', {"id":"lnkThesis"}):
        for child in i.findChildren():
            if child.name == 'p':
                ResearchThesisList.append(child.text.encode('ascii', 'ignore').decode('ascii'))
    # remove "go to top"
    ResearchThesisList.pop()  

    # Get the moat analysis
    for i in soup.find_all('div', {"id":"lnkMoatAnalysis"}):
        MoatAnalysis = i.findChildren()[1].text.encode('ascii', 'ignore').decode('ascii')

    # Get the risk analysis
    for i in soup.find_all('div', {"id":"lnkRisk"}):
        RiskAnalysis = i.findChildren()[1].text.encode('ascii', 'ignore').decode('ascii')

    # Get the capital allocation
    for i in soup.find_all('div', {"id":"lnkManagement"}):
        CapitalAllocation = i.findChildren()[1].text.encode('ascii', 'ignore').decode('ascii')

    # Get the profile
    for i in soup.find_all('p', {"id":"lnkProfile"}):
        Profile = re.sub(r'^Profile: ', '', i.text.encode('ascii', 'ignore').decode('ascii'))

    # Get the financial strength text
    for i in soup.find_all('p', {"id":"lnkFinancialHealth"}):
        FinancialStrengthText = re.sub(r'^Financial Strength: ', '', i.text.encode('ascii', 'ignore').decode('ascii'))

    # return the variables
    return ParseDate, Title, CompanyName, TickerSymbol, int(Rating), ReportDate, AuthorName, float(Price), Currency, PriceDate, float(FairPrice), Uncertainty, EconomicMoat, CostAllocation, FinancialStrength, AnalystNoteDate, AnalystNoteList, BullsList, BearsList, ResearchThesisDate, ResearchThesisList, MoatAnalysis, RiskAnalysis, CapitalAllocation, Profile, FinancialStrengthText


#### Method to concatenate lists used in JSON to also dump flat data in CSV

In [7]:
def concatenate_strings(string_list):
    # Use the join method to concatenate the strings in the list
    result_string = ' '.join(string_list)
    return result_string

### Final for-loop over all IDs in IDList to parse each website and append to both lists (list for pd Dataframe and dict for JSON)
Attention: the run-time is significantly faster and generally more stable at night

In [8]:
# initialise dictionary and list
JSONdict = {}
data = []

# loop through the ID list
for i in range(len(df["URL_ID"])):
    #print the progress with string and percentage
    print("Progress: " + str(i) + "/" + str(len(df["URL_ID"])) + " (" + str(round(i/len(df["URL_ID"])*100, 2)) + "%)")
    
    #just because sometimes there are outages
    for attempt in range(10):
        try:
            # define the url
            url = url1 + df["URL_ID"][i] + url2 + df["URL_ID"][i] + url3  

            # get the response
            response = requests.get(url,
            
                cookies=cookies,
                headers=headers,
            )

            # parse the response
            soup = BeautifulSoup(response.content, 'html.parser')  

            # get the variables
            ParseDate, Title, CompanyName, TickerSymbol, Rating, ReportDate, AuthorName, Price, Currency, PriceDate, FairPrice, Uncertainty, EconomicMoat, CostAllocation, FinancialStrength, AnalystNoteDate, AnalystNoteList, BullsList, BearsList, ResearchThesisDate, ResearchThesisList, MoatAnalysis, RiskAnalysis, CapitalAllocation, Profile, FinancialStrengthText = get_variables(soup)

            # create a dictionary
            dict_item = {
                    "Index": i,
                    "ParseDate": ParseDate,
                    "ID": df["URL_ID"][i],
                    "Title": Title,
                    "CompanyName": CompanyName,
                    "TickerSymbol": TickerSymbol,
                    "Rating": Rating,
                    "ReportDate": ReportDate,
                    "AuthorName": AuthorName,
                    "Price": {
                        "Value": Price,
                        "Currency": Currency,
                        "Date": PriceDate
                    },
                    "FairPrice": FairPrice,
                    "Uncertainty": Uncertainty,
                    "EconomicMoat": EconomicMoat,
                    "CostAllocation": CostAllocation,
                    "FinancialStrength": FinancialStrength,
                    "AnalystNote": {
                        "Date": AnalystNoteDate,
                        "Text": AnalystNoteList
                    },
                    "Bulls": BullsList,
                    "Bears": BearsList,
                    "ResearchThesis": {
                        "Date": ResearchThesisDate,
                        "Text": ResearchThesisList
                    },
                    "MoatAnalysis": MoatAnalysis,
                    "RiskAnalysis": RiskAnalysis,
                    "CapitalAllocation": CapitalAllocation,
                    "Overview": {
                        "Profile": Profile,
                        "FinancialStrength": FinancialStrengthText
                    }
                }
            
            # add the dictionary to the list
            JSONdict[str(df["URL_ID"][i])] = dict_item # new key-value pair
            
            # add the item to the list
            data.append([ParseDate, Title, CompanyName, TickerSymbol, Rating, ReportDate, AuthorName, Price, Currency, PriceDate, FairPrice, Uncertainty, EconomicMoat, CostAllocation, FinancialStrength, AnalystNoteDate, concatenate_strings(AnalystNoteList), concatenate_strings(BullsList), concatenate_strings(BearsList), ResearchThesisDate, concatenate_strings(ResearchThesisList), MoatAnalysis, RiskAnalysis, CapitalAllocation, Profile, FinancialStrengthText])

        except:
            # wait two second and then retry
            time.sleep(2)
            continue
        break


Progress: 0/1627 (0.0%)
Progress: 1/1627 (0.06%)
Progress: 2/1627 (0.12%)
Progress: 3/1627 (0.18%)
Progress: 4/1627 (0.25%)
Progress: 5/1627 (0.31%)
Progress: 6/1627 (0.37%)
Progress: 7/1627 (0.43%)
Progress: 8/1627 (0.49%)
Progress: 9/1627 (0.55%)
Progress: 10/1627 (0.61%)
Progress: 11/1627 (0.68%)
Progress: 12/1627 (0.74%)
Progress: 13/1627 (0.8%)
Progress: 14/1627 (0.86%)
Progress: 15/1627 (0.92%)
Progress: 16/1627 (0.98%)
Progress: 17/1627 (1.04%)
Progress: 18/1627 (1.11%)
Progress: 19/1627 (1.17%)
Progress: 20/1627 (1.23%)
Progress: 21/1627 (1.29%)
Progress: 22/1627 (1.35%)
Progress: 23/1627 (1.41%)
Progress: 24/1627 (1.48%)
Progress: 25/1627 (1.54%)
Progress: 26/1627 (1.6%)
Progress: 27/1627 (1.66%)
Progress: 28/1627 (1.72%)
Progress: 29/1627 (1.78%)
Progress: 30/1627 (1.84%)
Progress: 31/1627 (1.91%)
Progress: 32/1627 (1.97%)
Progress: 33/1627 (2.03%)
Progress: 34/1627 (2.09%)
Progress: 35/1627 (2.15%)
Progress: 36/1627 (2.21%)
Progress: 37/1627 (2.27%)
Progress: 38/1627 (2.34%)

#### Saving Data

In [9]:
# create a dataframe
df = pd.DataFrame(data, columns=["ParseDate", "Title", "CompanyName", "TickerSymbol", "Rating", "ReportDate", "AuthorName", "Price", "Currency", "PriceDate", "FairPrice", "Uncertainty",  "EconomicMoat", "CostAllocation", "FinancialStrength", "AnalystNoteDate", "AnalystNoteList", "BullsList", "BearsList", "ResearchThesisDate", "ResearchThesisList", "MoatAnalysis", "RiskAnalysis", "CapitalAllocation", "Profile", "FinancialStrengthText"])
#save df as csv
df.to_csv('data.csv', index=True)

# save the dictionary as json
with open('data.json', 'w') as outfile:
    json.dump(JSONdict, outfile)

How to load in different formats:

In [None]:
import pandas as pd
#load df from json file 
df = pd.read_json('data.json', orient='index')
# load the csv file
df = pd.read_csv('data.csv', index_col=0)
# save as excel file
df.to_excel('data.xlsx', index=True)

## Discontinued code (failure) <a id="6"></a>
It would be more elegant to do the login with a dedicated payload but unfortunately the website re-directs to an error and it doesn't work with requests (maybe there is a selenium workaround)

In [None]:
from lxml import etree, html

URL = "https://www.morningstar.co.uk/uk/membership/Auth0CallbackManager.ashx" 

baseURI= "https://www.morningstar.co.uk/uk/research/equities"

payload = { 
	"email": "", 
	"Login": "" 
}


with requests.Session() as session:
    session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'

    session.get(URL) 
    time.sleep(1)
    r = session.post(URL, data=payload) 
    time.sleep(2)
    url1 = "https://tools.morningstar.co.uk/ukp/stockreport/default.aspx?Site=uk&id="
    url2 = "&tab=15&isreport=true&LanguageId=en-GB&SecurityToken="
    url3 = "]3]0]E0WWE$$ALL"
    for i in range(1):#len(IDList):
        url = url1 + IDList[i] + url2 + IDList[i] + url3  
        r = session.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')  
        #dom = etree.HTML(str(soup)) 
        with open("report.html", "w", encoding='utf-8') as file:
            file.write(str(soup))