<b>Scraping and Analyzing Fed Beige Books</b><br>
Project by Alexander Trentin<br>

<b>General overview</b>:
The Fed Beige Book is a collection of economic observations and assessments by the Federal Reserve Bank published eight times a year. They provide for each of the twelve districts of the Federal Reserve System and the national level an anecdotal overview of the current economic climate. The different districts provide observations for the whole economy, prices and wages, and individual sectors.<p>
The documents can be found here:
https://www.federalreserve.gov/monetarypolicy/beige-book-default.htm

<b>Central purpose of this project:</b><br>
Make a massive corpus of free text with valuable economic observations easily accessible.<p>
<b>Idea:</b><br>
1. Extract from each assessment an "important phrase" to summarize the longer text
2. Provide a quantitative score to show if the economic assessment is positive, negative or neutral.

Part 1: Scraping

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import os

In [3]:
# Manually the dates of the reports to be collected are defined in a list.
# This is necessary to open the right documents.
# Overview: https://www.federalreserve.gov/monetarypolicy/beige-book-archive.htm
# The Fed Beige Books are HTML documents, separated by the Federal Reserve district and date.

# List of dates (month) when reports were published.

# During the collection process it was observed,
# that the report of "2015-07" was not published in the archive as it was supposed to be.
# Subsequently, this date was deleted from the list.

report_dates = [
"2008-01",
"2008-03",
"2008-04",
"2008-06",
"2008-07",
"2008-09",
"2008-10",
"2008-12",
"2009-01",
"2009-03",
"2009-04",
"2009-06",
"2009-07",
"2009-09",
"2009-10",
"2009-12",
"2010-01",
"2010-03",
"2010-04",
"2010-06",
"2010-07",
"2010-09",
"2010-10",
"2010-12",
"2011-01",
"2011-03",
"2011-04",
"2011-06",
"2011-07",
"2011-09",
"2011-10",
"2011-11",
"2012-01",
"2012-02",
"2012-04",
"2012-06",
"2012-07",
"2012-08",
"2012-10",
"2012-11",
"2013-01",
"2013-03",
"2013-04",
"2013-06",
"2013-07",
"2013-09",
"2013-10",
"2013-12",
"2014-01",
"2014-03",
"2014-04",
"2014-06",
"2014-07",
"2014-09",
"2014-10",
"2014-12",
"2015-01",
"2015-03",
"2015-04",
"2015-06",
"2015-09",
"2015-10",
"2015-12",
"2016-01",
"2016-03",
"2016-04",
"2016-06",
"2016-07",
"2016-09",
"2016-10",
"2016-11",
"2017-01",
"2017-03",
"2017-04",
"2017-05",
"2017-07"
]

report_update_dates = [
    "2017-07"
]


In [4]:
# which districts
report_districts = {
    "su":"National Summary",
    "at":"Atlanta",
    "bo":"Boston",
    "ch":"Chicago",
    "cl":"Cleveland",
    "da":"Dallas",
    "kc":"Kansas City",
    "mi":"Minneapolis",
    "ny":"New York",
    "ph":"Philadelphia",
    "ri":"Richmond",
    "sf":"San Francisco",
    "sl":"St Louis"
}

In [4]:
#collect all HTML documents
#save them locally (989 files)

for this_date in report_update_dates: 
    
    for this_district in report_districts.keys():
        this_URL = "https://www.minneapolisfed.org/news-and-events/beige-book-archive/"\
            +this_date+"-"+this_district
        
        district_description = report_districts[this_district]
        
        response = requests.get(this_URL)
        
        path = os.getcwd()+"/HTML/"+this_date+"_"+district_description+".html"
        
        with open(path, 'w') as f : f.write(response.text)

In [26]:
# Open HTML files, each one as a Beautiful Soup docs
# All Beautiful Soup documents are stored in one dictionary doc_dict
# The dictionary has the keys "date" and as value a sub-dictionary
# This sub-dictionary has as keys the district and as values the Beautiful Soup doc. 

doc_dict = {}

for this_date in report_dates:
    
    this_date_formatted = this_date[:4]+this_date[5:]
    
    doc_dict[this_date_formatted] = {}
    
    for this_district in report_districts.keys():
        
        district_description = report_districts[this_district]
        
        path = os.getcwd()+"/HTML/"+this_date+"_"+district_description+".html"
        
        this_html = open(path)
        
        this_doc = BeautifulSoup(this_html, 'html.parser')
            
        # only the section "article-content" is relevant
        
        this_doc = this_doc.find("section", {"class" : "article-content"})
        doc_dict[this_date_formatted][district_description] = this_doc

In [27]:
#Dictionary storing texts for each of the sectors
#Like the dictionary with Beautiful Soup docs, this one will store the scraped descriptions
#The dictionary texts_dic has as keys the dates and as values sub-dictionaries
#Each sub-dictionary has as keys the district name and as values the description as a string

texts_dict = {}

for this_date in doc_dict.keys():
    
    texts_dict[this_date] = {}
    
    for this_district in doc_dict[this_date].keys():
        this_doc = doc_dict[this_date][this_district]
        
        texts_dict[this_date][this_district] = {}
        
        observ = texts_dict[this_date][this_district]
        
        # National Summaries have a specific format
        # there are two different formats:
        # one where the first section is titled as "Overall Economic Activity"
        # and the older format where there is no title
        
        #starting from July 2009, there was a new paragraph
        if int(this_date) > 200906 and this_district == "National Summary":
            this_html = this_doc.find(text=re.compile('.repar.')).find_next()
                
        else:
            this_html = this_doc.find_all("strong")[0].find_next("p")
            
        # new format with a new title starting in 2017
        if int(this_date) < 201701:
            this_sector= 'Overall Economic Activity'
            this_string = this_html.text.strip()
                        
        else:
            while this_html.find("strong") is None and this_html.find("em") is None:
                this_html = this_html.find_next()

            
            this_html = this_html.find("strong")
            
            this_sector = this_html.text.strip()
            
            
            while this_sector == "":
                this_html = this_html.find_next()
                if this_html.find("strong"):
                    this_html = this_html.find("strong")
                    this_sector = this_html.text.strip()
                    
            if this_html.next_sibling:
                if this_html.next_sibling.name == "br":
                    this_string = this_html.next_sibling.next_sibling
                else:
                    this_string = this_html.next_sibling
            
        observ[this_sector] = this_string
             
        this_html = this_html.find_next()
    
            
        while this_html and this_html.find("strong") is None and this_html.find("em") is None:
            this_string = this_html.text.strip()
            observ[this_sector] =\
                observ[this_sector]+"\n"+this_string
            this_html = this_html.find_next()
        
        # check the other sectors
        
                        
        # find the header (sector) by formatting "strong"
            
        while this_html and this_html.find("div") is None and this_html.find("em") is None:
        
            this_html = this_html.find("strong")
                
            this_sector = this_html.text.strip()
            
            counter = 0
            while this_sector == "" or this_sector == ".":
                counter += 1
                this_html = this_html.find_next()
                if this_html.find("strong") and this_html.find("strong").name !="br":
                    this_html = this_html.find("strong")
                    this_sector = this_html.text.strip()
                if counter>2:
                    break
                    
            if this_html.find_next().name == "strong":
                    this_html = this_html.find_next()
                    this_sector = this_sector + " " + this_html.text.strip()
                
            if this_sector == "":
                print(this_date,this_district)
                    
            #Since beginning of 2017, highlights for each district are shown in the
            #summary. They should not be included.
            
            check_year = int(this_date) > 201612 and this_district == 'National Summary'
            check_highlights = False
            
            #the non-strong part should be part of the string
            
            if this_html.next_sibling:
                if this_html.next_sibling.name == "br":
                            if this_html.next_sibling.next_sibling.name == "strong":
                                this_sector = this_sector + " " + this_html.text.strip()
                                this_html = this_html.next_sibling.next_sibling

            observ[this_sector] = ""
            this_string = ""
            
            while this_html.next_sibling:
                if this_html.next_sibling.name == "br":
                    if this_html.next_sibling.next_sibling.name != "strong":
                        this_html = this_html.next_sibling.next_sibling
                    else:
                        if this_html.next_sibling.next_sibling:
                            this_html = this_html.next_sibling.next_sibling
                else:
                    this_html = this_html.next_sibling
                if isinstance(this_html,str):
                    this_string = this_html.strip()
                    observ[this_sector] =\
                        observ[this_sector]+"\n"+this_string

            this_html = this_html.find_next()
        
            while this_html.find("div") is None and this_html.find("em") is None:
                if this_html.find("strong"):
                    if this_html.find("strong").text.strip():
                        break
                        
                this_string = this_html.text.strip()
                
                #check if the description arrived at "Highlights by Federal Reserve District"
                if check_year and re.search('Highlights by.',this_string):
                    check_highlights = True
                    break
                    
                observ[this_sector] =\
                    observ[this_sector]+"\n"+this_string
                this_html = this_html.find_next() 
                
            if check_highlights:
                    break

In [28]:
#cleaning up keys
#there are some special characters in some sector names

sectors = []

for date in texts_dict.keys():
    for district in texts_dict[date].keys():
        for sector in texts_dict[date][district].keys():
            sectorNew = sector.replace('  ',' ')
            texts_dict[date][district][sectorNew] = texts_dict[date][district].pop(sector)
            sector = sectorNew
            
            sectorNew = sector.replace('\xc2','')
            texts_dict[date][district][sectorNew] = texts_dict[date][district].pop(sector)
            sector = sectorNew
            
            sectorNew = sector.replace('.','')
            texts_dict[date][district][sectorNew] = texts_dict[date][district].pop(sector)
            sector = sectorNew
            sectors.append(sector)
            
#get a list of all used sector titles
sectors = set(sectors)

In [29]:
import csv

with open('sectors.csv', 'w') as myfile:
    wr = csv.writer(myfile, delimiter='\n', quoting=csv.QUOTE_ALL)
    wr.writerow(sectors)

# Mapping the sectors to 7 categories
This is necessary as the different Fed districts use different sectors to break down their reports.<p>

Agriculture, Energy, Natural Resources<br>
Finance<br>
Real Estate, Construction<br>
Manufacturing, Services<br>
Employment, Prices, Wages<br>
Retail, Consumer Spending<br>
Economy

In [30]:
# using self-created CSV file to map sectors to categories
# joining it with the dictionary in in Pandas
df_categories = pd.read_csv('mapping_sectors.csv', sep=';')

In [31]:
# The dictionary with sub-dictionaries are imported into Pandas to do the mapping

# The solution to get nested dictionary into Pandas is derived from stack overflow
#https://stackoverflow.com/questions/33611782/pandas-dataframe-from-nested-dictionary
df = pd.DataFrame([(k1,k2,k3,v3) for k1,v1 in texts_dict.items() \
                   for k2,v2 in v1.items()\
                   for k3,v3 in v2.items()],\
                   columns = ['Date','District','Sector','Description'])

df.sort_values(['Date','District','Sector']).head(5)

df[df.Date=='201510']

Unnamed: 0,Date,District,Sector,Description
5813,201510,National Summary,Consumer Spending and Tourism,\nConsumer spending grew at a moderate pace ov...
5814,201510,National Summary,"Employment, Wages and Prices",\nLabor markets generally tightened since the ...
5815,201510,National Summary,Overall Economic Activity,Reports from the twelve Federal Reserve Distri...
5816,201510,National Summary,Nonfinancial Services,\nNonfinancial services activity generally str...
5817,201510,National Summary,Manufacturing,"\nOn the whole, manufacturing conditions were ..."
5818,201510,National Summary,Real Estate and Construction,\nResidential real estate activity has general...
5819,201510,National Summary,Agriculture and Natural Resources,\nAgriculture conditions were mixed. Growing c...
5820,201510,National Summary,Banking and Financial Services,\nReports from the banking sector were general...
5821,201510,Atlanta,Real Estate and Construction,\nDistrict contacts indicated that residential...
5822,201510,Atlanta,Manufacturing and Transportation,\nDistrict manufacturers indicated that busine...


In [32]:
df_joined = df.join(df_categories.set_index('Sector'), on='Sector')

In [33]:
df_joined.loc[df_joined.Category.isnull(),'Category'] =\
    "Manufacturing, Services"

In [34]:
df_joined.loc[df_joined['District'] == 'St Louis', 'District'] = 'St. Louis'
df_joined.sort_values(['Date','District','Category','Sector'])

Unnamed: 0,Date,District,Sector,Description,Category
12,200801,Atlanta,Agriculture and Natural Resources,\nRains during December brought short-term rel...,"Agriculture, Energy, Natural Resources"
13,200801,Atlanta,Overall Economic Activity,Reports from contacts for late November throug...,Economy
11,200801,Atlanta,Employment and Prices,\nThe demand for workers in some sectors conti...,"Employment, Prices, Wages"
10,200801,Atlanta,Banking and Finance,\nFinancial industry contacts reported reduced...,Finance
9,200801,Atlanta,Manufacturing and Transportation,\nManufacturing activity varied by industry. ...,"Manufacturing, Services"
8,200801,Atlanta,Real Estate,\nHomebuilders and Realtors reported that new ...,"Real Estate, Construction"
14,200801,Atlanta,Consumer Spending and Tourism,\nMost retail contacts reported that holiday s...,"Retail, Consumer Spending"
17,200801,Boston,Overall Economic Activity,"Closing out 2007 with mixed results, business ...",Economy
18,200801,Boston,Manufacturing,\nManufacturers and related services provider...,"Manufacturing, Services"
19,200801,Boston,Commercial Real Estate,\nThe common theme of respondents this time is...,"Real Estate, Construction"


In [35]:
df_joined.to_csv("fulltext.csv")