# Automatically downloading real estate data

The Mortgage Bankers Association sends a press release every week with the results of their weekly mortgage applications survey. The survey covers over 75% of all applications for retail residential mortgages in the U.S., and contains a huge wealth of information, including interest rates and changes in the number of applications. These press releases are all available on [one page](https://www.mba.org/news-research-and-resources/research-and-economics/single-family-research/weekly-applications-survey/research-and-economics-all-news-about-mbas-weekly-applications-survey) and go back to 2016. The survey itself has been conducted since 1990. Find more information about the survey [here](https://www.mba.org/Documents/mba.org/files/Research/HistoricalWAS/WASFAQ.pdf). 

The purpose of this script is to automatically download all of the press releases and extract information about interest rates.

## Part 1: Get a list of URLs

I need to find the URLs of each press release on the webpage and put them into a list.

In [2]:
from requests import get
from bs4 import BeautifulSoup

In [5]:
url = 'https://www.mba.org/news-research-and-resources/research-and-economics/single-family-research/weekly-applications-survey/research-and-economics-all-news-about-mbas-weekly-applications-survey'

headers = {'name': "Sharon Lurye", 'email': "sharonrlurye@gmail.com"}

response = get(url, headers=headers)

In [13]:
soup = BeautifulSoup(response.text, 'html.parser')
soup

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) { w[l] = w[l] || []; w[l].push({ 'gtm.start': new Date().getTime(), event: 'gtm.js' }); var f = d.getElementsByTagName(s)[0], j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src = 'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f); })(window, document, 'script', 'dataLayer', 'GTM-K956ZZP');</script>
<!-- End Google Tag Manager -->
<!-- second GTM - see issue 1360671 - kih 2020-08-27 -->
<!-- Google Tag Manager -->
<script>
    (function (w, d, s, l, i) {
      w[l] = w[l] || []; w[l].push({
        'gtm.start':
          new Date().getTime(), event: 'gtm.js'
      }); var f = d.getElementsByTagName(s)[0],
        j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src =
          'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f);
    

We want to find this part:

```
<ul class="item-list search-results-list">
<h1 class="schema-pressrelease prop-resourcename item-title search-results-item-title"> <a href="http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey-x281353">                 Mortgage Applications Increase in Latest MBA Weekly Survey                 </a> </h1>
<p> <span class="schema-pressrelease prop-articledate">                 June 23, 2021                 </span></p>
<br/>
```

In [20]:
press_releases = soup.find_all("ul", class_="item-list search-results-list")[0]
press_releases

<ul class="item-list search-results-list">
<h1 class="schema-pressrelease prop-resourcename item-title search-results-item-title"> <a href="http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey-x281353">                 Mortgage Applications Increase in Latest MBA Weekly Survey                 </a> </h1>
<p> <span class="schema-pressrelease prop-articledate">                 June 23, 2021                 </span></p>
<br/>
<h1 class="schema-pressrelease prop-resourcename item-title search-results-item-title"> <a href="http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey">                 Mortgage Applications Increase in Latest MBA Weekly Survey                 </a> </h1>
<p> <span class="schema-pressrelease prop-articledate">                 June 16, 2021                 </span></p>
<br/>
<h1 class="schema-pressrelease prop-resourcename item-title search-results-item-title"> <a href="http

In [26]:
press_releases.find_all("a")[0].get("href")

#https://stackoverflow.com/questions/14470504/python-how-to-extract-url-from-html-page-using-beautifulsoup

'http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey-x281353'

In [29]:
links = []

for press_release in press_releases.find_all("a"):
    links.append(press_release.get("href"))
    
links

['http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey-x281353',
 'http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey',
 'http://www.mba.org/2021-press-releases/june/mortgage-applications-decrease-in-latest-mba-weekly-survey-x280704',
 'http://www.mba.org/2021-press-releases/june/mortgage-applications-decrease-in-latest-mba-weekly-survey',
 'http://www.mba.org/2021-press-releases/may/mortgage-applications-decrease-in-latest-mba-weekly-survey-x280264',
 'http://www.mba.org/2021-press-releases/may/mortgage-applications-increase-in-latest-mba-weekly-survey-x279918',
 'http://www.mba.org/2021-press-releases/may/mortgage-applications-increase-in-latest-mba-weekly-survey',
 'http://www.mba.org/2021-press-releases/may/mortgage-applications-decrease-in-latest-mba-weekly-survey',
 'http://www.mba.org/x279239.xml',
 'http://www.mba.org/2021-press-releases/april/mortgage-applications-increase-in-

In [30]:
len(links) #There are 247 press releases going back to January 2016

247

## Part 2: Find the date of each survey and the interest rates

Let's start with the first press release only.

In [31]:
links[0]

'http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey-x281353'

In [34]:
url = links[0]

headers = {'name': "Sharon Lurye", 'email': "sharonrlurye@gmail.com"}

response = get(url, headers=headers)

soup = BeautifulSoup(response.text, "html.parser")

In [39]:
soup

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) { w[l] = w[l] || []; w[l].push({ 'gtm.start': new Date().getTime(), event: 'gtm.js' }); var f = d.getElementsByTagName(s)[0], j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src = 'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f); })(window, document, 'script', 'dataLayer', 'GTM-K956ZZP');</script>
<!-- End Google Tag Manager -->
<!-- second GTM - see issue 1360671 - kih 2020-08-27 -->
<!-- Google Tag Manager -->
<script>
    (function (w, d, s, l, i) {
      w[l] = w[l] || []; w[l].push({
        'gtm.start':
          new Date().getTime(), event: 'gtm.js'
      }); var f = d.getElementsByTagName(s)[0],
        j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src =
          'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f);
    

In [52]:
#Find the title of the press release
soup.find('h1', attrs={'class':'title-primary'}).text

'Mortgage Applications Increase in Latest MBA Weekly Survey '

In [65]:
#Find first two paragraphs of the press release
soup.find_all('p')[0:3]

[<p></p>,
 <p><strong></strong><strong>WASHINGTON, D.C. (June 23, 2021)</strong> - Mortgage applications increased 2.1 percent from one week earlier, according to data from the Mortgage Bankers Association's (MBA) Weekly Mortgage Applications Survey for the week ending June 18, 2021.</p>,
 <p>The Market Composite Index, a measure of mortgage loan application volume, increased 2.1 percent on a seasonally adjusted basis from one week earlier. On an unadjusted basis, the Index increased 1 percent compared with the previous week. The Refinance Index increased 3 percent from the previous week and was 9 percent lower than the same week one year ago. The seasonally adjusted Purchase Index increased 1 percent from one week earlier. The unadjusted Purchase Index decreased 1 percent compared with the previous week and was 14 percent lower than the same week one year ago.</p>]

In [66]:
#Find text of the first paragraph
soup.find_all('p')[1].text

"WASHINGTON, D.C. (June 23, 2021) -\xa0Mortgage applications increased 2.1 percent from one week earlier, according to data from the Mortgage Bankers Association's (MBA) Weekly Mortgage Applications Survey for the week ending June 18, 2021."

In [92]:
#Find the date the survey was completed in 

import re

In [153]:
#https://www.programiz.com/python-programming/regex#python-regex
#https://stackoverflow.com/questions/35413746/regex-to-match-date-like-month-name-day-comma-and-year/35413952
#https://stackoverflow.com/questions/59219456/extract-month-names-and-date-numbers-from-a-raw-string-using-regex-edit-new-te

string = soup.find_all('p')[1].text
pattern = '\s(?:January|February|March|April|May|June|July|August|September|October|November|December)\s\d\d,\s20\d\d'

#I want it to return June 18 (week the survey ended) but NOT June 23 (week the press release came out) so I search for the date with a whitespace in front. Then I use strip() to remove that whitespace.

result = re.findall(pattern, string)
print(string)
print(result[0].strip())

WASHINGTON, D.C. (June 23, 2021) - Mortgage applications increased 2.1 percent from one week earlier, according to data from the Mortgage Bankers Association's (MBA) Weekly Mortgage Applications Survey for the week ending June 18, 2021.
June 18, 2021


In [88]:
#Print all paragraphs that mention interest rates
#https://stackabuse.com/python-check-if-string-contains-substring 

string = "interest rate"

#Grab date from the first paragraph
for paragraph in soup.find_all('p'):
    if string in paragraph.text:
        print(paragraph.text, "\n")

The average contract interest rate for 30-year fixed-rate mortgages with conforming loan balances ($548,250 or less) increased to 3.18 percent from 3.11 percent, with points increasing to 0.48 from 0.36 (including the origination fee) for 80 percent loan-to-value ratio (LTV) loans. The effective rate increased from last week. 

The average contract interest rate for 30-year fixed-rate mortgages with jumbo loan balances (greater than $548,250) increased to 3.26 percent from 3.20 percent, with points decreasing to 0.44 from 0.46 (including the origination fee) for 80 percent LTV loans. The effective rate increased from last week. 

The average contract interest rate for 30-year fixed-rate mortgages backed by the FHA increased to 3.21 percent from 3.14 percent, with points increasing to 0.34 from 0.33 (including the origination fee) for 80 percent LTV loans. The effective rate increased from last week. 

The average contract interest rate for 15-year fixed-rate mortgages increased to 2.58

In [167]:
#Print contract interest rate for 30-year fixed-rate mortgages under $548,250

string = "The average contract interest rate for 30-year fixed-rate mortgages with conforming loan balances"
pattern = "(\d.\d\d) percent"

#Grab date from the first paragraph
for paragraph in soup.find_all('p'):
    if string in paragraph.text:
        print(paragraph.text)
        result = re.findall(pattern, paragraph.text)
        print(result[0])

The average contract interest rate for 30-year fixed-rate mortgages with conforming loan balances ($548,250 or less) increased to 3.18 percent from 3.11 percent, with points increasing to 0.48 from 0.36 (including the origination fee) for 80 percent loan-to-value ratio (LTV) loans. The effective rate increased from last week.
3.18


In [168]:
#Print contract interest rate for 5/1 ARMs

string = "The average contract interest rate for 5/1 ARMs"
pattern = "(\d.\d\d) percent"

#Grab date from the first paragraph
for paragraph in soup.find_all('p'):
    if string in paragraph.text:
        print(paragraph.text)
        result = re.findall(pattern, paragraph.text)
        print(result[0])

The average contract interest rate for 5/1 ARMs remained unchanged at 2.69 percent, with points decreasing to 0.26 from 0.38 (including the origination fee) for 80 percent LTV loans. The effective rate decreased from last week.  
2.69


## Putting it all together

Now I will systematically go through each press release and grab the data I need.

In [232]:
#Updated regex pattern since it did not find one-digit dates
#https://stackoverflow.com/questions/5155422/regex-for-1-or-2-digits-optional-non-alphanumeric-2-known-alphas/5155506
#Also updated result object to avoid an index error
#https://stackoverflow.com/questions/22251945/why-the-list-index-out-of-range-error

survey_dates = []

for link in links[0:3]:
    url = link
    headers = {'name': "Sharon Lurye", 'email': "sharonrlurye@gmail.com"}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    string = soup.find_all('p')[1].text
    pattern = '\s(?:January|February|March|April|May|June|July|August|September|October|November|December)\s[0-9]{1,2},\s20\d\d'
    result = re.findall(pattern, string)
    if len(result) > 0:
    #   print(string)
        print(result[0].strip())
        survey_dates.append(result[0].strip())
    else:
        print("NA")
        survey_dates.append("NA")

June 18, 2021
June 11, 2021
June 4, 2021


In [240]:
fixed_rates = []

for link in links[0:3]:
    url = link
    headers = {'name': "Sharon Lurye", 'email': "sharonrlurye@gmail.com"}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    string = "The average contract interest rate for 30-year fixed-rate mortgages with conforming loan balances"
    pattern = "(\d.\d\d) percent"
    for paragraph in soup.find_all('p'):
        if string in paragraph.text:
            #print(paragraph.text)
            result = re.findall(pattern, paragraph.text)
            print(result[0])
            fixed_rates.append(result[0])
        else:
            print("NA")
            fixed_rates.append("NA")

NA
NA
NA
NA
NA
NA
3.18
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
3.11
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
3.15
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA


## Problem: how do I get it to say NA if it doesn't find that string?

In [238]:
arm_rates = []

for link in links[0:3]:
    url = link
    headers = {'name': "Sharon Lurye", 'email': "sharonrlurye@gmail.com"}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    string = "The average contract interest rate for 5/1 ARMs"
    pattern = "(\d.\d\d) percent"
    for paragraph in soup.find_all('p'):
        if string in paragraph.text:
            #print(paragraph.text)
            result = re.findall(pattern, paragraph.text)
            print(result[0])
            arm_rates.append(result[0])

2.69
2.69
2.52


In [234]:
print(links[0:3])
print(survey_dates)
print(fixed_rates)
print(arm_rates)

['http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey-x281353', 'http://www.mba.org/2021-press-releases/june/mortgage-applications-increase-in-latest-mba-weekly-survey', 'http://www.mba.org/2021-press-releases/june/mortgage-applications-decrease-in-latest-mba-weekly-survey-x280704']
['June 18, 2021', 'June 11, 2021', 'June 4, 2021']
['3.18', '3.11', '3.15']
['2.69', '2.69', '2.52']


In [220]:
import pandas as pd

In [235]:
#https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/
    
df = pd.DataFrame(list(zip(links[0:3], survey_dates, fixed_rates, arm_rates)), 
                  columns = ['link', 'week_ending_date', 'fixed_rate', 'arm_rate'])

df

Unnamed: 0,link,week_ending_date,fixed_rate,arm_rate
0,http://www.mba.org/2021-press-releases/june/mo...,"June 18, 2021",3.18,2.69
1,http://www.mba.org/2021-press-releases/june/mo...,"June 11, 2021",3.11,2.69
2,http://www.mba.org/2021-press-releases/june/mo...,"June 4, 2021",3.15,2.52


In [227]:
survey_dates = []

for link in links:
    url = link
    headers = {'name': "Sharon Lurye", 'email': "sharonrlurye@gmail.com"}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    string = soup.find_all('p')[1].text
    pattern = '\s(?:January|February|March|April|May|June|July|August|September|October|November|December)\s[0-9]{1,2},\s20\d\d'
    result = re.findall(pattern, string)
    if len(result) > 0:
    #   print(string)
        print(result[0].strip())
        survey_dates.append(result[0].strip())
    else:
        print("NA")
        survey_dates.append("NA")
    df = pd.DataFrame(list(zip(links, survey_dates)), 
                  columns = ['link', 'week_ending_date'])    

June 18, 2021
June 11, 2021
June 4, 2021
May 28, 2021
May 21, 2021
May 14, 2021
May 7, 2021
April 30, 2021
April 23, 2021
April 16, 2021
April 9, 2021
April 2, 2021
March 26, 2021
March 19, 2021
March 12, 2021
March 5, 2021
February 26, 2021
February 19, 2021
February 5, 2021
January 29, 2021
January 22, 2021
January 15, 2021
January 8, 2021
January 1, 2021
December 18, 2020
December 11, 2020
December 4, 2020
November 27, 2020
November 20, 2020
November 13, 2020
November 6, 2020
October 30, 2020
October 23, 2020
October 16, 2020
October 9, 2020
October 2, 2020
September 25, 2020
September 18, 2020
September 11, 2020
September 4, 2020
August 28, 2020
August 21, 2020
August 14, 2020
August 7, 2020
July 31, 2020
July 24, 2020
July 17, 2020
July 10, 2020
July 3, 2020
June 26, 2020
June 19, 2020
June 12, 2020
June 5, 2020
May 29, 2020
May 22, 2020
May 15, 2020
May 8, 2020
May 1, 2020
April 24, 2020
April 17, 2020
April 10, 2020
April 3, 2020
March 27, 2020
March 20, 2020
March 13, 2020
Marc

In [228]:
df

Unnamed: 0,link,week_ending_date
0,http://www.mba.org/2021-press-releases/june/mo...,"June 18, 2021"
1,http://www.mba.org/2021-press-releases/june/mo...,"June 11, 2021"
2,http://www.mba.org/2021-press-releases/june/mo...,"June 4, 2021"
3,http://www.mba.org/2021-press-releases/june/mo...,"May 28, 2021"
4,http://www.mba.org/2021-press-releases/may/mor...,"May 21, 2021"
...,...,...
242,http://www.mba.org/2016-press-releases/january...,"January 22, 2016"
243,http://www.mba.org/2016-press-releases/january...,"January 15, 2016"
244,http://www.mba.org/2016-press-releases/january...,"January 8, 2016"
245,http://www.mba.org/2016-press-releases/january...,
