Tom Halloin <br> Springboard Data Science Career Track <br>

<h1 align="center">Capstone Project 2: Analysis of Berkshire Hathaway Shareholder Letters Using Natural Language Processing (NLP) Techniques</h1>

<h3 align='center'> Part 2: Cleaning Letters for Human Consumption</h3> <br>

<h4> We are going to start with a sample of the data to illustrate the cleaning process, then apply the cleaning process throughout the rest of the documents.</h4>

In [1]:
import html
import os # file movements later on
import re # regular expressions
from bs4 import BeautifulSoup

In [2]:
with open(f'../raw_letters/1977_letter.txt', 'r', encoding='utf-8', errors='replace') as infile:
    text = infile.readlines()
    letter = "".join(text)

sampletext = letter[0:1200]
print(sampletext)

<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-136883390-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-136883390-1');
</script>
<HTML>
<HEAD>
  <TITLE>Chairman's Letter - 1977</TITLE>
</HEAD>
<BODY>
<P ALIGN=CENTER>
<B>BERKSHIRE HATHAWAY INC.</B>
</P>
<PRE>

<I>To the Stockholders of Berkshire Hathaway Inc.:</I>

     Operating earnings in 1977 of $21,904,000, or $22.54 per 
share, were moderately better than anticipated a year ago.  Of 
these earnings, $1.43 per share resulted from substantial 
realized capital gains by Blue Chip Stamps which, to the extent 
of our proportional interest in that company, are included in our 
operating earnings figure.  Capital gains or losses realized 
directly by Berkshire Hathaway Inc. or its insurance subsidiaries 
are not included in our calculation of operating e

<h4>Some of the earlier letters are not encoded in UTF-8 form because of some HTML characters. The code below fixes that for consistent encoding for all of the documents.</h4>

In [3]:
def html_decode(words):
    
    decoded_review = html.unescape(words)
    decoded_review
    return(words)

sampletext = html_decode(sampletext)

<h4>The following removes much of the HTML characters from the scraping process.</h4>

In [4]:
def cleanhtml(words):
    """Removes HTML tags"""
    pattern = re.compile('<.*?>')
    cleantext = re.sub(pattern, '', words)
    return cleantext

sampletext = cleanhtml(sampletext)
print(sampletext)




  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-136883390-1');



  Chairman's Letter - 1977



BERKSHIRE HATHAWAY INC.



To the Stockholders of Berkshire Hathaway Inc.:

     Operating earnings in 1977 of $21,904,000, or $22.54 per 
share, were moderately better than anticipated a year ago.  Of 
these earnings, $1.43 per share resulted from substantial 
realized capital gains by Blue Chip Stamps which, to the extent 
of our proportional interest in that company, are included in our 
operating earnings figure.  Capital gains or losses realized 
directly by Berkshire Hathaway Inc. or its insurance subsidiaries 
are not included in our calculation of operating earnings.  While 
too much attention should not be paid to the figure for any 
single year, over the longer term the record regarding aggregate 
capital gains or losses obviously is of significance.

 


<h4>The following replaces newlines and tab characters.</h4>

In [5]:
def replace_newlines(words):
    """Replaces newlines"""
    words = words.replace('\n', '')
    words = words.replace('\\n', ' ')
    words = words.replace('\\t', '') # tab character
    
    return words

sampletext = replace_newlines(sampletext)
print(sampletext)

  window.dataLayer = window.dataLayer || [];  function gtag(){dataLayer.push(arguments);}  gtag('js', new Date());  gtag('config', 'UA-136883390-1');  Chairman's Letter - 1977BERKSHIRE HATHAWAY INC.To the Stockholders of Berkshire Hathaway Inc.:     Operating earnings in 1977 of $21,904,000, or $22.54 per share, were moderately better than anticipated a year ago.  Of these earnings, $1.43 per share resulted from substantial realized capital gains by Blue Chip Stamps which, to the extent of our proportional interest in that company, are included in our operating earnings figure.  Capital gains or losses realized directly by Berkshire Hathaway Inc. or its insurance subsidiaries are not included in our calculation of operating earnings.  While too much attention should not be paid to the figure for any single year, over the longer term the record regarding aggregate capital gains or losses obviously is of significance. 


<h4>Removing all text between 'To the Shareholders of Berkshire Hathaway' and after Buffett's salutation.</h4>

In [6]:
def remove_beginning_end(words):
    """Removes content before and after the letter starts."""
    
    beginning_string = 'To the Stockholders of Berkshire Hathaway Inc.:'
    beginning_string_shareholders = 'To the Shareholders of Berkshire Hathaway Inc.:'
    end_string = 'Warren E. Buffett, Chairman'
    beginning_index = words.find(beginning_string)

    # There are some letters where Buffett addresses the letters as "shareholders". This corrects that problem.

    if beginning_index == -1:
        beginning_index = words.find(
            'To the Shareholders of Berkshire Hathaway Inc.:')

    end_index = words.find(end_string)
    words = words[beginning_index:end_index]
    return(words)

sampletext = remove_beginning_end(sampletext)
print(sampletext)

To the Stockholders of Berkshire Hathaway Inc.:     Operating earnings in 1977 of $21,904,000, or $22.54 per share, were moderately better than anticipated a year ago.  Of these earnings, $1.43 per share resulted from substantial realized capital gains by Blue Chip Stamps which, to the extent of our proportional interest in that company, are included in our operating earnings figure.  Capital gains or losses realized directly by Berkshire Hathaway Inc. or its insurance subsidiaries are not included in our calculation of operating earnings.  While too much attention should not be paid to the figure for any single year, over the longer term the record regarding aggregate capital gains or losses obviously is of significance.


 <h4>Some of the tables have an excessive number of dots, dashes, and equals. This function fixes that problem.</h4>


In [11]:
def remove_dots_dashes_equals(words):
    """Removes dots, dashes, equal signs, and more. This cleans up the tables for machine consumption."""
    
    pattern = re.compile('[-=.]{2,}') # Consecutive dots, dashes, and equals signs.
    pattern2 = re.compile('(\. ){2,}') # Multiple periods in a row.
    pattern3 = re.compile('(\d+ +\', \'\d+\': b\')') # Dictionary page numbers
    words = re.sub(pattern, ' ', words)
    words = re.sub(pattern2, ' ', words)
    words = re.sub(pattern3, '', words)
    words = re.sub('\*', '', words) # * character
    words = re.sub('\\\'', '', words) # backslash before apostrophes
    words = re.sub('amp;', '', words) # ampersand
    words = re.sub(r'(\\x(.){2})', '', words) # Non-ascii characters
    return words

<h4>Summary output of what we have so far.</h4>


In [8]:

def clean_words(words):
    '''Function that combines above 5 functions into one.'''
    # words = html_decode(words)
    words = replace_newlines(words)
    words = cleanhtml(words)
    words = remove_beginning_end(words)
    words = remove_dots_dashes_equals(words)
    return words

In [12]:
# For our HTML data

with open(f'../raw_letters/1977_letter.txt', 'r') as infile:
    text = infile.readlines()
    words = "".join(text)

In [13]:
# For our PDF data

with open(f'../raw_letters/2018_letter.txt', 'r') as infile:
    text = infile.readlines()
    words = "".join(text)

<h4>Make a directory to save the "cleaned" letters.</h4>

In [16]:
if not os.path.exists('clean_letters'):
    os.makedirs('clean_letters')
    
for year in range(1977, 2020):
    with open(f'../raw_letters/{year}_letter.txt', 'r') as infile:
        text = infile.readlines()
        words = "".join(text)
        with open(f'../clean_letters/{year}_letter.txt', 'w', encoding='utf-8', errors='replace') as outfile:
            outfile.write(clean_words(words))