## LXML HTML parsing in Python

This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

It is about how to use [`lxml`](http://lxml.de/) in Python to grab the content we want from a set of locally stored html files.

In [None]:
# Import required librares
import glob, re
import lxml.html

### Read the html into an ElementTree (a hierarchy of all elements)

In [None]:
# Read just one of the files (if we have several) to test and set xpaths
with open("cy_sml/cysml1.html", "r") as f1:
    tree = lxml.html.parse(f1)
print(tree) # We have an ElementTree object

### Define which specific elements to get from the tree
Manually inspect elements in your html (using e.g. Developer Tools in the Chrome browser) to find out xpaths.

When entering them below, **add '/text()' to the end of the xpath** to get its text content.

###### First item

In [None]:
first_item_to_get = tree.xpath('//*[@id="the-loop"]/div/h2/text()') # Get the element by its xpath
print(first_item_to_get) # We have the element ...
type(first_item_to_get)  # ... in list format
str1 = ''.join(first_item_to_get) # Make it a string
topic_title = str1.strip() # Name the text variable, and remove leading and trailing whitespace
print(topic_title)

###### Second item

In [None]:
second_item_to_get = tree.xpath('//*[@id="the-loop"]/div/div[1]/div[2]/div[2]/text()')
str2 = ''.join(second_item_to_get)
thread_start = str2.strip()
print(thread_start)

###### Third item

In [None]:
third_item_to_get = tree.xpath('//*[@id="cyfo-topic-86-reply-709"]/div[2]/div[2]/text()')
str3 = ''.join(third_item_to_get)
thread_post = str3
thread_start = str2.strip() # Remove leading and trailing whitespace
print(thread_post)

##### Getting several similar items
Looking at the html of the second and third items above, we realise in this case that several of the bits that we want to get have similar, but not the exact same, xpaths. Such as:

- //\*[@id="the-loop"]/div/div[1]/**div[2]/div[2]/text()**
- //\*[@id="cyfo-topic-86-reply-709"]/**div[2]/div[2]/text()**

And also, in this case:

- //\*[@id="cyfo-topic-86-reply-711"]/**div[2]/div[2]/text()**[2]
- //\*[@id="cyfo-topic-86-reply-711"]/**div[2]/div[2]/text()**[3]

We can then try to use the lowest common denominator to get all of these items in one go, such as in this case '**//div[2]/div[2]/text()**'

Like this:

In [None]:
more_items_to_get = tree.xpath('//div[2]/div[2]/text()')
type(more_items_to_get)
for t in more_items_to_get:
    t = re.sub("\n"," ", t)
    t = t.strip()
    
    '''If it proves hard to avoid getting unwanted neighbouring (empty or other) fields, 
    we can inspect with the two lines below ...'''
    #print("============== NEW ITEM START")
    #print(t)
    
    # ... and then keep just the ones that match criteria that we define:
    if len(t) > 25:
        #print(t)
        thread_posts = t
        print(thread_posts+"\n")

Obviously, one could continue with this process, depending on what is in the source html to define more items to extract. In this example, we are happy with getting our `topic_title` and our `thread_posts`. 

### Extract the same data from many html files

Now, let's use the strategy above on all html files in our data directory.

In [None]:
# Set up an output csv file with column headers
with open('outfile.csv','a') as f: # it is important to open the file in 'a', for append, mode
    f.write("title; post\n")

In [None]:
# Read all files from data dir
fs = glob.glob("cy_sml/*.html")

In [None]:
## Iterate over the file list
for f in fs:
    with open(f, 'r') as infile:
        tree = lxml.html.parse(infile)
        
        # Get our first item
        topic_title = tree.xpath('//*[@id="the-loop"]/div/h2/text()')
        topic_title = ''.join(topic_title)
        #print(topic_title)
        
        # Get our secdon item
        thread_posts = tree.xpath('//div[2]/div[2]/text()')
        for t in thread_posts:
            t = re.sub("\n"," ", t)
            t = t.strip()
            if len(t) > 25:
                thread_post = t
                #print(thread_post+"\n")
                
                # Concatenate into a row to write to the output csv file
                csv_line = topic_title + ";" + thread_post
                #print(csv_line)
                
                # Append the line to the output csv
                with open('outfile.csv','a') as outf ile: # it is important to open the file in 'a', for append, mode
                    outfile.write(csv_line + "\n")                
