# Introduction to Structured and Semi-Structured Text Files

**Week01, Section 01**

ISM6564 Fall 2023

&copy; 2023 Dr. Tim Smith

---

We have thus far seen text files as unstructured 'blobs' of text that we can analyze.

However, many text files are structured or semi-structured. This means that they have a specific format that we can use to extract information from them.

Some examples of structured text files include:
- CSV files
- JSON files
- XML files

Some examples of semi-structured text files include:
- HTML files
- PDF files

The objectives of this section are to:
- Understand the structure of CSV, JSON, and XML files
- Learn how to read CSV, JSON, and XML files into Python
- Learn how to extract information from CSV, JSON, and XML files
- Load PDF and HTML files into Python (we will cover information extraction from html in a later section)

In [1]:
import json
import random

## Structured files


### JSON

JSON is a very popular format for storing structured data. It is a text-based format, and is therefore human-readable. It is also very easy to parse, and is therefore machine-readable. It is also very flexible, and can be used to store a wide variety of data structures. It is also very popular, and is therefore supported by a wide variety of programming languages and tools.

In [2]:
# create a sample dictionary of data to use for this example

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'year': [2012, 2012, 2013, 2014, 2014],
        'reports': [4, 24, 31, 2, 3]}

In [3]:
# save this data structure as a json file

import json

with open('data/data.json', 'w') as outfile:
    json.dump(data, outfile)

We can open the json file we just created using VSCode. One issue that you will find is that the JSON structure will typically show as one long line. To get VSCode to format the JSON file, you can right-click on the file and select "Format Document". You can also use the keyboard shortcut "Shift+Alt+F" (on windows) or on MacOS "Shift+Option+F".

In [4]:
# read the json file

with open('data/data.json') as json_file:
    data = json.load(json_file)

In [5]:
# explore the data variable created from the file

# print the data variable
print(data)

# print the type of the data variable
print(type(data))

# print the keys of the data variable
print(data.keys())
print(list(data.keys()))

# print the values of the data variable
print(data.values())
print(list(data.values()))

{'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]}
<class 'dict'>
dict_keys(['name', 'year', 'reports'])
['name', 'year', 'reports']
dict_values([['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], [2012, 2012, 2013, 2014, 2014], [4, 24, 31, 2, 3]])
[['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], [2012, 2012, 2013, 2014, 2014], [4, 24, 31, 2, 3]]


### XML

XML is a markup language that is used to store structured data. It is a bit like HTML, but it is not used to display data in a browser. Instead, it is used to store data in a structured way. XML is used in many different applications, and it is also used to store linguistic data.

To demonstrate the use of XML in python, we will create a sample dictionary, then covert it to XML and save it to a datafile.

In [6]:
# note, the follow two lines address an issue with the latest versions of python
# and the dicttoxml library
import collections
collections.Iterable = collections.abc.Iterable

import dicttoxml

xml=dicttoxml.dicttoxml(data) # we will reuse the data structure we created in the previous example

xml = dicttoxml.parseString(xml).toprettyxml() # convert xml binary string to xml string

with open('data/data.xml', 'w') as outfile:
    outfile.write(xml)

Now, we will review the process of reading the XML file into Python.

In [7]:
# read the xml file
import xmltodict
import json

with open('data/data.xml') as f:
    doc = xmltodict.parse(f.read())
    
type(doc)

dict

### CSV

CSV stands for comma-separated values. It is a very common format for storing data. Each line in a CSV file represents a row, and each column of data is separated by a comma. The first line of a CSV file is often a header row that contains the names of the columns.

In [8]:
# save my_dict to a CSV file

# Though there are many ways we can save a dictionary to a csv file, let's use pandas

import pandas as pd

df = pd.DataFrame.from_dict(data, orient='index')  # we will reuse the data structure we created in the first example
df.to_csv('./data/data.csv', header=True, index=False)

## Semi Structured Text Files

In [9]:
### HTML

# request a web page from www.usf.edu

import requests

url = 'https://www.tampa.gov/'

r = requests.get(url)

# print the first 500 characters of the HTML
print(r.text[:500])



<!DOCTYPE html>
<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
  <head>
    <meta charset="utf-8" />
<script>window.dataLayer = window.dataLayer ||


In [10]:
from bs4 import BeautifulSoup as bs

soup = bs(r.text)               #make BeautifulSoup
prettyHTML = soup.prettify()   #prettify the html
print(prettyHTML[5000:10000])

Hb},ssl:void 0,obfuscate:void 0,jserrors:{enabled:!0,harvestTimeSeconds:10},metrics:{enabled:!0},page_action:{enabled:!0,harvestTimeSeconds:30},page_view_event:{enabled:!0},page_view_timing:{enabled:!0,harvestTimeSeconds:30,long_task:!1},session_trace:{enabled:!0,harvestTimeSeconds:10},harvest:{tooManyRequestsDelay:60},session_replay:{enabled:!1,harvestTimeSeconds:60,sampleRate:.1,errorSampleRate:.1,maskTextSelector:"*",maskAllInputs:!0,get blockClass(){return"nr-block"},get ignoreClass(){return"nr-ignore"},get maskTextClass(){return"nr-mask"},get blockSelector(){return e.blockSelector},set blockSelector(t){e.blockSelector+=",".concat(t)},get maskInputOptions(){return e.maskInputOptions},set maskInputOptions(t){e.maskInputOptions={...t,password:!0}}},spa:{enabled:!0,harvestTimeSeconds:10}}},l={};function f(e){if(!e)throw new Error("All configuration objects require an agent identifier!");if(!l[e])throw new Error("Configuration for ".concat(e," was never set"));return l[e]}function g(e,

In [11]:
mydivs = soup.find_all("a", {"class": "nav-link"}) # find all a tags with class nav-link

mydivs

[<a class="nav-link" href="/accessibility/website"><i class="fas fa-universal-access"></i> <span aria-label="Title" class="nav-link-text">Access<span class="optional">ibility</span></span></a>,
 <a class="nav-link" href="/news"><i class="far fa-newspaper"></i> <span aria-label="Title" class="nav-link-text">News<span class="optional">room</span></span></a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle nav-link" data-toggle="dropdown" href="#" role="button">Guides</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle nav-link" data-toggle="dropdown" href="#" role="button">Businesses</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle nav-link" data-toggle="dropdown" href="#" role="button">Recreation</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle nav-link" data-toggle="dropdown" href="#" role="button">Residents</a>,
 <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle nav-link

NOTE: Though it's good to familiarize yourself with how you can go about extracting information from a website, we will not be covering this in detail in this course. If you are interested in learning more about this, you can find plenty of online resources for learning how to extract information from websites. There is another Python library called scrapy that is becoming popular. You can find more information about it here: https://scrapy.org/

### PDF

PDF files are a bit more complicated to work with. There are many pdf libraries to work with PDF's. In this section, we will use two libraries. fpdf, and PyPDF2.

fpdf is a library that allows us to create pdf files from scratch. PyPDF2 is a library that allows us to read pdf files.

#### Writing to PDF

In [12]:
import fpdf #pip3 intall fpdf # there are many pdf libraries in python. This is an old one, but good for writing.

pdf = fpdf.FPDF(format='letter') #pdf format
pdf.add_page() #create new page
pdf.set_font("Arial", size=12) # font and textsize
pdf.cell(200, 10, txt="Hello World", ln=1, align="L")
pdf.cell(200, 10, txt="Welcome", ln=2, align="L")
pdf.cell(200, 10, txt="to ISM6564", ln=3, align="L")
pdf.output("./data/test.pdf")

''

#### Reading from PDF

In [13]:
import PyPDF2

sample_pdf = open(r'./data/test.pdf', mode='rb')
pdfdoc = PyPDF2.PdfReader(sample_pdf)

print(pdfdoc.metadata)
print("----")
print(len(pdfdoc.pages))
print("----")
print(pdfdoc.pages[0].extract_text())

{'/Producer': 'PyFPDF 1.7.2 http://pyfpdf.googlecode.com/', '/CreationDate': 'D:20230826130925'}
----
1
----
Hello World
Welcome
to ISM6564
