# Data extraction: Texas Executed Offenders

Data sources are very diverse. Many official sources provide us with raw online data we can directly access with python libraries like [pandas](https://pandas.pydata.org/). We now show a series of examples on retrieving __HTML__  data from [Texas Executed Offenders](http://wgetsnaps.github.io/tdcj-state-tx-us--death_row/death_row/dr_executed_offenders.html). For a comprehensive guide on how to load various types of data into pandas dataframes please check [IO tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

In [1]:
# perform web requests from an url
import requests

# python library for pulling data out of HTML and XML files
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup

##  Texas Executed Offenders 

Provides data on executed offenders from the Texas Department of Criminal Justice.

Source: [TEO](http://wgetsnaps.github.io/tdcj-state-tx-us--death_row/death_row/dr_executed_offenders.html).


### The data file

- Direct link to the source data file: [http://wgetsnaps.github.io/tdcj-state-tx-us--death_row/death_row/dr_executed_offenders.html](http://wgetsnaps.github.io/tdcj-state-tx-us--death_row/death_row/dr_executed_offenders.html)

### Parsing and wrangling TEO data file
First, let's retrieve the HTML: 

In [9]:
# load the html
source_data_url_teo = "http://wgetsnaps.github.io/tdcj-state-tx-us--death_row/death_row/dr_executed_offenders.html"
teo_file = requests.get(source_data_url_teo)

# get the content
teo_html = teo_file.text

HTML strings can be dislplayed inline using IPython.display 

In [3]:
from IPython.display import display_html
display_html(teo_html, raw=True)

Execution,Link,Link.1,Last Name,First Name,TDCJ Number,Age,Date,Race,County
531,Offender Information,Last Statement,Holiday,Raphael,999419,36,11/18/2015,Black,Madison
530,Offender Information,Last Statement,Escamilla,Licho,999432,33,10/14/2015,Hispanic,Dallas
529,Offender Information,Last Statement,Garcia,Juan,999360,35,10/6/2015,Hispanic,Harris
528,Offender Information,Last Statement,Lopez,Daniel,999555,27,08/12/2015,Hispanic,Nueces
527,Offender Information,Last Statement,Russeau,Gregory,999430,46,06/18/2015,Black,Smith
526,Offender Information,Last Statement,Bower,Lester,764,67,06/03/2015,White,Grayson
525,Offender Information,Last Statement,Charles,Derrick,999451,32,05/12/2015,Black,Harris
524,Offender Information,Last Statement,Garza,Manuel,999434,34,04/15/2015,Hispanic,Bexar
523,Offender Information,Last Statement,Sprouse,Kent,999471,42,04/09/2015,White,Ellis
522,Offender Information,Last Statement,Vasquez,Manuel,999336,46,03/12/2015,Hispanic,Bexar


Read and select the table data using *BeautifulSoup* [`html.parser`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/):

In [4]:
# parse the html
executed_doc = BeautifulSoup(teo_html, 'html.parser')

# skip first row of headers
executed_rows = executed_doc.select('table.os tr')[1:] 
executed_rows

[<tr>
 <td>531</td>
 <td><a href="dr_info/holidayraphael.html" title="Offender Information for Raphael Holiday">Offender Information</a></td>
 <td><a href="dr_info/holidayraphaellast.html" title="Last Statement of Raphael Holiday">Last Statement</a></td>
 <td>Holiday</td>
 <td>Raphael</td>
 <td>999419</td>
 <td>36</td>
 <td>11/18/2015</td>
 <td>Black</td>
 <td>Madison</td>
 </tr>, <tr>
 <td>530</td>
 <td><a href="dr_info/escamillalicho.html" title="Offender Information for Licho Escamilla">Offender Information</a></td>
 <td><a href="dr_info/escamillalicholast.html" title="Last Statement of Licho Escamilla">Last Statement</a></td>
 <td>Escamilla</td>
 <td>Licho</td>
 <td>999432</td>
 <td>33</td>
 <td>10/14/2015</td>
 <td>Hispanic</td>
 <td>Dallas</td>
 </tr>, <tr>
 <td>529</td>
 <td><a href="dr_info/garciajuan.html" title="Offender Information for Juan Garcia">Offender Information</a></td>
 <td><a href="dr_info/garciajuanlast.html" title="Last Statement of Juan Garcia">Last Statement</a

BeautifulSoup allows you to navigate the HTML structure, for example:

In [5]:
# website title
executed_doc.title.string

'Death Row Information'

In [6]:
# +info
executed_doc.p.string

'Texas Department of Criminal Justice'

In [7]:
# query
executed_doc.find_all('a')

[<a href="dr_executed_offenders.html#main_content">Skip to Main Content</a>,
 <a href="http://www.tdcj.state.tx.us/info_employees.html">Employee Resources</a>,
 <a href="http://itd.tdcj.texas.gov/TDCJ_Intranet/">TDCJ Intranet</a>,
 <a href="http://www.tdcj.state.tx.us/directory/index.html">Contact Us</a>,
 <a href="http://www.tdcj.state.tx.us/espanol/index.html" lang="es-MX" xml:lang="es-MX">Información en Español</a>,
 <a accesskey="0" href="http://www.tdcj.state.tx.us/index.html" id="TDCJ_home">Home</a>,
 <a accesskey="1" href="http://www.tdcj.state.tx.us/tab1_public.html" id="TDCJ_pr">Public Resources</a>,
 <a accesskey="2" href="http://www.tdcj.state.tx.us/tab2_emp.html" id="TDCJ_em">Employment</a>,
 <a accesskey="3" href="http://www.tdcj.state.tx.us/tab3_about.html" id="TDCJ_about">About TDCJ</a>,
 <a accesskey="4" href="http://www.tdcj.state.tx.us/tab4_online.html" id="TDCJ_os">Online Services</a>,
 <a accesskey="5" href="http://www.tdcj.state.tx.us/search.html" id="TDCJ_s">Searc

Another option is to directly load the HTML text into a pandas dataframe using [`pd.read_html`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html):

In [8]:
# import pandas to use pandas DataFrame 
import pandas as pd 
pd.set_option("max_rows", 8)    

# create the dataframe
teo_df = pd.read_html(teo_html)[0] # it reaturns a list of dataframes, one per html table. 
teo_df

Unnamed: 0,Execution,Link,Link.1,Last Name,First Name,TDCJ Number,Age,Date,Race,County
0,531,Offender Information,Last Statement,Holiday,Raphael,999419,36,11/18/2015,Black,Madison
1,530,Offender Information,Last Statement,Escamilla,Licho,999432,33,10/14/2015,Hispanic,Dallas
2,529,Offender Information,Last Statement,Garcia,Juan,999360,35,10/6/2015,Hispanic,Harris
3,528,Offender Information,Last Statement,Lopez,Daniel,999555,27,08/12/2015,Hispanic,Nueces
...,...,...,...,...,...,...,...,...,...,...
527,4,Offender Information,Last Statement,Barefoot,Thomas,621,39,10/30/1984,White,Bell
528,3,Offender Information,Last Statement,O'Bryan,Ronald,529,39,03/31/1984,White,Harris
529,2,Offender Information,Last Statement,Autry,James,670,29,03/14/1984,White,Jefferson
530,1,Offender Information,Last Statement,"Brooks, Jr.",Charlie,592,40,12/07/1982,Black,Tarrant
