# How to Read Xml file to a Pandas Dataframe using Python.
This notebook provide the steps that one will take to read specific columns from an xml file using pandas, python and xmltodict parse function.

In [2]:
# Install xmltodict
!pip install xmltodict

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xmltodict
  Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.13.0


In [3]:
# Import relevant libraries
from collections import defaultdict
import datetime
import io
import pandas as pd
import requests
import xmltodict
import zipfile

In [4]:
# Create a function that will read the xml url
def read_s3_xml(xml_url):
  studies_file = 'xml_files/studies.xml'
  response = requests.get(xml_url)
  with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
    with zf.open(studies_file) as f:
      return xmltodict.parse(f.read())['bibdataset']['item']

In [22]:
# Create a function that will return publication date in datetime format
def parse_pub_date(head_data):
  pub_date = head_data['source']['publicationdate']
  pub_date = datetime.datetime(int(pub_date['year']), int(pub_date['month']), int(pub_date['day']))
  return pub_date

# Create a function that will return a default dictionary from an xml file
def parse_xml_data(xml_file):
  data_pd = defaultdict(list)
  for data in read_s3_xml(xml_file):
    head_data = data['bibrecord']['head']
    data_pd['publication_date'].append(parse_pub_date(head_data).date())
    data_pd['title'].append(head_data['citation-title']['titletext']['#text'])
    data_pd['doi'].append(data['bibrecord']['item-info']['itemidlist'].get('ce:doi', ''))
    data_pd['abstract'].append(head_data['abstracts']['abstract']['ce:para'])
  return data_pd


In [23]:
# Display all items in the column using maximum column
pd.set_option('display.max_columns', 100000)

In [24]:
# Use the parse-xml_data function to return a pandas dataframe
xml_url = 'https://codility-frontend-prod.s3.amazonaws.com/media/task_static/structuring_data/static/xml_files.zip'
data_pd = parse_xml_data(xml_url)
df = pd.DataFrame(data_pd)
df.head(1)


Unnamed: 0,publication_date,title,doi,abstract
0,2008-03-09,Mechanographic characteristics of adolescents and young adults with congenital heart disease,10.1007/s00431-007-0495-y,"The present study comprised 29 adolescents and young adults (15 females, 14 males; aged 14.1-23.9 years) with congenital heart disease (CHD) and focused on the interaction between the biomechanical system and CHD. Individuals were characterized by auxological (height, weight), dynamometric (MIGF, maximal isometric grip force) and mechanograpic parameters (Vmax, maximal velocity; PJF, peak jump force; PJP, peak jump power; time of five stand-ups in chair-rising test). PJF, PJP and MIGF were transformed into height-related SD-scores. MIGF-SDS and PJP-SDS were lower in the CHD patients than in reference individuals. PJP-SDS was lower than PJF-SDS. PJP-SDS was correlated to Vmax (r=0.62) and to the time of five-stand-ups in chair-rising (r=-0.62). Transcutaneous oxygen saturation and NYHA classes were correlated to Vmax (r=0.42 and r=-0.57, respectively) and to chair-rising performance (r=-0.60 and r=0.50, respectively). To conclude, individuals with CHD are characterized by an impaired inter- and intramuscular coordination, which is characterized by a greater decrease in muscular power than muscle force."
