# How to Read Xml file to a Pandas Dataframe using Python.
This notebook provide the steps that one will take to read specific columns from an xml file using pandas, python and xmltodict parse function.

In [2]:
# Install xmltodict
!pip install xmltodict

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xmltodict
  Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.13.0


In [3]:
# Import relevant libraries
from collections import defaultdict
import datetime
import io
import pandas as pd
import requests
import xmltodict
import zipfile

In [4]:
# Create a function that will read the xml url
def read_s3_xml(xml_url):
  studies_file = 'xml_files/studies.xml'
  response = requests.get(xml_url)
  with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
    with zf.open(studies_file) as f:
      return xmltodict.parse(f.read())['bibdataset']['item']

In [22]:
# Create a function that will return publication date in datetime format
def parse_pub_date(head_data):
  pub_date = head_data['source']['publicationdate']
  pub_date = datetime.datetime(int(pub_date['year']), int(pub_date['month']), int(pub_date['day']))
  return pub_date

# Create a function that will return a default dictionary from an xml file
def parse_xml_data(xml_file):
  data_pd = defaultdict(list)
  for data in read_s3_xml(xml_file):
    head_data = data['bibrecord']['head']
    data_pd['publication_date'].append(parse_pub_date(head_data).date())
    data_pd['title'].append(head_data['citation-title']['titletext']['#text'])
    data_pd['doi'].append(data['bibrecord']['item-info']['itemidlist'].get('ce:doi', ''))
    data_pd['abstract'].append(head_data['abstracts']['abstract']['ce:para'])
  return data_pd


In [23]:
# Display all items in the column using maximum column
pd.set_option('display.max_columns', 100000)

In [25]:
# Use the parse-xml_data function to return a pandas dataframe
xml_url = 'https://codility-frontend-prod.s3.amazonaws.com/media/task_static/structuring_data/static/xml_files.zip'
data_pd = parse_xml_data(xml_url)
df = pd.DataFrame(data_pd)
df.head(10)


Unnamed: 0,publication_date,title,doi,abstract
0,2008-03-09,Mechanographic characteristics of adolescents and young adults with congenital heart disease,10.1007/s00431-007-0495-y,"The present study comprised 29 adolescents and young adults (15 females, 14 males; aged 14.1-23.9 years) with congenital heart disease (CHD) and focused on the interaction between the biomechanical system and CHD. Individuals were characterized by auxological (height, weight), dynamometric (MIGF, maximal isometric grip force) and mechanograpic parameters (Vmax, maximal velocity; PJF, peak jump force; PJP, peak jump power; time of five stand-ups in chair-rising test). PJF, PJP and MIGF were transformed into height-related SD-scores. MIGF-SDS and PJP-SDS were lower in the CHD patients than in reference individuals. PJP-SDS was lower than PJF-SDS. PJP-SDS was correlated to Vmax (r=0.62) and to the time of five-stand-ups in chair-rising (r=-0.62). Transcutaneous oxygen saturation and NYHA classes were correlated to Vmax (r=0.42 and r=-0.57, respectively) and to chair-rising performance (r=-0.60 and r=0.50, respectively). To conclude, individuals with CHD are characterized by an impaired inter- and intramuscular coordination, which is characterized by a greater decrease in muscular power than muscle force."
1,2008-02-05,Treatment options in imatinib-resistant chronic myelogenous leukemia,10.1345/aph.1K303,"Objective: To discuss new therapeutic options available in the treatment of chronic myelogenous leukemia (CML) in patients who failed or were intolerant to imatinib therapy.\nSearch terms included imatinib, dasatinib, nilotinib, and chronic myelogenous leukemia.\nStudy selection and data extraction: Meeting abstracts and studies that reported preclinical and Phase 1, 2, and 3 trials published in English are included.\nData synthesis: Imatinib is the standard of care for CML; however, some patients develop resistance or are intolerant to the drug. Phase 1 and 2 clinical data for the more potent tyrosine kinase inhibitors, dasatinib and nilotinib, are promising. Hematologic and cytogenetic responses are reported with both. There does not appear to be cross-resistance between the drugs, although neither is effective against all mutations of the hallmark molecular marker, the Philadelphia chromosome. Novel agents are also being examined for the treatment of patients with CML, including aurora kinase and farnesyl transferase inhibitors, as well as combination therapies.\nConclusions: Dasatinib and nilotinib are second-line options for patients who have CML and are resistant or intolerant to imatinib. Toxicity profiles between agents may differ. Clinical trials with these drugs and others are ongoing."
2,2008-01-12,Women deliver: A global conference,10.1586/17474108.3.1.33,"A large international conference was held in London in October 2007 to celebrate 20 years of the Safe Motherhood Initiative. Launched in Nairobi, in 1987 by the WHO, the United Nations Population Fund and the World Bank, and joined by the United Nations Development Programme, the United Nations Children's Fund, the International Planned Parenthood Federation and the Population Council shortly thereafter, the initiative aimed at addressing the then neglected high maternal death rates in poor countries. Now, 20 years later, we can see that, while some progress has been made, the regions with the poorest maternal health have made the least progress: sub-Saharan Africa and South Asia. The objectives of the London conference were to take stock of the situation, to highlight the global consensus on effective strategies to reduce maternal mortality and, above all, to galvanize political will and commitment to address this scandalous situation. Although technical issues were discussed in a large number of small sessions, in panel after panel participants hammered on the need to create a global movement around maternal and newborn survival. Whether that objective was achieved will be the focus of this paper. © 2008 Future Drugs Ltd."
3,2008-03-20,Formation of circular patterns of calcium oxalate crystals at defective sites of Langmuir-Blodgett films,10.1016/j.colsurfa.2007.10.012,"The circular patterns of calcium oxalate (CaOxa) crystals were first induced at defective sites of LB film of DPPC injured by potassium oxalate. As the crystallization time extend, the pattern of the crystals changed from a ring-shape to a solid circle. In comparison, the LB films without oxalate pretreatment only induced randomly growth of crystals. It was attributed to the destruction of molecular rearrangement at the boundaries of liquid condensed (LC) and liquid expanded (LE) phases of the LB film caused by oxalate. This model system might be used to mimic formation of CaOxa stones at surface of damaged renal epithelial membranes."
4,2008-04-07,"Trends in adult post-kidney transplant immunosuppressive use in Australia, 1991-2005",10.1111/j.1440-1797.2007.00859.x,"Aim: Kidney transplant outcomes have improved over the past 15 years, partly due to improvements in immunosuppression. We used data from the Australia and New Zealand Dialysis and Transplant (ANZDATA) Registry to examine trends in immunosuppressive use post transplant. Methods: All adult (recipient age 16+ years) kidney-only transplants performed in Australia from April 1991 to December 2005 were followed to graft loss or December 2005. Immunosuppressive use at induction, 1, 3 and 5 years post transplant were analysed by transplant cohort. Results: Calcineurin-inhibitors (CNI) were used in most recipients for induction and maintenance immunosuppression, with increasing tacrolimus use. Induction cyclosporin dose increased since 2001 (from 5.8 to 7.9 mg/kg per day), but maintenance cyclosporin and tacrolimus dose decreased (from 3.8 to 3.0 mg/kg per day cyclosporin at 1 year post transplant). CNI-free induction increased since 2002 (from 1.4% to 8.4%), while CNI-free maintenance increased throughout the study period. Mycophenolates were the predominant antimetabolite used. Steroid-free maintenance decreased (from 22.7% to 8.7% at 1 year post transplant), as did median prednisolone doses (from 0.12 to 0.09 mg/kg per day at 1 year post transplant). Sirolimus or everolimus are increasingly used for CNI-sparing rather than as antimetabolites substitutes. OKT3 or antithymocyte globulin induction decreased, while anti-CD25 antibody usage increased from 9.5% to 57.1% since 2000. Conclusion: There is a trend to more potent induction immunosuppression with tacrolimus, mycophenolates and anti-CD-25 antibodies, but with CNI avoidance or minimization during maintenance phase. While steroid avoidance/cessation decreased, maintenance steroid dose has also decreased. Anti-CD25 antibodies are now used in >50% of recipients. © 2007 The Authors."
5,2008-02-13,Anion-Induced Adsorption of Ferrocenated Nanoparticles,10.1021/ja074161f,"Au nanoparticles fully coated with ω-ferrocenyl hexanethiolate ligands, with average composition Au225(ω-ferrocenyl hexanethiolate)43, exhibit a unique combination of adsorption properties on Pt electrodes. The adsorbed layer is so robust that electrodes bearing submonolayer, monolayer, and multilayer quantities of these nanoparticles can be transferred to fresh electrolyte solutions and there exhibit stable ferrocene voltammetry over long periods of time. The kinetics of forming the robustly adsorbed layer are slow; monolayer and submonolayer deposition can be described by a rate law that is first order in nanoparticle concentration and in available electrode surface. The adsorption mechanism is proposed to involve entropically enhanced (multiple) ion-pair bridges between oxidized (ferrocenium) sites and certain specifically adsorbed electrolyte anions on the electrode. Adsorption is promoted by scanning to positive potentials (through the ferrocene wave) and by high concentrations of Bu4N+X- electrolyte (X- = ClO4-, PF6-) in the CH2Cl2 solvent; there is no adsorption if X- = p-toluenesulfonate or if the electrode is coated with an alkanethiolate monolayer. The electrode double layer capacity is not appreciably diminished by the adsorbed ferrocenated nanoparticles, which are gradually desorbed by scanning to potentials more negative than the electrode's potential of zero charge. At very slow scan rates, voltammetric current peaks are symmetrical and nearly reversible, but exhibit Efwhm considerably narrower (typically 35 mV) than ideally expected (90.6 mV, at 298 K) for a one-electron transfer or for reactions of multiple, independent redox centers with identical formal potentials. The peak narrowing is qualitatively explicable by a surface-activity effect invoking large, attractive lateral interactions between nanoparticles and, or alternatively, by a model in which ferrocene sites react serially at formal potentials that become successively altered as ion-pair bridges are formed. At faster scan rates, both ΔEpeak and Efwhm increase in a manner consistent with a combination of uncompensated ohmic resistance of the electrolyte solution and of the adsorbed film, as distinct from behavior produced by slow electron transfer."
6,2008-02-23,"Headache in a Nonclinical Population in Dares Salaam, Tanzania. A Community-Based Study",10.1111/j.1526-4610.1995.hed3505273.x,"Headache is a common symptom that constitutes a major health problem to all countries in the world with a variable prevalence from about 20.2% in the African population to about 80% in populations of the civilized world. Community-based studies in African populations are still scanty, and the impact on health facility utilization and sickness absence from work is unknown.\nAfter a simple random selection, 1540 urban workers and students of higher education completed a standardized self-administered questionnaire on headache. A total of 815 (52%), (620 (51%) men, 195 (60%) women) admitted to having suffered a headache requiring medication or medical consultation in the last year. Of these, 366 (23.7%) had recurrent headache not attributable to systemic disease. Of the total with recurrent headache, there was a significant preponderance of women over men with sex prevalence of 28.9% and 22.4%, respectively (X 2 P = 0.0001). Combined vascular-muscular-type of headache exceeded all types of headache, accounting for 35.8% of cases, followed by migraine accounting for 30.8% of cases. Organic disease was rare, accounting for 8.5% of cases, and psychogenic causes of headache were even rarer at less than 1.2% of cases.Within 2 months of onset of recurrent headaches, over 32% of sufferers had utilized the health facility at their place of work or study. A significant number of cases (175) had an average of 11.3 lost work days per year in comparison to a control group of 154 persons with an average of 5.7 lost work days per year for reasons other than headache (X 2 P = 0.0005).\nIn summary, headache is probably rare in the African population as previously reported. However, the clinical manifestation of headache is similar to those observed in the civilized world. Whenever services are available, patients with headache will seek medical consultation. A significant number of days are lost from work due to severe headache in an urban population in Tanzania. This study underscores the need for early correct diagnosis and treatment of headache to reduce the number of work absences due to headache."
