# XML and Structured Files

When you want to store and work with data, pure text is not very helpful; for a start, pure text usually does not include the formatting (bold, italic, etc.), and contains no info as to the role of a particular part of the text (for instance, in a judgment, the difference between the arguments of the parties, the reasoning, or the <i>dispositif</i>.

The solution here is to store your text into a file that follows a structure, according to a particular language. XML, for "Extensible Markup Language", is a structured language. HTML is another.

Likewise, a `.docx`, when you go into the details, is actually a text file with a layer of structure that tells Microsoft words a number of information as to the formatting of that text. Here is an example of the difference between the two: this 
is the same part of a MSWord document, except the second is the internal .xml structure (after a bunch of 
manipulations on my part to make it somewhat readable).

![](../Data/Images/img_1.png)
![](../Data/Images/img.png)

So back to .xml. In your Files, I placed a number of decisions by the Conseil d'Etat that were recently released as part of their <a href="https://opendata.conseil-etat.fr/">Open Data program</a>. They are .xml files. They are not great, but they'll do.

Let's have a look at one of these files, as it appears if you open it with a browser. You can see that the main text is divided between what we call elements. Each element includes an opening `tag` (or "balise", in French), which must be accompanied by a closed tag of the same name. Tags and sections cannot overlap: when you open a tag in a context, 
you need to close it in that context. (You can also have self-standing, one-tag elements, of the form <tag/>, though they are rarer.)

![](../Data/Images/img_3.png)


The documents from the Conseil d'Etat don't have much of those, but normally you can specify further `attributes` for each element: these are data points that will not be seen by a natural reader (unless you look at the code directly), but enclose 
further information (such as formatting, or a URL for a link) for the software, or data scientist, who is probing this 
data. A good example is the <code>\<a></a></code>, which represents a link, and always has an attribute `href`, which 
is the url:

<code>
\<\a href="My URL Here">My link here\<\/a>  # The antislash here was added, so that you can see the structure; 
otherwise the element would not appear
</code>

You can also see, hopefully, that the information is enclosed in a hierarchical format, like a tree: you start with 
the <i>root</i>, and then you get branches that can get branches of their own, etc. Here everything is enclosed in a 
`Document` element, itself part of an`xml` element. Yet `Document` has only four direct children, which themselves 
have further children.

![](../Data/Images/img_2.png)

"Children" is the usual term, though "descendants" is also sometimes used. Logically, you also have "parents" or 
"siblings".

The interest of storing data in a structured format is not only that you can include more than data (such as metadata), but also that, once you know the structure, you can extract data efficienty from all files that follow that format. The Conseil d'Etat decided a few years ago to release all their judgments according to that format, and code that worked to extract data from judgments back then also works for new judgments - as long as they follow the structure.

In other words, just like using a loop over the content of a list allows you to be agnostic about the data in that list, having a structure allows you to be agnostic about the data that was filled in that structure.

For instance, Let's say we want to collect all dates from these decisions from the Conseil d'Etat. Instead of searching 
each text for a date, the .xml format is helpful: we can see that the date is enclosed in an element called 
`Date_Lecture`. We can just iterate over all files, and collect the dates.

The first thing to understand is that when you parse an .xml document, you need to start from the root. From there, you typically iterate over their descendants, sometimes by specifying a condition: for 
instance, we can look for all `<p>` elements, which represent the paragraphs. You also have various levels of 
iterations: over siblings, children, or ancestors. Another alternative is to go through  all descendants and check 
if they are of the required type.

In [7]:
import pandas as pd
from lxml import etree  # This is one of the main .xml reader module in Python, 
# the etree method from the lxml package. You need to : pip install lxml
import os
from datetime import datetime
from collections import defaultdict, Counter

#os.chdir("../Data/CE")  # We go to the main folder that stores all files
files = os.listdir(".") 
print(len(files))  # There are many files !

file = files[0] # Let's work on the first file to get an example

xml_file = etree.parse(file)  # We first open the .xml file with the "parse" method
root = xml_file.getroot()  # We then look for the "root" of the XML tree, and pass it to a variable root

print(root.attrib)  # You can check the attributes of every element this way
print("Text of the element: " + root.text)  # Likewise, the "text" attribute gives you the text inside an element; 
# root has no text, as you can see everything is in the elements instead

835
{'{http://www.w3.org/2001/XMLSchema-instance}noNamespaceSchemaLocation': 'validation-document.xsd'}
Text of the element: 



Now, starting from the root, we can go through all its children and grandchildren. There are several ways to do this.

In [9]:
for child in root:  # The parent element also works as a list of its children element, so you can easily iterate over it immediately like this
    print(child.tag)

for paragraph in root.iter("p"):  # Though a better way to do it is with iter(); 
    # this command takes arguments that allow you to filter the descendants
    print(paragraph.text) # This will return the text of the decision, paragraph by paragraph

Donnees_Techniques
Dossier
Audience
Decision
Vu la procédure suivante :
Mme B D O'Sullivan a demandé au tribunal administratif de Lyon de condamner la chambre de commerce et d'industrie (CCI) Lyon Métropole Saint-Etienne Roanne à lui verser une somme de 90 658,24 euros en réparation des préjudices qu'elle soutient avoir subis en raison d'une insuffisance de cotisation imputable à l'établissement public, pris en sa qualité d'employeur, au régime de retraite des personnels des chambres consulaires.
Par un jugement n° 1707967 du 6 novembre 2019, le tribunal administratif de Lyon a rejeté sa demande.
Par un arrêt n° 20LY00150 du 9 décembre 2021, la cour administrative d'appel de Lyon a rejeté l'appel formé par Mme D O'Sullivan contre ce jugement.
Par un pourvoi sommaire et un mémoire complémentaire, enregistrés les 9 février et 9 mai 2022 au secrétariat du contentieux du Conseil d'Etat, Mme D O'Sullivan demande au Conseil d'Etat :
1°) d'annuler cet arrêt ;
2°) réglant l'affaire au fond, de

In [10]:
for el in root.iter(["Numero_Dossier", "Date_Lecture"]):  # The filter can also be a list of relevant element names
    print(el.text)

461328
2022-12-05


Note also that you can navigate between the elements, to jump from elements to their parents, or siblings. This is very helpful if you know the tag of one element but aren't sure of what follows it; or if you want to work on several elements in line.

In [27]:
for el in root:
    pass  # An empty loop to make sure "el" is the last child of root
print("The last child of root is: ", el.tag)

prev_el = el.getprevious() # This method gets you the previous sibling
print(prev_el.tag)
next_el = prev_el.getnext()
print(next_el.tag)
subel = root.getchildren()[1]
print("The second child from the root is: ", subel)
print("Its parent is", subel.getparent())

The last child of root is:  Decision
Audience
Decision
The second child from the root is:  <Element Dossier at 0x7fcc8cdc4900>
Its parent is <Element Document at 0x7fcc8c3cce00>


Now, coming back to our example, we want to get the date for every decision. Note that if we want to do it for one file, we just need to find the relevant element (tag = "Date_Lecture"), and extract the data from that element.

In [29]:
for el in root.iter("Date_Lecture"):  # the Date_Lecture element contains the judgment's date; 
    # Easiest way in XML is to filter all descendants to get only the one we are interesting in
    date = el.text
print(date)

2022-12-05


Therefore, to obtain it from all judgments, we just need to loop over all files.

In [30]:
for file in files[:10]:  # Remember we defined os.listdir(".") as files above, and looping only over the first 10
    xml_file = etree.parse(file)  # We open each .xml file with the "parse" method
    root = xml_file.getroot()  # And we goot the root
    for el in root.iter("Date_Lecture"):  # the Date_Lecture element contains the judgment's date; 
    # Easiest way in XML is to filter all descendants to get only the one we are interesting in
        date = el.text
    print(date)

2022-12-05
2022-12-16
2022-12-20
2022-12-12
2022-12-27
2022-12-09
2022-12-16
2022-12-21
2022-12-09
2022-12-28


Now, if we wanted to recreate a full database of all relevant data points in each judgment, we can just use the list of list method.
This methods leverages the fact that  a dataframe is nothing but a list of sublists of equal length, with each sublist being a row (see <a href="https://www.geeksforgeeks.org/creating-pandas-dataframe-using-list-of-lists/">here</a> for more details).

In [32]:
details = ["Numero_Dossier", "Date_Lecture", "Date_Audience", "Avocat_Requerant", "Type_Decision", "Type_Recours",
"Formation_Jugement"]  # All the relevant data points/elements in our judgments
lists_details = []  # Easiest way to create a dataframe is first to have a list of lists, 
# and then pass it to pd.Dataframe(lists, columns=details)

for file in files:
    newlist = []  # We create a new, empty sublist, every time we switch to a new file; 
    # that sublist will be filled with relevant data and added to main list; each sublist will have the same length
    XML = etree.parse(file)
    root = XML.getroot()

    for detail in details:  # For each file, we iterate over each type of detail, using a loop
        result = ""
        for el in root.iter(detail):  # and we use this detail to filter from all descendants in root
            result = el.text
        newlist.append(result)  # we then pass the result to the sublist created above

    lists_details.append(newlist)  # Before the loop concludes with one file and passes on to the next, 
    # we append the (filled) newlist to main list

df = pd.DataFrame(lists_details, columns = details)  # Out of the loop, we create a dataframe based on that list of lists
df.head(10)
# df.to_clipboard(index=False) # Finally, we copy the DataFrame so as to paste it (CTRL+V) in Excel

Unnamed: 0,Numero_Dossier,Date_Lecture,Date_Audience,Avocat_Requerant,Type_Decision,Type_Recours,Formation_Jugement
0,461328,2022-12-05,2022-10-26,"ROCHETEAU, UZAN-SARANO & GOULET",Décision,Plein contentieux,7ème chambre jugeant seule
1,463896,2022-12-16,2022-11-28,,Décision,Plein contentieux,9ème chambre jugeant seule
2,469368,2022-12-20,2022-12-14,CRUSOE,Décision,Excès de pouvoir,"Juge des référés, formation collégiale"
3,465668,2022-12-12,2022-11-17,,Décision,Rectif. d'erreur matérielle,9ème chambre jugeant seule
4,467773,2022-12-27,,,Ordonnance,Excès de pouvoir,5ème chambre
5,464514,2022-12-09,2022-11-24,"SCP PIWNICA, MOLINIE",Décision,Plein contentieux,2ème chambre jugeant seule
6,465895,2022-12-16,2022-12-08,,Décision,Plein contentieux,8ème chambre jugeant seule
7,458650,2022-12-21,2022-12-02,,Décision,Plein contentieux,9ème et 10ème chambres réunies
8,461508,2022-12-09,2022-11-24,"SCP GASCHIGNARD, LOISEAU, MASSIGNON",Décision,Excès de pouvoir,2ème chambre jugeant seule
9,444845,2022-12-28,2022-12-07,"DELALANDE;SCP PIWNICA, MOLINIE",Décision,Excès de pouvoir,6ème et 5ème chambres réunies
