## Basic Python XML parsing using ElementTree

I'm using the latest dataset (as xml) available at https://data.gov.in/catalog/all-india-consumer-price-index-ruralurban-0

**About the dataset:**

This is a dataset for both Rural and Urban Consumer Price Index. Rural, urban, and rural+urban  CPI for each month from June 2013 to May 2023 is given in dataset.xml except for the entry of May 2023 which doesn't have rural+urban CPI data in the original dataset.xml file.

**Task performed:**

Parse the xml file in python and export separate csv files for the orignal dataset, dataset for urban only, dataset for rural only and dataset for urban+rural only.

We'll be working with the python the modules ElementTree and csv. Let's import them first.

In [2]:
import xml.etree.ElementTree as ET
import csv

We can use the parse() function from the ElementTree module to parse the datafile onto a variable.

In [4]:
tree=ET.parse("datafile.xml")

We can use the getroot() function to load the root of the xml onto a  variable. A root is the outermost or the main tag enclosing the contents of an xml file. You can see which tag it saves to the variable by priting it using print(root.tag) after getting the root tag like this below. Printing root.text will print the text contained within the tag or print "None" if the tag contains no text. While root is the root xml tag, root[0] refers to the 1st child tag. root[1] as the 2nd child tag, root[0][0] as the child tag of root[0] or grandchild tag of root, and so on.

In [6]:
root=tree.getroot()

None


We will now open our csv files for the csv module to write.

In [None]:
cpi_com = open('cpi_combined.csv', 'w')
cpi_rur = open('cpi_rural.csv', 'w')
cpi_urb = open('cpi_urban.csv', 'w')
cpi_og  = open('cpi_original.csv','w')
csvbth=csv.writer(cpi_com)
csvrur=csv.writer(cpi_rur)
csvurb=csv.writer(cpi_urb)
csog=csv.writer(cpi_og)

Let's assign the length of the root to r_tot which we'll need later. The while loop below writes the tags (for the header) to the csv files in the first run. Other loops work similarly with if to separate the rows based on the sector and write the rows to separate csv files using the writerow() function. We could also use the findall function to get the text from the root[x][y] tags but a loop would be better as it has 30 entries within one ROWx tag and we need to fetch all the entries in the ROWx tags.

In [None]:
r_tot=len(root)

rnum=0
while rnum<=r_tot:
    rur_data=[]
    com_data=[]
    urb_data=[]
    og_data=[]
    for entry in root.findall('ROW'+str(rnum+1)):
        rsz=len(root[rnum])
        if rnum==0:
            i=0
            while i+1<=len(root[0]):
                a=root[rnum][i].tag
                urb_data.append(a)
                com_data.append(a)
                rur_data.append(a)
                og_data.append(a)
                i+=1
                            
            csvbth.writerow(com_data)
            csvurb.writerow(urb_data)
            csvrur.writerow(rur_data)
            csog.writerow(og_data)
                
        f=root[rnum][0].text
        
        rur_data=[]
        com_data=[]
        urb_data=[]
        og_data=[]
        
        if f=='Urban': 
            i=0
            while i+1<=len(root[rnum]):
                a=root[rnum][i].text
                urb_data.append(a)
                og_data.append(a)
                i+=1
            csvurb.writerow(urb_data)
        
        if f=='Rural':
            i=0
            while i+1<=len(root[rnum]):
                a=root[rnum][i].text
                rur_data.append(a)
                og_data.append(a)
                i+=1
            csvrur.writerow(rur_data)
        
        if f=='Rural+Urban':
            i=0
            while i+1<=len(root[rnum]):
                a=root[rnum][i].text
                com_data.append(a)
                og_data.append(a)
                i+=1
            csvbth.writerow(com_data)
        
        
        csog.writerow(og_data)                 
    rnum+=1 

Now close the csv files using the close function and run the program

In [None]:
cpi_com.close()
cpi_rur.close()
cpi_urb.close()
cpi_og.close()