<a href="https://colab.research.google.com/github/singhmansi25/Mansi-Singh_-Data-Engineer/blob/main/Mansi_Singh_DataEngineer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Brief:

1. The requirement needs to be developed in Python 3
2. Code should follow pep8 standards and should include pydoc, logging & unit tests
3. Please provide github link for review.

## Requirement:

1. Download the xml from this link
2. From the xml, please parse through to the first download link whose file_type is DLTINS and download the zip
3. Extract the xml from the zip.
4. Convert the contents of the xml into a CSV with the following header:
 - FinInstrmGnlAttrbts.Id
 - FinInstrmGnlAttrbts.FullNm
 - FinInstrmGnlAttrbts.ClssfctnTp
 - FinInstrmGnlAttrbts.CmmdtyDerivInd
 - FinInstrmGnlAttrbts.NtnlCcy
 - Issr
5. Store the csv from step 4) in an AWS S3 bucket
6. The above function should be run as an AWS Lambda (Optional)

## Assessment criteria:

1. Percentage of requirements satisfied
2. How clean the code is - in particular simplicity, adhering to python code style conventions and error handling.
3. Follows PEP 8 guidelines
4. We expect pydoc for each class and function with optional type hints(nice to have)
5. Follows standard logging (no print statements). Logs are essential part of troubleshooting application.
6. Unit tests with good code coverage

## Importing Libraries

In [1]:
!pip install boto3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import xml.etree.ElementTree as ET              ## module to parse and manipulate XML data in Python
import unittest                                 ## module for writing and running tests in Python
import logging                                  ## module to create log messages to help with debugging and troubleshooting
import pydoc                                    ## module generates documentation from Python code
import urllib.request                           ## module for opening URLs
from urllib.request import urlopen as uReq      ## function from urllib.request that opens URLs
from bs4 import BeautifulSoup as soup           ## module to parse HTML and XML documents
import zipfile                                  ## module to work with zip files
import csv                                      ## modeule for reading and writing CSV files
import pandas as pd                             ## module for storing and manipulating datasets
import boto3                                    ## module for using AWS S3 
import os                                       ## os module
import io                                       ## io module
import botocore.session                         ## provides a session object to interact with AWS services 
from boto3.session import Session               ## boto3

# Define Class XML-CSV-Converter

In [3]:
class Xml_Csv_Converter(unittest.TestCase):
  '''A module to extract zip file from url. Download XML file from ZIP file. Extract features from XML and convert it to CSV. Upload CSV to AWS S3 bucket'''

  def __init__(self, url):
    '''Constructor to initialize url and logging'''
    self.url = url
    # Create and configure logger
    logging.basicConfig(filename="log_file.log", format='%(asctime)s %(message)s', filemode='w')

    # Creating a logging object
    self.logger = logging.getLogger()
    
    # Setting the threshold of logger to DEBUG
    self.logger.setLevel(logging.DEBUG)


  def download_zip(self): 
    ''' Step-1:  Download the zip file from url link. '''
    ## Read web page from url
    web_page= uReq(self.url)
    page = web_page.read()

    ## XML parser to parse file using beautifulsoup
    self.page_soup= soup(page, 'xml')
    self.logger.debug("\nExtracted page from given url")


  def get_file(self, n):
    ''' Step-2:  From the xml, please parse through to the first download link whose file_type is DLTINS and download the zip. '''
    ## find all the files with type as 'DLTINS'
    filetyp_list = self.page_soup.findAll('str',{'name':'file_type'})

    ## get the first file from all the files
    print("First_file:", filetyp_list[n])

    ## find all the download links
    link_list = self.page_soup.findAll('str',{'name':'download_link'})

    ## get the first link from all the links
    print("First_link:", link_list[n])

    ## get file type from file
    filetyp = filetyp_list[n].get_text()
    print("\n file:", filetyp)

    ## get link from the first link
    self.link = link_list[n].get_text()
    print("\n link:", self.link)
    ## Write Logs
    self.logger.debug("\nDownloaded file")

  def extractXML(self):
    ''' Step-3: Extract the xml from the zip. '''
    ## get file name from link
    data_file = self.link.split('/')[-1]

    ## download the zip file from the URL
    zip_ = urllib.request.urlretrieve(self.link, data_file)

    ## extract zip file from zip_
    zip_file = zipfile.ZipFile(data_file, 'r')

    ## create file xml from zip file 
    file = data_file.strip('.zip')+'.xml'

    ## read xml file from zip file
    xml_file = zip_file.open(file)
    xml_data = xml_file.read()

    ## store xml data in xml_data file
    with open('xml_data.xml', 'wb') as f:
      f.write(xml_data)

    self.xml_path = '/content/xml_data.xml'
    ## Write Logs
    self.logger.debug("\nRead XML file")
  
  def parseXML(self):
    ''' Step-4:  Convert the contents of the xml into a CSV with the following header:
                 FinInstrmGnlAttrbts.Id
                 FinInstrmGnlAttrbts.FullNm
                 FinInstrmGnlAttrbts.ClssfctnTp
                 FinInstrmGnlAttrbts.CmmdtyDerivInd
                 FinInstrmGnlAttrbts.NtnlCcy
                 Issr '''
    ## Parse the XML file
    xml_doc = ET.parse(self.xml_path)
    root = xml_doc.getroot()

    # Find all namespace prefixes used in the XML document
    ns_prefixes = []
    for elem in xml_doc.iter():
        if elem.tag.startswith('{'):
            ns_prefix = elem.tag.split('}')[0][1:]
            if ns_prefix not in ns_prefixes:
                ns_prefixes.append(ns_prefix)

    ## Define namespaces used in the XML file
    namespaces = {
        'head': ns_prefixes[0],
        'app':  ns_prefixes[1],
        'auth': ns_prefixes[2]
    }

    ## Extract the required data from the XML file
    data = []

    ## Find all the data for corresponding columns
    for fininstrm in root.findall('.//auth:FinInstrm', namespaces):
        '''FinInstrmGnlAttrbts.Id'''
        id = fininstrm.find('.//auth:Id', namespaces).text

        '''FinInstrmGnlAttrbts.FullNm'''
        fullnm = fininstrm.find('.//auth:FullNm', namespaces).text
        
        '''FinInstrmGnlAttrbts.ClssfctnTp'''
        clssfctnTp = fininstrm.find('.//auth:ClssfctnTp', namespaces).text
        
        '''FinInstrmGnlAttrbts.CmmdtyDerivInd'''
        cmmdtyDerivInd = fininstrm.find('.//auth:CmmdtyDerivInd', namespaces).text
        
        '''FinInstrmGnlAttrbts.NtnlCcy'''
        ntnlCcy = fininstrm.find('.//auth:NtnlCcy', namespaces).text
        
        '''FinInstrmGnlAttrbts.Issr'''
        issr = fininstrm.find('.//auth:Issr', namespaces).text
        
        '''create list of rows for the columns'''
        row = [id, fullnm, clssfctnTp, ntnlCcy, cmmdtyDerivInd, issr]
        data.append(row)

    ## Write the extracted data to a CSV file
    with open('xml_dataset.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['Id', 'FullNm', 'ClssfctnTp', 'CmmdtyDerivInd', 'NtnlCcy', 'Issr'])
        writer.writerows(data)
    ## Write Logs
    self.logger.debug("\nWrite data from XML to CSV file")

  def readCSV(self):
    ''' Read csv dataset using pandas and inspect first 5 rows. '''
    ## Read csv using pd
    self.xml_dt = pd.read_csv('/content/xml_dataset.csv')
    print(self.xml_dt.head())
    ## Write Logs
    self.logger.debug("\nRead CSV file")

  def aws_bucket(self, bucket_name, key):
    ''' Step-5:  Store the csv from Step-4 in an AWS S3 bucket '''
    # set the environment variables for AWS access keys
    os.environ['AWS_ACCESS_KEY_ID'] = 'AKIAS4QDNRDD6D4FPTMX'
    os.environ['AWS_SECRET_ACCESS_KEY'] = '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'
    
    # create boto session and AWS access keys
    session = botocore.session.get_session()
    AWS_ACCESS_KEY_ID = session.get_credentials().access_key
    AWS_SECRET_ACCESS_KEY = session.get_credentials().secret_key

    # create an S3 client with AWS access keys
    s3 = boto3.client("s3", aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
    
    # read the CSV file from disk and create a buffer
    with open('xml_dataset.csv', newline='') as csvfile:
      csv_buffer = io.StringIO(csvfile.read())
      # upload the buffer to S3
      s3.put_object(Body=csv_buffer.getvalue(), Bucket=bucket_name, Key=key)
    
    # create boto session and S3 client with AWS access keys to upload CSV file
    session = Session(aws_access_key_id=AWS_ACCESS_KEY_ID,aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
    s3 = session.resource('s3')
    my_bucket = s3.Bucket('mansi-pythonengineer-project')

    for s3_file in my_bucket.objects.all():
      print('File uploaded to S3 in AWS: ',s3_file.key)

    ## Write logs
    self.logger.debug("\nUploaded CSV to AWS S3 bucket")


# Call class from main to execute tasks 

In [4]:
if __name__ == '__main__':
    unittest.main(exit=False)
    # given url
    url='https://registers.esma.europa.eu/solr/esma_registers_firds_files/select?q=*&fq=publication_date:%5B2021-01-17T00:00:00Z+TO+2021-01-19T23:59:59Z%5D&wt=xml&indent=true&start=0&rows=100'
    # create class object
    obj = Xml_Csv_Converter(url)
    # call object methods - download_zip()
    obj.download_zip()
    # enter value of n for zip file, here we will use n=0 for first file
    n=int(input("enter file number of zip: 0/1/2/3: "))
    obj.get_file(n)
    obj.extractXML()
    obj.parseXML()
    obj.readCSV()
    # set the name of the S3 bucket and key
    bucket_name = 'mansi-pythonengineer-project'
    key = 'xml-file-dataset.csv'
    # call function aws_bucket
    obj.aws_bucket(bucket_name, key)

E
ERROR: /root/ (unittest.loader._FailedTest)
----------------------------------------------------------------------
AttributeError: module '__main__' has no attribute '/root/'

----------------------------------------------------------------------
Ran 1 test in 0.003s

FAILED (errors=1)
DEBUG:root:
Extracted page from given url


enter file number of zip: 0/1/2/3: 0


DEBUG:root:
Downloaded file


First_file: <str name="file_type">DLTINS</str>
First_link: <str name="download_link">http://firds.esma.europa.eu/firds/DLTINS_20210117_01of01.zip</str>

 file: DLTINS

 link: http://firds.esma.europa.eu/firds/DLTINS_20210117_01of01.zip


DEBUG:root:
Read XML file
DEBUG:root:
Write data from XML to CSV file
DEBUG:root:
Read CSV file
DEBUG:botocore.hooks:Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
DEBUG:botocore.hooks:Changing event name from before-call.apigateway to before-call.api-gateway
DEBUG:botocore.hooks:Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
DEBUG:botocore.hooks:Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
DEBUG:botocore.hooks:Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
DEBUG:botocore.hooks:Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
DEBUG:botocore.hooks:Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchCo

             Id                                             FullNm ClssfctnTp  \
0  DE000A1R07V3    Kreditanst.f.Wiederaufbau     Anl.v.2014 (2021)     DBFTFB   
1  DE000A1R07V3                                 KFW 1 5/8 01/15/21     DBFTFB   
2  DE000A1R07V3        Kreditanst.f.Wiederaufbau Anl.v.2014 (2021)     DBFTFB   
3  DE000A1R07V3        Kreditanst.f.Wiederaufbau Anl.v.2014 (2021)     DBFTFB   
4  DE000A1X3J56  IKB Deutsche Industriebank AG Stufenz.MTN-IHS ...     DTVUFB   

  CmmdtyDerivInd  NtnlCcy                  Issr  
0            EUR    False  549300GDPG70E3MBBU98  
1            EUR    False  549300GDPG70E3MBBU98  
2            EUR    False  549300GDPG70E3MBBU98  
3            EUR    False  549300GDPG70E3MBBU98  
4            EUR    False  PWEFG14QWWESISQ84C69  


DEBUG:botocore.hooks:Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7f900850d940>
DEBUG:botocore.endpoint:Setting s3 timeout as (60, 60)
DEBUG:botocore.loaders:Loading JSON file: /usr/local/lib/python3.9/dist-packages/botocore/data/_retry.json
DEBUG:botocore.client:Registering retry handlers for service: s3
DEBUG:botocore.utils:Registering S3 region redirector handler
DEBUG:botocore.hooks:Event before-endpoint-resolution.s3: calling handler <function customize_endpoint_resolver_builtins at 0x7f90084b0940>
DEBUG:botocore.hooks:Event before-endpoint-resolution.s3: calling handler <bound method S3RegionRedirectorv2.redirect_from_cache of <botocore.utils.S3RegionRedirectorv2 object at 0x7f900764dac0>>
DEBUG:botocore.regions:Calling endpoint provider with parameters: {'Bucket': 'mansi-pythonengineer-project', 'Region': 'us-east-1', 'UseFIPS': False, 'UseDualStack': False, 'ForcePathStyle': False, 'Accelerate': False, 'UseGlobalEndpoint': True, 'Dis

File uploaded to S3 in AWS:  xml-file-dataset.csv


Name: Mansi Singh

Lovely Professional University