Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a function profile_gen to generate profile formatted output from any given xml. Fixes #5. #12

Closed

Conversation

tinkerbotfoo
Copy link

@tinkerbotfoo tinkerbotfoo commented Oct 28, 2016

This is fix for the issue #5

Note: This is a initial version. I also wrote a test function. The code works and the test passes.
Following the agile pattern, make it work and make it better. I have made it to work and pass tests.
I am sure there is lot of room for improvements or even a total redesign. Open to suggestions and advice.

λ python -m xmldataset
Parse using profile output :

{'title_and_author': [{'title': "XML Developer's Guide", 'author': 'Gambardella, Matthew'}, {'title': 'Midnight Rain', 'author': 'Ralls, Kim'}, {'title': 'Maeve Ascendant', 'author': 'Corets, Eva'}, {'title': "Oberon's Legacy", 'author': 'Corets, Eva'}, {'title': 'The Sundered Grail', 'author': 'Corets, Eva'}, {'title': 'Lover Birds', 'author': 'Randall, Cynthia'}, {'title': 'Splish Splash', 'author': 'Thurman, Paula'}, {'title': 'Creepy Crawlies', 'author': 'Knorr, Stefan'}, {'title': 'Paradox Lost', 'author': 'Kress, Peter'}, {'title': 'Microsoft .NET: The Programming Bible', 'author': "O'Brien, Tim"}, {'title': 'MSXML3: A Comprehensive Guide', 'author': "O'Brien, Tim"}, {'title': 'Visual Studio 7: A Comprehensive Guide', 'author': 'Galos, Mike'}]}

Profile Gen output :

catalog
        lowest
                specificbefore
                        specificvalue
                book
                        optionalexternal
                                externaldata
                        author
                        title
                        genre
                        price
                        publish_date
                        description
                specificafter
                        specificvalue

@tinkerbotfoo
Copy link
Author

To Do:

  1. Current implementation is limited to the xml nodes. It doesn't expand the attributes of the nodes into the profile. It can be done in another update.
  2. While using this my team also felt that we can extend the profile gen function and data profiling stats at each level. This will be useful to know which nodes are used quite often.

@tinkerbotfoo
Copy link
Author

Looks like python 3 build is breaking at this line (list assignment/ in place modification).
path_list[p_idx]= str(re.sub('\[.+]','',item))

Its works fine in python 2.7. I tried to resolve the issue. I haven't worked much with python 3. Appreciate if you can help ....

@tinkerbotfoo
Copy link
Author

Fixed the python 3 issue. Plz check and lemme know. I am working on the other 2 ToDo items

@tinkerbotfoo
Copy link
Author

Implemented the first todo: feature to add include_attributes flag to include the attributes in the profile output as child of the element. Also added a new test for this feature.

"number", "id" would not show up if the include_attributes flag is set to false., since they are attributes of xml elements.

plz review/test and lemme know your thoughts

catalog
    lowest
        number
        specificbefore
            specificvalue
        book
            id
            optionalexternalstart
                externaldata
            author
            title
            genre
            price
            publish_date
            description
            optionalexternalend
                externaldata
        book2
            id
            author
            title
            genre
            price
            publish_date
            description
        specificafter
            specificvalue

@spurin
Copy link
Owner

spurin commented Oct 30, 2016

I think this is great, it would be good if we can include an option on the profile_gen function for also creating the dataset stubs, i.e. title = dataset:default

Can you see if you can refactor to just use the xml.etree.ElementTree without having a separate requirement for lxml. I originally chose this module as I would like the module to work without the installation of other modules (especially where there may be compilation requirements to satisfy the dependency). When creating the Perl equivalent, this ended up causing me headaches in terms of usage.

I also already have the logic in the code for handling elements/attributes so that can be reused where applicable.

Really appreciate your efforts, thanks again. Will aim for this and your other changes for the next release.

@tinkerbotfoo
Copy link
Author

Thanks very much for your feedback and being accomodating .. I see 3 asks in this ..

  1. refactor to use just xml.etree without lxml - i concur with you. I used lxml because it was convenient for the initial version. I will work on refactor to just use xml.etree.
  2. Sry i am missing something to understand the request - Can you plz elaborate dataset stubs, i.e. title = dataset:default ..
  3. I was originally planning to refactor the class _XMLDataset ., i needed some more time to do debug sessions and fully understand your logic and integrate with it seamlessly ...

@tinkerbotfoo
Copy link
Author

Also i would welcome any insights/suggestions/your thoughts in to class _XMLDataset refactoring .., would be useful to me and appreciated ... Also plz feel free to lemme know if anything is needed to make it easy for you to integrate this functionality. You have done great job of structuring and clean code.. i have tried to be consistent with your approach where ever i was able to.. but there is much to be desired ....

@spurin spurin closed this May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants