Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikiprojects parser sanity tests #1

Merged
merged 5 commits into from
Sep 10, 2017
Merged

Conversation

codez266
Copy link
Contributor

@codez266 codez266 commented Sep 6, 2017

  • Add tests for methods used by WikiProjects parser
  • Add testfiles/ with sample wikitext containing WikiProjects for comparison in tests
  • Modify parser from iterative to recursive
  • Minor improvements to regexes

* Add tests for methods used by WikiProjects parser
* Add testfiles/ with sample wikitext containing WikiProjects for parsing
* Modify parser from iterative to recursive
* Minor improvements to regex
Remove session as parameter to functions

excluded_keys = ['topics', 'root_url', 'index', 'name', 'url']

def test_WikiProjectsFromTable():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -61,17 +61,13 @@ def main(argv=None):
r'\[\[Wikipedia:WikiProject Council/Directory/([A-Za-z_, ]+)\|([A-Za-z_, ]+)\]\]='
WPListingRegex =\
r'See the full listing \[\[Wikipedia:WikiProject Council/Directory/([A-Za-z_,/ ]+)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Wikipedia:WikiProject Council/Directory" should be a parameter or a static variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Only left to figure out how to parameterize the complex regex - wp_section_regex

wp[sec['line']]['url'])


subPageSections = self.getSections(wp[sec['line']]['url'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable names should be snake case (lower_case_with_underscores) https://www.python.org/dev/peps/pep-0008/#descriptive-naming-styles

prevTopic = None
self.logger.info("Index:{}, Level:{}".format(index, level))
idx = index
while idx < len(sections):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would for section in sections be better here?

You can also do something like:

for i, section in enumerate(sections):
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've have done that but the recursion inherently depends on fast-forwarding the loop variable over parsed sections while exiting to avoid repeated calls to parse sections.

wpTopicsCompare(actualTopics, topics)

def test_getWikiProjectsFromSectionIntroText():
parser = WikiProjectsParser(WPDPage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parser is actually making an HTTP connection. That means the test could fail if you don't have a connection to the internet. It's better to download the page content and include the data in the test so that an internet connection is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the test - test_get_sub_categories makes api calls while all others use offline data.
Even the above test has been made to cache data offline in case tests are repeated. However, I'll still think about how can we mock api calls, as the method tested by this test makes repeated api calls for sections rather than fetching a page in a single call.

@halfak
Copy link
Member

halfak commented Sep 6, 2017

Generally, I'm concerned that the WikiProjectsParser constructs a session in it's init and that the host/URL for the WikiProject page is hard coded in that method. It doesn't seem like the class is performing a very useful function here. It would be better to have a pattern like:

def extract_directory(session, directory_page="Wikipedia:WikiProject..."):
    ...

def parse_section(section_text):
    ...

def parse_section_intro(intro_text):
    ...

def parse_wikiproject_table(table_text):
    ...

session = mwapi.Session(...)

print(extract_directory(session))

Then you'd write tests for parse_section() but not extract_directory which would only be responsible for making calls to the api iteratively.

In the end, this works. I'd like to, at minimum, get it pep8 compliant before we merge. We can worry about a restructuring later.

@codez266
Copy link
Contributor Author

codez266 commented Sep 7, 2017

I agree with your point that we should have a session variable in the top level function extract_directory and make api calls only there. But because of certain nesting in the structure of WikiProjects directory, api calls only in the extract_directory was becoming impossible. Will discuss this in detail and think about how to restructure.

The above approach would work if I had fetched the entire page once and called helper methods subsequently to parse sections. However that seemed difficult because:

  • Fetching sections one by one using api and section number is better and more robust that using regexes to isolate sections in the full wikitext of page.
  • Essentially I use the output of ?action=parse&prop=sections to get sections of the page and use this info to extract each section which guarantees that we don't miss any section.

@adamwight
Copy link
Contributor

This looks good! I ran flake8 and it found a few more pep8 glitches, report is attached.

pep8.txt

@codez266
Copy link
Contributor Author

codez266 commented Sep 7, 2017

Thanks Adam! for your suggestions regarding flake8.

@halfak halfak merged commit 7ad36c3 into master Sep 10, 2017
@codez266 codez266 deleted the wikiprojects-parser-tests branch September 13, 2017 02:29
Ladsgroup pushed a commit that referenced this pull request Aug 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants