In [1]:
from pathlib import Path
import json
import requests
from wikitalk_parser import WikiParserThreads

HERE = Path('.')

## Monty Python [talk page](https://en.wikipedia.org/wiki/Talk:Monty_Python) 

Download Cirrus Soc with Wiki markup of the talk page.

In [2]:
resp = requests.get('https://en.wikipedia.org/w/api.php', params=dict(
    action='query',
    format='json',
    prop='cirrusdoc',
    titles='Talk:Monty_Python'
))
data = resp.json()
source = data['query']['pages']['18949']['cirrusdoc'][0]['source']['source_text']

Using the parser is very easy. First an instance needs to be initialized on
a string with Wiki markup of a page. And then `.parse()` method can be called
which returns a generator of subsequent posts. Hierarchy of threads and
posts within threads is retained by a set of additional index fields.

Each post is a simple `dict` object with the following fields:

1. `topic`: thread title/topic
2. `thread_idx`: thread index according to the actual order on page
    and counting from `1`.
3. `post_idx`: index of a post within a thread according to the order of appearance.
4. `parent_idx`: index of the parent post.
5. `user_name`: user name of the author.
6. `timestamp`: post creation timestamp as `datetime` object.
7. `depth`: depth within the discussion tree.
8. `content`: raw content of the post as extrected from Wiki markup.
9. `content_sanitized`: content cleaned from most of Wiki and HTML markup.

Note that thanks to the fact that the data is returned as a simple flat
list of homogeneous `dict` objects output of the parser can be easily fit
into different convenient data structures such as 
[Pandas](https://pandas.pydata.org/) data frames.

In [3]:
threads = WikiParserThreads(source)
list(threads.parse())

[{'topic': '"First writer to play with the conventions of television"',
  'thread_idx': 1,
  'post_idx': '0',
  'parent_idx': None,
  'user_name': 'Justintime55',
  'timestamp': datetime.datetime(2019, 8, 21, 17, 3),
  'depth': 0,
  'content': '<s>You know, I get a bit tired of Brits who think their shit doesn\'t stink :-) ...</s>\n\n[[Ernie Kovacs]] started his television career in 1950, and by his death in 1962 was considered a genius of visual comedy (e.g. \'\'[[Silent Show]]\'\', 1957). Compare this with [[Spike Milligan]], who "first attempt[ed] to translate Goons humour to TV" in 1956, and didn\'t start the \'\'[[Q... (TV series)]]\'\' (which Palin says he and Jones adored) \'\'until 1969\'\', therefore could not possibly have been the "first to play with the conventions of television". [[User:JustinTime55|JustinTime55]] ([[User talk:JustinTime55|talk]]) 17:03, 21 August 2019 (UTC)',
  'content_sanitized': 'You know, I get a bit tired of Brits who think their shit doesn\'t stink 