# Store Reddis Posts in CSV Format

Because we will need to merge multiple sources like Reddis and migraine.com we decided to use CSV files as our intermediate format so we can easily concatenate them and then load them into memory with Pandas.

The purpose of this notebook is to retrieve only the relevant fields of Reddis posts from MongoDB and store them in CSV file.

The desired format of this file will be:

| Type | Parent | Author | Text | Title | Tags | Webpage |
| ---- | ------ | ------ | ---- | ----- | ---- | ------- |
| P/C  |   id   | userid | text | text  | text |   url   |

- Type - P for post or C for comment
- Parent - own id for post or value of `parent_id` in comment structure without `t3_` prefix
- Author - userid
- Text - for posts value from `selftext` and for comment value from `body`
- Title - title of the post
- Tags - used by migraine.com so this field is empty for Reddis
- Webpage - used by migraine.com so this field is empty for Reddis

In [6]:
import csv
from tqdm import tqdm
from pymongo import MongoClient

In [7]:
database_name = 'reddit-migraine'

client = MongoClient('mongodb://localhost:37017')
db = client[database_name]

In [8]:
migraine_file_name = 'reddis_migraine_posts.csv'

In [12]:
def process_comments(comments, title, parent_id):
    return [{
            'Type': 'C',
            'Parent': parent_id,
            'Author': comment['author'],
            'Text': comment['body'],
            'Title': title
            } for comment in comments]
        

def process_post(post):
    entries = [{
        'Type': 'P',
        'Parent': post['id'],
        'Author': post['author'],
        'Text': post.get('selftext', ''),
        'Title': post['title'],
        'Tags': None,
        'Webpage': None
    }]
    entries.extend(process_comments(
        post['comments'],
        post['title'],
        post['id']))
    return entries

In [14]:
created_header = False
posts_count = db['posts'].count_documents({})
with tqdm(total=posts_count, desc="Progress") as pbar:
    with open(f'data/{migraine_file_name}', 'w') as posts_file:
        field_names = ['Type', 'Parent', 'Author', 'Text', 'Title', 'Tags', 'Webpage']
        csv_writer = csv.DictWriter(posts_file, fieldnames=field_names)
        for post in db['posts'].find():
            entries = process_post(post)
            if not created_header and len(entries) > 0:
                csv_writer.writeheader()
                created_header = True
            for entry in entries:
                csv_writer.writerow(entry)
            pbar.update(1)


Progress: 100%|██████████| 42878/42878 [00:11<00:00, 3659.82it/s] 
