---
title: "Best practices for data collection"
author: "Leon Yin"
date-modified: "02-14-2023"
href: data-collection-best-practices
---

Here are some helpful tips for building datasets.

## Don't repeat work

Before you collect data, check if you've already collected it. 

Create a programmatic naming structure for a "target"-- this could be a filename or a unique ID in a database, and check if it exists. 

If it already exists, move on.

Below is a dummy example of a scraper for video metadata that checks if a file with the same `video_id` has already been saved.

In [1]:
import os
import time

def collect_video_metadata(video_id):
    """
    This is an example of a data collection function
    that checks if a video_id has already been collected.
    """
    # consistently structure the target filename (fn_out)
    fn_out = f"video_metadata_{video_id}.csv"
    
    # check if the file exists, if it does: move on
    if os.path.exists(fn_out):
        print("already collected")
        return
        
    # collect the data (not actually implemented)
    print("time to do some work!")
    
    # save the file. Instead of real data, we'll save text that says, "Collected".
    with open(fn_out, 'w') as f:
        f.write("Collected")
    return

Let's try to collect some video metadata for a `video_id` of our choosing.

In [2]:
video_id = "schfiftyfive"

In [3]:
#| echo: false
def delete_file(video_id):
    fn_out = f"video_metadata_{video_id}.csv"
    if os.path.exists(fn_out):
        os.remove(fn_out)
delete_file(video_id = video_id)

In [4]:
collect_video_metadata(video_id = video_id)

time to do some work!


Let's try to run the same exact function with the same input:

In [5]:
collect_video_metadata(video_id = video_id)

already collected


The second time you call it, the function ends early.

When collecting a large dataset, these steps are essential to make the best use of time.

## Save receipts

Save the output of every step, especially the earliest steps of collecting a JSON response from a server, or the HTML of a website. 

You can always re-write parsers that turn that "raw" data into something neat and actionable. 

Websites and API responses _can_ change, so web parsers can break easily. It is safer to just save the data straight from the source, and process it later.

If you're collecting a web page through browser automation, save a screenshot. It's helpful to have reference material of what the web page looked like when you captured it.

This is something we did at the Markup when we collected Facebook data from a national panel over several months, and again, when we collected Google search results.

These receipts don't just play a role in the underlying analysis, they can be used as powerful exhibits in your investigation.

<figure>
<video autoplay="1" playsinline src="assets/ida.webm">
<figcaption align = "left" style="font-size:80%;"> A screenshot of a Google Search for "Ida Tarbell" that was saved, and stained using a [web parsing](https://themarkup.org/google-the-giant/2020/07/28/how-we-analyzed-google-search-results-web-assay-parsing-tool#google-search-flow) tool created for an [investigation](https://themarkup.org/google-the-giant/2020/07/28/google-search-results-prioritize-google-products-over-competitors) that found Google gives its own properties 41% of the first page of Search results. </figcaption>
</figure>

## Bigger isn't always better

Be smart with how you use data, rather than depend on big numbers. Data isn't in-itself valuable.

It's better to start off smaller, with a trial analysis (we often call it a quick-sniff in the newsroom) to make sure you have a testable hypothesis.

This is always a step I use at my newsroom to plan longer data investigations, and see what kind of story we could write if we spent more time on the data collection and honing the methodology.

## Conclusion

These tips are not definitive. If you want to share tips, please make a suggestion via email or GitHub. 