**Be advised: This code overwrites files in your working directory. It also uses a considerable amount of disk space. Please run this code cautiously.**

# Data Acquisition

Reddit captured a complete **placement history** during 2022 r/place event. It offers at least two ways to access the placement history. We work from the [single gzip version](https://placedata.reddit.com/data/canvas-history/2022_place_canvas_history.csv.gzip): a 12.4GB CSV file containing ~160 million rows, where each row describes a single placement on the canvas in four columns.

1. A **timestamp** for when the placement happened. The three digits of milliseconds do not have trailing zeroes. Examples:

   ``2022-04-01 13:03:53.256 UTC``
   <br/>
   ``2022-04-01 13:03:55.57 UTC``

2. An anonamized unique identifier of the **redditor** who made the placement. Examples:

   ``W9aBAJubV6cUkBMMXKam6dArdW+f29WURNI8E6gg924YRnYfQP7eQC5eZOc3QAbIiKBZk8XGKGSIZ4OWHftq+w==``
   <br/>
   ``+LLcdTrBoCOWfrVxzvPQ8TErpFxEk5Uma6qo4f2wtBbaB0loKGfXkph/5FShq/jIMqx5llgxRCwA5DgpmWfrOg==``

3. An HTML-style hexadecimal representation of **color** of the placement. Examples:

    ``#00A368``
    <br/>
    ``#3690EA``

4. The **coordinates** of the placement. The placement is almost always a **pixel placement** with (_x_, _y_) coordinates, but sometimes it is a **rectangle placement** with (_x1_, _y1_, _x2_, _y2_). Event moderators sometimes placed rectangles to censor inappropriate content. Examples:

   ``394,227``
   <br/>
   ``23,0,511,819``

See [r/place Datasets](https://www.reddit.com/r/place/comments/txvk2d/rplace_datasets_april_fools_2022/) for more details about the event and the data set.

We run these commands on [Ubuntu](https://ubuntu.com), aware that unfortunately operating system specific differences exist - especially for sorting. That said we make our intentions explicit in each step, so the commands may port fairly easily to other operating systems.

In [None]:
# Download compressed placement history, overwriting if necessary
!wget -O placement_history.csv.gz "https://placedata.reddit.com/data/canvas-history/2022_place_canvas_history.csv.gzip"

In [2]:
# Ensure uncompressed placement history file does not exist
!rm -f placement_history.csv

In [3]:
# Uncompress placement history
!gunzip placement_history.csv.gz

# Header Row Removal

The CSV file begins with a header row. We could simply skip the header row when processing the file, except that we also want to sort the file - and it turns out that sorting changes the location of the header row within the file. So insead we remove the header row beforehand.

In [4]:
# Remove header
!sed -i "1d" placement_history.csv

# Sorting by Timestamp

The rows come initially unsorted. Some analyses require rows pre-sorted by timestamp. The sort needs no special parameters.

1. The timestamp is the first column
2. Since spaces come before digits in this locale, the variable-width milliseconds do not pose problems either

In [5]:
# Produce file with rows sorted by timestamp
!LC_ALL=C sort placement_history.csv > timestamp_sorted_history.csv

In [6]:
# Compress timestamp sorted file
!gzip -f timestamp_sorted_history.csv

# Cleanup

In the interest of reducing disk space usage, we delete the original unsorted placement history.

In [7]:
# Delete unsorted placement history
!rm -f placement_history.csv