## Some experiments with how well theaters are doing over time

We'll count how many theatres in each country had any showings for each date sampled.

This demo works with the entire showings dataset. Each operation takes about 20 seconds with 48 CPUs.

In [None]:
%%time

import os
import sys

sys.path.append('..')

from movies_dask_bag.movie_reader import TheatersReader, MoviesReader, ShowingsReader

work_dir = os.environ.get('SLURM_TMPDIR', '.')
data_dir = '{}/json'.format(work_dir)
file_pattern = '{}/*/*'.format(data_dir)

showings_reader = ShowingsReader(file_pattern)
theaters_reader = TheatersReader(file_pattern)

Count the showing records gives an indication of how long operations will take.

In [None]:
%%time

showings_reader.count

We'll hash the data by theatre, country, and date stamp and count the frequencies for each combination.

In [None]:
%%time

# Hashing scheme for the bins: "theater_id||country||date_stamp"

def theater_hash(movie):
    return "{}||{}||{}".format(movie['theater_id'], movie['country'], movie['date_stamp'])

frequencies = showings_reader.bag.map(theater_hash).frequencies().compute()
frequencies[:10]

In [None]:
len(frequencies)

Now we can create a nested structure that we'll use to create a dataframe to count how many theatres had showings for each date sampled. 

In [None]:
country_date_stamps_counts = {}
countries = set()
date_stamps = set()

for frequency in frequencies:
    (theater_id, country, date_stamp) = frequency[0].split("||")
    countries.add(country)
    date_stamps.add(date_stamp)

    country_counts = country_date_stamps_counts.get(country, {})
    count = country_counts.get(date_stamp, 0) + 1
    country_counts[date_stamp] = count
    country_date_stamps_counts[country] = country_counts

countries = sorted(list(countries))
date_stamps = sorted(list(date_stamps))

In [None]:
country_date_stamps_counts

Create the dataframe ...

In [None]:
import pandas as pd
import numpy as np

columns = date_stamps
df = pd.DataFrame(columns=columns, dtype=np.int64)

for country in country_date_stamps_counts:
    country_counts = country_date_stamps_counts[country]
    for date_stamp in country_counts:
        count = country_counts[date_stamp]
        df.loc[country, date_stamp] = count
df.fillna(0, inplace=True)
df

Do a plot of "How many theatres had showings" in each country for each week.

In [None]:
import datetime
import plotly.graph_objects as go

# Convert columns from strings to proper datetimes
plot_columns = [datetime.datetime.strptime(c, '%Y%m%d') for c in columns]

# Default double-click speed is a bit fast ...
config = {'doubleClickDelay': 1000}

fig = go.Figure()
for country in countries:
    fig.add_scatter(x=plot_columns,
                    y=df.loc[country],
                    mode = 'lines',
                    name=country)

fig.show(config=config)

It might be interesting to normalize the data to compare it with a date stamp pre-pandemic (2019-12-13 chosen because it has data for all of the countries).

In [None]:
df_ratio = df.apply(lambda x:x / df['20191213'])

fig = go.Figure()
for country in countries:
    fig.add_scatter(x=columns,
                    y=df_ratio.loc[country],
                    mode = 'lines',
                    name=country)

fig.show(config=config)

Shutdown things ...

In [None]:
showings_reader.shutdown()
theaters_reader.shutdown()