# Showings demo

This demo needs quite a few CPUs to be responsive.

Operations typically take:

* 6 CPUs: 4 minutes
* 48 CPUs: 20 seconds

In the demo, we'll plot the evolution of showings over time for each country.

In [None]:
%%time

import os
import sys

sys.path.append('..')

from movies_dask_bag.movie_reader import TheatersReader, MoviesReader, ShowingsReader

work_dir = os.environ.get('SLURM_TMPDIR', '.')
data_dir = '{}/json'.format(work_dir)
file_pattern = '{}/*/*'.format(data_dir)

showings_reader = ShowingsReader(file_pattern)

showings_reader.take(1)

In [None]:
showings_reader.client

In [None]:
showings_reader.bag

In [None]:
%%time

showings_reader.count

We can partition the data into bins for each country and date stamp, and compute the frequencies of the showings.

In [None]:
%%time

bins = showings_reader.bag.map(lambda x: x['country'] + '--' + x['date_stamp']).frequencies(sort=False).compute()

print("{} bins".format(len(bins)))
print("First 5 bins:\n{}", bins[:5])

It might also be interesting to count all of the showings for each date stamp ...

In [None]:
date_counts = showings_reader.bag.map(lambda x: x['date_stamp']).frequencies(sort=False).compute()
date_counts[:20]

The date stamps will be the columns for the data frame ...

In [None]:
columns = sorted([date_count[0] for date_count in date_counts])
columns[:5]

The countries will represent the rows of the data frame ...

In [None]:
country_counts = showings_reader.bag.map(lambda x: x['country']).frequencies(sort=False).compute()
countries = sorted([country_count[0] for country_count in country_counts])

print("{} countries total".format(len(countries)))
print("First 10 countries:")
countries[:10]

Assemble the data frame ...

In [None]:
import pandas as pd
import numpy as np

# df = pd.DataFrame(columns=columns, index=['country'], dtype=np.int64)
df = pd.DataFrame(columns=columns, dtype=np.int64)
df.index.name = 'country'

for row in bins:
    (country, date_stamp) = row[0].split('--')
    df.loc[country, date_stamp] = row[1]
df

Replace missing data with zeros ...

In [None]:
df.fillna(0, inplace=True)
df

**Note**: at this point it makes sense to dump the data to a CSV file and download to a PC ...

In [None]:
df.to_csv("showings_explorations_out.csv")

Make the plot ...

In [None]:
import datetime
import plotly.graph_objects as go

# Convert columns from strings to proper datetimes
columns = [datetime.datetime.strptime(c, '%Y%m%d') for c in columns]

# Default double-click speed is a bit fast ...
config = {'doubleClickDelay': 1000}

fig = go.Figure()
for country in countries:
    fig.add_scatter(x=columns,
                    y=df.loc[country],
                    mode = 'lines',
                    name=country)

fig.show(config=config)

It might be interesting to normalize the data to compare it with a date stamp pre-pandemic (2019-12-13 chosen because it has data for all of the countries).

In [None]:
df_ratio = df.apply(lambda x:x / df['20191213'])

fig = go.Figure()
for country in countries:
    fig.add_scatter(x=columns,
                    y=df_ratio.loc[country],
                    mode = 'lines',
                    name=country)

fig.show(config=config)

Shut down the computational network and clean up ...

In [None]:
showings_reader.shutdown()