## Assignment: Box Office Winner

In this assignment, your task is to reverse engineer a provided visualization from raw data. Specifically, we will visualize the daily box office winners in 2023. The raw data comes from [BoxOfficeMojo](https://www.boxofficemojo.com/daily/2023/?view=year). The target visualization is the following.

![Box Office Winner 2023](https://github.com/qnzhou/practical_data_visualization_in_python/assets/3606672/f404debd-b1bf-4a98-933e-d3b27e3b3921)

Our temporal axis, spanning from January 1st, 2023 to December 31st, 2023, is represented along the X-axis. Meanwhile, the Y-axis delineates the daily top release for each day. We employ rounded bars to visually signify the duration of a release's dominance at the box office. Each top release is distinguished by a unique color, accompanied by its title displayed preceding the corresponding bar. These releases are organized chronologically, following the order of their initial ascent to the top position.

In [1]:
import altair as alt
import pandas as pd

url = "https://github.com/qnzhou/practical_data_visualization_in_python/files/14239903/box_office_2023.csv"
df = pd.read_csv(url)

In [2]:
# Your code here...
df.describe

<bound method NDFrame.describe of             Date         Holiday Day of Week  Top 10 Gross  \
0    Dec 31 2023  New Year's Eve      Sunday      23078184   
1    Dec 30 2023             NaN    Saturday      40050370   
2    Dec 29 2023             NaN      Friday      37348409   
3    Dec 28 2023             NaN    Thursday      33261609   
4    Dec 27 2023             NaN   Wednesday      33892628   
..           ...             ...         ...           ...   
360   Jan 5 2023             NaN    Thursday      10864987   
361   Jan 4 2023             NaN   Wednesday      12131291   
362   Jan 3 2023             NaN     Tuesday      16965068   
363   Jan 2 2023             NaN      Monday      32548656   
364   Jan 1 2023  New Year's Day      Sunday      36210982   

     Number of Releases               Top Release     Gross  
0                    40                     Wonka   5208897  
1                    41                     Wonka   8637841  
2                    41            

In [3]:
df.head(10)

Unnamed: 0,Date,Holiday,Day of Week,Top 10 Gross,Number of Releases,Top Release,Gross
0,Dec 31 2023,New Year's Eve,Sunday,23078184,40,Wonka,5208897
1,Dec 30 2023,,Saturday,40050370,41,Wonka,8637841
2,Dec 29 2023,,Friday,37348409,41,Wonka,8630268
3,Dec 28 2023,,Thursday,33261609,43,Wonka,7988504
4,Dec 27 2023,,Wednesday,33892628,42,Wonka,8135639
5,Dec 26 2023,,Tuesday,41788862,42,Wonka,8970413
6,Dec 25 2023,Christmas Day,Monday,58545776,41,The Color Purple,18151050
7,Dec 24 2023,,Sunday,17021324,37,Aquaman and the Lost Kingdom,5001421
8,Dec 23 2023,,Saturday,29021089,38,Aquaman and the Lost Kingdom,9003036
9,Dec 22 2023,,Friday,39891139,38,Aquaman and the Lost Kingdom,13681754


In [4]:
df.dtypes

Date                  object
Holiday               object
Day of Week           object
Top 10 Gross           int64
Number of Releases     int64
Top Release           object
Gross                  int64
dtype: object

In [5]:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

In [6]:
df.columns

Index(['Date', 'Holiday', 'Day of Week', 'Top 10 Gross', 'Number of Releases',
       'Top Release', 'Gross'],
      dtype='object')

# --


In [7]:
df.head(15)

Unnamed: 0,Date,Holiday,Day of Week,Top 10 Gross,Number of Releases,Top Release,Gross
0,2023-12-31,New Year's Eve,Sunday,23078184,40,Wonka,5208897
1,2023-12-30,,Saturday,40050370,41,Wonka,8637841
2,2023-12-29,,Friday,37348409,41,Wonka,8630268
3,2023-12-28,,Thursday,33261609,43,Wonka,7988504
4,2023-12-27,,Wednesday,33892628,42,Wonka,8135639
5,2023-12-26,,Tuesday,41788862,42,Wonka,8970413
6,2023-12-25,Christmas Day,Monday,58545776,41,The Color Purple,18151050
7,2023-12-24,,Sunday,17021324,37,Aquaman and the Lost Kingdom,5001421
8,2023-12-23,,Saturday,29021089,38,Aquaman and the Lost Kingdom,9003036
9,2023-12-22,,Friday,39891139,38,Aquaman and the Lost Kingdom,13681754


In [8]:
# extra column to see  the changes in "Top Release"
df['prev_top_release'] = df['Top Release'].shift(1)
df['consecutive'] = df['Top Release'] == df['prev_top_release']

# make a group identifier for consecuutive ranges
df['group'] = (~df['consecutive']).cumsum()




# find starting and ending of the streak
grouped_df = df.groupby(['group', 'Top Release']).agg(
    start_date=('Date', 'min'),
    end_date=('Date', 'max')
).reset_index()


In [9]:
# df[df["consecutive"] == False].tail(15)

In [10]:
# VERY IMP --> Adjusting end_date as the movies with only one day streak are not shown on the plott
grouped_df.loc[grouped_df['start_date'] == grouped_df['end_date'], 'end_date'] += pd.Timedelta(days=1)


In [11]:




# firstoccurrence  frame for printing the nameonly once in the plot 
first_occurrence = grouped_df.groupby('Top Release').agg(
    first_start=('start_date', 'min')
).reset_index()


sorted_movies = first_occurrence.sort_values('first_start')['Top Release'].tolist()


In [12]:
df.tail(20)

Unnamed: 0,Date,Holiday,Day of Week,Top 10 Gross,Number of Releases,Top Release,Gross,prev_top_release,consecutive,group
345,2023-01-20,,Friday,19736410,37,Avatar: The Way of Water,4671161,Avatar: The Way of Water,True,81
346,2023-01-19,,Thursday,6634282,31,Avatar: The Way of Water,1959746,Avatar: The Way of Water,True,81
347,2023-01-18,,Wednesday,6203124,31,Avatar: The Way of Water,1865293,Avatar: The Way of Water,True,81
348,2023-01-17,,Tuesday,9757114,31,Avatar: The Way of Water,2769211,Avatar: The Way of Water,True,81
349,2023-01-16,MLK Day,Monday,21069765,32,Avatar: The Way of Water,7056071,Avatar: The Way of Water,True,81
350,2023-01-15,,Sunday,32575322,32,Avatar: The Way of Water,11727060,Avatar: The Way of Water,True,81
351,2023-01-14,,Saturday,40457653,32,Avatar: The Way of Water,14054064,Avatar: The Way of Water,True,81
352,2023-01-13,,Friday,25676736,33,Avatar: The Way of Water,7043560,Avatar: The Way of Water,True,81
353,2023-01-12,,Thursday,7202303,30,Avatar: The Way of Water,2974300,Avatar: The Way of Water,True,81
354,2023-01-11,,Wednesday,7964902,30,Avatar: The Way of Water,3173055,Avatar: The Way of Water,True,81


In [13]:
sorted_movies

['Avatar: The Way of Water',
 'M3GAN',
 'Pathaan',
 'The Chosen Season 3 Finale',
 'Knock at the Cabin',
 '80 for Brady',
 "Magic Mike's Last Dance",
 'Ant-Man and the Wasp: Quantumania',
 'Cocaine Bear',
 'Creed III',
 'Scream VI',
 'Shazam! Fury of the Gods',
 'John Wick: Chapter 4',
 'Dungeons & Dragons: Honor Among Thieves',
 'The Super Mario Bros. Movie',
 'Guardians of the Galaxy Vol. 3',
 'Fast X',
 'The Little Mermaid',
 'Spider-Man: Across the Spider-Verse',
 'Transformers: Rise of the Beasts',
 'The Flash',
 'Elemental',
 'No Hard Feelings',
 'Indiana Jones and the Dial of Destiny',
 'Sound of Freedom',
 'Insidious: The Red Door',
 'Mission: Impossible - Dead Reckoning Part One',
 'Barbie',
 'Blue Beetle',
 'Gran Turismo',
 'The Equalizer 3',
 'The Nun II',
 'A Haunting in Venice',
 'Expend4bles',
 'The Blind',
 'Saw X',
 'PAW Patrol: The Mighty Movie',
 'The Exorcist: Believer',
 'Taylor Swift: The Eras Tour',
 'Killers of the Flower Moon',
 "Five Nights at Freddy's",
 'The 

In [14]:
grouped_df.head(40)

Unnamed: 0,group,Top Release,start_date,end_date
0,1,Wonka,2023-12-26,2023-12-31
1,2,The Color Purple,2023-12-25,2023-12-26
2,3,Aquaman and the Lost Kingdom,2023-12-22,2023-12-24
3,4,Wonka,2023-12-15,2023-12-21
4,5,The Boy and the Heron,2023-12-08,2023-12-14
5,6,Godzilla Minus One,2023-12-06,2023-12-07
6,7,The Hunger Games: The Ballad of Songbirds & Sn...,2023-12-05,2023-12-06
7,8,Godzilla Minus One,2023-12-04,2023-12-05
8,9,Renaissance: A Film by Beyoncé,2023-12-03,2023-12-04
9,10,The Hunger Games: The Ballad of Songbirds & Sn...,2023-12-02,2023-12-03


# Create Gantt Chart 

In [15]:

# Create the Gantt chart with merged bars
chart = alt.Chart(grouped_df).mark_bar(cornerRadius=1).encode(
    x=alt.X('yearmonthdate(start_date):T', title='Date', axis=alt.Axis(orient='top', format='%B')),  # Show months only --> datetime format "%B"
    x2='yearmonthdate(end_date):T',
    y=alt.Y('Top Release:N', sort=sorted_movies, axis=None),  # Sort based on the first occurrence
    color=alt.Color('Top Release:N', legend=None)  # FOR Removing the legend
).properties(
    # title='Box Office Winners: Daily Top Release in 2023',
    width=1000,
    height=696
)

chart

In [16]:
# # previous

# chart = alt.Chart(df).mark_bar(cornerRadius=103).encode(
#     x=alt.X('yearmonthdate(Date):T', title='Date', axis=alt.Axis(orient='top', format='%B')),
#     y=alt.Y('Top Release:N',
#             sort=alt.EncodingSortField(field="Date", order='ascending'),
#             axis=None),  # Remove axis to match your goal
#     # color='Top Release:N'
#     color=alt.Color('Top Release:N', legend=None)
# ).properties(
#     title='Box Office Winners: Daily Top Release in 2023',
#     width=1000,
#     height=500
# )

# chart

In [17]:
# text = alt.Chart(df_with_first).mark_text(
#     align='right',
#     baseline='middle',
#     dx=-5  # Position the text 5 units to the left
# )

## Plot having Movie Names at locations of first occurences only 

In [18]:

text = alt.Chart(first_occurrence).mark_text(
    align='right',
    baseline='middle',
    dx=-5  # Position the text 5 units to the left of the start date
).encode(
    x=alt.X('yearmonthdate(first_start):T', title=None),  # Text at the first start date only
    y=alt.Y('Top Release:N', sort=sorted_movies, axis=None),  # Sort based on the first occurrence
    text='Top Release:N',
    color=alt.Color('Top Release:N', legend=None)  # Remove legend
)

text

# Ploting Titles

In [19]:
title_text1 = alt.Chart(pd.DataFrame({'x': [0.5], 'y': [0.5]})).mark_text(
    text='Box Office Winners',
    align='center',
    baseline='middle',
    fontSize=30,
    fontWeight='bold',
    color='black'
).encode(
    x=alt.value(600),  # Adjust x position inside the chart
    y=alt.value(100)   # Adjust y position inside the chart
)
title_text1

In [20]:
title_text2 = alt.Chart(pd.DataFrame({'x': [0.5], 'y': [0.5]})).mark_text(
    text='Daily Top Release in 2023',
    align='center',
    baseline='middle',
    fontSize=15,
    color='grey'
).encode(
    x=alt.value(600),  # Adjust x position inside the chart
    y=alt.value(120)   # Adjust y position inside the chart
)
title_text2

In [21]:
final_chart = chart + text + title_text1 + title_text2

final_chart

In [22]:
print("GG")

GG
