# First feedback from Abby

What I can look at is how did the preferences of the generes for movies changed before and after the pandemic.

I can look at x years previous of Covid and x years after Covid when the theatres were open. So drop all the movies that were launched during covid.

Then do the analysis.

# Project Structure for Analyzing Movie Genre Performance Before and After COVID-19:

### Objective:

Assess the financial performance of various movie genres before and after the onset of the COVID-19 pandemic.

### Data Collection:

Utilize the IMDb dataset to gather information on movies released in the years before and after the pandemic began (e.g., 2018-2019 for pre-COVID and 2020-2021 for post-COVID).
Focus on key data points like genre, box office earnings, budget, and release dates.
Analysis:

### Comparative Revenue Analysis: Compare the total and average box office revenues of different genres in both periods.
Genre Popularity Shifts: Identify any shifts in genre popularity. For example, did certain genres like horror or comedy become more popular post-COVID?
Budget vs. Revenue: Examine if there's a change in the budget-to-revenue ratio across genres before and after COVID-19.
Release Strategy Changes: Assess how release strategies (theatrical, streaming, hybrid) varied across genres and periods.

### Economic Theories and Models:

Apply microeconomic theories to explain changes in consumer preferences and market dynamics.
Consider demand and supply shifts due to external factors like lockdowns, economic uncertainty, and changes in consumer behavior.

### Results and Interpretation:

Present your findings in a clear, structured format.
Interpret the results in the context of economic theories and the unique circumstances of the pandemic.

### Conclusions:

Summarize key insights regarding the impact of COVID-19 on movie genre performance.
Discuss potential long-term implications for the film industry.

### Limitations and Further Research:

Acknowledge any limitations in your data or analysis.
Suggest areas for further research, such as a more detailed analysis of consumer preferences or the impact on independent films.
This project can offer a comprehensive view of how extraordinary events like a pandemic can influence entertainment industry trends, providing a practical application of economic concepts to real-world scenarios. Remember to keep your analysis aligned with the principles of microeconomics

Columns you should consider including in your analysis:

- Movie Title: The name of the movie. This is essential for identification and reference.

- Release Year: The year the movie was released. This will help you categorize movies into pre-COVID and post-COVID groups.

- Genre: The genre(s) of the movie. This is crucial for your primary analysis of comparing different genres.

- Box Office Revenue: The total earnings from ticket sales. This data is key to assessing financial performance.

- Budget: The estimated cost of producing the movie. This allows for analysis of profitability and budget-to-revenue ratios.

- Production Company: The company or companies that produced the movie. This can provide insights into market share and industry dynamics.

- Director and Key Cast Members: This information can be used to assess star power and its potential impact on a movie's success.

- IMDb Rating: Viewer ratings from IMDb. This can give a sense of critical reception and popularity.

- Number of Theaters Released In (if available): This data can help understand the scale of the release and its potential market reach.

- Streaming Platform Availability (post-COVID): For post-COVID movies, information on whether the movie was released on streaming platforms, and if so, which ones.

- Country/Countries of Origin: This helps in understanding geographical trends and the impact of regional COVID-19 measures.

- Runtime: The length of the movie, which can sometimes correlate with genre and audience preference.

- MPAA Rating/Censorship Rating: This information can provide insights into the target audience.

- Awards and Nominations: To gauge critical acclaim and its potential impact on financial success.

In [8]:
import pandas as pd
import os
import glob
import math
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
from pandas.tseries.offsets import MonthEnd, DateOffset

os.chdir('/Users/shreyashgupta/Library/CloudStorage/OneDrive-UniversityofArkansas/MSEA/ECON 5813 - Economic Analytics I/Project Proposal/Data')

In [9]:
# Load the datasets
title_basics = pd.read_csv('IMDb Basics.tsv', sep='\t', low_memory=False)
name_basics = pd.read_csv('IMDb Name Basics.tsv', sep='\t', low_memory=False)
title_ratings = pd.read_csv('IMDb Ratings.tsv', sep='\t', low_memory=False)
title_principals = pd.read_csv('IMDb Title Principals.tsv', sep='\t', low_memory=False)

# Merging the datasets
merged_data1 = pd.merge(title_basics, title_ratings, on='tconst')
merged_data2 = pd.merge(merged_data1, title_principals, on='tconst')
merged_df = pd.merge(merged_data2, name_basics, on='nconst')

In [10]:
merged_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,...,ordering,nconst,category,job,characters,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.7,...,1,nm1588970,self,\N,"[""Self""]",Carmencita,1868,1910,soundtrack,"tt0000001,tt0057728"
1,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.7,...,2,nm0005690,director,\N,\N,William K.L. Dickson,1860,1935,"cinematographer,director,producer","tt1428455,tt0219560,tt0308254,tt1496763"
2,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short",6.2,...,3,nm0005690,director,\N,\N,William K.L. Dickson,1860,1935,"cinematographer,director,producer","tt1428455,tt0219560,tt0308254,tt1496763"
3,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short,5.0,...,1,nm0005690,director,\N,\N,William K.L. Dickson,1860,1935,"cinematographer,director,producer","tt1428455,tt0219560,tt0308254,tt1496763"
4,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport",5.4,...,3,nm0005690,director,\N,\N,William K.L. Dickson,1860,1935,"cinematographer,director,producer","tt1428455,tt0219560,tt0308254,tt1496763"


In [11]:
# Define the path to the files
path_to_files = '/Users/shreyashgupta/Library/CloudStorage/OneDrive-UniversityofArkansas/MSEA/ECON 5813 - Economic Analytics I/Project Proposal/Data/Gross Data/'

# Initialize an empty DataFrame for the Revenues_dataset
Revenues_dataset = pd.DataFrame()

# Define a helper function to extract the year from the filename
def get_year_from_filename(file_name):
    for month in ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]:
        if month in file_name:
            return file_name.split(month)[1].strip('.csv').strip()
    return None

# Loop through each CSV file in the directory
for file_name in os.listdir(path_to_files):
    if file_name.endswith('.csv'):
        # Construct the full file path
        file_path = os.path.join(path_to_files, file_name)
        # Read the CSV file into a temporary DataFrame
        temp_df = pd.read_csv(file_path)
        # Extract the year from the file name
        year = get_year_from_filename(file_name)
        if year:
            # Assume that 'Release Date' column contains the day and abbreviated month (e.g., '17-Feb')
            # If day is missing, we'll prepend '01-' to use as a placeholder
            temp_df['Release Date'] = temp_df['Release Date'].str.extract(r'(\d+)-[A-Za-z]+', expand=False).fillna('01') + '-' + temp_df['Release Date'].str.extract(r'(\d+-)?([A-Za-z]+)', expand=False)[1] + '-' + year
            # Convert 'Release Date' to datetime format
            temp_df['Release Date'] = pd.to_datetime(temp_df['Release Date'], format='%d-%b-%Y', errors='coerce')
            # Drop rows where 'Release Date' could not be parsed
            temp_df.dropna(subset=['Release Date'], inplace=True)
            # Format 'Release Date' as 'MM/YYYY'
            temp_df['Release Date'] = temp_df['Release Date'].dt.strftime('%m/%Y')
            # Append to the main DataFrame
            Revenues_dataset = pd.concat([Revenues_dataset, temp_df], ignore_index=True)

# Remove rows with NaN 'Release Date' before dropping duplicates
Revenues_dataset.dropna(subset=['Release Date'], inplace=True)

# Drop duplicates based on the 'Release' column
Revenues_dataset = Revenues_dataset.drop_duplicates(subset='Release')

# Drop the unnecessary 'Rank' and 'Gross' columns
Revenues_dataset.drop(['Rank', 'Gross'], axis=1, inplace=True)

# Drop rows with null values in 'Total Gross'
Revenues_dataset = Revenues_dataset.dropna(subset=['Total Gross'])

In [12]:
Revenues_dataset.head()

Unnamed: 0,Release,Theaters,Total Gross,Release Date,Distributor
0,It Chapter Two,4570,"$211,593,228",09/2019,Warner Bros.
1,Hustlers,3525,"$104,963,598",09/2019,STX Entertainment
2,Downton Abbey,3548,"$96,853,865",09/2019,Focus Features
3,Ad Astra,3460,"$50,188,370",09/2019,Twentieth Century Fox
4,Rambo: Last Blood,3618,"$44,819,352",09/2019,Lions Gate Films
