# Microsoft Movie Analysis

_Author: Valentina Valdez_

![picture of motion picture camera](Images/pexels-donald-tong-66134.jpg)

## 1.0 Business Understanding

Microsoft's decision to venture into the movie industry marks an exciting strategic shift. By telling compelling stories through film, Microsoft can connect with audiences on an emotional level and establish a stronger presence in popular culture, leading to increased brand awareness. 

Other tech companies, such as Apple and Amazon, have succesfully transition into the entertainment industry. By leveraging its extensive technological expertise, vast resources, and global reach, Microsoft can produce high-quality movies can will enhance Microsoft's brand image and increase its cultural influence.

This research seeks to use the available data to gain valuable insights into trends and consumption patterns, enabling the company to create tailored content that resonates with viewers and maximizes box office success. Within this notebook, we will explore what types of films Microsoft should make to maximize not just its return on investment but also positive brand exposure by producing critically acclaimed films. 

Given the above needs, this analysis will strive to answer the following questions:
- What kind of films have a high ROI? Which genres, directors, actors, and writers have procuded films with high ROIs? 
- Which directors, actors, and writers have experience creating high-prestige films?
- Are there films that have both promising ROI and prestige? If so, who may be able to produce this winning combination?

Let's dive in!

#### Add years of analysis - New Hollywood, etc

## 2.0 Data Understanding

This analysis uses a variety of trusted data sources. The datasets will be used to narrow down how Microsoft should  invest in its filmaking efforts. The sources are as follows:

- **IMDB:** Launched in 1990 - and owned by Amazon since 1998 - IMDB is one of the most popular and recognizable databases. This database houses a large amount of information such as directors, writers, genres, and release date.  
- **The Numbers:** This database was started in 1997, and is now the largest freely available database of movie business information. The available data contains information about movie titles, production budgets, and gross revenue data. 
- **The Academy Awards**: This data was created by scraping the <a href="https://awardsdatabase.oscars.org/">academy database</a> for a Kaggle compatetion. The Academy Awards is considered the most prestigious filmaking award in America, and this data will provide valuable insights in identifying individuals capable of making prestige films. This dataset contains information on Academy Award nominees and winners between the years 1927 and 2023. Access the data <a href="https://www.kaggle.com/datasets/unanimad/the-oscar-award">here</a> .  


This analysis is limited by the information in these datasets, and may not fully encompass the full scope of the filmaking industry. However, the data is current enough that this analysis can still provide valuable insight and guide Microsoft on their next steps. 


The first step in this analysis is to understand the data and how we can transform it to gleam insights. First, I am importing the necessary libraries to read the data and perform the necessary analysis. I will review the data sources one by one to determine what needs to be done before we can perform analysis. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

%matplotlib inline

### 2.1 IMDB

Let's examine IMDB's database first, which is the most exhaustive of the three. I will need information on movies titles, their genres, and the directors, actors, and writers per film. 

In [2]:
#Connect to database

conn = sqlite3.connect("Data/im.db")

#Review tables
imdb_tables = pd.read_sql("""SELECT name FROM sqlite_master;""", conn)
imdb_tables

Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


In [34]:
#Query relevant tables and preview data

movie_basics = pd.read_sql(
"""
SELECT *
FROM writers
;""", conn)

movie_basics.tail()

# Make two different tables? One for people in movies to connect to ROI, one for movie genres? Or can I make one big table?

Unnamed: 0,movie_id,person_id
255868,tt8999892,nm10122246
255869,tt8999974,nm10122357
255870,tt9001390,nm6711477
255871,tt9004986,nm4993825
255872,tt9010172,nm8352242


### 2.2 The Numbers

For this dataset, the goal is to calculate the Return on Investment per film. Eventually, I can tie this information to the IMDB table and identify which genres and film staff (such as director, writers, and actors) have produced high ROIs. 

In [4]:
#Import data
numbers_df = pd.read_csv('Data/tn.movie_budgets.csv.gz')

#Preview table
numbers_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [5]:
#Review data structure
numbers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


The initial assessment is that this data will need to be converted in the following ways:

- The budget and gross data columns need to be converted to integers 
- The release_date column needs to be coverted to datetime object
- There are no null values, but I will need to further examine to make sure that the values are valid 

### 2.3 The Academy Awards

As mentioned earlier, Microsoft should pursue both high ROIs as well as this dataset will be critical in providing information on prestige capabilities. With the help of the IMDB database, I will be able to tie genres, films and film staff to prestige. Another interesting exploration will be finding out which writers, directors, actors, and/or genres are more likely to produce award winning films. 

In [6]:
#Import data
academy_df = pd.read_csv('Data/academy_data.csv')

#Preview data
academy_df.head()

Unnamed: 0,Year,Ceremony,Award,Winner,Name,Film
0,1927/1928,1,Actor,,Richard Barthelmess,The Noose
1,1927/1928,1,Actor,1.0,Emil Jannings,The Last Command
2,1927/1928,1,Actress,,Louise Dresser,A Ship Comes In
3,1927/1928,1,Actress,1.0,Janet Gaynor,7th Heaven
4,1927/1928,1,Actress,,Gloria Swanson,Sadie Thompson


It looks like this dataset is ordered by year. Lets compare the results to the most recent film data.

In [7]:
academy_df.tail()

Unnamed: 0,Year,Ceremony,Award,Winner,Name,Film
9959,2015,88,Writing (Original Screenplay),1.0,Spotlight,Written by Josh Singer & Tom McCarthy
9960,2015,88,Writing (Original Screenplay),,Straight Outta Compton,Screenplay by Jonathan Herman and Andrea Berlo...
9961,2015,88,Jean Hersholt Humanitarian Award,1.0,Debbie Reynolds,
9962,2015,88,Honorary Award,1.0,Spike Lee,
9963,2015,88,Honorary Award,1.0,Gena Rowlands,


In [8]:
#Check for column types and null values
academy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9964 entries, 0 to 9963
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      9964 non-null   object 
 1   Ceremony  9964 non-null   int64  
 2   Award     9964 non-null   object 
 3   Winner    2321 non-null   float64
 4   Name      9964 non-null   object 
 5   Film      9631 non-null   object 
dtypes: float64(1), int64(1), object(4)
memory usage: 467.2+ KB


The initial assessment of this data is:

- I will need to pick a year (since the elegibility window for the academy  for Change year to be the first year
- The eremony column not needed and can be dropped. I am interested in wheter they received an award, not if the award ceremony occured or if the person was present. 
- The Winner is not necessary, but will keep in case we want to see the most awarded. But ultimately, being nominated is what I looking for. 
- Change NaN to some other value that means no. Missing values seem to just be the people that did not win. 
- Name will be a challenge since depending on the Award, this can be either the name of a person or the name of the film. I will need to transform this data and fill in some values from the IMDB table. 
- Film, will change missing to some other value. Some awards are not based on a movie but may be a lifetime or achievement award.

## 3.0 Data Preparation

Now that we have a good understanding of our data, lets clean the datasets, add some features, and combine tables. 

### 3.1 IMDB

### 3.2 The Numbers

This dataset is key to findind out ROI numbers.  

In [9]:
#Check for duplicates
duplicates = numbers_df[numbers_df.duplicated()]
print(len(duplicates))

0


In [10]:
#Check for extraneous values
for col in numbers_df.columns:
    print(col, '\n', numbers_df[col].value_counts(normalize=True).head(), '\n\n')

id 
 4     0.010031
53    0.010031
61    0.010031
65    0.010031
69    0.010031
Name: id, dtype: float64 


release_date 
 Dec 31, 2014    0.004151
Dec 31, 2015    0.003978
Dec 31, 2010    0.002594
Dec 31, 2008    0.002421
Dec 31, 2009    0.002248
Name: release_date, dtype: float64 


movie 
 Halloween    0.000519
Home         0.000519
King Kong    0.000519
Heist        0.000346
Venom        0.000346
Name: movie, dtype: float64 


production_budget 
 $20,000,000    0.039952
$10,000,000    0.036666
$30,000,000    0.030612
$15,000,000    0.029920
$25,000,000    0.029575
Name: production_budget, dtype: float64 


domestic_gross 
 $0             0.094777
$8,000,000     0.001557
$2,000,000     0.001211
$7,000,000     0.001211
$10,000,000    0.001038
Name: domestic_gross, dtype: float64 


worldwide_gross 
 $0             0.063473
$8,000,000     0.001557
$7,000,000     0.001038
$2,000,000     0.001038
$10,000,000    0.000692
Name: worldwide_gross, dtype: float64 




Unfortunately, this dataset contains some movies that do not have reported worldwide gross numbers. These will be dropped, but first, I will convert the columns to their appropiate data type. 

In [25]:
#Converting release_date column to datetime object
numbers_df['release_date'] = pd.to_datetime(numbers_df['release_date'])

#Filter results to movies released after 1969 - start of the New Hollywood Era 
numbers_df = numbers_df.loc[numbers_df['release_date'] >= '1969']

#Create release_year column with datatype integer

#Review results
numbers_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,ROI
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,553.26
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,154.67
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,-57.21
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,324.38
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,315.37


In [12]:
#Convert 'production_budget', 'domestic_gross', 'worldwide_gross' to integers

#Remove extra symbols from string
cols = ['production_budget', 'domestic_gross', 'worldwide_gross']

for col in cols:
    numbers_df[col] = numbers_df[col].str.replace('$','').str.replace(',','')
numbers_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,Avatar,425000000,760507625,2776345279
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


In [13]:
#Convert columns to integers

cols = ['production_budget', 'domestic_gross', 'worldwide_gross']
numbers_df[cols] = numbers_df[cols].apply(pd.to_numeric, axis=1)
numbers_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5632 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5632 non-null   int64         
 1   release_date       5632 non-null   datetime64[ns]
 2   movie              5632 non-null   object        
 3   production_budget  5632 non-null   int64         
 4   domestic_gross     5632 non-null   int64         
 5   worldwide_gross    5632 non-null   int64         
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 308.0+ KB


In [14]:
#Drop rows where worldwide_gross is 0. 
numbers_df = numbers_df.loc[numbers_df['worldwide_gross'] > 0]
numbers_df

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,Avatar,425000000,760507625,2776345279
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747
...,...,...,...,...,...,...
5775,76,2006-05-26,Cavite,7000,70071,71644
5776,77,2004-12-31,The Mongol King,7000,900,900
5778,79,1999-04-02,Following,6000,48482,240495
5779,80,2005-07-13,Return to the Land of Wonders,5000,1338,1338


Now that the data is clean, we can create the 'ROI' (Return on Investment) column with the following formula:

$$
ROI = \frac{Net Gross}{Production Budget} * 100
$$

In [15]:
#Adding ROI Column
numbers_df['ROI'] = (numbers_df['worldwide_gross'] - numbers_df['production_budget']) \
                    / numbers_df['production_budget'] * 100

numbers_df['ROI'] = numbers_df['ROI'].round(decimals=2)

numbers_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,ROI
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,553.26
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,154.67
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,-57.21
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,324.38
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,315.37


Pending 
- Qualify good vs bad ROI --- > 300% is good? Account for marketing budget?
- Print number of records for each stage, see how many datapoints losing

### 3.3 The Academy Awards

In [None]:
#Reviewing the data again
academy_df.head()

In [None]:
#Checking for duplicates
duplicates = academy_df[academy_df.duplicated()]
print(len(duplicates))

In [None]:
#Checking for extraneous values. 
for col in academy_df.columns:
    print(col, '\n', academy_df[col].value_counts(normalize=True).head(), '\n\n')

Oscars Data Part II

In [20]:
oscars_df = pd.read_csv('Data/the_oscar_award.csv')
oscars_df.head()

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
0,1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
2,1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True
4,1927,1928,1,ACTRESS,Gloria Swanson,Sadie Thompson,False


In [21]:
oscars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10765 entries, 0 to 10764
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   year_film      10765 non-null  int64 
 1   year_ceremony  10765 non-null  int64 
 2   ceremony       10765 non-null  int64 
 3   category       10765 non-null  object
 4   name           10761 non-null  object
 5   film           10450 non-null  object
 6   winner         10765 non-null  bool  
dtypes: bool(1), int64(3), object(3)
memory usage: 515.2+ KB


In [23]:
oscars_df = oscars_df.loc[oscars_df['year_film'] > 1969]
oscars_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5992 entries, 4773 to 10764
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   year_film      5992 non-null   int64 
 1   year_ceremony  5992 non-null   int64 
 2   ceremony       5992 non-null   int64 
 3   category       5992 non-null   object
 4   name           5988 non-null   object
 5   film           5846 non-null   object
 6   winner         5992 non-null   bool  
dtypes: bool(1), int64(3), object(3)
memory usage: 333.5+ KB


### 3.4 Merging and Combining Datasets

## 4.0 Exploratory Data Analysis

### 4.1 Which movie genres have the highest ROI?

## Recommendations

After this preliminary review, we recommend that Microsoft invests in the following strategies:



However, much research is still to be done. The biggest recommendation is to continue this research and explore the following: 