# Project #2: Web Scraping Data Analysis & Visualization
## By: Shreya Kamath

### In this project, I will be analyzing Wikipedia Data surrounding a central theme: movies. I plan on analyzing data regarding awards, notable actors and actresses, and film earnings.
### Some of the questions I am trying to answer include:
### 1) Which decade had the most top-earning movies?
### 2) Do the highest-grossing movies have a better likelihood of being nominated for the Academy Award for Best Picture?
### 3) Do the highest-paid actors and actresses get nominated for more Academy Awards for Best Actor/Actress?
### 4) Do more expensive films make more money at the box office?

# Part 1: Package Imports

### In this section, I will be importing all the necessary packages needed for this project

In [None]:
# Web Scraping
import requests
from bs4 import BeautifulSoup

In [None]:
# Data Frames
import pandas as pd

In [None]:
# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Part 2: Web Scraping

### Note: In this section, I followed this video tutorial: https://www.youtube.com/watch?v=8dTpNajxaH0 to help me scrape my Wikipedia Pages. Although I ended up having to make some of my own tweaks during the scraping process, this tutorial provided a great starting point

## Webpage 1: Highest Grossing Films

### Step 1: Setting up a web scraping pipeline to collect all the data from the Wikitable

In [None]:
# Getting all the text from the webpage and placing it in a soup object
url1 = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_films_in_the_United_States_and_Canada'
page1 = requests.get(url1)
soup1 = BeautifulSoup(page1.text, 'html')

In [None]:
soup1

In [None]:
# Finding the exact table on the webpage I want to get data from
table1 = soup1.find_all('table', class_ = 'wikitable sortable plainrowheaders')[1]

In [None]:
table1

In [None]:
# Finding all the titles for the table in the <th> tags
titles1 = table1.find_all('th')
table1_titles = [title.text.strip() for title in titles1]

In [None]:
table1_titles

### Step 2: Creating a Pandas dataframe and placing the scraped data into it

In [None]:
# Creating a data frame with the table titles as column names
highestGrossing = pd.DataFrame(columns = table1_titles)

In [None]:
highestGrossing

In [None]:
# Finding the data to fill the columns within the <tr> tags
tbl1_col_data = table1.find_all('tr')

In [None]:
for row in tbl1_col_data[1:]: # Start at the second row to ignore the header tags
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(highestGrossing) # Get the number of columns in the dataframe
    highestGrossing.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
#List of highest grossing films adjusted for inflation
highestGrossing

## Webpage 2: Most Expensive Films

### Step 1: Setting up a web scraping pipeline to collect all the data from the Wikitable

In [None]:
# Getting all the text from the webpage and placing it in a soup object
url2 = 'https://en.wikipedia.org/wiki/List_of_most_expensive_films'
page2 = requests.get(url2)
soup2 = BeautifulSoup(page2.text, 'html')

In [None]:
soup2

In [None]:
# Finding the exact table on the webpage I want to get data from
table2 = soup2.find_all('table', class_ = 'wikitable sortable plainrowheaders')[0]

In [None]:
table2

In [None]:
# Finding all the titles for the table in the <th> tags
titles2 = table2.find_all('th')
table2_titles = [title.text.strip() for title in titles2]
table2_titles

In [None]:
# Slicing the list to have two seperate lists: one for actual table headers, and one for movie titles
table_headers = table2_titles[:5]
table_headers
movie_titles = table2_titles[5:]
movie_titles

#### Note: This table had placed the movie titles within header tags for some reason, so I had to split the list like this in order to still be able to keep the movie titles to use

### Step 2: Creating a Pandas dataframe and placing the scraped data into it

In [None]:
# Creating a data frame with the table titles as column names
mostExpensive = pd.DataFrame(columns = table_headers)

In [None]:
# Dropping columns
mostExpensive.drop('Rank', axis=1, inplace=True)
mostExpensive

#### Note: Because I did not require this column, and there were issues being caused by the movie titles being in header tags, I decided to remove this column beforehand to make the process of putting the data into the dataframe easier

In [None]:
# Finding the data to fill the columns within the <tr> tags
tbl2_col_data = table2.find_all('tr')

In [None]:
index = 0 # Create an index to loop through the movies list to be able to add it to the df
for row in tbl2_col_data[1:]: # Start at the second row to ignore the header row
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    individual_row_data.insert(1, movie_titles[index]) # Adding the movie title that was initially cast in a <th> tag to the row data
    individual_row_data = individual_row_data[1:] # Slicing at this index to remove the rank data, as that column was dropped
    if len(individual_row_data) != 4: # To deal with movies that having missing data due to the odd structure of the Wikitable
        individual_row_data.insert(1, 'XXXX') # Put a placeholder in the year column, because it will eventually be removed
    index += 1 
    length = len(mostExpensive) # Get the number of columns in the dataframe
    mostExpensive.loc[length] = individual_row_data  # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
mostExpensive

### Step 3: Performing additional cleaning operations on the dataframe to make it useful for answering my question

In [None]:
# Drop columns that are unecessary for my data analysis
mostExpensive.drop('Year', axis=1, inplace=True)
mostExpensive.drop('Refs and notes', axis=1, inplace=True)
mostExpensive
# List of most expensive films to create

#### Note: I dropped these columns because they weren't necessary for my data analysis

## Webpage 3: Nominees for Academy Award for Best Picture

### Step 1: Setting up a web scraping pipeline to collect all the data from the Wikitable

In [None]:
# Getting all the text from the webpage and placing it in a soup object
url3 = 'https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture'
page3 = requests.get(url3)
soup3 = BeautifulSoup(page3.text, 'html')

In [None]:
soup3

In [None]:
# Finding the exact table on the webpage I want to get data from
table3 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[1]

In [None]:
# Finding all the titles for the table in the <th> tags
titles3 = table3.find_all('th')

In [None]:
table3_titles = [title.text.strip() for title in titles3]
table3_titles = table3_titles[:3]

In [None]:
table3_titles

### Step 2: Creating a Pandas dataframe and placing the scraped data into it

In [None]:
# Creating a data frame with the table titles as column names
bestPicture = pd.DataFrame(columns = table3_titles)

In [None]:
bestPicture

In [None]:
# Dropping Unecessary columns
bestPicture.drop('Year of Film Release', axis=1, inplace=True)
bestPicture

#### Note: The year column was unecessary for my analysis, so I got rid of it

In [None]:
noms_tbl1 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[0] # Finding the exact table on the webpage I want to get data from
noms_col_data1 = noms_tbl1.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data1:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl2 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[1] # Finding the exact table on the webpage I want to get data from
noms_col_data2 = noms_tbl2.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data2:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:  
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl3 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[2] # Finding the exact table on the webpage I want to get data from
noms_col_data3 = noms_tbl3.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data3:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl4 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[3] # Finding the exact table on the webpage I want to get data from
noms_col_data4 = noms_tbl4.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data4:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl5 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[4] # Finding the exact table on the webpage I want to get data from
noms_col_data5 = noms_tbl5.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data5:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl6 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[5] # Finding the exact table on the webpage I want to get data from
noms_col_data6 = noms_tbl6.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data6:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl7 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[6] # Finding the exact table on the webpage I want to get data from
noms_col_data7 = noms_tbl7.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data7:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl8 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[7] # Finding the exact table on the webpage I want to get data from
noms_col_data8 = noms_tbl8.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data8:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl9 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[8] # Finding the exact table on the webpage I want to get data from
noms_col_data9 = noms_tbl9.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data9: 
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl10 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[9] # Finding the exact table on the webpage I want to get data from
noms_col_data10 = noms_tbl10.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data10:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
noms_tbl11 = soup3.find_all('table', class_ = 'wikitable sortable sticky-header')[10] # Finding the exact table on the webpage I want to get data from
noms_col_data11 = noms_tbl11.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in noms_col_data11:
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(bestPicture) # Get the number of columns in the dataframe
    if not individual_row_data:# If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestPicture.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
#This is a table w/ all the nominees for academy award for best picture
bestPicture

## Webpage 4: Highest Paid Actors and Actresses

### Step 1: Setting up a web scraping pipeline to collect all the data from the Wikitable

In [None]:
# Getting all the text from the webpage and placing it in a soup object
url4 = 'https://en.wikipedia.org/wiki/List_of_highest-paid_film_actors#'
page4 = requests.get(url4)
soup4 = BeautifulSoup(page4.text, 'html')

In [None]:
# Finding the exact table on the webpage I want to get data from
table4 = soup4.find_all('table', class_ = 'wikitable sortable plainrowheaders')[1]

In [None]:
# Finding all the titles for the table in the <th> tags
titles4 = table4.find_all('th')
table4_titles = [title.text.strip() for title in titles4]
table4_titles = table4_titles[:6]

In [None]:
table4_titles

### Step 2: Creating a Pandas dataframe and placing the scraped data into it

In [None]:
# Creating a data frame with the table titles as column names
highestPaid = pd.DataFrame(columns = table4_titles)
highestPaid

In [None]:
# Dropping unecessary columns
highestPaid.drop('Year', axis=1, inplace=True)
highestPaid

#### Note: I dropped the year column because it was unecessary for my analysis

In [None]:
# Finding the data to fill the columns within the <tr> tags
tbl4_col_data = table4.find_all('tr')

In [None]:
for row in tbl4_col_data[1:]: # Start at the second row to ignore the header row
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    length = len(highestPaid) # Get the number of columns in the dataframe
    highestPaid.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
highestPaid

### Step 3: Performing additional cleaning operations on the dataframe to make it useful for answering my question

In [None]:
# Drop unecessary columns
highestPaid.drop('Earnings', axis=1, inplace=True)
highestPaid.drop('Ref.', axis=1, inplace=True)
highestPaid

#### Note: I dropped these columns because they were unecessary for my data analysis

In [None]:
# Splitting my big dataframe at the second columm to create a seperate frame for actor data
highestPaidActors = highestPaid.iloc[:, :1]
highestPaidActors

#### Note: I learned how to split my dataframe into two from a suggestion on Stack Overflow: https://stackoverflow.com/questions/41624241/pandas-split-dataframe-into-two-dataframes-at-a-specific-column

In [None]:
# Splitting my big dataframe at the second columm to create a seperate frame for actress data
highestPaidActresses = highestPaid.iloc[:, 1:]
highestPaidActresses

In [None]:
# Dropping multiple instances of a name to only ensure each name occured once
highestPaidActors = highestPaidActors.drop_duplicates(subset=['Actor'], keep='first')
highestPaidActors

#### Note: I used this GeeksForGeeks tutorial to help me with the drop duplicates method: https://www.geeksforgeeks.org/pandas/python-pandas-dataframe-drop_duplicates/

#### Note: The Wikitable this originally came from had listed the highest paid actor/actress for each year. Because an actor could be the highest paid actor 2+ years in a row, I removed duplicates so that only one instance of the actor remained

In [None]:
# Dropping multiple instances of a name to only ensure each name occured once
highestPaidActresses = highestPaidActresses.drop_duplicates(subset=['Actress'], keep='first')
highestPaidActresses

## Webpage 5: Nominees for Academy Award for Best Actress

### Step 1: Setting up a web scraping pipeline to collect all the data from the Wikitable

In [None]:
# Getting all the text from the webpage and placing it in a soup object
url5 = 'https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress#'
page5 = requests.get(url5)
soup5 = BeautifulSoup(page5.text, 'html')

In [None]:
soup5

In [None]:
# Finding the exact table on the webpage I want to get data from
table5 = soup5.find_all('table', class_ = 'wikitable sortable')[0]

In [None]:
# Finding all the titles for the table in the <th> tags
titles5 = table5.find_all('th')
table5_titles = [title.text.strip() for title in titles5]
table5_titles = table5_titles[:5]
table5_titles

### Step 2: Creating a Pandas dataframe and placing the scraped data into it

In [None]:
# Creating a data frame with the table titles as column names
bestActresses = pd.DataFrame(columns = table5_titles)

In [None]:
bestActresses

In [None]:
#Dropping unecessary columns
bestActresses.drop('Year', axis=1, inplace=True)
bestActresses.drop('Role(s)', axis=1, inplace=True)
bestActresses.drop('Film', axis=1, inplace=True)
bestActresses.drop('Ref.', axis=1, inplace=True)
bestActresses

#### Note: I just wanted this to be a list of actresses who were nominees/winners, so I got rid of all the columns except the actress names

In [None]:
actress_tbl7 = soup5.find_all('table', class_ = 'wikitable sortable')[6] # Finding the exact table on the webpage I want to get data from
actress_col_data7 = actress_tbl7.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in actress_col_data7[36:]: #Choosing which row in the table I want to start from (see note)
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    individual_row_data = individual_row_data[:1] #Slice the data to only contain the text w/ the name of the actress, because that's all thats needed

    length = len(bestActresses) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestActresses.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

#### Note: I start from Row 36 in this table because that's where the year 1987 is on the Wikitable. The list this would be compared to only contains data from 1987 onwards, so I wanted to only have actress nominations from 1987 onwards in my df

In [None]:
actress_tbl8 = soup5.find_all('table', class_ = 'wikitable sortable')[7]
actress_col_data8 = actress_tbl8.find_all('tr')
for row in actress_col_data8:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    individual_row_data = individual_row_data[:1]

    length = len(bestActresses)
    if not individual_row_data:
        continue
    else:
        bestActresses.loc[length] = individual_row_data

In [None]:
actress_tbl9 = soup5.find_all('table', class_ = 'wikitable sortable')[8]
actress_col_data9 = actress_tbl9.find_all('tr')
for row in actress_col_data9:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    individual_row_data = individual_row_data[:1]

    length = len(bestActresses)
    if not individual_row_data:
        continue
    else:
        bestActresses.loc[length] = individual_row_data

In [None]:
actress_tbl10 = soup5.find_all('table', class_ = 'wikitable sortable')[9]
actress_col_data10 = actress_tbl10.find_all('tr')
for row in actress_col_data10:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    individual_row_data = individual_row_data[:1]

    length = len(bestActresses)
    if not individual_row_data:
        continue
    else:
        bestActresses.loc[length] = individual_row_data

In [None]:
actress_tbl11 = soup5.find_all('table', class_ = 'wikitable sortable')[10]
actress_col_data11 = actress_tbl11.find_all('tr')
for row in actress_col_data11:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    individual_row_data = individual_row_data[:1]

    length = len(bestActresses)
    if not individual_row_data:
        continue
    else:
        bestActresses.loc[length] = individual_row_data

In [None]:
bestActresses

## Webpage 6: Nominees for Academy Award for Best Actor

### Step 1: Setting up a web scraping pipeline to collect all the data from the Wikitable

In [None]:
# Getting all the text from the webpage and placing it in a soup object
url6 = 'https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actor#'
page6 = requests.get(url6)
soup6 = BeautifulSoup(page6.text, 'html')

In [None]:
soup6

In [None]:
# Finding the exact table on the webpage I want to get data from
table6 = soup6.find_all('table', class_ = 'wikitable sortable')[0]

In [None]:
table6

In [None]:
# Finding all the titles for the table in the <th> tags
titles6 = table6.find_all('th')
table6_titles = [title.text.strip() for title in titles6]
table6_titles = table6_titles[:5]
table6_titles

### Step 2: Creating a Pandas dataframe to place all the scraped data into

In [None]:
# Creating a data frame with the table titles as column names
bestActors = pd.DataFrame(columns = table6_titles)
bestActors

In [None]:
# Dropping unecessary columns
bestActors.drop('Year', axis=1, inplace=True)
bestActors.drop('Role(s)', axis=1, inplace=True)
bestActors.drop('Film', axis=1, inplace=True)
bestActors.drop('Ref.', axis=1, inplace=True)
bestActors

#### Note: I only really wanted the list of actor names to work with, so I dropped all the other columns

In [None]:
actor_tbl7 = soup6.find_all('table', class_ = 'wikitable sortable')[6] # Finding the exact table on the webpage I want to get data from
actor_col_data7 = actor_tbl7.find_all('tr') # Finding the data to fill the columns within the <tr> tags
for row in actor_col_data7[36:]: #Choosing which row in the table I want to start from (see note)
    row_data = row.find_all('td') # Find all the data within a row in the <td> tags
    individual_row_data = [data.text.strip() for data in row_data] # Strip any tags/whitespace characters from row data, place each row in a list
    individual_row_data = individual_row_data[:1] #Slice the data to only contain the text w/ the name of the actor, because that's all thats needed

    length = len(bestActors) # Get the number of columns in the dataframe
    if not individual_row_data: # If the list of data is empty - indicates header row
        continue # Don't add it to the df, just move onto the next row in the Wikitable
    else:
        bestActors.loc[length] = individual_row_data # If the number of elements in the list matches the number of columns in the df, add the list in as a row

In [None]:
actor_tbl8 = soup6.find_all('table', class_ = 'wikitable sortable')[7]
actor_col_data8 = actor_tbl8.find_all('tr')
for row in actor_col_data8:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    individual_row_data = individual_row_data[:1]

    length = len(bestActors)
    if not individual_row_data:
        continue
    else:
        bestActors.loc[length] = individual_row_data

In [None]:
actor_tbl9 = soup6.find_all('table', class_ = 'wikitable sortable')[8]
actor_col_data9 = actor_tbl9.find_all('tr')
for row in actor_col_data9:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    individual_row_data = individual_row_data[:1]

    length = len(bestActors)
    if not individual_row_data:
        continue
    else:
        bestActors.loc[length] = individual_row_data

In [None]:
actor_tbl10 = soup6.find_all('table', class_ = 'wikitable sortable')[9]
actor_col_data10 = actor_tbl10.find_all('tr')
for row in actor_col_data10:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    individual_row_data = individual_row_data[:1]

    length = len(bestActors)
    if not individual_row_data:
        continue
    else:
        bestActors.loc[length] = individual_row_data

In [None]:
actor_tbl11 = soup6.find_all('table', class_ = 'wikitable sortable')[10]
actor_col_data11 = actor_tbl11.find_all('tr')
for row in actor_col_data11:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    individual_row_data = individual_row_data[:1]

    length = len(bestActors)
    if not individual_row_data:
        continue
    else:
        bestActors.loc[length] = individual_row_data

In [None]:
bestActors

# Part 3: Plotting
## In this section, I used Matplotlib and Seaborn to make visualizations that serve as answers to my questions

## Question 1: Which decade had the most top-earning movies (adjusted for inflation)?

### Step 1: Manipulating the data within the dataframe to come to a conclusion to my answer

In [None]:
# Sort the dataframe by year in ascending order to determine the range of decades I'll need to include in my dictionary
sorted_df = highestGrossing.sort_values(by='Year')

#### Note: I used this GeeksForGeeks tutorial to help me sort the data by the year to see the range of decades I'd need: https://www.geeksforgeeks.org/pandas/how-to-sort-pandas-dataframe/

In [None]:
# Print the sorted dataframe to determine what the earliest and latest decade are
sorted_df

In [None]:
# Create a dictionary mapping a decade to the number of top grossing movies in that decade
decadesCt = {1930:0, 1940:0, 1950:0, 1960:0, 1970:0, 1980:0, 1990:0, 2000:0, 2010:0, 2020:0}

In [None]:
# Create a list only containing the key values of the dictionary to use to parse through the dictionary
dictKeys = list(decadesCt.keys())

In [None]:
# For each movie, determine what decade that movie's release date falls in, then increment the value corresponding to the key that decade is in the dict
for yr in highestGrossing['Year']:
    if (int(yr) >= dictKeys[0]) and (int(yr) < dictKeys[1]):
        decadesCt[1930] += 1
    elif (int(yr) >= dictKeys[1]) and (int(yr) < dictKeys[2]):
        decadesCt[1940] += 1
    elif (int(yr) >= dictKeys[2]) and (int(yr) < dictKeys[3]):
        decadesCt[1950] += 1
    elif (int(yr) >= dictKeys[3]) and (int(yr) < dictKeys[4]):
        decadesCt[1960] += 1
    elif (int(yr) >= dictKeys[4]) and (int(yr) < dictKeys[5]):
        decadesCt[1970] += 1
    elif (int(yr) >= dictKeys[5]) and (int(yr) < dictKeys[6]):
        decadesCt[1980] += 1
    elif (int(yr) >= dictKeys[6]) and (int(yr) < dictKeys[7]):
        decadesCt[1990] += 1
    elif (int(yr) >= dictKeys[7]) and (int(yr) < dictKeys[8]):
        decadesCt[2000] += 1
    elif (int(yr) >= dictKeys[8]) and (int(yr) < dictKeys[9]):
        decadesCt[2010] += 1
    elif (int(yr) >= dictKeys[9]):
        decadesCt[2020] += 1

In [None]:
# View the number of top grossing movies in the decade
decadesCt

### Step 2: Creating a plot based on my numerical conclusions to visualize the data

In [None]:
# Create a bar chart that illustrates the data in the dictionary
sns.barplot(x=list(decadesCt.keys()), y=list(decadesCt.values()), palette='Set2')
sns.set_style('whitegrid')
plt.title('The 1970s Had the Most Top-Grossing Movies of All Time')
plt.xlabel('Decade')
plt.ylabel('Number of Movies')
plt.ylim(0, 20)
plt.show()

#### Note: I used the Seaborn docs to help me make my bar chart: https://seaborn.pydata.org/generated/seaborn.barplot.html

### Analysis: As the title of the graph states, the most top-grossing movies were released in the 1970s. I would have expected more of the top grossing movies to be released in more recent decades (2000s onward), as movie tickets tend to be more expensive nowadays, but since the data for this graph was inflation-adjusted it makes sense that it would account for price changes

## Question 2: Do the highest-grossing movies have a better likelihood of being nominated for the Academy Award for Best Picture?

### Step 1: Manipulating the data within the dataframe to come to a conclusion to my answer

In [None]:
# Creating a dictionary to compare the number of top-grossing films nominated for best picture vs those not nominated
nomineeCts = {'Nominated':0, 'Not Nominated':0}

In [None]:
# Creating lists of the nominees and highest grossing films
nomineeNames = bestPicture['Film'].tolist()
highestGrossNames = highestGrossing['Title'].tolist()

In [None]:
# Using set comprehensions to find the highest grossing films in the list of nominees
matches = list(set(nomineeNames).intersection(set(highestGrossNames)))
# Setting the dictionary values equal to the number of films that are in both films and the number that are not, respectively
nomineeCts['Nominated'] = len(matches)
nomineeCts['Not Nominated'] = (len(nomineeNames) - len(matches))

In [None]:
# Displaying the number of films nominated vs. not nominated
nomineeCts

### Step 2: Creating a plot based on my numerical conclusions to visualize the data

In [None]:
# Creating a pie chart to illustrate the percent of high grossing films that get nominated for Best Picture
labels = list(nomineeCts.keys())
sizes = list(nomineeCts.values())
sns.set_style("whitegrid")
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title('Top Grossing Movies Do Not Have a Better Likelihood of Being Nominated for Best Picture')
plt.axis('equal') 
plt.show()

#### I used this tutorial to help me make my pie chart: https://pieriantraining.com/seaborn-pie-chart-a-tutorial-for-data-visualization/

### Analysis: I had assumed that there would be more films nominated for Best Picture in the highest grossing films list, as recieving the Best Picture award usually gives films some notoriety which encourages many people to go watch them. However, it also makes sense that many of the highest grossing films aren't Best Picture nominees, because the Academy Awards usually go to more artistic films, and the highest grossers tend to be more commercial type films

## Question 3a: Do the highest-paid actors get nominated for more Academy Awards for Best Actor?: 

### Step 1: Manipulating the data within the dataframe to come to a conclusion to my answer

In [None]:
# Putting all the data within the actors column into a list to perform comparisons on
actors = list(highestPaidActors['Actor'])

In [None]:
# Creating a dictionary mapping each high-paid actor to the number of nominations they recieve
actor_correlation = dict.fromkeys(actors, 0)

In [None]:
# Comparing each actor name to the list of nominees, and incrementing the value each time an actor's name is found
for key in actor_correlation.keys():
    for val in list(bestActors['Actor']):
        if (key == val) or (key+" ‡" == val): # See note below
            actor_correlation[key] += 1

#### Note: The Wikipedia list used special characters to denote winners of the award. Because winners are also nominees, I made sure to include the second statement to ensure that all instances of an actor's name get recorded 

In [None]:
# Displaying the dictionary with each actor and their number of nominees
actor_correlation

### Step 2: Creating a plot based on my numerical conclusions to visualize the data

In [None]:
# Using a histogram to illustrate the likelihood of being a top-paying actor and getting nominated for the Best Actor award
actor_nomination_counts = list(actor_correlation.values())
sns.histplot(actor_nomination_counts, bins=range(0, max(actor_nomination_counts) + 2), kde=False)
plt.xlabel('Number of Nominations')
plt.ylabel('Number of Actors')
plt.title('The Highest Paid Actors Do Not Get Nominated for More Academy Awards')

#### I was actually unsure of how I could represent the results in the dictionary visually, so I asked ChatGPT and it recommended I use a histogram
#### I used the Seaborn docs to help me with the creation of the histogram: https://seaborn.pydata.org/generated/seaborn.histplot.html

## Question 3b: Do the highest-paid actresses get nominated for more Academy Awards for Best Actress?: 

### Step 1: Manipulating the data within the dataframe to come to a conclusion to my answer

In [None]:
# Putting all the data within the actresses column into a list to perform comparisons on
actresses = list(highestPaidActresses['Actress'])

In [None]:
# Creating a dictionary mapping each high-paid actress to the number of nominations they recieve
actress_correlation = dict.fromkeys(actresses, 0)

In [None]:
# Comparing each actor name to the list of nominees, and incrementing the value each time an actress's name is found
for key in actress_correlation.keys():
    for val in list(bestActresses['Actress']):
        if (key == val) or (key+" ‡" == val):
            actress_correlation[key] += 1

In [None]:
# Displaying the dictionary with each actress and their number of nominees
actress_correlation

In [None]:
# Removing unecessary values from the dictionary
actress_correlation.pop('—')
actress_correlation

#### Note: The original table on Wikipedia actually included the dash, so it got scraped in. I dropped it from my data because it was irrelevant. 

### Step 2: Creating a plot based on my numerical conclusions to visualize the data

In [None]:
# Using a histogram to illustrate the likelihood of being a top-paying actress and getting nominated for the Best Actor award
actress_nomination_counts = list(actress_correlation.values())

sns.histplot(actress_nomination_counts, bins=range(0, max(actress_nomination_counts) + 2), kde=False)

plt.xlabel('Number of Nominations')
plt.ylabel('Number of Actresses')
plt.title('The Highest Paid Actresses Do Not Get Nominated for More Academy Awards')

### Analysis: I was suprised that the fact that higher-paid actors do not get nominated for more awards, because I assumed that people were willing to pay them more because they delivered better performances. Then I reviewed the Wikitable again and realized that it contained data on the highest paid actors across all media types (film, theater, TV, etc.), while the Academy Awards are only awarded to film actors, which may have skewed the data a bit

## Question 4: Do more expensive films make more money at the box office?

### Step 1: Manipulating the data within the dataframe to come to a conclusion to my answer

In [None]:
# Merge the two data frames on a common column to have Cost and Earning data in the same df
merged = pd.merge(mostExpensive, highestGrossing, on='Title', how='inner')

#### Note: I used this GeeksForGeeks tutorial to help me merge my two dataframes into one: https://www.geeksforgeeks.org/pandas/how-to-combine-two-dataframe-in-python-pandas/

In [None]:
# Display the merged data frame to ensure everything was successful
merged

In [None]:
# Remove $, commas, asterisks, etc.
merged['Cost (est.)(millions)'] = merged['Cost (est.)(millions)'].str.replace('$', '')
merged['Cost (est.)(millions)'] = merged['Cost (est.)(millions)'].str.replace('*', '')
merged['Cost (est.)(millions)'] = merged['Cost (est.)(millions)'].astype(int)
merged['Adjusted gross'] = merged['Adjusted gross'].str.replace('$', '')
merged['Adjusted gross'] = merged['Adjusted gross'].str.replace('*', '')
merged['Adjusted gross'] = merged['Adjusted gross'].str.replace(',', '')
merged['Adjusted gross'] = merged['Adjusted gross'].astype(int)

#### Note: I used this StackOverflow thread to help me strip the columns of all their nonumerical data + turn them into numbers: https://stackoverflow.com/questions/38516481/trying-to-remove-commas-and-dollars-signs-with-pandas-in-python

In [None]:
# Display the stripped dataframe to ensure that only the numerical values remain
merged

### Step 2: Creating a plot based on my numerical conclusions to visualize the data

In [None]:
# Create a scatterplot w/ regression line to illustrate the relationship between cost and earnings
sns.regplot(data=merged, x="Cost (est.)(millions)", y="Adjusted gross", scatter_kws={'s':25}, color='green', marker='s')

plt.title('More Expensive Films Do Not Typically Make More Money')
plt.xlabel('Cost (in millions)')
plt.ylabel('Earnings (in billions)')
plt.show()

#### Note: I used the Seaborn docs to help me make my scatterplot w/ regression line: https://seaborn.pydata.org/generated/seaborn.regplot.html

### Analysis: There is little to no correlation between how much money is spent on a film and the amount it earns in the box office. I was surprised to learn this, as I assumed that films that spent a lot of money would have put that money towards making the film as good as possible, so it could earn all the money back. However, it is interesting to note how there can be films that spend almost the same amount of money but make drastically different amounts in revenue, such as the two data points in the 400/450 mil budget. 