# Web Scraping

## Project Overview

This project analyzes box office and streaming data to provide strategic direction for a newly formed movie division. Our analysis explores financial data, film ratings, and viewership data to help guide the division in deciding what types of movies to create.

## Notebook Overview

Because the provided datasets only contain data up to 2018, I want to find some more recent data to see if there are any additional recent trends to be aware of. Due to the COVID-19 pandemic, box office data for the last ~16 months is mostly worthless, so I chose instead to scrape some streaming data, specifically, Netflix Top Ten data.
Streaming data can be difficult to gather because these companies limit what information they release to the public. Netflix has one of the largest market shares of any streaming service ([about 20% in 2021](https://www.thewrap.com/netflix-streaming-us-market-share-chart/)), and because they release a public "top ten" list every day, I decided to use that data for this analysis (scraped from [The Numbers](https://www.the-numbers.com)).

## Web Scraping

In [5]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

The first step is to generate the list of urls to scrape. The Numbers uses the format YYYY/MM/DD for their urls, so I can easily generate the urls I want with a function.

In [6]:
#generate url months
urls = ['https://www.the-numbers.com/home-market/netflix-daily-chart/2021/0{}'.format(i) for i in range(1, 6)]
urls

#define a function that will generate a url for each day of the month
def add_days(url):
    days = []
    #add days 01-09
    for i in range(1, 10):
        days.append(url + '/0' + str(i))
    #add days 10-28
    for i in range(10, 29):
        days.append(url + '/' + str(i))
    #for the month of April, add days 29 and 30
    if url[-1] == '4':
        for i in range(29, 31):
            days.append(url + '/' + str(i))
    #for all months other than April and February, add days 29-31
    elif url[-1] != '2':
        for i in range(29, 32):
            days.append(url + '/' + str(i))
    return days

#run function to generate a url for each day of 2021 from 01/01 to 05/31
all_urls = []

for url in urls:
    all_urls.extend(add_days(url))

Now that we have our list of urls, we can write a function that will scrape each url in the list and combine all the data into a dataframe

In [7]:
def get_data(url):
    request = requests.get(url)
    df_list = pd.read_html(request.content)
    return df_list[1]

#test the function on a single url to make sure the output is what we expect
get_data(all_urls[100])

Unnamed: 0,Rank,YD,LW,Title,Type,NetflixExcl.,NetflixReleaseDate,Days InTop 10,Viewer-shipScore,WatchNow
0,1,(1),(-),Thunder Force,Movie,Yes,"Apr 9, 2021",2,20,Watch Now
1,2,(3),(-),This is a Robbery: The Worl…,TV Show,Yes,"Apr 7, 2021",3,25,Watch Now
2,3,(2),(1),Who Killed Sara?,TV Show,Yes,"Mar 24, 2021",16,150,Watch Now
3,4,(7),(-),The Little Rascals,Movie,,"Apr 2, 2021",2,11,Watch Now
4,5,(5),(7),The Serpent,TV Show,Yes,"Apr 2, 2021",8,50,Watch Now
5,6,(4),(-),What Lies Below,Movie,,"Apr 4, 2021",7,59,Watch Now
6,7,(6),(9),Cocomelon,TV Show,Yes,"Jun 1, 2020",170,533,Watch Now
7,8,(8),(2),Concrete Cowboy,Movie,Yes,"Apr 2, 2021",9,59,Watch Now
8,9,(9),(-),Legally Blonde,Movie,Yes,"Apr 1, 2021",5,10,Watch Now
9,10,(-),(-),Sniper: Ghost Shooter,Movie,,"Apr 1, 2021",6,14,Watch Now


In [8]:
#loop through our list of urls; this will create a list of dataframes
top_ten_list = []

for i in all_urls:
    top_ten_list.append(get_data(i))

In [9]:
#stack the list of dataframes on top of each other into one large dataframe
df = pd.concat(top_ten_list, ignore_index=True)

df.head()

Unnamed: 0,Rank,YD,LW,Title,Type,NetflixExcl.,NetflixReleaseDate,Days InTop 10,Viewer-shipScore,WatchNow
0,1,(1),(-),Bridgerton,TV Show,Yes,"Dec 25, 2020",7,67,Watch Now
1,2,(3),(-),Death to 2020,Concert/Perf…,Yes,"Dec 27, 2020",5,37,Watch Now
2,3,(2),(-),We Can Be Heroes,Movie,Yes,"Dec 25, 2020",7,58,Watch Now
3,4,(-),(-),Chilling Adventures of Sabrina,TV Show,Yes,"Oct 26, 2018",1,7,Watch Now
4,5,(4),(1),The Midnight Sky,Movie,Yes,"Dec 23, 2020",9,77,Watch Now


In [13]:
#Export data to a CSV file for use in other notebooks
df.to_csv('Netflix Top 10.csv', index=False)

### Next Steps

- Collect additional information about these titles (see [API Calls](API_Calls.ipynb))
- Clean data (see [Exploratory Data Analysis](EDA.ipynb))