# Reddit & Quibi: Web API and NLP
## Part 1-B: Scraping Data from Quibi

In the previous notebook, I gathered data from Reddit. Now I'm going to scrape the Quibi website so I can compare it to the Reddit info. The Quibi website doesn't have any limits/restrictions on scraping and I was able to access all the pages via their sitemap.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time


First, I'm scraping the sitemap since it holds all the URL's I'll want to access in the future.

In [2]:
url = 'https://quibi.com/sitemap.xml'

res = requests.get(url)


In [3]:
res.status_code

200

In [4]:
soup = BeautifulSoup(res.content, 'lxml')

All of the links are in 'loc' tags, the list 'pages' will hold each link.

In [5]:
pages = soup.find_all('loc')

However, not all of these pages are for shows. Only those with 'shows/' in the URL are for their current programming. The loop below tries to split on the words 'shows' in the URL. If it works, it will save the rest of the link to 'shows'. If it doesn't, this link is not added to the list of shows.

In [6]:
shows = []
for page in pages:
    try:
        show_url = page.text.split('shows/')[1]
        shows.append(show_url)
    except:
        pass

There are 53 shows on the Quibi website.

In [7]:
len(shows)

53

Now that we have the list of show URL's, the loop below goes to each URL and grabs the pertinent information. Some are consistent across every page (such as title), but others only appear where necessary (like rating). There's a second for loop that checks some of the retrieved content to see if it matches the expected strucutre before assigning it to be stored. The final product is a list of individual show dictionaries that is then combined into one DataFrame.

In [8]:
#Empty list to store the show info
all_show_list = []

show_base = 'https://quibi.com/shows/'

#For each show in the URL list
for show in shows:
    #Get the info
    req = requests.get(show_base + show)
    
    #If the request worked
    if req.status_code < 400:
        #Turn it into a BS object
        bs = BeautifulSoup(req.content, 'lxml')
        
        #The title of the page is an image, but the alt text is the actual title
        title = bs.find('img').attrs['alt']
        
        #The heading contains multiple pieces of information to split up, but not every page has the same info
        heading = bs.find('h3').text.split(' • ')
        
        #Setting default values if one isn't included on the page
        rating = None
        season = None
        genre = None
        rel = 'Released'
        
        #For loop checks if the info is there, then changes from the default
        for h in heading:
            if 'TV-' in h:
                rating = h
            elif 'season' in h:
                season = h
            elif '2020' in h:
                rel = h
            else:
                genre = h
        
        #Grabbing the description
        desc = bs.find('p', {'class' : 'show-long-description__1htfq'}).text
        
        #Grabbing the select cast/crew list
        cast_list = [cast.text for cast in bs.find_all('p', {'class' : 'credit-item-name__1eL-i'})]
        
        show_dict = {
            'title': title,
            'genre': genre,
            'description': desc,
            'cast_crew': cast_list,
            'rating': rating,
            'season': season,
            'release': rel,
            'url': show_base+show
        }
        all_show_list.append(show_dict)

show_df = pd.DataFrame(all_show_list)     

In [9]:
show_df.head()

Unnamed: 0,title,genre,description,cast_crew,rating,season,release,url
0,50 States Of Fright,Horror Anthology Thriller,A horror anthology featuring the scariest stor...,"[Rachel Brosnahan, Travis Fimmel, John Marshal...",TV-MA,1 season,Released,https://quibi.com/shows/50-states-of-fright-417
1,Agua Donkeys,"Deadpan, Buddy, Comedy","What's standing between MP, Jer, and the one t...","[MP Cunningham, Jer Jackson, Baby Darrington, ...",TV-MA,1 season,Released,https://quibi.com/shows/agua-donkeys-479
2,All The Feels by The Dodo,"Reality, Cute Animals",Animals make everything better. From dogs who ...,[],,1 season,Spring 2020,https://quibi.com/shows/all-the-feels-by-the-d...
3,Answered by Vox,"News, Coronavirus",Answered is Vox’s daily explainer series helpi...,[Cleo Abram],,1 season,Released,https://quibi.com/shows/answered-by-vox-631
4,Around The World by BBC NEWS,"News, International",Do you ever wonder how something happening tho...,[Ben Bland],,1 season,Spring 2020,https://quibi.com/shows/around-the-world-by-bb...


In [10]:
show_df.tail(1)

Unnamed: 0,title,genre,description,cast_crew,rating,season,release,url
52,You Ain't Got These,"Documentary, Culture",This is not a show about sneakers. It’s a show...,"[Lena Waithe, Carmelo Anthony, Hasan Minhaj, Q...",TV-MA,1 season,Released,https://quibi.com/shows/you-aint-got-these-500


The DataFrame has 53 rows and 53 unique titles which means it was successful in gathering data from all the URL's.

In [11]:
len(show_df)

53

In [12]:
show_df['title'].value_counts().sum()

53

Saving the raw data to csv:

In [13]:
show_df.to_csv('../datasets/quibi_raw.csv', index=False)