# Web Scraping the Scripts from Scraps from the Loft

In [2]:
## imports
import requests
from bs4 import BeautifulSoup

import numpy as np
import pandas as pd

## Example - 5 scripts

### One script

#### Scrape one script title and url

In [3]:
## Set up soup
url = 'http://scrapsfromtheloft.com/stand-up-comedy-scripts/'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [4]:
## Find correct elements
titles_html = soup.find_all(['h1','h3'], class_="elementor-post__title")
len(titles_html)

387

In [5]:
titles_html[:10]

[<h1 class="elementor-post__title">
 <a href="https://scrapsfromtheloft.com/comedy/brian-regan-on-the-rocks-transcript/">
 				Brian Regan: On The Rocks (2021) – Transcript			</a>
 </h1>,
 <h3 class="elementor-post__title">
 <a href="https://scrapsfromtheloft.com/comedy/tom-segura-disgraceful-2018-full-transcript/">
 				Tom Segura: Disgraceful (2018) – Transcript			</a>
 </h3>,
 <h3 class="elementor-post__title">
 <a href="https://scrapsfromtheloft.com/comedy/chris-rock-bigger-blacker-1999-full-transcript/">
 				Chris Rock: Bigger &amp; Blacker (1999) – Transcript			</a>
 </h3>,
 <h1 class="elementor-post__title">
 <a href="https://scrapsfromtheloft.com/comedy/brian-regan-on-the-rocks-transcript/">
 				Brian Regan: On The Rocks (2021) – Transcript			</a>
 </h1>,
 <h3 class="elementor-post__title">
 <a href="https://scrapsfromtheloft.com/comedy/tom-segura-disgraceful-2018-full-transcript/">
 				Tom Segura: Disgraceful (2018) – Transcript			</a>
 </h3>,
 <h3 class="elementor-post__ti

In [6]:
script_url = titles_html[2].find('a').get('href')

In [7]:
script_url

'https://scrapsfromtheloft.com/comedy/chris-rock-bigger-blacker-1999-full-transcript/'

In [8]:
title = titles_html[0].find('a').string.strip('\n').strip('\t')

In [9]:
title

'Brian Regan: On The Rocks (2021) – Transcript'

#### Scrape description and script from script url

In [10]:
## Set up soup
script_response = requests.get(script_url)
script_soup = BeautifulSoup(script_response.content, 'html.parser')

In [11]:
description = script_soup.find(name='div', attrs={'data-id': "53e7c39"})\
                .find('div', class_="elementor-widget-container").string.strip('\n').strip('\t')

In [12]:
description

'Chris Rock brings his critically acclaimed brand of social commentary-themed humor to this 1999 standup comedy presentation from HBO.'

In [13]:
content = script_soup.find_all('p', attrs={'style': "text-align: justify;"})

In [14]:
five = ''.join(list(content[5].strings))

In [15]:
four = ''.join(list(content[4].strings))

In [16]:
four

'Now, if the kid can’t read ’cause there ain’t no lights in the house… that’s Daddy’s fault. You got this shit down? See, nobody gives a fuck about Daddy. There’s some real daddies out there. I’m not talking about the guy that fucked you and left. Fuck him, okay? I’m talking about the real daddies. There’s still some motherfuckers out there that handle their business. Motherfuckers wanna act like brothers…. There’s some brothers that handle their business. ‘Cause people don’t give a fuck…. Nobody gives a fuck about Daddy. Everybody takes Daddy for granted. Just listen to the radio. Everything’s ”Mama. Dear Mama. Always loved my Mama.” What’s the Daddy song? Papa was a Rollin’ Stone. Nobody gives a fuck. Nobody appreciates Daddy. Now, Mama’s got the roughest job. I ain’t gonna front. But at least people appreciate Mama. Every time Mama do something right, Mama gets a compliment… ’cause women need to hear compliments all the time. Women need food, water, and compliments. That’s right. An

In [17]:
' '.join([four, five])

'Now, if the kid can’t read ’cause there ain’t no lights in the house… that’s Daddy’s fault. You got this shit down? See, nobody gives a fuck about Daddy. There’s some real daddies out there. I’m not talking about the guy that fucked you and left. Fuck him, okay? I’m talking about the real daddies. There’s still some motherfuckers out there that handle their business. Motherfuckers wanna act like brothers…. There’s some brothers that handle their business. ‘Cause people don’t give a fuck…. Nobody gives a fuck about Daddy. Everybody takes Daddy for granted. Just listen to the radio. Everything’s ”Mama. Dear Mama. Always loved my Mama.” What’s the Daddy song? Papa was a Rollin’ Stone. Nobody gives a fuck. Nobody appreciates Daddy. Now, Mama’s got the roughest job. I ain’t gonna front. But at least people appreciate Mama. Every time Mama do something right, Mama gets a compliment… ’cause women need to hear compliments all the time. Women need food, water, and compliments. That’s right. An

In [18]:
_lst = []
for paragraph in content:
    text = ''.join(list(paragraph.strings))
    _lst.append(text)
full_script = '\n\n'.join(_lst)

In [19]:
print(full_script)

Ladies and gentlemen… live from the world-famous Apollo Theater… in Harlem, New York. Are you ready? Please welcome Mr. Chris Rock!

What’s up… New York? There’s Brooklyn in the house. Well, I’m from Brooklyn. Shit, look at this. White people are up top tonight. You know, I was just in my hotel, a little while ago, on my way here… and I got in the elevator, right? I’m getting in the elevator… and these two high-school white boys try to get on with me… and I just dove off. I said, ”Y’all ain’t killing me!” I am scared of young white boys. If you white and under 21 I am running for the hills. What the hell is wrong with these white kids shooting up the school? They don’t even wait till 3 o’clock either. Killing people in the morning. That ain’t right. The Trenchcoat Mafia. ”No one will play with us. ”We have no friends. We’re the Trenchcoat Mafia.” Hey, I saw the yearbook pictures. It was six of them. I didn’t have six friends in high school. I don’t got six friends now. Shit, that’s thr

### Scraping Functions

In [37]:
### Scrape parent page

def scrape_loft():

    '''
    Scrapes every title and url from the list of stand-up comedy scripts on Scraps From the Loft
    returns a dictionary of urls and titles 
    '''
    url = 'http://scrapsfromtheloft.com/stand-up-comedy-scripts/'

    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    titles_html = soup.find_all(['h1', 'h3'], class_="elementor-post__title")
    
    title_list = []
    url_list = []
    for title in titles_html:
        script_url = title.find('a').get('href')
        script_title = title.find('a').string.strip('\n').strip('\t')
        url_list.append(script_url)
        title_list.append(script_title)
    
    return {'url': url_list, 'title': title_list}

In [41]:
### Scrape each individual script page

def scrape_script(script_sp):
    '''
    Scrapes descriptions and full scripts of individual pages from stand-up comedy scripts on Scraps From the Loft
    returns: description_list, script_list
    '''
    #description = script_sp.find(name='div', attrs={'data-id': "53e7c39"}).find('div', class_="elementor-widget-container").string.strip('\n').strip('\t')
    
    content = script_sp.find_all('p', attrs={'style': "text-align: justify;"})
    
    _lst = []
    for paragraph in content:
        text = ''.join(list(paragraph.strings))
        _lst.append(text)
    full_script = '\n\n'.join(_lst)
    
    return full_script

In [44]:
### Loop through the url scripts
def df_dict(title_url_dict):
    
    descriptions = []
    full_transcripts = []

    url_list = title_url_dict['url']
    title_list = title_url_dict['title']
    _iter = zip(url_list, title_list)
    
    for url, title in _iter:
        script_resp = requests.get(url)
        script_sp = BeautifulSoup(script_resp.content, 'html.parser')
        if script_sp is None:
            title_list = title_list.remove(title)
            title_url = url_list.remove(url)
        else:
            _result = scrape_script(script_sp)
            #descriptions.append(_result[0])
            full_transcripts.append(_result)

    return {'url': url_list, 'title': title_list, 'full_transcript': full_transcripts}

In [23]:
title_dict = scrape_loft()

In [30]:
title_dict

{'url': ['https://scrapsfromtheloft.com/comedy/brian-regan-on-the-rocks-transcript/',
  'https://scrapsfromtheloft.com/comedy/tom-segura-disgraceful-2018-full-transcript/',
  'https://scrapsfromtheloft.com/comedy/chris-rock-bigger-blacker-1999-full-transcript/',
  'https://scrapsfromtheloft.com/comedy/brian-regan-on-the-rocks-transcript/',
  'https://scrapsfromtheloft.com/comedy/tom-segura-disgraceful-2018-full-transcript/',
  'https://scrapsfromtheloft.com/comedy/chris-rock-bigger-blacker-1999-full-transcript/',
  'https://scrapsfromtheloft.com/comedy/jim-gaffigan-comedy-monster-transcript/',
  'https://scrapsfromtheloft.com/comedy/louis-c-k-sorry-transcript/',
  'https://scrapsfromtheloft.com/comedy/drew-michael-drew-michael-2018-transcript/',
  'https://scrapsfromtheloft.com/comedy/drew-michael-red-blue-green-transcript/',
  'https://scrapsfromtheloft.com/comedy/mo-amer-mohammed-in-texas-transcript/',
  'https://scrapsfromtheloft.com/comedy/dave-chappelle-the-closer-transcript/',
  

In [45]:
df_d = df_dict(title_dict)

In [57]:
df = pd.DataFrame(df_d)

In [58]:
df = df.iloc[3:].reset_index().drop(columns='index')

In [59]:
df

Unnamed: 0,url,title,full_transcript
0,https://scrapsfromtheloft.com/comedy/brian-reg...,Brian Regan: On The Rocks (2021) – Transcript,Filmed in 2020 at the Tuacahn outdoor amphithe...
1,https://scrapsfromtheloft.com/comedy/tom-segur...,Tom Segura: Disgraceful (2018) – Transcript,[announcer] Ladies and gentlemen… [audience wh...
2,https://scrapsfromtheloft.com/comedy/chris-roc...,Chris Rock: Bigger & Blacker (1999) – Transcript,Ladies and gentlemen… live from the world-famo...
3,https://scrapsfromtheloft.com/comedy/jim-gaffi...,Jim Gaffigan: Comedy Monster (2021) | Transcript,"Thank you! Thank you! Oh, my gosh. Thank you s..."
4,https://scrapsfromtheloft.com/comedy/louis-c-k...,Louis C. K.: Sorry (2021) | Transcript,♪♪ [“Like a Rolling Stone” by Bob Dylan playin...
...,...,...,...
379,https://scrapsfromtheloft.com/comedy/jim-jeffe...,JIM JEFFERIES ON GUN CONTROL [FULL TRANSCRIPT],by Jim Jefferies\n\nI’m gonna talk about somet...
380,https://scrapsfromtheloft.com/comedy/reggie-wa...,Reggie Watts: Spatial (2016) – Full Transcript,"Hello, I’m Thomas. I’m so glad to meet you Mum..."
381,https://scrapsfromtheloft.com/comedy/george-ca...,GEORGE CARLIN: COMPLAINTS AND GRIEVANCES (2001...,Complaints and Grievances is a HBO stand-up sp...
382,https://scrapsfromtheloft.com/comedy/george-ca...,GEORGE CARLIN: LIFE IS WORTH LOSING (2006) – T...,"Recorded on November 5, 2005, Beacon Theater, ..."


In [61]:
df.to_json('scraps_from_the_loft.json')