# Capstone 3: Web Scraping #

To build the dataset for the project, we will be scraping the two websites: www.eslfast.com and http://iteslj.org/. From the first website we are going to scrape two large sets of ESL dialogues, one set that is beginner level and another that is intermediate. From the second website we will scrape a list of discussion questions to use as "starter questions"

The following code is designed to extract the desired information from the website's source code.  The specific operations involved in web scraping vary considerably and must be tailored to the particular page being scraped.  This process involves a lot of trial and error.  In addition, any changes to the site's HTML can cause the code to no longer work, so there is no guarantee that code below will continue to work indefinitely in the future.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
from random import randint
import re

### Read Webpage into Python ###

We start by getting a sample of HTML to work with from www.eslfast.com. The URL below is first dialogue in a section called Easy Dialogues. Now we use the requests library to fetch the web page by passing the URL to the `get` function, and store the results to a response object called `r`. The response object has a `text` attribute which can be used to access the HTML.

In [2]:
r = requests.get('https://www.eslfast.com/easydialogs/ec/dailylife001.htm')

In [3]:
# print the first 500 characters of the HTML
print(r.text[0:500])

<html>

<head> <!-- Global site tag (gtag.js) - Google Analytics --> <script async src="https://www.googletagmanager.com/gtag/js?id=UA-18587813-3"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-18587813-3'); </script>  <style> audio::-internal-media-controls-download-button { display:none; } audio::-webkit-media-controls-enclosure { overflow:hidden; } audio::-webkit-media-controls-panel { width:


### Parse the HTML ###

Next, we parse the HTML using the Beautiful Soup 4 library. This code parses the HTML into an object called `soup` which `BeatifulSoup` understands. (html.parser is the default parser in Python, but others can be used.)

In [4]:
soup = BeautifulSoup(r.text, 'html.parser')

In [5]:
print(soup)

<html>
<head> <!-- Global site tag (gtag.js) - Google Analytics --> <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-18587813-3"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-18587813-3'); </script> <style> audio::-internal-media-controls-download-button { display:none; } audio::-webkit-media-controls-enclosure { overflow:hidden; } audio::-webkit-media-controls-panel { width: calc(100% + 30px); } </style>
<meta content="en-us" http-equiv="Content-Language"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <script> (adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4240992229388079", enable_page_level_ads: true }); </script>
<meta content="Learning English, ESL, EFL, English as a Second Language, easy dialogues, easy conversatio

### Collect the Results ###

HTML contains text as well as tags used to mark up the text.  The tags are found within angle brackets, and many can be seen in the sample of HTML above. For example, there is an opening tag `<title>`, followed by some text, followed by a closing tag `</title>`.  Tags also have attributes, for example the tag `<font face="arial" size="6">` specifies the font and of the text. Finally, tags can be nest within one another. Web scraping involves taking advantage of all of these facts to locate and extract information on a website.

From this point on, the code used is specific to this particular project. Inspecting the HTML of www.eslfast.com reveals that the dialogues we seek are contained in `p` tags with the class `MsoNormal`. The following code produces an object called a "ResultSet" which behaves like a Python list, which in this case only has one item. After that, we split the dialogue into individual lines, and write some code to get the title of the dialogue.

In [6]:
result = soup.find('p', attrs={'class':'MsoNormal'})

In [7]:
print(result)

<p class="MsoNormal" style="text-indent: .0in; line-height: 200%">
<b>A:</b> Where do you live?<br/>
<b>B:</b> I live in Pasadena.<br/>
<b>A:</b> Where is Pasadena?<br/>
<b>B:</b> It's in California.<br/>
<b>A:</b> Is it in northern California?<br/>
<b>B:</b> No. It's in southern California.<br/>
<b>A:</b> Is Pasadena a big city?<br/>
<b>B:</b> It's pretty big.<br/>
<b>A:</b> How big is "pretty big"?<br/>
<b>B:</b> It has about 140,000 people.<br/>
<b>A:</b> How big is Los Angeles?<br/>
<b>B:</b> It has about 3 million people.

</p>


In [8]:
dialogue = re.split('\nA: |\nB: ', result.text)
dialogue = [x.strip() for x in dialogue][1:]
print(dialogue)

['Where do you live?', 'I live in Pasadena.', 'Where is Pasadena?', "It's in California.", 'Is it in northern California?', "No. It's in southern California.", 'Is Pasadena a big city?', "It's pretty big.", 'How big is "pretty big"?', 'It has about 140,000 people.', 'How big is Los Angeles?', 'It has about 3 million people.']


In [9]:
title = soup.find('title')
title = re.findall('\d. (.*)', title.text)[0]

In [10]:
print(title)

I Live in Pasadena


### Gather the Data ###

Next we'll take all the code from above and package it into a function. We can test out the function on a different URL. After that, we need to apply the function to all of the pages with dialogues.  To accomplish this, we create a list of tuples containing the names of different sections and the number of dialogues in each section.  We will then create a loop within a loop that iterates over this list and returns the lines of each dialogue, along with its section and title. Finally, all of this information can be converted into a `pandas` DataFrame.

In [11]:
def scrape_page(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    result = soup.find('p', attrs={'class':'MsoNormal'})
    dialogue = re.split('\nA: |\nB: ', result.text)
    dialogue = [x.strip() for x in dialogue][1:]
    title = soup.find('title')
    title = re.findall('\d. (.*)', title.text)[0]
    return title, dialogue

In [12]:
title, dialogue = scrape_page('https://www.eslfast.com/easydialogs/ec/dailylife002.htm')
print('title: ', title)
print('dialogue lines: ', dialogue)

title:  I Have a Honda
dialogue lines:  ['Do you have a car?', 'Yes, I do.', 'What kind of car do you have?', 'I have a Honda.', 'Is it new?', 'It was new in 2003.', "So, it's pretty old now.", 'Yes, it is. But it still looks good.', 'Do you take good care of it?', 'Oh, yes. I wash it once a week.', 'Do you change the oil?', 'My mechanic changes the oil twice a year.']


In [13]:
sections = [('dailylife0', 77), ('schoollife', 15), ('transportation', 21), ('entertainment', 20), ('dating', 13), 
            ('restaurant', 10), ('sports', 12), ('safety', 12), ('travel', 14), ('jobs', 16), ('food', 11), 
            ('shop', 10), ('housing', 10), ('election', 12), ('health', 20)]

In [14]:
topics = []
titles = []
dialogue_lines = []

for section in sections:
    name, length = section
    for n in range(1, length+1):
        num = str(n).zfill(2)
        url = 'https://www.eslfast.com/easydialogs/ec/{}.htm'.format(name+num)
        title, dialogue = scrape_page(url)
        dialogue_lines.extend(dialogue)
        for i in range(len(dialogue)):
            titles.append(title)
            topics.append(name)
        sleep(randint(2,10))
    print("completed ", name)
    
easy_dialogues = pd.DataFrame({'topic': topics, 'title': titles, 'dialogue_line': dialogue_lines})

completed  dailylife0
completed  schoollife
completed  transportation
completed  entertainment
completed  dating
completed  restaurant
completed  sports
completed  safety
completed  travel
completed  jobs
completed  food
completed  shop
completed  housing
completed  election
completed  health


In [15]:
easy_dialogues.shape

(3270, 3)

In [16]:
easy_dialogues.head(10)

Unnamed: 0,topic,title,dialogue_line
0,dailylife0,I Live in Pasadena,Where do you live?
1,dailylife0,I Live in Pasadena,I live in Pasadena.
2,dailylife0,I Live in Pasadena,Where is Pasadena?
3,dailylife0,I Live in Pasadena,It's in California.
4,dailylife0,I Live in Pasadena,Is it in northern California?
5,dailylife0,I Live in Pasadena,No. It's in southern California.
6,dailylife0,I Live in Pasadena,Is Pasadena a big city?
7,dailylife0,I Live in Pasadena,It's pretty big.
8,dailylife0,I Live in Pasadena,"How big is ""pretty big""?"
9,dailylife0,I Live in Pasadena,"It has about 140,000 people."


### Intermediate Dialogues ###

Our first DataFrame looks good. Next we can repeat the process for the intermediate dialogues. All of the steps will basically be the same, except for differences in the tags used to locate the dialogues. The intermediate dialogues come three to a page, and the html contains the names of each section as well as a subsection, so the scraping process will be a little bit trickier this time.

In [17]:
r = requests.get('https://www.eslfast.com/robot/topics/smalltalk/smalltalk01.htm')

In [18]:
# print the first 500 characters of the HTML
print(r.text[0:500])

<html>

<head> <!-- Global site tag (gtag.js) - Google Analytics --> <script async src="https://www.googletagmanager.com/gtag/js?id=UA-18587813-3"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-18587813-3'); </script>  

<style> audio::-internal-media-controls-download-button { display:none; } audio::-webkit-media-controls-enclosure { overflow:hidden; } audio::-webkit-media-controls-panel { widt


In [19]:
soup = BeautifulSoup(r.text, 'html.parser')

In [20]:
print(soup)

<html>
<head> <!-- Global site tag (gtag.js) - Google Analytics --> <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-18587813-3"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-18587813-3'); </script>
<style> audio::-internal-media-controls-download-button { display:none; } audio::-webkit-media-controls-enclosure { overflow:hidden; } audio::-webkit-media-controls-panel { width: calc(100% + 30px); } </style>
<meta content="en-us" http-equiv="Content-Language"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <script> (adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4240992229388079", enable_page_level_ads: true }); </script>
<title>Conversation: 1. Greetings</title>
</head>
<body><div style="width:970px; margin:0 auto;">
<center

In [21]:
result = soup.find('html')
lines = re.split('\n1.|\n2.|\n3.|\nA: |\nB: |\n', result.text)
lines = [x.strip() for x in lines]
lines = [x for x in lines if x != '' and x != 'Repeat']
title = lines[1]
dialogue_lines = lines[2:-2]
print(title)
print(dialogue_lines)

Greetings
['Hi, how are you doing?', "I'm fine. How about yourself?", "I'm pretty good. Thanks for asking.", 'No problem. So how have you been?', "I've been great. What about you?", "I've been good. I'm in school right now.", 'What school do you go to?', 'I go to PCC.', 'Do you like it there?', "It's okay. It's a really big campus.", 'Good luck with school.', 'Thank you very much.', "How's it going?", "I'm doing well. How about you?", 'Never better, thanks.', 'So how have you been lately?', "I've actually been pretty good. You?", "I'm actually in school right now.", 'Which school do you attend?', "I'm attending PCC right now.", 'Are you enjoying it there?', "It's not bad. There are a lot of people there.", 'Good luck with that.', 'Thanks.', 'How are you doing today?', "I'm doing great. What about you?", "I'm absolutely lovely, thank you.", "Everything's been good with you?", "I haven't been better. How about yourself?", 'I started school recently.', 'Where are you going to school?', "I

In [22]:
def scrape_page2(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    result = soup.find('html')
    lines = re.split('\n1.|\n2.|\n3.|\nA: |\nB: |\n', result.text)
    lines = [x.strip() for x in lines]
    lines = [x for x in lines if x != '' and x != 'Repeat']
    title = lines[1]
    dialogue_lines = lines[2:-2]
    return title, dialogue_lines

In [23]:
title, dialogue = scrape_page2('https://www.eslfast.com/robot/topics/smalltalk/smalltalk02.htm')
print('title: ', title)
print('dialogue lines: ', dialogue)

title:  Weather (1)
dialogue lines:  ["It's an ugly day today.", 'I know. I think it may rain.', "It's the middle of summer, it shouldn't rain today.", 'That would be weird.', "Yeah, especially since it's ninety degrees outside.", 'I know, it would be horrible if it rained and it was hot outside.', 'Yes, it would be.', "I really wish it wasn't so hot every day.", "Me too. I can't wait until winter.", 'I like winter too, but sometimes it gets too cold.', "I'd rather be cold than hot.", 'Me too.', "It doesn't look very nice outside today.", "You're right. I think it's going to rain later.", "In the middle of the summer, it shouldn't be raining.", "That wouldn't seem right.", "Considering that it's over ninety degrees outside, that would be weird.", "Exactly, it wouldn't be nice if it started raining. It's too hot.", "I know, you're absolutely right.", 'I wish it would cool off one day.', "That's how I feel, I want winter to come soon.", 'I enjoy the winter, but it gets really cold someti

In [24]:
sections = [('smalltalk', 'smalltalk', 24), ('college', 'collegelife', 24), ('library', 'library', 14), 
            ('transfer', 'transfer', 14), ('social', 'social', 14), ('dating', 'dating', 8), 
            ('apartment', '1apartment', 10), ('apartment', '2apartment', 45), ('transport', 'bus', 15), 
            ('dailylife', 'dailylife', 26), ('shop', 'shop', 21), ('bank', 'bank', 17), ('food', 'food', 21), 
            ('restaurant', 'restaurant', 16), ('transport', 'buycar', 13), ('transport', 'drive', 20), 
            ('health', 'health', 26), ('employment', 'employment', 24), ('travel', 'travel', 28), 
            ('hotel', 'hotel', 21), ('buyhouse', 'buyhouse', 12), ('salehouse', 'salehouse', 17), 
            ('community', 'community', 13), ('crime', 'crime', 14), ('vote', 'vote', 17)
           ]

In [26]:
topics = []
titles = []
dialogue_lines = []

for section in sections:
    name1, name2, length = section
    for n in range(1, length+1):
        num = str(n).zfill(2)
        url = 'https://www.eslfast.com/robot/topics/{}/{}.htm'.format(name1, name2+num)
        title, dialogue = scrape_page2(url)
        dialogue_lines.extend(dialogue)
        for i in range(len(dialogue)):
            titles.append(title)
            topics.append(name2)
        sleep(randint(2,10))
    print("completed ", name2)
    
int_dialogues = pd.DataFrame({'topic': topics, 'title': titles, 'dialogue_line': dialogue_lines})

completed  smalltalk
completed  collegelife
completed  library
completed  transfer
completed  social
completed  dating
completed  1apartment
completed  2apartment
completed  bus
completed  dailylife
completed  shop
completed  bank
completed  food
completed  restaurant
completed  buycar
completed  drive
completed  health
completed  employment
completed  travel
completed  hotel
completed  buyhouse
completed  salehouse
completed  community
completed  crime
completed  vote


In [27]:
int_dialogues.shape

(16000, 3)

In [28]:
int_dialogues.head(10)

Unnamed: 0,topic,title,dialogue_line
0,smalltalk,Greetings,"Hi, how are you doing?"
1,smalltalk,Greetings,I'm fine. How about yourself?
2,smalltalk,Greetings,I'm pretty good. Thanks for asking.
3,smalltalk,Greetings,No problem. So how have you been?
4,smalltalk,Greetings,I've been great. What about you?
5,smalltalk,Greetings,I've been good. I'm in school right now.
6,smalltalk,Greetings,What school do you go to?
7,smalltalk,Greetings,I go to PCC.
8,smalltalk,Greetings,Do you like it there?
9,smalltalk,Greetings,It's okay. It's a really big campus.


### The Starter Questions ###

The second DataFrame also looks good. Now let's move on to scraping the next page to get the starter questions. We will repeat much of the process again, but in this case there is only need to scrape one page, so it isn't necessary to create functions to use inside of a for loop. In this case, we will convert the list of questions into a `pandas` Series, since there aren't different sections to worry about.

In [29]:
r = requests.get('http://iteslj.org/questions/getting.html')

In [30]:
# print the first 500 characters of the HTML
print(r.text[0:500])

<html><head><title>ESL Conversation Questions - Getting to Know Each Other (I-TESL-J)</title><meta name="description" content="A list of questions you can use to generate conversations in the ESL/EFL classroom."><meta name="keywords" content="ESL EFL ESOL TESL TEFL TESOL ELT English as a Second Language English as a Foreign Language"><meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"><script src="http://iteslj.org/ue.js"></script></head><body bgcolor="#ffffff" text="#000000" 


In [31]:
soup = BeautifulSoup(r.text, 'html.parser')

In [32]:
print(soup)

<html><head><title>ESL Conversation Questions - Getting to Know Each Other (I-TESL-J)</title><meta content="A list of questions you can use to generate conversations in the ESL/EFL classroom." name="description"/><meta content="ESL EFL ESOL TESL TEFL TESOL ELT English as a Second Language English as a Foreign Language" name="keywords"/><meta content="text/html;charset=utf-8" http-equiv="Content-Type"/><script src="http://iteslj.org/ue.js"></script></head><body bgcolor="#ffffff" link="#0000ff" text="#000000" vlink="#600042"><script type="text/javascript">menu();</script><div class="main wide"><center><h1>Conversation Questions<br/>
Getting to Know Each Other
</h1>
A Part of <a href="http://iteslj.org/questions/">Conversation Questions for the ESL Classroom</a>.
</center>
<ul>
<li>Do you have any pets?
<li>What was the last book you read?
<li>Do you like to cook?
<li>What's your favorite food?
<li>Are you good at cooking/swimming/etc?
<li>Are you married or single?
<li>Do you h

In [33]:
results = soup.find('ul')

In [34]:
print(results)

<ul>
<li>Do you have any pets?
<li>What was the last book you read?
<li>Do you like to cook?
<li>What's your favorite food?
<li>Are you good at cooking/swimming/etc?
<li>Are you married or single?
<li>Do you have brothers and sisters?<ul><li>Are they older or younger than you?</li></ul>
<li>Do you like baseball?
<li>Do you live alone?
<li>Do you live in a house or an apartment?
<li>Have you ever lived in another country?
<li>Have you ever met a famous person?
<li>How do you spend your free time?
<li>How long have you been studying English?
<li>How old are you?
<li>How tall are you?
<li>Tell me about a favorite event of your adulthood.
<li>Tell me about a favorite event of your childhood.
<li>What are your hobbies?
<li>What two things could you not do when you were...?
<li>What countries have you visited?
<li>What country are you from?
<li>What do you do on Sundays?
<li>What do you do? What's your job?
<li>What do you like to do in your free time?
<li>What hobbie

In [35]:
questions = re.split('<li>|<ul>|\n|\r\n', results.text)
questions = [x for x in questions if (len(x)>0 and len(x)<60)]

print(questions)

['Do you have any pets?', 'What was the last book you read?', 'Do you like to cook?', "What's your favorite food?", 'Are you good at cooking/swimming/etc?', 'Are you married or single?', 'Do you like baseball?', 'Do you live alone?', 'Do you live in a house or an apartment?', 'Have you ever lived in another country?', 'Have you ever met a famous person?', 'How do you spend your free time?', 'How long have you been studying English?', 'How old are you?', 'How tall are you?', 'Tell me about a favorite event of your adulthood.', 'Tell me about a favorite event of your childhood.', 'What are your hobbies?', 'What two things could you not do when you were...?', 'What countries have you visited?', 'What country are you from?', 'What do you do on Sundays?', "What do you do? What's your job?", 'What do you like to do in your free time?', 'What hobbies do you have?', 'What is your motto?', 'What kind of food do you like?', 'What kind of people do you like?', 'What kind of people do you not like

In [36]:
question_series = pd.Series(questions)
question_series.head(10)

0                      Do you have any pets?
1           What was the last book you read?
2                       Do you like to cook?
3                 What's your favorite food?
4      Are you good at cooking/swimming/etc?
5                 Are you married or single?
6                      Do you like baseball?
7                         Do you live alone?
8    Do you live in a house or an apartment?
9    Have you ever lived in another country?
dtype: object

### Save Data to a CSV Files ###

The final step is to save dialogues dataframe and the questions series as a CSV files. This will allow them to be easily imported during the next stage of the project.

In [37]:
easy_dialogues.to_csv('easy_dialogues.csv', index=False)
int_dialogues.to_csv('int_dialogues.csv', index=False)
question_series.to_csv('esl_questions.csv', index=False)