# USNews Top Stories WebScraping

In this small project, I used the website of usnews to do an example of web scraping.

I decoded the html website code of usnews, then identified the **second** story of top stories. I web-scraped the title and the first three lines of body text of the second story using "tokenize".

At last, I wrote about why I chose to go for the way I used.  
Why I used tokenize to split sentences rather than using '.'.  
Why I further identified paragraphs rather than just go with the text of main_body.

In [1]:
# The requests module allows you to send HTTP requests using Python.
import requests
#import urllib.request
import re
from bs4 import BeautifulSoup as BS

## Go to the USNews website.

Set the url to the website and access the site with the requests library.

Response 200 means it went through.
    
**NOTE:**
We need to set the agent and make it the header to connect.

In [2]:
url = 'https://www.usnews.com'
agent = {"User-Agent":'Mozilla/5.0'}
response = requests.get(url, headers = agent)
response

<Response [200]>

## Parse the html with BeautifulSoup
So that we can work with a nicer, nested BeautifulSoup data structure. 

In [3]:
soup = BS(response.text, "html.parser")
# or
# soup = bs(response.content, "html.parser")

## Identify the second story of "Top Stories". Fetch the link and get redirected to the page of that story.

By inspect the html code, we find the story line titles are coded in the tag h3.
We use the method [.findAll] to locate all of the h3 tags. This gives every line of code that has an h3 tag.

In [4]:
# get all the <h3> nodes
soup.findAll('h3')

[<h3 class="Heading-bocdeh-1 iqkCSQ Heading__HeadingStyled-bocdeh-0-h3 jqIpxp" size="6" spacing="4">Panic and Play During the Pandemic </h3>,
 <h3 class="Heading-bocdeh-1 iqkCSQ Heading__HeadingStyled-bocdeh-0-h3 cYppwF" size="4" spacing="3">Panic and Play During the Pandemic </h3>,
 <h3 class="story-headline ContentBox__StoryHeading-s48yiwo-3 ktdZVE Heading-bocdeh-1 iqkCSQ Heading__HeadingStyled-bocdeh-0-h3 TtOgA" size="2" spacing="3">The Latest on the Coronavirus</h3>,
 <h3 class="story-headline ContentBox__StoryHeading-s48yiwo-3 ktdZVE Heading-bocdeh-1 iqkCSQ Heading__HeadingStyled-bocdeh-0-h3 TtOgA" size="2" spacing="3">Signs It's Time to Find a New Job</h3>,
 <h3 class="story-headline ContentBox__StoryHeading-s48yiwo-3 ktdZVE Heading-bocdeh-1 iqkCSQ Heading__HeadingStyled-bocdeh-0-h3 TtOgA" size="2" spacing="3"><a href="https://www.usnews.com/news/world-report/articles/2020-03-19/italy-coronavirus-death-toll-surpasses-china">Italy Virus Death Toll Surpasses China</a></h3>,
 <h3 cl

But what we want is the second story which begins on the line 6. 

**Note:** Line 6 is the 5th [5].

We further use [find] to find the tag a of next layer.  
Then we return the 'href' attribute of the tag a.  

    NOTE: (use type() to check)  
    return a tag: .find()  
    return a resultset: .findAll()  
    return an attribute: node[]  

In [5]:
second_story = soup.findAll('h3')[5]
link = second_story.find('a')['href']
link

'https://www.usnews.com/news/national-news/articles/2020-03-19/us-quarantines-troops-in-afghanistan-amid-coronavirus-fallout'

**Another way**

In [25]:
h2 = soup.find('h2', text = "Top Stories")
links = h2.find_next_sibling().select('a') # select return a list

url2ndTopStory = links[3].get('href')
print('url2ndTopStory: \n', url2ndTopStory)

Go to the second story page.

In [6]:
response2 = requests.get(link, headers = agent)
response2

<Response [200]>

## Get the header of the story.

In [7]:
soup2 = BS(response2.text, "html.parser")

In [8]:
header = soup2.find('h1')
print(header.text)

U.S. Quarantines Troops in Afghanistan Amid Coronavirus Fallout


## Print the first three sentences of the main body.

We can find the specific element easily by its id and class.
Here, the id helps us navigate to the main body text, while the class helps us identify the paragraphs in the main body text rather than the external links or images.

In [9]:
main_body = soup2.find(id='ad-in-text-target')

Because we only want the first three sentences, we only take the first three paragraphs which will certainly include the fist three sentences.
    
    NOTE:  
    [0:3] does not include the last one, so it gives 0,1,2.

In [19]:
# text = main_body.findAll('div', class_='Raw-s14xcvr1-0 AXWJq')[0:3]
text = main_body.findAll('div', class_='Raw-s14xcvr1-0 jkSsZN')[0:3]

Because .findAll returns a resultset, so we can not use a text attribute here. Instead, we use a for loop to get all the text in each node. And we use .append to get all the results in the loop.

In [20]:
p3=[]
for p in text:
   p3.append((p.text))

To split the text into sentences, we can either do it through Regex or natural language toolkit.
Here we use the tokensize package.

But to use the tokenize.sent_tokenize() function, the object needs to be a string. But the p3 that we've got is a list.

If we just use the str() function to turn it into string, the final output will be quite messy, with \[.  

If we use the p3_str = p3[0] + p3[1] + p3[2], the problem is there is no blankspace between the appended strings which leads to the failure of detection of sentence for the tokenize function.  

So we use the ' '.join function to appende a list of strings.

In [21]:
p3

['Troops currently based in Afghanistan will have to remain beyond their expected departure dates and 1,500 new arrivals will stay in quarantine, the U.S. headquarters there announced Thursday as the global public health crisis surrounding the coronavirus outbreak complicates a planned withdrawal of U.S. forces.',
 '"To preserve our currently-healthy force, Resolute Support is making the necessary adjustments to temporarily pause personnel movement into theater," Army Gen. Scott Miller, commander of U.S. operations in Afghanistan, said in a statement early Thursday, using the official name for the American mission there. Only essential personnel can now access U.S. bases in Afghanistan, and Americans are increasing the number of teleconferences with their Afghan counterparts instead of in-person meetings. ',
 '"In some cases, these measures will necessitate some servicemembers remaining beyond their scheduled departure dates to continue the mission," Miller said. ']

In [22]:
p3_str = ' '.join([p3[0],p3[1],p3[2]])
p3_str

'Troops currently based in Afghanistan will have to remain beyond their expected departure dates and 1,500 new arrivals will stay in quarantine, the U.S. headquarters there announced Thursday as the global public health crisis surrounding the coronavirus outbreak complicates a planned withdrawal of U.S. forces. "To preserve our currently-healthy force, Resolute Support is making the necessary adjustments to temporarily pause personnel movement into theater," Army Gen. Scott Miller, commander of U.S. operations in Afghanistan, said in a statement early Thursday, using the official name for the American mission there. Only essential personnel can now access U.S. bases in Afghanistan, and Americans are increasing the number of teleconferences with their Afghan counterparts instead of in-person meetings.  "In some cases, these measures will necessitate some servicemembers remaining beyond their scheduled departure dates to continue the mission," Miller said. '

In [23]:
from nltk import tokenize
#nltk.download('punkt')

In [24]:
sentences = tokenize.sent_tokenize(p3_str)
three_sent =sentences[0:3]
three_sent

['Troops currently based in Afghanistan will have to remain beyond their expected departure dates and 1,500 new arrivals will stay in quarantine, the U.S. headquarters there announced Thursday as the global public health crisis surrounding the coronavirus outbreak complicates a planned withdrawal of U.S. forces.',
 '"To preserve our currently-healthy force, Resolute Support is making the necessary adjustments to temporarily pause personnel movement into theater," Army Gen. Scott Miller, commander of U.S. operations in Afghanistan, said in a statement early Thursday, using the official name for the American mission there.',
 'Only essential personnel can now access U.S. bases in Afghanistan, and Americans are increasing the number of teleconferences with their Afghan counterparts instead of in-person meetings.']

### Notes about why tokenize rathen split('.')

In [29]:
# cannot recognize whether it's a period or not.
main_body = soup2.find(id = 'ad-in-text-target').get_text().split('.')
print('First three sentences:')
print(". ".join([sentences[i].strip() for i in range(3)]))

First three sentences:
Troops currently based in Afghanistan will have to remain beyond their expected departure dates and 1,500 new arrivals will stay in quarantine, the U. S. headquarters there announced Thursday as the global public health crisis surrounding the coronavirus outbreak complicates a planned withdrawal of U


### Notes about why identifying the paragraphs rather than just use the text of main_body

In [31]:
# there will be ads in the main_body which may contain text also.
sentences = tokenize.sent_tokenize(main_body)
three_sent =sentences[0:3]
three_sent

['Troops currently based in Afghanistan will have to remain beyond their expected departure dates and 1,500 new arrivals will stay in quarantine, the U.S. headquarters there announced Thursday as the global public health crisis surrounding the coronavirus outbreak complicates a planned withdrawal of U.S.',
 'forces.',
 '[\xa0SEE: The Week in Cartoons for March 16-20\xa0]"To preserve our currently-healthy force, Resolute Support is making the necessary adjustments to temporarily pause personnel movement into theater," Army Gen. Scott Miller, commander of U.S. operations in Afghanistan, said in a statement early Thursday, using the official name for the American mission there.']