<a href="https://colab.research.google.com/github/sbdrchauhan/python-ic/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Read the webpage**

In [1]:
import requests

# make the resoponse object
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [2]:
# response object has text attribute that let us view contents of the webpage
print(r.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


**Parse the HTML (Understand the pattern in HTML)**

In [3]:
# so to actually understand the underlying pattern in HTML we need a parser
# beautifulsoup is the one most popular, lets import it
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

You might have noticed that each record has the following format:

`<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>`


In [4]:
# now that we know the pattern to search for, we will ask soup object to find all those instances
results = soup.find_all('span', attrs={'class':'short-desc'})
# the above line searches soup object for all instances of 'span' tags having attributes class='short-desc'

In [5]:
# since the soup object that we got is a python list, we can check its length
len(results)

180

In [6]:
# lets see the first three results to verify that it contains what we require
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [7]:
# lets see the last record so that we confirm that it has all the records needed
results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

**Extract the required parts**

In [8]:
# since this will be easy to just work on first item, later we do loop for all
first_result = results[0]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

**Extract date**

In [9]:
first_result.find('strong')
# this will still outputs the soup object, not the string

<strong>Jan. 21 </strong>

In [10]:
# but we need the text of it, now this will be python string
first_result.find('strong').text

'Jan. 21\xa0'

In [11]:
# but \nbsp came up, lets remove it
first_result.find('strong').text[0:-1]

'Jan. 21'

In [12]:
# lets add the year to make the date more sensible
first_result.find('strong').text[0:-1]+', 2017'

'Jan. 21, 2017'

**Extract lie**

In [14]:
# since the sentence is not inside any tag, we cannot do like above. so we will use contents attribute
first_result.contents
# this will give python list of the children of the tag

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [15]:
# so lets extract second element
first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [16]:
# to remove extra quotations
first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

**Extract Explanation**

In [17]:
# we could use either way to extrac this one, but lets just use the tag ones
first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

**Extract url**

In [18]:
# to achieve the text within the tag, we can use like dictionary key for the attributes of the tag
first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

**Build the dataset**

In [19]:
# now that we know how to extract each of our required data we will build the dataset out of it
records = []
for result in results:
    date = result.find('strong').text[0:-1]+', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date,lie,explanation,url))

In [20]:
# so this is giving tuple of length 4, and it still has total length of 180 as previous 
len(records)

180

In [21]:
# lets see few of the records
records[0:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

**Make the dataset tabular, export as csv file**

In [22]:
import pandas as pd
df = pd.DataFrame(records,columns=['date','lie','explanation','url'])

In [23]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [24]:
# since to make the date consistent, it is better to convert date column to pandas datetime
df['date'] = pd.to_datetime(df['date'])

In [25]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [26]:
# now make CSV
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')
# telling that we dont requir index from df
# this should already create a file in the working directory

In [27]:
#we can read back the file if we want,
df = pd.read_csv('trump_lies.csv', parse_dates=['date'], encoding='utf-8')

In [28]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...
