- In this chapter, we’ll look at different ways of getting data into Python and into the right formats.

# 1. stdin and stdout

- if we run our Python script at command line, we can pipe output of one py file to another

- piping can be done using sys.stdin and sys.stdout

## (a) piping 
  1. first have a input file(say, `input.txt`), in which we want to find the lines which have a given matching word (e.g. "this")
  2. write a python script (say `egrep.py`) for finding these lines
  3. Use command line in -bash to run this  
     How?  
     use `cat input.txt | python egrep.py "this"`
     - here `input.txt` will be taken as <u>input(sys.stdin)</u> in `egrep.py` file and "this" is taken as <u>sys.argv[1]</u> in egrep.py file

## (b) sys module

  > its like a window between your operating sytsem and Python interpreter.

Important key components of sys toolbox:
    
###  1. sys.argv
     * when we run python from command lines, we can access the arguments passed to python script using this command
     * sys.argv[0] : file name
     * sys.argv[1]: first argument after file name given in command line

input.txt file  

>This is a sample input file.  
>It contains some lines with text.  
>#This is a comment line.  
>This line contains the word "pattern".  
>Another line with more text.  
>#Another comment line.  
>This is the last line.  

In [1]:
# egrep.py
import sys, re

regex = sys.argv[1]
# sys.argv is list of command line arguments
# sys.argv[0] is name of program file itself
# sys.argv[1] will be regex specified at command line itself

# for every line passed into this script
# write line in stdout which has regex
for line in sys.stdin:
    if re.search(regex, line):
        sys.stdout.write(line)

# sys.stdout is an object in python representing standard output stream
# method sys.stdout.write() will write the output in console

###  2. How does line in sys.stdin is found?

- 'Enter' key typically inserts a newline character (\n) into the input stream.
- When you use sys.stdin to read input from a file or from piped input, it treats each newline character (\n) as a delimiter between lines.
- This means that when you use sys.stdin to read from a file, it automatically recognizes newline characters and splits the input into separate lines accordingly.


### 3. Count the number of lines in input.txt

- We write a program `line_count.py`

In [None]:
# line_count.py

import sys

count = 0
for line in sys.stdin:
    count += 1

print(count)                    # by default print goes to sys.stdout, therefore it writes in console
                                # 7

`cat input.txt | python egrep.py "This" | python line_count.py`
 #4

### 4. Count number of lines in 'input.txt' which have any number/digit 

- use pipe character `|` to use output of left command as input of right command
  
- here regex is '[0-9]'

- terminal command line: `cat input.txt | python egrep.py "[0-9]" | python line_count.py` #2

### 5. Write (stdout) the words of 'input.txt' and write most common ones

In [None]:
# most_common_words.py

from collections import Counter
import sys

num = sys.argv[1]

try:
    num = int(sys.argv[1])

except:
    print("usage: python most_common_words.py num")
    sys.exit(1)               # nonzero exit code indicates error

c = Counter(word.lower() 
            for line in sys.stdin 
            for word in line.strip().replace('.', ' ').split()
            if word)         
                             # strips whitespaces/newlines/tabs characters from start-end of line first
                             # replace full stop with whitespace
                             # splits words at space 

for word, count in c.most_common(num):      # c is dictionary of words and its counts
    sys.stdout.write(str(count))
    sys.stdout.write("\t")
    sys.stdout.write(word)
    sys.stdout.write("\n")

# 2. Reading Files

- we can directly read from files/write to files from py code 

## (a) Basics of text files 

### 1. read, write, append

In [29]:
# 'r' is read-only -- by default it's assumed read-only if not written

file_for_reading = open('input.txt', 'r')
file_for_reading2 = open('input.txt')

# 'w' is write -- will destroy the file if it already exists!
file_for_writing = open('writing_file.txt', 'w')

# 'a' is append -- for adding to the end of the file
file_for_appending = open('appending_file.txt', 'a')

# Don't forget to close files when done
file_for_writing.close()


### 2. read file with block code

- It is easy to forget to close the files.
- So, we should always use them within a block, at the end of which they will be closed automatically.

In [None]:
with open(filename) as f:   
    data = function_that_gets_data_from(f)

# At this point f has already been closed, so don't try to use it
process(data)

### 3. To read whole file iterate over all lines
- e.g. we want to count lines starting with # using block code

In [49]:
start_with_hash = 0
with open('input.txt') as f:
    for line in f:                  #look at each line of f
        if re.match('^ #', line):   #matches first(^) character with ' #' i.e. space-hash
            start_with_hash +=1      

print(start_with_hash)

1


### 4. Example: Histogram of email domain

- when we press enter, a new line character is added at the end of line'\n'

- while reading a file, therefore, we need to remove this always

- so, we use line.strip() function - strip() removes whotespaces before and after the text

- so, for one line: lower all characters -> strip the start-end whitespaces -> split line at @ -> return last element of splitted <u>list</u>

In [64]:
def get_domain(email_addr: str) -> str:
    return email_addr.lower().strip().split('@')[-1]

with open('email_addresses.txt') as f:
    c = Counter(get_domain(line)
               for line in f
               if '@' in line)
    print(c)
    

Counter({'example.com': 3, 'gmail.com': 3, 'yahoo.com': 2, 'hotmail.com': 1, 'example.org': 1, 'mail.datasciencester.com': 1})


## (b) Delimited files

- often used for storing structured data in a human-readable format.

- for exporting and importing data between different software applications and systems because they are simple and widely supported.

- in practicality the files have lots of data in each line of text file

- often have delimiters like comma, tab, semi-colons, and pipes -separated several fields in one line

- so, never parse these files by your own, and screw up edge cases

- so, we use python libraries designed to read these files e.g. csv module, pandas etc

### csv.reader & csv.DictReader

In [7]:
# Read tab_delimited_stock_prices.txt
# csv.reader() returns an iterable object that you can loop through to process each row of the file.


import csv

# Print each cell of rows with name
def process(date: str, symbol: str, closing_price: float) -> None:
    print(f"date: {date}, symbol: {symbol}, closing price: {closing_price}")

#read file using csv module
with open("tab_delimited_stock_prices.txt") as f:
    tab_reader = csv.reader(f, delimiter = '\t')     # iterator object, can be iterated over rows
    for row in tab_reader:
        # Print(row)
        date = row[0]
        symbol = row[1]
        closing_price = row[2]
        process(date, symbol, closing_price)
    


date: 6/20/2014, symbol: AAPL, closing price: 90.91
date: 6/20/2014, symbol: MSFT, closing price: 41.68
date: 6/20/2014, symbol: FB, closing price: 64.5
date: 6/19/2014, symbol: AAPL, closing price: 91.86
date: 6/19/2014, symbol: MSFT, closing price: 41.51
date: 6/19/2014, symbol: FB, closing price: 64.34


In [8]:
# To read file with header

# 1. use reader.next()
# Print each cell of rows with name

with open("tab_delimited_stock_prices_with_header.txt") as f:
    tab_reader = csv.reader(f, delimiter = '\t')     # iterator object, can be iterated over rows
    next(tab_reader)                                 # skip header
    for row in tab_reader:
        #print(row)
        date = row[0]
        symbol = row[1]
        closing_price = row[2]
        process(date, symbol, closing_price)
    


date: 6/20/2014, symbol: AAPL, closing price: 90.91
date: 6/20/2014, symbol: MSFT, closing price: 41.68
date: 6/20/2014, symbol: FB, closing price: 64.5
date: 6/19/2014, symbol: AAPL, closing price: 91.86
date: 6/19/2014, symbol: MSFT, closing price: 41.51
date: 6/19/2014, symbol: FB, closing price: 64.34


In [9]:
# 2. using csv.DictReader

with open("tab_delimited_stock_prices_with_header.txt") as f:
    tab_reader = csv.DictReader(f, delimiter ='\t')           # Assumes the file has header names
                                                              # Reads from next line by default
                                                              # Assigns header name as key for each line with its corresponding value


    # use header names to access the elements
    for dict_row in tab_reader:
        date = dict_row["date"]
        symbol = dict_row["symbol"]
        closing_prices = dict_row["prices"]
        process(date, symbol, closing_price)
    

date: 6/20/2014, symbol: AAPL, closing price: 64.34
date: 6/20/2014, symbol: MSFT, closing price: 64.34
date: 6/20/2014, symbol: FB, closing price: 64.34
date: 6/19/2014, symbol: AAPL, closing price: 64.34
date: 6/19/2014, symbol: MSFT, closing price: 64.34
date: 6/19/2014, symbol: FB, closing price: 64.34


### fieldnames i.e. header name

In [121]:
# How to read header from a file using DictReader

with open("tab_delimited_stock_prices_with_header.txt") as f:
    tab_reader = csv.DictReader(f, delimiter='\t')  # Iterator object
    header = tab_reader.fieldnames                  # Read header names 
    print(header)

['date', 'symbol', 'prices']


In [133]:
# If file doesn't have header
# Still can read using DictReader
# Pass keys as fieldname parameter

with open("colon_delimited_stock_prices.txt") as f:
    #file with no header
    colon_reader = csv.DictReader(f, delimiter = ":")
   
    #give header name
    colon_reader.fieldnames[0] = "date"
    colon_reader.fieldnames[1] = "symbol"
    colon_reader.fieldnames[2] = "closing prices"

    #lets check header now
    header = colon_reader.fieldnames
    print(header)

        

['date', 'symbol', 'closing prices']


### csv.writer

In [143]:
# Write todays_prices dict in a file named "comma_delimited_stock_prices.txt"

todays_prices = {'AAPL': 90.91, 'MSFT': 41.68, 'FB': 64.5 }

with open ("comma_delimited_stock_prices.txt", 'w') as f:
    csv_writer = csv.writer(f, delimiter = ',')
    for stock, price in todays_prices.items():
        csv_writer.writerow([stock,price])


In [146]:
# Lets try writing a "bad_csv.txt" file without using csv.write

results = [["test1", "success", "Monday"],["test2", "success, kind of", "Tuesday"],["test3", "failure, kind of", "Wednesday"],["test4", "failure, utter", "Thursday"]]

# Bad code
with open('bad_csv.txt', 'w') as f:
    for row in results:
        f.write(','.join(map(str,row)))
        f.write('\n')

# It created a file named bad_csv.txt with content that makes no sense
# test1,success,Monday
# test2,success, kind of,Tuesday
# test3,failure, kind of,Wednesday
# test4,failure, utter,Thursday

In [150]:
# Now try it using csv.write
with open("good_csv.txt", 'w') as f:
    csv_writer = csv.writer(f, delimiter = ",")
    for row in results:
        csv_writer.writerow(row)

# It creates good_csv.txt file with:

# test1,success,Monday
# test2,"success, kind of",Tuesday
# test3,"failure, kind of",Wednesday
# test4,"failure, utter",Thursday

# 3. Scrapping the web
- fetching information from web pages

## (a) html parsing

- web pages are written in html

- html files are marked by 'Elements' and their 'Attributes'

- Example:

```html
<html>
  <head>
    <title>A web page</title>
  </head>
  <body>
    <p id="author">Joel Grus</p>
    <p id="subject">Data Science</p>
  </body>
</html>
```

- anything inside ```<>``` is tag, with opening and closing tags
  
- tags like ```<p .... </p>``` is element with attribute `id`

- in web scraping we extract our useful data from the complex html files like

- “find the ```<p>``` element whose ```id``` is <u> subject </u> and return the text it contains.”

### BeautifulSoup Library

- it is used to create a tree out of various elements of html/XML files i.e. **parsing html/xml**

- these trees provide interface for easy access of data

- find(), find_all(), select(), etc. are methods to navigate through the trees

- without Beautiful Soup, we need to manually access raw HTML content using string manipulation or regular expressions, which can be error-prone and cumbersome, especially if the HTML structure of the webpage changes

- with Beautiful Soup, we can use its methods (find_all() in this case) to acces the content easily based on their HTML tags, abstracting away the parsing details and making the code more readable and maintainable.

### Requests library

- used to make HTTP requests

- It simplifies the process of sending HTTP requests and handling responses, making it easy to interact with web services, APIs, and websites

### html5lib 

- its an alternative to python's in-built parser

- in more complex or less structured HTML documents, Python's built-in HTML parser (html.parser) might struggle to handle improperly nested tags, missing closing tags, or other irregularities.

- thus we install html5lib parser through command line:

- `python -m pip install beautifulsoup4 requests html5lib`


In [2]:
# See how parser parse the html text

from bs4 import BeautifulSoup

# Example HTML content
html_content = """
<!DOCTYPE html>
<html>
  <head>
    <title>Test Page</title>
  </head>
  <body>
    <h1>Welcome to the Test Page</h1>
    <p>This is a <strong>simple</strong> HTML example.</p>
    <a href="http://example.com">Example Link</a>
  </body>
</html>
"""

# Parse the HTML content using BeautifulSoup with the html5lib parser
soup = BeautifulSoup(html_content, 'html5lib')

# Print the prettified HTML
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Test Page
  </title>
 </head>
 <body>
  <h1>
   Welcome to the Test Page
  </h1>
  <p>
   This is a
   <strong>
    simple
   </strong>
   HTML example.
  </p>
  <a href="http://example.com">
   Example Link
  </a>
 </body>
</html>



## (b)Steps of parsing

1. get the http url using `Requests`
2. Parse using `BeautifulSoup`
   
### 1. read html 

In [12]:
from bs4 import BeautifulSoup
import requests

url = "https://raw.githubusercontent.com/joelgrus/data/master/getting-data.html"

# Request url
html = requests.get(url).text                                     

# print(html)

# Parse html
soup = BeautifulSoup(html, 'html5lib')
print(soup)

<!DOCTYPE html>
<html lang="en-US"><head>
    <title>Getting Data</title>
    <meta charset="utf-8"/>
</head>
<body>
    <h1>Getting Data</h1>
    <div class="explanation">
        This is an explanation.
    </div>
    <div class="comment">
        This is a comment.
    </div>
    <div class="content">
        <p id="p1">This is the first paragraph.</p>
        <p class="important">This is the second paragraph.</p>
    </div>
    <div class="signature">
        <span id="name">Joel</span>
        <span id="twitter">@joelgrus</span>
        <span id="email">joelgrus-at-gmail</span>
    </div>


</body></html>


### 2. find p-tag and its contents

In [13]:
# Find p-tag and its contents

first_paragraph = soup.find('p')
all_paragraphs = soup.find_all('p')
print(first_paragraph, "\n", all_paragraphs)

# Alternative way
first_paragraph1 = soup.p
all_paragraphs1 = soup('p')
print(first_paragraph1, "\n", all_paragraphs1)

<p id="p1">This is the first paragraph.</p> 
 [<p id="p1">This is the first paragraph.</p>, <p class="important">This is the second paragraph.</p>]
<p id="p1">This is the first paragraph.</p> 
 [<p id="p1">This is the first paragraph.</p>, <p class="important">This is the second paragraph.</p>]


### 3. get text content under p-tag

In [26]:
# Get text content under p-tag

first_paragraph_text = soup.find('p').text
print(first_paragraph_text)

# Or
first_paragraph_text1 = soup.p.text
print(first_paragraph_text1)

This is the first paragraph.
This is the first paragraph.


### 4. extract a tag’s attributes by treating it like a dict

In [15]:
# Extract a tag’s attributes by treating it like a dict

first_paragraph_id = soup.p['id']
first_pragraph_class = soup.p['class']  # raises key error beacause first paragraph has no key as 'class'

KeyError: 'class'

In [35]:
# Therefore use get() 
# It will give 'None' if key is absent
first_pragraph_class1 = soup.p.get('class')
print(first_pragraph_class1)

None


In [60]:
# Find paragraphs with 'id' attribute

paragraph_with_id = [p for p in soup('p') if p.get('id')]
print(paragraph_with_id)

[<p id="p1">This is the first paragraph.</p>]


### 5. Different ways to find tags with a specific class

In [91]:
# Different ways to find tags with a specific class

important_paragraph = [p for p in soup('p') if 'important' in p.get('class',[])]  #returns [] by default
important_paragraph1 = [soup('p', 'important')]
important_paragraph2 = [soup('p', {'class':'important'})]

print(important_paragraph, "\n",important_paragraph1, "\n",important_paragraph2 )

# To print(id_p1)
id_p1 = [p for p in soup('p', id = 'p1')]


[<p class="important">This is the second paragraph.</p>] 
 [[<p class="important">This is the second paragraph.</p>]] 
 [[<p class="important">This is the second paragraph.</p>]]


In [100]:
# Combine these methods to implement more elaborate logic
# For example, To find every `<span>` element that is contained inside a <div> element


from bs4 import BeautifulSoup

#print(html)
span_inside_divs = [span for div in soup('div')
                    for span in div('span')]
print(span_inside_divs)

[<span id="name">Joel</span>, <span id="twitter">@joelgrus</span>, <span id="email">joelgrus-at-gmail</span>]


- the important data won’t typically be labeled as class = "important"

- we will need to carefully inspect the source HTML, reason through our selection logic, and worry about edge cases to make sure our data is correct.

## REVISION

In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://raw.githubusercontent.com/joelgrus/data/master/getting-data.html"
html = requests.get(url, 'http5lib').text
soup = BeautifulSoup(html)
print(soup)

<!DOCTYPE html>
<html lang="en-US"><head>
    <title>Getting Data</title>
    <meta charset="utf-8"/>
</head>
<body>
    <h1>Getting Data</h1>
    <div class="explanation">
        This is an explanation.
    </div>
    <div class="comment">
        This is a comment.
    </div>
    <div class="content">
        <p id="p1">This is the first paragraph.</p>
        <p class="important">This is the second paragraph.</p>
    </div>
    <div class="signature">
        <span id="name">Joel</span>
        <span id="twitter">@joelgrus</span>
        <span id="email">joelgrus-at-gmail</span>
    </div>


</body></html>


In [157]:
soup.div

<div class="explanation">
        This is an explanation.
    </div>

In [17]:
soup.div.text

'\n        This is an explanation.\n    '

In [159]:
soup.div['class']

['explanation']

In [160]:
soup.div.get('class')

['explanation']

In [212]:
divs = soup('div')
#print(divs)


div_with_signature = [div for div in soup('div', {'class':'signature'})]
print(div_with_signature)

[<div class="signature">
        <span id="name">Joel</span>
        <span id="twitter">@joelgrus</span>
        <span id="email">joelgrus-at-gmail</span>
    </div>]


REMEMBER

- in 'soup' -> in 'div' -> dict with items  = 1. (```'class': ''```) , 2. (```<span id="name">Joel</span> <span id="twitter">@joelgrus</span><span id="email">joelgrus-at-gmail</span>```)

In [233]:
#print(soup)
x = [div for div in soup('div') if div('p',{'id':'p1'})]
print(x)
#div_with_p = [p for div in soup('div')
              #soup('p', {'class':'important'})]
#print(div_with_p)

[<div class="content">
        <p id="p1">This is the first paragraph.</p>
        <p class="important">This is the second paragraph.</p>
    </div>]


## (c) Example: Keeping tabs on Congress

- we want to find out how many members of congress have mentioned about data science in their press releases

- links to all of the representatives’ is given in the websites : https://www.house.gov/representatives

- Example, in 'view page source' when we right click on page, the url of each representative looks like:

`<td headers="view-value-4-table-column" class="views-field views-field-value-4 views-field-value-5"><a href="https://carl.house.gov">Carl, Jerry</a>        </td>`

- under <a, url is written : 
```<td <a href="https://carl.house.gov">Carl, Jerry</a> </td>```

- **steps to get all the urls**
  1. read html of https://www.house.gov/representatives
  2. parse it
  3. under tag `a`, read all `href` attribute
  4. extract the ones we want start with either http:// or https://, have some kind of name, and end with either .house.gov or .house.gov/

In [6]:
import requests
from bs4 import BeautifulSoup
import re

#1. read html as text
url = "https://www.house.gov/representatives"
html = requests.get(url, 'html5lib').text

#2. parse using soup
soup = BeautifulSoup(html)

#3(A). all a tags
a_tag = soup('a')

#3(B). all href values if a has attribute 'href'
all_href = [a.get('href') for a in soup('a')
           if a.get('href')]


#4. Extract website addresses only from all_href using regex
regex = r"^https?://.*\.house.gov/?$"    #starting with http:// or https:// and ending with .house.gov or .house.gov/
all_urls = [url for url in all_href
                 if re.match(regex, url)]

#5. many might be repeating, so create a unique set from list of all_urls
unique_urls = set(all_urls)


- now we found url of all the congressman listed in website
- we will find how many urls mention 'data' in their 'press-release'

  **steps to find mention of 'data' in 'press-release' of individual representative**
1. we will go to each url and get its html as text
2. soup html
3. check in a-tag if text "press release" is there, if yes, then extract its href, which is link to its press release
4. in press release check if any p-tag i.e. paragraph mentions about data

In [7]:
#first lets do it for one url, say https://bice.house.gov'
#go to each url and get its html as text


url = "https://bice.house.gov"
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')
link = [a['href'] for a in soup('a') if 'press release' in a.text.lower()]
link = set(link)

In [71]:
#lets do scraping for all the unique_url
from typing import Dict, List
from tqdm import tqdm

press_releases: Dict[str, List[str]] = {}
for url in tqdm(unique_urls):
    html = requests.get(url, 'html5lib').text
    soup = BeautifulSoup(html)
    link = [a['href'] for a in soup('a') if 'press release' in a.text.lower()]
    link = list(set(link))
    press_releases[url] = link

 13%|█████▍                                    | 56/436 [01:45<11:54,  1.88s/it]


KeyboardInterrupt: 

In [72]:
print(press_releases)

{'https://titus.house.gov/': ['/news/documentquery.aspx?DocumentTypeID=27'], 'https://huizenga.house.gov/': ['/News/DocumentQuery.aspx?DocumentTypeID=2041'], 'https://ezell.house.gov': ['/news/documentquery.aspx?DocumentTypeID=27'], 'https://dondavis.house.gov': ['/media/press-releases', '/issues/grants'], 'https://bean.house.gov': ['/media/press-releases'], 'https://harris.house.gov/': ['/media/press-releases'], 'https://lofgren.house.gov/': ['/media/press-releases'], 'https://porter.house.gov/': [], 'https://marymiller.house.gov': ['/media/press-releases', '/media/subscribe-press-releases'], 'https://molinaro.house.gov': ['/news/documentquery.aspx?DocumentTypeID=27'], 'https://posey.house.gov/': ['/News/DocumentQuery.aspx?DocumentTypeID=1487'], 'https://bergman.house.gov': ['/news/documentquery.aspx?DocumentTypeID=27'], 'https://gosar.house.gov/': ['/news/email'], 'https://plaskett.house.gov/': ['/news/documentquery.aspx?documenttypeid=27'], 'https://larsen.house.gov': ['/news/docume

In [88]:
#We’ll write a slightly more general function that checks whether a page of press releases mentions any given 'keyword'
#If you visit the site and view the source, it seems like there’s a snippet from 
#each press release inside a <p> tag, so we’ll use that as our first attempt:

def paragraph_mentions(text: str, keyword: str) -> bool:
    """Returns True if a <p> inside the text mentions {keyword}"""
    soup = BeautifulSoup(text, 'html5lib')

    paragraphs = [p.get_text() for p in soup('p')]

    return any(keyword.lower() in paragraph.lower() for paragraph in paragraphs)

#Let’s write a quick test for it:
text = """<body><h1>Facebook</h1><p>Twitter</p>"""
assert paragraph_mentions(text, 'twitter') == True
assert paragraph_mentions(text, 'facebook') == False

In [107]:
#find it anywhere in html under <p> 'data' is mentioned

for house_url, pr_links in tqdm(press_releases.items()):
    for pr_link in pr_links:
        url = f"{house_url}/{pr_link}"
        text = requests.get(url).text
        if paragraph_mentions(text, 'data'):
            print(f"{house_url}")
            break # done with this house_url

 30%|█████████████                              | 17/56 [00:08<00:13,  2.90it/s]

https://castor.house.gov/


 61%|██████████████████████████                 | 34/56 [00:21<00:13,  1.61it/s]

https://cartwright.house.gov


 66%|████████████████████████████▍              | 37/56 [00:21<00:06,  3.01it/s]

https://norton.house.gov/


 70%|█████████████████████████████▉             | 39/56 [00:22<00:05,  3.13it/s]

https://mooney.house.gov/


 77%|█████████████████████████████████          | 43/56 [00:25<00:08,  1.47it/s]

https://delauro.house.gov/


 80%|██████████████████████████████████▌        | 45/56 [00:26<00:06,  1.59it/s]

https://hoyer.house.gov/


 82%|███████████████████████████████████▎       | 46/56 [00:27<00:06,  1.48it/s]

https://buck.house.gov/


100%|███████████████████████████████████████████| 56/56 [00:35<00:00,  1.59it/s]


If you look at the various “press releases” pages, most of them are paginated with only 5 or 10 press releases per page. This means that we only retrieved the few most recent press releases for each congressperson. A more thorough solution would have iterated over the pages and retrieved the full text of each press release.


# 4. Using APIs

## 1. Application programming interface (API)

- An API (Application Programming Interface) is a set of rules and protocols that allows different software applications to communicate with each other.

-   It defines the methods and data formats that applications can use to request and exchange information.

- Imagine you are at a restaurant. The restaurant's menu is like an API. It provides you with a list of dishes (functions or services) that you can order (use). Each dish has a name (function name) and a description (function documentation) that tells you what it does and how it's prepared.
- Now, you (the client) want to order a dish (use a service) from the menu (API). You don't need to know how the dish is prepared or what ingredients are used; you just need to know its name and what it does. You tell the waiter (the API) the name of the dish you want, and the waiter takes your order to the kitchen (the server) where the chef (the software) prepares the dish and brings it back to you.

- Similarly, in software development, an API provides a set of functions, methods, or endpoints that developers can use to interact with a system or service. Developers don't need to know the internal details of how the system works; they only need to know how to use the API to perform specific tasks or access certain functionalities.

### API for web scraping
- Many websites and web services provide application programming interfaces (APIs), which allow you to explicitly request data in a structured format.

- This saves you the trouble of having to scrape them!

## 2. JSON and XML
    * Because HTTP is a protocol for transferring text, the data you request through a web API needs to be serialized into a string format.
    * Often this serialization uses <u>JavaScript Object Notation (JSON)</u>.
    * JavaScript objects look quite similar to Python dicts, which makes their string representationseasy to interpret

Example of JSON:   

{ "title" : "Data Science Book", "author" : "Joel Grus", "publicationYear" : 2019, "topics" : [ "data", "science", "data science"] }

### Parse JSON file
- use python's JSON module

- 'load' function deserializes JSON object into python object

In [111]:
import json

serialized = """{ "title" : "Data Science Book",
"author" : "Joel Grus",
"publicationYear" : 2019,
"topics" : [ "data", "science", "data science"] }"""

#parse json to create python dictionary
deserialized = json.loads(serialized)    

#check the type of objects
print(type(serialized))            
print(type(deserialized))

<class 'str'>
<class 'dict'>


In [114]:
# Sometimes an API provider hates you and provides only responses in XML
# it should be scraped like html using BeautifulSoup
'''<Book>
    <Title>Data Science Book</Title>
    <Author>Joel Grus</Author>
    <PublicationYear>2014</PublicationYear>
    <Topics>
        <Topic>data</Topic>
        <Topic>science</Topic>
        <Topic>data science</Topic>
    </Topics>
</Book>'''

'<Book>\n    <Title>Data Science Book</Title>\n    <Author>Joel Grus</Author>\n    <PublicationYear>2014</PublicationYear>\n    <Topics>\n        <Topic>data</Topic>\n        <Topic>science</Topic>\n        <Topic>data science</Topic>\n    </Topics>\n</Book>'

- Most APIs these days require that you first authenticate yourself before you can use them.

- While we don’t begrudge them this policy, it creates a lot of extra boilerplate that muddies up our exposition.

- Accordingly, we’ll start by taking a look at GitHub’s API(https://api.github.com/users/joelgrus/repos), with which we can do some simple things unauthenticated:

In [147]:
import requests, json
github_user = "joelgrus"
endpoint = f"https://api.github.com/users/{github_user}/repos"
repos = json.loads(requests.get(endpoint).text)   #converts str to dict
print(type(repos))                                      #list of python dicts
print(type(requests.get(endpoint).text))                #string

print(repos)
#however both looks same but type changed

<class 'list'>
<class 'str'>
[{'id': 112873601, 'node_id': 'MDEwOlJlcG9zaXRvcnkxMTI4NzM2MDE=', 'name': 'advent2017', 'full_name': 'joelgrus/advent2017', 'private': False, 'owner': {'login': 'joelgrus', 'id': 1308313, 'node_id': 'MDQ6VXNlcjEzMDgzMTM=', 'avatar_url': 'https://avatars.githubusercontent.com/u/1308313?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/joelgrus', 'html_url': 'https://github.com/joelgrus', 'followers_url': 'https://api.github.com/users/joelgrus/followers', 'following_url': 'https://api.github.com/users/joelgrus/following{/other_user}', 'gists_url': 'https://api.github.com/users/joelgrus/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/joelgrus/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/joelgrus/subscriptions', 'organizations_url': 'https://api.github.com/users/joelgrus/orgs', 'repos_url': 'https://api.github.com/users/joelgrus/repos', 'events_url': 'https://api.github.com/users/joelgrus/events{/privacy

- We can use this to figure out which months and days of the week I’m most likely to create a repository.

- The only issue is that the dates in the response are strings: `'created_at': '2017-12-02T20:13:49Z'`
### dateutil module
- Python doesn’t come with a great date parser, so we’ll need to install one:

`python -m pip install python-dateutil`
- from which you’ll probably only ever need the dateutil.parser.parse function:

In [148]:
from collections import Counter
from dateutil.parser import parse

dates = [parse(repo['created_at']) for repo in repos]
print(dates)

month_counts = Counter(date.month for date in dates)
print(month_counts)

weekday_count = Counter(date.weekday() for date in dates)
print(weekday_count)

[datetime.datetime(2017, 12, 2, 20, 13, 49, tzinfo=tzutc()), datetime.datetime(2018, 11, 30, 22, 41, 16, tzinfo=tzutc()), datetime.datetime(2019, 12, 1, 2, 57, 18, tzinfo=tzutc()), datetime.datetime(2020, 11, 21, 16, 21, 49, tzinfo=tzutc()), datetime.datetime(2021, 11, 24, 13, 53, 23, tzinfo=tzutc()), datetime.datetime(2022, 11, 22, 2, 25, 22, tzinfo=tzutc()), datetime.datetime(2023, 12, 2, 3, 15, 48, tzinfo=tzutc()), datetime.datetime(2018, 2, 23, 15, 51, 4, tzinfo=tzutc()), datetime.datetime(2017, 12, 19, 0, 12, 40, tzinfo=tzutc()), datetime.datetime(2018, 1, 31, 23, 51, 16, tzinfo=tzutc()), datetime.datetime(2018, 12, 19, 19, 44, 45, tzinfo=tzutc()), datetime.datetime(2018, 9, 5, 2, 43, 52, tzinfo=tzutc()), datetime.datetime(2019, 2, 1, 20, 25, 46, tzinfo=tzutc()), datetime.datetime(2013, 7, 5, 2, 2, 28, tzinfo=tzutc()), datetime.datetime(2023, 3, 19, 20, 15, 39, tzinfo=tzutc()), datetime.datetime(2017, 5, 10, 17, 22, 45, tzinfo=tzutc()), datetime.datetime(2013, 11, 15, 5, 33, 22, t

In [150]:
#We can extract language of last 5 repos

last_5_repositories = sorted(repos, key = lambda r:r["pushed_at"], reverse = True)[:5]
print(last_5_repositories)


[{'id': 726318877, 'node_id': 'R_kgDOK0q_HQ', 'name': 'advent2023', 'full_name': 'joelgrus/advent2023', 'private': False, 'owner': {'login': 'joelgrus', 'id': 1308313, 'node_id': 'MDQ6VXNlcjEzMDgzMTM=', 'avatar_url': 'https://avatars.githubusercontent.com/u/1308313?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/joelgrus', 'html_url': 'https://github.com/joelgrus', 'followers_url': 'https://api.github.com/users/joelgrus/followers', 'following_url': 'https://api.github.com/users/joelgrus/following{/other_user}', 'gists_url': 'https://api.github.com/users/joelgrus/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/joelgrus/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/joelgrus/subscriptions', 'organizations_url': 'https://api.github.com/users/joelgrus/orgs', 'repos_url': 'https://api.github.com/users/joelgrus/repos', 'events_url': 'https://api.github.com/users/joelgrus/events{/privacy}', 'received_events_url': 'https://api.github.co

In [154]:
#last 5 languages

last_5_languages = [repo ["language"] for repo in last_5_repositories]
print(last_5_languages)

['Python', 'Python', 'Svelte', 'Python', 'Python']


- Typically we won’t be working with APIs at this low “make the requests and parse the responses ourselves” level.

- One of the benefits of using Python is that someone has already built a library for pretty much any API you’re interested in accessing.

- When they’re done well, these libraries can save you a lot of the trouble of figuring out the hairier details of API access. (When they’re not done well, or when it turns out they’re based on defunct versions of the corresponding APIs, they can cause you enormous headaches.)

- Nonetheless, you’ll occasionally have to roll your own API access library (or, more likely, debug why someone else’s isn’t working), so it’s good to know some of the details.

## Finding APIs

- If you need data from a specific site, look for a “developers” or “API” section of the site for details, and try searching the web for “python<sitename> api” to find a library.

- There are libraries for the Yelp API, for the Instagram API, for the Spotify API, and so on.

- If you’re looking for a list of APIs that have Python wrappers, there’s a nice one from Real Python on https://github.com/realpython/list-of-python-api-wrappers

# 5. Example: Using twitter API

- First, you need your API key and API secret key (sometimes known as the consumer key and consumer secret, respectively)

- in bash :
`export TWITTER_API_KEY='your_api_key'`  
`export TWITTER_API_SECRET='your_api_secret'`

- use `os module` to access the private keys to python: os is a module in Python's standard library that provides a way to interact with the operating system. 

**Using Twython**
- The trickiest part of using the Twitter API is authenticating yourself.

- Indeed, this is the trickiest part of using a lot of APIs.

- API providers want to make sure that you’re authorized to access their data and that you don’t exceed their usage limits. They also want to know who’s accessing their data.

- Authentication is kind of a pain. There is a simple way, OAuth 2, that suffices when you just want to do simple searches like some keywords/hashtags etc.

- And there is a complex way, OAuth 1, that’s required when you want to perform actions (e.g., tweeting) or (in particular for us) connect to the Twitter stream.

- So we’re stuck with the more complicated way, which we’ll try to automate as much as we can.

- first, get the api key and api key secret from os environment where we have saved using command lines:

`export TWITTER_API_KEY='your_api_key'`  
`export TWITTER_API_SECRET='your_api_secret'`


In [14]:
import os

CONSUMER_KEY = os.environ.get("TWITTER_API_KEY")
CONSUMER_SECRET = os.environ.get("TWITTER_API_SECRET")

print(CONSUMER_KEY, CONSUMER_SECRET)

None None


In [4]:
#Now we can instantiate the client:

import webbrowser
from twython import Twython

#create a temporary twython client using your key and secret key
temp_client = Twython(CONSUMER_KEY,CONSUMER_SECRET)


#get temporary OAuth1 tokens and authentication url
temp_creds = temp_client.get_authentication_tokens()
url = temp_creds['auth_url']
print(temp_creds)

{'oauth_token': 'wu7c_QAAAAABsikqAAABjeSot9E', 'oauth_token_secret': 'am0p1mMVHjPTH7SkhvQKdAtYa81KitGf', 'oauth_callback_confirmed': 'true', 'auth_url': 'https://api.twitter.com/oauth/authenticate?oauth_token=wu7c_QAAAAABsikqAAABjeSot9E'}


In [5]:
# Now visit that URL to authorize the application and get a PIN
print(f"go visit {url} and get the PIN code and paste it below")
webbrowser.open(url)
PIN_CODE = input("please enter the PIN code: ")

go visit https://api.twitter.com/oauth/authenticate?oauth_token=wu7c_QAAAAABsikqAAABjeSot9E and get the PIN code and paste it below


please enter the PIN code:  8544483


In [6]:
#now we use this pin code to get the actual tokens
auth_client = Twython( CONSUMER_KEY, CONSUMER_SECRET,
                      temp_creds['oauth_token'],
                      temp_creds['oauth_token_secret'])

final_step = auth_client.get_authorized_tokens(PIN_CODE)
ACCESS_TOKEN = final_step['oauth_token']
ACCESS_TOKEN_SECRET = final_step['oauth_token_secret']

# And get a new Twython instance using them.
twitter = Twython(CONSUMER_KEY,
CONSUMER_SECRET,
ACCESS_TOKEN,
ACCESS_TOKEN_SECRET)

In [57]:
# Once we have an authenticated Twython instance, we can start performing searches:
# Search for tweets containing the phrase "data science"
# for status in twitter.search(q='"data science"')["statuses"]:
#     user = status["user"]["screen_name"]
#     text = status["text"]
#     print(f"{user}: {text}\n")

In [8]:
# Post a tweet
tweet = "Hello, Twitter! This is a test tweet."
twitter.update_status(status=tweet)

print("Tweet posted successfully!")

TwythonError: Twitter API returned a 403 (Forbidden), You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product

In [13]:
from requests_oauthlib import OAuth1Session
import os
import json

# In your terminal please set your environment variables by running the following lines of code.
# export 'CONSUMER_KEY'='<your_consumer_key>'
# export 'CONSUMER_SECRET'='<your_consumer_secret>'

# consumer_key = os.environ.get("CONSUMER_KEY")
# consumer_secret = os.environ.get("CONSUMER_SECRET")

consumer_key = ''
consumer_secret = ''

# User fields are adjustable, options include:
# created_at, description, entities, id, location, name,
# pinned_tweet_id, profile_image_url, protected,
# public_metrics, url, username, verified, and withheld
fields = "created_at,description"
params = {"user.fields": fields}

# Get request token
request_token_url = "https://api.twitter.com/oauth/request_token"
oauth = OAuth1Session(consumer_key, client_secret=consumer_secret)

try:
    fetch_response = oauth.fetch_request_token(request_token_url)
except ValueError:
    print(
        "There may have been an issue with the consumer_key or consumer_secret you entered."
    )

resource_owner_key = fetch_response.get("oauth_token")
resource_owner_secret = fetch_response.get("oauth_token_secret")
print("Got OAuth token: %s" % resource_owner_key)

# # Get authorization
base_authorization_url = "https://api.twitter.com/oauth/authorize"
authorization_url = oauth.authorization_url(base_authorization_url)
print("Please go here and authorize: %s" % authorization_url)
verifier = input("Paste the PIN here: ")

# Get the access token
access_token_url = "https://api.twitter.com/oauth/access_token"
oauth = OAuth1Session(
    consumer_key,
    client_secret=consumer_secret,
    resource_owner_key=resource_owner_key,
    resource_owner_secret=resource_owner_secret,
    verifier=verifier,
)
oauth_tokens = oauth.fetch_access_token(access_token_url)

access_token = oauth_tokens["oauth_token"]
access_token_secret = oauth_tokens["oauth_token_secret"]

# Make the request
oauth = OAuth1Session(
    consumer_key,
    client_secret=consumer_secret,
    resource_owner_key=access_token,
    resource_owner_secret=access_token_secret,
)

response = oauth.get("https://api.twitter.com/2/users/me", params=params)

if response.status_code != 200:
    raise Exception(
        "Request returned an error: {} {}".format(response.status_code, response.text)
    )

print("Response code: {}".format(response.status_code))

json_response = response.json()

print(json.dumps(json_response, indent=4, sort_keys=True))

Got OAuth token: -8UgyQAAAAABsikqAAABjeS3RVo
Please go here and authorize: https://api.twitter.com/oauth/authorize?oauth_token=-8UgyQAAAAABsikqAAABjeS3RVo


Paste the PIN here:  0007949


Response code: 200
{
    "data": {
        "created_at": "2024-02-26T07:35:10.000Z",
        "description": "",
        "id": "1762018388891807744",
        "name": "tandaifuku",
        "username": "tandaifuku"
    }
}
