# Text Acquisition & Ingestion Assignment

In [3]:
import json
import requests
import feedparser
from bs4 import BeautifulSoup

### Iterate through the list of article URLs below, scraping the text from each one and saving it to a text file. 

In [4]:
articles = ['http://lite.cnn.io/en/article/h_eac18760a7a7f9a1bf33616f1c4a336d',
            'http://lite.cnn.io/en/article/h_de3f82f17d289680dd2b47c6413ebe7c',
            'http://lite.cnn.io/en/article/h_72f4dc9d6f35458a89af014b62e625ad',
            'http://lite.cnn.io/en/article/h_aa21fe6bf176071cb49e09d422c3adf0',
            'http://lite.cnn.io/en/article/h_8ad34a532921c9076cdc9d7390d2f1bc',
            'http://lite.cnn.io/en/article/h_84422c79110d9989177cfaf1c5f45fe7',
            'http://lite.cnn.io/en/article/h_d010d9580abac3a44c6181ec6fb63d58',
            'http://lite.cnn.io/en/article/h_fb11f4e9d7c5323e75b337d9e9e5e368',
            'http://lite.cnn.io/en/article/h_7b27f0b131067f8ece6238ac559670ab',
            'http://lite.cnn.io/en/article/h_8cae7f735fa9573d470f802063ceffe2',
            'http://lite.cnn.io/en/article/h_72c3668280e82576fcc2602b0fa70c14',
            'http://lite.cnn.io/en/article/h_d20658fb0e20212051cda0e0a7248c8a',
            'http://lite.cnn.io/en/article/h_56611c43d7928120d2ae21666ccc7417',
            'http://lite.cnn.io/en/article/h_bda0394e3c5ee7054ee65c022bca7695']

In [97]:
PATH = '/content/drive/MyDrive/News_Articles/'
for i, article in enumerate(articles):
    response = requests.get(article)
    soup = BeautifulSoup(response.text).find('div', {'class': 'afe4286c'}).text
    with open(PATH + f'article_{i}.txt', 'wb') as f:
      f.write(soup.encode())

### Ingest the text files generated via web scraping into a corpus and print the corpus statistics.

In [98]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
import nltk
nltk.download('punkt')

DOC_PATTERN = r'.*\.txt'
corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)
corpus.fileids()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['article_0.txt',
 'article_1.txt',
 'article_10.txt',
 'article_11.txt',
 'article_12.txt',
 'article_13.txt',
 'article_2.txt',
 'article_3.txt',
 'article_4.txt',
 'article_5.txt',
 'article_6.txt',
 'article_7.txt',
 'article_8.txt',
 'article_9.txt']

In [99]:
def corpus_stats(corpus):
    print(
        f"Corpus Statistics\n\n"
        f"Number of documents: {len(corpus.fileids())}\n\n"
        f"Number of paragraphs: {len(corpus.paras())}\n\n"
        f"Number of sentences: {len(corpus.sents())}\n\n"
        f"Number of words: {len(corpus.words())}\n\n"
        f"Vocabulary: {len(set(w.lower() for w in corpus.words()))}\n\n"
        f"Avg chars per word: {round(len(corpus.raw())/len(corpus.words()))}\n\n"
        f"Avg words per sentence: {round(len(corpus.words())/len(corpus.sents()))}\n\n"
    )

In [100]:
corpus_stats(corpus)

Corpus Statistics

Number of documents: 14

Number of paragraphs: 14

Number of sentences: 427

Number of words: 13824

Vocabulary: 2955

Avg chars per word: 5

Avg words per sentence: 32




### Parse the O'Reilly Radar RSS feed below, extract the text from each post, and save it to a text file.

The content of each post contains HTML tags. Strip those out using the same approach you used for web scraping so that only text is saved to the files.

In [102]:
feed = 'http://feeds.feedburner.com/oreilly/radar/atom'

In [103]:
parsed = feedparser.parse(feed)

In [108]:
#parsed
display(parsed.entries[0].summary)
display(parsed.entries[1].summary)

'2020 has been a year of great challenges for so many, but it’s not all negative. Around the world, organizations and their workforces have risen to the occasion, recognizing the importance of expanding their knowledge, taking on new tasks, and bettering themselves both personally and professionally. With the uptick in virtual conferencing, remote work, and, [&#8230;]'

'It has long seemed to me that functional programming is, essentially, programming viewed as mathematics. Many ideas in functional programming came from Alonzo Church&#8217;s Lambda Calculus, which significantly predates anything that looks remotely like a modern computer. Though the actual history of computing runs differently: in the early days of computing, Von Neumann’s ideas were [&#8230;]'

### Ingest the text files generated via RSS parsing into a corpus and print the corpus statistics.

In [109]:
path = '/content/drive/MyDrive/RSS_Articles/'

for i, entry in enumerate(parsed.entries):
    text = entry.summary
    with open(path + f'article_{i}.txt', 'wb') as f:
        f.write(text.encode())

In [110]:
doc_pattern = r'.*\.txt'
corpus = PlaintextCorpusReader(path, doc_pattern)

corpus_stats(corpus)

Corpus Statistics

Number of documents: 60

Number of paragraphs: 60

Number of sentences: 197

Number of words: 4515

Vocabulary: 1467

Avg chars per word: 5

Avg words per sentence: 23




### Make an API call to the Hacker News API to retrieve their Ask, Show, and Job category items. 

- URL: https://hacker-news.firebaseio.com/v0/askstories.json

In [119]:
item_ids = []
for categories in ['ask', 'show', 'job']:
    url = f'https://hacker-news.firebaseio.com/v0/{categories}stories.json'
    response = requests.get(url)
    print(response)
    print(f'Added {len(response.json())} from {categories}stories')
    item_ids.extend(response.json())

<Response [200]>
Added 89 from askstories
<Response [200]>
Added 39 from showstories
<Response [200]>
Added 60 from jobstories


In [122]:
print(item_ids)

[25559571, 25562022, 25560185, 25561398, 25553818, 25558927, 25559274, 25559174, 25562723, 25560180, 25553772, 25540583, 25538586, 25537230, 25556199, 25555746, 25556563, 25559143, 25546445, 25557720, 25541269, 25541616, 25557852, 25548802, 25558741, 25553613, 25547050, 25541964, 25554464, 25542676, 25552885, 25557927, 25545469, 25557631, 25544753, 25551133, 25538405, 25546557, 25545136, 25542679, 25550111, 25540343, 25551290, 25541939, 25542290, 25538258, 25538128, 25541828, 25540059, 25544961, 25550627, 25543287, 25543087, 25542812, 25555295, 25539594, 25539230, 25537569, 25542189, 25549841, 25548696, 25545525, 25549864, 25536672, 25539190, 25537279, 25550782, 25542445, 25563260, 25563215, 25563092, 25563077, 25562565, 25562378, 25562255, 25561875, 25561836, 25561826, 25561526, 25561152, 25560857, 25560695, 25560220, 25560127, 25559864, 25559508, 25559220, 25559169, 25559163, 25558485, 25558891, 25560570, 25562198, 25561064, 25555633, 25560368, 25553458, 25559551, 25549342, 25550280,

In [120]:
len(item_ids)

188

### Once you have retrieved the item IDs from the URL above, retrieve each item by adding the item ID to the URL below, extract the item's text property, and save the text from each item to disk as its own document.

- URL: https://hacker-news.firebaseio.com/v0/item/ITEM_ID_HERE.json

The content of some items may contain HTML tags. Strip those out using the same approach you used for web scraping so that only text is saved to the files.

In [125]:
path = '/content/drive/MyDrive/API_Articles/'

for i in item_ids:
    url = f'https://hacker-news.firebaseio.com/v0/item/{i}.json'
    response = requests.get(url)
    if 'text' in response.json().keys():
        soup = BeautifulSoup(response.json()['text'])
        text = soup.text
        with open(path + f'article_{i}.txt', 'wb') as f:
            f.write(text.encode())

### Ingest the text files generated via API into a corpus and print the corpus statistics.

In [126]:
doc_pattern = r'.*\.txt'
corpus = PlaintextCorpusReader(path, doc_pattern)

corpus_stats(corpus)

Corpus Statistics

Number of documents: 78

Number of paragraphs: 78

Number of sentences: 275

Number of words: 7958

Vocabulary: 2072

Avg chars per word: 5

Avg words per sentence: 29


