# Text Acquisition & Ingestion Assignment

In [1]:
#!pip install feedparser

In [2]:
import json
import requests
import feedparser
from bs4 import BeautifulSoup
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

import nltk
# nltk.download('punkt')

In [3]:
def corpus_stats(corpus):
    print("Corpus Statistics")
    print("Number of documents: " + str(len(corpus.fileids())))
    print("Number of paragraphs: " + str(len(corpus.paras())))
    print("Number of sentences: " + str(len(corpus.sents())))
    print("Number of words: " + str(len(corpus.words())))
    print("Vocabulary: " + str(len(set(w.lower() for w in corpus.words()))))
    print("Avg chars per word: " + str(round(len(corpus.raw())/len(corpus.words()),1)))
    print("Avg words per sentence: " + str(round(len(corpus.words())/len(corpus.sents()),1)))

### Iterate through the list of article URLs below, scraping the text from each one and saving it to a text file. 

In [4]:
articles = ['http://lite.cnn.io/en/article/h_eac18760a7a7f9a1bf33616f1c4a336d',
            'http://lite.cnn.io/en/article/h_de3f82f17d289680dd2b47c6413ebe7c',
            'http://lite.cnn.io/en/article/h_72f4dc9d6f35458a89af014b62e625ad',
            'http://lite.cnn.io/en/article/h_aa21fe6bf176071cb49e09d422c3adf0',
            'http://lite.cnn.io/en/article/h_8ad34a532921c9076cdc9d7390d2f1bc',
            'http://lite.cnn.io/en/article/h_84422c79110d9989177cfaf1c5f45fe7',
            'http://lite.cnn.io/en/article/h_d010d9580abac3a44c6181ec6fb63d58',
            'http://lite.cnn.io/en/article/h_fb11f4e9d7c5323e75b337d9e9e5e368',
            'http://lite.cnn.io/en/article/h_7b27f0b131067f8ece6238ac559670ab',
            'http://lite.cnn.io/en/article/h_8cae7f735fa9573d470f802063ceffe2',
            'http://lite.cnn.io/en/article/h_72c3668280e82576fcc2602b0fa70c14',
            'http://lite.cnn.io/en/article/h_d20658fb0e20212051cda0e0a7248c8a',
            'http://lite.cnn.io/en/article/h_56611c43d7928120d2ae21666ccc7417',
            'http://lite.cnn.io/en/article/h_bda0394e3c5ee7054ee65c022bca7695']

### Ingest the text files generated via web scraping into a corpus and print the corpus statistics.

In [5]:
for i, art in enumerate(articles):
    afile = open("output/articles_text" + str(i) + ".p", "wb" )
    response = requests.get(art)
    content = response.text
    
    soup = BeautifulSoup(content, 'lxml')
    title = soup.find('div', class_='afe4286c').h2.text
    afile.write(title.encode("UTF-8"))
    
    text = soup.find('div', class_='afe4286c').find_all('p')
    text_list = [tag.get_text() for tag in text]
    
    for line in text_list:
         afile.write(line.encode("UTF-8"))
    
    afile.close() 

In [6]:
PATH = 'output/'
DOC_PATTERN = r'articles_text.*\.p'
corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)
corpus_stats(corpus)

Corpus Statistics
Number of documents: 14
Number of paragraphs: 14
Number of sentences: 427
Number of words: 13668
Vocabulary: 2927
Avg chars per word: 5.0
Avg words per sentence: 32.0


### Parse the O'Reilly Radar RSS feed below, extract the text from each post, and save it to a text file.

The content of each post contains HTML tags. Strip those out using the same approach you used for web scraping so that only text is saved to the files.

In [7]:
feed = 'http://feeds.feedburner.com/oreilly/radar/atom'

In [8]:
parsed = feedparser.parse(feed)
posts = parsed.entries
posts[0]['link']

'http://feedproxy.google.com/~r/oreilly/radar/atom/~3/7S_cHstbl8Y/'

In [9]:
for i, post in enumerate(posts):   
    response = requests.get(posts[0]['link'])
    content = response.text
    soup = BeautifulSoup(content, 'lxml')

    text = soup.find('div', class_='main-post-radar-content').find_all(['p', 'li', 'h3', 'a'])
    text_list = [tag.get_text() for tag in text]

    afile = open("output/articles_rss" + str(i) + ".p", "wb" )

    for line in text_list:
        afile.write(line.encode("UTF-8"))
    
    afile.close() 

### Ingest the text files generated via RSS parsing into a corpus and print the corpus statistics.

In [10]:
PATH = 'output/'
DOC_PATTERN = r'articles_rss.*\.p'
corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)
corpus_stats(corpus)

Corpus Statistics
Number of documents: 60
Number of paragraphs: 120
Number of sentences: 2160
Number of words: 75960
Vocabulary: 583
Avg chars per word: 5.5
Avg words per sentence: 35.2


### Make an API call to the Hacker News API to retrieve their Ask, Show, and Job category items. 

- URL: https://hacker-news.firebaseio.com/v0/askstories.json

In [11]:
url = 'https://hacker-news.firebaseio.com/v0/askstories.json'
response = requests.get(url)
items = json.loads(response.content)

### Once you have retrieved the item IDs from the URL above, retrieve each item by adding the item ID to the URL below, extract the item's text property, and save the text from each item to disk as its own document.

- URL: https://hacker-news.firebaseio.com/v0/item/ITEM_ID_HERE.json

The content of some items may contain HTML tags. Strip those out using the same approach you used for web scraping so that only text is saved to the files.

In [26]:
for item in items:
    url = 'https://hacker-news.firebaseio.com/v0/item/' + str(item) + '.json'
    response = requests.get(url)
    content = response.text
    
    obj = eval(content)
    if ('text' in obj.keys()):
        text = obj['text']
        print('--------------------------------------')
        print(text)

#     afile = open("output/articles_rss" + str(i) + ".p", "wb" )

#     for line in text_list:
#         afile.write(line.encode("UTF-8"))
    
#     afile.close() 

--------------------------------------
Please state the job location and include the keywords
REMOTE, INTERNS and&#x2F;or VISA when the corresponding sort of candidate is welcome.
When remote work is <i>not</i> an option, include ONSITE.<p>Please only post if you personally are part of the hiring company—no
recruiting firms or job boards. Only one post per company. If it isn&#x27;t a household name,
please explain what your company does.<p>Commenters: please don&#x27;t reply to job posts to complain about
something. It&#x27;s off topic here.<p>Readers: please only email if you are personally interested in the job.<p>Searchers: Try <a href="https:&#x2F;&#x2F;findwork.dev&#x2F;?source=hn" rel="nofollow">https:&#x2F;&#x2F;findwork.dev&#x2F;?source=hn</a>, <a href="https:&#x2F;&#x2F;kennytilton.github.io&#x2F;whoishiring&#x2F;" rel="nofollow">https:&#x2F;&#x2F;kennytilton.github.io&#x2F;whoishiring&#x2F;</a>,
<a href="https:&#x2F;&#x2F;hnhired.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;hn

--------------------------------------
See https:&#x2F;&#x2F;github.com&#x2F;.<p>Nothing on status.github.com yet
--------------------------------------
I haven&#x27;t been able to find anything about it except for news of the vote back in February.
--------------------------------------
Cert expired 5 mins ago.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;</a>
--------------------------------------
(Sorry for bad English. Not my first language.)<p>I had just began work at my first job after finishing university studies. About 1 year back, my first task was to hide negative results about an Israeli start-up that had bad news articles about their business stealing code from competitors.<p>It was never wright with me but it was my first job and I followed instructions of the lead who I was working on this client task with.<p>After a weeks of work, we showed the results to client and the news stories were not appearing in the first pages 

--------------------------------------
So I&#x27;ve been thinking about this for a while and diversifying services as much as possible (mail, drive, docs etc) but Photos is one service that I can&#x27;t replace.<p>Of course I take backups of the (super messy) Takeout, and use the API to make (incomplete, lower quality) backups of photos.<p>None are ideal though. So, here&#x27;s my question, what&#x27;s your contingency plan if you lose access to your Google account?
--------------------------------------
Our 2 year old startup had a 409a valuation to, amongst other things, set the strike price of our options. The board held back significant amounts of information, including the total value and fair market value. The strike price was set at 1.25 per share, after they had suggested numerous people would get a price of 1 per share. Come to find out, the fair market value is .90 per share, which puts the value of the company at less than what they originally told us and to me shows some sh

--------------------------------------
Genuinely curious and admit I don’t understand this technology fully but if the main use case is verifying something Why can’t people vote using their phones and results be instantaneous.<p>Asking as the us election is approaching and we know it’s going to be a shitshow and this is if we are lucky. Else it will be something .. well let’s not go there.


### Ingest the text files generated via API into a corpus and print the corpus statistics.