# Q & A
*Try to answer the following questions without looking up the answers:*

What properties of language make it difficult to analyze?
Grammar, syntax, variances (different languages), typos, evolution (it evolves over time), slang, and semantics.

What is the difference between a lexicon and a grammar?
Lexicon is the entirety of the language grammar is how you use it.

What is the difference between syntax and morphology?

Morphology is the study of words and other meaningful units of language. Syntax is the study of sentences and phrases, and the rules of grammar that sentences obey.

What is the difference between a word stem and a lemma?
The stem of a word is not it's lemma. It's just what the word would look like minus the suffixes and the prefixes. The lemma is, like the root of the word.

Name 4 examples of parts of speech.
Noun, Adverb, adjective, conjugation.

What are some examples of named entities?

People, Places, Pets (sometimes?), etc. 

What are the steps in the NLP workflow?
Acquire, Build/Store corpus, preprocess, analyize, vector, model, deploy. 

Provide 3 examples of real-world NLP applications.

Chatbots, speech to text, digital assistants.

# Text Acquisition & Ingestion Assignment

In [4]:
import json
import requests
import feedparser
from bs4 import BeautifulSoup

### Iterate through the list of article URLs below, scraping the text from each one and saving it to a text file. 

In [5]:
articles = ['http://lite.cnn.io/en/article/h_eac18760a7a7f9a1bf33616f1c4a336d',
            'http://lite.cnn.io/en/article/h_de3f82f17d289680dd2b47c6413ebe7c',
            'http://lite.cnn.io/en/article/h_72f4dc9d6f35458a89af014b62e625ad',
            'http://lite.cnn.io/en/article/h_aa21fe6bf176071cb49e09d422c3adf0',
            'http://lite.cnn.io/en/article/h_8ad34a532921c9076cdc9d7390d2f1bc',
            'http://lite.cnn.io/en/article/h_84422c79110d9989177cfaf1c5f45fe7',
            'http://lite.cnn.io/en/article/h_d010d9580abac3a44c6181ec6fb63d58',
            'http://lite.cnn.io/en/article/h_fb11f4e9d7c5323e75b337d9e9e5e368',
            'http://lite.cnn.io/en/article/h_7b27f0b131067f8ece6238ac559670ab',
            'http://lite.cnn.io/en/article/h_8cae7f735fa9573d470f802063ceffe2',
            'http://lite.cnn.io/en/article/h_72c3668280e82576fcc2602b0fa70c14',
            'http://lite.cnn.io/en/article/h_d20658fb0e20212051cda0e0a7248c8a',
            'http://lite.cnn.io/en/article/h_56611c43d7928120d2ae21666ccc7417',
            'http://lite.cnn.io/en/article/h_bda0394e3c5ee7054ee65c022bca7695']

In [48]:
PATH = 'drive/MyDrive/content/articles/'
i = 0
for article in articles: 
  response = requests.get(article)
  content = response.text
 
  #class name = afe4286c
  soup = BeautifulSoup(content, 'lxml')

  text_list = [t.get_text(separator="\n") for t in soup.find_all(name='div', attrs={'class': 'afe4286c'})]
  
  with open(PATH + f'article_{i}.txt', 'wb') as f:
    f.write(str(text_list).encode())
  i += 1


### Ingest the text files generated via web scraping into a corpus and print the corpus statistics.

In [51]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

DOC_PATTERN = r'.*\.txt'
corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)


In [60]:
def corpus_stats(corpus):
  print("corpus statistics:")
  print(f'Number of Documents: {str(len(corpus.fileids()))}')
  print(f'Number of paragraphs: {str(len(corpus.paras()))}')
  print(f'Number of sentences: {str(len(corpus.sents()))}')
  print(f'Number of words: {str(len(corpus.words()))}')
  print(f'Vocabulary: {str(len(set(w.lower() for w in corpus.words())))}')
  print(f'avg chars per word: {str(round(len(corpus.raw())/len(corpus.words()), 1))}')
  print(f'avg words per sentence: {str(round(len(corpus.words())/len(corpus.sents()), 1))}')

In [61]:
import nltk
corpus_stats(corpus)

corpus statistics:
Number of Documents: 14
Number of paragraphs: 14
Number of sentences: 395
Number of words: 14102
Vocabulary: 3058
avg chars per word: 5.0
avg words per sentence: 35.7


In [58]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Parse the O'Reilly Radar RSS feed below, extract the text from each post, and save it to a text file.

The content of each post contains HTML tags. Strip those out using the same approach you used for web scraping so that only text is saved to the files.

In [62]:
feed = 'http://feeds.feedburner.com/oreilly/radar/atom'

In [63]:
parsed = feedparser.parse(feed)

In [64]:
posts = parsed.entries

In [65]:
posts[0]["summary"]

'2020 has been a year of great challenges for so many, but it’s not all negative. Around the world, organizations and their workforces have risen to the occasion, recognizing the importance of expanding their knowledge, taking on new tasks, and bettering themselves both personally and professionally. With the uptick in virtual conferencing, remote work, and, [&#8230;]'

### Ingest the text files generated via RSS parsing into a corpus and print the corpus statistics.

In [66]:
PATH = 'drive/MyDrive/content/rss/'

for i, post in enumerate(posts):
  text = post.summary

  with open(PATH + f'post_{i}.txt', 'wb') as f:
    f.write(text.encode())

In [67]:
DOC_PATTERN = r'.*\.txt'
corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)

corpus_stats(corpus)

corpus statistics:
Number of Documents: 60
Number of paragraphs: 60
Number of sentences: 197
Number of words: 4515
Vocabulary: 1467
avg chars per word: 4.9
avg words per sentence: 22.9


### Make an API call to the Hacker News API to retrieve their Ask, Show, and Job category items. 

- URL: https://hacker-news.firebaseio.com/v0/askstories.json

In [73]:
url = 'https://hacker-news.firebaseio.com/v0/askstories.json'

response = requests.get(url).json()

print(response)

[25562022, 25561398, 25560185, 25553818, 25558927, 25559755, 25559274, 25554464, 25560180, 25559174, 25553772, 25540583, 25538586, 25556199, 25537230, 25559143, 25555746, 25556563, 25557720, 25546445, 25525457, 25541269, 25541616, 25558741, 25557852, 25525426, 25548802, 25553613, 25530700, 25533487, 25547050, 25541964, 25557927, 25552885, 25542676, 25557631, 25545469, 25551133, 25544753, 25538405, 25531729, 25550111, 25546557, 25545136, 25542679, 25551290, 25540343, 25541939, 25559571, 25542290, 25533051, 25538258, 25538128, 25541828, 25528481, 25535752, 25540059, 25544961, 25550627, 25528596, 25543287, 25555295, 25543087, 25526708, 25542812, 25530559, 25535332, 25528837, 25533472, 25525590, 25539594, 25539230, 25537569, 25535792, 25533505, 25534981, 25526280, 25533682, 25542189, 25549841, 25548696, 25527401, 25526579, 25545525, 25549864, 25525446, 25536672, 25539190, 25533770, 25527006, 25526511, 25537279, 25534307, 25550782, 25542445, 25534168, 25553809, 25553555]


In [84]:
print(len(response))

98


### Once you have retrieved the item IDs from the URL above, retrieve each item by adding the item ID to the URL below, extract the item's text property, and save the text from each item to disk as its own document.

- URL: https://hacker-news.firebaseio.com/v0/item/ITEM_ID_HERE.json

The content of some items may contain HTML tags. Strip those out using the same approach you used for web scraping so that only text is saved to the files.

In [87]:
for id in response:
  url = f'https://hacker-news.firebaseio.com/v0/item/{id}.json'
  article = requests.get(url).json()
  if "text" in article:
    soup = BeautifulSoup(article['text'], 'lxml')
    PATH = 'drive/MyDrive/content/api/'
    
    with open(PATH + f'{id}.txt', 'wb') as f:
      f.write(text.encode())   

### Ingest the text files generated via API into a corpus and print the corpus statistics.

In [88]:
DOC_PATTERN = r'.*\.txt'
corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)

corpus_stats(corpus)

corpus statistics:
Number of Documents: 77
Number of paragraphs: 77
Number of sentences: 231
Number of words: 5852
Vocabulary: 61
avg chars per word: 5.3
avg words per sentence: 25.3
