# Natural Language Processing
* What is it?
  * involves the interactions between computers and natural human languages

* Document - Basic unit of observation. Ex: entire books, chapters in a book, articles, reviews, tweets, etc
* Corpus - collection of related documents
* token - unit of text analysis
* Stop words - frequently used words with relatively unimportant meanings. (the, and, I, we, my, a)
* Named entity - group of words that indicate a category of objects. (Names, organizations, locations, time expressions, quantities, monetary values, percentages)
* Normalization
  * Stem - form of a word in which affixes (prefix or suffix) can be attached
    * Word: puppy
    * stem: puppi
  * Lemmatization - root or base form of the word
    * word: puppies
    * lemma: puppy

* Part of Speech - noun, verb, adjective, adverb

* Applications:
  * Chatbots
  * Topic Modeling
  * Sentiment Analysis
  * Machine Translation
  * Spam Filtering
  * Transcription (Speech to Text)
  * Text to Speech
  * Digital Assistants (Siri, Alexa, Cortana, etc)



In [None]:
import requests

url = 'https://quotes.toscrape.com/'
response = requests.get(url)
response

<Response [200]>

In [None]:
content = response.text

In [None]:
type(content)

str

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(content)

In [None]:
type(soup)

bs4.BeautifulSoup

In [None]:
print(soup.prettify())

In [None]:
# find the first instance that contains the element tag and attributes
soup.find('span', {'class':'text'}).get_text() #.text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [None]:
quotes = soup.find_all('span', {'class':'text'})

In [None]:
my_quotes = [quote.text for quote in quotes]
my_quotes

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [None]:
# other python libraries: Selenium, splinter

In [None]:
# define the url
url = 'https://books.toscrape.com/'
# get html data from url
response = requests.get(url)
# convert to raw text
content = response.text

In [None]:
soup = BeautifulSoup(content, 'html.parser') # 'lxml'

In [None]:
print(soup.prettify())

In [None]:
article = soup.find('article', {'class': 'product_pod'})

In [None]:
article.find('p', {'class': 'price_color'})

<p class="price_color">Â£51.77</p>

In [None]:
article.find('h3').text

'A Light in the ...'

In [None]:
article.h3.text

'A Light in the ...'

In [None]:
!pip install feedparser

Collecting feedparser
[?25l  Downloading https://files.pythonhosted.org/packages/1c/21/faf1bac028662cc8adb2b5ef7a6f3999a765baa2835331df365289b0ca56/feedparser-6.0.2-py3-none-any.whl (80kB)
[K     |████                            | 10kB 7.9MB/s eta 0:00:01[K     |████████                        | 20kB 11.8MB/s eta 0:00:01[K     |████████████▏                   | 30kB 7.9MB/s eta 0:00:01[K     |████████████████▏               | 40kB 7.9MB/s eta 0:00:01[K     |████████████████████▎           | 51kB 4.1MB/s eta 0:00:01[K     |████████████████████████▎       | 61kB 4.7MB/s eta 0:00:01[K     |████████████████████████████▍   | 71kB 4.5MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.5MB/s 
[?25hCollecting sgmllib3k
  Downloading https://files.pythonhosted.org/packages/9e/bd/3704a8c3e0942d711c1299ebf7b9091930adae6675d7c8f476a7ce48653c/sgmllib3k-1.0.0.tar.gz
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[

In [None]:
import feedparser

feed = 'http://feeds.feedburner.com/oreilly/radar/atom'
parsed = feedparser.parse(feed)

In [None]:
len(parsed)

12

In [None]:
len(parsed.entries)

60

In [None]:
posts = parsed.entries

In [None]:
posts[0]['summary']

'“If they can get you asking the wrong questions, they don&#8217;t have to worry about answers.” Thomas Pynchon, Gravity’s Rainbow The deplatforming of Donald Trump and his alt-right coterie has led to many discussions of free speech.&#160; Some of the discussions make good points, most don’t, but it seems to me that all of them [&#8230;]'

In [None]:
PATH = '/content/rss_posts/'

for i, post in enumerate(posts):
  text = post['summary'] # post.summary

  with open(PATH + f'post_{i}.txt', 'wb') as f:
    f.write(text.encode())

In [None]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

DOC_PATTERN = r'.*\.txt'
corpus = PlaintextCorpusReader('/content/rss_posts', DOC_PATTERN)

In [None]:
len(corpus.fileids())

60

In [None]:
docs = [corpus.raw(doc) for doc in corpus.fileids()]
docs

['“If they can get you asking the wrong questions, they don&#8217;t have to worry about answers.” Thomas Pynchon, Gravity’s Rainbow The deplatforming of Donald Trump and his alt-right coterie has led to many discussions of free speech.&#160; Some of the discussions make good points, most don’t, but it seems to me that all of them [&#8230;]',
 'A lot happened in the last month, and not just in Washington. Important developments appeared all through the technology world. Perhaps the most spectacular was the use of Natural Language Processing techniques to analyze viral DNA. It’s actually sort of obvious once you think about it. If DNA is a language, then it should have [&#8230;]',
 'NAND Game &#8212; You start with a single component, the nand gate. Using this as the fundamental building block, you will build all other components necessary. (See also NAND to Tetris) Facebook&#8217;s Game AI &#8212; today we are unveiling Recursive Belief-based Learning (ReBeL), a general RL+Search algori

In [None]:
# API
url = 'https://dog.ceo/api/breed/pomeranian/images/random'
requests.get(url).json()

{'message': 'https://images.dog.ceo/breeds/pomeranian/n02112018_5697.jpg',
 'status': 'success'}

In [None]:
url = 'https://api.kanye.rest/'

requests.get(url).json()

{'quote': "I'm nice at ping pong"}

In [None]:
for i in range(10):
  response = requests.get(url).json()

  print(response['quote'])
  print('---'*10)

Decentralize
------------------------------
Empathy is the glue
------------------------------
We must and will cure homelessness and hunger. We have the capability as a species
------------------------------
Have you ever thought you were in love with someone but then realized you were just staring in a mirror for 20 minutes?
------------------------------
We're going to move the entire music industry into the 21st Century
------------------------------
We have to evolve
------------------------------
We will heal. We will cure.
------------------------------
If I don't scream, if I don't say something then no one's going to say anything.
------------------------------
Buy property
------------------------------
The world is our family
------------------------------


In [None]:
url = 'http://newsapi.org/v2/everything?q=tesla&from=2021-02-11&language=en&sortBy=publishedAt&apiKey=cc48d0d278124cedafac4fa8c6b8e3ae'

data = requests.get(url).json()

In [None]:
len(data['articles'])

20

In [None]:
data['articles'][0]['description']

"Elon Musk, the world's richest man, says that he manages to get by on six hours of sleep while juggling his portfolio of businesses like Tesla (bottom left) and SpaceX (top left)."

In [None]:
for article in data['articles']:
  print(article['description'])
  print('-----------')

Elon Musk, the world's richest man, says that he manages to get by on six hours of sleep while juggling his portfolio of businesses like Tesla (bottom left) and SpaceX (top left).
-----------
Sen. Cynthia Lummis (R., Wyo.) says she doesn't want the federal government to "mess up" cryptocurrency regulation.
-----------
Despite choppy price action, news over the past few days has been extraordinarily bullish for the leading cryptocurrency.
-----------
DUBLIN, Feb. 12, 2021 /PRNewswire/ -- The "Electric Vehicle Charging Station Market by Level of Charging (Level 1, Level 2 & Level 3), by Charging Infrastructure (Normal Charge, Type-2, CCS, CHAdeMO and Tesla Supercharger), DC Fast Charging (Fast & Ultra-fast…
-----------
DUBLIN, Feb. 12, 2021 /PRNewswire/ -- The "Electric Vehicle Charging Station Market by Level of Charging (Level 1, Level 2 & Level 3), by Charging Infrastructure (Normal Charge, Type-2, CCS, CHAdeMO and Tesla Supercharger), DC Fast Charging (Fast & Ultra-fast…
-----------
