# Natural Language Processing

- __Requests__ to fetch the HTML files
- __BeautifulSoup__ to pull the data from HTML files
- __lxml__ to parse (or translate) the HTML to Python 
- __Pandas__ to manipulate our data, printing it and saving it into a file
- __nltk__ natural language tool kit

### Steps
- Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
- Assign the address of the web page to a variable named url.
- Request the server the content of the web page by using get(), and store the server’s response in the variable response.
- Print the response text to ensure you have an html page.
- Take a look at the actual web page contents and inspect the source to understand the structure a bit.
- Use BeautifulSoup to parse the HTML into a variable ('soup').
- Identify the key tags you need to extract the data you are looking for.
- Create a dataframe of the data desired.
- Run some summary stats and inspect the data to ensure you have what you wanted.
- Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
- Create a corpus of the column with the text you want to analyze.
- Store that corpus for use in a future notebook.

In [142]:
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import requests

import re
import json
import unicodedata
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer

from time import strftime

import warnings
warnings.filterwarnings('ignore')

import acquire

In [29]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)
if response != response.status_code:
    print(f'Url Pinged okay:  {response}')
else:
    print(f'error occured: {response}')

Url Pinged okay:  <Response [200]>


In [30]:
print(response.text[:200])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type=


In [35]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="https://codeup.com/xmlrpc.php" rel="pingback"/>
  <script type="text/javascript">
   document.documentElement.className = 'js';
  </script>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <script id="diviarea-loader">
   window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttrib":"data-popup","modalIndicatorClass":"is-modal","blockingIndicatorClass":"is-blocking","defaultShowCloseButton":true,"withCloseClass":"with-close","noCloseClass":"no-close","triggerCloseClass":"close","singletonClass":"single","darkModeClass":"dark","noShadowClass":"no-shadow","altCloseClass":"close-alt","popupSelector":".et_pb_section.popup","initializeOnEvent":"et_pb_after_init_modules","popupWrapperClass":"area-outer-wrap","fullHeightClass":"full-height","openPopupClass":"da-ov

In [50]:
# Get title
soup.title

<title>Blog - Codeup</title>

In [199]:
soup.select('title')[0]

<title>Blog - Codeup</title>

In [41]:
soup.title.string

'Blog - Codeup'

In [60]:
soup.a

<a href="https://codeup.com/">Home</a>

In [66]:
soup.find_all('a')[2]

<a href="https://codeup.com/program/full-stack-web-development/">Full Stack Web Development</a>

In [78]:
soup.find_all('div')

[<div class="divi-mobile-menu">
 <div class="menu-wrap menuclosed" id="dm_nav">
 <div class="menu-wrap__inner">
 <div class="scroll_section">
 <nav class="menu-top"></nav>
 <nav class="menu-side">
 <span class="menu-name-behind"></span>
 <ul class="nav" id="dm-menu"><li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home menu-item-16491"><a href="https://codeup.com/">Home</a></li>
 <li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-18125"><a href="https://codeup.com/program/cloud-adminsitration/">Cloud Administration</a></li>
 <li class="menu-item menu-item-type-post_type menu-item-object-course menu-item-16497"><a href="https://codeup.com/program/full-stack-web-development/">Full Stack Web Development</a></li>
 <li class="menu-item menu-item-type-post_type menu-item-object-course menu-item-16496"><a href="https://codeup.com/program/data-science/">Data Science</a></li>
 <li class="menu-item menu-item-type-post_type menu-item-object-p

### Get All links re-direct with for loop

In [102]:
for link in soup.find_all('a'):
    urls = link.get('href')

    print(urls)   

https://codeup.com/
https://codeup.com/program/cloud-adminsitration/
https://codeup.com/program/full-stack-web-development/
https://codeup.com/program/data-science/
https://codeup.com/financial-aid/
https://codeup.com/events/
https://codeup.com/veterans/
https://codeup.com/hire-tech-talent/
https://alumni.codeup.com/
https://codeup.com/resources/
/my-story/
https://codeup.com/blog/
https://codeup.com/frequently-asked-questions/
https://codeup.com/podcast/
https://codeup.com/apply-now/
https://codeup.com/
/about-codeup/
/category/behind-the-billboards/
/careers/
/index.php/
https://codeup.com/programs/
https://codeup.com/program/cloud-adminsitration/
https://codeup.com/program/full-stack-web-development/
https://codeup.com/program/data-science/
/financial-aid/
#
https://codeup.com/san-antonio-events/
https://codeup.com/dallas-events/
https://codeup.com/veterans/
None
https://codeup.com/san-antonio/
https://codeup.com/dallas/
https://codeup.com/houston/
https://codeup.com/hire-tech-talen

### get_text()

In [120]:
print(soup.get_text(strip = True, separator = '  '))

Blog - Codeup  Home  Cloud Administration  Full Stack Web Development  Data Science  Financial Aid  Events  Military  Hire Tech Talent  Alumni  Resources  Student Reviews  Blog  Common Questions  Hire Tech Podcast  Apply Now  About  |  Behind The Billboards  |  Careers  Programs  Cloud Administration  Web Development  Data Science  Financing  Events  San Antonio  Dallas  Military  Campuses  San Antonio  Dallas  Houston  Hire  Employers  Alumni Portal  Resources  Student Reviews  Blog  Common Questions  Hire Tech Podcast  Apply Now  Codeup News & Articles  Search for:  From Bootcamp to Bootcamp | A Military Appreciation Panel  In honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans! We will chat about their experiences attending a coding bootcamp, and how their military training set them up for success here at Codeup. Grab your...  Read More  Our Acquisition of the Rackspace Cloud Academy: One Year Later  Just about a year ago 

### Tree Navigation

In [134]:
soup.find_all('a')

[<a href="https://codeup.com/">Home</a>,
 <a href="https://codeup.com/program/cloud-adminsitration/">Cloud Administration</a>,
 <a href="https://codeup.com/program/full-stack-web-development/">Full Stack Web Development</a>,
 <a href="https://codeup.com/program/data-science/">Data Science</a>,
 <a href="https://codeup.com/financial-aid/">Financial Aid</a>,
 <a href="https://codeup.com/events/">Events</a>,
 <a href="https://codeup.com/veterans/">Military</a>,
 <a href="https://codeup.com/hire-tech-talent/">Hire Tech Talent</a>,
 <a href="https://alumni.codeup.com/">Alumni</a>,
 <a href="https://codeup.com/resources/">Resources</a>,
 <a href="/my-story/">Student Reviews</a>,
 <a aria-current="page" href="https://codeup.com/blog/">Blog</a>,
 <a href="https://codeup.com/frequently-asked-questions/">Common Questions</a>,
 <a href="https://codeup.com/podcast/">Hire Tech Podcast</a>,
 <a href="https://codeup.com/apply-now/">Apply Now</a>,
 <a href="https://codeup.com/">
 <img alt="Codeup" cla

### Beautiful Soup Methods and Properties

- soup.title.string gets the page's title (the same text in the browser tab for a page, this is the \<title\> element.
- soup.prettify() is useful to print in case you want to see the HTML
- soup.find_all("a") find all the anchor tags, or whatever argument is specified.
- soup.find("h1") finds the first matching element
- soup.get_text() gets the text from within a matching piece of soup/HTML
- The soup.select() method takes in a CSS selector as a string and returns all matching elements. super useful

In [1]:
soup.select('h2')

NameError: name 'soup' is not defined

In [10]:
soup.select('h2')[2].text


'Get Program Details & Pricing'

In [45]:
soup.select('body')

[<body class="blog custom-background et-tb-has-template et-tb-has-header et-tb-has-body et-tb-has-footer et_pb_button_helper_class et_cover_background et_pb_gutter osx et_pb_gutters3 et_pb_pagebuilder_layout et_smooth_scroll et_divi_theme et-db">
 <svg focusable="false" height="0" role="none" style="visibility: hidden; position: absolute; left: -9999px; overflow: hidden;" viewbox="0 0 0 0" width="0" xmlns="http://www.w3.org/2000/svg"><defs><filter id="wp-duotone-dark-grayscale"><fecolormatrix color-interpolation-filters="sRGB" type="matrix" values=" .299 .587 .114 0 0 .299 .587 .114 0 0 .299 .587 .114 0 0 .299 .587 .114 0 0 "></fecolormatrix><fecomponenttransfer color-interpolation-filters="sRGB"><fefuncr tablevalues="0 0.498039215686" type="table"></fefuncr><fefuncg tablevalues="0 0.498039215686" type="table"></fefuncg><fefuncb tablevalues="0 0.498039215686" type="table"></fefuncb><fefunca tablevalues="1 1" type="table"></fefunca></fecomponenttransfer><fecomposite in2="SourceGraphic" 

In [181]:
import nltk
nltk.download_shell()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> 

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> h

Commands:
  d) Download a package or collection     u) Update out of date packages
  l) List packages & collections          h) Help
  c) View & Modify Configuration          q) Quit

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc........

Downloader> q


# Data Preparation

#### Plan for parsing the text data:

- Convert text to all lower case for normalcy.
- Remove any accented characters, non-ASCII characters- ensure english language.
- Remove special characters.
- Stem or lemmatize the words.
- Remove stopwords (words that are not important).
- Store the clean text and the original text for use in future notebooks.

In [161]:
# # We don't need to install nltk, it should come with anaconda, but nltk
# # does need to download some data.
# python -c "import nltk; nltk.download('stopwords')"


data = soup.get_text(strip = True, separator = ' ')
data

'Blog - Codeup Home Cloud Administration Full Stack Web Development Data Science Financial Aid Events Military Hire Tech Talent Alumni Resources Student Reviews Blog Common Questions Hire Tech Podcast Apply Now About | Behind The Billboards | Careers Programs Cloud Administration Web Development Data Science Financing Events San Antonio Dallas Military Campuses San Antonio Dallas Houston Hire Employers Alumni Portal Resources Student Reviews Blog Common Questions Hire Tech Podcast Apply Now Codeup News & Articles Search for: From Bootcamp to Bootcamp | A Military Appreciation Panel In honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans! We will chat about their experiences attending a coding bootcamp, and how their military training set them up for success here at Codeup. Grab your... Read More Our Acquisition of the Rackspace Cloud Academy: One Year Later Just about a year ago on April 16th, 2021 we announced our acquisition

In [162]:
# Make all case lower

data = data.lower()

### We'll go about this in three steps:

- __unicodedata.normalize__ removes any inconsistencies in unicode character encoding.
- __.encode__ to convert the resulting string to the ASCII character set. We'll ignore any errors in conversion, meaning we'll drop anything that isn't an ASCII character.
- __.decode__ to turn the resulting bytes object back into a string.

In [165]:
data = unicodedata.normalize('NFKD', data)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

print(data)

blog - codeup home cloud administration full stack web development data science financial aid events military hire tech talent alumni resources student reviews blog common questions hire tech podcast apply now about | behind the billboards | careers programs cloud administration web development data science financing events san antonio dallas military campuses san antonio dallas houston hire employers alumni portal resources student reviews blog common questions hire tech podcast apply now codeup news & articles search for: from bootcamp to bootcamp | a military appreciation panel in honor of military appreciation month, join us for a discussion with codeup alumni who are also military veterans! we will chat about their experiences attending a coding bootcamp, and how their military training set them up for success here at codeup. grab your... read more our acquisition of the rackspace cloud academy: one year later just about a year ago on april 16th, 2021 we announced our acquisition 

## Remove anything not a-z or 0-9 with regex

[^a-z0-9] --> anything not

^[a-z0-9] --> anything that begins with

In [168]:
# remove anything that is not a through z, a number, a single quote, or whitespace
data = re.sub(r"[^a-z0-9'\s]", '', data)
print(data)


blog  codeup home cloud administration full stack web development data science financial aid events military hire tech talent alumni resources student reviews blog common questions hire tech podcast apply now about  behind the billboards  careers programs cloud administration web development data science financing events san antonio dallas military campuses san antonio dallas houston hire employers alumni portal resources student reviews blog common questions hire tech podcast apply now codeup news  articles search for from bootcamp to bootcamp  a military appreciation panel in honor of military appreciation month join us for a discussion with codeup alumni who are also military veterans we will chat about their experiences attending a coding bootcamp and how their military training set them up for success here at codeup grab your read more our acquisition of the rackspace cloud academy one year later just about a year ago on april 16th 2021 we announced our acquisition of the rackspac

## Tokenization

##### The process of breaking something down into discrete units. In the context of NLP, this means breaking text down into discrete words, punctuation, etc.

In [171]:
tokenizer = nltk.tokenize.ToktokTokenizer()

print(tokenizer.tokenize(data, return_str = True))

blog codeup home cloud administration full stack web development data science financial aid events military hire tech talent alumni resources student reviews blog common questions hire tech podcast apply now about behind the billboards careers programs cloud administration web development data science financing events san antonio dallas military campuses san antonio dallas houston hire employers alumni portal resources student reviews blog common questions hire tech podcast apply now codeup news articles search for from bootcamp to bootcamp a military appreciation panel in honor of military appreciation month join us for a discussion with codeup alumni who are also military veterans we will chat about their experiences attending a coding bootcamp and how their military training set them up for success here at codeup grab your read more our acquisition of the rackspace cloud academy one year later just about a year ago on april 16th 2021 we announced our acquisition of the rackspace clo

## Stemming and Lemmatization


- __Stemming__ - the base form of a word. Example, "calls", "called", and "calling" all share the base stem "call".


- __Lemmatization__ - is very similar to stemming, however, the base form in this case is known as the root word, but not the root stem. Root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary.

In [178]:
# Create the nltk stemmer object, then use it

ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('called'), ps.stem('calling')

('call', 'call', 'call')

#### Stemming

In [180]:
# Applying this stemming transformation to all the words in the data article.

stems = [ps.stem(word) for word in data.split()]
data_stemmed = ' '.join(stems)
print(data_stemmed)

blog codeup home cloud administr full stack web develop data scienc financi aid event militari hire tech talent alumni resourc student review blog common question hire tech podcast appli now about behind the billboard career program cloud administr web develop data scienc financ event san antonio dalla militari campus san antonio dalla houston hire employ alumni portal resourc student review blog common question hire tech podcast appli now codeup news articl search for from bootcamp to bootcamp a militari appreci panel in honor of militari appreci month join us for a discuss with codeup alumni who are also militari veteran we will chat about their experi attend a code bootcamp and how their militari train set them up for success here at codeup grab your read more our acquisit of the rackspac cloud academi one year later just about a year ago on april 16th 2021 we announc our acquisit of the rackspac cloud academi for a short time after the acquisit it wa rebrand as the codeup cloud aca

In [184]:
pd.Series(stems).value_counts().head()

the      69
cooki    41
to       39
for      30
of       28
dtype: int64

#### Lemmatization

In [193]:
import nltk
nltk.download('stopwords')

wnl = nltk.stem.WordNetLemmatizer()

lemmas = [wnl.lemmatize(word) for word in data.split()]
data_lemmatized = ' '.join(lemmas)

print(data_lemmatized)

blog codeup home cloud administration full stack web development data science financial aid event military hire tech talent alumnus resource student review blog common question hire tech podcast apply now about behind the billboard career program cloud administration web development data science financing event san antonio dallas military campus san antonio dallas houston hire employer alumnus portal resource student review blog common question hire tech podcast apply now codeup news article search for from bootcamp to bootcamp a military appreciation panel in honor of military appreciation month join u for a discussion with codeup alumnus who are also military veteran we will chat about their experience attending a coding bootcamp and how their military training set them up for success here at codeup grab your read more our acquisition of the rackspace cloud academy one year later just about a year ago on april 16th 2021 we announced our acquisition of the rackspace cloud academy for 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/stephenkipkurui/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [191]:
# Now that we have a list of the lemmas, we can take a look at the most frequent words.

pd.Series(lemmas).value_counts()[:10]

the       69
to        39
a         32
for       30
of        28
in        27
cooky     24
read      22
codeup    21
are       20
dtype: int64

### Remove Stopwords

Words which have __little or no significance,__ especially when constructing meaningful features from text, are known as __stop words (or stopwords).__ These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords: a, an, the, and like.

In [194]:
stopword_list = stopwords.words('english')

stopword_list.remove('no')
stopword_list.remove('not')

stopword_list[:10]


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [198]:
words = data.split()
filtered_words = [w for w in words if w not in stopword_list]

print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
print('------------------------')

data_without_stopwords = ' '.join(filtered_words)

print(data_without_stopwords)


Removed 545 stopwords
------------------------
blog codeup home cloud administration full stack web development data science financial aid events military hire tech talent alumni resources student reviews blog common questions hire tech podcast apply behind billboards careers programs cloud administration web development data science financing events san antonio dallas military campuses san antonio dallas houston hire employers alumni portal resources student reviews blog common questions hire tech podcast apply codeup news articles search bootcamp bootcamp military appreciation panel honor military appreciation month join us discussion codeup alumni also military veterans chat experiences attending coding bootcamp military training set success codeup grab read acquisition rackspace cloud academy one year later year ago april 16th 2021 announced acquisition rackspace cloud academy short time acquisition rebranded codeup cloud academy fulltime part codeup brand read blog read 5 books ev