# Natural Language Processing 

## Prepare Module

#### Key Words:
- __Requests__ to fetch the HTML files
- __BeautifulSoup__ to pull the data from HTML files
- __lxml__ to parse (or translate) the HTML to Python
- __Pandas__ to manipulate our data, printing it and saving it into a file
- __nltk__ natural language tool kit

In [1]:
# Import functions
import requests
from bs4 import BeautifulSoup

from acquire import get_people_web_scrap_data, get_blog_article_urls,\
parse_blog_article, get_blog_articles, parse_news_card, \
parse_news_category, get_news_articles

from prepare import basic_clean, tokenize, stem, lemmatize, \
remove_stopwords, prepare_dataframe

import json
import unicodedata
import re
import os
# nltk: natural language toolkit -> tokenization, stopwords (more on this soon)
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

### News DataFrame

In [2]:
# Acquire news df

news_df = get_news_articles()
news_df.head(3)

Unnamed: 0,category,title,content,author,published
0,business,Rupee hits all-time low of 77.42 against US dollar,"The Indian rupee fell to an all-time low of 77.42 against the US dollar on Monday, Reuters reported. Asian markets were lower on Monday as US stock futures fell on fears of more policy tightening from the Federal Reserve and strict lockdown in Shanghai impacting global growth, according to Reuters.",Apaar Sharma,2022-05-09T05:05:31.000Z
1,business,"Bitcoin falls to the lowest level since January, trades below $34,000","Bitcoin fell on Monday to as low as $33,266 in morning trade, nearing January's low of $32,951 as slumping equity markets continued to hurt cryptocurrencies. It then steadied to trade above $33,600. According to BBC, the world's largest cryptocurrency has fallen by 50% since its peak in November 2021.",Pragya Swastik,2022-05-09T09:20:34.000Z
2,business,Made best possible decision: IndiGo on barring differently-abled child from flight,"IndiGo's CEO Ronojoy Dutta said the airline made ""the best possible decision"" by barring a differently-abled teenager and his family from boarding a Ranchi-Hyderabad flight. ""At boarding area, the teenager was visibly in panic...the airport staff, in line with safety guidelines, were forced to make a difficult decision,"" Dutta said. IndiGo offered to purchase an electric wheelchair for the child.",Pragya Swastik,2022-05-09T09:50:34.000Z


In [3]:
news_df.content

0     The Indian rupee fell to an all-time low of 77...
1     Bitcoin fell on Monday to as low as $33,266 in...
2     IndiGo's CEO Ronojoy Dutta said the airline ma...
3     The Indian rupee weakened further on Monday to...
4     LIC's IPO, India's biggest IPO which opened on...
                            ...                        
94    Actor Kunal Kemmu has reacted to Saif Ali Khan...
95    During a speech at an event, actor Mahesh Babu...
96    Television actress Mahi Vij in an interview ha...
97    Actress-choreographer Mohena Kumari Singh took...
98    Yash starrer-'KGF: Chapter 2' had its screenin...
Name: content, Length: 99, dtype: object

#### Rename content to original

In [13]:
news_df.rename(columns={'content': 'original'}, inplace=True)

In [5]:
news_df.head(2)                                  

Unnamed: 0,category,title,original,author,published
0,business,Rupee hits all-time low of 77.42 against US dollar,"The Indian rupee fell to an all-time low of 77.42 against the US dollar on Monday, Reuters reported. Asian markets were lower on Monday as US stock futures fell on fears of more policy tightening from the Federal Reserve and strict lockdown in Shanghai impacting global growth, according to Reuters.",Apaar Sharma,2022-05-09T05:05:31.000Z
1,business,"Bitcoin falls to the lowest level since January, trades below $34,000","Bitcoin fell on Monday to as low as $33,266 in morning trade, nearing January's low of $32,951 as slumping equity markets continued to hurt cryptocurrencies. It then steadied to trade above $33,600. According to BBC, the world's largest cryptocurrency has fallen by 50% since its peak in November 2021.",Pragya Swastik,2022-05-09T09:20:34.000Z


In [6]:
prepare_dataframe(news_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head(3)

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Rupee hits all-time low of 77.42 against US dollar,"The Indian rupee fell to an all-time low of 77.42 against the US dollar on Monday, Reuters reported. Asian markets were lower on Monday as US stock futures fell on fears of more policy tightening from the Federal Reserve and strict lockdown in Shanghai impacting global growth, according to Reuters.",indian rupee fell alltime low 7742 us dollar monday reuters reported asian markets lower monday us stock futures fell fears policy tightening federal reserve strict lockdown shanghai impacting global growth according reuters,indian rupe fell alltim low 7742 us dollar monday reuter report asian market lower monday us stock futur fell fear polici tighten feder reserv strict lockdown shanghai impact global growth accord reuter,indian rupee fell alltime low 7742 u dollar monday reuters reported asian market lower monday u stock future fell fear policy tightening federal reserve strict lockdown shanghai impacting global growth according reuters
1,"Bitcoin falls to the lowest level since January, trades below $34,000","Bitcoin fell on Monday to as low as $33,266 in morning trade, nearing January's low of $32,951 as slumping equity markets continued to hurt cryptocurrencies. It then steadied to trade above $33,600. According to BBC, the world's largest cryptocurrency has fallen by 50% since its peak in November 2021.",bitcoin fell monday low 33266 morning trade nearing january ' low 32951 slumping equity markets continued hurt cryptocurrencies steadied trade 33600 according bbc world ' largest cryptocurrency fallen 50 since peak november 2021,bitcoin fell monday low 33266 morn trade near januari ' low 32951 slump equiti market continu hurt cryptocurr steadi trade 33600 accord bbc world ' largest cryptocurr fallen 50 sinc peak novemb 2021,bitcoin fell monday low 33266 morning trade nearing january ' low 32951 slumping equity market continued hurt cryptocurrencies steadied trade 33600 according bbc world ' largest cryptocurrency fallen 50 since peak november 2021
2,Made best possible decision: IndiGo on barring differently-abled child from flight,"IndiGo's CEO Ronojoy Dutta said the airline made ""the best possible decision"" by barring a differently-abled teenager and his family from boarding a Ranchi-Hyderabad flight. ""At boarding area, the teenager was visibly in panic...the airport staff, in line with safety guidelines, were forced to make a difficult decision,"" Dutta said. IndiGo offered to purchase an electric wheelchair for the child.",indigo ' ceo ronojoy dutta said airline made best possible decision barring differentlyabled teenager family boarding ranchihyderabad flight boarding area teenager visibly panicthe airport staff line safety guidelines forced make difficult decision dutta said indigo offered purchase electric wheelchair child,indigo ' ceo ronojoy dutta said airlin made best possibl decis bar differently teenag famili board ranchihyderabad flight board area teenag visibl panicth airport staff line safeti guidelin forc make difficult decis dutta said indigo offer purchas electr wheelchair child,indigo ' ceo ronojoy dutta said airline made best possible decision barring differentlyabled teenager family boarding ranchihyderabad flight boarding area teenager visibly panicthe airport staff line safety guideline forced make difficult decision dutta said indigo offered purchase electric wheelchair child


### CodeUp Blog DataFrame

#### Get Blog Data

In [12]:
# Acquire news df

codeup_df = get_blog_articles()
codeup_df.head()

Unnamed: 0,title,published,content
0,,,
1,,,
2,,,
3,,,
4,,,


In [16]:
codeup_df.rename(columns={'content': 'original'}, inplace=True)
codeup_df

Unnamed: 0,title,published,original
0,,,
1,,,
2,,,
3,,,
4,,,
...,...,...,...
10,,,
11,,,
12,,,
13,,,


#### Prepare CodeUp Data

In [17]:
prepare_dataframe(codeup_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()

TypeError: expected string or bytes-like object

### Question: 

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?