# NLP Data Preparation

## Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

### Imports

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire as a

### Acquire

In [3]:
news_df = a.get_all_news_articles(a.categories)
news_df.head()



  soup = BeautifulSoup(response.text)


Unnamed: 0,title,content,category
0,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
1,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
2,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
3,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
4,India announced triumph over COVID-19 early: U...,Confederation of Indian Industry (CII) Preside...,business


### Prep

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [13]:
#function to clean
def basic_clean(text):
    """
    This function takes in one argument (text) and will apply
    some basic text cleaning to it:
    1. lowercase everything
    2. normalize unicode characters
    3. replace anything that is not a letter, number, whitespace,
    or a single quote
    """
    lowercase = text.lower()
    normalize = unicodedata.normalize('NFKD', lowercase)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    remove_special = re.sub(r"[^a-z0-9'\s]", '', normalize)
    clean_text = remove_special
    return clean_text

In [17]:
#make sure function works
text = news_df.content[1]
clean = basic_clean(text)
clean

"speaking about india's second covid19 wave former rbi governor raghuram rajan said i think what went wrong was simply thatwe underestimated the virus and its ability to adapt after the first wave there was a sense that we had endured the worstand we had come through and it was time to open up and that complacency hurt us he added"

In [4]:
#grab some text to practice with
text = news_df.content[2]
text

"South Korea’s richest woman Hong Ra-hee added another $7 billion (nearly ₹51,700 crore) worth of assets to her wealth. She received the amount in stocks after the transfer of her late husband and Samsung Group’s ex-Chairman Lee Kun-hee's assets. The 75-year-old inherited 83 million shares in Samsung Electronics, making her the largest individual shareholder in Samsung with a 2.3% stake."

In [5]:
#lowercase the text
lowercase = text.lower()
lowercase

"south korea’s richest woman hong ra-hee added another $7 billion (nearly ₹51,700 crore) worth of assets to her wealth. she received the amount in stocks after the transfer of her late husband and samsung group’s ex-chairman lee kun-hee's assets. the 75-year-old inherited 83 million shares in samsung electronics, making her the largest individual shareholder in samsung with a 2.3% stake."

In [10]:
#normalize: Removing accented characters or non-ASCII characters
normalize = unicodedata.normalize('NFKD', lowercase)\
.encode('ascii', 'ignore')\
.decode('utf-8', 'ignore')

In [11]:
normalize

"south koreas richest woman hong ra-hee added another $7 billion (nearly 51,700 crore) worth of assets to her wealth. she received the amount in stocks after the transfer of her late husband and samsung groups ex-chairman lee kun-hee's assets. the 75-year-old inherited 83 million shares in samsung electronics, making her the largest individual shareholder in samsung with a 2.3% stake."

In [12]:
#remove special characters
remove_special = re.sub(r"[^a-z0-9'\s]", '', normalize)
remove_special

"south koreas richest woman hong rahee added another 7 billion nearly 51700 crore worth of assets to her wealth she received the amount in stocks after the transfer of her late husband and samsung groups exchairman lee kunhee's assets the 75yearold inherited 83 million shares in samsung electronics making her the largest individual shareholder in samsung with a 23 stake"

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [21]:
#function to tokenize
def tokenize(text):
    """
    This function will take in one argument(text) and will
    tokenize all words in the text.
    """
    tokenizer = nltk.tokenize.ToktokTokenizer()
    tokens = tokenizer.tokenize(basic_clean(text))
    
    return tokens
    

In [15]:
#create tokenizer
tokenizer = nltk.tokenize.ToktokTokenizer()


In [18]:
# Use the tokenizer
tokenizer.tokenize(clean)

['speaking',
 'about',
 'india',
 "'",
 's',
 'second',
 'covid19',
 'wave',
 'former',
 'rbi',
 'governor',
 'raghuram',
 'rajan',
 'said',
 'i',
 'think',
 'what',
 'went',
 'wrong',
 'was',
 'simply',
 'thatwe',
 'underestimated',
 'the',
 'virus',
 'and',
 'its',
 'ability',
 'to',
 'adapt',
 'after',
 'the',
 'first',
 'wave',
 'there',
 'was',
 'a',
 'sense',
 'that',
 'we',
 'had',
 'endured',
 'the',
 'worstand',
 'we',
 'had',
 'come',
 'through',
 'and',
 'it',
 'was',
 'time',
 'to',
 'open',
 'up',
 'and',
 'that',
 'complacency',
 'hurt',
 'us',
 'he',
 'added']

In [20]:
#tokens = tokenizer.tokenize(basic_clean(text))
#tokens

In [22]:
#make sure function works
text = news_df.content[0]
tokenize(text)

['indian',
 'commercial',
 'pilots',
 'association',
 'icpa',
 'on',
 'tuesday',
 'said',
 'if',
 'air',
 'india',
 'fails',
 'to',
 'set',
 'up',
 'vaccination',
 'camps',
 'on',
 'a',
 'panindia',
 'basis',
 'forflying',
 'crew',
 'above',
 'the',
 'age',
 'of',
 '18',
 'years',
 'on',
 'priority',
 'we',
 "'",
 'll',
 'stop',
 'work',
 'in',
 'a',
 'letter',
 'to',
 'the',
 'airline',
 "'",
 's',
 'management',
 'icpa',
 'added',
 'with',
 'no',
 'healthcare',
 'supportno',
 'insurancewe',
 "'",
 're',
 'in',
 'no',
 'position',
 'to',
 'continue',
 'risking',
 'lives',
 'of',
 'our',
 'pilots',
 'without',
 'vaccination']

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [None]:
#function to stem
def stem(text):
    

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [None]:
#function to lemmatize
def lemmatize(text):
    

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
- This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [None]:
#function to remove stop words
def remove_stopwords(text):
    

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

### 8. For each dataframe, produce the following columns:
- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

### 9. If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?