# NLP Data Preparation

## Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

### Imports

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire as a

### Acquire

In [2]:
news_df = a.get_all_news_articles(a.categories)
news_df.head()



  soup = BeautifulSoup(response.text)


Unnamed: 0,title,content,category
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
4,"Will supply 11 cr doses to states, pvt hospita...",Serum Institute of India (SII) CEO Adar Poonaw...,business


### Prep

### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:
- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [3]:
#function to clean
def basic_clean(string):
    """
    This function takes in one argument (string) and will apply
    some basic text cleaning to it:
    1. lowercase everything
    2. normalize unicode characters
    3. replace anything that is not a letter, number, whitespace,
    or a single quote
    """
    lowercase = string.lower()
    normalize = unicodedata.normalize('NFKD', lowercase)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    remove_special = re.sub(r"[^a-z0-9'\s]", '', normalize)
    clean_string = remove_special
    return clean_string

In [4]:
#make sure function works
string = news_df.content[1]
clean = basic_clean(string)
clean

"indian commercial pilots association icpa on tuesday said if air india fails to set up vaccination camps on a panindia basis forflying crew above the age of 18 years on priority we'll stop work in a letter to the airline's management icpa added with no healthcare supportno insurancewe're in no position to continue risking lives of our pilots without vaccination"

In [5]:
#grab some text to practice with
string = news_df.content[2]
string

'Pandora, the world\'s biggest jeweller, has said that it\'ll stop using mined diamonds and focus on laboratory-made ones, which are affordable and sustainable. "Diamonds are not only forever, but for everyone," Pandora CEO Alexander Lacik said. This comes amid growing demand for alternatives to mined diamonds amid concerns over unethical practices in mining industry, including human rights abuses.'

In [6]:
#lowercase the text
lowercase = string.lower()
lowercase

'pandora, the world\'s biggest jeweller, has said that it\'ll stop using mined diamonds and focus on laboratory-made ones, which are affordable and sustainable. "diamonds are not only forever, but for everyone," pandora ceo alexander lacik said. this comes amid growing demand for alternatives to mined diamonds amid concerns over unethical practices in mining industry, including human rights abuses.'

In [7]:
#normalize: Removing accented characters or non-ASCII characters
normalize = unicodedata.normalize('NFKD', lowercase)\
.encode('ascii', 'ignore')\
.decode('utf-8', 'ignore')

In [8]:
normalize

'pandora, the world\'s biggest jeweller, has said that it\'ll stop using mined diamonds and focus on laboratory-made ones, which are affordable and sustainable. "diamonds are not only forever, but for everyone," pandora ceo alexander lacik said. this comes amid growing demand for alternatives to mined diamonds amid concerns over unethical practices in mining industry, including human rights abuses.'

In [9]:
#remove special characters
remove_special = re.sub(r"[^a-z0-9'\s]", '', normalize)
remove_special

"pandora the world's biggest jeweller has said that it'll stop using mined diamonds and focus on laboratorymade ones which are affordable and sustainable diamonds are not only forever but for everyone pandora ceo alexander lacik said this comes amid growing demand for alternatives to mined diamonds amid concerns over unethical practices in mining industry including human rights abuses"

### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [10]:
#function to tokenize
def tokenize(string):
    """
    This function will take in one argument(string) and will
    tokenize all words in the string.
    """
    #create tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    #use tokenizer
    tokens = tokenizer.tokenize(basic_clean(string), return_str = True)
    
    return tokens
    

In [45]:
#make sure function works
string = news_df.content[0]
tokenize(string)

"indian commercial pilots association icpa on tuesday said if air india fails to set up vaccination camps on a panindia basis forflying crew above the age of 18 years on priority we ' ll stop work in a letter to the airline ' s management icpa added with no healthcare supportno insurancewe ' re in no position to continue risking lives of our pilots without vaccination"

In [44]:
#create tokenizer
tokenizer = nltk.tokenize.ToktokTokenizer()


In [12]:
# Use the tokenizer
tokenizer.tokenize(clean)

['indian',
 'commercial',
 'pilots',
 'association',
 'icpa',
 'on',
 'tuesday',
 'said',
 'if',
 'air',
 'india',
 'fails',
 'to',
 'set',
 'up',
 'vaccination',
 'camps',
 'on',
 'a',
 'panindia',
 'basis',
 'forflying',
 'crew',
 'above',
 'the',
 'age',
 'of',
 '18',
 'years',
 'on',
 'priority',
 'we',
 "'",
 'll',
 'stop',
 'work',
 'in',
 'a',
 'letter',
 'to',
 'the',
 'airline',
 "'",
 's',
 'management',
 'icpa',
 'added',
 'with',
 'no',
 'healthcare',
 'supportno',
 'insurancewe',
 "'",
 're',
 'in',
 'no',
 'position',
 'to',
 'continue',
 'risking',
 'lives',
 'of',
 'our',
 'pilots',
 'without',
 'vaccination']

In [13]:
#tokens = tokenizer.tokenize(basic_clean(text))
#tokens

In [15]:
string

'Speaking about India\'s second COVID-19 wave, former RBI Governor Raghuram Rajan said, "I think what went wrong was simply [that]...we underestimated the virus and its ability to adapt." After the first wave, "there was a sense that we had endured the worst...and we had come through, and it was time to open up, and that complacency hurt us", he added.'

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [16]:
#function to stem
def stem(string):
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    # Apply the stemmer to each word in our string.
    stems = [ps.stem(word) for word in string.split()]
    
    return stems

In [17]:
#test function
stem(clean)

['indian',
 'commerci',
 'pilot',
 'associ',
 'icpa',
 'on',
 'tuesday',
 'said',
 'if',
 'air',
 'india',
 'fail',
 'to',
 'set',
 'up',
 'vaccin',
 'camp',
 'on',
 'a',
 'panindia',
 'basi',
 'forfli',
 'crew',
 'abov',
 'the',
 'age',
 'of',
 '18',
 'year',
 'on',
 'prioriti',
 "we'll",
 'stop',
 'work',
 'in',
 'a',
 'letter',
 'to',
 'the',
 "airline'",
 'manag',
 'icpa',
 'ad',
 'with',
 'no',
 'healthcar',
 'supportno',
 "insurancewe'r",
 'in',
 'no',
 'posit',
 'to',
 'continu',
 'risk',
 'live',
 'of',
 'our',
 'pilot',
 'without',
 'vaccin']

In [18]:
# Create porter stemmer.
ps = nltk.porter.PorterStemmer()

In [19]:
# Apply the stemmer to each word in our string.
#stem it word by word
stems = ps.stem(clean)
stems

"indian commercial pilots association icpa on tuesday said if air india fails to set up vaccination camps on a panindia basis forflying crew above the age of 18 years on priority we'll stop work in a letter to the airline's management icpa added with no healthcare supportno insurancewe're in no position to continue risking lives of our pilots without vaccin"

In [20]:
# Apply the stemmer to each word in our string.
stems = [ps.stem(word) for word in clean.split()]
stems[:10]

['indian',
 'commerci',
 'pilot',
 'associ',
 'icpa',
 'on',
 'tuesday',
 'said',
 'if',
 'air']

In [21]:
# Join our lists of words into a string again
article_stemmed = ' '.join(stems)
article_stemmed

"indian commerci pilot associ icpa on tuesday said if air india fail to set up vaccin camp on a panindia basi forfli crew abov the age of 18 year on prioriti we'll stop work in a letter to the airline' manag icpa ad with no healthcar supportno insurancewe'r in no posit to continu risk live of our pilot without vaccin"

### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [22]:
#function to lemmatize
def lemmatize(string):
    # Download the first time.
    nltk.download('wordnet')
    # Create the Lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    return lemmas

In [23]:
#check funciton
lemmatize(clean)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lorisegovia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['indian',
 'commercial',
 'pilot',
 'association',
 'icpa',
 'on',
 'tuesday',
 'said',
 'if',
 'air',
 'india',
 'fails',
 'to',
 'set',
 'up',
 'vaccination',
 'camp',
 'on',
 'a',
 'panindia',
 'basis',
 'forflying',
 'crew',
 'above',
 'the',
 'age',
 'of',
 '18',
 'year',
 'on',
 'priority',
 "we'll",
 'stop',
 'work',
 'in',
 'a',
 'letter',
 'to',
 'the',
 "airline's",
 'management',
 'icpa',
 'added',
 'with',
 'no',
 'healthcare',
 'supportno',
 "insurancewe're",
 'in',
 'no',
 'position',
 'to',
 'continue',
 'risking',
 'life',
 'of',
 'our',
 'pilot',
 'without',
 'vaccination']

In [24]:
# Download the first time.
nltk.download('wordnet') #words that exist in the dictionary

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lorisegovia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [25]:
# Create the Lemmatizer.

wnl = nltk.stem.WordNetLemmatizer()

In [26]:
# Use the lemmatizer on each word in the list of words we created by using split.

lemmas = [wnl.lemmatize(word) for word in string.split()]

### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
- This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [27]:
#function to remove stop words
def remove_stopwords(string, extra_words=[], exclude_words=[]):
    """
    This function will take in three arguments string, extra_words,
    and exclude words.  
    """
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))
    
    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

### 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [28]:
news_df = a.get_all_news_articles(a.categories)
news_df.head()



  soup = BeautifulSoup(response.text)


Unnamed: 0,title,content,category
0,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
1,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
4,"Will supply 11 cr doses to states, pvt hospita...",Serum Institute of India (SII) CEO Adar Poonaw...,business


### 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [29]:
codeup_df = a.acquire_codeup_blog()
codeup_df



  soup = BeautifulSoup(response.text)


Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


### 8. For each dataframe, produce the following columns:
- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [33]:
codeup_df['clean'] = codeup_df.content.apply(basic_clean)

In [34]:
codeup_df.head()

Unnamed: 0,title,published_date,blog_image,content,clean
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoniou and maggie giust\ndata sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch...",by dimitri antoniou\na week ago codeup launche...
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair\nthe third biannual san anton...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps are closing is the model ...


In [41]:
codeup_df['stemmed'] = codeup_df.content.apply(tokenize).apply(stem)

In [42]:
codeup_df['lemmatized'] = codeup_df.content.apply(tokenize).apply(lemmatize)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lorisegovia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lorisegovia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lorisegovia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lorisegovia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lorisegovia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [43]:
codeup_df.head()

Unnamed: 0,title,published_date,blog_image,content,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...,the rumors are true the time has arrived codeu...,"[the, rumor, are, true, the, time, ha, arriv, ...","[the, rumor, are, true, the, time, ha, arrived..."
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...,by dimitri antoniou and maggie giust\ndata sci...,"[by, dimitri, antoni, and, maggi, giust, data,...","[by, dimitri, antoniou, and, maggie, giust, da..."
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch...",by dimitri antoniou\na week ago codeup launche...,"[by, dimitri, antoni, a, week, ago, codeup, la...","[by, dimitri, antoniou, a, week, ago, codeup, ..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair\nthe third biannual san anton...,"[sa, tech, job, fair, the, third, biannual, sa...","[sa, tech, job, fair, the, third, biannual, sa..."
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps are closing is the model ...,"[competitor, bootcamp, are, close, is, the, mo...","[competitor, bootcamps, are, closing, is, the,..."


In [31]:
#ravi's function
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

### 9. Answer the following:
- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

- You would lemmatize because the dataset is small.
- You would used the stemmed bc it is much bigger. 
- Stemmed to save time.

In [32]:
#matthews all in one
#def clean_stem_stop(string):
   # return remove_stopwords(stem(tokenize(basic_clean(string))))