## Exercises
The end result of this exercise should be a file named `prepare.py` that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

2. Define a function named `tokenize`. It should take in a string and tokenize all the words in the string.

3. Define a function named `stem`. It should accept some text and return the text after applying stemming to all the words.

4. Define a function named `lemmatize`. It should accept some text and return the text after applying lemmatization to each word.

5. Define a function named `remove_stopwords`. It should accept some text and return the text after removing all the stopwords.

    This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.
    

6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe `news_df`.

7. Make another dataframe for the Codeup blog posts. Name the dataframe `codeup_df`.

8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

Ask yourself:

If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

from time import strftime

### 1.

In [2]:
def basic_clean(s):
    # lowercase
    s = s.lower()
    # normalize
    s = unicodedata.normalize('NFKD', s)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    # remove special characters
    s = re.sub(r"[^a-z0-9'\s]", '', s)
    return s


### 2.

In [3]:
def tokenize(s):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(s, return_str=True)

### 3.

In [14]:
def stem(s):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in s.split()]
    stemmed_s = ' '.join(stems)
    return stemmed_s

### 4.

In [17]:
def lemmatize(s):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in s.split()]
    lemmatized_s = ' '.join(lemmas)

    return lemmatized_s

### 5.

In [6]:
def remove_stopwords(s, extra_words = [], exclude_words = []):
    '''
    Takes a string and removes stopwords.
    Optional arguments: 
    extra_words adds words to stopword list
    exclude_words words to keep
    '''
    stopword_list = stopwords.words('english')
    if len(extra_words) > 0:
        stopword_list.append(word for word in extra_words)
    if len(exclude_words) > 0:
        stopword_list.remove(word for word in exclude_words)
    
    words = s.split()
    filtered_words = [w for w in words if w not in stopword_list]
    s_without_stopwords = ' '.join(filtered_words)
    return s_without_stopwords



### 6.

In [7]:
import acquire

In [8]:
news_df = acquire.acquire_news()
news_df.head()

Unnamed: 0,title,author,content,category
0,RBI cancels licence of Maha-based Independence...,Shalini Ojha,RBI has cancelled licence of Maharashtra-based...,business
1,Boost to EVs a big step: Windmill Capital,Roshan Gupta,"Increased use of EVs in public transport, spec...",business
2,Facebook parent Meta's $230-billion wipeout bi...,Pragya Swastik,Facebook's parent Meta's shares plunged 27% an...,business
3,Facebook's daily active users fall for first t...,Pragya Swastik,Facebook has seen its daily active users (DAUs...,business
4,"Tesla co-worker used N-word, threw a hot tool ...",Kiran Khatri,A former Tesla worker has filed a lawsuit agai...,business


### 7.

In [9]:
codeup_df = acquire.acquire_blogs()
codeup_df.head()

Unnamed: 0,title,content
0,Codeup Dallas Open House - Codeup,\nCome join us for the re-opening of our Dalla...
1,Codeup Helps 40 Grads Land Tech Jobs in Just 1...,\n\n\n\n\n\nOur Placement Team is simply defin...
2,"IT Certifications 101: Why They Matter, and Wh...","\n\n\n\n\n\nAWS, Google, Azure, Red Hat, CompT..."
3,A rise in cyber attacks means opportunities fo...,"\nIn the last few months, the US has experienc..."
4,Use your GI Bill® benefits to Land a Job in Te...,\n\n\n\n\n\nAs the end of military service get...


### 8. For each dataframe, produce the following columns:
- `title` to hold the title
- `original` to hold the original article/post content
- `clean` to hold the normalized and tokenized original with the stopwords removed.
- `stemmed` to hold the stemmed version of the cleaned data.
- `lemmatized` to hold the lemmatized version of the cleaned data.

In [10]:
# original
news_df['original']=news_df.content

In [26]:
# cleaned
clean_articles = []
for article in news_df.content:
    clean_articles.append(remove_stopwords(tokenize(basic_clean(article))))

news_df['clean'] = clean_articles


In [15]:
# stemmed
stemmed_articles = []
for article in news_df.clean:
    stemmed_articles.append(stem(article))

news_df['stemmed'] = stemmed_articles

In [18]:
# lemmatized
lemmatized_articles = []
for article in news_df.clean:
    lemmatized_articles.append(lemmatize(article))

news_df['lemmatized'] = stemmed_articles

Codeup

In [20]:
# original
codeup_df['original']=codeup_df.content

In [25]:
# cleaned
clean_articles = []
for article in codeup_df.content:
    clean_articles.append(remove_stopwords(tokenize(basic_clean(article))))

codeup_df['clean'] = clean_articles


In [22]:
# stemmed
stemmed_articles = []
for article in codeup_df.clean:
    stemmed_articles.append(stem(article))

codeup_df['stemmed'] = stemmed_articles

In [23]:
# lemmatized
lemmatized_articles = []
for article in codeup_df.clean:
    lemmatized_articles.append(lemmatize(article))

codeup_df['lemmatized'] = stemmed_articles

In [24]:
codeup_df

Unnamed: 0,title,content,original,clean,stemmed,lemmatized
0,Codeup Dallas Open House - Codeup,\nCome join us for the re-opening of our Dalla...,\nCome join us for the re-opening of our Dalla...,come join us reopening dallas campus drinks sn...,come join us reopen dalla campu drink snack co...,come join us reopen dalla campu drink snack co...
1,Codeup Helps 40 Grads Land Tech Jobs in Just 1...,\n\n\n\n\n\nOur Placement Team is simply defin...,\n\n\n\n\n\nOur Placement Team is simply defin...,placement team simply defined group manages re...,placement team simpli defin group manag relati...,placement team simpli defin group manag relati...
2,"IT Certifications 101: Why They Matter, and Wh...","\n\n\n\n\n\nAWS, Google, Azure, Red Hat, CompT...","\n\n\n\n\n\nAWS, Google, Azure, Red Hat, CompT...",aws google azure red hat comptiathese big name...,aw googl azur red hat comptiathes big name pro...,aw googl azur red hat comptiathes big name pro...
3,A rise in cyber attacks means opportunities fo...,"\nIn the last few months, the US has experienc...","\nIn the last few months, the US has experienc...",last months us experienced dozens major cybera...,last month us experienc dozen major cyberattac...,last month us experienc dozen major cyberattac...
4,Use your GI Bill® benefits to Land a Job in Te...,\n\n\n\n\n\nAs the end of military service get...,\n\n\n\n\n\nAs the end of military service get...,end military service gets closer many transiti...,end militari servic get closer mani transit se...,end militari servic get closer mani transit se...
5,Which program is right for me: Cyber Security ...,\n\n\n\n\n\nWhat IT Career should I choose?\nI...,\n\n\n\n\n\nWhat IT Career should I choose?\nI...,career choose youre thinking career lot direct...,career choos your think career lot direct coul...,career choos your think career lot direct coul...
6,What the Heck is System Engineering? - Codeup,\n\n\n\n\n\nCodeup offers a 13-week training p...,\n\n\n\n\n\nCodeup offers a 13-week training p...,codeup offers 13week training program systems ...,codeup offer 13week train program system engin...,codeup offer 13week train program system engin...
7,From Speech Pathology to Business Intelligence...,\n\n\n\n\n\nBy: Alicia Gonzalez\nBefore Codeup...,\n\n\n\n\n\nBy: Alicia Gonzalez\nBefore Codeup...,alicia gonzalez codeup home health speechlangu...,alicia gonzalez codeup home health speechlangu...,alicia gonzalez codeup home health speechlangu...
8,Boris - Behind the Billboards - Codeup,\n\n\n,\n\n\n,,,
9,Is Codeup the Best Bootcamp in San Antonio...o...,\n\n\n\n\n\nLooking for the best data science ...,\n\n\n\n\n\nLooking for the best data science ...,looking best data science bootcamp world best ...,look best data scienc bootcamp world best code...,look best data scienc bootcamp world best code...
