# Data Preparation Exercises
***

### In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

#### 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

    - Lowercase everything
    - Normalize unicode characters
    - Replace anything that is not a letter, number, whitespace or a single quote.

In [31]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import nltk
nltk.download('wordnet')

from time import strftime

import pandas as pd

import acquire

[nltk_data] Downloading package wordnet to /Users/vfirey/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
original = acquire.get_blog_articles()
print(original)

                                               title  \
0  Codeup’s Data Science Career Accelerator is Here!   
1                                 Data Science Myths   
2  Data Science VS Data Analytics: What’s The Dif...   
3        10 Tips to Crush It at the SA Tech Job Fair   
4  Competitor Bootcamps Are Closing. Is the Model...   

                                             content  
0  The rumors are true! The time has arrived. Cod...  
1  By Dimitri Antoniou and Maggie Giust\nData Sci...  
2  By Dimitri Antoniou\nA week ago, Codeup launch...  
3  SA Tech Job Fair\nThe third bi-annual San Anto...  
4  Competitor Bootcamps Are Closing. Is the Model...  


In [3]:
def basic_clean(string):
    
    # This function does basic cleaning and prepping of a string
    
    #Lowercase all text in string
    prepped_string = string.lower()   
    
    #Normalize unicode letters
    prepped_string = unicodedata.normalize('NFKD', prepped_string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore') 
    
    # remove anything that is not a through z, a number, a single quote, or whitespace
    prepped_string = re.sub(r"[^a-z0-9'\s]", '', prepped_string)

    return prepped_string
    
    

    

In [12]:
basic_clean(str(original.content))

'0    the rumors are true the time has arrived cod\n1    by dimitri antoniou and maggie giustndata sci\n2    by dimitri antoniouna week ago codeup launch\n3    sa tech job fairnthe third biannual san anto\n4    competitor bootcamps are closing is the model\nname content dtype object'

#### 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [15]:
def tokenize(string):
    
    #Import libraries
    import nltk
    from nltk.tokenize.toktok import ToktokTokenizer
    
    #This function takes in a string and tokenizes all the words in the string.
    
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    return (tokenizer.tokenize(string, return_str=True))

In [17]:
tokenize(str(original.content))

'0 The rumors are true ! The time has arrived. Cod ... \n1 By Dimitri Antoniou and Maggie Giust\\nData Sci ... \n2 By Dimitri Antoniou\\nA week ago , Codeup launch ... \n3 SA Tech Job Fair\\nThe third bi-annual San Anto ... \n4 Competitor Bootcamps Are Closing. Is the Model ... \nName : content , dtype : object'

#### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [16]:
def stem(text):
    
    #Import library
    import nltk
    
    # This function accepts some text and returns the text after applying stemming to all the words.
    
    # Create the nltk stemmer object, then use it
    ps = nltk.porter.PorterStemmer()
    
    stems = [ps.stem(word) for word in text.split()]
    text_stemmed = ' '.join(stems)
    
    return text_stemmed

In [18]:
stem(str(original.content))

'0 the rumor are true! the time ha arrived. cod... 1 by dimitri antoni and maggi giust\\ndata sci... 2 by dimitri antoniou\\na week ago, codeup launch... 3 sa tech job fair\\nth third bi-annu san anto... 4 competitor bootcamp are closing. is the model... name: content, dtype: object'

#### 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [20]:
def lemmatize(text):
    
    #This function accepts some text and returns the text after applying lemmatization to each word.
    
    # Create the nltk lemmatizer object, then use it
    wnl = nltk.stem.WordNetLemmatizer()

    lemmas = [wnl.lemmatize(word) for word in text.split()]
    text_lemmatized = ' '.join(lemmas)

    return text_lemmatized
    

In [23]:
lemmatize(str(original.content))

'0 The rumor are true! The time ha arrived. Cod... 1 By Dimitri Antoniou and Maggie Giust\\nData Sci... 2 By Dimitri Antoniou\\nA week ago, Codeup launch... 3 SA Tech Job Fair\\nThe third bi-annual San Anto... 4 Competitor Bootcamps Are Closing. Is the Model... Name: content, dtype: object'

#### 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

 This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [25]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))

    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [None]:
remove_stopwords()

#### 6.) Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [27]:
# use all the functions to see if they work on the content column
original['content'].apply(basic_clean)\
.apply(tokenize)\
.apply(lemmatize)\
.apply(remove_stopwords)

0    rumor true time ha arrived codeup ha officiall...
1    dimitri antoniou maggie giust data science big...
2    dimitri antoniou week ago codeup launched imme...
3    sa tech job fair third biannual san antonio te...
4    competitor bootcamps closing model danger prog...
Name: content, dtype: object

#### 7.) Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [32]:
today = strftime('%Y-%m-%d')
codeup_df = acquire.get_blog_articles()

In [33]:
codeup_df.head()

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


#### 8.) For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [35]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [36]:
prep_article_data(original, 'content', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...,rumors true time arrived codeup officially ope...,rumor true time arriv codeup offici open appli...,rumor true time arrived codeup officially open...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...,dimitri antoniou maggie giust data science big...,dimitri antoni maggi giust data scienc big dat...,dimitri antoniou maggie giust data science big...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch...",dimitri antoniou week ago codeup launched imme...,dimitri antoni week ago codeup launch immers d...,dimitri antoniou week ago codeup launched imme...
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...,sa tech job fair third biannual san antonio te...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...,competitor bootcamps closing model danger prog...,competitor bootcamp close model danger program...,competitor bootcamps closing model danger prog...


#### 9.) Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
     - Lemmatized
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - Lemmatized
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
    - Stemmed