# Exercises

# The end result of this exercise should be a file named prepare.py that defines the requested functions.

# In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [2]:
# imports
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import acquire

import warnings
warnings.filterwarnings('ignore')


# 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [3]:
def basic_clean(article):
    '''This function takes in string and converts to lower case, normalize unicode characters and
        replace anything that is not a letter, number, whitespace or a single quote.
    '''
        
    # lowercase everything
    article = article.lower()
    
    # normalize unicode characters
    article = unicodedata.normalize('NFKD', article)\
            .encode('ascii', 'ignore')\
            .decode('utf-8', 'ignore')
    # Replace anything that is not a letter, number, whitespace or a single quote.
    article = re.sub(r"[^a-z0-9'\s]", '', article)
    
    return article

In [4]:
article = """We are learning data science. We are investing time and money to learn new skills. 
We know that it will pay off. Manipulate, Manipulation, Data Science and Data Scientist"""

In [5]:
# calling function
basic_clean(article)

'we are learning data science we are investing time and money to learn new skills \nwe know that it will pay off manipulate manipulation data science and data scientist'

# 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [6]:
def tokenize(article):
    """ This function takes in article parameter as sting and teturns a tokenized string. """
    # Creatr an object for tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer object to tokenize the parameter article here
    article =  tokenizer.tokenize(article, return_str = True)
    
    return article
    

In [7]:
# Calling function tokenize
tokenize(article)

'We are learning data science. We are investing time and money to learn new skills. \nWe know that it will pay off. Manipulate , Manipulation , Data Science and Data Scientist'

# 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [8]:
def stem(article):
    
    """ This function takes in article parameter as string and returns article with words stemmed"""
    
    # Create the nltk stemmer object, then use it
    ps = nltk.porter.PorterStemmer()
    
    # Splitting the words in article parameter
    article_splitted = article.split()
    
    # Use stemmer to stem each word in the list of words in article_splitted
    stems = [ps.stem(word) for word in article_splitted]

    # joining back all stemmed words stems here
    article = ' '.join(stems)
    
    return article
    

In [9]:
# calling function stem
stem(article)

'we are learn data science. we are invest time and money to learn new skills. we know that it will pay off. manipulate, manipulation, data scienc and data scientist'

# 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [10]:
def lemmatize(article):
    """ This function will lemmatize the article parameter as string 
        and returns string with words lemmatized.
    """
    # Create an object for lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Splitting the words in article parameter
    article_splitted = article.split()
    
    # Use lemmatizer object to lemmatize each word in the list of words in article_splitted
    lemmas = [wnl.lemmatize(word) for word in article_splitted]

    # joining back all stemmed words stems here
    article = ' '.join(lemmas)
    
    return article

In [11]:
lemmatize(article)

'We are learning data science. We are investing time and money to learn new skills. We know that it will pay off. Manipulate, Manipulation, Data Science and Data Scientist'

# 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

# This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [12]:
#def remove_stopwords(article):
    # stopword in english
#    stopword_list = stopwords.words('english')
    
#    words = article.split()
#    filtered_words = [w for w in words if w not in stopword_list]

#    article_without_stopwords = ' '.join(filtered_words)

#    return article_without_stopwords

In [13]:
# remove_stopwords(article)

In [14]:
def remove_stopwords(article, extra_words = [], exclude_words = []):
    """ This function takes in article as parameter as string, optional extra_words, 
    and optional exclude_words paramters with empty lists and returns a string."""
    # Create stopword_list object in english
    stopword_list = stopwords.words('english')
    
    # Removing exclude_words from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Adding extra_words to stopword_list to remove these in my text
    stopword_list = stopword_list.union(set(extra_words))
    
    # spliting words in article
    words = article.split()
    
    # Creating a list of words from my article with stopwords removed and assign to variable
    filtered_words = [w for w in words if w not in stopword_list]
    
    # Joining words in the filtered_words back into string and assign to variable
    article_without_stopwords = ' '.join(filtered_words)

    return article_without_stopwords

In [15]:
remove_stopwords(article, extra_words = ['money'], exclude_words = ['the'])

'We learning data science. We investing time learn new skills. We know pay off. Manipulate, Manipulation, Data Science Data Scientist'

# 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [16]:
# Use get_inshorts_articles() from acquire.py file
news_df = acquire.get_inshorts_articles()

In [17]:
news_df.head()

Unnamed: 0,title,author,content,date,category
0,I used to catch Patna-Banaras train to listen ...,Pragya Swastik,Vedanta's billionaire Chairman Anil Agarwal in...,"06 Feb 2022,Sunday",business
1,What can you say when you no longer have your ...,Kiran Khatri,After Lata Mangeshkar passed away at 92 on Sun...,"06 Feb 2022,Sunday",business
2,"COVID, you did your worst & stole our voice: M...",Sakshita Khosla,Businessman Anand Mahindra on Sunday shared a ...,"06 Feb 2022,Sunday",business
3,There's no greater tribute to our unity than L...,Hiral Goyal,Adani Group's Chairman Gautam Adani took to Tw...,"06 Feb 2022,Sunday",business
4,Lata Mangeshkar will eternally cast her shadow...,Kiran Khatri,Biocon's Executive Chairperson Kiran Mazumdar-...,"06 Feb 2022,Sunday",business


# 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [19]:
codeup_df = acquire.get_blog_articles()

In [20]:
codeup_df.head()

Unnamed: 0,title,content,date published
0,Codeup Dallas Open House,Come join us for the re-opening of our Dallas ...,"Nov 30, 2021"
1,Codeup’s Placement Team Continues Setting Records,Our Placement Team is simply defined as a grou...,"Nov 19, 2021"
2,"IT Certifications 101: Why They Matter, and Wh...","AWS, Google, Azure, Red Hat, CompTIA…these are...","Nov 18, 2021"
3,A rise in cyber attacks means opportunities fo...,"In the last few months, the US has experienced...","Nov 17, 2021"
4,Use your GI Bill® benefits to Land a Job in Tech,"As the end of military service gets closer, ma...","Nov 4, 2021"


# 8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [21]:
# title already hold title on both of them
# replacing column content with 'original'
news_df = news_df.rename(columns = {'content' : 'original'})

# cleaning to hold the normalized and tokenized original with the stopwords removed
codeup_df = codeup_df.rename(columns = {'content' : 'original'})

In [22]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [23]:
# Use the function defined above for news_df's content column
prep_article_data(news_df, 'original', extra_words=['ha'], exclude_words=['no']).head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,I used to catch Patna-Banaras train to listen ...,Vedanta's billionaire Chairman Anil Agarwal in...,vedanta ' billionaire chairman anil agarwal tw...,vedanta ' billionair chairman anil agarw tweet...,vedanta ' billionaire chairman anil agarwal tw...
1,What can you say when you no longer have your ...,After Lata Mangeshkar passed away at 92 on Sun...,lata mangeshkar passed away 92 sunday business...,lata mangeshkar pass away 92 sunday businessma...,lata mangeshkar passed away 92 sunday business...
2,"COVID, you did your worst & stole our voice: M...",Businessman Anand Mahindra on Sunday shared a ...,businessman anand mahindra sunday shared pictu...,businessman anand mahindra sunday share pictur...,businessman anand mahindra sunday shared pictu...
3,There's no greater tribute to our unity than L...,Adani Group's Chairman Gautam Adani took to Tw...,adani group ' chairman gautam adani took twitt...,adani group ' chairman gautam adani took twitt...,adani group ' chairman gautam adani took twitt...
4,Lata Mangeshkar will eternally cast her shadow...,Biocon's Executive Chairperson Kiran Mazumdar-...,biocon ' executive chairperson kiran mazumdars...,biocon ' execut chairperson kiran mazumdarshaw...,biocon ' executive chairperson kiran mazumdars...


In [24]:
# Use the function defined above for news_df's content column
prep_article_data(codeup_df, 'original', extra_words=['ha'], exclude_words=['no']).head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup Dallas Open House,Come join us for the re-opening of our Dallas ...,come join us reopening dallas campus drinks sn...,come join us reopen dalla campu drink snack co...,come join u reopening dallas campus drink snac...
1,Codeup’s Placement Team Continues Setting Records,Our Placement Team is simply defined as a grou...,placement team simply defined group manages re...,placement team simpli defin group manag relati...,placement team simply defined group manages re...
2,"IT Certifications 101: Why They Matter, and Wh...","AWS, Google, Azure, Red Hat, CompTIA…these are...",aws google azure red hat comptiathese big name...,aw googl azur red hat comptiathes big name onl...,aws google azure red hat comptiathese big name...
3,A rise in cyber attacks means opportunities fo...,"In the last few months, the US has experienced...",last months us experienced dozens major cybera...,last month us experienc dozen major cyberattac...,last month u experienced dozen major cyberatta...
4,Use your GI Bill® benefits to Land a Job in Tech,"As the end of military service gets closer, ma...",end military service gets closer many transiti...,end militari servic get closer mani transit se...,end military service get closer many transitio...


# 9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

### For 200TB one stemmed text but for 493KB and 25MB, I would go with lemmatized text.