# Exercises

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

In [4]:
import unicodedata
import re
import json
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
import acquire

In [8]:
# get data for testing
blog = acquire.get_blog_articles()
news = acquire.get_inshorts_articles()

In [12]:
original_blog = blog.content[0]
original_blog

'Any podcast enthusiasts out there? We are pleased to announce the release of Codeup’s first podcast, Hire Tech! Hire Tech is a conversation with new developers and the people that hire them, hosted by our CEO and Co-Founder, Jason Straughan.\xa0\nIn a world where “entry-level” positions often require 3 years of experience, Jason sets out to discover what it’s like to break into and work in the modern tech industry. To hear various perspectives, he interviews both Codeup alumni and tech leaders who have hired from Codeup. These stories show you the impact Codeup has on the tech world by empowering our community with real life change.\nIn the first episode, Jason interviews Codeup alum Ryan Smith, who has been working as a Software Developer and Software Engineer for nearly two years. From a missionary in Colombia to a Dog Handler in Afghanistan, Ryan’s life experiences are one of the many ways he and other Codeup grads stand out to employers. That, paired with his interview preparation

In [13]:
original_news = news.content[0]
original_news

'Navi Mutual Fund is offering investors the option to start Systematic Investment Plans (SIPs) with ₹500 on its app. “With the recent spike in demand for mutual funds, Navi promises to offer a hassle-free digital experience to users and it is emerging as a preferred platform for direct investments”, the company said. Users can start investing via the Navi app.'

## 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [None]:
def basic_clean(string):
    """
    This function will perform basic cleaning of a string. It will reduce all characters 
    to lower case, normalize unicode characters, and remove anything that is not a 
    letter, number, whitespace, or a single quote.
    """
    
    #Lower case everything
    string = string.lower()
    
    #Normalize unicode characters, 
    #encode into ascii byte strings and ignore unknown chars,
    #decode back into a UTF-8 string that we can work with
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('UTF-8', 'ignore')
    
    #Use regex to replace anything that is not a letter, number, whitespace, or a single quote
    string = re.sub(r"[^a-z0-9\s']", '', string)
    
    return string

In [17]:
# for testing basic cleaning of the blot articles
cleaned_blog = basic_clean(original_blog)
cleaned_blog

'any podcast enthusiasts out there we are pleased to announce the release of codeups first podcast hire tech hire tech is a conversation with new developers and the people that hire them hosted by our ceo and cofounder jason straughan \nin a world where entrylevel positions often require 3 years of experience jason sets out to discover what its like to break into and work in the modern tech industry to hear various perspectives he interviews both codeup alumni and tech leaders who have hired from codeup these stories show you the impact codeup has on the tech world by empowering our community with real life change\nin the first episode jason interviews codeup alum ryan smith who has been working as a software developer and software engineer for nearly two years from a missionary in colombia to a dog handler in afghanistan ryans life experiences are one of the many ways he and other codeup grads stand out to employers that paired with his interview preparation trick listen in to learn w

In [18]:
# for testing basic cleaning of the news articles
cleaned_news = basic_clean(original_news)
cleaned_news

'navi mutual fund is offering investors the option to start systematic investment plans sips with 500 on its app with the recent spike in demand for mutual funds navi promises to offer a hasslefree digital experience to users and it is emerging as a preferred platform for direct investments the company said users can start investing via the navi app'

## 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [19]:
def tokenize(string):
    """
    This function will tokenize all the words in the given string and return the 
    tokenized string.
    """
    
    #Create the tokenizer
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    #Use the tokenizer
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

In [22]:
# for testing untoenized blog articles
cleaned_blog

'any podcast enthusiasts out there we are pleased to announce the release of codeups first podcast hire tech hire tech is a conversation with new developers and the people that hire them hosted by our ceo and cofounder jason straughan \nin a world where entrylevel positions often require 3 years of experience jason sets out to discover what its like to break into and work in the modern tech industry to hear various perspectives he interviews both codeup alumni and tech leaders who have hired from codeup these stories show you the impact codeup has on the tech world by empowering our community with real life change\nin the first episode jason interviews codeup alum ryan smith who has been working as a software developer and software engineer for nearly two years from a missionary in colombia to a dog handler in afghanistan ryans life experiences are one of the many ways he and other codeup grads stand out to employers that paired with his interview preparation trick listen in to learn w

In [23]:
# Tokenized
tokenized_blog = tokenize(cleaned_blog)
tokenized_blog


'any podcast enthusiasts out there we are pleased to announce the release of codeups first podcast hire tech hire tech is a conversation with new developers and the people that hire them hosted by our ceo and cofounder jason straughan \nin a world where entrylevel positions often require 3 years of experience jason sets out to discover what its like to break into and work in the modern tech industry to hear various perspectives he interviews both codeup alumni and tech leaders who have hired from codeup these stories show you the impact codeup has on the tech world by empowering our community with real life change\nin the first episode jason interviews codeup alum ryan smith who has been working as a software developer and software engineer for nearly two years from a missionary in colombia to a dog handler in afghanistan ryans life experiences are one of the many ways he and other codeup grads stand out to employers that paired with his interview preparation trick listen in to learn w

### 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [24]:
def stem(string):
    '''
    This function takes in a string and
    returns a string with words stemmed.
    '''
    # Create porter stemmer.
    ps = nltk.porter.PorterStemmer()
    
    # Use the stemmer to stem each word in the list of words we created by using split.
    stems = [ps.stem(word) for word in string.split()]
    
    # Join our lists of words into a string again and assign to a variable.
    string = ' '.join(stems)
    
    return string

In [25]:
# for testing unstemmed
tokenized_blog

'any podcast enthusiasts out there we are pleased to announce the release of codeups first podcast hire tech hire tech is a conversation with new developers and the people that hire them hosted by our ceo and cofounder jason straughan \nin a world where entrylevel positions often require 3 years of experience jason sets out to discover what its like to break into and work in the modern tech industry to hear various perspectives he interviews both codeup alumni and tech leaders who have hired from codeup these stories show you the impact codeup has on the tech world by empowering our community with real life change\nin the first episode jason interviews codeup alum ryan smith who has been working as a software developer and software engineer for nearly two years from a missionary in colombia to a dog handler in afghanistan ryans life experiences are one of the many ways he and other codeup grads stand out to employers that paired with his interview preparation trick listen in to learn w

In [26]:
# stemmed
stemmed = stem(tokenized_blog)
stemmed

'ani podcast enthusiast out there we are pleas to announc the releas of codeup first podcast hire tech hire tech is a convers with new develop and the peopl that hire them host by our ceo and cofound jason straughan in a world where entrylevel posit often requir 3 year of experi jason set out to discov what it like to break into and work in the modern tech industri to hear variou perspect he interview both codeup alumni and tech leader who have hire from codeup these stori show you the impact codeup ha on the tech world by empow our commun with real life chang in the first episod jason interview codeup alum ryan smith who ha been work as a softwar develop and softwar engin for nearli two year from a missionari in colombia to a dog handler in afghanistan ryan life experi are one of the mani way he and other codeup grad stand out to employ that pair with hi interview prepar trick listen in to learn what it is thi episod will reveal how he went into hi job search with total confid what he

## 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [31]:
# Download the wordnet
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/randyfrench/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [32]:
def lemmatize(string):
    '''
    This function takes in string for and
    returns a string with words lemmatized.
    '''
    # Create the lemmatizer.
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Use the lemmatizer on each word in the list of words we created by using split.
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    
    # Join our list of words into a string again and assign to a variable.
    string = ' '.join(lemmas)
    
    return string

In [33]:
# use the function defined above
lemmatize(tokenized_blog)

'any podcast enthusiast out there we are pleased to announce the release of codeups first podcast hire tech hire tech is a conversation with new developer and the people that hire them hosted by our ceo and cofounder jason straughan in a world where entrylevel position often require 3 year of experience jason set out to discover what it like to break into and work in the modern tech industry to hear various perspective he interview both codeup alumnus and tech leader who have hired from codeup these story show you the impact codeup ha on the tech world by empowering our community with real life change in the first episode jason interview codeup alum ryan smith who ha been working a a software developer and software engineer for nearly two year from a missionary in colombia to a dog handler in afghanistan ryans life experience are one of the many way he and other codeup grad stand out to employer that paired with his interview preparation trick listen in to learn what it is this episode

## 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [34]:
# download the stopword corpus
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/randyfrench/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [35]:
def remove_stopwords(string, extra_words = [], exclude_words = []):
    '''
    This function takes in a string, optional extra_words and exclude_words parameters
    with default empty lists and returns a string.
    '''
    # Create stopword_list.
    stopword_list = stopwords.words('english')
    
    # Remove 'exclude_words' from stopword_list to keep these in my text.
    stopword_list = set(stopword_list) - set(exclude_words)
    
    # Add in 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))

    # Split words in string.
    words = string.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [36]:
# use the function defined above
remove_stopwords(tokenized_blog)

'podcast enthusiasts pleased announce release codeups first podcast hire tech hire tech conversation new developers people hire hosted ceo cofounder jason straughan world entrylevel positions often require 3 years experience jason sets discover like break work modern tech industry hear various perspectives interviews codeup alumni tech leaders hired codeup stories show impact codeup tech world empowering community real life change first episode jason interviews codeup alum ryan smith working software developer software engineer nearly two years missionary colombia dog handler afghanistan ryans life experiences one many ways codeup grads stand employers paired interview preparation trick listen learn episode reveal went job search total confidence would tell hiring managers regarding juniorlevel talent shouldnt get caught applicant knowing primary language give listen spotify also available apple podcasts anchor google podcasts perfect morning commute'

In [37]:
# create a list of extra words and words to exculde
extra_words = ['trick', 'life', 'google']
exclude_words = ['how']

In [39]:
# a string without the stop words
extra_clean = remove_stopwords(tokenized_blog, extra_words, exclude_words)
extra_clean

'podcast enthusiasts pleased announce release codeups first podcast hire tech hire tech conversation new developers people hire hosted ceo cofounder jason straughan world entrylevel positions often require 3 years experience jason sets discover like break work modern tech industry hear various perspectives interviews codeup alumni tech leaders hired codeup stories show impact codeup tech world empowering community real change first episode jason interviews codeup alum ryan smith working software developer software engineer nearly two years missionary colombia dog handler afghanistan ryans experiences one many ways codeup grads stand employers paired interview preparation listen learn episode reveal how went job search total confidence would tell hiring managers regarding juniorlevel talent shouldnt get caught applicant knowing primary language give listen spotify also available apple podcasts anchor podcasts perfect morning commute'

## 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [41]:
news_df = acquire.get_inshorts_articles()
news_df

Unnamed: 0,title,published,author,content,category
0,Zydus Cadila to reduce price of its COVID-19 v...,2021-10-31T15:17:07.000Z,Pragya Swastik,Gujarat-based pharma company Zydus Cadila has ...,business
1,Bill Gates celebrates his 66th birthday with J...,2021-10-31T09:16:36.000Z,Pragya Swastik,"Microsoft Co-founder Bill Gates, who turned 66...",business
2,Hebrew speakers mock Facebook's new name 'Meta...,2021-10-31T09:17:55.000Z,Pragya Swastik,Social media users in Israel are mocking Faceb...,business
3,Mumbai Police shares meme on Facebook name cha...,2021-10-30T15:57:03.000Z,Arshiya Chopra,After Facebook changed its company name to 'Me...,business
4,If UN can tell how $6 bn will solve world hung...,2021-10-31T15:22:38.000Z,Pragya Swastik,Tesla CEO Elon Musk said if the UN's World Foo...,business
...,...,...,...,...,...
95,Cringed on set during projects I did only for ...,2021-10-31T16:20:26.000Z,Kriti Kambiri,Actor Sharib Hashmi has revealed that in the p...,entertainment
96,Glad we've never been typical father/son duo: ...,2021-10-31T10:48:12.000Z,Mahima Kharbanda,Actor Anupam Kher wished his son Sikandar Kher...,entertainment
97,What the f*ck is that: Salman Khan to Tejasswi...,2021-10-31T09:59:34.000Z,Kriti Kambiri,A Bigg Boss 15 promo shows host Salman Khan sa...,entertainment
98,Urmila Matondkar tests positive for coronavirus,2021-10-31T08:57:07.000Z,Mahima Kharbanda,Actress-turned-politician Urmila Matondkar ann...,entertainment


## 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [42]:
codeup_df = acquire.get_blog_articles()
codeup_df

Unnamed: 0,title,published,content
0,Boris – Behind the Billboards,"Oct 3, 2021",
1,Is Codeup the Best Bootcamp in San Antonio…or ...,"Sep 16, 2021",Looking for the best data science bootcamp in ...
2,Codeup Launches First Podcast: Hire Tech,"Aug 25, 2021",Any podcast enthusiasts out there? We are plea...
3,Why Should I Become a System Administrator?,"Aug 23, 2021","With so many tech careers in demand, why choos..."
4,Announcing our Candidacy for Accreditation!,"Jun 30, 2021",Did you know that even though we’re an indepen...
5,Codeup Takes Over More of the Historic Vogue B...,"Jun 21, 2021",Codeup is moving into another floor of our His...
6,Inclusion at Codeup During Pride Month (and Al...,"Jun 4, 2021",Happy Pride Month! Pride Month is a dedicated ...
7,Why You Need the Best Coding Bootcamp Instructors,"May 21, 2021",One of the many reasons students love Codeup i...
8,"Meet the new Codeup COO, Stephen Noteboom!","May 3, 2021","A big welcome to Stephen Noteboom, who will be..."
9,How I Went From Codeup to Business Owner,"Apr 30, 2021","Out of college, I was a bit of a mess. That’s ..."


## 8. For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [43]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    '''
    This function take in a df and the string name for a text column with 
    option to pass lists for extra_words and exclude_words and
    returns a df with the text article title, original text, stemmed text,
    lemmatized text, cleaned, tokenized, & lemmatized text with stopwords removed.
    '''
    df['clean'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['stemmed'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(stem)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    df['lemmatized'] = df[column].apply(basic_clean)\
                            .apply(tokenize)\
                            .apply(lemmatize)\
                            .apply(remove_stopwords, 
                                   extra_words=extra_words, 
                                   exclude_words=exclude_words)
    
    return df[['title', column,'clean', 'stemmed', 'lemmatized']]

In [44]:
# use the function defined above for news_df's content column.

prep_article_data(news_df, 'content', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Zydus Cadila to reduce price of its COVID-19 v...,Gujarat-based pharma company Zydus Cadila has ...,gujaratbased pharma company zydus cadila agree...,gujaratbas pharma compani zydu cadila agre bri...,gujaratbased pharma company zydus cadila agree...
1,Bill Gates celebrates his 66th birthday with J...,"Microsoft Co-founder Bill Gates, who turned 66...",microsoft cofounder bill gates turned 66 octob...,microsoft cofound bill gate turn 66 octob 28 r...,microsoft cofounder bill gate turned 66 octobe...
2,Hebrew speakers mock Facebook's new name 'Meta...,Social media users in Israel are mocking Faceb...,social media users israel mocking facebook ' n...,social media user israel mock facebook ' new c...,social medium user israel mocking facebook ' n...
3,Mumbai Police shares meme on Facebook name cha...,After Facebook changed its company name to 'Me...,facebook changed company name ' meta ' mumbai ...,facebook chang compani name ' meta ' mumbai po...,facebook changed company name ' meta ' mumbai ...
4,If UN can tell how $6 bn will solve world hung...,Tesla CEO Elon Musk said if the UN's World Foo...,tesla ceo elon musk said un ' world food progr...,tesla ceo elon musk said un ' world food progr...,tesla ceo elon musk said un ' world food progr...


In [45]:
# use the function defined above for codeup_df's content column.

prep_article_data(codeup_df, 'content', extra_words = ['ha'], exclude_words = ['no']).head()

Unnamed: 0,title,content,clean,stemmed,lemmatized
0,Boris – Behind the Billboards,,,,
1,Is Codeup the Best Bootcamp in San Antonio…or ...,Looking for the best data science bootcamp in ...,looking best data science bootcamp world best ...,look best data scienc bootcamp world best code...,looking best data science bootcamp world best ...
2,Codeup Launches First Podcast: Hire Tech,Any podcast enthusiasts out there? We are plea...,podcast enthusiasts pleased announce release c...,ani podcast enthusiast pleas announc releas co...,podcast enthusiast pleased announce release co...
3,Why Should I Become a System Administrator?,"With so many tech careers in demand, why choos...",many tech careers demand choose system adminis...,mani tech career demand whi choos system admin...,many tech career demand choose system administ...
4,Announcing our Candidacy for Accreditation!,Did you know that even though we’re an indepen...,know even though independent school multiple r...,know even though independ school multipl regul...,know even though independent school multiple r...


## 9. Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
   - Since this corpus is small, I would prefer to use the lemmatized text
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
   - This corpus ia a bit larger but not too large, would perfer to use the lemmatized text.
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
   - This corpus is huge, stemmed text would be the way to go and be faster