In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd
from time import strftime

import acquire

import warnings
warnings.filterwarnings('ignore')

The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both 
- the codeup blog articles 
- and the news articles that were previously acquired.

In [2]:
codeup_df = acquire.get_blog_articles(acquire.urls)
codeup_df.head()

Unnamed: 0,title,published,contents
0,Codeup Start Dates for March 2022,"Jan 26, 2022",\nAs we approach the end of January we wanted ...
1,5 Books Every Woman In Tech Should Read,"Mar 8, 2022",\nOn this International Women’s Day 2022 we wa...
2,Dallas Campus Re-opens With New Grant Partner,"Dec 30, 2021",\n\n\n\n\n\nWe are happy to announce that our ...
3,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",\n\n\n\n\n\nOur Placement Team is simply defin...
4,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","\n\n\n\n\n\nAWS, Google, Azure, Red Hat, CompT..."


# PREPARE

### 1. Clean 
Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [3]:
def basic_clean(string):
    '''
    This function takes in a string, applies basic text cleaning to it,
    then returns normalized text, making all text lowercase, normalizing unicode characters,
    and replacing anything that is not a letter, number, whitespace, or a single quote.
    '''
    # removes accented characters; removes inconsistencies in unicode, converts resulting string to ASCII character, while ignoring warnings, and decodes to turn resulting bytes back into string. 
    string = unicodedata.normalize('NFKD', string)\
             .encode('ascii', 'ignore')\
             .decode('utf-8', 'ignore')
    # # removes special characters, substituting anything that is NOT a letter, number, apostrophe, or whitespace, then makes text lowercase
    string = re.sub(r"[^a-z0-9'\s]", '', string).lower()
    return string
    

In [4]:
test = codeup_df.contents[0]
test

'\nAs we approach the end of January we wanted to look forward to our next start dates for all of our current programs.\nFull Stack Web Development – 3/7/22\nFull Stack Web Development is the first program we built and also our most popular. You’ve asked and we listened! Our next Web Development cohort will start on 3/7/2022 and is ENTIRELY VIRTUAL! THESE SEATS WILL GO FAST!\nAs one of the most in-demand jobs in the country, software and web development is the tech career with the newest jobs. In the U.S., there’s:\n\n1.5 million developer jobs*\n250,000 of them remain open\na high growth rate of 13%*\n\n\xa0\nData Science – 3/22/22\nOur first new Data Science class of 2022 starts Monday 3/22/2022 at our downtown campus at the Vogue building.\nWhy consider pivoting careers to Data Science?\n\n#1 job in America from 2016-2020 (Glassdoor*)\n650% increase in data science positions since 2012\nNearly 12 million new jobs between 2019 and 2029\n31% ten-year growth rate\n\nThe supply of data 

In [5]:
test_clean = basic_clean(test)
test_clean

'\ns we approach the end of anuary we wanted to look forward to our next start dates for all of our current programs\null tack eb evelopment  3722\null tack eb evelopment is the first program we built and also our most popular ouve asked and we listened ur next eb evelopment cohort will start on 372022 and is       \ns one of the most indemand jobs in the country software and web development is the tech career with the newest jobs n the  theres\n\n15 million developer jobs\n250000 of them remain open\na high growth rate of 13\n\n \nata cience  32222\nur first new ata cience class of 2022 starts onday 3222022 at our downtown campus at the ogue building\nhy consider pivoting careers to ata cience\n\n1 job in merica from 20162020 lassdoor\n650 increase in data science positions since 2012\nearly 12 million new jobs between 2019 and 2029\n31 tenyear growth rate\n\nhe supply of data scientists remains painfully low compared to the outrageous demand  can help close the gap while launching a 

### 2. Tokenize
Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [6]:
def tokenize(string):
    '''
    This function takes in a string and
    tokenizes the string; breaking them down into discrete units.
    '''
    # Create tokenizer.
    tokenizer = nltk.tokenize.ToktokTokenizer()
    
    # Use tokenizer
    string = tokenizer.tokenize(string, return_str = True)
    
    return string

In [7]:
test_tokenize = tokenize(test_clean)
test_tokenize

's we approach the end of anuary we wanted to look forward to our next start dates for all of our current programs\null tack eb evelopment 3722\null tack eb evelopment is the first program we built and also our most popular ouve asked and we listened ur next eb evelopment cohort will start on 372022 and is \ns one of the most indemand jobs in the country software and web development is the tech career with the newest jobs n the theres\n\n15 million developer jobs\n250000 of them remain open\na high growth rate of 13\n\n \nata cience 32222\nur first new ata cience class of 2022 starts onday 3222022 at our downtown campus at the ogue building\nhy consider pivoting careers to ata cience\n\n1 job in merica from 20162020 lassdoor\n650 increase in data science positions since 2012\nearly 12 million new jobs between 2019 and 2029\n31 tenyear growth rate\n\nhe supply of data scientists remains painfully low compared to the outrageous demand can help close the gap while launching a fulfilling s

### 3. Stem
Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [8]:
def stem(text):
    '''This function accepts text and returns the stemmed text.
    '''
    # create the stemmer
    ps = nltk.porter.PorterStemmer()
    
    # apply the stemming transformation to all the words in the text using split
    stems = [ps.stem(word) for word in text.split()]
    
    # join the list of words into a string again assigned to the variable article_stemmed
    text_stemmed = ' '.join(stems)
    
    return text

In [9]:
test_stem = stem(test_tokenize)
test_stem

's we approach the end of anuary we wanted to look forward to our next start dates for all of our current programs\null tack eb evelopment 3722\null tack eb evelopment is the first program we built and also our most popular ouve asked and we listened ur next eb evelopment cohort will start on 372022 and is \ns one of the most indemand jobs in the country software and web development is the tech career with the newest jobs n the theres\n\n15 million developer jobs\n250000 of them remain open\na high growth rate of 13\n\n \nata cience 32222\nur first new ata cience class of 2022 starts onday 3222022 at our downtown campus at the ogue building\nhy consider pivoting careers to ata cience\n\n1 job in merica from 20162020 lassdoor\n650 increase in data science positions since 2012\nearly 12 million new jobs between 2019 and 2029\n31 tenyear growth rate\n\nhe supply of data scientists remains painfully low compared to the outrageous demand can help close the gap while launching a fulfilling s

In [10]:
pd.Series(test_stem.split()).value_counts().head(20)

the           15
of            11
to            11
our            8
and            8
in             7
a              6
can            5
career         5
your           5
jobs           4
we             4
cience         3
evelopment     3
is             3
tech           3
ata            3
ur             3
eb             3
one            3
dtype: int64

### 4. Lemmatize
Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [11]:
def lemmatize(text):
    '''This function takes in a string and returns 
    a string with the words lemmatized.
    '''
    
    # create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()

    lemmas = [wnl.lemmatize(word) for word in text.split()]
    
    text_lemmatized = ' '.join(lemmas)
    
    
    return text

In [12]:
test_lemmatize = lemmatize(test_stem)
test_lemmatize

's we approach the end of anuary we wanted to look forward to our next start dates for all of our current programs\null tack eb evelopment 3722\null tack eb evelopment is the first program we built and also our most popular ouve asked and we listened ur next eb evelopment cohort will start on 372022 and is \ns one of the most indemand jobs in the country software and web development is the tech career with the newest jobs n the theres\n\n15 million developer jobs\n250000 of them remain open\na high growth rate of 13\n\n \nata cience 32222\nur first new ata cience class of 2022 starts onday 3222022 at our downtown campus at the ogue building\nhy consider pivoting careers to ata cience\n\n1 job in merica from 20162020 lassdoor\n650 increase in data science positions since 2012\nearly 12 million new jobs between 2019 and 2029\n31 tenyear growth rate\n\nhe supply of data scientists remains painfully low compared to the outrageous demand can help close the gap while launching a fulfilling s

In [13]:
pd.Series(test_lemmatize.split()).value_counts().head(10)

the       15
of        11
to        11
our        8
and        8
in         7
a          6
can        5
career     5
your       5
dtype: int64

### 5. Stopwords
Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [14]:
def remove_stopwords(text, extra_words = [], exclude_words = []):
    
    '''
    This function takes in text, optional extra_words and exclude_words
    with default empty lists and returns the text.
    '''
    
    # assign stopwords from nltk into stopword_list
    stopword_list = stopwords.words('english')
    # remove excluded words using set
    stopword_list = set(stopword_list) - set(exclude_words)
    # add extra words to the set using union
    stopword_list = stopword_list.union(set(extra_words))

    # split the text by spaces
    words = text.split()
    # assign filtered words as any word in the text that is not in the stopwords list
    filtered_words = [word for word in words if word not in stopword_list]

    # join the filtered list back with the spaces
    text_without_stopwords = ' '.join(filtered_words)

    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
    print('---')

    print(text_without_stopwords)
    
    
    return text_without_stopwords

In [15]:
extra_words = []
exclude_words = ['to']

In [16]:
'''
stopword_list = stopwords.words('english')
stopword_list
'''

"\nstopword_list = stopwords.words('english')\nstopword_list\n"

In [17]:
# remove_stopwords('I love the langague python')

In [18]:
test_remove_stopwords = remove_stopwords(test_lemmatize, exclude_words= exclude_words)#.split()).value_counts().head()
test_remove_stopwords 

Removed 106 stopwords
---
approach end anuary wanted to look forward to next start dates current programs ull tack eb evelopment 3722 ull tack eb evelopment first program built also popular ouve asked listened ur next eb evelopment cohort start 372022 one indemand jobs country software web development tech career newest jobs n theres 15 million developer jobs 250000 remain open high growth rate 13 ata cience 32222 ur first new ata cience class 2022 starts onday 3222022 downtown campus ogue building hy consider pivoting careers to ata cience 1 job merica 20162020 lassdoor 650 increase data science positions since 2012 early 12 million new jobs 2019 2029 31 tenyear growth rate supply data scientists remains painfully low compared to outrageous demand help close gap launching fulfilling secure highpaying career one best country mployers scrambling to find talent due to lack qualified applicants help fill gap futureproofing skillset ave flexibility security salary youve always wanted caree

'approach end anuary wanted to look forward to next start dates current programs ull tack eb evelopment 3722 ull tack eb evelopment first program built also popular ouve asked listened ur next eb evelopment cohort start 372022 one indemand jobs country software web development tech career newest jobs n theres 15 million developer jobs 250000 remain open high growth rate 13 ata cience 32222 ur first new ata cience class 2022 starts onday 3222022 downtown campus ogue building hy consider pivoting careers to ata cience 1 job merica 20162020 lassdoor 650 increase data science positions since 2012 early 12 million new jobs 2019 2029 31 tenyear growth rate supply data scientists remains painfully low compared to outrageous demand help close gap launching fulfilling secure highpaying career one best country mployers scrambling to find talent due to lack qualified applicants help fill gap futureproofing skillset ave flexibility security salary youve always wanted career ready to launch career 

In [19]:
pd.Series(test_remove_stopwords.split()).value_counts().head(10)

to            11
career         5
jobs           4
help           3
evelopment     3
eb             3
ur             3
ata            3
one            3
cience         3
dtype: int64

In [20]:
pd.Series(remove_stopwords(test_lemmatize, exclude_words= exclude_words).split()).value_counts().head()

Removed 106 stopwords
---
approach end anuary wanted to look forward to next start dates current programs ull tack eb evelopment 3722 ull tack eb evelopment first program built also popular ouve asked listened ur next eb evelopment cohort start 372022 one indemand jobs country software web development tech career newest jobs n theres 15 million developer jobs 250000 remain open high growth rate 13 ata cience 32222 ur first new ata cience class 2022 starts onday 3222022 downtown campus ogue building hy consider pivoting careers to ata cience 1 job merica 20162020 lassdoor 650 increase data science positions since 2012 early 12 million new jobs 2019 2029 31 tenyear growth rate supply data scientists remains painfully low compared to outrageous demand help close gap launching fulfilling secure highpaying career one best country mployers scrambling to find talent due to lack qualified applicants help fill gap futureproofing skillset ave flexibility security salary youve always wanted caree

to            11
career         5
jobs           4
help           3
evelopment     3
dtype: int64

### 6. dataframe news_df
Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [21]:
news_df = acquire.get_news_articles()
news_df.head()

Getting articles for business
Getting articles for sports
Getting articles for entertainment
Getting articles for technology


Unnamed: 0,category,title,content,author,published
0,business,Apple delays plan requiring employees to come ...,Apple has delayed its plan that required its e...,Pragya Swastik,2022-05-18T10:17:10.000Z
1,business,"Price of domestic LPG cylinder crosses ₹1,000-...",The price of a 14.2-kg domestic LPG cylinder w...,Apaar Sharma,2022-05-19T04:23:45.000Z
2,business,Wheat shouldn't go the way of COVID-19 vaccine...,"Calling out the West, India said that wheat sh...",Apaar Sharma,2022-05-19T03:56:52.000Z
3,business,Rupee closes at a new all-time low of 77.58 ag...,The Indian rupee closed at a new all-time low ...,Anmol Sharma,2022-05-18T11:11:34.000Z
4,business,Investors lose ₹7 lakh crore as Sensex crashes...,The wealth of investors tumbled by ₹7 lakh cro...,Pragya Swastik,2022-05-19T12:20:09.000Z


### 7. codeup_df
Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [22]:
codeup_df = acquire.get_blog_articles(acquire.urls)
codeup_df.head()

Unnamed: 0,title,published,contents
0,Codeup Start Dates for March 2022,"Jan 26, 2022",\nAs we approach the end of January we wanted ...
1,5 Books Every Woman In Tech Should Read,"Mar 8, 2022",\nOn this International Women’s Day 2022 we wa...
2,Dallas Campus Re-opens With New Grant Partner,"Dec 30, 2021",\n\n\n\n\n\nWe are happy to announce that our ...
3,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",\n\n\n\n\n\nOur Placement Team is simply defin...
4,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","\n\n\n\n\n\nAWS, Google, Azure, Red Hat, CompT..."


In [23]:
codeup_df = codeup_df.rename(columns={'contents':'original'})
codeup_df.head()

Unnamed: 0,title,published,original
0,Codeup Start Dates for March 2022,"Jan 26, 2022",\nAs we approach the end of January we wanted ...
1,5 Books Every Woman In Tech Should Read,"Mar 8, 2022",\nOn this International Women’s Day 2022 we wa...
2,Dallas Campus Re-opens With New Grant Partner,"Dec 30, 2021",\n\n\n\n\n\nWe are happy to announce that our ...
3,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",\n\n\n\n\n\nOur Placement Team is simply defin...
4,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","\n\n\n\n\n\nAWS, Google, Azure, Red Hat, CompT..."


8.For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [24]:
def prep_article_data(df, column, extra_words =[], exclude_words=[]):
    '''
    This function takes in a dataframe, and the string name for a column with an option to
    pass extra_word and exclude_word lists. It returns a dataframe with the text title,
    original text, cleaned text with stop words removed which has had tokenization applied to it, 
    stemmed, text, and lemmatized text.
    '''
    
    df['clean'] = df[column].apply(basic_clean)\
                                   .apply(tokenize)\
                                   .apply (remove_stopwords, 
                                           extra_words = extra_words, 
                                           exclude_words= exclude_words)
    df['stemmed'] = df['clean'].apply(stem)
    df['lemmatized'] = df['clean'].apply(lemmatize)

    return df[['title', column, 'clean', 'stemmed', 'lemmatized']]

In [25]:
prep_article_data(codeup_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()

Removed 117 stopwords
---
approach end anuary wanted look forward next start dates current programs ull tack eb evelopment 3722 ull tack eb evelopment first program built also popular ouve asked listened ur next eb evelopment cohort start 372022 one indemand jobs country software web development tech career newest jobs n theres 15 million developer jobs 250000 remain open high growth rate 13 ata cience 32222 ur first new ata cience class 2022 starts onday 3222022 downtown campus ogue building hy consider pivoting careers ata cience 1 job merica 20162020 lassdoor 650 increase data science positions since 2012 early 12 million new jobs 2019 2029 31 tenyear growth rate supply data scientists remains painfully low compared outrageous demand help close gap launching fulfilling secure highpaying career one best country mployers scrambling find talent due lack qualified applicants help fill gap futureproofing skillset ave flexibility security salary youve always wanted career ready launch car

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Codeup Start Dates for March 2022,\nAs we approach the end of January we wanted ...,approach end anuary wanted look forward next s...,approach end anuary wanted look forward next s...,approach end anuary wanted look forward next s...
1,5 Books Every Woman In Tech Should Read,\nOn this International Women’s Day 2022 we wa...,n nternational omens ay 2022 wanted tell stori...,n nternational omens ay 2022 wanted tell stori...,n nternational omens ay 2022 wanted tell stori...
2,Dallas Campus Re-opens With New Grant Partner,\n\n\n\n\n\nWe are happy to announce that our ...,e happy announce allas campus reopened etter y...,e happy announce allas campus reopened etter y...,e happy announce allas campus reopened etter y...
3,Codeup’s Placement Team Continues Setting Records,\n\n\n\n\n\nOur Placement Team is simply defin...,ur lacement eam simply defined group manages r...,ur lacement eam simply defined group manages r...,ur lacement eam simply defined group manages r...
4,"IT Certifications 101: Why They Matter, and Wh...","\n\n\n\n\n\nAWS, Google, Azure, Red Hat, CompT...",oogle zure ed ompthese big names nd products a...,oogle zure ed ompthese big names nd products a...,oogle zure ed ompthese big names nd products a...


In [28]:
news_df = news_df.rename(columns={'content':'original'})
news_df.head()

Unnamed: 0,category,title,original,author,published
0,business,Apple delays plan requiring employees to come ...,Apple has delayed its plan that required its e...,Pragya Swastik,2022-05-18T10:17:10.000Z
1,business,"Price of domestic LPG cylinder crosses ₹1,000-...",The price of a 14.2-kg domestic LPG cylinder w...,Apaar Sharma,2022-05-19T04:23:45.000Z
2,business,Wheat shouldn't go the way of COVID-19 vaccine...,"Calling out the West, India said that wheat sh...",Apaar Sharma,2022-05-19T03:56:52.000Z
3,business,Rupee closes at a new all-time low of 77.58 ag...,The Indian rupee closed at a new all-time low ...,Anmol Sharma,2022-05-18T11:11:34.000Z
4,business,Investors lose ₹7 lakh crore as Sensex crashes...,The wealth of investors tumbled by ₹7 lakh cro...,Pragya Swastik,2022-05-19T12:20:09.000Z


In [29]:
prep_article_data(news_df, 'original', extra_words = ['ha'], exclude_words = ['no']).head()

Removed 25 stopwords
---
pple delayed plan required employees come office three days week according memo seen loomberg pple said requirement delayed time without providing new date technology company still requires employees come office two days week
Removed 26 stopwords
---
price 142kg domestic cylinder hiked 350 hursday second move month ith cost cylinder crosses 1000mark according ith price rise domestic cylinder cost 1003 elhi umbai 1029 olkata 10185 hennai today
Removed 28 stopwords
---
alling est ndia said wheat go way 19 vaccines voiced concern hoarding discrimination amid unjustified increase food prices necessary us adequately appreciate importance equity affordability accessibility comes food grains ndia said
Removed 21 stopwords
---
ndian rupee closed new alltime low 7758 dollar ednesday ndian rupee closed 7747 dollar uesday eanwhile stock market slipped red halting twoday upward trend ensex settled 54209 points ifty stood 16240 points
Removed 23 stopwords
---
wealth investo

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Apple delays plan requiring employees to come ...,Apple has delayed its plan that required its e...,pple delayed plan required employees come offi...,pple delayed plan required employees come offi...,pple delayed plan required employees come offi...
1,"Price of domestic LPG cylinder crosses ₹1,000-...",The price of a 14.2-kg domestic LPG cylinder w...,price 142kg domestic cylinder hiked 350 hursda...,price 142kg domestic cylinder hiked 350 hursda...,price 142kg domestic cylinder hiked 350 hursda...
2,Wheat shouldn't go the way of COVID-19 vaccine...,"Calling out the West, India said that wheat sh...",alling est ndia said wheat go way 19 vaccines ...,alling est ndia said wheat go way 19 vaccines ...,alling est ndia said wheat go way 19 vaccines ...
3,Rupee closes at a new all-time low of 77.58 ag...,The Indian rupee closed at a new all-time low ...,ndian rupee closed new alltime low 7758 dollar...,ndian rupee closed new alltime low 7758 dollar...,ndian rupee closed new alltime low 7758 dollar...
4,Investors lose ₹7 lakh crore as Sensex crashes...,The wealth of investors tumbled by ₹7 lakh cro...,wealth investors tumbled 7 lakh crore hursday ...,wealth investors tumbled 7 lakh crore hursday ...,wealth investors tumbled 7 lakh crore hursday ...


9. Ask Yourself

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
     - Lemmatized text because it is a smaller dataset, and lemmatizing will result in more accurate identification of the 'meaning' of the word, identifying the lexicographically correct root word. 
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
    - It depends on the amount of time I have. Stemmed text could be better if short on time because it is a larger dataset, and lemmatizing, although it will result in more accurate identification of the 'meaning' of the word, it is considerably slower for larger datasets. However, if not short on time, I would want the most accurate results through lemmatized text. 
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
    - Stemmed text because it is a larger dataset, and lemmatizing, although it will result in more accurate identification of the 'meaning' of the word, it is considerably slower for larger datasets. 
