# Data Prep Exercises

In [1]:
import pandas as pd
import numpy as np
import unicodedata
import re

from bs4 import BeautifulSoup
import requests
import os
import json

import nltk
from nltk.corpus import stopwords

import wrangle as wr


In [2]:
#nltk.download('all')

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

- Lowercase everything
- Normalize unicode characters
- Replace anything that is not a letter, number, whitespace or a single quote.

In [127]:
#news_df['cleaned'] = (news_df['original'].str.lower()
#                .str.decode('utf-8')
#                .map(lambda x: unicodedata.normalize('NFKD', news_df['original']))
#                .str.encode('ascii', 'ignore'))

In [128]:
def basic_clean(string):
    
    string = string.lower()
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8')
    string = re.sub(r'[^a-z0-9\s\']', '', string)
    
    return string

In [4]:
string = "Smitty and Ian are 'Giant' nerds, and so is Andrew."

In [5]:
string = basic_clean(string)

Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [6]:
def tokenize(string):
    
    tokenize = nltk.tokenize.ToktokTokenizer()
    string = tokenize.tokenize(string)
    
    return string

In [7]:
string = tokenize(string)

In [8]:
string

['smitty',
 'and',
 'ian',
 'are',
 "'",
 'giant',
 "'",
 'nerds',
 'and',
 'so',
 'is',
 'andrew']

Define a function named stem. It should accept some text and return the text after applying stemming to all the words.



In [9]:
def stem(string):
    
    ps = nltk.porter.PorterStemmer()
    string = [ps.stem(word) for word in string]
    string = ' '.join(string)
    
    return string

In [10]:
stem_string = stem(string)

In [11]:
stem_string

"smitti and ian are ' giant ' nerd and so is andrew"

Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.



In [12]:
def lemmatize(string):
    
    wnl = nltk.stem.WordNetLemmatizer()
    string = [wnl.lemmatize(word) for word in string]
    string = ' '.join(string)
    
    return string

In [13]:
lem_string = lemmatize(string)

In [14]:
lem_string

"smitty and ian are ' giant ' nerd and so is andrew"

Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [15]:
def remove_stopwords(string):
    
    stopwords_english = stopwords.words('english')
    string_minus_stopwords = [word for word in string if word not in stopwords_english]
    
    return string_minus_stopwords

In [16]:
remove_stopwords(string)

['smitty', 'ian', "'", 'giant', "'", 'nerds', 'andrew']

Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [17]:
topic_list = ['technology', 'entertainment', 'world', 'business']

news = wr.get_news_articles(topic_list)

news

[{'category': 'technology',
  'title': "'Best wishes to my classmate,' writes Gates in book gifted to Mahindra, latter shares pic",
  'content': 'Businessman Anand Mahindra on Tuesday met Microsoft Co-founder Bill Gates, who is currently on a visit to India. Mahindra said that Gates gifted an autographed copy of his book to him with a message saying, "Best wishes to my classmate!" Mahindra had earlier said that Gates started Harvard College the same year as him in 1973 but later dropped out.'},
 {'category': 'technology',
  'title': 'Identify 10 problems of the society that AI can solve: PM to tech stakeholders',
  'content': "PM Narendra Modi on Tuesday asked tech stakeholders whether they could identify 10 problems of the society that can be solved using artificial intelligence (AI). Addressing the post-budget webinar on 'Unleashing the Potential — Ease of Living using Technology', he said that India is creating a modern digital infrastructure. Technologies like 5G and AI are leading

In [18]:
news_df = pd.DataFrame(news)

news_df.head()

Unnamed: 0,category,title,content
0,technology,"'Best wishes to my classmate,' writes Gates in...",Businessman Anand Mahindra on Tuesday met Micr...
1,technology,Identify 10 problems of the society that AI ca...,PM Narendra Modi on Tuesday asked tech stakeho...
2,technology,Fake LinkedIn profile of founder created using...,Walnut CEO Roshan Patel revealed that he creat...
3,technology,Soon our farmers will produce green fuel & gre...,Union Minister for Road Transport and Highways...
4,technology,Technology will help India become developed na...,Technology will help India achieve the target ...


Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [107]:
headers = {'User-Agent': 'Codeup Data Science'}

url = 'https://codeup.com/blog/'

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

more_links = soup.find_all('a', class_='more-link')

links_list = [link['href'] for link in more_links]

links_list

['https://codeup.com/codeup-news/panelist-spotlight-4/',
 'https://codeup.com/events/black-excellence-in-tech-panelist-spotlight-stephanie-jones/',
 'https://codeup.com/events/black-excellence-in-tech-panelist-spotlight-james-cooper/',
 'https://codeup.com/events/black-excellence-in-tech-panelist-spotlight/',
 'https://codeup.com/tips-for-prospective-students/coding-bootcamp-or-self-learning/',
 'https://codeup.com/codeup-news/codeup-best-bootcamps/']

In [108]:
article_info = wr.get_blog_articles(links_list)

article_info

[{'title': 'Black Excellence in Tech: Panelist Spotlight – Wilmarie De La Cruz Mejia',
  'link': 'https://codeup.com/codeup-news/panelist-spotlight-4/',
  'date_published': 'Feb 16, 2023',
  'content': '\nBlack excellence in tech: Panelist Spotlight – Wilmarie De La Cruz Mejia\n\nCodeup is hosting a Black Excellence in Tech Panel in honor of Black History Month on February 22, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as black leaders in the tech industry!\xa0\xa0\nMeet Wilmarie!\nWilmarie De\xa0La Cruz Mejia is a current Codeup student on the path to becoming a Full-Stack Web Developer at our Dallas, TX campus.\xa0\nWilmarie is a veteran expanding her knowledge of programming languages and technologies on her journey with Codeup.\xa0\nWe asked Wilmarie to share more about her experience at Codeup. She shares, “I was able to meet other people who were passionate about coding an

In [154]:
article_df = pd.DataFrame(article_info)

article_df

Unnamed: 0,title,link,date_published,content
0,Black Excellence in Tech: Panelist Spotlight –...,https://codeup.com/codeup-news/panelist-spotli...,"Feb 16, 2023",\nBlack excellence in tech: Panelist Spotlight...
1,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 13, 2023",\nBlack excellence in tech: Panelist Spotlight...
2,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 10, 2023",\nBlack excellence in tech: Panelist Spotlight...
3,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 6, 2023",\nBlack excellence in tech: Panelist Spotlight...
4,Coding Bootcamp or Self-Learning? Which is Bes...,https://codeup.com/tips-for-prospective-studen...,"Jan 20, 2023",\nIf you’re interested in embarking on a caree...
5,Codeup Among Top 58 Best Coding Bootcamps of 2023,https://codeup.com/codeup-news/codeup-best-boo...,"Jan 12, 2023",\nCodeup is pleased to announce we have been r...


For each dataframe, produce the following columns:

- title to hold the title
- original to hold the original article/post content
- clean to hold the normalized and tokenized original with the stopwords removed.
- stemmed to hold the stemmed version of the cleaned data.
- lemmatized to hold the lemmatized version of the cleaned data.

In [26]:
news_df = news_df.rename(columns={'content':'original'})

In [175]:
news_df

Unnamed: 0,category,title,original
0,technology,"'Best wishes to my classmate,' writes Gates in...",Businessman Anand Mahindra on Tuesday met Micr...
1,technology,Identify 10 problems of the society that AI ca...,PM Narendra Modi on Tuesday asked tech stakeho...
2,technology,Fake LinkedIn profile of founder created using...,Walnut CEO Roshan Patel revealed that he creat...
3,technology,Soon our farmers will produce green fuel & gre...,Union Minister for Road Transport and Highways...
4,technology,Technology will help India become developed na...,Technology will help India achieve the target ...
...,...,...,...
95,business,India kept its interest at the top in importin...,Finance Minister Nirmala Sitharaman said that ...
96,business,"HDFC, Punjab National Bank hike lending rates ...",Housing finance major HDFC and state-run Punja...
97,business,RIL sets up subsidiary to develop commercial p...,Reliance Industries Limited (RIL) has set up a...
98,business,"Moody's slashes Pak's rating to 'Caa3', change...",Moody's Investors Service said it has downgrad...


In [180]:
for index, row in news_df.iterrows():
    
    news_df.at[index, 'cleaned'] = unicodedata.normalize('NKFD', row['original'])
    
print(news_df.head())

ValueError: invalid normalization form

In [30]:
original_content = news_df['original']

In [97]:
def clean_series(df, target):
    
    list = []
    
    series = df[target]
    
    for string in series:
        
        num = range(len(series))
    
        string = basic_clean(series[num])
    
        list.append(string)
    
    return list

In [155]:
article_df = article_df.rename(columns={'content': 'original'})
                                        
article_df

Unnamed: 0,title,link,date_published,original
0,Black Excellence in Tech: Panelist Spotlight –...,https://codeup.com/codeup-news/panelist-spotli...,"Feb 16, 2023",\nBlack excellence in tech: Panelist Spotlight...
1,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 13, 2023",\nBlack excellence in tech: Panelist Spotlight...
2,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 10, 2023",\nBlack excellence in tech: Panelist Spotlight...
3,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 6, 2023",\nBlack excellence in tech: Panelist Spotlight...
4,Coding Bootcamp or Self-Learning? Which is Bes...,https://codeup.com/tips-for-prospective-studen...,"Jan 20, 2023",\nIf you’re interested in embarking on a caree...
5,Codeup Among Top 58 Best Coding Bootcamps of 2023,https://codeup.com/codeup-news/codeup-best-boo...,"Jan 12, 2023",\nCodeup is pleased to announce we have been r...


In [168]:
article_ogs = article_df['original'].copy()

In [157]:
article_df

Unnamed: 0,title,link,date_published,original
0,Black Excellence in Tech: Panelist Spotlight –...,https://codeup.com/codeup-news/panelist-spotli...,"Feb 16, 2023",\nBlack excellence in tech: Panelist Spotlight...
1,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 13, 2023",\nBlack excellence in tech: Panelist Spotlight...
2,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 10, 2023",\nBlack excellence in tech: Panelist Spotlight...
3,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 6, 2023",\nBlack excellence in tech: Panelist Spotlight...
4,Coding Bootcamp or Self-Learning? Which is Bes...,https://codeup.com/tips-for-prospective-studen...,"Jan 20, 2023",\nIf you’re interested in embarking on a caree...
5,Codeup Among Top 58 Best Coding Bootcamps of 2023,https://codeup.com/codeup-news/codeup-best-boo...,"Jan 12, 2023",\nCodeup is pleased to announce we have been r...


In [169]:
article_ogs[0] = basic_clean(article_ogs[0])
article_ogs[1] = basic_clean(article_ogs[1])
article_ogs[2] = basic_clean(article_ogs[2])
article_ogs[3] = basic_clean(article_ogs[3])
article_ogs[4] = basic_clean(article_ogs[4])
article_ogs[5] = basic_clean(article_ogs[5])

article_ogs

0    \nblack excellence in tech panelist spotlight ...
1    \nblack excellence in tech panelist spotlight ...
2    \nblack excellence in tech panelist spotlight ...
3    \nblack excellence in tech panelist spotlight ...
4    \nif youre interested in embarking on a career...
5    \ncodeup is pleased to announce we have been r...
Name: original, dtype: object

In [159]:
article_df

Unnamed: 0,title,link,date_published,original
0,Black Excellence in Tech: Panelist Spotlight –...,https://codeup.com/codeup-news/panelist-spotli...,"Feb 16, 2023",\nBlack excellence in tech: Panelist Spotlight...
1,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 13, 2023",\nBlack excellence in tech: Panelist Spotlight...
2,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 10, 2023",\nBlack excellence in tech: Panelist Spotlight...
3,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 6, 2023",\nBlack excellence in tech: Panelist Spotlight...
4,Coding Bootcamp or Self-Learning? Which is Bes...,https://codeup.com/tips-for-prospective-studen...,"Jan 20, 2023",\nIf you’re interested in embarking on a caree...
5,Codeup Among Top 58 Best Coding Bootcamps of 2023,https://codeup.com/codeup-news/codeup-best-boo...,"Jan 12, 2023",\nCodeup is pleased to announce we have been r...


In [170]:
article_ogs[0] = tokenize(article_ogs[0])
article_ogs[1] = tokenize(article_ogs[1])
article_ogs[2] = tokenize(article_ogs[2])
article_ogs[3] = tokenize(article_ogs[3])
article_ogs[4] = tokenize(article_ogs[4])
article_ogs[5] = tokenize(article_ogs[5])

article_ogs

0    [black, excellence, in, tech, panelist, spotli...
1    [black, excellence, in, tech, panelist, spotli...
2    [black, excellence, in, tech, panelist, spotli...
3    [black, excellence, in, tech, panelist, spotli...
4    [if, youre, interested, in, embarking, on, a, ...
5    [codeup, is, pleased, to, announce, we, have, ...
Name: original, dtype: object

In [171]:
article_ogs[0] = remove_stopwords(article_ogs[0])
article_ogs[1] = remove_stopwords(article_ogs[1])
article_ogs[2] = remove_stopwords(article_ogs[2])
article_ogs[3] = remove_stopwords(article_ogs[3])
article_ogs[4] = remove_stopwords(article_ogs[4])
article_ogs[5] = remove_stopwords(article_ogs[5])

article_ogs

0    [black, excellence, tech, panelist, spotlight,...
1    [black, excellence, tech, panelist, spotlight,...
2    [black, excellence, tech, panelist, spotlight,...
3    [black, excellence, tech, panelist, spotlight,...
4    [youre, interested, embarking, career, tech, l...
5    [codeup, pleased, announce, ranked, among, 58,...
Name: original, dtype: object

In [163]:
article_df['cleaned'] = article_ogs

article_df

Unnamed: 0,title,link,date_published,original,cleaned
0,Black Excellence in Tech: Panelist Spotlight –...,https://codeup.com/codeup-news/panelist-spotli...,"Feb 16, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,..."
1,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 13, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,..."
2,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 10, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,..."
3,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 6, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,..."
4,Coding Bootcamp or Self-Learning? Which is Bes...,https://codeup.com/tips-for-prospective-studen...,"Jan 20, 2023",\nIf you’re interested in embarking on a caree...,"[youre, interested, embarking, career, tech, l..."
5,Codeup Among Top 58 Best Coding Bootcamps of 2023,https://codeup.com/codeup-news/codeup-best-boo...,"Jan 12, 2023",\nCodeup is pleased to announce we have been r...,"[codeup, pleased, announce, ranked, among, 58,..."


In [174]:
#Don't stem and then lemmatize on the stemmed data

#article_ogs[0] = stem(article_ogs[0])
#article_ogs[1] = stem(article_ogs[1])
#article_ogs[2] = stem(article_ogs[2])
#article_ogs[3] = stem(article_ogs[3])
#article_ogs[4] = stem(article_ogs[4])
#article_ogs[5] = stem(article_ogs[5])

#article_ogs

In [166]:
article_df['stemmed'] = article_ogs

article_df

Unnamed: 0,title,link,date_published,original,cleaned,stemmed
0,Black Excellence in Tech: Panelist Spotlight –...,https://codeup.com/codeup-news/panelist-spotli...,"Feb 16, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,...",black excel tech panelist spotlight wilmari de...
1,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 13, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,...",black excel tech panelist spotlight stephani j...
2,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 10, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,...",black excel tech panelist spotlight jame coope...
3,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 6, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,...",black excel tech panelist spotlight jeanic fre...
4,Coding Bootcamp or Self-Learning? Which is Bes...,https://codeup.com/tips-for-prospective-studen...,"Jan 20, 2023",\nIf you’re interested in embarking on a caree...,"[youre, interested, embarking, career, tech, l...",your interest embark career tech like taken lo...
5,Codeup Among Top 58 Best Coding Bootcamps of 2023,https://codeup.com/codeup-news/codeup-best-boo...,"Jan 12, 2023",\nCodeup is pleased to announce we have been r...,"[codeup, pleased, announce, ranked, among, 58,...",codeup pleas announc rank among 58 best code b...


In [172]:
article_ogs[0] = lemmatize(article_ogs[0])
article_ogs[1] = lemmatize(article_ogs[1])
article_ogs[2] = lemmatize(article_ogs[2])
article_ogs[3] = lemmatize(article_ogs[3])
article_ogs[4] = lemmatize(article_ogs[4])
article_ogs[5] = lemmatize(article_ogs[5])

article_ogs

0    black excellence tech panelist spotlight wilma...
1    black excellence tech panelist spotlight steph...
2    black excellence tech panelist spotlight james...
3    black excellence tech panelist spotlight jeani...
4    youre interested embarking career tech likely ...
5    codeup pleased announce ranked among 58 best c...
Name: original, dtype: object

In [173]:
article_df['lemmatized'] = article_ogs

article_df

Unnamed: 0,title,link,date_published,original,cleaned,stemmed,lemmatized
0,Black Excellence in Tech: Panelist Spotlight –...,https://codeup.com/codeup-news/panelist-spotli...,"Feb 16, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,...",black excel tech panelist spotlight wilmari de...,black excellence tech panelist spotlight wilma...
1,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 13, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,...",black excel tech panelist spotlight stephani j...,black excellence tech panelist spotlight steph...
2,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 10, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,...",black excel tech panelist spotlight jame coope...,black excellence tech panelist spotlight james...
3,Black excellence in tech: Panelist Spotlight –...,https://codeup.com/events/black-excellence-in-...,"Feb 6, 2023",\nBlack excellence in tech: Panelist Spotlight...,"[black, excellence, tech, panelist, spotlight,...",black excel tech panelist spotlight jeanic fre...,black excellence tech panelist spotlight jeani...
4,Coding Bootcamp or Self-Learning? Which is Bes...,https://codeup.com/tips-for-prospective-studen...,"Jan 20, 2023",\nIf you’re interested in embarking on a caree...,"[youre, interested, embarking, career, tech, l...",your interest embark career tech like taken lo...,youre interested embarking career tech likely ...
5,Codeup Among Top 58 Best Coding Bootcamps of 2023,https://codeup.com/codeup-news/codeup-best-boo...,"Jan 12, 2023",\nCodeup is pleased to announce we have been r...,"[codeup, pleased, announce, ranked, among, 58,...",codeup pleas announc rank among 58 best code b...,codeup pleased announce ranked among 58 best c...


Ask yourself:

- If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

    - For a corpus of 493 kb, I would most likely use both methods and determine which works better on my models.

    - For a corpus of 25 mb, I would likely do the same as above.

    - For a corpus od 200 tb, I would definitely stem and only lemmatize is it was determined to be absolutely necessary.