# NLP Exercises - Data Preparation

In [1]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd

import acquire as a

In [2]:
blogs = a.get_blog_articles_data()

In [3]:
blogs.head()

Unnamed: 0,title,content
0,Women in tech: Panelist Spotlight – Magdalena ...,\nCodeup is hosting a Women in Tech Panel in h...
1,Women in tech: Panelist Spotlight – Rachel Rob...,\nCodeup is hosting a Women in Tech Panel in h...
2,Women in Tech: Panelist Spotlight – Sarah Mellor,\nCodeup is hosting a Women in Tech Panel in ...
3,Women in Tech: Panelist Spotlight – Madeleine ...,\nCodeup is hosting a Women in Tech Panel in h...
4,Black Excellence in Tech: Panelist Spotlight –...,\n\nCodeup is hosting a Black Excellence in Te...


Define a function named `basic_clean`. 

It should take in a string and apply some basic text cleaning to it:
* Lowercase everything
* Normalize unicode characters
* Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
def basic_clean(string):
    # lowercase all letters
    string = string.lower()
    # normalize unicode characters
    string = unicodedata.normalize('NFKD', string)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore').lower()
    # replace everything that isn't letters, numbers, 
    # whitespace, or single quotes
    string = re.sub(r"[^a-z0-9'\s]", '', string)
    return string

In [5]:
basic_clean(blogs.content[0])

'\ncodeup is hosting a women in tech panel in honor of womens history month on march 29th 2023 to further celebrate wed like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry\n\nmeet magdalena\nmagdalena rahn is a current codeup student in a data science cohort in san antonio texas she has a professional background in crosscultural communications international business development the wine industry and journalism after serving in the us navy she decided to complement her professional skill set by attending the data science program at codeup she is set to graduate in march 2023 magdalena is fluent in french bulgarian chinesemandarin spanish and italian\nwe asked magdalena how codeup impacted her career and she replied codeup has provided a solid foundation in analytical processes programming and data science methods and its been an encouragement to have such supportive instructors and wonderful

## 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [6]:
def tokenize(string):
    # creates tokenizer object
    tokenizer = nltk.tokenize.ToktokTokenizer()
    # returns the string tokeized
    return tokenizer.tokenize(string, return_str=True)

In [7]:
tokenize(blogs.content[0])

'Codeup is hosting a Women in Tech Panel in honor of Women ’ s History Month on March 29th , 2023 ! To further celebrate , we ’ d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry ! \n\nMeet Magdalena ! \nMagdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio , Texas. She has a professional background in cross-cultural communications , international business development , the wine industry and journalism. After serving in the US Navy , she decided to complement her professional skill set by attending the Data Science program at Codeup ; she is set to graduate in March 2023. Magdalena is fluent in French , Bulgarian , Chinese-Mandarin , Spanish and Italian.\nWe asked Magdalena how Codeup impacted her career , and she replied “Codeup has provided a solid foundation in analytical processes , programming and data science methods , and it ’ s been an encouragement t

## 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [8]:
def stem(text):
    # creates the stemming object
    ps = nltk.porter.PorterStemmer()
    # creates variable stem that reads all words in text split into a list as a list
    stems = [ps.stem(word) for word in text.split()]
    # creates variable to join all words from previous list with a space as one string
    article_stemmed = ' '.join(stems)
    return article_stemmed

In [9]:
stem(blogs.content[0])

'codeup is host a women in tech panel in honor of women’ histori month on march 29th, 2023! to further celebrate, we’d like to spotlight each of our panelist lead up to the discuss to learn a bit about their respect experi as women in the tech industry! meet magdalena! magdalena rahn is a current codeup student in a data scienc cohort in san antonio, texas. she ha a profession background in cross-cultur communications, intern busi development, the wine industri and journalism. after serv in the us navy, she decid to complement her profession skill set by attend the data scienc program at codeup; she is set to graduat in march 2023. magdalena is fluent in french, bulgarian, chinese-mandarin, spanish and italian. we ask magdalena how codeup impact her career, and she repli “codeup ha provid a solid foundat in analyt processes, program and data scienc methods, and it’ been an encourag to have such support instructor and wonder classmates.” don’t forget to tune in on march 29th to sit in o

## 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [10]:
def lemmatize(text):
    # create the lemmatization object
    wnl = nltk.stem.WordNetLemmatizer()
    # splits words in text to a list, then lemmatizes each word in the list
    lemmas = [wnl.lemmatize(word) for word in text.split()]
    # joins each word as a single string with a space between the words
    text = ' '.join(lemmas)
    return text

In [11]:
lemmatize(blogs.content[0])

'Codeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelist leading up to the discussion to learn a bit about their respective experience a woman in the tech industry! Meet Magdalena! Magdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio, Texas. She ha a professional background in cross-cultural communications, international business development, the wine industry and journalism. After serving in the US Navy, she decided to complement her professional skill set by attending the Data Science program at Codeup; she is set to graduate in March 2023. Magdalena is fluent in French, Bulgarian, Chinese-Mandarin, Spanish and Italian. We asked Magdalena how Codeup impacted her career, and she replied “Codeup ha provided a solid foundation in analytical processes, programming and data science methods, and it’s been an encouragement to have such supportive instructor 

## 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.
This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [12]:
stopword_list = stopwords.words('english')
stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [13]:
def remove_stopwords(text, extra_words=[], exclude_words=[]):
    # creates list of stop words
    stopword_list = stopwords.words('english')
    
    # remove 'exclude_words' from stopword_list to keep these in the text.
    stopword_list = set(stopword_list) - set(exclude_words)

    # add 'extra_words' to stopword_list.
    stopword_list = stopword_list.union(set(extra_words))
    
    # split words in string.
    words = text.split()
    
    # Create a list of words from my string with stopwords removed and assign to variable.
    filtered_words = [word for word in words if word not in stopword_list]
    
    # Join words in the list back into strings and assign to a variable.
    string_without_stopwords = ' '.join(filtered_words)
    
    return string_without_stopwords

In [14]:
remove_stopwords(blogs.content[0])

'Codeup hosting Women Tech Panel honor Women’s History Month March 29th, 2023! To celebrate, we’d like spotlight panelists leading discussion learn bit respective experiences women tech industry! Meet Magdalena! Magdalena Rahn current Codeup student Data Science cohort San Antonio, Texas. She professional background cross-cultural communications, international business development, wine industry journalism. After serving US Navy, decided complement professional skill set attending Data Science program Codeup; set graduate March 2023. Magdalena fluent French, Bulgarian, Chinese-Mandarin, Spanish Italian. We asked Magdalena Codeup impacted career, replied “Codeup provided solid foundation analytical processes, programming data science methods, it’s encouragement supportive instructors wonderful classmates.” Don’t forget tune March 29th sit insightful conversation Magdalena.'

## 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [15]:
news_df = a.get_news_articles_data(refresh=False)

In [16]:
news_df.head()

Unnamed: 0,title,content,category
0,RR's Yashasvi Jaiswal smashes fastest fifty in...,Rajasthan Royals (RR) opener Yashasvi Jaiswal ...,national
1,Rajasthan Royals record biggest win of IPL 202...,Rajasthan Royals (RR) on Thursday recorded the...,national
2,RR break record for scoring most runs in 1st o...,RR on Thursday broke the record for scoring mo...,national
3,Laxman Sivaramakrishnan mocks Kamal Haasan ove...,Ex-India spinner Laxman Sivaramakrishnan took ...,national
4,Which Indians have smashed fifty off 15 or les...,RR's Yashasvi Jaiswal today slammed the fastes...,national


## 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [18]:
codeup_df = a.get_blog_articles_data()

In [19]:
codeup_df

Unnamed: 0,title,content
0,Women in tech: Panelist Spotlight – Magdalena ...,\nCodeup is hosting a Women in Tech Panel in h...
1,Women in tech: Panelist Spotlight – Rachel Rob...,\nCodeup is hosting a Women in Tech Panel in h...
2,Women in Tech: Panelist Spotlight – Sarah Mellor,\nCodeup is hosting a Women in Tech Panel in ...
3,Women in Tech: Panelist Spotlight – Madeleine ...,\nCodeup is hosting a Women in Tech Panel in h...
4,Black Excellence in Tech: Panelist Spotlight –...,\n\nCodeup is hosting a Black Excellence in Te...
5,Black excellence in tech: Panelist Spotlight –...,\nCodeup is hosting our second Black Excellenc...


## 8. For each dataframe, produce the following columns:

* `title` to hold the title
* `original` to hold the original article/post content
* `clean` to hold the normalized and tokenized original with the stopwords removed.
* `stemmed` to hold the stemmed version of the cleaned data.
* `lemmatized` to hold the lemmatized version of the cleaned data.

In [20]:

news_df.content.apply(basic_clean).apply(tokenize).apply(remove_stopwords).apply(lemmatize).head()

0    rajasthan royal rr opener yashasvi jaiswal thu...
1    rajasthan royal rr thursday recorded biggest w...
2    rr thursday broke record scoring run first ipl...
3    exindia spinner laxman sivaramakrishnan took d...
4    rr ' yashasvi jaiswal today slammed fastest fi...
Name: content, dtype: object

In [21]:
def prep_article_data(df, column, extra_words=[], exclude_words=[]):
    df = df.rename(columns={column:'original'})
    df['clean'] = df.original.apply(basic_clean)\
        .apply(tokenize)\
        .apply(remove_stopwords,
               extra_words=extra_words, 
               exclude_words=exclude_words)
    df['stemmed'] = df.clean.apply(stem)
    df['lemmatized'] = df.clean.apply(lemmatize)
    return df

In [22]:
prep_article_data(news_df, 'content').head()

Unnamed: 0,title,original,category,clean,stemmed,lemmatized
0,RR's Yashasvi Jaiswal smashes fastest fifty in...,Rajasthan Royals (RR) opener Yashasvi Jaiswal ...,national,rajasthan royals rr opener yashasvi jaiswal th...,rajasthan royal rr open yashasvi jaiswal thurs...,rajasthan royal rr opener yashasvi jaiswal thu...
1,Rajasthan Royals record biggest win of IPL 202...,Rajasthan Royals (RR) on Thursday recorded the...,national,rajasthan royals rr thursday recorded biggest ...,rajasthan royal rr thursday record biggest win...,rajasthan royal rr thursday recorded biggest w...
2,RR break record for scoring most runs in 1st o...,RR on Thursday broke the record for scoring mo...,national,rr thursday broke record scoring runs first ip...,rr thursday broke record score run first ipl i...,rr thursday broke record scoring run first ipl...
3,Laxman Sivaramakrishnan mocks Kamal Haasan ove...,Ex-India spinner Laxman Sivaramakrishnan took ...,national,exindia spinner laxman sivaramakrishnan took d...,exindia spinner laxman sivaramakrishnan took d...,exindia spinner laxman sivaramakrishnan took d...
4,Which Indians have smashed fifty off 15 or les...,RR's Yashasvi Jaiswal today slammed the fastes...,national,rr ' yashasvi jaiswal today slammed fastest fi...,rr ' yashasvi jaiswal today slam fastest fifti...,rr ' yashasvi jaiswal today slammed fastest fi...


In [23]:
prep_article_data(codeup_df, 'content').head()

Unnamed: 0,title,original,clean,stemmed,lemmatized
0,Women in tech: Panelist Spotlight – Magdalena ...,\nCodeup is hosting a Women in Tech Panel in h...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
1,Women in tech: Panelist Spotlight – Rachel Rob...,\nCodeup is hosting a Women in Tech Panel in h...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
2,Women in Tech: Panelist Spotlight – Sarah Mellor,\nCodeup is hosting a Women in Tech Panel in ...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
3,Women in Tech: Panelist Spotlight – Madeleine ...,\nCodeup is hosting a Women in Tech Panel in h...,codeup hosting women tech panel honor womens h...,codeup host women tech panel honor women histo...,codeup hosting woman tech panel honor woman hi...
4,Black Excellence in Tech: Panelist Spotlight –...,\n\nCodeup is hosting a Black Excellence in Te...,codeup hosting black excellence tech panel hon...,codeup host black excel tech panel honor black...,codeup hosting black excellence tech panel hon...


## 9. Ask yourself:

* If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
* If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?

If 493KB...Lemmetized

If 25MB...Lemmetized

If 200TB...Stemmed