# Parsing Text (aka Prepping Text Data)

## Exercises
The end result of this exercise should be a file named prepare.py that defines the requested functions.

In this exercise we will be defining some functions to prepare textual data. These functions should apply equally well to both the codeup blog articles and the news articles that were previously acquired.

## Imports

In [1]:
#standard imports
import pandas as pd
import numpy as np

# my imports
import acquire as a

#import Parsing Text
import unicodedata

#import regular expression operations
import re

#import natural language toolkit
import nltk

#import stopwords list
from nltk.corpus import stopwords

In [2]:
news_articles = a.get_news_articles()

In [3]:
blog_articles = a.get_blog_articles()

## 1. Define a function named basic_clean. It should take in a string and apply some basic text cleaning to it:

    • Lowercase everything
    • Normalize unicode characters
    • Replace anything that is not a letter, number, whitespace or a single quote.

In [4]:
# blog_articles[0]['content']

In [5]:
def basic_clean(article):
    """
    Lower Case:
    - setting all letters to a lowercase
    
    Encoding:
    - `unicodedata.normalize` removes any inconsistencies in unicode character encoding
    - `.encode` to convert the resulting string to the ASCII character set
    - `.decode` to turn the resulting bytes object back into a string
    
    Special characters:
    - remove anything that isn't a-z, a number, a single quote, or a whitespace
    """
    # lowercase text
    article = article.lower()
    
    # remove any accented characters and non-ASCII characters
    # normalizing
    # getting ride of anything not in ascii
    # turning back to a string
    article = unicodedata.normalize('NFKD', article).encode('ascii','ignore').decode('utf-8')
    
    # remove special characters
    #use re.sub to remove special characters
    bc_article = re.sub(r'[^a-z0-9\'\s]', '', article)
    
    return bc_article

In [6]:
bc_article = basic_clean(blog_articles[0]['content'])
bc_article

'may is traditionally known as asian american and pacific islander aapi heritage month this month we celebrate the history and contributions made possible by our aapi friends family and community we also examine our level of support and seek opportunities to better understand the aapi community  in an effort to address real concerns and experiences we sat down with arbeena thapa one of codeups financial aid and enrollment managers arbeena identifies as nepali american and desi arbeenas parents immigrated to texas in 1988 for better employment and educational opportunities arbeenas older sister was five when they made the move to the us arbeena was born later becoming the first in her family to be a us citizen at codeup we take our efforts at inclusivity very seriously after speaking with arbeena we were taught that the term aapi excludes desiamerican individuals hence we will now use the term asian pacific islander desi american apida here is how the rest of our conversation with arbee

## 2. Define a function named tokenize. It should take in a string and tokenize all the words in the string.

In [7]:
def tokenize(article):
    """
    Tokenization is the process of breaking something down
    into smaller, discrete units. These units are called tokens.
    
    It's common to tokenize the strings to break up words and punctutation
    left over into discrete units. 
    """  

    #create the tokenizer
    tokenize = nltk.tokenize.ToktokTokenizer()
    tok_art = tokenize.tokenize(article, return_str=True)
  
    return tok_art

In [8]:
tok_art = tokenize(blog_articles[0]['content'])
tok_art

'May is traditionally known as Asian American and Pacific Islander ( AAPI ) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends , family , and community. We also examine our level of support and seek opportunities to better understand the AAPI community. In an effort to address real concerns and experiences , we sat down with Arbeena Thapa , one of Codeup ’ s Financial Aid and Enrollment Managers. Arbeena identifies as Nepali American and Desi. Arbeena ’ s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena ’ s older sister was five when they made the move to the US. Arbeena was born later , becoming the first in her family to be a US citizen. At Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena , we were taught that the term AAPI excludes Desi-American individuals. Hence , we will now use the term Asian Pacific Islander Desi American ( APIDA ) . Here is 

## 3. Define a function named stem. It should accept some text and return the text after applying stemming to all the words.

In [9]:
def stem(article):
    """
    Stemming:
    - **truncates** words to their "stem"
    - algorithmic rules (non lingustic)
    - example: "calls", "called", "calling" --> "call"
    - fast and efficient
    """   
    #create porter stemmer
    ps = nltk.porter.PorterStemmer()
    
    #use stemmer - apply stem to each word in our string
    ps.stem(article)
    
    # split all the words in the article
    article.split()
    stems = [ps.stem (word) for word in article.split()]
    
    #join words back together
    article_stemmed = ' '.join(stems)
    
    return article_stemmed

In [10]:
article_stemmed = stem(blog_articles[0]['content'])
article_stemmed

'may is tradit known as asian american and pacif island (aapi) heritag month. thi month we celebr the histori and contribut made possibl by our aapi friends, family, and community. we also examin our level of support and seek opportun to better understand the aapi community. in an effort to address real concern and experiences, we sat down with arbeena thapa, one of codeup’ financi aid and enrol managers. arbeena identifi as nepali american and desi. arbeena’ parent immigr to texa in 1988 for better employ and educ opportunities. arbeena’ older sister wa five when they made the move to the us. arbeena wa born later, becom the first in her famili to be a us citizen. at codeup we take our effort at inclus veri seriously. after speak with arbeena, we were taught that the term aapi exclud desi-american individuals. hence, we will now use the term asian pacif island desi american (apida). here is how the rest of our convers with arbeena went! how do you celebr or connect with your heritag a

## 4. Define a function named lemmatize. It should accept some text and return the text after applying lemmatization to each word.

In [11]:
def lemmatize(article):
    """
    Lemmatize:
        - **changes** words to their "root"
        - it can conjugate to the base word 
        - example: "mouse", "mice" --> "mouse"
        - slower than stemming
    """ 
    #create the lemmatizer   
    wnl = nltk.stem.WordNetLemmatizer()
    
    #use lemmatize - apply stem to each word in our string
    # wnl.lemmatize(article)
    lemma = [wnl.lemmatize(word) for word in article.split()]
    
    #join words back together
    article_lemma = ' '.join(lemma)
    
    return article_lemma

In [12]:
article_lemma = lemmatize(blog_articles[0]['content'])
article_lemma

'May is traditionally known a Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contribution made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunity to better understand the AAPI community. In an effort to address real concern and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers. Arbeena identifies a Nepali American and Desi. Arbeena’s parent immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister wa five when they made the move to the US. Arbeena wa born later, becoming the first in her family to be a US citizen. At Codeup we take our effort at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American individuals. Hence, we will now use the term Asian Pacific Islander Desi American (APIDA). Here is how the rest of our conversa

## 5. Define a function named remove_stopwords. It should accept some text and return the text after removing all the stopwords.

In [13]:
def remove_stopwords(article, article_lemma):
    """
    Words which have little or no significance, especially when constructing
    meaningful features from text, are known as stopwords.
    - example: a, an, the, and like

    We will use a standard English language stopwords list from nltk
    """
    #save stopwords
    stopwords_ls = stopwords.words('english')
    
    #split words in lemmatized article
    words = article_lemma.split()
    
    #remove stopwords from list of words
    filtered = [word for word in words if word not in stopwords_ls]
    
    #join words back together
    rem_stopwords = ' '.join(filtered)
    
    return rem_stopwords

In [14]:
rem_stopwords = remove_stopwords(blog_articles[0]['content'], article_lemma)
rem_stopwords

'May traditionally known Asian American Pacific Islander (AAPI) Heritage Month. This month celebrate history contribution made possible AAPI friends, family, community. We also examine level support seek opportunity better understand AAPI community. In effort address real concern experiences, sat Arbeena Thapa, one Codeup’s Financial Aid Enrollment Managers. Arbeena identifies Nepali American Desi. Arbeena’s parent immigrated Texas 1988 better employment educational opportunities. Arbeena’s older sister wa five made move US. Arbeena wa born later, becoming first family US citizen. At Codeup take effort inclusivity seriously. After speaking Arbeena, taught term AAPI excludes Desi-American individuals. Hence, use term Asian Pacific Islander Desi American (APIDA). Here rest conversation Arbeena went! How celebrate connect heritage cultural traditions? “I celebrate Nepal’s version Christmas Dashain. This nine-day celebration also known Dussehra. I grew Hindu I identify Hindu, large part he

## This function should define two optional parameters, extra_words and exclude_words. These parameters should define any additional stop words to include, and any words that we don't want to remove.

In [15]:
#set a list to add some stopwords IF THEY ARE NEEDED!
extra_words = ['all', 'about','after']

In [16]:
#set a list to remove some stopwords IF THEY ARE NEEDED!
exclude_words = ['aaa']

In [17]:
def remove_stopwords(article_lemma, extra_words, exclude_words):
    """
    Words which have little or no significance, especially when constructing
    meaningful features from text, are known as stopwords.
    - example: a, an, the, and like

    We will use a standard English language stopwords list from nltk
    """
    #save stopwords
    stopwords_ls = stopwords.words('english')
    
    # remove extra words
    stopwords_ls = set(stopwords_ls) - set(exclude_words)

    # add to stopword list
    stopwords_ls = set(stopwords_ls).union(extra_words)
    
    #split words in lemmatized article
    words = article_lemma.split()
    
    #remove stopwords from list of words
    filtered = [word for word in words if word not in stopwords_ls]
    
    #join words back together
    rem_stopwords = ' '.join(filtered)
    
    return rem_stopwords

In [18]:
article_lemma

'May is traditionally known a Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contribution made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunity to better understand the AAPI community. In an effort to address real concern and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers. Arbeena identifies a Nepali American and Desi. Arbeena’s parent immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister wa five when they made the move to the US. Arbeena wa born later, becoming the first in her family to be a US citizen. At Codeup we take our effort at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American individuals. Hence, we will now use the term Asian Pacific Islander Desi American (APIDA). Here is how the rest of our conversa

In [19]:
rem_stopwords = remove_stopwords(article_lemma, extra_words, exclude_words)
rem_stopwords

'May traditionally known Asian American Pacific Islander (AAPI) Heritage Month. This month celebrate history contribution made possible AAPI friends, family, community. We also examine level support seek opportunity better understand AAPI community. In effort address real concern experiences, sat Arbeena Thapa, one Codeup’s Financial Aid Enrollment Managers. Arbeena identifies Nepali American Desi. Arbeena’s parent immigrated Texas 1988 better employment educational opportunities. Arbeena’s older sister wa five made move US. Arbeena wa born later, becoming first family US citizen. At Codeup take effort inclusivity seriously. After speaking Arbeena, taught term AAPI excludes Desi-American individuals. Hence, use term Asian Pacific Islander Desi American (APIDA). Here rest conversation Arbeena went! How celebrate connect heritage cultural traditions? “I celebrate Nepal’s version Christmas Dashain. This nine-day celebration also known Dussehra. I grew Hindu I identify Hindu, large part he

## 6. Use your data from the acquire to produce a dataframe of the news articles. Name the dataframe news_df.

In [20]:
news_articles = a.get_news_articles()
news_df = pd.DataFrame(news_articles)
news_df

Unnamed: 0,category,title,content
0,business,"Sensex, Nifty end at fresh closing highs","Benchmark indices Sensex and Nifty ended at record closing highs on Wednesday. Sensex ended 195 points higher at 63,523 while the Nifty ended at 18,856.85, up 40 points. The gains were led by stocks like HDFC, Reliance Industries and TCS. During the intraday trade, Sensex rose to its fresh record high level of 63,588."
1,business,Amazon tricked millions of customers into enrolling in Prime: US FTC,"US Federal Trade Commission (FTC) has sued Amazon, accusing it of tricking millions of consumers into signing up for its Prime subscription without their consent. ""Amazon used manipulative, coercive or deceptive user-interface designs known as 'dark patterns' to trick consumers into enrolling in automatically-renewing Prime subscriptions,"" US FTC said. Prime members in the US pay $139 per year."
2,business,TIME releases list of the world's 100 most influential companies,"TIME magazine has released its annual list of the world's 100 most influential companies, which features OpenAI, SpaceX, Chess.com, Google DeepMind and Kim Kardashian's SKIMS among others. The National Payments Corporation of India (NPCI) and e-commerce platform Meesho also featured on the list. ""NPCI launched UPI...which accounted for 52% of India's digital transactions in FY22,"" TIME said."
3,business,Which are the world's top 10 airlines according to passengers?,"Singapore Airlines is the world's best airline, according to Skytrax World Airline Awards 2023, an annual poll of flyers released at the Paris Air Show. It is followed by Qatar Airways, All Nippon Airways, Emirates, Japan Airlines, Turkish Airlines, Air France, Cathay Pacific, EVA Air, and Korean Air. Vistara, ranked 16th, is the only Indian airline in the top 20."
4,business,"Grab lays off over 1,000 employees","Singapore-based ride-hailing and food delivery app Grab has laid off over 1,000 employees. This is Grab's largest round of layoffs since 2020, when it cut 360 jobs in response to COVID-19 pandemic challenges. ""I want to be clear that we're not doing this as a shortcut to profitability,"" Group CEO and Co-Founder Anthony Tan said in an e-mail to employees."
...,...,...,...
95,entertainment,He was wonderful: Harrison Ford on 'Indiana..' co-star Amrish Puri,"Actor Harrison Ford remembered late actor Amrish Puri, who had worked in 1984's 'Indiana Jones and the Temple of Doom'. ""He was a wonderful person...very charming man. (He was) nothing like the character that he played...Very sophisticated. I really admired him...enjoyed working with him,"" Ford said about Puri in an interview. The 1984 film featured Puri as the antagonist."
96,entertainment,Admire her audacity: Mahesh Bhatt on Pooja entering 'Bigg Boss...',"Filmmaker Mahesh Bhatt reacted to his daughter Pooja Bhatt's participation in 'Bigg Boss OTT 2'. ""Life's greatest adventures begin when we step into the realm of the unknown with courage and curiosity. She has done just that. I admire her audacity,"" Bhatt told ETimes. Pooja, on the show, had opened up about battling alcohol addiction at the age of 44."
97,entertainment,"Kartik made me wait during 'Bhool...', I used to scold him: Kiara","Actress Kiara Advani, while talking about her 'Satyaprem Ki Katha' co-actor Kartik Aaryan, said the actor used to make her wait during the shoot of their 2022 film 'Bhool Bhulaiyaa 2'. ""I used to scold Kartik, would tell him not to come late this time over and make me wait,"" Kiara said. ""We've grown professionally and personally,"" she added."
98,entertainment,It was hard to find 300 transgender people for 'Haddi': Producer,"Producer Sanjay Saha said that finding 300 transgender people for Nawazuddin Siddiqui's 'Haddi' was ""very adventurous and hard"". He added that a transwoman named Renuka helped the filmmakers to learn about transgender community. ""She had brought some of her friends from the community to Nawaz so that he could get into the character and deeply understand their life,"" he added."


## 7. Make another dataframe for the Codeup blog posts. Name the dataframe codeup_df.

In [21]:
blog_articles = a.get_blog_articles()
codeup_df = pd.DataFrame(blog_articles)
codeup_df

Unnamed: 0,title,content
0,Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa,"May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community. In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers. Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen. At Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American individuals. Hence, we will now use the term Asian Pacific Islander Desi American (APIDA). Here is how the rest of our conversation with Arbeena went! How do you celebrate or connect with your heritage and cultural traditions? “I celebrate Nepal’s version of Christmas or Dashain. This is a nine-day celebration also known as Dussehra. I grew up as Hindu and I identify as Hindu, this is a very large part of my heritage. “ “Other ways I connect with my culture include sharing food! Momos are South Asian Dumplings and they’re my favorite to make and share.” “On my Asian American side, I am an advocate of immigrant justice and erasure within APIDA social or political movements. I participate in events to embrace my identity such as immigrant justice advocacy because I come from a mixed-status family. I’ve always been in a community with undocumented Asian immigrants. .” What are some of the challenges you have faced as an APIDA individual, personally or professionally? “I often struggle with being gendered as compliant or a pushover. Professionally, I am often stereotyped as meek, so I’ve been overlooked for leadership roles. We are seen as perpetually foreign; people tend to other us in that way, yet put us on a pedestal for what a model minority looks like. This has made me hesitant to share my heritage in the past because these assumptions get mapped onto me. ” Can you describe some common barriers of entry that APIDA individuals, specifically women may face when trying to enter or advance in the workplace? “Being overlooked for leadership. In the past, I have not been viewed as a leader. People sometimes have preconceived stereotypes of Asian women not being able to be bold, or being vocal can be mistaken for being too emotional. “ How do you believe microaggressions impact APIDA individuals in the workplace? Can you provide examples of such microaggressions? “Erasure is big. To me, only saying ‘Merry Christmas’ isn’t inclusive to other religions. People are often resistant to saying ‘Happy Holidays,’ but saying Merry Christmas excludes, and does not appreciate my heritage. “ “Often microaggressions are not micro at all. They typically are not aggressive racialized violence, but the term ‘micro’ minimizes impact.” “Some that I’ve heard are ‘What kind of Asian are you?’ or ‘Where are you from?’ This automatically makes me the ‘other’ and not seen as American. Even within the APIDA community, South Asians are overlooked as “Asian”.” How important is representation, specifically APIDA representation, in organizational leadership positions? “I want to say that it is important to have someone who looks like you in leadership roles, and it is, but those leaders may not share the same beliefs as you. Certain privileges such as wealth, resources, or lack of interaction with lower-socioeconomic-status Asian Americans may cause a difference in community politics. I do not think the bamboo ceiling is acceptable, but the company you work for plays a big part in your politics and belief alignment.” How do you feel about code-switching, and have you ever felt it necessary to code-switch? “I like sharing South Asian terms or connecting with others that have similar heritage and culture. A workplace that is welcoming to going into this sort of breakout is refreshing and makes space for us. However, having to code-switch could also mean a workplace that is not conducive and welcoming of other cultures. “ Finally, in your opinion, what long-term strategies can create lasting change in the workplace and ensure support, equality, and inclusion for APIDA individuals? “Prior to a career in financial aid, I did a lot of research related to the post-9/11 immigration of the South Asian diaspora. This background made me heavily rely on grassroots organizing. Hire the people that want to innovate, hire the changemakers, hire the button-pushers. Reduce reliance on whiteness as change. This will become natural for the organization and become organizational change. Change comes from us on the ground.” A huge thank you to Arbeena Thapa for sharing her experiences, and being vulnerable with us. Your words were inspiring and the opportunity to understand your perspective more has been valuable. We hope we can become better support for the APIDA community as we learn and grow on our journey of cultivating inclusive growth."
1,Women in tech: Panelist Spotlight – Magdalena Rahn,"Women in tech: Panelist Spotlight – Magdalena Rahn Codeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry! Meet Magdalena! Magdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio, Texas. She has a professional background in cross-cultural communications, international business development, the wine industry and journalism. After serving in the US Navy, she decided to complement her professional skill set by attending the Data Science program at Codeup; she is set to graduate in March 2023. Magdalena is fluent in French, Bulgarian, Chinese-Mandarin, Spanish and Italian. We asked Magdalena how Codeup impacted her career, and she replied “Codeup has provided a solid foundation in analytical processes, programming and data science methods, and it’s been an encouragement to have such supportive instructors and wonderful classmates.” Don’t forget to tune in on March 29th to sit in on an insightful conversation with Magdalena."
2,Women in tech: Panelist Spotlight – Rachel Robbins-Mayhill,"Women in tech: Panelist Spotlight – Rachel Robbins-Mayhill Codeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry! Meet Rachel! Rachel Robbins-Mayhill is a Decision Science Analyst I in San Antonio, Texas. Rachel has had a varied career that includes counseling, teaching, training, community development, and military operations. Her focus has always been on assessing needs, identifying solutions, and educating individuals and groups on aligning needs and solutions in different contexts. Rachel’s passion for data science stems from her belief that data is a powerful tool for communicating patterns that can lead to hope and growth in the future. In June 2022, Rachel graduated from Codeup’s Innis cohort, where she honed her skills in data science. Shortly after, she started working as a Data Science Technical Writer with Apex Systems as a Contractor for USAA in July 2022. Her unconventional role allowed her to understand where her skills could be best utilized to support USAA in a non-contract role. Rachel recently joined USAA’s Data Science Delivery team as a Decision Science Analyst I in February 2023. The team is focused on delivering machine learning models for fraud prevention, and Rachel’s particular role centers around providing strategic process solutions for the team in collaboration with Operational and Model Risk components. In addition to her career, Rachel is currently pursuing a master’s degree in Applied Data Science from Syracuse University, further expanding her knowledge and skills in the field. Rachel is passionate about collaborating with individuals who share her belief in the potential of others and strive to achieve growth through logical, informed action. She welcomes LinkedIn connections and is excited about supporting the network of CodeUp alumni! We asked Rachel how Codeup impacted her career, and she replied “Codeup delivered a comprehensive education in all facets of the data science pipeline, laying a strong foundation for me to build upon. Through repeated hands-on practice, I developed a reliable process that was immediately applicable in my job. Collaborative group projects were instrumental in helping me hone my skills in project management, allowing me to navigate complex data science projects with comfortability. Thanks to this invaluable experience, I was able to make significant strides in my career within just six months of graduating from Codeup.” Don’t forget to tune in on March 29th to sit in on an insightful conversation."
3,Women in Tech: Panelist Spotlight – Sarah Mellor,"Women in tech: Panelist Spotlight – Sarah Mellor Codeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry! Meet Sarah! Sarah Mellor currently works as the Director of People Operations. She joined Codeup four and a half years ago as an Admissions Manager. She went on to build out and lead the Marketing and Admissions team, while picking up People Ops tasks and projects here and there until moving over to lead the People Ops team two years ago. Prior to Codeup, she worked at education-focused non-profits in Washington, DC and Boulder, Colorado. She graduated from Wake Forest University. We asked Sarah how Codeup has impacted her career, and her response was “I have absolutely loved having the privilege to grow alongside Codeup. In my time here across multiple different roles and departments, I’ve seen a lot of change. The consistent things have always been the high quality of passionate and hardworking people I get to work with; the impactful mission we get to work on; and the inspiring students who trust us with their career change.” Don’t forget to tune in on March 29th to sit in on an insightful conversation."
4,Women in Tech: Panelist Spotlight – Madeleine Capper,"Women in tech: Panelist Spotlight – Madeleine Capper Codeup is hosting a Women in Tech Panel in honor of Women’s History Month on March 29th, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as women in the tech industry! Meet Madeleine! Madeleine Capper is a Data Scientist in San Antonio, Texas. A long-standing San Antonio resident, she studied mathematics at the University of Texas San Antonio and has worked as a Data Scientist for Booz Allen Hamilton. Madeleine currently teaches Data Science at Codeup, where she works daily with burgeoning data professionals to help them actualize their career aspirations through technical education. Madeleine attended Codeup as a student in early 2019 as a pupil in the very first Codeup Data Science cohort. The program proved immediately effective and she was the first student to obtain a data career out of the program. After working at Booz Allen Hamilton, Madeleine’s passion for education in conjunction with her appreciation for Codeup’s capacity for transformative life change brought her back to the institution in an instructional capacity, where she has been teaching for two years. Don’t forget to tune in on March 29th to sit in on an insightful conversation."
5,Black Excellence in Tech: Panelist Spotlight – Wilmarie De La Cruz Mejia,"Black excellence in tech: Panelist Spotlight – Wilmarie De La Cruz Mejia Codeup is hosting a Black Excellence in Tech Panel in honor of Black History Month on February 22, 2023! To further celebrate, we’d like to spotlight each of our panelists leading up to the discussion to learn a bit about their respective experiences as black leaders in the tech industry! Meet Wilmarie! Wilmarie De La Cruz Mejia is a current Codeup student on the path to becoming a Full-Stack Web Developer at our Dallas, TX campus. Wilmarie is a veteran expanding her knowledge of programming languages and technologies on her journey with Codeup. We asked Wilmarie to share more about her experience at Codeup. She shares, “I was able to meet other people who were passionate about coding and be in a positive learning environment.” We hope you can join us on February 22nd to sit in on an insightful conversation with Wilmarie and all of our panelists!"


## 8. For each dataframe, produce the following columns:

. **title** to hold the title     
. **original** to hold the original article/post content    
. **clean** to hold the normalized and tokenized original with the stopwords removed.    
. **stemmed** to hold the stemmed version of the cleaned data.   
. **lemmatized** to hold the lemmatized version of the cleaned data.

In [22]:
# df.rename columns from content to original
news_df = news_df.rename(columns={'content':'original'})
news_df

Unnamed: 0,category,title,original
0,business,"Sensex, Nifty end at fresh closing highs","Benchmark indices Sensex and Nifty ended at record closing highs on Wednesday. Sensex ended 195 points higher at 63,523 while the Nifty ended at 18,856.85, up 40 points. The gains were led by stocks like HDFC, Reliance Industries and TCS. During the intraday trade, Sensex rose to its fresh record high level of 63,588."
1,business,Amazon tricked millions of customers into enrolling in Prime: US FTC,"US Federal Trade Commission (FTC) has sued Amazon, accusing it of tricking millions of consumers into signing up for its Prime subscription without their consent. ""Amazon used manipulative, coercive or deceptive user-interface designs known as 'dark patterns' to trick consumers into enrolling in automatically-renewing Prime subscriptions,"" US FTC said. Prime members in the US pay $139 per year."
2,business,TIME releases list of the world's 100 most influential companies,"TIME magazine has released its annual list of the world's 100 most influential companies, which features OpenAI, SpaceX, Chess.com, Google DeepMind and Kim Kardashian's SKIMS among others. The National Payments Corporation of India (NPCI) and e-commerce platform Meesho also featured on the list. ""NPCI launched UPI...which accounted for 52% of India's digital transactions in FY22,"" TIME said."
3,business,Which are the world's top 10 airlines according to passengers?,"Singapore Airlines is the world's best airline, according to Skytrax World Airline Awards 2023, an annual poll of flyers released at the Paris Air Show. It is followed by Qatar Airways, All Nippon Airways, Emirates, Japan Airlines, Turkish Airlines, Air France, Cathay Pacific, EVA Air, and Korean Air. Vistara, ranked 16th, is the only Indian airline in the top 20."
4,business,"Grab lays off over 1,000 employees","Singapore-based ride-hailing and food delivery app Grab has laid off over 1,000 employees. This is Grab's largest round of layoffs since 2020, when it cut 360 jobs in response to COVID-19 pandemic challenges. ""I want to be clear that we're not doing this as a shortcut to profitability,"" Group CEO and Co-Founder Anthony Tan said in an e-mail to employees."
...,...,...,...
95,entertainment,He was wonderful: Harrison Ford on 'Indiana..' co-star Amrish Puri,"Actor Harrison Ford remembered late actor Amrish Puri, who had worked in 1984's 'Indiana Jones and the Temple of Doom'. ""He was a wonderful person...very charming man. (He was) nothing like the character that he played...Very sophisticated. I really admired him...enjoyed working with him,"" Ford said about Puri in an interview. The 1984 film featured Puri as the antagonist."
96,entertainment,Admire her audacity: Mahesh Bhatt on Pooja entering 'Bigg Boss...',"Filmmaker Mahesh Bhatt reacted to his daughter Pooja Bhatt's participation in 'Bigg Boss OTT 2'. ""Life's greatest adventures begin when we step into the realm of the unknown with courage and curiosity. She has done just that. I admire her audacity,"" Bhatt told ETimes. Pooja, on the show, had opened up about battling alcohol addiction at the age of 44."
97,entertainment,"Kartik made me wait during 'Bhool...', I used to scold him: Kiara","Actress Kiara Advani, while talking about her 'Satyaprem Ki Katha' co-actor Kartik Aaryan, said the actor used to make her wait during the shoot of their 2022 film 'Bhool Bhulaiyaa 2'. ""I used to scold Kartik, would tell him not to come late this time over and make me wait,"" Kiara said. ""We've grown professionally and personally,"" she added."
98,entertainment,It was hard to find 300 transgender people for 'Haddi': Producer,"Producer Sanjay Saha said that finding 300 transgender people for Nawazuddin Siddiqui's 'Haddi' was ""very adventurous and hard"". He added that a transwoman named Renuka helped the filmmakers to learn about transgender community. ""She had brought some of her friends from the community to Nawaz so that he could get into the character and deeply understand their life,"" he added."


In [36]:
# combine basic and tokenize functions
basic_clean(tokenize(remove_stopwords(news_df.original[0], extra_words, exclude_words)))

'benchmark indices sensex nifty ended record closing highs wednesday sensex ended 195 points higher 63523 nifty ended 1885685  40 points the gains led stocks like hdfc  reliance industries tcs during intraday trade  sensex rose fresh record high level 63588 '

In [42]:
news_df['clean'] = news_df.original.apply(basic_clean).apply(tokenize).apply(remove_stopwords, extra_words=extra_words, exclude_words=exclude_words)
news_df.head()

Unnamed: 0,category,title,original,clean,stemmed,lemmatized
0,business,"Sensex, Nifty end at fresh closing highs","Benchmark indices Sensex and Nifty ended at record closing highs on Wednesday. Sensex ended 195 points higher at 63,523 while the Nifty ended at 18,856.85, up 40 points. The gains were led by stocks like HDFC, Reliance Industries and TCS. During the intraday trade, Sensex rose to its fresh record high level of 63,588.",benchmark indices sensex nifty ended record closing highs wednesday sensex ended 195 points higher 63523 nifty ended 1885685 40 points gains led stocks like hdfc reliance industries tcs intraday trade sensex rose fresh record high level 63588,benchmark indic sensex and nifti end at record close high on wednesday sensex end 195 point higher at 63523 while the nifti end at 1885685 up 40 point the gain were led by stock like hdfc relianc industri and tc dure the intraday trade sensex rose to it fresh record high level of 63588,benchmark index sensex and nifty ended at record closing high on wednesday sensex ended 195 point higher at 63523 while the nifty ended at 1885685 up 40 point the gain were led by stock like hdfc reliance industry and tc during the intraday trade sensex rose to it fresh record high level of 63588
1,business,Amazon tricked millions of customers into enrolling in Prime: US FTC,"US Federal Trade Commission (FTC) has sued Amazon, accusing it of tricking millions of consumers into signing up for its Prime subscription without their consent. ""Amazon used manipulative, coercive or deceptive user-interface designs known as 'dark patterns' to trick consumers into enrolling in automatically-renewing Prime subscriptions,"" US FTC said. Prime members in the US pay $139 per year.",us federal trade commission ftc sued amazon accusing tricking millions consumers signing prime subscription without consent amazon used manipulative coercive deceptive userinterface designs known ' dark patterns ' trick consumers enrolling automaticallyrenewing prime subscriptions us ftc said prime members us pay 139 per year,us feder trade commiss ftc ha su amazon accus it of trick million of consum into sign up for it prime subscript without their consent amazon use manipul coerciv or decept userinterfac design known as ' dark pattern ' to trick consum into enrol in automaticallyrenew prime subscript us ftc said prime member in the us pay 139 per year,u federal trade commission ftc ha sued amazon accusing it of tricking million of consumer into signing up for it prime subscription without their consent amazon used manipulative coercive or deceptive userinterface design known a ' dark pattern ' to trick consumer into enrolling in automaticallyrenewing prime subscription u ftc said prime member in the u pay 139 per year
2,business,TIME releases list of the world's 100 most influential companies,"TIME magazine has released its annual list of the world's 100 most influential companies, which features OpenAI, SpaceX, Chess.com, Google DeepMind and Kim Kardashian's SKIMS among others. The National Payments Corporation of India (NPCI) and e-commerce platform Meesho also featured on the list. ""NPCI launched UPI...which accounted for 52% of India's digital transactions in FY22,"" TIME said.",time magazine released annual list world ' 100 influential companies features openai spacex chesscom google deepmind kim kardashian ' skims among others national payments corporation india npci ecommerce platform meesho also featured list npci launched upiwhich accounted 52 india ' digital transactions fy22 time said,time magazin ha releas it annual list of the world ' s 100 most influenti compani which featur openai spacex chesscom googl deepmind and kim kardashian ' s skim among other the nation payment corpor of india npci and ecommerc platform meesho also featur on the list npci launch upiwhich account for 52 of india ' s digit transact in fy22 time said,time magazine ha released it annual list of the world ' s 100 most influential company which feature openai spacex chesscom google deepmind and kim kardashian ' s skim among others the national payment corporation of india npci and ecommerce platform meesho also featured on the list npci launched upiwhich accounted for 52 of india ' s digital transaction in fy22 time said
3,business,Which are the world's top 10 airlines according to passengers?,"Singapore Airlines is the world's best airline, according to Skytrax World Airline Awards 2023, an annual poll of flyers released at the Paris Air Show. It is followed by Qatar Airways, All Nippon Airways, Emirates, Japan Airlines, Turkish Airlines, Air France, Cathay Pacific, EVA Air, and Korean Air. Vistara, ranked 16th, is the only Indian airline in the top 20.",singapore airlines world ' best airline according skytrax world airline awards 2023 annual poll flyers released paris air show followed qatar airways nippon airways emirates japan airlines turkish airlines air france cathay pacific eva air korean air vistara ranked 16th indian airline top 20,singapor airlin is the world ' s best airlin accord to skytrax world airlin award 2023 an annual poll of flyer releas at the pari air show it is follow by qatar airway all nippon airway emir japan airlin turkish airlin air franc cathay pacif eva air and korean air vistara rank 16th is the onli indian airlin in the top 20,singapore airline is the world ' s best airline according to skytrax world airline award 2023 an annual poll of flyer released at the paris air show it is followed by qatar airway all nippon airway emirate japan airline turkish airline air france cathay pacific eva air and korean air vistara ranked 16th is the only indian airline in the top 20
4,business,"Grab lays off over 1,000 employees","Singapore-based ride-hailing and food delivery app Grab has laid off over 1,000 employees. This is Grab's largest round of layoffs since 2020, when it cut 360 jobs in response to COVID-19 pandemic challenges. ""I want to be clear that we're not doing this as a shortcut to profitability,"" Group CEO and Co-Founder Anthony Tan said in an e-mail to employees.",singaporebased ridehailing food delivery app grab laid 1000 employees grab ' largest round layoffs since 2020 cut 360 jobs response covid19 pandemic challenges want clear ' shortcut profitability group ceo cofounder anthony tan said email employees,singaporebas ridehail and food deliveri app grab ha laid off over 1000 employe thi is grab ' s largest round of layoff sinc 2020 when it cut 360 job in respons to covid19 pandem challeng i want to be clear that we ' re not do thi as a shortcut to profit group ceo and cofound anthoni tan said in an email to employe,singaporebased ridehailing and food delivery app grab ha laid off over 1000 employee this is grab ' s largest round of layoff since 2020 when it cut 360 job in response to covid19 pandemic challenge i want to be clear that we ' re not doing this a a shortcut to profitability group ceo and cofounder anthony tan said in an email to employee


In [43]:
stem(news_df.clean[0])

'benchmark indic sensex nifti end record close high wednesday sensex end 195 point higher 63523 nifti end 1885685 40 point gain led stock like hdfc relianc industri tc intraday trade sensex rose fresh record high level 63588'

In [44]:
news_df['stemmed'] = news_df.clean.apply(stem)
news_df

Unnamed: 0,category,title,original,clean,stemmed,lemmatized
0,business,"Sensex, Nifty end at fresh closing highs","Benchmark indices Sensex and Nifty ended at record closing highs on Wednesday. Sensex ended 195 points higher at 63,523 while the Nifty ended at 18,856.85, up 40 points. The gains were led by stocks like HDFC, Reliance Industries and TCS. During the intraday trade, Sensex rose to its fresh record high level of 63,588.",benchmark indices sensex nifty ended record closing highs wednesday sensex ended 195 points higher 63523 nifty ended 1885685 40 points gains led stocks like hdfc reliance industries tcs intraday trade sensex rose fresh record high level 63588,benchmark indic sensex nifti end record close high wednesday sensex end 195 point higher 63523 nifti end 1885685 40 point gain led stock like hdfc relianc industri tc intraday trade sensex rose fresh record high level 63588,benchmark index sensex and nifty ended at record closing high on wednesday sensex ended 195 point higher at 63523 while the nifty ended at 1885685 up 40 point the gain were led by stock like hdfc reliance industry and tc during the intraday trade sensex rose to it fresh record high level of 63588
1,business,Amazon tricked millions of customers into enrolling in Prime: US FTC,"US Federal Trade Commission (FTC) has sued Amazon, accusing it of tricking millions of consumers into signing up for its Prime subscription without their consent. ""Amazon used manipulative, coercive or deceptive user-interface designs known as 'dark patterns' to trick consumers into enrolling in automatically-renewing Prime subscriptions,"" US FTC said. Prime members in the US pay $139 per year.",us federal trade commission ftc sued amazon accusing tricking millions consumers signing prime subscription without consent amazon used manipulative coercive deceptive userinterface designs known ' dark patterns ' trick consumers enrolling automaticallyrenewing prime subscriptions us ftc said prime members us pay 139 per year,us feder trade commiss ftc su amazon accus trick million consum sign prime subscript without consent amazon use manipul coerciv decept userinterfac design known ' dark pattern ' trick consum enrol automaticallyrenew prime subscript us ftc said prime member us pay 139 per year,u federal trade commission ftc ha sued amazon accusing it of tricking million of consumer into signing up for it prime subscription without their consent amazon used manipulative coercive or deceptive userinterface design known a ' dark pattern ' to trick consumer into enrolling in automaticallyrenewing prime subscription u ftc said prime member in the u pay 139 per year
2,business,TIME releases list of the world's 100 most influential companies,"TIME magazine has released its annual list of the world's 100 most influential companies, which features OpenAI, SpaceX, Chess.com, Google DeepMind and Kim Kardashian's SKIMS among others. The National Payments Corporation of India (NPCI) and e-commerce platform Meesho also featured on the list. ""NPCI launched UPI...which accounted for 52% of India's digital transactions in FY22,"" TIME said.",time magazine released annual list world ' 100 influential companies features openai spacex chesscom google deepmind kim kardashian ' skims among others national payments corporation india npci ecommerce platform meesho also featured list npci launched upiwhich accounted 52 india ' digital transactions fy22 time said,time magazin releas annual list world ' 100 influenti compani featur openai spacex chesscom googl deepmind kim kardashian ' skim among other nation payment corpor india npci ecommerc platform meesho also featur list npci launch upiwhich account 52 india ' digit transact fy22 time said,time magazine ha released it annual list of the world ' s 100 most influential company which feature openai spacex chesscom google deepmind and kim kardashian ' s skim among others the national payment corporation of india npci and ecommerce platform meesho also featured on the list npci launched upiwhich accounted for 52 of india ' s digital transaction in fy22 time said
3,business,Which are the world's top 10 airlines according to passengers?,"Singapore Airlines is the world's best airline, according to Skytrax World Airline Awards 2023, an annual poll of flyers released at the Paris Air Show. It is followed by Qatar Airways, All Nippon Airways, Emirates, Japan Airlines, Turkish Airlines, Air France, Cathay Pacific, EVA Air, and Korean Air. Vistara, ranked 16th, is the only Indian airline in the top 20.",singapore airlines world ' best airline according skytrax world airline awards 2023 annual poll flyers released paris air show followed qatar airways nippon airways emirates japan airlines turkish airlines air france cathay pacific eva air korean air vistara ranked 16th indian airline top 20,singapor airlin world ' best airlin accord skytrax world airlin award 2023 annual poll flyer releas pari air show follow qatar airway nippon airway emir japan airlin turkish airlin air franc cathay pacif eva air korean air vistara rank 16th indian airlin top 20,singapore airline is the world ' s best airline according to skytrax world airline award 2023 an annual poll of flyer released at the paris air show it is followed by qatar airway all nippon airway emirate japan airline turkish airline air france cathay pacific eva air and korean air vistara ranked 16th is the only indian airline in the top 20
4,business,"Grab lays off over 1,000 employees","Singapore-based ride-hailing and food delivery app Grab has laid off over 1,000 employees. This is Grab's largest round of layoffs since 2020, when it cut 360 jobs in response to COVID-19 pandemic challenges. ""I want to be clear that we're not doing this as a shortcut to profitability,"" Group CEO and Co-Founder Anthony Tan said in an e-mail to employees.",singaporebased ridehailing food delivery app grab laid 1000 employees grab ' largest round layoffs since 2020 cut 360 jobs response covid19 pandemic challenges want clear ' shortcut profitability group ceo cofounder anthony tan said email employees,singaporebas ridehail food deliveri app grab laid 1000 employe grab ' largest round layoff sinc 2020 cut 360 job respons covid19 pandem challeng want clear ' shortcut profit group ceo cofound anthoni tan said email employe,singaporebased ridehailing and food delivery app grab ha laid off over 1000 employee this is grab ' s largest round of layoff since 2020 when it cut 360 job in response to covid19 pandemic challenge i want to be clear that we ' re not doing this a a shortcut to profitability group ceo and cofounder anthony tan said in an email to employee
...,...,...,...,...,...,...
95,entertainment,He was wonderful: Harrison Ford on 'Indiana..' co-star Amrish Puri,"Actor Harrison Ford remembered late actor Amrish Puri, who had worked in 1984's 'Indiana Jones and the Temple of Doom'. ""He was a wonderful person...very charming man. (He was) nothing like the character that he played...Very sophisticated. I really admired him...enjoyed working with him,"" Ford said about Puri in an interview. The 1984 film featured Puri as the antagonist.",actor harrison ford remembered late actor amrish puri worked 1984 ' ' indiana jones temple doom ' wonderful personvery charming man nothing like character playedvery sophisticated really admired himenjoyed working ford said puri interview 1984 film featured puri antagonist,actor harrison ford rememb late actor amrish puri work 1984 ' ' indiana jone templ doom ' wonder personveri charm man noth like charact playedveri sophist realli admir himenjoy work ford said puri interview 1984 film featur puri antagonist,actor harrison ford remembered late actor amrish puri who had worked in 1984 ' s ' indiana jones and the temple of doom ' he wa a wonderful personvery charming man he wa nothing like the character that he playedvery sophisticated i really admired himenjoyed working with him ford said about puri in an interview the 1984 film featured puri a the antagonist
96,entertainment,Admire her audacity: Mahesh Bhatt on Pooja entering 'Bigg Boss...',"Filmmaker Mahesh Bhatt reacted to his daughter Pooja Bhatt's participation in 'Bigg Boss OTT 2'. ""Life's greatest adventures begin when we step into the realm of the unknown with courage and curiosity. She has done just that. I admire her audacity,"" Bhatt told ETimes. Pooja, on the show, had opened up about battling alcohol addiction at the age of 44.",filmmaker mahesh bhatt reacted daughter pooja bhatt ' participation ' bigg boss ott 2 ' life ' greatest adventures begin step realm unknown courage curiosity done admire audacity bhatt told etimes pooja show opened battling alcohol addiction age 44,filmmak mahesh bhatt react daughter pooja bhatt ' particip ' bigg boss ott 2 ' life ' greatest adventur begin step realm unknown courag curios done admir audac bhatt told etim pooja show open battl alcohol addict age 44,filmmaker mahesh bhatt reacted to his daughter pooja bhatt ' s participation in ' bigg bos ott 2 ' life ' s greatest adventure begin when we step into the realm of the unknown with courage and curiosity she ha done just that i admire her audacity bhatt told etimes pooja on the show had opened up about battling alcohol addiction at the age of 44
97,entertainment,"Kartik made me wait during 'Bhool...', I used to scold him: Kiara","Actress Kiara Advani, while talking about her 'Satyaprem Ki Katha' co-actor Kartik Aaryan, said the actor used to make her wait during the shoot of their 2022 film 'Bhool Bhulaiyaa 2'. ""I used to scold Kartik, would tell him not to come late this time over and make me wait,"" Kiara said. ""We've grown professionally and personally,"" she added.",actress kiara advani talking ' satyaprem ki katha ' coactor kartik aaryan said actor used make wait shoot 2022 film ' bhool bhulaiyaa 2 ' used scold kartik would tell come late time make wait kiara said ' grown professionally personally added,actress kiara advani talk ' satyaprem ki katha ' coactor kartik aaryan said actor use make wait shoot 2022 film ' bhool bhulaiyaa 2 ' use scold kartik would tell come late time make wait kiara said ' grown profession person ad,actress kiara advani while talking about her ' satyaprem ki katha ' coactor kartik aaryan said the actor used to make her wait during the shoot of their 2022 film ' bhool bhulaiyaa 2 ' i used to scold kartik would tell him not to come late this time over and make me wait kiara said we ' ve grown professionally and personally she added
98,entertainment,It was hard to find 300 transgender people for 'Haddi': Producer,"Producer Sanjay Saha said that finding 300 transgender people for Nawazuddin Siddiqui's 'Haddi' was ""very adventurous and hard"". He added that a transwoman named Renuka helped the filmmakers to learn about transgender community. ""She had brought some of her friends from the community to Nawaz so that he could get into the character and deeply understand their life,"" he added.",producer sanjay saha said finding 300 transgender people nawazuddin siddiqui ' ' haddi ' adventurous hard added transwoman named renuka helped filmmakers learn transgender community brought friends community nawaz could get character deeply understand life added,produc sanjay saha said find 300 transgend peopl nawazuddin siddiqui ' ' haddi ' adventur hard ad transwoman name renuka help filmmak learn transgend commun brought friend commun nawaz could get charact deepli understand life ad,producer sanjay saha said that finding 300 transgender people for nawazuddin siddiqui ' s ' haddi ' wa very adventurous and hard he added that a transwoman named renuka helped the filmmaker to learn about transgender community she had brought some of her friend from the community to nawaz so that he could get into the character and deeply understand their life he added


In [45]:
lemmatize(news_df.clean[0])

'benchmark index sensex nifty ended record closing high wednesday sensex ended 195 point higher 63523 nifty ended 1885685 40 point gain led stock like hdfc reliance industry tc intraday trade sensex rose fresh record high level 63588'

In [46]:
news_df['lemmatized'] = news_df.clean.apply(lemmatize)
news_df

Unnamed: 0,category,title,original,clean,stemmed,lemmatized
0,business,"Sensex, Nifty end at fresh closing highs","Benchmark indices Sensex and Nifty ended at record closing highs on Wednesday. Sensex ended 195 points higher at 63,523 while the Nifty ended at 18,856.85, up 40 points. The gains were led by stocks like HDFC, Reliance Industries and TCS. During the intraday trade, Sensex rose to its fresh record high level of 63,588.",benchmark indices sensex nifty ended record closing highs wednesday sensex ended 195 points higher 63523 nifty ended 1885685 40 points gains led stocks like hdfc reliance industries tcs intraday trade sensex rose fresh record high level 63588,benchmark indic sensex nifti end record close high wednesday sensex end 195 point higher 63523 nifti end 1885685 40 point gain led stock like hdfc relianc industri tc intraday trade sensex rose fresh record high level 63588,benchmark index sensex nifty ended record closing high wednesday sensex ended 195 point higher 63523 nifty ended 1885685 40 point gain led stock like hdfc reliance industry tc intraday trade sensex rose fresh record high level 63588
1,business,Amazon tricked millions of customers into enrolling in Prime: US FTC,"US Federal Trade Commission (FTC) has sued Amazon, accusing it of tricking millions of consumers into signing up for its Prime subscription without their consent. ""Amazon used manipulative, coercive or deceptive user-interface designs known as 'dark patterns' to trick consumers into enrolling in automatically-renewing Prime subscriptions,"" US FTC said. Prime members in the US pay $139 per year.",us federal trade commission ftc sued amazon accusing tricking millions consumers signing prime subscription without consent amazon used manipulative coercive deceptive userinterface designs known ' dark patterns ' trick consumers enrolling automaticallyrenewing prime subscriptions us ftc said prime members us pay 139 per year,us feder trade commiss ftc su amazon accus trick million consum sign prime subscript without consent amazon use manipul coerciv decept userinterfac design known ' dark pattern ' trick consum enrol automaticallyrenew prime subscript us ftc said prime member us pay 139 per year,u federal trade commission ftc sued amazon accusing tricking million consumer signing prime subscription without consent amazon used manipulative coercive deceptive userinterface design known ' dark pattern ' trick consumer enrolling automaticallyrenewing prime subscription u ftc said prime member u pay 139 per year
2,business,TIME releases list of the world's 100 most influential companies,"TIME magazine has released its annual list of the world's 100 most influential companies, which features OpenAI, SpaceX, Chess.com, Google DeepMind and Kim Kardashian's SKIMS among others. The National Payments Corporation of India (NPCI) and e-commerce platform Meesho also featured on the list. ""NPCI launched UPI...which accounted for 52% of India's digital transactions in FY22,"" TIME said.",time magazine released annual list world ' 100 influential companies features openai spacex chesscom google deepmind kim kardashian ' skims among others national payments corporation india npci ecommerce platform meesho also featured list npci launched upiwhich accounted 52 india ' digital transactions fy22 time said,time magazin releas annual list world ' 100 influenti compani featur openai spacex chesscom googl deepmind kim kardashian ' skim among other nation payment corpor india npci ecommerc platform meesho also featur list npci launch upiwhich account 52 india ' digit transact fy22 time said,time magazine released annual list world ' 100 influential company feature openai spacex chesscom google deepmind kim kardashian ' skim among others national payment corporation india npci ecommerce platform meesho also featured list npci launched upiwhich accounted 52 india ' digital transaction fy22 time said
3,business,Which are the world's top 10 airlines according to passengers?,"Singapore Airlines is the world's best airline, according to Skytrax World Airline Awards 2023, an annual poll of flyers released at the Paris Air Show. It is followed by Qatar Airways, All Nippon Airways, Emirates, Japan Airlines, Turkish Airlines, Air France, Cathay Pacific, EVA Air, and Korean Air. Vistara, ranked 16th, is the only Indian airline in the top 20.",singapore airlines world ' best airline according skytrax world airline awards 2023 annual poll flyers released paris air show followed qatar airways nippon airways emirates japan airlines turkish airlines air france cathay pacific eva air korean air vistara ranked 16th indian airline top 20,singapor airlin world ' best airlin accord skytrax world airlin award 2023 annual poll flyer releas pari air show follow qatar airway nippon airway emir japan airlin turkish airlin air franc cathay pacif eva air korean air vistara rank 16th indian airlin top 20,singapore airline world ' best airline according skytrax world airline award 2023 annual poll flyer released paris air show followed qatar airway nippon airway emirate japan airline turkish airline air france cathay pacific eva air korean air vistara ranked 16th indian airline top 20
4,business,"Grab lays off over 1,000 employees","Singapore-based ride-hailing and food delivery app Grab has laid off over 1,000 employees. This is Grab's largest round of layoffs since 2020, when it cut 360 jobs in response to COVID-19 pandemic challenges. ""I want to be clear that we're not doing this as a shortcut to profitability,"" Group CEO and Co-Founder Anthony Tan said in an e-mail to employees.",singaporebased ridehailing food delivery app grab laid 1000 employees grab ' largest round layoffs since 2020 cut 360 jobs response covid19 pandemic challenges want clear ' shortcut profitability group ceo cofounder anthony tan said email employees,singaporebas ridehail food deliveri app grab laid 1000 employe grab ' largest round layoff sinc 2020 cut 360 job respons covid19 pandem challeng want clear ' shortcut profit group ceo cofound anthoni tan said email employe,singaporebased ridehailing food delivery app grab laid 1000 employee grab ' largest round layoff since 2020 cut 360 job response covid19 pandemic challenge want clear ' shortcut profitability group ceo cofounder anthony tan said email employee
...,...,...,...,...,...,...
95,entertainment,He was wonderful: Harrison Ford on 'Indiana..' co-star Amrish Puri,"Actor Harrison Ford remembered late actor Amrish Puri, who had worked in 1984's 'Indiana Jones and the Temple of Doom'. ""He was a wonderful person...very charming man. (He was) nothing like the character that he played...Very sophisticated. I really admired him...enjoyed working with him,"" Ford said about Puri in an interview. The 1984 film featured Puri as the antagonist.",actor harrison ford remembered late actor amrish puri worked 1984 ' ' indiana jones temple doom ' wonderful personvery charming man nothing like character playedvery sophisticated really admired himenjoyed working ford said puri interview 1984 film featured puri antagonist,actor harrison ford rememb late actor amrish puri work 1984 ' ' indiana jone templ doom ' wonder personveri charm man noth like charact playedveri sophist realli admir himenjoy work ford said puri interview 1984 film featur puri antagonist,actor harrison ford remembered late actor amrish puri worked 1984 ' ' indiana jones temple doom ' wonderful personvery charming man nothing like character playedvery sophisticated really admired himenjoyed working ford said puri interview 1984 film featured puri antagonist
96,entertainment,Admire her audacity: Mahesh Bhatt on Pooja entering 'Bigg Boss...',"Filmmaker Mahesh Bhatt reacted to his daughter Pooja Bhatt's participation in 'Bigg Boss OTT 2'. ""Life's greatest adventures begin when we step into the realm of the unknown with courage and curiosity. She has done just that. I admire her audacity,"" Bhatt told ETimes. Pooja, on the show, had opened up about battling alcohol addiction at the age of 44.",filmmaker mahesh bhatt reacted daughter pooja bhatt ' participation ' bigg boss ott 2 ' life ' greatest adventures begin step realm unknown courage curiosity done admire audacity bhatt told etimes pooja show opened battling alcohol addiction age 44,filmmak mahesh bhatt react daughter pooja bhatt ' particip ' bigg boss ott 2 ' life ' greatest adventur begin step realm unknown courag curios done admir audac bhatt told etim pooja show open battl alcohol addict age 44,filmmaker mahesh bhatt reacted daughter pooja bhatt ' participation ' bigg bos ott 2 ' life ' greatest adventure begin step realm unknown courage curiosity done admire audacity bhatt told etimes pooja show opened battling alcohol addiction age 44
97,entertainment,"Kartik made me wait during 'Bhool...', I used to scold him: Kiara","Actress Kiara Advani, while talking about her 'Satyaprem Ki Katha' co-actor Kartik Aaryan, said the actor used to make her wait during the shoot of their 2022 film 'Bhool Bhulaiyaa 2'. ""I used to scold Kartik, would tell him not to come late this time over and make me wait,"" Kiara said. ""We've grown professionally and personally,"" she added.",actress kiara advani talking ' satyaprem ki katha ' coactor kartik aaryan said actor used make wait shoot 2022 film ' bhool bhulaiyaa 2 ' used scold kartik would tell come late time make wait kiara said ' grown professionally personally added,actress kiara advani talk ' satyaprem ki katha ' coactor kartik aaryan said actor use make wait shoot 2022 film ' bhool bhulaiyaa 2 ' use scold kartik would tell come late time make wait kiara said ' grown profession person ad,actress kiara advani talking ' satyaprem ki katha ' coactor kartik aaryan said actor used make wait shoot 2022 film ' bhool bhulaiyaa 2 ' used scold kartik would tell come late time make wait kiara said ' grown professionally personally added
98,entertainment,It was hard to find 300 transgender people for 'Haddi': Producer,"Producer Sanjay Saha said that finding 300 transgender people for Nawazuddin Siddiqui's 'Haddi' was ""very adventurous and hard"". He added that a transwoman named Renuka helped the filmmakers to learn about transgender community. ""She had brought some of her friends from the community to Nawaz so that he could get into the character and deeply understand their life,"" he added.",producer sanjay saha said finding 300 transgender people nawazuddin siddiqui ' ' haddi ' adventurous hard added transwoman named renuka helped filmmakers learn transgender community brought friends community nawaz could get character deeply understand life added,produc sanjay saha said find 300 transgend peopl nawazuddin siddiqui ' ' haddi ' adventur hard ad transwoman name renuka help filmmak learn transgend commun brought friend commun nawaz could get charact deepli understand life ad,producer sanjay saha said finding 300 transgender people nawazuddin siddiqui ' ' haddi ' adventurous hard added transwoman named renuka helped filmmaker learn transgender community brought friend community nawaz could get character deeply understand life added


## 9. Ask yourself:
 
. If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?    
. If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?    
. If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?    

. If your corpus is 493KB, would you prefer to use stemmed or lemmatized text?
- I would prefer to use lemmatized

. If your corpus is 25MB, would you prefer to use stemmed or lemmatized text?
- I would prefer to use lemmatized

. If your corpus is 200TB of text and you're charged by the megabyte for your hosted computational resources, would you prefer to use stemmed or lemmatized text?
- In this case i will use stemmed as it is faster and the gain is not much bigger.

In [None]:
def basic_clean(article):
    """
    
    """
    # lowercase text
    article = article.lower()
    
    # remove any accented characters and non-ASCII characters
    # normalizing
    # getting ride of anything not in ascii
    # turning back to a string
    article = unicodedata.normalize('NFKD', article).encode('ascii','ignore').decode('utf-8')
    
    # remove special characters
    #use re.sub to remove special characters
    article = re.sub(r'[^a-z0-9\'\s]', '', article)
    
    # tokenization is the process of breaking something down into smaller, discrete units.
    # these units are called tokens.
    #create the tokenizer
    tokenize = nltk.tokenize.ToktokTokenizer()
    article = tokenize.tokenize(article, return_str=True)
    
    # Lemmatize
    # - **changes** words to their "root"
    # - it can conjugate to the base word 
    # - example: "mouse", "mice" --> "mouse"
    # - slower than stemming
    #create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    #use lemmatize - apply stem to each word in our string
    # wnl.lemmatize(article)
    lemma = [wnl.lemmatize(word) for word in article.split()]
    
    #join words back together
    article_lemma = ' '.join(lemma)
    
    #save stopwords
    stopwords_ls = stopwords.words('english')
    
    # sort words inside stopwords
    stopwords_ls.sort()
    
    # #set a list to remove some stopwords IF THEY ARE NEEDED!
    # extra = ['all', 'about','after']
    # # remove extra words
    # set(stopwords_ls) - set(extra)

    #add to stopword list
    stopwords_ls.append("'")
    
    # #remove from stopword list
    # stopwords_ls.remove('o')
    
    #split words in lemmatized article
    words = article_lemma.split()
        
    #remove stopwords from list of words
    filtered = [word for word in words if word not in stopwords_ls]
    
    #join words back together
    parsed_article = ' '.join(filtered)
    
    return parsed_article