# NLP Data Acquisition Exercises
---

#### By the end of this exercise, you should have a file named `acquire.py` that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. `acquire_codeup_blog.py` and `acquire_news_articles.py`), but the end function should be present in `acquire.py` (that is, `acquire.py` should import `get_blog_articles` from the `acquire_codeup_blog` module.)

### 1. Codeup Blog Articles

#### Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

- https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/
- https://codeup.com/codeup-news/codeup-candidate-for-accreditation/
- https://codeup.com/codeup-news/codeup-takes-over-more-of-the-historic-vogue-building/
- https://codeup.com/codeup-news/is-codeup-the-best-bootcamp-in-san-antonio-or-the-world/
- https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/

#### Encapsulate your work in a function named `get_blog_articles` that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

>
>{
>
>        'title': 'the title of the article',
>
>        'content': 'the full text content of the article'
>
>}
>

#### Plus any additional properties you think might be helpful.

#### Bonus: Scrape the text of all the articles linked on codeup's blog page.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
response = requests.get('https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="https://codeup.com/xmlrpc.php" rel="pingback"/>
  <script type="text/javascript">
   document.documentElement.className = 'js';
  </script>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <script id="diviarea-loader">
   window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttrib":"data-popup","modalIndicatorClass":"is-modal","blockingIndicatorClass":"is-blocking","defaultShowCloseButton":true,"withCloseClass":"with-close","noCloseClass":"no-close","triggerCloseClass":"close","singletonClass":"single","darkModeClass":"dark","noShadowClass":"no-shadow","altCloseClass":"close-alt","popupSelector":".et_pb_section.popup","initializeOnEvent":"et_pb_after_init_modules","popupWrapperClass":"area-outer-wrap","fullHeightClass":"full-height","openPopupClass":"da-ov

In [3]:
# isolate title
soup.select('title')[0].text

'Inclusion at Codeup During Pride Month (and Always) - Codeup'

In [4]:
# isolate content
' '.join([soup('p')[i].text for i in range(1, len(soup('p')))])

'Happy Pride Month! Pride Month is a dedicated time to celebrate and support the LGBTQIA+ community. At Codeup, one of our core values is Cultivating Inclusive Growth, something that takes on many shapes, sizes, forms, and colors. From representation in tech to empowering and supporting all, let’s reflect on how we live out this core value for our LGBTQIA+ community, not just during Pride Month, but always. We’re firm believers that the people making tech should look like the people using it, which is everyone. We’re proud to offer Pride Scholarships year round, which aim to increase, support, and promote representation of the LGBTQIA+ community in tech. However, representation is only one part of cultivating inclusive growth. We want to help create a thriving tech community where everyone feels represented, but also safe and empowered. In a 2019 survey conducted by Blind, 83% of LGBQ technologists and 78% of trans or gender non-conforming technologists reported that they felt safe in 

In [5]:
len(soup('p'))

13

In [6]:
# define function
def get_codeup_articles(blog):
    '''
    This function takes in a Codeup blog url and returns a dictionary of its title 
    and contents.
    '''
    response = requests.get(blog, headers={'user-agent': 'Codeup DS Hopper'})
    soup = BeautifulSoup(response.text)
    title = soup.select('title')[0].text
    content = ' '.join([soup('p')[i].text for i in range(0, len(soup('p')))])
    return {
        'title': title,
        'content': content
    }
# test function
get_blog_articles('https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/')

{'title': 'Inclusion at Codeup During Pride Month (and Always) - Codeup',
 'content': 'Jun 4, 2021 | Codeup News Happy Pride Month! Pride Month is a dedicated time to celebrate and support the LGBTQIA+ community. At Codeup, one of our core values is Cultivating Inclusive Growth, something that takes on many shapes, sizes, forms, and colors. From representation in tech to empowering and supporting all, let’s reflect on how we live out this core value for our LGBTQIA+ community, not just during Pride Month, but always. We’re firm believers that the people making tech should look like the people using it, which is everyone. We’re proud to offer Pride Scholarships year round, which aim to increase, support, and promote representation of the LGBTQIA+ community in tech. However, representation is only one part of cultivating inclusive growth. We want to help create a thriving tech community where everyone feels represented, but also safe and empowered. In a 2019 survey conducted by Blind, 83

In [7]:
# list urls
blogs = ['https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/',
         'https://codeup.com/codeup-news/codeup-candidate-for-accreditation/',
         'https://codeup.com/codeup-news/codeup-takes-over-more-of-the-historic-vogue-building/',
         'https://codeup.com/codeup-news/is-codeup-the-best-bootcamp-in-san-antonio-or-the-world/',
         'https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/']
# loop through all urls
pd.DataFrame([get_blog_articles(blog) for blog in blogs])

Unnamed: 0,title,content
0,Inclusion at Codeup During Pride Month (and Al...,"Jun 4, 2021 | Codeup News Happy Pride Month! P..."
1,Announcing our Candidacy for Accreditation! - ...,"Jun 30, 2021 | Codeup News Did you know that e..."
2,Codeup Takes Over More of the Historic Vogue B...,"Jun 21, 2021 | Codeup News, Featured Codeup is..."
3,Is Codeup the Best Bootcamp in San Antonio...o...,"Sep 16, 2021 | Codeup News, Featured Looking f..."
4,Codeup Launches First Podcast: Hire Tech - Codeup,"Aug 25, 2021 | Codeup News, Featured Any podca..."


In [8]:
# view html for main blog page
soup = BeautifulSoup(requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'}).text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="https://codeup.com/xmlrpc.php" rel="pingback"/>
  <script type="text/javascript">
   document.documentElement.className = 'js';
  </script>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <script id="diviarea-loader">
   window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttrib":"data-popup","modalIndicatorClass":"is-modal","blockingIndicatorClass":"is-blocking","defaultShowCloseButton":true,"withCloseClass":"with-close","noCloseClass":"no-close","triggerCloseClass":"close","singletonClass":"single","darkModeClass":"dark","noShadowClass":"no-shadow","altCloseClass":"close-alt","popupSelector":".et_pb_section.popup","initializeOnEvent":"et_pb_after_init_modules","popupWrapperClass":"area-outer-wrap","fullHeightClass":"full-height","openPopupClass":"da-ov

In [9]:
# list all urls for articles on Codeup's blog page
[soup.select('.more-link')[i]['href'] for i in range(0,len(soup.select('.more-link')))]

['https://codeup.com/dallas-newsletter/codeup-dallas-open-house/',
 'https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/',
 'https://codeup.com/it-training/it-certifications-101/',
 'https://codeup.com/cybersecurity/a-rise-in-cyber-attacks-means-opportunities-for-veterans-in-san-antonio/',
 'https://codeup.com/codeup-news/use-your-gi-bill-benefits-to-land-a-job-in-tech/',
 'https://codeup.com/tips-for-prospective-students/which-program-is-right-for-me-cyber-security-or-systems-engineering/',
 'https://codeup.com/it-training/what-the-heck-is-system-engineering/',
 'https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/',
 'https://codeup.com/behind-the-billboards/boris-behind-the-billboards/',
 'https://codeup.com/codeup-news/is-codeup-the-best-bootcamp-in-san-antonio-or-the-world/',
 'https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/',
 'https://codeup.com/tips-for-prospective-students/why-should-i-become-a-s

In [10]:
# define function
def get_blog_articles():
    '''
    This function returns a dataframe of the titles and contents of all blog posts
    linked on Codeup's blog page.
    '''
    soup = BeautifulSoup(requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'}).text)
    blog_urls = [soup.select('.more-link')[i]['href'] for i in range(0,len(soup.select('.more-link')))]
    return pd.DataFrame([get_codeup_articles(blog) for blog in blog_urls])
# test function
get_blog_articles()

Unnamed: 0,title,content
0,Codeup Dallas Open House - Codeup,"Nov 30, 2021 | Dallas Newsletter, Events Come ..."
1,Codeup Helps 40 Grads Land Tech Jobs in Just 1...,"Nov 19, 2021 | Codeup News, Employers Our Plac..."
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021 | IT Training, Tips for Prospecti..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021 | Cybersecurity In the last few m..."
4,Use your GI Bill® benefits to Land a Job in Te...,"Nov 4, 2021 | Codeup News, Tips for Prospectiv..."
5,Which program is right for me: Cyber Security ...,"Oct 28, 2021 | IT Training, Tips for Prospecti..."
6,What the Heck is System Engineering? - Codeup,"Oct 21, 2021 | IT Training, Tips for Prospecti..."
7,From Speech Pathology to Business Intelligence...,"Oct 18, 2021 | Alumni Stories Before Codeup, I..."
8,Boris - Behind the Billboards - Codeup,"Oct 3, 2021 | Behind the Billboards \n Subm..."
9,Is Codeup the Best Bootcamp in San Antonio...o...,"Sep 16, 2021 | Codeup News, Featured Looking f..."


### 2. News Articles

#### We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

#### Write a function that scrapes the news articles for the following topics:

- **Business**
- **Sports**
- **Technology**
- **Entertainment**

#### The end product of this should be a function named `get_news_articles` that returns a list of dictionaries, where each dictionary has this shape:

>
>{
>
>        'title': 'The article title',
>
>        'content': 'The article content',
>
>        'category': 'business' # for example
>
>}
>

#### Hints:

- **Start by inspecting the website in your browser. Figure out which elements will be useful.**
- **Start by creating a function that handles a single article and produces a dictionary like the one above.**
-  **Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.**
- **Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.**

In [11]:
# get data from url
response = requests.get('https://inshorts.com/en/news/facebook-parent-metas-$230billion-wipeout-biggest-in-us-market-history-1643910633154')
# create soup object
soup = BeautifulSoup(response.text)
# view html
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <style>
   /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if needed */
        background-color: rgb(0,0,0); /* Fallback color */
        background-color: rgba(0,0,0,0.4); /* Black w/ opacity */
    }

    /* Modal Content/Box */
    .modal-content {
        background-color: #fefefe;
        margin: 15% auto;
        padding: 20px !important;
        padding-top: 0 !important;
        /* border: 1px solid #888; */
        text-align: center;
        position: relative;
        border-radius: 6px;
    }

    /* The Close Button */
    .close {
      left: 90%;
      color: #aaa;
      float: right;
      font-size: 28px;
      font-weight: bold;
    /* pos

In [12]:
# isolate title
soup('title')[0].text.split('|')[0].strip()

"Facebook parent Meta's $230-billion wipeout biggest in US market history"

In [13]:
# isolate content
soup.select('.news-card-content')[0].text.strip().replace('\n', ' ')

"Facebook's parent Meta's shares plunged 27% and Thursday's collapse wiped out $230 billion of the company's market value. It's the biggest collapse in market value for any US company but there's no certainty that the losses will hold, given the volatility, Bloomberg said. This comes after Facebook's daily active users fell for the first time in its 18-year history.  short by Pragya Swastik /        11:20 pm on 03 Feb"

In [14]:
# isolate category
soup('title')[0].text.split('|')[1].strip().split(' ')[0]

'Business'

In [15]:
# get data from url
response = requests.get('https://inshorts.com/en/read', headers={'user-agent': 'Codeup DS'})
# create soup object
soup = BeautifulSoup(response.text)
# view html
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <style>
   /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if needed */
        background-color: rgb(0,0,0); /* Fallback color */
        background-color: rgba(0,0,0,0.4); /* Black w/ opacity */
    }

    /* Modal Content/Box */
    .modal-content {
        background-color: #fefefe;
        margin: 15% auto;
        padding: 20px !important;
        padding-top: 0 !important;
        /* border: 1px solid #888; */
        text-align: center;
        position: relative;
        border-radius: 6px;
    }

    /* The Close Button */
    .close {
      left: 90%;
      color: #aaa;
      float: right;
      font-size: 28px;
      font-weight: bold;
    /* pos

In [16]:
cards = soup.select('.news-card')
len(cards)

25

In [17]:
# isolate author
card = cards[0]
author = card.find('span', class_ = 'author').text
author

'Sakshita Khosla'

In [18]:
# isolate headline
headline = card.find('span', itemprop = 'headline').text
headline

"Letting colleges decide if hijab is allowed is illegal: Students to K'taka HC"

In [19]:
# isolate content
content = card.find('div', itemprop = 'articleBody').text
content

'During the Karnataka High Court hearing on hijab ban in educational institutions, senior advocate Dev Datt Kamat appearing for the students said, "The delegation to the College Committees to decide whether the hijab is allowed or not is completely illegal." He also argued that Muslim women are allowed to wear headscarves in public. The hearing was adjourned till tomorrow.'

In [21]:
def get_news_article(article):
    '''
    This function takes in a url of a news article from inshorts.com and returns a
    dictionary of the article's title, content, and category.
    '''
    response = requests.get(article, headers={'user-agent': 'Codeup DS'})
    soup = BeautifulSoup(response.text)
    title = soup('title')[0].text.split('|')[0].strip()
    content = card.find('div', itemprop = 'articleBody').text
    category = soup('title')[0].text.split('|')[1].strip().split(' ')[0] 
    return {
        'title': title,
        'content': content,
        'category': category
    }
# test function
get_news_article('https://inshorts.com/en/news/facebook-parent-metas-$230billion-wipeout-biggest-in-us-market-history-1643910633154')

{'title': "Facebook parent Meta's $230-billion wipeout biggest in US market history",
 'content': 'During the Karnataka High Court hearing on hijab ban in educational institutions, senior advocate Dev Datt Kamat appearing for the students said, "The delegation to the College Committees to decide whether the hijab is allowed or not is completely illegal." He also argued that Muslim women are allowed to wear headscarves in public. The hearing was adjourned till tomorrow.',
 'category': 'Business'}

In [24]:
# define function
def get_news_card(card):
    '''
    This function returns a dictionary of relevant information from a news card
    found at inshorts.com.
    '''
    card_title = card.select_one('.news-card-title')
    title = card.find('span', itemprop = 'headline').text
    author = card.find('span', class_ = 'author').text
    content = card.find('div', itemprop = 'articleBody').text
    date = card.find('span', clas ='date').text
    return {
        'title': title,
        'date': date,
        'content': content,
        'author': author
    }
# test function
url = 'https://www.inshorts.com/en/read/business'
response = requests.get(url, headers={'user-agent': 'Codeup DS'})
soup = BeautifulSoup(response.text)
cards = soup.select('.news-card')
card = cards[0]
get_news_card(card)

{'title': "LIC files draft papers with SEBI to seek approval for India's biggest IPO",
 'date': '13 Feb 2022,Sunday',
 'content': "State-run Life Insurance Corporation of India (LIC) on Sunday filed the Draft Red Herring Prospectus (DRHP) with capital markets regulator SEBI for an initial public offering (IPO). The IPO is expected to be the country's largest public issue. The LIC will offload over 31.62 crore shares of face value ₹10 each, according to the draft prospectus.\n",
 'author': 'Pragya Swastik'}

In [27]:
# define function to get articles from a page
def get_inshorts_page(url):
    '''
    This function returns a dataframe where each row is a news article from 
    a page on inshorts.com.
    '''
    category = url.split('/')[-1]
    response = requests.get(url, headers={'user-agent': 'Codeup DS'})
    soup = BeautifulSoup(response.text)
    cards = soup.select('.news-card')
    articles = pd.DataFrame([get_news_card(card) for card in cards])
    articles['category'] = category
    return articles
# test function
get_inshorts_page('https://inshorts.com/en/read/business')

Unnamed: 0,title,date,content,author,category
0,LIC files draft papers with SEBI to seek appro...,"13 Feb 2022,Sunday",State-run Life Insurance Corporation of India ...,Pragya Swastik,business
1,13-yr-old girl gets ₹50 lakh funding on Shark ...,"13 Feb 2022,Sunday",A Class 8 girl became the youngest contestant ...,Ridham Gambhir,business
2,Retail inflation rises to 7-month-high of 6.01...,"14 Feb 2022,Monday",The retail inflation accelerated to a seven-mo...,Pragya Swastik,business
3,"What is the ₹22,842 cr ABG Shipyard case, bigg...","13 Feb 2022,Sunday",The CBI in its biggest bank fraud case booked ...,Pragya Swastik,business
4,"Who is Ilker Ayci, Air India's new MD and CEO?","14 Feb 2022,Monday","Tata Sons appointed 51-year-old Ilker Ayci, th...",Pragya Swastik,business
5,"Sensex crashes over 1,700 points to close at 5...","14 Feb 2022,Monday","The Sensex fell 1,747 points on Monday to end ...",Pragya Swastik,business
6,"Sensex crashes over 1,400 points, Nifty slips ...","14 Feb 2022,Monday",Indian equity benchmark Sensex fell by more th...,Anmol Sharma,business
7,"All men die, not all men truly live: Rajiv Baj...","13 Feb 2022,Sunday","Bajaj Auto MD Rajiv Bajaj, the son of industri...",Pragya Swastik,business
8,Big B recalls Air India ad from his college da...,"14 Feb 2022,Monday",Actor Amitabh Bachchan recalled an old Air Ind...,Pragya Swastik,business
9,Tatas appoint Ilker Ayci as CEO and MD of Air ...,"14 Feb 2022,Monday",Former Turkish Airlines Chairman Ilker Ayci ha...,Sakshita Khosla,business


In [29]:
# define function to get all articles
def get_inshorts_articles():
    '''
    This function returns a dataframe of news articles from the business, sports, 
    technology, and entertainment sections of inshorts.com.
    '''
    url = 'https://inshorts.com/en/read/'
    categories = ['business', 'sports', 'technology', 'entertainment']
    df = pd.DataFrame()
    for cat in categories:
        df = pd.concat([df, pd.DataFrame(get_inshorts_page(url + cat))])
    df = df.reset_index(drop=True)
    return df
# test function
get_inshorts_articles()

Unnamed: 0,title,date,content,author,category
0,LIC files draft papers with SEBI to seek appro...,"13 Feb 2022,Sunday",State-run Life Insurance Corporation of India ...,Pragya Swastik,business
1,13-yr-old girl gets ₹50 lakh funding on Shark ...,"13 Feb 2022,Sunday",A Class 8 girl became the youngest contestant ...,Ridham Gambhir,business
2,Retail inflation rises to 7-month-high of 6.01...,"14 Feb 2022,Monday",The retail inflation accelerated to a seven-mo...,Pragya Swastik,business
3,"What is the ₹22,842 cr ABG Shipyard case, bigg...","13 Feb 2022,Sunday",The CBI in its biggest bank fraud case booked ...,Pragya Swastik,business
4,"Who is Ilker Ayci, Air India's new MD and CEO?","14 Feb 2022,Monday","Tata Sons appointed 51-year-old Ilker Ayci, th...",Pragya Swastik,business
...,...,...,...,...,...
94,"I hate sweets, was suddenly asking for doughnu...","14 Feb 2022,Monday","While talking about her pregnancy, singer Riha...",Ria Kapoor,entertainment
95,Faced speculation due to social media toxicity...,"14 Feb 2022,Monday",Actor Arjun Kapoor said that he admires girlfr...,Ria Kapoor,entertainment
96,"Sanjay Dutt treats me like a baby, asks me to ...","14 Feb 2022,Monday","Alia Bhatt, during an interview with Siddharth...",Udit Gupta,entertainment
97,Malayalam film 'The Great Indian Kitchen' to b...,"14 Feb 2022,Monday","Malayalam film 'The Great Indian Kitchen', sta...",Udit Gupta,entertainment
