# NLP Acquire Exercises

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)



## 1. Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:


{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
Plus any additional properties you think might be helpful.

In [36]:
def get_article_text():
    # go fetch the data
    url = 'https://codeup.com/featured/apida-heritage-month/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    articles = []
    posts = soup.find_all("article")
    #article = soup.find('div', id='main-content')
    # save it for next time
    for post in posts:
        title = post.find('h1').text.strip()
        content = post.find("div", class_="entry-content").text.strip()

        article = {
            "title": title,
            "content": content
        }

    articles.append(article)
    return articles





In [37]:
get_article_text()

[{'title': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
  'content': 'May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.\n\nIn an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.\nArbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen.\nAt Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-America

In [None]:
urls = ['https://codeup.com/featured/apida-heritage-month/','https://codeup.com/featured/women-in-tech-panelist-spotlight/','https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/','https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/','https://codeup.com/events/women-in-tech-madeleine/']

In [42]:
# this is the final one

def get_articles_texts(urls):
    articles = []

    for url in urls:
        headers = {'User-Agent': 'Codeup Data Science'}
        response = get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        posts = soup.find_all('article')

        for post in posts:
            title = post.find('h1').text.strip()
            content = post.find('div', class_='entry-content').text.strip()

            article = {
                'title': title,
                'content': content
            }

            articles.append(article)

    return articles


In [41]:
get_articles_texts(urls)

[{'title': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
  'content': 'May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.\n\nIn an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.\nArbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen.\nAt Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-America

## 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

Business
Sports
Technology
Entertainment
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:


{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
Hints:

Start by inspecting the website in your browser. Figure out which elements will be useful.
Start by creating a function that handles a single article and produces a dictionary like the one above.
Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [45]:


def get_news_articles():
    base_url = "https://inshorts.com/en/read/"
    topics = {
        'business': 'business',
        'sports': 'sports',
        'technology': 'technology',
        'entertainment': 'entertainment'
    }
    articles = []

    for category, topic_url in topics.items():
        url = base_url + topic_url
        response = get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        news_cards = soup.find_all(class_='news-card')

        for card in news_cards:
            title = card.find('span', itemprop='headline').text.strip()
            content = card.find('div', itemprop='articleBody').text.strip()

            article = {
                'title': title,
                'content': content,
                'category': category
            }

            articles.append(article)

    return articles


In [46]:
get_news_articles()

[{'title': 'Apple could force 111-year-old Swiss firm to change its apple logo',
  'content': 'Fruit Union Suisse, a 111-year-old Swiss company, is worried it might have to change its logo because Apple is trying to gain intellectual property rights over depictions of apples. "It\'s not like they\'re trying to protect their bitten apple...Their objective...is really to own rights to an actual apple, which...should be free for everyone to use," its director Jimmy Mariéthoz said.',
  'category': 'business'},
 {'title': "Nissan's ex-CEO Carlos Ghosn sues automaker for $1 bn over ouster",
  'content': "Nissan's former CEO Carlos Ghosn has filed a $1-billion lawsuit against the Japanese automaker and connected individuals for ousting him in 2018 and arranging his arrest over alleged financial misconduct. Ghosn filed the lawsuit in Lebanon, where he has lived since escaping from Japan in 2019 to flee trial. In 2020, Nissan sued Ghosn for $90 million in monetary damages.",
  'category': 'busi

In [None]:
find()