# NLP Exercises - Data Acquisition

In [2]:
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
import os

### 1. Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.
Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

* {
    'title': 'the title of the article',
    'content': 'the full text content of the article'}


### Article 1

In [6]:
# create response object
url = 'http://codeup.com/featured/women-in-tech-panelist-spotlight/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
response

<Response [200]>

In [13]:
print(response.text[:500])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttri


In [14]:
# create soup object

soup = BeautifulSoup(response.content, 'html.parser')

In [32]:
# obtain article text

article = soup.find('div', class_='entry-content')

In [25]:
article.text

'\nWomen in tech: Panelist Spotlight – Magdalena Rahn\nMar 28, 2023 | Events, Featured\n'

In [37]:
# obtain article text elements under div

p_elements = article.find_all('p')

In [38]:
p_elements[3].text

'Magdalena Rahn is a current Codeup student in a Data Science cohort in San Antonio, Texas. She has a professional background in cross-cultural communications, international business development, the wine industry and journalism. After serving in the US Navy, she decided to complement her professional skill set by attending the Data Science program at Codeup; she is set to graduate in March 2023. Magdalena is fluent in French, Bulgarian, Chinese-Mandarin, Spanish and Italian.'

In [39]:
# obtain article title

title = soup.find('h1').text

In [40]:
title

'Women in tech: Panelist Spotlight – Magdalena Rahn'

In [41]:
# create a function that takes in a url and requests/parses html
def codeup_soup(url):
    '''
    This function takes in a url, then requests and
    parses html
    '''
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup

In [54]:
def get_codeup_articles(urls, cached=False):
    '''
    This function takes in a list of Codeup Blog urls and a parameter,
    with default cached == False which scrapes the title and text for each url.
    creates a list of dictionaries with the title and text for each blog, 
    converts the list to a dataframe, and returns the df.
    If cached == True, the function returns a df from a json file.
    '''
    if cached == True:
        df = pd.read_json('codeup_blogs.json')
        
    # cached == False completes a fresh scrape for df     
    else:

        # Create an empty list to hold dictionaries
        articles = []

        # Loop through each url in our list of urls
        for url in urls:

            # Make request and soup object using helper
            soup = codeup_soup(url)

            # Save the title of each blog in variable title
            title = soup.find('h1').text

            # Save the text in each blog to variable text
            content = soup.find('div', class_="entry-content").text

            # Create a dictionary holding the title and content for each blog
            article = {'title': title, 'content': content}

            # Add each dictionary to the articles list of dictionaries
            articles.append(article)
            
        # convert our list of dictionaries to a df
        df = pd.DataFrame(articles)

        # Write df to a json file for faster access
        df.to_json('codeup_blogs.json')
    
    return df

In [55]:
# list of codeup urls

urls = ['https://codeup.com/featured/women-in-tech-panelist-spotlight/', 
        'https://codeup.com/featured/women-in-tech-rachel-robbins-mayhill/',
        'https://codeup.com/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/',
        'https://codeup.com/events/women-in-tech-madeleine/',
        'https://codeup.com/codeup-news/panelist-spotlight-4/']

blogs = get_codeup_articles(urls=urls, cached=False)

In [56]:
blogs

Unnamed: 0,title,content
0,Women in tech: Panelist Spotlight – Magdalena ...,\nWomen in tech: Panelist Spotlight – Magdalen...
1,Women in tech: Panelist Spotlight – Rachel Rob...,\nWomen in tech: Panelist Spotlight – Rachel R...
2,Women in Tech: Panelist Spotlight – Sarah Mellor,\nWomen in tech: Panelist Spotlight – Sarah Me...
3,Women in Tech: Panelist Spotlight – Madeleine ...,\nWomen in tech: Panelist Spotlight – Madelein...
4,Black Excellence in Tech: Panelist Spotlight –...,\nBlack excellence in tech: Panelist Spotlight...


## 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.
Write a function that scrapes the news articles for the following topics:
* Business
* Sports
* Technology
* Entertainment

Hints:
* Start by inspecting the website in your browser. Figure out which elements will be useful.
* Start by creating a function that handles a single article and produces a dictionary like the one above.
* Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
* Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.


In [57]:
def get_news_articles(cached=False):
    '''
    This function with default cached == False does a fresh scrape of inshort pages with topics 
    business, sports, technology, and entertainment and writes the returned df to a json file.
    cached == True returns a df read in from a json file.
    '''
    # option to read in a json file instead of scrape for df
    if cached == True:
        df = pd.read_json('news_articles.json')
        
    # cached == False completes a fresh scrape for df    
    else:
    
        # Set base_url that will be used in get request
        base_url = 'https://inshorts.com/en/read/'
        
        # List of topics to scrape
        topics = ['business', 'sports', 'technology', 'entertainment']
        
        # Create an empty list, articles, to hold our dictionaries
        articles = []

        for topic in topics:
            
            # Create url with topic endpoint
            topic_url = base_url + topic
            
            # Make request and soup object using helper
            soup = codeup_soup(topic_url)

            # Scrape a ResultSet of all the news cards on the page
            cards = soup.find_all('div', class_='news-card')

            # Loop through each news card on the page and get what we want
            for card in cards:
                title = card.find('span', itemprop='headline' ).text
                author = card.find('span', class_='author').text
                content = card.find('div', itemprop='articleBody').text

                # Create a dictionary, article, for each news card
                article = ({'topic': topic, 
                            'title': title, 
                            'author': author, 
                            'content': content})

                # Add the dictionary, article, to our list of dictionaries, articles.
                articles.append(article)
            
        # Create a DataFrame from list of dictionaries
        df = pd.DataFrame(articles)
        
        # Write df to json file for future use
        df.to_json('news_articles.json')
    
    return df

In [58]:
get_news_articles()

Unnamed: 0,topic,title,author,content
0,business,Video shows moment cake is thrown at Porsche C...,Pragya Swastik,A video captured the moment a cake was thrown ...
1,business,Pakistan rupee falls to all-time low of 300 a ...,Pragya Swastik,Pakistan's rupee slumped to a new record low o...
2,business,"SpiceJet denies insolvency reports, says tryin...",Ashley Paul,SpiceJet said it's taking steps to revive its ...
3,business,Bank of England raises lending rates to 4.5% i...,Srishty Choudhury,Bank of England (BoE) raised its benchmark len...
4,business,Man Group appoints its first female CEO in 240...,Pragya Swastik,London-based investment advice company Man Gro...
...,...,...,...,...
95,entertainment,"Even after 8 years, children are watching 'F.I...",Bhawana Chaudhary,"Actress Kavita Kaushik, known for her TV show ..."
96,entertainment,MP has withdrawn tax exemption to 'The Kerala....,Medhaa Gupta,Congress Rajya Sabha MP Vivek Tankha has claim...
97,entertainment,Want to ask Amitabh Bachchan 'Where do you wan...,Bhawana Chaudhary,Actor Piyush Mishra said that he wants to ask ...
98,entertainment,I never made wrong decisions: Kangana on Rasca...,Swati Dubey,Kangana Ranaut reacted to an old video shared ...
