### Exercises
By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

In [1]:
from requests import get
import requests
from bs4 import BeautifulSoup
import os
import json
from pprint import pprint
import re

import itertools as it
from typing import List, Dict
import pandas as pd

1. Codeup Blog Articles
Scrape the article text from the following pages:

https://codeup.com/data-science/recession-proof-career/

https://codeup.com/featured/series-part-3-web-development/

https://codeup.com/data-science-vs-data-analytics-whats-the-difference/

https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/

https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}

In [6]:
def get_article(url):
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.select('.entry-title')[0].text
    content = soup.select('.entry-content')[0].text
    output = {
        'title':title.strip(),
        'content':content.strip()
    }
    return output

In [7]:
def get_links(url):
    links = []
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    for link in soup.select('h2 a[href]'):
        links.append(link['href'])
    return links
        
        
url_home = 'https://codeup.com/blog/'
get_links(url_home)

['https://codeup.com/data-science/recession-proof-career/',
 'https://codeup.com/codeup-news/codeup-x-comic-con/',
 'https://codeup.com/featured/series-part-3-web-development/',
 'https://codeup.com/codeup-news/codeup-dallas-campus/',
 'https://codeup.com/codeup-news/codeup-tv-commercial/',
 'https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/']

In [8]:
url_a = 'https://codeup.com/data-science/recession-proof-career/'
get_article(url_a)

{'title': 'Is a Career in Tech Recession-Proof?',
 'content': 'Given the current economic climate, many economists are considering the U.S. to be entering a recession. This can cause confusion, fear, and uncertainty, especially as it pertains to job security.\nTo ease some of those feelings, below you’ll find some careers in tech that tend to hold up better than others amid a recession. In the event of a recession, companies will likely shift to digital strategies, making these careers in tech valuable and highly coveted.\n\xa0\n\n\nProgrammer/Developer\nNo matter the programming language you’ve mastered, having the knowledge alone makes you extremely valuable. The coding skills you possess as a programmer or developer are in-demand for companies looking to build or enhance their websites, and enhance their consumer experience. According to the U.S. Bureau of Labor Statistics, jobs in software development are expected to grow 22% by 2030. This is much faster than the average career.\n\

In [9]:
def get_blog_articles(url_list):
    output = []
    for url in url_list:
        article_dict = get_article(url)
        output.append(article_dict)
    return output

In [10]:
url_1 = 'https://codeup.com/data-science/recession-proof-career/'
url_2 = 'https://codeup.com/codeup-news/codeup-x-comic-con/'
url_3 = 'https://codeup.com/featured/series-part-3-web-development/'
url_4 = 'https://codeup.com/codeup-news/codeup-dallas-campus/'
url_5 = 'https://codeup.com/codeup-news/codeup-tv-commercial/'
select_articles = [url_1, url_2, url_3, url_4, url_5]
#get_blog_articles(select_articles)
get_blog_articles(get_links(url_home))

[{'title': 'Is a Career in Tech Recession-Proof?',
  'content': 'Given the current economic climate, many economists are considering the U.S. to be entering a recession. This can cause confusion, fear, and uncertainty, especially as it pertains to job security.\nTo ease some of those feelings, below you’ll find some careers in tech that tend to hold up better than others amid a recession. In the event of a recession, companies will likely shift to digital strategies, making these careers in tech valuable and highly coveted.\n\xa0\n\n\nProgrammer/Developer\nNo matter the programming language you’ve mastered, having the knowledge alone makes you extremely valuable. The coding skills you possess as a programmer or developer are in-demand for companies looking to build or enhance their websites, and enhance their consumer experience. According to the U.S. Bureau of Labor Statistics, jobs in software development are expected to grow 22% by 2030. This is much faster than the average career.\

1. News Articles
We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

Hints:

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [11]:
def get_articles(url, category):
    headers = {'User-Agent': 'Codeup Data Science'}
    outputs = []
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    news_cards = soup.select('.news-card')
    for card in news_cards:
        title = card.select('.news-card-title')[0].text
        title = re.findall('(\s.*)', title)[1]
        content = card.select('.news-card-content')[0].text
        output = {
            'title':title.strip(),
            'content':content.strip(),
            'category':category
        }
        outputs.append(output)
    return outputs

In [12]:
def get_news_articles():
    outputs = []
    base_url = 'https://inshorts.com/en/read'
    end_points = ['business', 'sports', 'technology', 'entertainment'] 
    for endp in end_points:
        outputs += get_articles(f"{base_url}/{endp}", endp)
    return outputs

In [13]:
get_news_articles()

[{'title': 'ED arrests former NSE CEO Ravi Narain in money laundering case',
  'content': "The Enforcement Directorate has arrested Ravi Narain, the former Managing Director and CEO of the National Stock Exchange (NSE), in connection with a money laundering case related to the alleged illegal phone-tapping of the exchange's employees. The ED had earlier arrested another former NSE MD and CEO Chitra Ramkrishna and former Mumbai Commissioner of Police Sanjay Pandey in the case.\n\nshort by Apaar Sharma / \n      08:53 am on 07 Sep",
  'category': 'business'},
 {'title': "Musk's lawyer seeks to delay Twitter trial to investigate whistleblower's claims",
  'content': 'Tesla CEO Elon Musk\'s lawyer urged that the trial over the $44 billion Twitter deal should be delayed by several weeks to allow Musk to investigate a whistleblower\'s claims. "Doesn\'t justice demand a few weeks to look into this?" said Musk’s lawyer. Twitter\'s former head of security, Peiter Zatko, accused Twitter of false

1. Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).

In [14]:
def cache_data(dictionary, filename='cache_file.json'):
    json_obj = json.dumps(dictionary, indent = 4)
    try:
        with open(filename, 'w') as f:
            f.write(json_obj)
        return True
    except Exception as e:
        print(e)
        return False
    
def read_url_or_file_inshort(filename='cache_file.json', query_url=False):
    if os.path.isfile(filename) and not query_url:
        print('Found File')
        try:
            with open(filename, 'r') as f:
                json_obj = json.load(f)
            return json_obj
        except Exception as e:
            print(e)
            return None
    else:
        print('Querying url')
        try:
            dictionary = get_news_articles()
            cache_data(dictionary, filename=filename)
            return dictionary
        except Exception as e:
            print(e)
            return None
            
def read_url_or_file_codeup(filename='cache_file.json', query_url=False):
    if os.path.isfile(filename) and not query_url:
        print('Found File')
        try:
            with open(filename, 'r') as f:
                json_obj = json.load(f)
            return json_obj
        except Exception as e:
            print(e)
            return None
    else:
        print('Querying url')
        url= 'https://codeup.com/blog/'
        select_articles = get_links(url)
        try:
            dictionary = get_blog_articles(select_articles)
            cache_data(dictionary, filename=filename)
            return dictionary
        except Exception as e:
            print(e)
            return None
            

In [15]:
read_url_or_file_inshort(filename='inshort_articles.json', query_url=True)

Querying url


[{'title': 'ED arrests former NSE CEO Ravi Narain in money laundering case',
  'content': "The Enforcement Directorate has arrested Ravi Narain, the former Managing Director and CEO of the National Stock Exchange (NSE), in connection with a money laundering case related to the alleged illegal phone-tapping of the exchange's employees. The ED had earlier arrested another former NSE MD and CEO Chitra Ramkrishna and former Mumbai Commissioner of Police Sanjay Pandey in the case.\n\nshort by Apaar Sharma / \n      08:53 am on 07 Sep",
  'category': 'business'},
 {'title': "Musk's lawyer seeks to delay Twitter trial to investigate whistleblower's claims",
  'content': 'Tesla CEO Elon Musk\'s lawyer urged that the trial over the $44 billion Twitter deal should be delayed by several weeks to allow Musk to investigate a whistleblower\'s claims. "Doesn\'t justice demand a few weeks to look into this?" said Musk’s lawyer. Twitter\'s former head of security, Peiter Zatko, accused Twitter of false

In [16]:
read_url_or_file_inshort(filename='inshort_articles.json')

Found File


[{'title': 'ED arrests former NSE CEO Ravi Narain in money laundering case',
  'content': "The Enforcement Directorate has arrested Ravi Narain, the former Managing Director and CEO of the National Stock Exchange (NSE), in connection with a money laundering case related to the alleged illegal phone-tapping of the exchange's employees. The ED had earlier arrested another former NSE MD and CEO Chitra Ramkrishna and former Mumbai Commissioner of Police Sanjay Pandey in the case.\n\nshort by Apaar Sharma / \n      08:53 am on 07 Sep",
  'category': 'business'},
 {'title': "Musk's lawyer seeks to delay Twitter trial to investigate whistleblower's claims",
  'content': 'Tesla CEO Elon Musk\'s lawyer urged that the trial over the $44 billion Twitter deal should be delayed by several weeks to allow Musk to investigate a whistleblower\'s claims. "Doesn\'t justice demand a few weeks to look into this?" said Musk’s lawyer. Twitter\'s former head of security, Peiter Zatko, accused Twitter of false

In [17]:
read_url_or_file_codeup(filename='codeup_articles.json',query_url=True)

Querying url


[{'title': 'Is a Career in Tech Recession-Proof?',
  'content': 'Given the current economic climate, many economists are considering the U.S. to be entering a recession. This can cause confusion, fear, and uncertainty, especially as it pertains to job security.\nTo ease some of those feelings, below you’ll find some careers in tech that tend to hold up better than others amid a recession. In the event of a recession, companies will likely shift to digital strategies, making these careers in tech valuable and highly coveted.\n\xa0\n\n\nProgrammer/Developer\nNo matter the programming language you’ve mastered, having the knowledge alone makes you extremely valuable. The coding skills you possess as a programmer or developer are in-demand for companies looking to build or enhance their websites, and enhance their consumer experience. According to the U.S. Bureau of Labor Statistics, jobs in software development are expected to grow 22% by 2030. This is much faster than the average career.\

In [18]:
read_url_or_file_codeup(filename='codeup_articles.json')

Found File


[{'title': 'Is a Career in Tech Recession-Proof?',
  'content': 'Given the current economic climate, many economists are considering the U.S. to be entering a recession. This can cause confusion, fear, and uncertainty, especially as it pertains to job security.\nTo ease some of those feelings, below you’ll find some careers in tech that tend to hold up better than others amid a recession. In the event of a recession, companies will likely shift to digital strategies, making these careers in tech valuable and highly coveted.\n\xa0\n\n\nProgrammer/Developer\nNo matter the programming language you’ve mastered, having the knowledge alone makes you extremely valuable. The coding skills you possess as a programmer or developer are in-demand for companies looking to build or enhance their websites, and enhance their consumer experience. According to the U.S. Bureau of Labor Statistics, jobs in software development are expected to grow 22% by 2030. This is much faster than the average career.\

In [2]:
url = 'https://codeup.com/featured/series-part-3-web-development/'
headers = {'User-Agent': 'Codeup Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)

html = response.text
soup = BeautifulSoup(response.content, 'html.parser')

In [3]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


#### Get the article

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
article = soup.find('div', class_='et_pb_text_inne')
article.text

AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
# This reads one article only...

def get_article_text(url):
    # if we already have the data, read it locally
    if os.path.isfile('article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise go fetch the data
#     url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
    headers = {'User-Agent': 'Codeup Ada Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', class_='mk-single-content')

    # save it for next time
#     with open('article.txt', 'w') as f:
#         f.write(article.text)

    return article.text.strip()

In [None]:
get_article_text('https://codeup.com/featured/series-part-3-web-development/')

In [None]:
pip install beautifulsoup4

In [None]:
def get_blog_posts():
    filename = './codeup_blog_posts.csv'

    # check for presence of the file or make a new request
    if os.path.exists(filename):
        return pd.read_csv(filename)
    else:
        return make_new_request()

In [None]:
def make_dictionary_from_article(url): 
    # Set header and user agent to increase likelihood that your request get the response you want
    headers = {'user-agent': 'Codeup Data Science Student'}

    # This is the actual HTTP GET request that python will send across the internet
    response = get(url, headers=headers)

    # response.text is a single string of all the html from that page
    soup = BeautifulSoup(response.text, 'html.parser')

    title = soup.title.get_text()
    body = soup.select("div.mk-single-content.clearfix")[0].get_text()

    output = {}
    output["title"] = title
    output["body"] = body
    return output

In [None]:
def make_new_request():
    urls = [
        "https://codeup.com/data-science/recession-proof-career/",
        "https://codeup.com/featured/series-part-3-web-development/",
        "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
        "https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/",
        "https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/",
    ]

    output = []
    
    for url in urls:
        article_dictionary = make_dictionary_from_article(url)
        output.append(article_dictionary)

    df = pd.DataFrame(output)
    df.to_csv('./codeup_blog_posts.csv') 

    return df

In [None]:
make_new_request()

In [None]:
def get_news_articles():
    filename = 'inshorts_news_articles.csv'

    # check for presence of the file or make a new request
    if os.path.exists(filename):
        return pd.read_csv(filename)
    else:
        return make_new_request()

In [None]:
def get_articles_from_topic(url):
    headers = {'user-agent': 'Codeup Data Science Student'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    output = []

    articles = soup.select(".news-card")

    for article in articles: 
        title = article.select("[itemprop='headline']")[0].get_text()
        content = article.select("[itemprop='articleBody']")[0].get_text()
        author = article.select(".author")[0].get_text()
        published_date = article.select(".time")[0]["content"]
        category = response.url.split("/")[-1]

        article_data = {
            'title': title,
            'content': content,
            'category': category,
            'author': author,
            'published_date': published_date,
        }
        output.append(article_data)


    return output

In [None]:
def make_new_article_request():
    urls = [
        "https://inshorts.com/en/read/business",
        "https://inshorts.com/en/read/sports",
        "https://inshorts.com/en/read/technology",
        "https://inshorts.com/en/read/entertainment"
    ]

    output = []
    
    for url in urls:
        # We use .extend in order to make a flat output list.
        output.extend(get_articles_from_topic(url))

    print("stuff")
    print(output)
    df = pd.DataFrame(output)
    df.to_csv('inshorts_news_articles.csv') 

    return df

In [None]:
make_new_article_request()

In [None]:
import nltk; nltk.download('stopwords')