In [33]:
# Initial imports
import pandas as pd
from bs4 import BeautifulSoup
from requests import get
from os import path

## Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.
## Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
Plus any additional properties you think might be helpful.
Bonus: Scrape the text of all the articles linked on codeup's blog page.

In [142]:
def get_blog_articles(urls):
    """
    Acquires the title and content of 5 Codeup blog posts. Returns a list of dictionaries where each element represents an article
    """
    # Intiate the results list
    results = []
    
    # Iterate through the urls that were taken as an argument
    for url in urls:
        # Set the headers to give us access
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
        # Take in the url's html file
        response = requests.get(url, headers=headers)
        # Format in beautfulsoup
        bs = BeautifulSoup(response.text)
        # Create a dictionary of the title and content
        content = dict(title = bs.h1.text, content = '\n'.join([p.text for p in bs.find_all('p')]))
        # Append the results to the list of dictionaries
        results.append(content)
        
    return results

In [143]:
urls = ['https://codeup.edu/alumni-stories/how-i-paid-43-for-my-codeup-tuition/',
        'https://codeup.edu/data-science/where-do-data-scientists-come-from/',
        'https://codeup.edu/data-science/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.edu/featured/women-in-tech-panelist-spotlight/',
        'https://codeup.edu/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/'
       ]

In [144]:
# Test the function
articles = get_blog_articles(urls)
articles

[{'title': 'How I Paid $43 For My Codeup Tuition',
 {'title': 'Where Do Data Scientists Come From?',
  'content': 'Oct 24, 2018 | Data Science\nBy Dimitri Antinou\nOver the last few blog posts, we’ve answered a lot of questions around Data Science: What is it? What’s the difference from data analytics? Which type of program is right for me? If you’re interested in becoming a data scientist, you might be wondering how other people got into the field. Given how new the profession is, most of today’s practitioners probably didn’t study data science formally as undergraduate or graduate students. So today we’re asking: where do data scientists come from?\nLet’s start broadly by defining the possible pathways into this career. If you’re a Data Scientist, you probably followed one or more of these paths:\nEach of these pathways has unique advantages and disadvantages across variables like cost, formal credential, length, and pace. A free online program is free and accessible, but takes a lot

In [145]:
# Make it a dataframe, just for fun
pd.DataFrame(articles)

Unnamed: 0,title,content
0,How I Paid $43 For My Codeup Tuition,"Nov 27, 2019 | Alumni Stories\nBootcamps or ca..."
1,Where Do Data Scientists Come From?,"Oct 24, 2018 | Data Science\nBy Dimitri Antino..."
2,Codeup’s Data Science Career Accelerator is Here!,"Sep 30, 2018 | Data Science\nThe rumors are tr..."
3,Women in tech: Panelist Spotlight – Magdalena ...,"Mar 28, 2023 | Events, Featured\nCodeup is hos..."
4,Women in Tech: Panelist Spotlight – Sarah Mellor,"Mar 13, 2023 | Codeup News, Featured\nCodeup i..."


# Write a function that scrapes the news articles for the following topics:
Business
Sports
Technology
Entertainment
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
Hints:
Start by inspecting the website in your browser. Figure out which elements will be useful.
Start by creating a function that handles a single article and produces a dictionary like the one above.
Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [147]:
def get_news_articles(urls):
    """
    Scrapes news articles from provided URLs and returns their title, content, and category in a dataframe
    """
    results = []
    
    # Iterate through the list of urls that were taken as an argument
    for url in urls:
        # Take in the urls html file
        response = get(url)
        # Convert it to beautifulsoup format
        soup = BeautifulSoup(response.text, 'html.parser')
        # Make a list of all the headlines on the page
        headlines = soup.find_all('span', itemprop="headline")
        # Make a list of all the article content on the page
        articles = soup.find_all('div', itemprop="articleBody")
        # Grab the category from the end of the url
        category = url.split('/')[-1]
        
        # For the number of headlines there are in the url, create a dictionary of the title, content, and category
        for i in range(len(headlines)):
            article = {
                'title': headlines[i].text,
                'article': articles[i].text,
                'category': category
            }
            # Append the results to a list
            results.append(article)
    
#     # Cast the results to a dataframe
#     results_df = pd.DataFrame(results)
    
    return results

In [148]:
# List of URLs to scrape
urls = [
    'https://inshorts.com/en/read/business',
    'https://inshorts.com/en/read/sports',
    'https://inshorts.com/en/read/technology',
    'https://inshorts.com/en/read/entertainment'
]

In [149]:
# Call the function with the list of URLs
news = get_news_articles(urls)
news

[{'title': "Some of Jhunjhunwala's best picks were in 2002 crash: Utpal Sheth",
  'article': 'Rare Enterprises CEO Utpal Sheth said some of the best picks of late investor Rakesh Jhunjhunwala were during the downturn of 2002-2003. The environment at that time was of despair, but valuations were strongly in favour of investors, Sheth stated. He added that Jhunjhunwala recognised that same sentiment of despair in the economy during COVID-19.',
  'category': 'business'},
 {'title': 'Pakistan suspends Russian oil imports over quality issues: Reports',
  'article': 'Pakistan has reportedly suspended imports of Russian crude oil over quality concerns. As per multiple reports, Pakistan refineries have refused to process Russian oil as it was producing less petrol with 20% more furnace oil than Arabian crude oil. No Russian oil ship had come to Pakistan after the last one arrived on June 26, the reports said.',
  'category': 'business'},
 {'title': "SEBI bars ZEE's Subhash Chandra, Goenka from

In [150]:
# Make it a dataframe just for fun
pd.DataFrame(news)

Unnamed: 0,title,article,category
0,Some of Jhunjhunwala's best picks were in 2002...,Rare Enterprises CEO Utpal Sheth said some of ...,business
1,Pakistan suspends Russian oil imports over qua...,Pakistan has reportedly suspended imports of R...,business
2,"SEBI bars ZEE's Subhash Chandra, Goenka from b...",SEBI barred Zee Entertainment's Subhash Chandr...,business
3,"In 1989, Jhunjhunwala said India's time has co...",Veteran investor Ramesh Damani said late billi...,business
4,RBI to launch public tech platform for frictio...,RBI on Monday announced it will launch a pilot...,business
5,"Vodafone Idea suffers ₹7,840 crore loss in Q1",Debt-ridden telecom operator Vodafone Idea (Vi...,business
6,Foxconn's $2-billion investment in India only ...,"Foxconn Chairman Young Liu, while talking abou...",business
7,India's merchandise trade deficit narrows to $...,India's merchandise trade deficit narrowed by ...,business
8,Sunil Munjal to quit Hero MotoCorp's top manag...,Hero MotoCorp on Monday disclosed details of t...,business
9,ITC investors to get 1 ITC Hotels share for 10...,ITC on Monday said its board has approved a sc...,business
