In [1]:
import pandas as pd
import numpy as np
import re
from requests import get
from bs4 import BeautifulSoup as bs

### 1. Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

    {
        'title': 'the title of the article',
        'content': 'the full text content of the article'
    }


#### From class review

In [4]:
# establish a base url for our requests:
url = 'https://codeup.com/blog/'

In [5]:
# we need to specify some user agent for the codeup site
# non-specified user agents are rejected
header = {'User-Agent': 'hamsandwich'}

In [6]:
# establish our basic soup with the base url
soup = bs(get(url, headers=header).content)

In [7]:
#soup.prettify

In [8]:
soup.select("a.more-link")

[<a class="more-link" href="https://codeup.com/tips-for-prospective-students/coding-bootcamp-vs-college/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/dei-report/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/diversity-and-inclusion-award/">read more</a>,
 <a class="more-link" href="https://codeup.com/featured/financing-career-transition/">read more</a>,
 <a class="more-link" href="https://codeup.com/tips-for-prospective-students/tips-for-women/">read more</a>,
 <a class="more-link" href="https://codeup.com/cloud-administration/cloud-computing-and-aws/">read more</a>]

In [9]:
soup.select("a.more-link")[0]['href']

'https://codeup.com/tips-for-prospective-students/coding-bootcamp-vs-college/'

In [10]:
blog_posts = [link["href"] for link in soup.select("a.more-link")]

In [11]:
blog_posts

['https://codeup.com/tips-for-prospective-students/coding-bootcamp-vs-college/',
 'https://codeup.com/codeup-news/dei-report/',
 'https://codeup.com/codeup-news/diversity-and-inclusion-award/',
 'https://codeup.com/featured/financing-career-transition/',
 'https://codeup.com/tips-for-prospective-students/tips-for-women/',
 'https://codeup.com/cloud-administration/cloud-computing-and-aws/']

In [12]:
#make function
def get_blog_urls(base_url, header = {'User-Agent': 'hamsandwich'}):
    soup = bs(get(url, headers = header).content)
    return [link["href"] for link in soup.select("a.more-link")]

In [13]:
get_blog_urls(url)

['https://codeup.com/tips-for-prospective-students/coding-bootcamp-vs-college/',
 'https://codeup.com/codeup-news/dei-report/',
 'https://codeup.com/codeup-news/diversity-and-inclusion-award/',
 'https://codeup.com/featured/financing-career-transition/',
 'https://codeup.com/tips-for-prospective-students/tips-for-women/',
 'https://codeup.com/cloud-administration/cloud-computing-and-aws/']

In [14]:
#inside the article
article_soup = bs(get(
    'https://codeup.com/codeup-news/dei-report/',
    headers=header
).content)

In [15]:
# if I only have one thing, use select_one
article_soup.select_one('h1.entry-title').text

'Diversity Equity and Inclusion Report'

In [16]:
# let's get the article content now:
article_soup.select_one('div.entry-content').text.strip()

'Codeup is excited to launch our first Diversity Equity, and Inclusion (DEI) report! In over eight years as an organization, we’ve implemented policies and grown our DEI efforts. We are extremely proud of the progress we’ve made as a staff and Codeup community, and we recognize there is more to learn. This report captures some of the ways that we’ve lived our value of Cultivating Inclusive Growth, and how we will continue doing so as we look to the future.\nWe wanted to shine a light on the demographics of our students and staff, and in particular how that compares to the tech industry as a whole. How we collect, organize, and share employee demographic data is informed by standards set by the Equal Employment Opportunity Commission (EEOC).\nWe are proud to celebrate how we’ve grown and are motivated and committed to do more and be better. To view the report visit the link here, or download it below.'

In [17]:
def get_blog_content(base_url):
    blog_links = get_blog_urls(base_url)
    all_blogs = []
    for blog in blog_links:
        blog_soup = bs(
            get(blog,
                headers=header).content)
        blog_content = {'title': blog_soup.select_one(
            'h1.entry-title').text,
        'content': blog_soup.select_one(
            'div.entry-content').text.strip()}
        all_blogs.append(blog_content)
    return all_blogs

In [18]:
get_blog_content(url)

[{'title': 'Coding Bootcamp or Computer Science Degree?',
  'content': 'For many people, deciding between a coding bootcamp and a computer science degree can be tough. We would like to lend a hand in comparing programs, and shed some light on key similarities and differences between Codeup and a degree.\nKey Differences\n1. Time Commitment\nCodeup’s programs range from 15 to 20 weeks. They are full-time courses requiring students to be in class from 9 am to 5 pm Monday through Friday.\nOn average, a full-time enrollee obtaining a Bachelor’s degree in Computer Science can expect to commit four years, or 208 weeks.\n2. Job Placement\nWhen a student decides to attend Codeup, built into tuition is placement assistance. We have a team dedicated to placing students with jobs in-field upon completion of a Codeup program.*\nUnfortunately, most colleges and universities do not offer placement assistance. While some schools may facilitate job fairs or share new job listings with students, most e

In [19]:
my_blogs = pd.DataFrame(get_blog_content(url))
my_blogs

Unnamed: 0,title,content
0,Coding Bootcamp or Computer Science Degree?,"For many people, deciding between a coding boo..."
1,Diversity Equity and Inclusion Report,Codeup is excited to launch our first Diversit...
2,Codeup Honored as SABJ Diversity and Inclusion...,Codeup has been named the 2022 Diversity and I...
3,How Can I Finance My Career Transition?,Deciding to transition into a tech career is a...
4,Tips for Women Beginning a Career in Tech,"Codeup strongly values diversity, and inclusio..."
5,What is Cloud Computing and AWS?,With many companies switching to cloud service...


### 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

Business
Sports
Technology
Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

    {
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
    }


In [2]:
#get url
url = "https://inshorts.com/en/read"

In [3]:
#wrap around soup
soup = bs(get(url).content)

In [22]:
#soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<style>
    /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if needed */
        background-color: rgb(0,0,0); /* Fallback color */
        background-color: rgba(0,0,0,0.4); /* Black w/ opacity */
    }

    /* Modal Content/Box */
    .modal-content {
        background-color: #fefefe;
        margin: 15% auto;
        padding: 20px !important;
        padding-top: 0 !important;
        /* border: 1px solid #888; */
        text-align: center;
        position: relative;
        border-radius: 6px;
    }

    /* The Close Button */
    .close {
      left: 90%;
      color: #aaa;
      float: right;
      font-size: 28px;
      font-weight: bold;
    /* positio

In [33]:
# <li has list of all news category
soup.find_all("li")

[<li class="active-category selected">All News</li>,
 <li class="active-category">India</li>,
 <li class="active-category">Business</li>,
 <li class="active-category">Sports</li>,
 <li class="active-category">World</li>,
 <li class="active-category">Politics</li>,
 <li class="active-category">Technology</li>,
 <li class="active-category">Startup</li>,
 <li class="active-category">Entertainment</li>,
 <li class="active-category">Miscellaneous</li>,
 <li class="active-category">Hatke</li>,
 <li class="active-category">Science</li>,
 <li class="active-category">Automobile</li>]

In [31]:
soup.find_all("li")[1].text.lower()

'india'

In [34]:
url

'https://inshorts.com/en/read'

In [36]:
#concat url
url + '/' + soup.find_all("li")[1].text.lower()


'https://inshorts.com/en/read/india'

In [5]:
#get categories
def get_cats(url):
    '''
    this function will give back category of articles from base url
    '''
    soup = bs(get(url).content)
    
    return [cat.text.lower() for cat in soup.find_all("li")[1:]]

In [6]:
get_cats(url)

['india',
 'business',
 'sports',
 'world',
 'politics',
 'technology',
 'startup',
 'entertainment',
 'miscellaneous',
 'hatke',
 'science',
 'automobile']

In [37]:
#concat baseurl with category, it is a functionong url for that cateogry
cat_url = url + '/' + 'science'
cat_url

'https://inshorts.com/en/read/science'

In [9]:
cat_soup = bs(get(cat_url).content)
cat_soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<style>
    /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if needed */
        background-color: rgb(0,0,0); /* Fallback color */
        background-color: rgba(0,0,0,0.4); /* Black w/ opacity */
    }

    /* Modal Content/Box */
    .modal-content {
        background-color: #fefefe;
        margin: 15% auto;
        padding: 20px !important;
        padding-top: 0 !important;
        /* border: 1px solid #888; */
        text-align: center;
        position: relative;
        border-radius: 6px;
    }

    /* The Close Button */
    .close {
      left: 90%;
      color: #aaa;
      float: right;
      font-size: 28px;
      font-weight: bold;
    /* positio

In [12]:
#find titles
cat_soup.find_all("span", itemprop = "headline")[0].text

'New species of beetle named after Novak Djokovic'

In [13]:
#loop to find all
cat_title = [title.text for title in cat_soup.find_all("span", itemprop = "headline")]

In [14]:
cat_title

['New species of beetle named after Novak Djokovic',
 'Microplastics found in human breast milk for the first time',
 'Orcas caught chasing, killing great white shark on video for the first time',
 'Toxic air pollution particles found in lungs, brains of unborn babies for the first time',
 "Scientists map 'graveyard of stars' in our galaxy for the 1st time, pics released",
 "NASA releases new pic of Jupiter's ice-covered moon Europa captured by Juno",
 'Asteroid hit by NASA leaves 10,000 km trail of debris; pic surfaces',
 "Material coming out of black hole is like nothing we've ever seen: Scientists",
 "Pics show aerial view of US' Florida before and after Hurricane Ian's approach",
 "Asteroid's path altered after NASA deliberately crashes spacecraft into it",
 'You have 50 mins until your life changes: Nobel Committee to winner at 1:53 am',
 'SpaceX & NASA launch crew of 4, including a Russian cosmonaut, to ISS',
 'SpaceX launches 52 Starlink satellites hours after launching Crew-5 m

In [16]:
#cat bodies
cat_soup.find_all("div", itemprop = "articleBody")[0].text

'Serbian scientists named a new species of beetle after ex-world number one men\'s tennis player Novak Djokovic. The insect, which belongs to Duvalius genus of ground beetles present in Europe and was discovered several years ago in underground pit in Serbia, has been named \'Duvalius djokovici\'. "We feel urged to pay Djokovic back in...way we can," a researcher said.'

In [17]:
cat_bodies = [body.text for body in cat_soup.find_all("div", itemprop = "articleBody")]

In [20]:
cat_bodies[:3]

['Serbian scientists named a new species of beetle after ex-world number one men\'s tennis player Novak Djokovic. The insect, which belongs to Duvalius genus of ground beetles present in Europe and was discovered several years ago in underground pit in Serbia, has been named \'Duvalius djokovici\'. "We feel urged to pay Djokovic back in...way we can," a researcher said.',
 'Microplastics have been found in human breast milk for the first time, according to a research published in the peer-reviewed journal Polymers. An Italian team of researchers took breast milk samples from 34 healthy mothers a week after giving birth and microplastics were detected in 75% of them. The researchers found microplastics composed of polyethylene, PVC and polypropylene.',
 'Scientists have confirmed that orcas hunt great white sharks, after drone and helicopter footage showed first evidence of a pod of orcas killing a great white near Mossel Bay in South Africa. The footage shows five killer whales chasing

In [22]:
#check lenght
len(cat_bodies) == len(cat_title)

True

In [36]:
#make function to grab all this
def get_all_shorts(base_url):
    '''
    This function takes in base url, creats url for each categories, scraps out titles and bodies and returns 
    a list of dictionaries with title text and body text in dictionaries
    '''
    
    #get category from earrlier function
    cats = get_cats(base_url)
    all_articles = []
    for cat in cats:
        #create url for each category
        cat_url = base_url + "/" + cat
        print(get(cat_url))
        #grab content
        cat_soup = bs(get(cat_url).content)
        #grab title
        cat_titles = [title.text for title in cat_soup.find_all('span', itemprop='headline')]
        #grab body
        cat_bodies = [body.text for body in cat_soup.find_all('div', itemprop='articleBody')]
        #create a dictionary
        cat_articles = [{"title":title, "category": cat, "body":body} for \
                       title, body in zip(cat_titles, cat_bodies)]
        print('cat articles length: ',len(cat_articles))
        
        # add on dictionary as function loops
        all_articles.extend(cat_articles)
        print('length of all_articles: ', len(all_articles))
        
    return all_articles




In [34]:
url

'https://inshorts.com/en/read'

In [35]:
all_articles = get_all_shorts(url)


<Response [200]>
cat articles length:  12
length of all_articles:  12
<Response [200]>
cat articles length:  25
length of all_articles:  37
<Response [200]>
cat articles length:  25
length of all_articles:  62
<Response [200]>
cat articles length:  25
length of all_articles:  87
<Response [200]>
cat articles length:  25
length of all_articles:  112
<Response [200]>
cat articles length:  25
length of all_articles:  137
<Response [200]>
cat articles length:  25
length of all_articles:  162
<Response [200]>
cat articles length:  25
length of all_articles:  187
<Response [200]>
cat articles length:  24
length of all_articles:  211
<Response [200]>
cat articles length:  25
length of all_articles:  236
<Response [200]>
cat articles length:  25
length of all_articles:  261
<Response [200]>
cat articles length:  24
length of all_articles:  285


In [31]:
#create dataframe
all_articles = pd.DataFrame(all_articles)

In [32]:
all_articles

Unnamed: 0,title,category,body
0,"Afghanistan wins SAFF title, spoils India's ha...",india,Afghanistan won their maiden-SAFF Football Cha...
1,Kashmir's famous Dal Lake freezes,india,After the recent snowfall in upper reaches of ...
2,"Indian Navy gets VLF, easy communication with ...",india,The Indian navy has a new communication system...
3,India's first Billiards Premier League,india,The Billiards and Snooker Association of Mahar...
4,Oldest woman in India passes away,india,"Kunjannam, a 112-yr-old woman from Parannur (K..."
...,...,...,...
280,Fix for wheel issue that caused electric car r...,automobile,Toyota Motor said it has found a fix for the d...
281,Withdraw rule that makes 6 airbags mandatory i...,automobile,International Road Federation (IRF) has urged ...
282,Amazon-backed Rivian's shares fall 9% after it...,automobile,Amazon-backed EV-maker Rivian on Friday recall...
283,Vintage cars on display to promote wildlife pr...,automobile,"To create awareness about wildlife week, the K..."
