# Data Acquisition 

## Imports

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd

## Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```python
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

Plus any additional properties you think might be helpful.



In [2]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

In [3]:
soup = BeautifulSoup(response.content, 'html.parser')

### Note: Show how you using '.more-link' misses the first three articles

In [4]:
soup.select('.more-link')

[<a class="more-link" href="https://codeup.com/workshops/in-person-workshop-learn-to-code-python-on-7-19/">read more</a>,
 <a class="more-link" href="https://codeup.com/workshops/dallas/free-javascript-workshop-at-codeup-dallas-on-6-28/">read more</a>,
 <a class="more-link" href="https://codeup.com/tips-for-prospective-students/is-our-cloud-administration-program-right-for-you/">read more</a>,
 <a class="more-link" href="https://codeup.com/workshops/pride-in-tech-panel/">read more</a>,
 <a class="more-link" href="https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/">read more</a>,
 <a class="more-link" href="https://codeup.com/tips-for-prospective-students/mental-health-first-aid-training/">read more</a>,
 <a class="more-link" href="https://codeup.com/workshops/codeup-dallas-how-to-succeed-at-a-coding-bootcamp-on-june-9th/">read more</a>,
 <a class="more-link" href="https://codeup.com/featured/5-reasons-to-attend-our-new-cloud-administration-program/">read 

### This captures all the articles due to the fact that H2 was only used for articles

In [5]:
soup.select('h2 a[href]')

[<a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a>,
 <a href="https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/">What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science</a>,
 <a href="https://codeup.com/tips-for-prospective-students/is-our-cloud-administration-program-right-for-you/">Is Our Cloud Administration Program Right for You?</a>,
 <a href="https://codeup.com/featured/5-reasons-to-attend-our-new-cloud-administration-program/">5 Reasons To Attend Our New Cloud Administration Program</a>,
 <a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a>,
 <a href="https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/">What Jobs Can You Get Aft

In [6]:
soup.select('h2 a[href]')[0]

<a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a>

In [7]:
soup.select('h2 a[href]')[0]['href']

'https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/'

### List comprehension review

In [8]:
[n for n in range(1, 11)]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

### Using list comprehension to get all the links out

In [9]:
links = [link['href'] for link in soup.select('h2 a[href]')]
links

['https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/',
 'https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/',
 'https://codeup.com/tips-for-prospective-students/is-our-cloud-administration-program-right-for-you/',
 'https://codeup.com/featured/5-reasons-to-attend-our-new-cloud-administration-program/',
 'https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/',
 'https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/',
 'https://codeup.com/workshops/san-antonio/in-person-workshop-learn-to-code-javascript-on-7-26/',
 'https://codeup.com/workshops/in-person-workshop-learn-to-code-python-on-7-19/',
 'https://codeup.com/workshops/dallas/free-javascript-workshop-at-codeup-dallas-on-6-28/',
 'https://codeup.com/tips-for-prospective-students/is-our-cloud-administration-program-right-for-you/',
 'https://codeup.com/workshops/pride-in-tech-

### Get title and content from article

In [10]:
url = links[0]
response = get(url, headers=headers)
soup = BeautifulSoup(response.text)

In [11]:
soup.find('h1', class_='entry-title').text

'What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration'

In [12]:
soup.find('div', class_='entry-content').text.strip()

'Have you been considering a career in Cloud Administration, but have no idea what your job title or potential salary could be? Continue reading below to find out!\nIn this mini-series, we will take each of our programs here at Codeup: Data Science, Web Development, and Cloud Administration, and outline respectively potential job titles, as well as entry-level salaries.*\xa0Let’s discuss Cloud Administration.\nProgram Overview\nAt Codeup, we offer a 15-week Cloud Administration program, which was derived from our previous two programs: Systems Engineering and Cyber Cloud. We combined the best of both and blended hands-on practical knowledge with skilled instructors to create the Cloud Administration program.\nUpon completing this program, you’ll have the opportunity to take on two exams for certifications: Amazon Web Services (AWS) Cloud Practitioner and AWS Solutions Architect Associate.\xa0\nPotential Jobs\nAccording to A Cloud Guru, with an AWS Certification you’ll be equipped with 

### Put it together

In [13]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

links = [link['href'] for link in soup.select('h2 a[href]')]

articles = []

for url in links:
    
    url_response = get(url, headers=headers)
    soup = BeautifulSoup(url_response.text)
    
    title = soup.find('h1', class_='entry-title').text
    content = soup.find('div', class_='entry-content').text.strip()
    
    article_dict = {
        'title': title,
        'content': content
    }
    
    articles.append(article_dict)

In [14]:
articles[0:5]

[{'title': 'What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration',
  'content': 'Have you been considering a career in Cloud Administration, but have no idea what your job title or potential salary could be? Continue reading below to find out!\nIn this mini-series, we will take each of our programs here at Codeup: Data Science, Web Development, and Cloud Administration, and outline respectively potential job titles, as well as entry-level salaries.*\xa0Let’s discuss Cloud Administration.\nProgram Overview\nAt Codeup, we offer a 15-week Cloud Administration program, which was derived from our previous two programs: Systems Engineering and Cyber Cloud. We combined the best of both and blended hands-on practical knowledge with skilled instructors to create the Cloud Administration program.\nUpon completing this program, you’ll have the opportunity to take on two exams for certifications: Amazon Web Services (AWS) Cloud Practitioner and AWS Solutions Architect Associ

### Put in df

In [15]:
blog_article_df = pd.DataFrame(articles)
blog_article_df

Unnamed: 0,title,content
0,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...
1,What Jobs Can You Get After a Coding Bootcamp?...,If you are interested in embarking on a career...
2,Is Our Cloud Administration Program Right for ...,Changing careers can be scary. The first thing...
3,5 Reasons To Attend Our New Cloud Administrati...,Come Work In The Cloud\nWhen your Monday rolls...
4,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...
5,What Jobs Can You Get After a Coding Bootcamp?...,If you are interested in embarking on a career...
6,In-Person Workshop: Learn to Code – JavaScript...,Join us for our live in-person JavaScript cras...
7,In-Person Workshop: Learn to Code – Python on ...,"According to LinkedIn, the “#1 Most Promising ..."
8,Free JavaScript Workshop at Codeup Dallas on 6/28,Event Info: \nLocation – Codeup Dallas\nTime –...
9,Is Our Cloud Administration Program Right for ...,Changing careers can be scary. The first thing...


In [16]:
blog_article_df.to_csv('blog_articles.csv', index=False)

## News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```python
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

In [17]:
url = 'https://inshorts.com/en/read'
response = get(url)
soup = BeautifulSoup(response.content, 'html.parser')

### Get title

In [18]:
soup.find_all('span', itemprop='headline')[0].text

"India lose an ODI under Rohit's captaincy for 1st time in 2022, England level series"

### Get content

In [19]:
soup.find_all('div', itemprop='articleBody')[0].text

"England defeated India by 100 runs in the second ODI at Lord's to level the three-match series 1-1. This is the first time in 2022 that India have lost an ODI match under the captaincy of Rohit Sharma. Bowling first, India bowled England out for 246 runs in 49 overs. India were bowled out for 146 in 38.5 overs."

### Get categories

In [20]:
categories = [li.text.lower() for li in soup.select('li')][1:]
categories[0] = 'national'
categories

['national',
 'business',
 'sports',
 'world',
 'politics',
 'technology',
 'startup',
 'entertainment',
 'miscellaneous',
 'hatke',
 'science',
 'automobile']

### Put it together

In [21]:
categories = [li.text.lower() for li in soup.select('li')][1:]
categories[0] = 'national'

inshorts = []

for category in categories:
    
    url = 'https://inshorts.com/en/read' + '/' + category
    response = get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    titles = [span.text for span in soup.find_all('span', itemprop='headline')]
    contents = [div.text for div in soup.find_all('div', itemprop='articleBody')]
    
    for i in range(len(titles)):
        
        article = {
            'title': titles[i],
            'content': contents[i],
            'category': category,
        }
        
        inshorts.append(article)

In [22]:
inshorts[0:5]

[{'title': "BJP running 'Operation Kamal' in Prez polls, bribing people: Sinha",
  'content': 'Ahead of Presidential elections, opposition parties’ Presidential candidate Yashwant Sinha has accused the BJP of carrying out \'Operation Kamal\' by engaging in horse-trading. Sinha, who has sought a probe into the matter, alleged that the party is offering a "huge sum of money to non-BJP MLAs" to back NDA\'s candidate in polls. Presidential elections will be held on July 18.',
  'category': 'national'},
 {'title': "Guru Granth Sahib insulted at Punjab CM Mann's marriage ceremony: SGPC",
  'content': 'The Shiromani Gurdwara Parbandhak Committee (SGPC) has claimed that the vehicle carrying holy book Guru Granth Sahib was stopped for inspection at Punjab CM Bhagwant Mann\'s wedding ceremony. The committee called the act an "insult to the honour of the Guru". "It was even more hurtful that this incident took place at [CM\'s] residence," SGPC chief Harjinder Singh Dhami said.',
  'category': 'na

In [23]:
inshorts_article_df = pd.DataFrame(inshorts)
inshorts_article_df

Unnamed: 0,title,content,category
0,"BJP running 'Operation Kamal' in Prez polls, b...","Ahead of Presidential elections, opposition pa...",national
1,Guru Granth Sahib insulted at Punjab CM Mann's...,The Shiromani Gurdwara Parbandhak Committee (S...,national
2,IT Department raids Cong MLA Sanjay Sharma's b...,The Income Tax Department is conducting search...,national
3,Odisha BJP MLAs protest against Cong leader's ...,Odisha BJP MLAs held a protest inside the stat...,national
4,Murmu will get more votes than expected in Pre...,Goa CM Pramod Sawant on Thursday said NDA's Pr...,national
...,...,...,...
292,Mahindra may consider investing in an EV batte...,Mahindra and Mahindra may consider investing i...,automobile
293,"Volkswagen, partners to invest $20 billion to ...",German automobile manufacturer Volkswagen said...,automobile
294,Fully agree that aim should be to reduce fire ...,Ola CEO Bhavish Aggarwal in response to Autoca...,automobile
295,Hero MotoCorp gets permit to use 'Hero' tradem...,A Delhi High Court-appointed arbitration panel...,automobile


In [24]:
inshorts_article_df.to_csv('news_articles.csv', index=False)