# Data Collection - Pokemon and News Sites

Go, Thea Ellen T. | Section S12

## News Site (The Verge)
Get the news articles published on March 11-12. Decide which news site you want to get articles from. Make sure you get the following information:
1. date
2. title
3. full article
4. author

In [1]:
import json
import requests
import re
from config import mykey
from newsapi import NewsApiClient
from bs4 import BeautifulSoup

In [2]:
newsapi = NewsApiClient(api_key=mykey)

In [3]:
articles = newsapi.get_everything(sources='the-verge', from_param='2021-03-11', to='2021-03-12')['articles']

### Test on some instance of data

Inspect one article to see structure of data

In [4]:
articles[0]

{'source': {'id': 'the-verge', 'name': 'The Verge'},
 'author': 'Megan Farokhmanesh',
 'title': 'Why game developers can’t get a handle on doors',
 'description': 'Over the past week, dozens of developers across multiple disciplines and teams shared their frustrations on Twitter. Death Trash creator Stephan Hövelbrinks explained that doors “have all sorts of possible bugs.” The Last of Us Part II co-game director Kurt M…',
 'url': 'https://www.theverge.com/22328169/game-development-doors-design-difficult',
 'urlToImage': 'https://cdn.vox-cdn.com/thumbor/4uB363X75nkChNEPVRI5R6BXWKg=/0x38:1920x1043/fit-in/1200x630/cdn.vox-cdn.com/uploads/chorus_asset/file/22366570/Nightmare8.jpg',
 'publishedAt': '2021-03-12T22:17:29Z',
 'content': 'The best kind of door in a video game is the one no one remembers. Sure, everyone can appreciate a big, beautiful door with great animations, says Owlchemy Labs developer Pete Galbraith. But in a vid… [+5812 chars]'}

Identify columns to be used

In [5]:
print(articles[1]['title'])
print(articles[1]['author'])
print(articles[1]['publishedAt'])
print(articles[1]['url'])
print(articles[1]['content'])

Some Bethesda games now on Xbox Game Pass are getting frame rate boosts on Xbox Series X/S
Nick Statt
2021-03-12T21:52:25Z
https://www.theverge.com/2021/3/12/22328089/microsoft-xbox-series-x-bethesda-frame-rate-boost-game-pass
Dishonored: Definitive Edition, Fallout games, and Prey boosted to 60 fps
Illustration by Alex Castro / The Verge
Microsofts Xbox Game Pass platform just got a huge boost with the addition of 20 ne… [+1489 chars]


Parse HTML

In [6]:
soup = BeautifulSoup(requests.get(articles[0]['url']).content, 'html.parser')

Get text from article, remove duplicate spaces and multiple newlines

In [7]:
text = soup.find_all('div', {'class': 'c-entry-content'})
a = re.sub(r'\n\s*\n', '\n\n', text[0].getText().rstrip('\n').lstrip('\n'))
a = re.sub(r' * ', ' ', a)
print(a)

The best kind of door in a video game is the one no one remembers. Sure, everyone can appreciate a big, beautiful door with great animations, says Owlchemy Labs developer Pete Galbraith. But in a video game, doors are often synonymous with a massive design headache. Forgettable means a developer has done their job well. “If it fits into the environment, makes sense for its context, and works exactly how the player expects, then in that instant it was simply a door as real as any other in the player’s real life,” says Galbraith. “I can’t imagine higher praise for a door in a game.”
Over the past week, dozens of developers across multiple disciplines and teams shared their frustrations on Twitter. Death Trash creator Stephan Hövelbrinks explained that doors “have all sorts of possible bugs.” The Last of Us Part II co-game director Kurt Margenau called it “the thing that took the longest to get right.” How doors work is different during “combat tension,” when players are mid-encounter, vs

### Iterate through the data

In [8]:
listofarticles = []

for i in range(len(articles)):
    info = articles[i]
    soup = BeautifulSoup(requests.get(info['url']).content, 'html.parser')
    text = soup.find_all('div', {'class': 'c-entry-content'})
    temp = text[0].getText().rstrip('\n').lstrip('\n')
    temp = re.sub(r'\n\s*\n', '\n\n', temp)
    temp = re.sub(r' * ', ' ', temp)
    data = {
        "title": info['title'],
        "author": info['author'],
        "date": info['publishedAt'],
        "url": info['url'],
        "preview": info['content'],
        "fulltext": temp
    }
    listofarticles.append(data)

listofarticles

[{'title': 'Why game developers can’t get a handle on doors',
  'author': 'Megan Farokhmanesh',
  'date': '2021-03-12T22:17:29Z',
  'url': 'https://www.theverge.com/22328169/game-development-doors-design-difficult',
  'preview': 'The best kind of door in a video game is the one no one remembers. Sure, everyone can appreciate a big, beautiful door with great animations, says Owlchemy Labs developer Pete Galbraith. But in a vid… [+5812 chars]',
  'fulltext': 'The best kind of door in a video game is the one no one remembers. Sure, everyone can appreciate a big, beautiful door with great animations, says Owlchemy Labs developer Pete Galbraith. But in a video game, doors are often synonymous with a massive design headache. Forgettable means a developer has done their job well. “If it fits into the environment, makes sense for its context, and works exactly how the player expects, then in that instant it was simply a door as real as any other in the player’s real life,” says Galbraith. “I c

### Save to JSON file

In [9]:
with open('news.json', 'w') as json_file:
    json.dump(listofarticles, json_file)