# Web Scraping for Space Debris Information 

### Search Engines 

We are going to scrape several search engines for the latest space debris information includeing 

- Google 
- Bing
- Yahoo!

For each search engine we must probe how to get the searches for desired keyword, time frame, sources, and etc.
 _____________________________________________________________________________________________________________________________________________________
#### [Google](https://google.com/)
The base of the url for a Google search is "search?". Following this we enter our necessary url parameters.

url parameters 

1. q : parameter where you enter the keyword to search
2. tbm : choose to search for videos, news, etc.
    - nws : news 
    - vid : video 
    - isch : image 
3. as_qdr : select a specific time frame such as 
    - d : past 24 hours 
    - w : past week
    - m : past month
    - y : past year 
4. start : select the page number (should be multiples of 10: 0, 10, 20, 30, ...)
5. sourceid : this will be set to "chrome"
6. ie : this selects the encoding which will be set to "UTF-8"

So for an example, the url we must search for becomes 

https://google.com/search?q=space+debris&tbm=nws&as_qdr=w&start=0&sourceid=chrome&ie=UTF-8

 _____________________________________________________________________________________________________________________________________________________
#### [Bing](https://bing.com)

url parameters
1. q : parameter where you enter the keyword to search
2. news: enter in the url address after ".com/" to filter only news sources 
3. qft : the time frame 
    - interval%3d%224%22 : past hour 
    - interval%3d%227%22 : past 24 hours 
    - interval%3d%228%22 : past 7 days 
    - interval%3d%229%22 : past 30 days 

So an example would be 

https://bing.com/news/search?q=space+debris&qft=interval%3d%228%22 

 _____________________________________________________________________________________________________________________________________________________
#### [Yahoo!](https://www.yahoo.com/)

Does not provide much news sources but the following would be an example of an url

All results 

https://search.yahoo.com/search?p=space+debris 

News sources 

https://news.search.yahoo.com/search?p=space+debris

 _____________________________________________________________________________________________________________________________________________________
### space.com

Additionally we are going to scrape one good space news source called [space.com](https://space.com)

url parameters
1. /news : to fetch news sources 
2. /# : where # indicates a number and this number is the page number we want for the articles 

An example url would be 

https://space.com/news/3

## Web Scrape Examples 
### Google

In [35]:
# For Google Search engine 

# Import necessary modules
import requests 
from bs4 import BeautifulSoup as bs

In [117]:
# Scraping Google for space debris information 

params = ['q', 'tbm', 'as_qdr', 'start', 'sourceid', 'ie']
param_dict = {}
for p in params:
    dialog = "Enter the searching parameter for " + p + " -> "
    param_dict[p] = input(dialog)
    

Enter the searching parameter for q ->  space debris
Enter the searching parameter for tbm ->  nws
Enter the searching parameter for as_qdr ->  d
Enter the searching parameter for start ->  0
Enter the searching parameter for sourceid ->  chrome
Enter the searching parameter for ie ->  utf-8


In [118]:
param_dict

{'q': 'space debris',
 'tbm': 'nws',
 'as_qdr': 'd',
 'start': '0',
 'sourceid': 'chrome',
 'ie': 'utf-8'}

In [119]:
# Replace space with + sign 
param_dict['q'] = param_dict['q'].replace(' ', '+')
param_dict

{'q': 'space+debris',
 'tbm': 'nws',
 'as_qdr': 'd',
 'start': '0',
 'sourceid': 'chrome',
 'ie': 'utf-8'}

In [120]:
# Create url link with the dictionary information 

i = 1
ct = 0
for v in param_dict.values():
    if not v:
        ct += 1
        
url = 'https://google.com/search?'
for k, v in param_dict.items():
    if v:
        url += k + '=' + v     
        if i != len(param_dict)-ct:
            url += '&'
    i += 1

In [127]:
url

'https://google.com/search?q=space+debris&tbm=nws&as_qdr=d&start=0&sourceid=chrome&ie=utf-8'

In [135]:
# urllib
# import urllib.request 
# HEADERS={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
#          "X-Requested-With": "XMLHttpRequest"}
# req = urllib.request.Request(url, headers=HEADERS)
# page = urllib.request.urlopen(req)

# Requests
r = requests.get(url)
r.encoding = 'utf-8'
soup1 = bs(r.text, 'html.parser')
r.close()
# soup2 = soup1.prettify()

soup = bs(page,"html.parser")

In [136]:
divs = soup1.find_all('div', {"class", "kCrYT"})
links = []
for div in divs:
    atag = div.a
    if atag:
        links.append(atag['href'])

In [137]:
for idx, link in enumerate(links):
    link = link.lstrip("/url?q=")
    link = link.replace("&sa=", " ")
    link1, link2 = link.split()
    links[idx] = link1

In [138]:
links

['http://spaceref.com/news/viewsr.html%3Fpid%3D54396',
 'http://spaceref.com/news/viewsr.html%3Fpid%3D54397',
 'https://www.ign.com/articles/giant-claw-heading-into-orbit-space-junk-clean-up',
 'https://www.ign.com/articles/giant-claw-heading-into-orbit-space-junk-clean-up',
 'https://eurasiantimes.com/japan-to-reduce-space-junk-with-the-launch-of-worlds-first-wooden-satellite-by-2023/',
 'https://eurasiantimes.com/japan-to-reduce-space-junk-with-the-launch-of-worlds-first-wooden-satellite-by-2023/',
 'https://techxplore.com/news/2020-12-fukushima-nuclear-debris-virus.html',
 'https://techxplore.com/news/2020-12-fukushima-nuclear-debris-virus.html',
 'https://www.ign.com/articles/spooky-circles-in-space-are-puzzling-astronomers',
 'https://www.ign.com/articles/spooky-circles-in-space-are-puzzling-astronomers',
 'https://www.theday.com/real-estate/20201225/keep-your-floors-clean-this-winter',
 'https://www.spacedaily.com/reports/Space_Electric_Thruster_System_SETS_to_Demonstrate_In_Spac