# Web Scraping for Space Debris Information 

### Search Engines 

We are going to scrape several search engines for the latest space debris information includeing 

- Google 
- Bing
- Yahoo!

For each search engine we must probe how to get the searches for desired keyword, time frame, sources, and etc.
 _____________________________________________________________________________________________________________________________________________________
#### [Google](https://google.com/)
The base of the url for a Google search is "search?". Following this we enter our necessary url parameters.

url parameters 

1. q : parameter where you enter the keyword to search
2. tbm : choose to search for videos, news, etc.
    - nws : news 
    - vid : video 
    - isch : image 
3. as_qdr : select a specific time frame such as 
    - d : past 24 hours 
    - w : past week
    - m : past month
    - y : past year 
4. start : select the page number (should be multiples of 10: 0, 10, 20, 30, ...)
5. sourceid : this will be set to "chrome"
6. ie : this selects the encoding which will be set to "UTF-8"

So for an example, the url we must search for becomes 

https://google.com/search?q=space+debris&tbm=nws&as_qdr=w&start=0&sourceid=chrome&ie=UTF-8

 _____________________________________________________________________________________________________________________________________________________
#### [Bing](https://bing.com)

url parameters
1. q : parameter where you enter the keyword to search
2. news: enter in the url address after ".com/" to filter only news sources 
3. qft : the time frame 
    - interval%3d%224%22 : past hour 
    - interval%3d%227%22 : past 24 hours 
    - interval%3d%228%22 : past 7 days 
    - interval%3d%229%22 : past 30 days 

So an example would be 

https://bing.com/news/search?q=space+debris&qft=interval%3d%228%22 

 _____________________________________________________________________________________________________________________________________________________
#### [Yahoo!](https://www.yahoo.com/)

Does not provide much news sources but the following would be an example of an url

All results 

https://search.yahoo.com/search?p=space+debris 

News sources 

https://news.search.yahoo.com/search?p=space+debris

 _____________________________________________________________________________________________________________________________________________________
### space.com

Additionally we are going to scrape one good space news source called [space.com](https://space.com)

url parameters
1. /news : to fetch news sources 
2. /# : where # indicates a number and this number is the page number we want for the articles 

An example url would be 

https://space.com/news/3

## Web Scrape Examples 
### Google

In [None]:
# For Google Search engine 

# Import necessary modules
import requests 
from bs4 import BeautifulSoup as bs

In [None]:
# Scraping Google for space debris information 
# defining the parameters for the web-scraping 
params = ['q', 'tbm', 'as_qdr', 'start', 'sourceid', 'ie']
param_dict = {}
for p in params:
    dialog = "Enter the searching parameter for " + p + " -> "
    param_dict[p] = input(dialog)

In [None]:
# print out the dictionary storing the parameter information 
param_dict

In [None]:
# Replace space with + sign 
param_dict['q'] = param_dict['q'].replace(' ', '+')

# print out to check the results 
param_dict

In [None]:
# Create url link with the dictionary information 

# When the value for a key in the dictionary is empty do not increment 
# a counter indicating how many empty keys are inside the dictionary 
i = 1
ct = 0
for v in param_dict.values():
    if not v:
        ct += 1

# Create a base url for to search 
url = 'https://google.com/search?'
# Create the url from the parameter information in the dictionary 
for k, v in param_dict.items():
    if v:
        url += k + '=' + v     
        if i != len(param_dict)-ct:
            url += '&'
    i += 1

# Check the url by printing it out 
url

In [None]:
# METHOD1
# urllib
# import urllib.request 
# HEADERS={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
#          "X-Requested-With": "XMLHttpRequest"}
# req = urllib.request.Request(url, headers=HEADERS)
# page = urllib.request.urlopen(req)
#soup = bs(page,"html.parser")

# METHOD 2
# Requests
r = requests.get(url)
r.encoding = 'utf-8'
soup1 = bs(r.text, 'html.parser')
r.close()
# soup2 = soup1.prettify()

In [None]:
# Extract the links from the HTML retrieved by 
# beautiful soup
divs = soup1.find_all('div', {"class", "kCrYT"})
links = []
for div in divs:
    atag = div.a
    if atag:
        links.append(atag['href'])

# Rinse/filter the urls by striping the unnecessary parts 
for idx, link in enumerate(links):
    link = link.lstrip("/url?q=")
    link = link.replace("&sa=", " ")
    link1, link2 = link.split()
    links[idx] = link1

# Print out the links 
links

In [None]:
# Use requests to open each link and scrape each of them 
link1 = links[0]
r = requests.get(link1)
r.encoding = 'utf-8'
soup = bs(r.text, 'html.parser')
r.close()
print(soup.prettify())

### Bing

In [38]:
# Scraping Google for space debris information 
# defining the parameters for the web-scraping 
params = ['q', 'news', 'qft']
param_dict = {}
param_dict['q'] = input("What do you want to search in Bing? ")
param_dict['news'] = input("Would you like to search for news sources or normal sources? Y[yes]/N[no]")
print("What time frame would you like to search for? ")
print("(1) past hour")
print('(2) past 24 hours')
print('(3) past 7 days')
print('(4) past 30 days')
param_dict['qft'] = input("Chose a number -> ")    

# print out the dictionary storing the parameter information 
param_dict

What do you want to search in Bing?  space debris
Would you like to search for news sources or normal sources? Y[yes]/N[no] Y


What time frame would you like to search for? 
(1) past hour
(2) past 24 hours
(3) past 7 days
(4) past 30 days


Chose a number ->  4


{'q': 'space debris', 'news': 'Y', 'qft': '4'}

In [39]:
# Alter the base url depending on the parameters 
if param_dict['news'] == "Y":
    url = 'https://bing.com/news/search?'
else:
    url = 'https://bing.com/search?'
    
# Remove the 'news' key 
del param_dict['news']

In [40]:
# Change the interval for the search depending on the parameters
intervals = {
    '1': 'interval%3d%224%22',
    '2': 'interval%3d%227%22',
    '3': 'interval%3d%228%22',
    '4': 'interval%3d%229%22'
}
param_dict['qft'] = intervals[param_dict['qft']]

In [41]:
# Replace space with + sign 
param_dict['q'] = param_dict['q'].replace(' ', '+')
# print out to check the results 
param_dict

{'q': 'space+debris', 'qft': 'interval%3d%229%22'}

In [42]:
# Create url link with the dictionary information 

# When the value for a key in the dictionary is empty do not increment 
# a counter indicating how many empty keys are inside the dictionary 
i = 1
ct = 0
for v in param_dict.values():
    if not v:
        ct += 1

# Create the url from the parameter information in the dictionary 
for k, v in param_dict.items():
    if v:
        url += k + '=' + v     
        if i != len(param_dict)-ct:
            url += '&'
    i += 1

# Check the url by printing it out 
url

'https://bing.com/news/search?q=space+debris&qft=interval%3d%229%22'

In [43]:
# METHOD1
# urllib
# import urllib.request 
# HEADERS={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
#          "X-Requested-With": "XMLHttpRequest"}
# req = urllib.request.Request(url, headers=HEADERS)
# page = urllib.request.urlopen(req)
#soup = bs(page,"html.parser")

# METHOD 2
# Requests
r = requests.get(url)
r.encoding = 'utf-8'
soup1 = bs(r.text, 'html.parser')
r.close()

In [44]:
# Extract the links from the HTML retrieved by 
# beautiful soup
divs = soup1.find_all('div', {"class", "t_t"})
links = []
for div in divs:
    atag = div.a
    if atag:
        links.append(atag['href'])

In [45]:
links

['https://tabi-labo.com/298313/wt-europe-clear-space',
 'https://prtimes.jp/main/html/rd/p/000000008.000067481.html',
 'https://www.iza.ne.jp/kiji/pressrelease/news/201222/prl20122213410247-n1.html',
 'https://news.yahoo.co.jp/articles/4afb53ecca93714023a811807ff0006ab41ac6e8',
 'https://news.toremaga.com/release/others/1682045.html',
 'http://fabcross.jp/interview/20201217_astroscale.html',
 'https://www.alterna.co.jp/34219/4/',
 'https://www.minato-yamaguchi.co.jp/yama/e-yama/articles/19464',
 'https://realsound.jp/movie/2020/12/post-679625.html',
 'https://jp.techcrunch.com/2020/12/19/2020-12-18-orbital-refueling-and-manufacturing-go-from-theory-to-reality-in-2021/',
 'http://www.jwing.net/news/33641',
 'https://article.auone.jp/detail/1/2/2/101_2_r_20201217_1608162601724821']

In [None]:
# Use requests to open each link and scrape each of them 
link1 = links[0]
r = requests.get(link1)
r.encoding = 'utf-8'
soup = bs(r.text, 'html.parser')
r.close()
print(soup.prettify())