# Seminar - JSON, XML, Requests, Web-Scraping 
by Vítek Macháček

## Task 1: JSON API and Golemio

* For security reasons, the access credentials are not distributed via GitHub.
* The notebook expects file `secret.py` to be present in this directory, but the file `.gitignore` prevented it from being tracked and distributed. 
* The file `secret-example.py` shows what format of the `secret.py` file is expected. 
* To make task 1 work, you need to create your own `secret.py` file and put the GOLEMIO_API_KEY that can be generated [here](https://api.golemio.cz/api-keys/auth/sign-in) (after registration)
* This is normally done using ENVIRONMENTAL VARIABLES, but for simplicity we skip this.

In [None]:
import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
import time
from secret import GOLEMIO_API_KEY

### 1a. Request Golemio for cyclocounters detections 

* Golemio API documentation: https://golemioapi.docs.apiary.io/#introduction/description
* Get API key here: https://api.golemio.cz/api-keys/ and store it as `GOLEMIO_API_KEY` variable
* We will analyze location `V zámcích` (Trója, behind ZOO). Use `camea-BC_ZA-BO` as ID
* Limit the search for February 2021
* Use GET and use `{'X-Access-Token':GOLEMIO_API_KEY}` as a header

Hint: https://api.golemio.cz/v2/bicyclecounters/detections?from={START_DATE}&to={END_DATE}&id={DIRECTION_ID}

### 1b. convert response to some python-friendly format

Ideally `list` or `dict`

### 1c. Explore response structure

* hierarchical JSON structure
* Where are the data

### 1d. Let's do a DataFrame

### 1e. Suggest function structure of downloading script and implement it

## Task 2: Get list of Charles University faculties and links to its websites

https://cuni.cz/UKEN-108.html

Is this website scrapable? 

Convert request text into BeautifulSoup object

Identify faculty link location within the website using `INSPECT` tool

Get list of all links with names

Transform into list of dictionary objects where `'name'` contains faculty name and `link` link

Convert to Pandas object

### Task 3: POST request

#### send a GET request to this API endpoint
https://httpbin.org/post

send a POST request with no data

Send a POST request with the following json:


```json

{
    "course_identifier":"JEM207",
    "course_name":"Data Processing in Python",
    "lecturers": ["Vít Macháček","Martin Hronec", "Jan Šíla"]
}
```

## Task 4: Scraping IES news

use following code snippets to construct your own IES News web scraper

In [None]:
def get_soup(link):
    '''
    Function accepts a link and returns a BeautifulSoup object parsed from text of a succesful GET request on a link. If requests returns other status code than 200, returns None and prints a message

    Make sure that the request object is parsed as UTF-8 string.
    '''
    r = requests.get(link)
    r.encoding='UTF-8'
    
    pass

In [None]:
def get_all_news_links(link):
    '''
    Generates list of URLs of all news-related links from the url provided.

    Links on news format: <a href="/en/news/{id}" title="show news" class="show-news">show news</a>

    The URLs are expected in absolute format, i.e. including a full domain.
    '''
    
    pass 
news_links = get_all_news_links('https://ies.fsv.cuni.cz/content/tree/index/lang/en')
news_links

In [None]:
def parse_title(soup):
    '''
    Parse text of the first `h3` object from the soup element.
    '''
    pass

In [None]:
def parse_date(soup):
    '''
    Parse text of the sibling of sibling of the first h3 element in the soup. Note that the immidiate sibling of `h3` is not Tag element, but NavigableString. This is used to represent text between tags.
    '''

    pass
print(parse_date(get_soup('https://ies.fsv.cuni.cz/en/news/4976')))

In [None]:
def parse_news_content(soup):
    '''
    For simplicity, the content of the article is the content of all <p> elements within <div class="col-sm-12 news"></div>

    Return a single string with the whole text. Use `/n` as a connecting string between individual p-texts. 

    Hint: Consider using a `.join()` function applicable on string object
    '''
    pass

print(parse_news_content(get_soup('https://ies.fsv.cuni.cz/en/news/4976')))

In [None]:
def parse_ies_news(link,pause=.5):
    '''
    From URL of given news story generate pd.Series object with `title`, `date` and `content`.

    Use functions `parse_title`, `parse_date` and `parse_news_content` to get individual attributes.

    Please, keep the sleep() to prevent overflow of IES website.
    '''

    time.sleep(pause)
    
    pass    
parse_ies_news('https://ies.fsv.cuni.cz/en/news/4976')

In [None]:
def get_all_news(link):
    '''
    wraping fuction that accepts a `link` pointing towards hub website with links to parse and returing a dataframe containing all the links
    '''
    
    news_links = get_all_news_links('https://ies.fsv.cuni.cz/content/tree/index/lang/en')
    
    return pd.DataFrame([parse_ies_news(link) for link in news_links])

get_all_news('https://ies.fsv.cuni.cz/content/tree/index/lang/en')

## Task 5: Convert Task 4 into OOP

In [None]:
class News:
    def __init__(self,link):
        pass




class Downloader:
    def __init__(self,hub_url):
        pass


#dl = Downloader('https://ies.fsv.cuni.cz/content/tree/index/lang/en')
