# Set up general Infrastructure <a class="tocSkip">
Author: Stefan Roland Schwingenschlögl <br>
email: stefan.roland.schwingenschloegl@gmail.com <br>
github: github.com/stefan-schwingenschloegl <br>
___
*Projekt File No: 1 <br>*

In this File the folder structure will be made. The idea is to have a root folder. This root folder will be the working directory for the whole project. All Notebooks will be hosted there.
Inside the root folder there are three sub folders. Namely there will be a `input_data`, `realtime_data` & `img`. The idea is as followed:
* `input_data`: host all static files from the Wiener Linien API
* `realtime_data`: host all raw data which got collected from the Realtime API and inserted into the *stage_table* of the database
* `img`: host all images which will be embedded in one of the notebooks

After generating the whole file structure all static files which are available in the Wiener Linien API will be downloaded and saved to the `input_data` folder. These folder will not host any data in github due to the fact, that the data should be published from the primary source, which is the Wiener Linien GmbH & Co KG.

____
## Import Libraries

In [1]:
# library for operation system control
import os

# libraries for webscraping
import urllib.request as request
from bs4 import BeautifulSoup

## Make folder structure
___
In this section the mentioned folder structure will be created. 

In [3]:
# Set general properties for this section
root_folder = "Wiener_Linien_Projekt"
sub_folder_list = ["input_data", "realtime_data", "img"]

In [4]:
# create root_folder
if root_folder not in os.listdir():
    os.mkdir(root_folder)

In [5]:
# change directory to root folder
os.chdir(root_folder)

In [6]:
# create data folders
for folder in sub_folder_list:
    if folder not in os.listdir():
        os.mkdir(folder)


## Download static files
___
The 'Wiener Linien' is the public transport agency from Vienna. They provide real time data from thier services to open use. The access to the data is through an API. The URL of the API is: http://www.wienerlinien.at/ogd_realtime/doku/

In [8]:
# change working directory to input data
os.chdir(sub_folder_list[0])

In [9]:
# set general properties
WL_URL = 'http://www.wienerlinien.at/ogd_realtime/doku/'

In [10]:
# Get html text from the Wiener Linien API Doku (http://www.wienerlinien.at/ogd_realtime/doku)

with request.urlopen(WL_URL) as response:
    #print(response.getcode())
    source = response.read()
    soup = BeautifulSoup(source, 'html.parser')
    print(soup.prettify())

<!DOCTYPE html>
<html lang="de">
 <head>
  <title>
   Wiener Linien - OGD Doku
  </title>
  <meta content=", " name="keywords"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 </head>
</html>
<body>
 <a href="./ogd/wienerlinien-ogd-fahrwegverlaeufe.csv">
  wienerlinien-ogd-fahrwegverlaeufe.csv
 </a>
 <br/>
 <a href="./ogd/wienerlinien-ogd-gps-punkte.csv">
  wienerlinien-ogd-gps-punkte.csv
 </a>
 <br/>
 <a href="./ogd/wienerlinien-ogd-haltepunkte.csv">
  wienerlinien-ogd-haltepunkte.csv
 </a>
 <br/>
 <a href="./ogd/wienerlinien-ogd-haltestellen.csv">
  wienerlinien-ogd-haltestellen.csv
 </a>
 <br/>
 <a href="./ogd/wienerlinien-ogd-linien.csv">
  wienerlinien-ogd-linien.csv
 </a>
 <br/>
 <a href="./ogd/wienerlinien-ogd-steige.csv">
  wienerlinien-ogd-steige.csv
 </a>
 <br/>
 <a href="./ogd/wienerlinien-ogd-version.csv">
  wienerlinien-ogd-version.csv
 </a>
 <br/>
 <br/>
 <a href="./ogd/wienerlinien_ogd_Beschreibung.pdf">
  wienerlinien_ogd_Beschreibung.pdf
 </a>


html: Tag 'a' refers to a hyperlink. 'href' refers to the URL were the hyperlink linkes to. In this case every hyperlink links to a file which we want to download. Therefore in the next step it is necessary to extract every string of a link and save it to a list. After that it is possible to iterate through the list and download all files from the links into the '/input_data' folder of the projects respository.

In [11]:
# generate a list of URLS from all static files in the Wiener Linien API
URLS = []
for a in soup.find_all('a', href=True):
    URLS.append(a['href'].replace('./', WL_URL, 1))

After getting the URLs from all available files, in the next section all files will be downloaded. 

In [12]:
# Download all static csv files from the Wiener Linien API. 

for file in URLS:
    request.urlretrieve(file, file.rsplit('/',1)[1])
    print('Sucessful Download: ' + file)
print("#### All files got downloaded sucessfully ####")

Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien-ogd-fahrwegverlaeufe.csv
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien-ogd-gps-punkte.csv
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien-ogd-haltepunkte.csv
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien-ogd-haltestellen.csv
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien-ogd-linien.csv
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien-ogd-steige.csv
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien-ogd-version.csv
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien_ogd_Beschreibung.pdf
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/wienerlinien-echtzeitdaten-dokumentation.pdf
Sucessful Download: http://www.wienerlinien.at/ogd_realtime/doku/ogd/gtfs/agen

## Next Step:
___
After sucesfully setting up the infrastructure and downloading all files a closer look will be taken at the these files. An overview of the files will be given and if needed the file will be cleaned. This step will be made in the `static_file_cleaning.ipynb`. 