# Wie komme ich an die externen Daten?

* Daten aus Textdateien
* Strukturierte Daten aus Web
    * JSON
    * XML
* Web-Scraping
* Datenbanken

### Textdateien vs. Binärdateien

Textdateien bestehen aus Textzeilen, durch Line-Separatoren getrennt

Line-Separatoren
* Unix: 0xD
* Windows: 0xD 0xA

# Dateien lesen und schreiben

## Textdateien lesen

Einfaches Beispiel

In [None]:
f = open('sample.txt')
for s in f:
    print(s)
f.close()

Suchen in einer Datei

In [None]:
f = open('sample.txt')
if 'world\n' in f:
    print("World is here")
f.close()

Die ganze Textdatei in RAM auslesen und Lines intern trennen

In [None]:
with open('sample.txt', 'r') as f:
    cnt = 0
    for line in f.readlines():
        line = str(line).rstrip()
        print ("Line: " + line)
        cnt += 1
        if 'world' in line:
            print ("Gefunden, line " + str(cnt))
            


Große Datei Line-nach-Line lesen

In [None]:
with open('sample.txt', 'r') as f:
    while True:
        line = f.readline()
        if line == '':
            break
        print (line, end='')
        
    
            


## Binäre Dateien lesen

In [None]:
with open('sample.txt', 'rb') as f:
    for chr in f.read(10):
        print (chr)

## Dateien schreieben

Textdatei nur schreiben

In [None]:
with open('sample_write.txt', 'w') as f:
    f.write('Hello!\n')
    f.write('This is my first file!\n')

Textdatei Lesen und Schreieben

In [None]:
with open('sample_read_write.txt', 'w+') as f:
    f.write('Hallo wieder!\n')
    f.write('This is my first file!\n')
    f.seek(0)
    print (f.read())

## Falls kein 'with' statement: close() nicht vergessen!

In [None]:
f = open('sample_write.txt', 'w')
f.write('Hello!\n')
f.write('This is my first file!\n')
f.close()

# Achtung, Gefahr: utf-8 Dateien

## Lesen von Dateien, die Unicode Chars enthalten im Textmodus kann zu Fehlern führen

In [None]:
f = open('sample_utf.txt')
for s in f:
    print(s)
f.close()

## UTF-8 Dateien richtig lesen

In [None]:
import codecs
f = codecs.open("sample_utf.txt", "r", "utf-8")
for s in f:
    print(s)
f.close()

## UTF-8 Dateien richtig schreiben

In [1]:
import codecs
f = codecs.open("sample_utf_w.txt", "w", "utf-8")
f.write('Привет! γεια!')
f.close()

# Strukturierte Daten aus Web lesen

Beispiel: einfache Daten lesen

In [29]:
import requests

r = requests.get('http://ip.jsontest.com/')

print('Encoding: ', r.encoding)

print ('Content: ', r.content)

Encoding:  ISO-8859-1
Content:  b'{"ip": "83.136.72.12"}\n'


## Beispiel: JSON zu Dict umwandeln

## c:> conda install ujson

In [33]:
import ujson
dict = ujson.loads(r.content)
dict

{'ip': '83.136.72.12'}

### UTF-8 Daten 

In [53]:
import requests

r = requests.get('https://news.google.com/news/feeds?output=rss')

print('Encoding: ', r.encoding)

utf = r.content.decode('utf-8')

dict = ujson.loads(utf)

dict['synsets'][4]


ConnectionError: HTTPSConnectionPool(host='news.google.com', port=443): Max retries exceeded with url: /news/feeds?output=rss (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x000001B9215290B8>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

# XML aus Web Lesen

In [51]:
import requests

r = requests.get('http://rus.delfi.lv/sport_rss.php')

print('Encoding: ', r.encoding)

r.content 
#utf = r.content.decode('utf-8')

#dict = ujson.loads(utf)

#dict



Encoding:  None


b'<?xml version="1.0" encoding="UTF-8"?>\n<rss version="2.0">\n\t<channel>\n\t\t<title>\xd0\xa1\xd0\xbf\xd0\xbe\xd1\x80\xd1\x82</title>\n\t\t<link>http://rus.delfi.lv</link>\n\t\t<description>DELFI: \xd1\x81\xd0\xb2\xd0\xb5\xd0\xb6\xd0\xb8\xd0\xb5 \xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbe\xd1\x81\xd1\x82\xd0\xb8 \xd1\x81\xd0\xbf\xd0\xbe\xd1\x80\xd1\x82\xd0\xb0</description>\n\t\t<language>ru</language>\n\t\t<copyright>A/S DELFI, 2009. All rights reserved.</copyright>\n\t\t<lastBuildDate>Wed, 21 Jun 2017 18:24:36 +0300</lastBuildDate>\n\t\t<generator>PHP</generator>\n\t\t<image>\n\t\t\t<url>http://g1.delphi.lv/u/delfi.gif</url>\n\t\t\t<link>http://rus.delfi.lv</link>\n\t\t\t<title>\xd0\xa1\xd0\xbf\xd0\xbe\xd1\x80\xd1\x82</title>\n\t\t</image>\n\t\t\t\t\n\t\t\t<item>\n\t\t\t\t<title>\xd0\x9e\xd1\x81\xd0\xba\xd0\xb0\xd0\xbd\xd0\xb4\xd0\xb0\xd0\xbb\xd0\xb8\xd0\xb2\xd1\x88\xd0\xb8\xd0\xb9\xd1\x81\xd1\x8f \xd0\xb3\xd0\xbb\xd0\xb0\xd0\xb2\xd0\xb0 &quot;\xd0\x9b\xd0\xb5\xd1\x82\xd1\x83\xd0\xb2\xd0\xbe\