# Wie komme ich an die externen Daten?

* Daten aus Textdateien
* Strukturierte Daten aus Web
* Web-Scraping
* Datenbanken

### Textdateien vs. Binärdateien

Textdateien bestehen aus Textzeilen, durch Line-Separatoren getrennt

Line-Separatoren
* Unix: 0xD
* Windows: 0xD 0xA

# Dateien lesen und schreiben

## Textdateien lesen

Einfaches Beispiel

In [None]:
f = open('sample.txt')
for s in f:
    s = s.strip()
    print(s)
f.close()

Suchen in einer Datei

In [None]:
f = open('sample.txt')
if 'world\n' in f:
    print("World is here")
f.close()

Die ganze Textdatei in RAM auslesen und Lines intern trennen

In [None]:
with open('sample.txt', 'r') as f:
    cnt = 0
    for line in f.readlines():
        line = str(line).rstrip()
        print ("Line: " + line)
        cnt += 1
        if 'world' in line:
            print ("Gefunden, line " + str(cnt))
            break
            


Große Datei Line-nach-Line lesen

In [None]:
with open('sample.txt', 'r') as f:
    while True:
        line = f.readline()
        if line == '':
            break
        print (line, end='')
        
    
            


## Binäre Dateien lesen

In [2]:
with open('sample.txt', 'rb') as f:
    for ch in f.read(10):
        print(chr(ch))

h
e
l
l
o



w
o
r


## Dateien schreieben

Textdatei nur schreiben

In [3]:
with open('sample_write.txt', 'w') as f:
    f.write('Hello!\n')
    f.write('This is my first file!\n')

Textdatei Lesen und Schreieben

In [4]:
with open('sample_read_write.txt', 'w+') as f:
    f.write('Hallo wieder!\n')
    f.write('This is my first file!\n')
    f.seek(0)
    print (f.read())

Hallo wieder!
This is my first file!



## Falls kein 'with' statement: close() nicht vergessen!

In [None]:
f = open('sample_write.txt', 'w')
f.write('Hello!\n')
f.write('This is my first file!\n')
f.close()

# Achtung, Gefahr: utf-8 Dateien

## Lesen von Dateien, die Unicode Chars enthalten im Textmodus kann zu Fehlern führen

In [5]:
f = open('sample_utf.txt')
for s in f:
    print(s)
f.close()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 50: character maps to <undefined>

## utf-8 Dateien richtig lesen

In [7]:
import codecs
f = codecs.open("sample_utf.txt", "r", "utf-8")
for s in f:
    print(s,end='')
f.close()

﻿hello
world
Привет!
αγάπη χαιρετισμούς!
Liebe grüße
these are
strings from sample.txt
file


## utf-8 Dateien richtig schreiben

In [8]:
import codecs
f = codecs.open("sample_utf_w.txt", "w", "utf-8")
f.write('Привет! γεια!')
f.close()

# Strukturierte Daten aus Web lesen

Beispiel: einfache Daten lesen

In [9]:
import requests

r = requests.get('http://ip.jsontest.com/')

print('Encoding: ', r.encoding)

print ('Content: ', r.content)

Encoding:  ISO-8859-1
Content:  b'{"ip": "213.216.9.42"}\n'


## Beispiel: JSON zu Dict umwandeln

## c:> conda install ujson

In [10]:
import ujson
content = r.content
content

b'{"ip": "213.216.9.42"}\n'

In [12]:
dict = ujson.loads(content)
dict['ip']

'213.216.9.42'

### UTF-8 Daten 

In [None]:
print('Encoding: ', r.encoding)

In [13]:
utf = r.content.decode('utf-8')

dict = ujson.loads(utf)

dict['ip']

'213.216.9.42'

# Unstrukturierte Daten aus Web lesen (Web-Scraping)

In [25]:
from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')

tree = html.fromstring(page.content)

buyers = tree.xpath('//div[@title="buyer-name"]/text()')

prices = tree.xpath('//span[@class="item-price"]/text()')

data = zip(buyers, prices)

n = list(data)
list(zip(*n))[1]

('$29.95',
 '$8.37',
 '$15.26',
 '$19.25',
 '$19.25',
 '$13.99',
 '$31.57',
 '$8.49',
 '$14.47',
 '$15.86',
 '$11.11',
 '$15.98',
 '$16.27',
 '$7.50',
 '$50.85',
 '$14.26',
 '$5.68',
 '$15.00',
 '$114.07',
 '$10.09')

Hinweis: mehr info über xPath kann man bei W3Schools bekommen: https://www.w3schools.com/xml/xpath_intro.asp