# Einlesen von Daten aus Dateien

Daten liegen oft in Form von Dateien vor. Damit spielt das Einlesen von Dateien eine wichtige Rolle im explorativen Datenanalyseprozess.

Folgende Schritte sollten beim Einlesen unbekannter Daten beachtet bzw. durchgeführt werden:
* Sichtung der Daten
* Probeweises Einlesen
* Validierung des Einlesevorgangs
* Verbesserung des Einlesens

Grobes Vorgehen:
* was ist über die Daten bereits bekannt?
  * Metadaten?
  * Datenbeschreibung?
  * Dateiformat?
  * Dateityp?
* Sichtung mittels Texteditor (z.B. Notepad++) oder direkt in der IDE
  * wichtig: Format "raten", Trennzeichen, Sonderzeichen, Umbrüche?
* Entscheidung für Einlesevorgang:
  * Python-file
  * csvreader
  * pandas
  * ...
* Durchführung Einlesen
* Prüfung/Validierung
  * alles enthalten?
  * Formatierung wie gewünscht?
  * Trennzeichen, Leerzeichen, Sonderzeichen erhalten?
  * Überflüssiges gelöscht?
* Verbesserungen
  * nach Veränderungen nochmalige Validierung erforderlich

## Sichtung der Daten


### Beispiele

```csv
  ,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
  CHN,China,1398.72,9596.96,12234.78,Asia,
  IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
  USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
```

The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.

```json
{
  "Title": "Bicentennial Man",
  "Release Date": "Dec 17 1999",
  "MPAA Rating": "PG",
  "Running Time min": 132,
  "Distributor": "Walt Disney Pictures",
  "Source": "Based on Book/Short Story",
  "Major Genre": "Drama",
  "Creative Type": "Science Fiction",
  "Director": "Chris Columbus",
  "Rotten Tomatoes Rating": 38,
  "IMDB Rating": 6.4,
  "IMDB Votes": 28827
}
``` 

## Einlesen von CSV

In [2]:
import csv
with open('data/country_data.csv') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        print(row)

['', 'COUNTRY', 'POP', 'AREA', 'GDP', 'CONT', 'IND_DAY']
['CHN', 'China', '1398.72', '9596.96', '12234.78', 'Asia', '']
['IND', 'India', '1351.16', '3287.26', '2575.67', 'Asia', '1947-08-15']
['USA', 'US', '329.74', '9833.52', '19485.39', 'N.America', '1776-07-04']
['IDN', 'Indonesia', '268.07', '1910.93', '1015.54', 'Asia', '1945-08-17']
['BRA', 'Brazil', '210.32', '8515.77', '2055.51', 'S.America', '1822-09-07']
['PAK', 'Pakistan', '205.71', '881.91', '302.14', 'Asia', '1947-08-14']
['NGA', 'Nigeria', '200.96', '923.77', '375.77', 'Africa', '1960-10-01']
['BGD', 'Bangladesh', '167.09', '147.57', '245.63', 'Asia', '1971-03-26']
['RUS', 'Russia', '146.79', '17098.25', '1530.75', '', '1992-06-12']
['MEX', 'Mexico', '126.58', '1964.38', '1158.23', 'N.America', '1810-09-16']
['JPN', 'Japan', '126.22', '377.97', '4872.42', 'Asia', '']
['DEU', 'Germany', '83.02', '357.11', '3693.2', 'Europe', '']
['FRA', 'France', '67.02', '640.68', '2582.49', 'Europe', '1789-07-14']
['GBR', 'UK', '66.44', 

Wichtig: "delimiter"
* ,
* ;
* Tab: \t
* Leerzeichen
* |
* ...

csv liefert strukturierte Daten zur direkten Weiterverarbeitung:

In [3]:
with open('data/country_data.csv') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        print(f"{row[1]} is located in {row[5]}.")

COUNTRY is located in CONT.
China is located in Asia.
India is located in Asia.
US is located in N.America.
Indonesia is located in Asia.
Brazil is located in S.America.
Pakistan is located in Asia.
Nigeria is located in Africa.
Bangladesh is located in Asia.
Russia is located in .
Mexico is located in N.America.
Japan is located in Asia.
Germany is located in Europe.
France is located in Europe.
UK is located in Europe.
Italy is located in Europe.
Argentina is located in S.America.
Algeria is located in Africa.
Canada is located in N.America.
Australia is located in Oceania.
Kazakhstan is located in Asia.


In [5]:
from collections import Counter
continents = Counter()
with open('data/country_data.csv') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        continents.update([row[5]])

print(continents)

Counter({'Asia': 7, 'Europe': 4, 'N.America': 3, 'S.America': 2, 'Africa': 2, 'CONT': 1, '': 1, 'Oceania': 1})


Die Überschriften stören hier ja mehr als dass sie nutzen.

Sonderbehandlung?
möglich, aber es gibt etwas Besseres...

In [7]:
with open('data/country_data.csv') as f:
    dict_reader = csv.DictReader(f, delimiter=',')
    for row in dict_reader:
        print(f"{row['COUNTRY']} is located in {row['CONT']}.")

China is located in Asia.
India is located in Asia.
US is located in N.America.
Indonesia is located in Asia.
Brazil is located in S.America.
Pakistan is located in Asia.
Nigeria is located in Africa.
Bangladesh is located in Asia.
Russia is located in .
Mexico is located in N.America.
Japan is located in Asia.
Germany is located in Europe.
France is located in Europe.
UK is located in Europe.
Italy is located in Europe.
Argentina is located in S.America.
Algeria is located in Africa.
Canada is located in N.America.
Australia is located in Oceania.
Kazakhstan is located in Asia.


## Einlesen von Textdaten

oder eigentlich von irgendwelchen Daten...

In [9]:
f = open('data/zen_of_python.txt', 'r')
print(f.read())
f.close()

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


Wichtige Komponenten:
* open/close
* Art des Zugriffs:
  * 'r' 	opens a file for reading only
  * 'w' 	opens a file for writing. If the file exists, it overwrites it, otherwise, it creates a new file.
  * 'a' 	opens a file for appending only. If the file doesn’t exist, it creates the file.
  * 'x' 	creates a new file. If the file exists, it fails.
  * '+' 	opens a file for updating.

We can also specify opening a file in text mode, 't', the default mode, or binary mode, 'b'. (images, etc.)

Sichere Behandlung von Dateien:

In [10]:
with open('data/zen_of_python.txt') as f: # öffnet und schliesst automatisch
    print(f.read())

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


Varianten beim Einlesen:

In [11]:
with open('data/zen_of_python.txt') as f:
    print(f.read(17))

The Zen of Python


In [13]:
with open('data/zen_of_python.txt') as f:
    print(f.readline())

The Zen of Python, by Tim Peters



In [14]:
with open('data/zen_of_python.txt') as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())

The Zen of Python, by Tim Peters



Beautiful is better than ugly.



In [17]:
with open('data/zen_of_python.txt') as f:
    line = f.readline()
    while line:
        print(line, end="")  # "end" gleicht ansonsten doppelte Umbrüche aus (je 1 aus Datei und vom print)
        line = f.readline()

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

In [19]:
with open('data/zen_of_python.txt') as f:
    lines = f.readlines()

print(type(lines))
print(lines)

<class 'list'>
['The Zen of Python, by Tim Peters\n', '\n', 'Beautiful is better than ugly.\n', 'Explicit is better than implicit.\n', 'Simple is better than complex.\n', 'Complex is better than complicated.\n', 'Flat is better than nested.\n', 'Sparse is better than dense.\n', 'Readability counts.\n', "Special cases aren't special enough to break the rules.\n", 'Although practicality beats purity.\n', 'Errors should never pass silently.\n', 'Unless explicitly silenced.\n', 'In the face of ambiguity, refuse the temptation to guess.\n', 'There should be one-- and preferably only one --obvious way to do it.\n', "Although that way may not be obvious at first unless you're Dutch.\n", 'Now is better than never.\n', 'Although never is often better than *right* now.\n', "If the implementation is hard to explain, it's a bad idea.\n", 'If the implementation is easy to explain, it may be a good idea.\n', "Namespaces are one honking great idea -- let's do more of those!"]


In [22]:
lines[3:6]  # lines kann direkt weiterbenutzt werden...

['Explicit is better than implicit.\n',
 'Simple is better than complex.\n',
 'Complex is better than complicated.\n']

In [23]:
lines[-1]

"Namespaces are one honking great idea -- let's do more of those!"

### Datei-/Zeichenformate

nicht alles kann automatisch erkannt werden:

In [29]:
with open('data/unicode-bsp.txt') as f:
    print(f.read())

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 244: character maps to <undefined>

In [30]:
with open('data/unicode-bsp.txt', encoding="utf-8") as f:
    print(f.read())



Original by Markus Kuhn, adapted for HTML by Martin Dürst.

UTF-8 encoded sample plain-text file
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾

Markus Kuhn [ˈmaʳkʊs kuːn] <mkuhn@acm.org> — 1999-08-20


The ASCII compatible UTF-8 encoding of ISO 10646 and Unicode
plain-text files is defined in RFC 2279 and in ISO 10646-1 Annex R.


Using Unicode/UTF-8, you can write in emails and source code things such as

Mathematics and Sciences:

  ∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏ g(i), ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β),

  ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (A ⇔ B),

  2H₂ + O₂ ⇌ 2H₂O, R = 4.7 kΩ, ⌀ 200 mm

Linguistics and dictionaries:

  ði ıntəˈnæʃənəl fəˈnɛtık əsoʊsiˈeıʃn
  Y [ˈʏpsilɔn], Yen [jɛn], Yoga [ˈjoːgɑ]

APL:

  ((V⍳V)=⍳⍴V)/V←,V    ⌷←⍳→⍴∆∇⊃‾⍎⍕⌈

Nicer typography in plain text files:

  ╔══════════════════════════════════════════╗
  ║                                          ║
  ║   • ‘single’ and “double” quotes         ║
  ║                                          ║
  ║   • C

## Einlesen von JSON

JavaScript Object Notation - quasi Dictionaries in Dateien

In [24]:
import json
with open('data/movie.json') as f:
    content = json.load(f)
    print(content)

{'Title': 'Bicentennial Man', 'Release Date': 'Dec 17 1999', 'MPAA Rating': 'PG', 'Running Time min': 132, 'Distributor': 'Walt Disney Pictures', 'Source': 'Based on Book/Short Story', 'Major Genre': 'Drama', 'Creative Type': 'Science Fiction', 'Director': 'Chris Columbus', 'Rotten Tomatoes Rating': 38, 'IMDB Rating': 6.4, 'IMDB Votes': 28827}


In [25]:
print(f"{content['Title']} directed by {content['Director']}")

Bicentennial Man directed by Chris Columbus
