# Pandas -- "Excel für Python, aber besser"

* Series
* DataFrame
* Daten-Auswahl
* Benutzerdefinierte Auswahl
* Data Input und Output
* Abwesende Daten
* GroupBy
* Merging, Joining, Concatenation
* Operations

Installation (in Anaconda schon inbegriffen)

    conda install pandas
oder
    
    pip install pandas

# Series

Series ist ein Datentyp von Pandas, um Reiehenfolgen-Artige Data (wie Time-Series) zu behandeln.

Series kann man aus Python List, Numpy Array, Python Dictionary erstellen.

In [None]:
import pandas as pd

pd.Series([10, 20, 30, 40, 50])

In [None]:
import numpy as np

pd.Series(np.array([10, 20, 30, 40, 50]))

In [None]:
md = {'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}

pd.Series(md)

In [None]:
labels = ['n', 'k', 'j', 'x', 'y']

pd.Series([10, 20, 30, 40, 50], index=labels)

Indices ermöglichen schnellen Zugriff zu Daten

In [None]:
data = [43, 12, 54, 12, 56]
labels = ['KY', 'NY', 'OR', 'WY', 'AL']
s1 = pd.Series(data, labels)
s1

In [None]:
data = [65, 34, 56, 23, 56]
labels = ['KY', 'WY', 'FL', 'TN', 'AL']
s2 = pd.Series(data, labels)
s2

In [None]:
s3 = s1 + s2; s3

### Series können beliebige Daten enthalten

### Abwesende Werte in Series

In [None]:
s3.dropna()

In [None]:
s3.fillna(0.0)

In [None]:
s3.fillna(s3.mean())

In [None]:
s3.head()

In [None]:
s3.tail()

# DataFrame

In [None]:
from numpy.random import randn
import pandas as pd

df = pd.DataFrame(randn(5, 4), columns='A B C D'.split(), index='WY KY WS AL FL'.split()); df

### Daten-Auswahl

In [None]:
df['A']

In [None]:
df['A']['WY']

In [None]:
df[['A', 'B']]

In [None]:
df.loc['WY']

In [None]:
df.loc['WY']['A']

Achtung, eine Falle: wenn man statt 'loc[]' versucht 'loc()' anzuwenden, werden nicht die Daten, sondern 'loc()' Object zurückgegeben

In [None]:
df.loc['WY', 'A']

Achtung, eine Falle: ein Subset von den Daten kann man mit df.loc[] bekommen, aber nicht mit dem Auswahl aus dem DataFrame Objekt selbst, wie df[['WY', 'KY'], ['A', 'B']]

In [None]:
df.loc[['WY', 'KY'], ['A', 'B']]

.iloc() -- Datenreihe nach Position suchen

In [None]:
df

In [None]:
df.iloc[0]

In [None]:
df.iloc[0][['A', 'B']]

# Benutzerdefinierte Auswahl

In [None]:
df

In [None]:
df > 0

In [None]:
df[df > 0]

In [None]:
df[df['D'] > 0]

In [None]:
df[(df['D'] > 0) & (df['B'] < 0)]

Achtung: nur binäre Operatoren verwenden (|, &, ~), keine or/and! ()s sind notwendig

# Group by

In [None]:
import pandas as pd

data = {'Company':['GOOG','FB','MSFT','MSFT','GOOG','FB'],
       'Person':['Bob','Charlie','Sam','Vanessa','Charlie','Alice'],
       'Sales':[300,100,300,224,113,351]}

df = pd.DataFrame(data)

df

In [None]:
df.groupby('Company').mean()

In [None]:
df.groupby('Company').mean()

In [None]:
df.groupby('Company').std()

In [None]:
df.groupby('Company').count()

In [None]:
df.groupby('Company').describe()

# Operations

In [None]:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4],'B':[444,555,666,444],'C':['abc','def','ghi','xyz']})
df.head()

In [None]:
df['B'].unique()

In [None]:
df['B'].nunique()

In [None]:
df['B'].value_counts()

In [None]:
df['Q'] = df['B'].apply(lambda n: n**2)

In [None]:
df

In [None]:
del df['Q']
df

In [None]:
df.sort_values(by='B')

# Data Input und Output

### Aus Web: pandas entziffert HTML-Tabellen selbständig

In [None]:
tables = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html'); 
df = tables[0]

In [None]:
df.info()

In [None]:
df.head()

# Aus Excel

In [None]:
pd.read_excel('Mappe1.xlsx',sheetname='Tabelle1')

# Nach Excel

In [None]:
df.to_excel('Mappe1Neu.xlsx', sheet_name='Tabelle1')

# Aus CSV

In [None]:
df = pd.read_csv('Mappe1.csv', sep=';', decimal=',')
df

# Nach CSV

In [None]:
df.to_csv('Mappe1Neu.csv', sep=';', decimal=',')

# SQL

In [None]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
df.to_sql('data', engine)
sql_df = pd.read_sql('data', con=engine)
sql_df

In [None]:
import sqlalchemy
engine = create_engine('mysql://pcdb:pcdb@192.168.254.158:3306/pcdb')
sql_df = pd.read_sql('select * from pcdb_puppet_factsets', con=engine)
sql_df

# JSON

In [None]:
sql_df.to_json()

# Dictionary

In [None]:
sql_df.to_dict()