## Introduction to Pandas

Pandas is a library for working with tabular data, such as the following table:

| | A | B | 
|-|---|---|
|1| A1 | B1|
|2| A2 | B2|

In Pandas we define this table as a dictionary of columns, where each column corresponds to a sequence of values. 

In [None]:
import pandas as pd

In [None]:
table = pd.DataFrame({
    'A': ["A1", "A2", "A3"],
    'B': ["B1", "B2", "B3"]
})

### Two views on a Pandas Dataframe

Rows of the table can be accessed using the numeric index using `loc`. 

* Correspond to view of Table as list of rows

In [None]:
for i in range(0, table.shape[0]):
    row = table.loc[i]
    print(row["A"], row["B"])

Columns can be accessed using the Columnname as index

* Correspond to view as dictionary of columns

In [None]:
for col in table.columns:
    print(table[col])

### Indexing 

* Every row has an associated index value

In [None]:
table.index

* The default indices are the numbers in the interval `[0, n]`
* Possible to set index as arbitrary column of a dataframe

In [None]:
idics = pd.Index(["R1", "R2", "R3"])
table.set_index(idics,  inplace=True)

In [None]:
table.loc['R2'] 

Note that mumerical indexing is still possible using `iloc`

We can also index based on truth values

In [None]:
selected_rows = (table["A"] == "A1") | (table["B"] =="B2")
print(selected_rows)

In [None]:
table[selected_rows]

### Other useful operations



Real-world datasets are big. We need to select a subset of the data to get an overview. 

Let's load a big dataset:

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
df = pd.read_csv(url, delimiter=";")

With `df.shape` we see how large the data actually is

In [None]:
df.shape

With `df.head` we can show the first `n` columns:

In [None]:
df.head(10)

`df.describe` gives us a quick overview with a few statistics computed for each column. 

In [None]:
df.describe()

#### Plotting

Pandas contains convenient plotting functions

In [None]:
df["fixed acidity"].plot.hist()

In [None]:
df.plot.scatter(x="fixed acidity", y="pH")

### Missing data

In many real world dataset some values are missing. In Pandas this is either indicated by `np.nan` or `None`

In [None]:
import numpy as np
table_with_missing_data = pd.DataFrame({"Col1" : [1, np.nan, None, 3], "Col2": [2, 3, 4, None]})

In [None]:
table_with_missing_data

We either can choose to replace the missing values with real values, or simply drop the missing data:

In [None]:
table_with_missing_data.fillna(0)

In [None]:
table_with_missing_data.dropna()