# Notebook 6.2: Pandas

Pandas is a library for working with `DataFrames`. This is yet another type of object to learn in Python, but it is also a very intuitive data structure and is closely connected to `numpy`'s ndarray objects, which makes it easier to learn. Essentially, you can think of pandas DataFrames as a pretty wrapper around an array object, that add column and row names and allow you to access elements by names instead of only by indices. 

This notebook will simply reiterate many concepts from your assigned reading.

Read more about pandas at the documentation: https://pandas.pydata.org/pandas-docs/stable/dsintro.html

### Required software

In [None]:
# conda install numpy -c conda-forge
# conda install pandas -c conda-forge

In [None]:
import numpy as np
import pandas as pd

In [None]:
# let's define some ndarrays
arr0 = np.zeros(10)
arr1 = np.ones(10)
arr2 = np.arange(10)
arr3 = np.array(list("abcdefghij"))

### 1-dimensional Series objects
The first datatype in pandas are `Series`. These are simply 1-dimensional arrays with the option of attaching a name to them. Series objects will display the dtype of the object as well as the `index` to the left of it. You will notice that the argument for entering a name is `name`, not `names`, this is because Series are 1-dimensional and thus should only have a single column name. 

In [None]:
pd.Series(data=arr0, name='arr0')

In [None]:
pd.Series(np.random.randint(0, 10, 5), index=['a', 'b', 'c', 'd', 'e'])

In [None]:
# you can create a series with a custom index by using a dict as the data entry
ddict = {'a': 3, 'b': 4, 'c': 5}
pd.Series(ddict, name="values")

In [None]:
# but it is sometimes more clear to enter the data, name, and index separately
pd.Series(data=arr0, name='arr1', index=list("abcdefghij"))

### A series can be indexed like an ndarray
and it can be indexed like a Python dictionary using the `index` as a key. 

In [None]:
# create a Series
ddict = {'a': 3, 'b': 4, 'c': 5, 'd':10, 'e': 40}
data = pd.Series(ddict)

In [None]:
# select Series by index
print(data[1:3])

In [None]:
# select Series by name like a dict
print(data['a'])

In [None]:
# select Series by name like an object
print(data.e)

## 2-dimensional DataFrame objects

Although `Series` objects are useful, the `DataFrame` class object is used much more frequently. You can think of the columns of a dataframe as each being a separate Series object. Dataframes can take the argument `columns` to set the column names. There are several ways to create a DataFrame, using dictionaries, ndarrays, Series, and dictionaries. (almost too many options...).

In [None]:
pd.DataFrame(arr0, columns=['arr0'])

In [None]:
pd.DataFrame({'arr0': arr0, 'arr1': arr1})

In [None]:
pd.DataFrame(
    data=[
        pd.Series(arr0, name='arr0'),
        pd.Series(arr1, name='arr1'),
    ])

In [None]:
pd.DataFrame(
    data=[arr0, arr1, arr2], 
    index=['arr0', 'arr1', 'arr2'],
    )

In [None]:
# Transpose a dataframe (.T)
pd.DataFrame(
    data=[arr0, arr1, arr2], 
    index=['arr0', 'arr1', 'arr2'],
    ).T

### Parsing CSV files
This is one of the best uses of pandas, and is your magic replacement for never having to open an excel spreadsheet again. Pandas has a super flexible framework for reading in data tables that are stored in a wide variety of formats. The most commonly used format for storing data tables is CSV, which stands for comma-separated values. We've seen this type of file before when we were working with the iris data set in our first few lessons. Other common types include TSV (tab separated values), and XLS which is the proprietary formal of microsoft excel. In general, for storing data that you want other to *use* and not just look at, CSV or TSV is much preferred to XLS. 

In [None]:
# load a CSV file by reading it straight from a url
data = pd.read_csv(
    "http://eaton-lab.org/data/iris-data-dirty.csv", 
    header=None) 

data.head()

In [None]:
# set new column names
data.columns=["trait1", "trait2", "trait3", "trait4", "label"]
data.head()

In [None]:
# save as a CSV file
data.to_csv("iris-data-dirty.csv")

In [None]:
# open it again by parsing the CSV file on disk this time
data = pd.read_csv("iris-data-dirty.csv", index_col=0)
data.head()

### Operate on the DataFrame
Here we will replace some typos in the data.

In [None]:
# find all values in the 'label' column using .unique
print(data.label.unique())

In [None]:
# RETURNS a modified COPY
data2 = (data
    .replace("Iris-setsa", "Iris-setosa")
    .replace("Iris-versicolour", "Iris-versicolor")
)

In [None]:
# Modifies in place (less desirable, can't chain functions)
data.replace("Iris-setsa", "Iris-setosa", inplace=True)
data.replace("Iris-versicolour", "Iris-versicolor", inplace=True)

In [None]:
# return a formatted table of a calculation
pd.DataFrame({
    "trait1": pd.Series({
        "mean": data.trait1.mean(), 
        "std": data.trait1.std()
    }), 
    "trait2": pd.Series({
        "mean": data.trait2.mean(), 
        "std": data.trait2.std(),
    }),
})

In [None]:
data.columns

In [None]:
pd.DataFrame({"mean": data.mean(), "std": data.std()})

In [None]:
# stats summary method 
data.describe()

### So much more
Pandas is a very powerful library, and your reading introduces a huge range of ways to use it. We'll continue using pandas and numpy very extensively in the coming weeks, so make sure you have a good grasp of their basics (how to index, slice, and access views of these objects) and some idea of their larger capabilities (reading in tables, calculating statistics, performing operations across axes). 

See the official documentation and further tutorials:
https://pandas.pydata.org/pandas-docs/version/1.0.0/getting_started/10min.html