# 19) Learning more about Pandas

Related references:

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)
- [Python for Data Analysis, 2nd Edition](https://www.safaribooksonline.com/library/view/python-for-data/9781491957653/) 

### Connecting the last few lectures

The last lectures on databases have some commonalities with Pandas DataFrames. When would you prefer to use a database?


## Arithmetic and Data Alignment

Arithmetic with Databases has some similar behavior to automatic outer join on the index labels. We did some work with this last Pandas lecture; let's have a little review before moving on.

In [None]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1

In [None]:
df2

In [None]:
df1 + df2

If you want to add arguments, you'll need to use built-in methods, not just the character representation. For example:

In [None]:
df1.add(df2, fill_value=0)

### Pandas DataFrame arithmatic

The "r\*" version of each method does the same action in reverse order. For example, 
df1.rdiv(other, axis='index') is equivalent to other.div(df1, axis='index')

| Method | Description |
|-----|-----|
| add, radd | Methods for addition (+) |
| sub, rsub | Methods for subtraction (-) |
| div, rdiv | Methods for division (/) |
| floordiv, rfloordiv | Methods for floor division (//) |
| mul, rmul | Methods for multiplication (\*) |
| pow, rpow | Methods for exponentiation (\*\*) |

### Operations between series and dataframes: remember broadcasting!

Let's look at an example:

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

In [None]:
series = frame.iloc[0]
series

What do you expect from the following?

In [None]:
frame - series

In [None]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

In [None]:
frame + series2

By default, broadcasting is done over indices, matching on columns. If you want the opposite, specify so with an arithmetic method.

In [None]:
series3 = frame['d']
frame

In [None]:
series3

In [None]:
frame - series3

In [None]:
frame.sub(series3, axis='index')

There are many other functions that you might want to perform on dataframes, and quick Internet searches will give you the correct syntax. For example:

In [None]:
frame.min()

In [None]:
frame.min(axis=1)

In [None]:
frame

In [None]:
frame.sort_index()

In [None]:
frame.sort_index(axis=1, ascending=False)

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

### Function Application and Mapping

NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [None]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

In [None]:
def f(x):
    return x.max() - x.min()

frame.apply(f)

Here the function f, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in 'frame'. The result is a Series having the columns of 'frame' as its index.

If you pass `axis='columns'` to apply, the function will be invoked once per row instead:

In [None]:
frame.apply(f, axis='columns')

The function passed to apply need not return a scalar value; it can also return a Series with multiple values:

In [None]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

**Note**: Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

### A personal favorite DataFrame method:

In [None]:
frame2 = frame.describe()
frame2

In [None]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj

In [None]:
obj.describe()

### A list of helpful built-in descriptive and summary statistic methods

| Method | Description |
|-----|-----|
| count | Number of non-NA values |
| describe | Compute set of summary statistics for Series or each DataFrame column |
| min, max | Compute minimum and maximum values |
| argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively |
| idxmin, idxmax | Compute index labels at which minimum or maximum value obtained, respectively |
| quantile | Compute sample quantile ranging from 0 to 1 |
| sum | Sum of values |
| mean | Mean of values |
| median | Arithmetic median (50% quantile) of values |
| mad | Mean absolute deviation from mean value |
| prod | Product of all values |
| var | Sample variance of values |
| std | Sample standard deviation of values |
| skew | Sample skewness (third moment) of values |
| kurt | Sample kurtosis (fourth moment) of values |
| cumsum | Cumulative sum of values |
| cummin, cummax | Cumulative minimum or maximum of values, respectively |
| cumprod | Cumulative product of values |
| diff | Compute first arithmetic difference (useful for time series) |
| pct_change | Compute percent changes |

These functions have built-in methods for handling missing data, so a different version of them is not needed when there are null (e.g. NaN) entries.

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Calling DataFrame’s sum method returns a Series containing column sums:

In [None]:
df.sum()

In [None]:
df.sum(axis='columns')

Note that NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled with the skipna option:

In [None]:
df.mean(axis='columns', skipna=False)

### Unique Values, Value Counts, and Membership

Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:

In [None]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

uniques = obj.unique()
uniques.sort()
uniques

The unique values are not necessarily returned in sorted order, but could be sorted after the fact if needed (uniques.sort()). Relatedly, value_counts computes a Series containing value frequencies:

In [None]:
obj.value_counts()

The Series is sorted by value in descending order as a convenience. value_counts is also available as a top-level pandas method that can be used with any array or sequence:

In [None]:
my_list = ['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']
pd.value_counts(my_list, sort=False)

In [None]:
pd.value_counts(obj.values, sort=True)

## Data Loading, Storage, and File Formats

pandas features a number of functions for reading tabular data as a DataFrame object, summarized in the table below. `read_csv` and `read_table` are likely the ones you’ll use the most.

| Function | Description
|----------| ----------
| read_csv | Load delimited data from a file, URL, or file-like object; use comma as default delimiter
| read_table | Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter
| read_fwf | Read data in fixed-width column format (i.e., no delimiters)
| read_clipboard | Version of read_table that reads data from the clipboard; useful for converting tables from web pages
| read_excel | Read tabular data from an Excel XLS or XLSX file
| read_hdf | Read HDF5 files written by pandas
| read_html | Read all tables found in the given HTML document
| read_json | Read data from a JSON (JavaScript Object Notation) string representation
| read_msgpack | Read pandas data encoded using the MessagePack binary format
| read_pickle | Read an arbitrary object stored in Python pickle format
| read_sas | Read a SAS dataset stored in one of the SAS system’s custom storage formats
| read_sql | Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
| read_stata | Read a dataset from Stata file format
| read_feather | Read the Feather binary

To start practicing with this, make a file called ex1.csv in a folder "examples" with these lines:

In [None]:
df = pd.read_csv('examples/ex1.csv')
df

We could also have used read_table and specified the delimiter:

In [None]:
pd.read_table('examples/ex1.csv', sep=',')

A file will not always have a header row. Consider this file, 'examples/ex2.csv':

To read this file, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:

In [None]:
pd.read_csv('examples/ex2.csv', header=None)

In [None]:
pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Suppose you wanted the message column to be the index of the returned DataFrame. You can either indicate you want the column at index 4 or named 'message' using the index_col argument:

In [None]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

What about a file with variable whitespace separating the columns, like examples/ex3.txt

You can pass a regular expression as a delimiter for read_table. This can be expressed by the regular expression \s+, so we have then:

In [None]:
result = pd.read_table('examples/ex3.txt', sep='\s+')
result

Because there was one fewer column name than the number of data rows, read_table infers that the first column should be the DataFrame’s index in this special case.

What if you have comment lines, like in examples/ex4.csv?

In [None]:
pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])

In [None]:
pd.read_csv('examples/ex4.csv', comment='#')

Let's handle some missing data, like in examples/ex5.csv

In [None]:
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
result

### Writing Data to Text Format
Data can also be exported to a delimited format. Let's use what we just read in.

In [None]:
result.to_csv('examples/out.csv')

Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:

In [None]:
result.to_csv('examples/out2.csv', na_rep='NULL')

With no other options specified, both the row and column labels are written. Both of these can be disabled:

In [None]:
result.to_csv('examples/out3.csv', index=False, header=False)

You can also write only a subset of the columns, and in an order of your choosing:

In [None]:
result.to_csv('examples/out4.csv', index=False, columns=['a', 'b', 'c'])

**Up next**: Visualization!!