# Introduction to pandas


## Motivation

Why do we use pandas for data processing:

1. Built-in containers (`list`, `tuple`, `dict`, ...) are *terrible* for data analysis.
2. NumPy is good for *homogenous numerical* data, but tedious for heterogeneous data types (string + numerical data).
3. Pandas has the most convenient data input/output functions.
4. Pandas supports "data wrangling" using the *split-apply-combine* approach. 

Useful additional material includes:

-   The official [user guide](https://pandas.pydata.org/docs/user_guide/index.html).
-   The official [pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
    which nicely illustrates the most frequently used operations.
-   The official [API reference](https://pandas.pydata.org/docs/reference/index.html) with details on
    every pandas object and function.
-   There are numerous tutorials (including videos) available
    on the internet. See [here](https://pandas.pydata.org/docs/getting_started/tutorials.html)
    for a list.

***
## Creating pandas data structures

Pandas has two main data structures:

1.  [`Series`](https://pandas.pydata.org/docs/reference/series.html) 
    represents observations of a *single* variable.
2.  [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html) 
    is a container for *several* variables, one per column.

*Example: Create Series from 1-dimensional NumPy array*

*Example: Create DataFrame from NumPy array*

*Example: Create from dictionary*

***
## Importing data

### Loading text data with NumPy

We can import CSV data without pandas, just using plain NumPy.

-   [`np.loadtxt()`](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html):     load data from a text file.
-   [`np.genfromtxt()`](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html): 
    load data from a text file and handle missing data.

*Example: Load character-separated text data from FRED*

In [None]:
# Uncomment this to use files in the local data/ directory
DATA_PATH = '../data'

# Load data directly from GitHub
# DATA_PATH = 'https://raw.githubusercontent.com/richardfoltyn/TECH2-H24/main/data'

*Example: Load character-separated text data with missing values*


### Loading data with Pandas

Pandas's input/output routines are more powerful than those implemented in NumPy:

-   They support reading and writing numerous file formats.
-   They support heterogeneous data without having to specify
    the data type in advance.
-   They gracefully handle missing values.

The most important routines are:

-   [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), 
    [`to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html): 
    Read or write CSV text files
-   [`read_fwf()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html): 
    Read data with fixed field widths, i.e. text data
    that does not use delimiters to separate fields.
-   [`read_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html), 
    [`to_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html): 
    Read or write Excel spreadsheets
-   [`read_stata()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_stata.html), 
    [`to_stata()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_stata.html): 
    Read or write Stata's `.dta` files.

*Example: Read in missing data with pandas's
[`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*:

<div class="alert alert-info">
<h3> Your turn</h3>
Use the pandas functions listed above to import data from the following files located in the <TT>data/</TT> folder:
<ol>
    <li>titanic.csv</li>
    <li>FRED.xlsx</li>
</ol>

To load Excel files, you need to have the package <TT>openpyxl</TT> installed.
</div>

***
## Viewing data

Functions to get a quick overview of data:

- [`info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html): show columns and observation counts
- [`head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html): show first observations
- [`tail()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html): show last observations
- [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html): show summary statistics
- [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html): tabulate categorical data

*Example: data set of passengers on board of the Titanic:*

1.  `PassengerId`
2.  `Survived`: indicator whether the person survived
3.  `Pclass`: accommodation class (first, second, third)
4.  `Name`: Name of passenger (last name, first name)
5.  `Sex`: `male` or `female`
6.  `Age`
7.  `Ticket`: Ticket number
8.  `Fare`: Fare in pounds
9.  `Cabin`: Deck + cabin number
10. `Embarked`: Port at which passenger embarked:
    `C` - Cherbourg, `Q` - Queenstown, `S` - Southampton

*Examples using `info()`, `head()`, `tail()`, `describe()`, and `value_counts()`*

***
## Indexing

Pandas supports two types of indexing:

1.  Indexing by position with `.iloc[]`: same as NumPy
2.  Indexing by label with `.loc[]`: can use almost arbitrary types of data as index (dates, time, personal IDs, etc.)
3.  Indexing with `[]`: can be used to select *either* columns (by label) *or* rows (by label or position)

*Example: Selecting columns*

*Example: Selecting rows by position*

### Creating and manipulating indices

-   Create a new `Series` or `DataFrame` object with a custom index
    using the `index=` argument.
-   [`set_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html):
    creates an index from existing columns.
-   [`reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html): 
    resets the index to its default value, a sequence
    of increasing integers starting at 0.

*Example: Creating Series with custom index*

#### Manipulating indices

Use 
[`set_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html)
and
[`reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html)
to alter an existing `Series` or `DataFrame`.

Important: need to specify `inplace=True` to modify existing container (otherwise copy is created).

*Example: set a new index for Titanic data set*

<div class="alert alert-info">
<h3> Your turn</h3>
Read in the following data files from the <TT>data/</TT> folder and manipulate the dataframe index:
<ol>
    <li>Read in the file <TT>FRED.csv</TT> and set the column <TT>Year</TT> as the index.</li>
    <li>Read in the file <TT>FRED-monthly.csv</TT> and set the columns <TT>Year</TT> and <TT>Month</TT> as the index</li>
</ol>
Experiment what happens if you use the <TT>inplace=True</TT> and <TT>append=True</TT> options of <TT>set_index()</TT>.

Restore the original (default) index after you are done.
</div>

### Selecting elements

Recommendation:

1.  Use `df['name']` only to select *columns* and nothing else.
2.  Use `.loc[]` to select by label.
3.  Use `.iloc[]` to select by position.

**Selection by label**

**Selection by position**

**Boolean indexing**

- Use logical operator (`==`, `!=`, `isin()`) to create array of `True` and `False`
- Can be used to select subset of data

*Example: Using a single condition*

*Example: Using multiple conditions*

- Multiple conditions can be combined using the `&` (logical and) or `|` (logical or) operators

*Example: using the  [`query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html) method*

<div class="alert alert-info">
<h3> Your turn</h3>
Load the Titanic passenger data set <TT>data/titanic.csv</TT> and select the follow subsets of data:
<ol>
    <li>Select all passengers with passenger IDs from 10 to 20</li>
    <li>Select the 10th to 20th (inclusive) row of the dataframe</li>
    <li>Using <TT>query()</TT>, select the sub-sample of female passengers aged 30 to 40. Display only the columns <TT>Name</TT>, <TT>Age</TT>, and <TT>Sex</TT> (in that order)</li>
    <li>Repeat the last exercise without using <TT>query()</TT></li>
    <li>Select all men who embarked in Queenstown or Cherbourg</li>
</ol>
</div>