# Introduction to the Panel Data (Pandas) library

## 1. What is Pandas and why do I care?

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

pandas does not implement significant modeling functionality outside of linear and panel regression

https://pandas.pydata.org

It offers functionality that numpy does not (which we will see later).

## 1.1. Pandas Intrduction

### 1.2. Pandas vs Numpy
Pandas is more robust and thus more complicated; you get a lot of nice features at the expense of slower code. Here is an article explaining this and giving some examples: https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/

Note: We will cover indexes later on :)

## 2. Installation
Like any python package, you must have already downloaded and installed the package using your package manager. For vanilla python this would be done using pip and for Anaconda this would be done using conda.

In [1]:
! pip install pandas



## 3. Documentation and Usefull Links
The official documentation can be found here:
https://pandas.pydata.org/pandas-docs/stable/

The page offers a video giving a quick intro to pandas:
https://youtu.be/_T8LGqJtuGc



## 4. The basic Pandas data types

There are two main data structures being used with pandas: Series and DataFrames


|Dimensions	| Name	| Description |
|:--------- |:-----|:----------- |
|1	        |Series	| 1D labeled homogeneously-typed array |
|2	        |DataFrame	| General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column |

Why more than one data structure?

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

The data scturcutes also provide a natural way to represent and orient data. This is intended to make transformations more intuitive and writing code easier and faster.

For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results in more readable code.

Consider this pandas code:

``` python
for col in df.columns:
    series = df[col]
    # do something with series
```

Compared to this numpy code:

``` python 
for x in ndarray:
    d1array = ndarray[x]
    # Do something with the 1D arrary
```

# 5. Examples
In the following section we look at some basic examples of using Pandas

In [3]:
# Dont forget to import our libraries so we can use them!
import numpy
import pandas

## 5.1. Creating basic data types

### 5.1.1. Creating Pandas Series
The documentation for this data structure can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

In [3]:
# Create a Series using a built-in array
pandas_series = pandas.Series([1, 3, 5, 4, 6, 8])
pandas_series

0    1
1    3
2    5
3    4
4    6
5    8
dtype: int64

In [4]:
# Create a Series using a numpy array
numpy_array = numpy.array([1, 3, 5, 4, 6, 8])
pandas_series = pandas.Series(numpy_array)
pandas_series

0    1
1    3
2    5
3    4
4    6
5    8
dtype: int64

### 5.1.2. Creating Pandas DataFrames
The documentation for this data structure can be found here: 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [5]:
# Create a DataFrame using a built-in array
built_in_array = [[1,2,3,4,5],[6,7,8,9,10]]
pandas_dataframe = pandas.DataFrame(built_in_array)
pandas_dataframe

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,6,7,8,9,10


In [6]:
# Create a DataFrame using a numpy array
built_in_array = [[1,2,3,4,5],[6,7,8,9,10]]
numpy_array = numpy.array(built_in_array)
pandas_dataframe = pandas.DataFrame(numpy_array)
pandas_dataframe

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,6,7,8,9,10


In [13]:
# Create a DataFrame using a built-in dictionary
ids = [1,2,3,4,5]
names = ["Jim", "Sue", "Michael", "Kim", "Logan"]
sales = [500, 300, 400, 700, 200]

dictionary = {
    "ID": ids,
    "Name": names,
    "Sales": sales
}

pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys(), index=names)
pandas_dataframe

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Kim,4,Kim,700
Logan,5,Logan,200


## 5.2. Indexing Pandas data structures (Series and Dataframe)
Pandas gives us a lot of useful methods for indexing and slicing our data structures. Indexing the DataFrame is like indexing a multi-dimensional built-in list, or a dictionary (depending on how you created the structure). There are also some advanced options which we will also cover below...

### 5.2.1. Traditional indexing of Pandas data structures
In this section we will show how the Pandas objects can be indexed in the same way that the build in types can be

In [4]:
# Indexing the Series is like indexing a built-in list
pandas_series = pandas.Series([1, 3, 5, 4, 6, 8])
pandas_series[3]

4

In [10]:
# Indexing a DataFrame like a multi-dimensional list

built_in_array = [[1,2,3,4,5],[6,7,8,9,10]]
numpy_array = numpy.array(built_in_array)
pandas_dataframe = pandas.DataFrame(numpy_array)

# The DataFrame is indexed in column-major order
pandas_dataframe[1][0] # Will return 2nd column, first row: 2.. why/how??

2

In [14]:
# Indexing a DataFrame like a dictionary

dictionary = {
    "ID": [1,2,3,4,5],
    "Name": ["Jim", "Sue", "Michael", "Kim", "Logan"],
    "Sales": [500, 300, 400, 700, 200]
}
pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys(), index=names)

# The index is case sensitive!!
pandas_dataframe["Sales"]["Jim"]

500

In [11]:
# Indexing like a class
pandas_dataframe.Sales

Jim        500
Sue        300
Michael    400
Kim        700
Logan      200
Name: Sales, dtype: int64

### 5.2.2 Using Pandas DataFrame.loc() function
The loc() function allows us to access a group of rows and columns by label(s) or a boolean array.

This goes beyond indexing because and is more similar to querying data.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

In [12]:
pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys(), index=names)
pandas_dataframe

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Kim,4,Kim,700
Logan,5,Logan,200


#### 5.2.2.1. Indexing with DataFrame.loc() function

In [13]:
# Index the DataFrame
pandas_dataframe.loc["Jim"]

ID         1
Name     Jim
Sales    500
Name: Jim, dtype: object

#### 5.2.2.2. Slicing with DataFrame.loc() function

In [14]:
# Slice the DataFrame by row
pandas_dataframe.loc[["Jim", "Sue"]]

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300


In [15]:
# Show that slicing is not indexing when there are multiple rows with the same name
names = ["Jim", "Sue", "Michael", "Kim", "Logan","Jim"]
dictionary = {
    "ID": [1,2,3,4,5,1],
    "Name": names,
    "Sales": [500, 300, 400, 700, 200,300]
}
pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys(), index=names)
pandas_dataframe

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Kim,4,Kim,700
Logan,5,Logan,200
Jim,1,Jim,300


In [16]:
pandas_dataframe.loc["Jim"] # We will get two rows instead of one

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Jim,1,Jim,300


In [17]:
# Slice by row and column
pandas_dataframe.loc[["Jim","Sue"], "ID"]

Jim    1
Jim    1
Sue    2
Name: ID, dtype: int64

#### 5.2.2.3. Evaluate conditional expressions with DataFrame.loc() function
This will allow us to use a boolean expression to select data from the DataFrame

In [16]:
# Select the rows where Sales were greater than 200
sales_series = pandas_dataframe["Sales"]
pandas_dataframe.loc[sales_series > 200]

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Kim,4,Kim,700


In [19]:
# Select the rows where Sales were greater than 200 and less than 600
sales_series = pandas_dataframe["Sales"]
pandas_dataframe.loc[(sales_series > 200) & (sales_series < 600)]

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Jim,1,Jim,300


In [20]:
# We can also set the value of items using this paradigm
pandas_dataframe.loc[(sales_series > 200) & (sales_series < 600)]["Sales"] = 0
pandas_dataframe

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Kim,4,Kim,700
Logan,5,Logan,200
Jim,1,Jim,300


In [21]:
pandas_dataframe.Sales

Jim        500
Sue        300
Michael    400
Kim        700
Logan      200
Jim        300
Name: Sales, dtype: int64

### 5.2.3 Using Pandas DataFrame.query() function
The query() function behaves like the loc() function but makes our code even more dynamic. We will see that we can store queries in strings and manipulate them dynamically.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

In [22]:
pandas_dataframe.query('Sales > 200')

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Kim,4,Kim,700
Jim,1,Jim,300


In [23]:
pandas_dataframe.query('Sales > 200 & Sales < 600')

Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Jim,1,Jim,300


In [24]:
query = "{0} {1} {2}".format("Sales", ">", "200")
print(query)
pandas_dataframe.query(query)

Sales > 200


Unnamed: 0,ID,Name,Sales
Jim,1,Jim,500
Sue,2,Sue,300
Michael,3,Michael,400
Kim,4,Kim,700
Jim,1,Jim,300


## 5.3. Performing Transformations on Pandas data structures

### A quick note on vectorized vs non-vectorized functions
When performance is paramount, you should avoid using .apply() and .map() because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).

https://campus.datacamp.com/courses/manipulating-dataframes-with-pandas/extracting-and-transforming-data?ex=16

A more indepth article can be found here:

https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

### 5.3.1. Performing Transformations on Pandas Series

First we will look at the Series object. The documentation outlines several useful methods and is worth a read.

We will cover a few:

* apply()
* map()
* factorize()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

#### 5.3.1.1 Using the Series.apply() function

This allows us to apply a function to the values in a series.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply

In [22]:
pandas_series = pandas.Series([1,3,4,6,2,3])

def my_function(value):
    print("Value: {0}".format(value))
    return (value * 5) > 15

pandas_series.apply(my_function)

Value: 1
Value: 3
Value: 4
Value: 6
Value: 2
Value: 3


0    False
1    False
2     True
3     True
4    False
5    False
dtype: bool

#### 5.3.1.2 Using the Series.map() function

Used to map the inputs Series to an output Series, possibly using a map

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html#pandas.Series.map

In [26]:
# We can see that map can perform exactly like apply
pandas_series.map(my_function)
pandas_series

0    1
1    3
2    4
3    6
4    2
5    3
dtype: int64

In [27]:
# We can see that it can use a value map unlike apply
pandas_series = pandas.Series(["r", "b", "g", "r", "g", "b"])

my_map = {
    "r": "red", 
    "b": "blue",
    "g": "green"
}

pandas_series.map(my_map)

0      red
1     blue
2    green
3      red
4    green
5     blue
dtype: object

#### 5.3.1.3 Using the Series.factorize() function

This is used to extract factors from a Series and more:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.factorize.html#pandas-series-factorize

In [25]:
pandas_series = pandas.Series(["r", "b", "g", "r", "g", "b"])
labels, uniques = pandas_series.factorize()


print(list(labels))
print(list(uniques))

[0, 1, 2, 0, 2, 1]
['r', 'b', 'g']


### 5.3.2 Transformations on DataFrame
Next we will look at the DataFrame object. The documentation outlines several useful methods and is worth a read.

We will cover a few:

* apply()
* applymap()
* transform()
* transpose()


We will also look at built in operators

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

#### 5.3.2.1 Transformations using indexes, slices, and built in operators
Referencing indexes or slices of a DataFrame will return references to either DataFrames or Series objects, so it's important to know what we are dealing with!

In [29]:
# Review our dataframe
pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys(), index=names)
pandas_dataframe.Sales

Jim        500
Sue        300
Michael    400
Kim        700
Logan      200
Jim        300
Name: Sales, dtype: int64

In [30]:
# Multiply a column by a scalar
print(type(pandas_dataframe.Sales))
pandas_dataframe.Sales = pandas_dataframe.Sales * 5
pandas_dataframe

<class 'pandas.core.series.Series'>


Unnamed: 0,ID,Name,Sales
Jim,1,Jim,2500
Sue,2,Sue,1500
Michael,3,Michael,2000
Kim,4,Kim,3500
Logan,5,Logan,1000
Jim,1,Jim,1500


In [31]:
# Perform a query and multiply the results by a scalar
pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys(), index=names)
pandas_dataframe = pandas_dataframe.loc[(pandas_dataframe.Sales > 200) & (pandas_dataframe.Sales < 600)].Sales * 5
pandas_dataframe

Jim        2500
Sue        1500
Michael    2000
Jim        1500
Name: Sales, dtype: int64

#### 5.3.2.2 Using DataFrame.apply() function
This allows us to call a function for each row of our DataFrame and returns a Series container the results. This is a non vectorized function.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply

In [32]:
# Lets consider the example:
#  Employees get a 10% bonus based on all their sales after the first $300
#  The bonus only applies if they have been with the company more than 1 year

dictionary = {
    "ID": [1,2,3,4,5],
    "Name": ["Jim", "Sue", "Michael", "Kim", "Logan"],
    "Sales": [500, 300, 400, 700, 200],
    "Tenure": [0, 1, 3, 5, 0]
}
pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys())

# Define our function to calculate the bonus using our dataframe
def bonus_amount(row):
    if row.Tenure <= 0:
        return 0
    if row.Sales <= 200:
        return 0
    return row.Sales * 0.10

# Apply the transformation and store the results in a new column
pandas_dataframe["Bonus"] =  pandas_dataframe.apply(bonus_amount, axis=1)
pandas_dataframe

Unnamed: 0,ID,Name,Sales,Tenure,Bonus
0,1,Jim,500,0,0.0
1,2,Sue,300,1,30.0
2,3,Michael,400,3,40.0
3,4,Kim,700,5,70.0
4,5,Logan,200,0,0.0


#### 5.3.2.3 Using DataFrame.applymap() function
This allows us to call a function for element of our DataFrame and returns a Series container the results. This is a non vectorized function

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html#pandas.DataFrame.applymap

In [33]:
# Lets consider the example of removing NaN from our dataframe

dictionary = {
    "ID": [1,2,3,4,5],
    "Name": ["Jim", "Sue", "Michael", "Kim", "Logan"],
    "Sales": [500, 300, 400, 700, 200],
    "Tenure": [0, 1, 3, numpy.NaN, 0]
}
pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys())

# Define our function to calculate the bonus using our dataframe
import math
import numbers
def bonus_amount(item):
    if isinstance(item, numbers.Number) and math.isnan(item):
        return 0
    return item

# Apply the transformation and store the results in a new column
pandas_dataframe.applymap(bonus_amount)


Unnamed: 0,ID,Name,Sales,Tenure
0,1,Jim,500,0.0
1,2,Sue,300,1.0
2,3,Michael,400,3.0
3,4,Kim,700,0.0
4,5,Logan,200,0.0


#### 5.3.2.4. Using the DataFrame.transform() function
This function allows us to perform an operation on a column. This is a vectorized function and the input is an entire column.

Note: The first column is supplied twice! be careful

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html

In [34]:
# Lets consider the example:
#  Employees get a 10% bonus based on all their sales after the first $300
#  The bonus only applies if they have been with the company more than 1 year
names = ["Jim", "Sue", "Michael", "Kim", "Logan"]
dictionary = {
    "ID": [1,2,3,4,5],
    "Name": names,
    "Sales": [500, 300, 400, 700, 200],
    "Tenure": [0, 1, 3, 5, 0]
}
pandas_dataframe = pandas.DataFrame(dictionary, columns=dictionary.keys(), index=names)

# Define our function to calculate the bonus using our dataframe
def bonus_amount(column):
    print(column)
    return column

# Apply the transformation and store the results in a new column
pandas_dataframe.transform(bonus_amount)

Jim        1
Sue        2
Michael    3
Kim        4
Logan      5
Name: ID, dtype: object
Jim        1
Sue        2
Michael    3
Kim        4
Logan      5
Name: ID, dtype: int64
Jim            Jim
Sue            Sue
Michael    Michael
Kim            Kim
Logan        Logan
Name: Name, dtype: object
Jim        500
Sue        300
Michael    400
Kim        700
Logan      200
Name: Sales, dtype: int64
Jim        0
Sue        1
Michael    3
Kim        5
Logan      0
Name: Tenure, dtype: int64


Unnamed: 0,ID,Name,Sales,Tenure
Jim,1,Jim,500,0
Sue,2,Sue,300,1
Michael,3,Michael,400,3
Kim,4,Kim,700,5
Logan,5,Logan,200,0


#### 5.3.2.5. Using the DataFrame.transpose() function
This function allows us to transpose the DataFrame replaceing rows and columns

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html#pandas.DataFrame.transpose

In [35]:
pandas_dataframe.transpose()

Unnamed: 0,Jim,Sue,Michael,Kim,Logan
ID,1,2,3,4,5
Name,Jim,Sue,Michael,Kim,Logan
Sales,500,300,400,700,200
Tenure,0,1,3,5,0


In [36]:
# 6. Exercises