---
# Pandas Data Structures
Pandas has two main data structures for handling tabular data: `Series` and `DataFrames` 

---

In [2]:
# Importing pandas and numpy as pd, np, respectively.
# These shorthand names are common practice.
import pandas as pd
import numpy as np

# display() function works like print, but it shows the
# output in its rich display format if possible
from IPython.display import display

In [3]:
# Function for printing a horizontal line. For display purporse
def printhr(n: int = 50):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        n (int, optional): Number of characters. Defaults to 50.
    """

    print("=" * n)

---
---
## Series

A series is a homogenous one-dimensional labeled array that can hold any data type (but only 1 type; homogenous). The axis labels are collectively referred to as the **index**. 
 
A Series acts like an ndarray (numpys) and thus most numpy functions that work on an ndarray will work on a Series.

---
---

---
### Creating a Series

You can use the Series class to create a Series.  

`new_series = pd.Series(data, index=index)`

Where **data** is the data to be converted into a Series, and **index** is iterable that contains the labels. If **index** is not set, default integer index will be set (incrementing from 0).

**data** can be of different structures:

- a Python dict
- a 1D NumPy ndarray
- a scalar value (like 38)

---


---
#### **Series:** Python dict as Data

When passing in a Python dict, the keys will act as the labels. If an index is passed, it will be the size of the Series instead of the data's size. If an index is passed, it will check against the dict keys, and order the Series according to the order of the index. Values in the index that do not exist in the dict passed to data will be represented as NaN.

---


In [4]:
# Create a dict
data = {"a": 2, "b": 4, "c": 6}
display(data)
printhr()

# Create Series
s = pd.Series(data)
display(s)

# > When an index does not exist in the provided dict data, it will
#   be represented with NaN. 
# > The length of the index is length of Series.
# > Order of index is followed.
s2 = pd.Series(data, index=["c", "a", "b", "x"])
display(s2)

{'a': 2, 'b': 4, 'c': 6}



a    2
b    4
c    6
dtype: int64

c    6.0
a    2.0
b    4.0
x    NaN
dtype: float64

---
#### **Series:** NumPy ndarray as Data

A numpy ndarray of 1-dimension can be passed as data.

---


In [5]:
# Create an ndarray
data = np.random.randn(5)
display(data)
printhr()

# Create a Series
# Index in descending order using pd.RangeIndex()
s = pd.Series(data, index=pd.RangeIndex(4, -1, -1))
display(s)

# If no index is passed, int is automatically assigned incrementally.
s2 = pd.Series(data)
display(s2)

array([-0.98899199,  0.67622427,  1.07229327, -0.66789179,  0.08898007])



4   -0.988992
3    0.676224
2    1.072293
1   -0.667892
0    0.088980
dtype: float64

0   -0.988992
1    0.676224
2    1.072293
3   -0.667892
4    0.088980
dtype: float64

---
#### **Series:** Scalar Value as Data

Passing a scalar value will result in a Series of length 1. If an index is passed, the value will be copied match the length of the index. A scalar value is an object with no dimension, or is not a collection.

---


In [6]:
# Scalar value (not a collection)
data = 4.2

# Create Series
s = pd.Series(data)
display(s)

# Value is repeated throughout the index
s2 = pd.Series(data, index=["i", "n", "d", "e", "x"])
display(s2)

0    4.2
dtype: float64

i    4.2
n    4.2
d    4.2
e    4.2
x    4.2
dtype: float64

---
---
## DataFrame

A series is a homogenous one-dimensional labeled array that can hold any data type (but only 1 type; homogenous). The axis labels are collectively referred to as the **index**.

---
---

---
### Creating a DataFrame

Much like creating a Series, a DataFrame class can be used to create a DataFrame.  

`new_df = pd.DataFrame(data, index=index)`

**data** can be of different structures:

- a Python dict of any of the ff:
  - Series
  - lists
  - 1D ndarrays
  - dicts
- a 2D NumPy ndarray
- a Series
- structured or record ndarray
- another DataFrame


---


---
#### **DataFrame:** dict of Series as Data

When creating a DataFrame from a dict of Series, the resulting index will be the union of the indices of the various Series.

**from series**: guide note  
When passing in a Python dict, the keys will act as the labels. If an index is passed, it will be the size of the Series instead of the data's size. If an index is passed, it will check against the dict keys, and order the Series according to the order of the index. Values in the index that do not exist in the dict passed to data will be represented as NaN.

---


In [25]:
# Create a dict containing Series
s1 = pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"])
s2 = pd.Series(["one", "two", "three", "four"], index=["a", "b", "c", "d"])

d = {
    "one": s1,
    "two": s2,
}
display(d)
printhr()

# Create DataFrame
# Missing data is set to NaN
df = pd.DataFrame(d)
display(df)

# When index is passed, it dictates the order
df = pd.DataFrame(d, index=["d", "a", "b"])
display(df)

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64,
 'two': a      one
 b      two
 c    three
 d     four
 dtype: object}



Unnamed: 0,one,two
a,1.0,one
b,2.0,two
c,3.0,three
d,,four


Unnamed: 0,one,two
d,,four
a,1.0,one
b,2.0,two


---
#### **DataFrame:** dict of Lists as Data
wip




**from series**: guide note  
When passing in a Python dict, the keys will act as the labels. If an index is passed, it will be the size of the Series instead of the data's size. If an index is passed, it will check against the dict keys, and order the Series according to the order of the index. Values in the index that do not exist in the dict passed to data will be represented as NaN.

---


In [7]:
# Passing in a dict
data = {"Foo": [2, 5, 5], "Bar": [9, 2, 7]}
display(data)

s = pd.Series(data)
df = pd.DataFrame(data, index=range(3))

display(s, df)

{'Foo': [2, 5, 5], 'Bar': [9, 2, 7]}

Foo    [2, 5, 5]
Bar    [9, 2, 7]
dtype: object

Unnamed: 0,Foo,Bar
0,2,9
1,5,2
2,5,7


In [9]:
# Passing in a dict of lists
data = {"Foo": [2, 5, 5], "Bar": 9}
display(data)
printhr()

s = pd.Series(data)
df = pd.DataFrame(data)

display(s, df)

{'Foo': [2, 5, 5], 'Bar': 9}



Foo    [2, 5, 5]
Bar            9
dtype: object

Unnamed: 0,Foo,Bar
0,2,9
1,5,9
2,5,9


In [72]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}

df = pd.DataFrame(d)
display(df)

display(pd.DataFrame(d, index=["d", "b", "a"]))
display(pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"]))

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,
