---
# Pandas Data Structures
Pandas has two main data structures for handling tabular data: `Series` and `DataFrames` 

---

In [None]:
# Importing pandas and numpy as pd, np, respectively.
# These shorthand names are common practice.
import pandas as pd
import numpy as np

# display() function works like print, but it shows the
# output in its rich display format if possible
from IPython.display import display

In [None]:
# Function for printing a horizontal line. For display purporse
def printhr(n: int = 50):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        n (int, optional): Number of characters. Defaults to 50.
    """

    print("=" * n)

---
---
## Series

A series is a homogenous one-dimensional labeled array that can hold any data type (but only 1 type; homogenous). The axis labels are collectively referred to as the **index**. 
 
A Series acts like an ndarray (numpys) and thus most numpy functions that work on an ndarray will work on a Series.

---
---

---
### Creating a Series

You can use the Series class to create a Series.  

`new_series = pd.Series(data, index=index)`

Where **data** is the data to be converted into a Series, and **index** is iterable that contains the labels. If **index** is not set, default integer index will be set (incrementing from 0).

**data** can be of different structures:

- a Python dict
- a 1D NumPy ndarray
- a scalar value (like 38)

---


---
#### **Series:** Python dict as Data

When passing in a Python dict, the keys will act as the labels. If an index is passed, it will be the size of the Series instead of the data's size. If an index is passed, it will check against the dict keys, and order the Series according to the order of the index. Values in the index that do not exist in the dict passed to data will be represented as NaN.

---


In [None]:
# Create a dict
d = {"a": 2, "b": 4, "c": 6}
display(d)
printhr()

# Create Series
s = pd.Series(d)
display(s)

# > When an index does not exist in the provided dict data, it will
#   be represented with NaN. 
# > The length of the index is length of Series.
# > Order of index is followed.
s2 = pd.Series(d, index=["c", "a", "b", "x"])
display(s2)

---
#### **Series:** NumPy ndarray as Data

A numpy ndarray of 1-dimension can be passed as data.

---


In [None]:
# Create an ndarray
d = np.random.randn(5)
display(d)
printhr()

# Create a Series
# Index in descending order using pd.RangeIndex()
s = pd.Series(d, index=pd.RangeIndex(4, -1, -1))
display(s)

# If no index is passed, int is automatically assigned incrementally.
s2 = pd.Series(d)
display(s2)

---
#### **Series:** Scalar Value as Data

Passing a scalar value will result in a Series of length 1. If an index is passed, the value will be copied match the length of the index. A scalar value is an object with no dimension, or is not a collection.

---


In [None]:
# Scalar value (not a collection)
d = 4.2

# Create Series
s = pd.Series(d)
display(s)

# Value is repeated throughout the index
s2 = pd.Series(d, index=["i", "n", "d", "e", "x"])
display(s2)

---
---
## DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns, each of which can be of different type. This is the most commonly used pandas object.

---
---

---
### Creating a DataFrame

Much like creating a Series, a DataFrame class can be used to create a DataFrame.  

`new_df = pd.DataFrame(data, index=index)`

**data** can be of different structures:

- a Python dict of any of the ff:
  - Series
  - lists
  - 1D ndarrays
  - dicts
- a 2D NumPy ndarray
- a Series
- structured or record ndarray
- another DataFrame


---


---
#### **DataFrame:** dict of Series as Data

When creating a DataFrame from a dict of Series, the resulting index will be the union of the indexes of the various Series. The keys from the dict will be the column labels. Missing data will be converted to NaN. Columns can be set through the **columns** parameter; the DataFrame will only contain the columns passed.

---


In [None]:
# Create a dict containing Series
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series(["one", "two", "three", "four"], index=["a", "b", "c", "d"]),
}

display(d)
printhr()

# Create DataFrame
# Missing data is set to NaN
df = pd.DataFrame(d)
display(df)

# When index is passed, it dictates the order
df = pd.DataFrame(d, index=["d", "a", "b"])
display(df)

# Specifying which columns to show
# Notice how "three" is a NaN column. This is because
# the data (d) does not have a column "three"
df = pd.DataFrame(d, index=["d", "a", "b", "c"], columns=["one", "three"])
display(df)

---
#### **DataFrame:** dict of Lists as Data

Passing in a dict of lists behaves similarly as a dict of Series.

---


In [None]:
# Create dict
d = {"Foo": [2, 5, 5], "Bar": [9, 2, 7]}
display(d)
printhr()

# Create df
df = pd.DataFrame(d, index=range(3))
display(df)

---
#### **DataFrame:** List of dicts

When passing a list of dicts as data, each dict will be a row, and the keys in the dict will be the columns.

---


In [59]:
# Create data
d = [
    {"a": 1, "b": 2, "c": 3, "d": 4},
    {"a": 3, "c": 9, "d": 12},
]
display(d)
printhr()

# Create df
df = pd.DataFrame(d)
display(df)
# Note that the column "b" shows 2.0
# This is because when creating our df, we have an empty value
# which is represented by NaN and when there is NaN on a column
# where numbers are present, the column is automatically set
# to be a float64 data type on initialization.

# To convert, we can pass a dict with all the columns's type to the
# dtype parameter of DataFrame() in a column:type pair.

#TODO does not work: check on dtype parameter.
# dt = {"a": int, "b": int, "c": int, "d": int}
# df = pd.DataFrame(d, dtype=dt)
# display(df)
#TODO does not work: check on dtype parameter.

#TODO do I include this here or later on: NaN is a float value and 
# hence cant be converted into integer.
# Maybe fix this problem by setting default value to NaN values?

# We can also change the column's data type after we create the df
# using the astype() method
# df["b"].astype(int)
# df



[{'a': 1, 'b': 2, 'c': 3, 'd': 4}, {'a': 3, 'c': 9, 'd': 12}]



Unnamed: 0,a,b,c,d
0,1,2.0,3,4
1,3,,9,12


IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer