---
# Pandas Data Structures
Pandas has two main data structures for handling tabular data: `Series` and `DataFrames` 

---

In [2]:
# Importing pandas and numpy as pd, np, respectively.
# These shorthand names are common practice.
import pandas as pd
import numpy as np

# display() function works like print, but it shows the
# output in its rich display format if possible
from IPython.display import display

In [3]:
# Function for printing a horizontal line. For display purporse
def printhr(n: int = 50):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        n (int, optional): Number of characters. Defaults to 50.
    """

    print("=" * n)

---
---
## Series

A series is a homogenous one-dimensional labeled array that can hold any data type (but only 1 type; homogenous). The axis labels are collectively referred to as the **index**. 
 
A Series acts like an ndarray (numpys) and thus most numpy functions that work on an ndarray will work on a Series.

---
---

---
### Creating a Series

You can use the Series class to create a Series.  

`new_series = pd.Series(data, index=index)`

Where **data** is the data to be converted into a Series, and **index** is iterable that contains the labels. If **index** is not set, default integer index will be set (incrementing from 0).

**data** can be of different structures:

- a Python dict
- a 1D NumPy ndarray
- a scalar value (like 38)

---


---
#### **Series:** Python dict as Data

When passing in a Python dict, the keys will act as the labels. If an index is passed, it will be the size of the Series instead of the data's size. If an index is passed, it will check against the dict keys, and order the Series according to the order of the index. Values in the index that do not exist in the dict passed to data will be represented as NaN.

---


In [4]:
# Create a dict
d = {"a": 2, "b": 4, "c": 6}
display(d)
printhr()

# Create Series
s = pd.Series(d)
display(s)

# > When an index does not exist in the provided dict data, it will
#   be represented with NaN. 
# > The length of the index is length of Series.
# > Order of index is followed.
s2 = pd.Series(d, index=["c", "a", "b", "x"])
display(s2)

{'a': 2, 'b': 4, 'c': 6}



a    2
b    4
c    6
dtype: int64

c    6.0
a    2.0
b    4.0
x    NaN
dtype: float64

---
#### **Series:** NumPy ndarray as Data

A numpy ndarray of 1-dimension can be passed as data.

---


In [5]:
# Create an ndarray
d = np.random.randn(5)
display(d)
printhr()

# Create a Series
# Index in descending order using pd.RangeIndex()
s = pd.Series(d, index=pd.RangeIndex(4, -1, -1))
display(s)

# If no index is passed, int is automatically assigned incrementally.
s2 = pd.Series(d)
display(s2)

array([ 0.2381735 , -1.52091083,  1.89039075, -0.25387356, -2.347492  ])



4    0.238174
3   -1.520911
2    1.890391
1   -0.253874
0   -2.347492
dtype: float64

0    0.238174
1   -1.520911
2    1.890391
3   -0.253874
4   -2.347492
dtype: float64

---
#### **Series:** Scalar Value as Data

Passing a scalar value will result in a Series of length 1. If an index is passed, the value will be copied match the length of the index. A scalar value is an object with no dimension, or is not a collection.

---


In [6]:
# Scalar value (not a collection)
d = 4.2

# Create Series
s = pd.Series(d)
display(s)

# Value is repeated throughout the index
s2 = pd.Series(d, index=["i", "n", "d", "e", "x"])
display(s2)

0    4.2
dtype: float64

i    4.2
n    4.2
d    4.2
e    4.2
x    4.2
dtype: float64

---
---
## DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns, each of which can be of different type. This is the most commonly used pandas object.

---
---

---
### Creating a DataFrame

Much like creating a Series, a DataFrame class can be used to create a DataFrame.  

`new_df = pd.DataFrame(data, index=index)`

**data** can be of different structures:

- a Python dict of any of the ff:
  - Series
  - lists
  - 1D ndarrays
  - dicts
- a 2D NumPy ndarray
- a Series
- structured or record ndarray
- another DataFrame


---


---
#### **DataFrame:** dict of Series as Data

When creating a DataFrame from a dict of Series, the resulting index will be the union of the indexes of the various Series. The keys from the dict will be the column labels. Missing data will be converted to NaN. Columns can be set through the **columns** parameter; the DataFrame will only contain the columns passed.

---


In [7]:
# Create a dict containing Series
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series(["one", "two", "three", "four"], index=["a", "b", "c", "d"]),
}

display(d)
printhr()

# Create DataFrame
# Missing data is set to NaN
df = pd.DataFrame(d)
display(df)

# When index is passed, it dictates the order
df = pd.DataFrame(d, index=["d", "a", "b"])
display(df)

# Specifying which columns to show
# Notice how "three" is a NaN column. This is because
# the data (d) does not have a column "three"
df = pd.DataFrame(d, index=["d", "a", "b", "c"], columns=["one", "three"])
display(df)

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64,
 'two': a      one
 b      two
 c    three
 d     four
 dtype: object}



Unnamed: 0,one,two
a,1.0,one
b,2.0,two
c,3.0,three
d,,four


Unnamed: 0,one,two
d,,four
a,1.0,one
b,2.0,two


Unnamed: 0,one,three
d,,
a,1.0,
b,2.0,
c,3.0,


---
#### **DataFrame:** dict of Lists as Data

Passing in a dict of lists behaves similarly as a dict of Series.

---


In [8]:
# Create dict
d = {"Foo": [2, 5, 5], "Bar": [9, 2, 7]}
display(d)
printhr()

# Create df
df = pd.DataFrame(d, index=range(3))
display(df)

{'Foo': [2, 5, 5], 'Bar': [9, 2, 7]}



Unnamed: 0,Foo,Bar
0,2,9
1,5,2
2,5,7


---
#### **DataFrame:** List of dicts as Data

When passing a list of dicts as data, each dict will be a row, and the keys in the dict will be the columns.

---


In [21]:
# Create data
d = [
    {"a": 1, "b": 2, "c": 3, "d": 4},
    {"a": 3, "c": 9, "d": 12},
]
display(d)
printhr()

# Create df
df = pd.DataFrame(d)
display(df)
# Note that the column "b" shows 2.0
# This is because when creating our df, we have an empty value
# which is represented by NaN and since NaN is a special float,
# the column is automatically set to be a float64 data type on
# initialization.


# We can replace NaN values with a 0 and convert the type to
# an int

# Replace NaN values
df.fillna(0, inplace=True)

# Convert column to int64 (default for ints)
df["b"] = df["b"].astype("int64")

display(df, df.dtypes)

[{'a': 1, 'b': 2, 'c': 3, 'd': 4}, {'a': 3, 'c': 9, 'd': 12}]



Unnamed: 0,a,b,c,d
0,1,2.0,3,4
1,3,,9,12


Unnamed: 0,a,b,c,d
0,1,2,3,4
1,3,0,9,12


a    int64
b    int64
c    int64
d    int64
dtype: object

---
#### **DataFrame:** Series as Data

Converting a Series to a DataFrame will take the Series' name and make it as the column label (if no column name is passed).

---


In [40]:
# Named Series to DataFrame
s = pd.Series(range(3), index=list("abc"), name="my_series")
display(s)
printhr()

# If a named Series is passed, the Series name would be the 
# DataFrame's column label
df = pd.DataFrame(s)
display(df)

# When specifying column names when creating a df from a Series:
df = pd.DataFrame(s, columns=["a", "my_series"])
display(df)

a    0
b    1
c    2
Name: my_series, dtype: int64



Unnamed: 0,my_series
a,0
b,1
c,2


Unnamed: 0,a,my_series
a,,0
b,,1
c,,2


In [34]:
# Series as Data, example 2.
s = pd.Series(range(3), index=list("abc"))
display(s)
printhr()

# If the Series is unnamed, a column name can be passed
# that will override the default series name (0).
df = pd.DataFrame(s, columns=["new_column"])
display(df)

a    0
b    1
c    2
dtype: int64



Unnamed: 0,new_column
a,0
b,1
c,2


---
### Alternate DataFrame Constructors

DataFrame.from_dict()

Takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default, but can be set to 'index' in order to use the dict keys as row labels.

DataFrame.from_records()

Takes a list of tuples or an ndarray with structured dtype. It works analogously to the normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of the structured dtype.

More on [pandas docs](https://pandas.pydata.org/docs/user_guide/dsintro.html#alternate-constructors)  


---
