# 5.1 Introduction to pandas Data Structures

1. [Series](#series)
2. [DataFrame](#dataframe)
3. [Index Objects](#index)

The two main data structures in the pandas library are `Series` and `DataFrame`

<a name="series"></a>
# Series

A `Series`:
1. is a one-dimensional array-like object
1. contains a sequence of values of the same type
1. contains an associated array of data labels, called its `index`

The simplest Series is made from an array of data:

In [190]:
import numpy as np
import pandas as pd

mySeries = pd.Series([4, 7, -5, 3])
mySeries

0    4
1    7
2   -5
3    3
dtype: int64

The values to the left of our actual values are the `index` values. Since we didn't specify one when creating the Series, it defaults to 0 through (nrow-1)

Both the data and the index are *attributes* of the series and can be accessed as such:

In [191]:
mySeries.array

<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [192]:
mySeries.index

RangeIndex(start=0, stop=4, step=1)

The index can be thought of as the row names.  

In [193]:
mySeries2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
mySeries2

d    4
b    7
a   -5
c    3
dtype: int64

In [194]:
mySeries2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

They can also be used to select entries in the series - either using a single value or a set of values:

In [195]:
# Select the value associated with the index 'a'
mySeries2["a"]

-5

In [196]:
# Assign the value associated with the index 'd' to be 6
mySeries2["d"] = 6
mySeries2

d    6
b    7
a   -5
c    3
dtype: int64

In [197]:
# Select multiple values using a list of indices
mySeries2[["c", "a", "d"]]

c    3
a   -5
d    6
dtype: int64

In [198]:
# Notice how selecting 1 value versus multiple values changes the type of object returned:
print(type(mySeries2["a"]))
print(type(mySeries2[["c", "a", "d"]]))

<class 'numpy.int64'>
<class 'pandas.core.series.Series'>


Again, like row names, any sort of filtering or modification to the values will not change the association with a given index:

In [199]:
# Filtering maintains index-value relationships
mySeries2[mySeries2 > 0]

d    6
b    7
c    3
dtype: int64

In [200]:
# Scalar multiplication modifies the values but keeps their associated index
mySeries2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [201]:
# Applying a math function behaves the same
np.exp(mySeries2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

You can also think of a series as a fixed-length, ordered dictionary because it's a mapping of index (key) values to data values.

They're similar enough that simply passing a dictionary to `pd.Series()` will coerce it to a Series. The reverse can be accomplished with `to_dict`

In [202]:
# Check if an index (key) is in the series
print("b" in mySeries2)
print("e" in mySeries2)

True
False


In [203]:
# Coerce a dictionary to a series
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
mySeries3 = pd.Series(sdata)
mySeries3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [204]:
mySeries3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

The order of the dictionary keys is preserved when converting to a Series, but you can provide a specific order if you'd like.  

Notice in the example below how the missing dictionary key ("Utah") is omitted from the Series and the non-existent dictionary key ("California") is populated as NA

In [205]:
states = ["California", "Ohio", "Oregon", "Texas"]
mySeries4 = pd.Series(sdata, index=states)
mySeries4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Use the functions `isna` and `notna` to detect missing data (which will be referred to as 'missing', 'NA', or 'null').

These are pandas functions and also methods of Series.

In [206]:
# pandas function
pd.isna(mySeries4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [207]:
# Series method
mySeries4.notna()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

One cool thing about Series that isn't included in analgous R data.tables/data.frames is that arithmetic operations between Series are automatically aligned by index.

So if I add mySeries3 and mySeries4, not only will it make sure to add Ohio, Oregon, and Texas values correctly, but it will also include the entries unique to either:

In [208]:
mySeries3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [209]:
mySeries4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [210]:
mySeries3 + mySeries4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Another aspect of Series is the `name` attribute, which can be thought of as a column name in a sense.

Both the Series and the Series's index have `name` attributes. These will come in handy when working with DataFrames I believe.

Below we label the index values as "States" and then the data values as "population"

In [211]:
mySeries4.name = "population"
mySeries4.index.name = "state"
mySeries4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

Final thing we'll mention here is that a Series's index values can be modified in place:

In [212]:
mySeries

0    4
1    7
2   -5
3    3
dtype: int64

In [213]:
mySeries.index = ["Bob", "Steve", "Jeff", "Ryan"]
mySeries

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

You can't change an individual index value though!

```python
mySeries.index[1] = ["George"]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 mySeries.index[1] = ["George"]

File ~/miniconda3/envs/pydata-book/lib/python3.10/site-packages/pandas/core/indexes/base.py:5371, in Index.__setitem__(self, key, value)
   5369 @final
   5370 def __setitem__(self, key, value) -> None:
-> 5371     raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations
```

<a name="dataframe"></a>
# DataFrame

A `DataFrame`
1. is a rectangular table of data
2. contains an ordered, named collection of columns
    - each column can be a different value type
3. has both a row and a column index
4. Is analogous to a dictionary of Series that all share the same index

## Creation

Most common way to construct a DataFrame is using a dictionary of equal-length lists or NumPy arrays.  

The row index will be generated automatically if not specified. The columns will be in the same order as they are in the dictionary.  

Another method of DataFrame construction is a dictionary of dictionaries. In this method, the top-level dictionary keys will be the columns and the inner-level keys will be the row indices

### Dictionary out of Lists

In [214]:
# Build a Dictionary of lists
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
data

{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [215]:
# Coerce to DataFrame
data_DF = pd.DataFrame(data)
data_DF

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [216]:
# Can specify column order, if desired
data2_DF = pd.DataFrame(data, columns=["year", "state", "pop"])
data2_DF

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [217]:
# Non-existent columns are populated with NAs
data2_DF = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
data2_DF

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [218]:
# Maintained as a column index!
data2_DF.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

### DataFrame out of nested dictionary

In [219]:
# Nested dictionary:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}
populations

{'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}, 'Nevada': {2001: 2.4, 2002: 2.9}}

In [220]:
# Coerce to DataFrame
populations_DF = pd.DataFrame(populations)
populations_DF

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [221]:
# If specific index is provided during creation, then inner keys that aren't in it are excluded and missing ones are added as empty
# (The entry for 2000 in Ohio is omitted while the entry for 2003 is created)
pd.DataFrame(populations, index = [2001, 2002, 2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


### DataFrame out of Dictionary of Series

In [222]:
ohio = populations_DF["Ohio"]
nev = populations_DF["Nevada"]
populations2 = {"Ohio": ohio, "Nevada" : nev}
populations2

{'Ohio': 2000    1.5
 2001    1.7
 2002    3.6
 Name: Ohio, dtype: float64,
 'Nevada': 2000    NaN
 2001    2.4
 2002    2.9
 Name: Nevada, dtype: float64}

In [223]:
pd.DataFrame(populations2)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [224]:
populations_DF["Ohio"]

2000    1.5
2001    1.7
2002    3.6
Name: Ohio, dtype: float64

### Data Frame Constructors

<img src="./myImages/table5.1_dfConstructors.png" width = 600>

## Manipulation

In [225]:
# View top 5 rows
data_DF.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [226]:
# View bottom 5 rows
data_DF.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [227]:
# Access a single column with "dictionary-like" notation
data2_DF['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [228]:
# Access a single column with "dot attribute" notation
# Note column name can't be an actuall attribute or method and can't have special characters (only underscores allowed)
data2_DF.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [229]:
# Single columns are returned as Series
type(data2_DF.year)

pandas.core.series.Series

In [230]:
# Order columns in specific way
data_DF[["year", "state", "pop"]]

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


You can transpose a DataFrame using `.T` the same as a NumPy array. Be careful though because the data types of columns are lost if they don't aggree anymore.

In [231]:
# Transpose with same syntax as NumPy: T
data_DF.T

Unnamed: 0,0,1,2,3,4,5
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
year,2000,2001,2002,2001,2002,2003
pop,1.5,1.7,3.6,2.4,2.9,3.2


DataFrames also have special attributes `loc` and `iloc` that can be usd to select rows. More later, but for now:

In [247]:
data2_DF

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,-1.2
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,-1.5
5,2003,Nevada,3.2,-1.7


In [248]:
data2_DF.loc[1]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

In [249]:
data2_DF.iloc[0]

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: 0, dtype: object

## Modification

Modify columns by assignment
- scalar value gets repeated
- array/list gets populated in order (must be of same length as DataFrame!)
- Series gets labels matched accordingly with missing values inserted

In [232]:
# Scalar
data2_DF["debt"] = 16.5
data2_DF

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [233]:
# Array
data2_DF["debt"] = np.arange(6.)
data2_DF

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


In [234]:
# Series 1
val1 = pd.Series([-1.2, -1.5, -1.7])
data2_DF["debt"] = val1
data2_DF

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,-1.2
1,2001,Ohio,1.7,-1.5
2,2002,Ohio,3.6,-1.7
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [235]:
# Series 2
val2 = pd.Series([-1.2, -1.5, -1.7], index = [2, 4, 5])
data2_DF["debt"] = val2
data2_DF

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,-1.2
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,-1.5
5,2003,Nevada,3.2,-1.7


Can make new columns "in place", i.e. don't have to have a blank column ready to populate with a result, can get a result and assign it to a new column all at once:

In [236]:
data2_DF["eastern"] = data_DF["state"] == "Ohio"
data2_DF

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,-1.2,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,-1.5,False
5,2003,Nevada,3.2,-1.7,False


Use `del` method to remove columns:

In [237]:
del data2_DF["eastern"]
data2_DF.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

## Other Items

Both the `index` (row labels) and `columns` of a DataFrame have `name` attributes. If present, they'll be displayed when the DataFrame is printed.  

The DataFrame itself, however, doesn't have a `name` attribute

In [238]:
populations_DF

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [239]:
print(populations_DF.index.name)
print(populations_DF.columns.name)

None
None


In [240]:
populations_DF.index.name = "year"
populations_DF

Unnamed: 0_level_0,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [241]:
populations_DF.columns.name = "state"
populations_DF

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


The `to_numpy` method will convert a DataFrame into a two-dimensional array.  

If the columns have different datatypes, then the final array's data type will accomodate all columns (i.e. float if int+float; object if int+string, etc.)

In [242]:
populations_DF.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

In [243]:
data2_DF.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, -1.2],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, -1.5],
       [2003, 'Nevada', 3.2, -1.7]], dtype=object)

<a name="index"></a>
# Index Objects

Index objects
1. hold the axis labels (including DataFrame column names)
1. hold other metadata (like axis name or names)
1. internally converted from an array or other sequence of labels used when constructing a Series or DataFrame

Index objects are **immutable**  

They also behave like a fixed set (i.e. can use `in` and `not in` to check them)

Unlike sets, however, a pandas Index **can have duplicate labels** - all occurrences of a label will be selected i nthis case.

<img src="./myImages/table5.2_indexMethods.png" width = 600>

In [244]:
# View an index
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [245]:
# Slice an index
index[1:]

Index(['b', 'c'], dtype='object')

In [246]:
# Immutable
index[1] = "d"

TypeError: Index does not support mutable operations

In [None]:
# Assign indices from a variable
myLabels = pd.Index(np.arange(3))
myLabels

Index([0, 1, 2], dtype='int64')

In [None]:
obj2 = pd.Series([1.5, -2.5, 0], index=myLabels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [None]:
# Equal
obj2.index is myLabels

True

In [None]:
populations_DF

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [None]:
# Use `in` to check column index
"Ohio" in populations_DF.columns

True

In [None]:
# Use `not in` to check row index
2003 not in populations_DF.index

True

In [None]:
# Duplicate indices
data2_DF.index = ["A", "A", "B", "C", "C", "D"]
data2_DF

Unnamed: 0,year,state,pop,debt
A,2000,Ohio,1.5,
A,2001,Ohio,1.7,
B,2002,Ohio,3.6,-1.2
C,2001,Nevada,2.4,
C,2002,Nevada,2.9,-1.5
D,2003,Nevada,3.2,-1.7


In [None]:
data2_DF[data2_DF.index == "A"]