# Assignment 26: Pandas Series Objects #

### Goals for this Assignment ###

By the time you have completed this assignment, you should be able to:

- Manually create a `Series` object with the Pandas library, including with custom indices
- Create a `Series` object from a Python dictionary
- Access values in `Series` objects by name using `.loc` or square brackets (`[]`), and by numeric index with `.iloc`
- Get the size of a `Series` object with `.size`
- Perform vector operations on `Series` objects to do list-like operations without lists
- Use fancy indexing on `Series` objects to extract values at given indices
- Use masking on `Series` objects to extract elements matching a predicate
- Use the `mean`, `min`, and `max` methods to gather basic statistical information over a `Series` object

## Step 1: Create a `Series` Object with Manually-Specified Indices ##

### Background: Pandas `Series` Objects ###

[Pandas](https://pandas.pydata.org/) is a very popular Python library for Data Science applications.
Pandas offers much of the same sort of functionality as [NumPy](https://numpy.org/), and even internally uses NumPy.
As a result, knowledge about NumPy will usually transfer over to Pandas.
However, unlike NumPy, Pandas was very specifically made for Data Science purposes, and can end up making many operations simpler as a result.

For this assignment, we will look specifically at [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) objects, which are Pandas' representation of single-dimensional data.
As a reminder, with single-dimensional data, we only need to provide a single index to get access to an individual datum; for example, `some_list[3]` will get the element at index `3` in `some_list`.
Notably, what exactly constitutes an "index" is, in general, somewhat open to interpretation.
With lists, indices are integers.
However, with dictionaries, our "indices" are instead much more flexible; keys are effectively indices, and keys can be strings, integers, or floats, among many others (exactly how many is beyond our scope).

With `Series` objects, we can have our indices be integers, much like a normal Python list.
This is shown in the cell below, which uses Pandas to create a `Series` object.
Note that Pandas itself is `import`ed as `pd`; `pd` is, by convention, the preferred name to use when used in an `import`.

In [1]:
import pandas as pd

series = pd.Series([3, 2, 8, 4])
print(series[0]) # prints 3
print(series[1]) # prints 2
print(series[2]) # prints 8
print(series[3]) # prints 4

3
2
8
4


As shown in the cell above, we can create a `Series` object in a very similar way as a NumPy array, and we can access it in the same way, too.
However, while this works, this is generally not the preferred style.
`Series` objects optionally allow for custom indices to be used, which can be specified with the `index` parameter, like so:

In [2]:
named_series = pd.Series([4, 2, 8, 3, 9, 1], index = ["a", "b", "c", "d", "e", "f"])
print(named_series["a"]) # prints 4
print(named_series["b"]) # prints 2
print(named_series["c"]) # prints 8
print(named_series["d"]) # prints 3
print(named_series["e"]) # prints 9
print(named_series["f"]) # prints 1

4
2
8
3
9
1


With custom indices provided, `Series` objects end up having a very similar look and feel to Python dictionaries, since we can use strings and other types for the indices instead of just integers.
While this may not seem like that big of a shift, in practice this can help a lot when doing data analysis, because usually there is some semantic meaning behind each individual value.
For example, if we have a table, usually each column in the table represents something distinct.
Using medical data as an example, we might have a column for patient names, their birth date, and their blood pressure during their last doctor visit, in that order.
If we have a single row of this table represented as a `Series` object, we could access these values via something like `patient["name"]` or `patient["blood_pressure"]`.
This is self-describing, making this easier to understand (and arguably less error-prone) than `patient[0]` or `patient[2]`.

### Try this Yourself ###

In the next cell, create a `Series` object bound to variable `my_series` holding the following:

- `5`, with the label `"foo"`
- `2`, with the label `"bar"`
- `3`, with the label `"baz"`

Leave the prints in order to test your code.

In [3]:
# Define your series here.  Leave the prints below in order to test your code.
my_series = pd.Series([5, 2, 3], index=["foo", "bar", "baz"])

print(my_series["foo"]) # should print 5
print(my_series["bar"]) # should print 2
print(my_series["baz"]) # should print 3

5
2
3


## Step 2: Create a `Series` Object from a Dictionary ##

### Background: Making `Series` Objects from a Dictionary ##

The constructor for `Series` can be used in multiple ways, and can instead take a dictionary.
In that case, the keys of the dictionary will be used as indices, and the values will be used as the entries.
This is illustrated in the cell below.

In [4]:
dict_example = pd.Series({ "apple" : 1, "pear" : 2 })
print(dict_example["apple"]) # prints 1
print(dict_example["pear"]) # prints 2

1
2


### Try this Yourself ###

In the next cell, create a `Series` object bound to variable `my_series_dict`.
This should hold the same values as your prior definition of `my_series`, only now the `Series` object should be created by passing a dictionary to `pd.Series` instead.
Leave the prints in place in order to test your code.

In [5]:
# Define your series here.  Leave the prints below in order to test your code.

my_series_dict = pd.Series({ "foo" : 5, "bar" : 2 , "baz" : 3})

print(my_series_dict["foo"]) # should print 5
print(my_series_dict["bar"]) # should print 2
print(my_series_dict["baz"]) # should print 3

5
2
3


## Step 3: Access Values Using `.loc` and `.iloc` ##

### Background: `.loc` and `.iloc` ###

In addition to the square bracket notation we have been using so far, you can also access values in a `Series` using `.loc`, as shown below:

In [6]:
example = pd.Series({ "alpha" : 42, "beta" : 12, "gamma" : 38 })
print(example.loc["alpha"]) # prints 42
print(example.loc["beta"]) # prints 12
print(example.loc["gamma"]) # prints 38

42
12
38


As shown, accessing through `.loc` effectively does the same thing as using square brackets directly.

You can also access by numeric index via `.iloc`, as shown below:

In [7]:
print(example.iloc[0]) # prints 42
print(example.iloc[1]) # prints 12
print(example.iloc[2]) # prints 38

42
12
38


As shown, `.iloc` allows for access via a numeric index, similar to a list.
The items are in the same order as they were inserted into the `Series`.

### Try this Yourself ###

In the next cell, access the elements of `my_series_dict` by numeric index using `.iloc`.
You should print out the elements in _reverse_ order in which you inserted them, as in, this should print:

```
3
2
5
```

You don't need to write a loop or use `reverse`; the intention here is to write three individual `print`s which access `.iloc` with the numeric indecies corresponding to the above elements.

In [11]:
# Define your prints accessing my_series_dict elements via numeric
# index, using .iloc

print(my_series_dict.iloc[2]) 
print(my_series_dict.iloc[1]) 
print(my_series_dict.iloc[0]) 

3
2
5


## Step 4: Get the Size of a `Series` Object with `.size` ##

### Background: `len` of `Series` Objects and `.size` ###

You can use either the `len` function to get the size of a `Series` object, or access this size using `.size`, as shown below:

In [12]:
size_example = pd.Series([4, 9, 4], index=["a", "b", "c"])
print(len(size_example)) # prints 3
print(size_example.size) # prints 3

3
3


### Try this Yourself ###

In the next cell, print the size of your `my_series_dict` `Series` object from before, once using `len`, and again using `.size`.

In [14]:
# Put your prints below
print(len(my_series_dict)) 
print(my_series_dict.size)

3
3


## Step 5: Use NumPy-like Vector Operations on `Series` Objects ##

### Background: Vector Operations on `Series` Objects ###

`Series` objects permit vector operations in practically the same way as NumPy arrays do, as shown below:

In [15]:
example1 = pd.Series({"foo" : 1, "bar" : 2})
example2 = pd.Series({"foo" : 5, "bar" : 3})
result = example1 + example2
print(result["foo"]) # prints 6
print(result["bar"]) # prints 5

6
5


As shown, `example1 + example2` computes a new `Series` object, where each element of the resulting object results from the sum of two values from `example1` and `example2`, using the same index.
That is, `result["foo"]` is `6`, because `example1["foo"] + example2["foo"]` is `6`.

`Series` objects also broadcast scalar values in the same way as NumPy arrays do, as shown below:

In [16]:
result = example1 + 7
print(result["foo"]) # prints 8
print(result["bar"]) # prints 9

8
9


That is, `example1 + 7` returns a new `Series` object, where the value of each field is the result of adding `7` to the corresponding field in `example1`.

Unlike NumPy arrays, if we perform a vector operation on two `Series` objects which don't share the same indices, this is not a hard error.
That is, an exception is **not** thrown if the indices are different.
Instead, we see `NaN` in the resulting `Series` object for each field that was present in only one of the two input `Series` objects, as shown below:

In [17]:
bar_baz = pd.Series({ "bar" : 5, "baz" : 7})
bar_baz_result = example1 + bar_baz
print(bar_baz_result)

bar    7.0
baz    NaN
foo    NaN
dtype: float64


If you run the prior cell, you'll see that `bar_baz_result` has the indices `"bar"`, `"baz"`, and `"foo"`; this corresponds to the set union of all indices for `example1` and `bar_baz`.
However, only the set _intersection_ has meaningful values; i.e., the only index shared by both `example1` and `bar_baz` is `"bar"`, therefore `bar_baz_result` holds index `"bar"` with a value corresponding to the sum of the values at these indices (`2 + 5 = 7`).
The remainder of the fields (`"baz"` and `"foo"`) hold [`NaN`, or Not a Number](https://en.wikipedia.org/wiki/NaN), which is a special floating-point value used to indicate that a requested computation more or less would do something weird if performed.
In this case, only one of the operands to `+` had the given index, so `NaN` was used as the result, since we couldn't actually sum anything.
Note that this also changed the type of value in the `Series` object to be a `float64` (shown with `dtype`); this is because `NaN` needs a floating-point representation to be represented at all, and integers cannot represent these.
Just as with NumPy arrays, the types of the elements in a `Series` object need to be the same, and the available types correspond to the same types in NumPy.

### Try this Yourself ###

This step effectively has you redo step 1 of the prior assignment, but over Pandas `Series` objects instead of NumPy array objects.
The next cell defines two different `Series` objects, `s1` and `s2`.
The comments say which operation is requested for you to perform.
The first one is provided for you as an example.

In [20]:
s1 = pd.Series([4, 8, 9, 0, 2, 0, 4], index = ["a", "b", "c", "d", "e", "f", "g"])
s2 = pd.Series([0, 1, 2, 3, 4, 5, 6], index = ["a", "b", "c", "d", "e", "f", "g"])

# +
print(s1 + s2)
# prior statement should print:
# a     4
# b     9
# c    11
# d     3
# e     6
# f     5
# g    10
# dtype: int64

# -
print()
print(s1 - s2)
# prior statement should print:
# a    4
# b    7
# c    7
# d   -3
# e   -2
# f   -5
# g   -2
# dtype: int64

# *
print()
print(s1 * s2)
# prior statement should print:
# a     0
# b     8
# c    18
# d     0
# e     8
# f     0
# g    24
# dtype: int64

# <
print()
print(s1 < s2)
# prior statement should print:
# a    False
# b    False
# c    False
# d     True
# e     True
# f     True
# g     True
# dtype: bool

# <=
print()
print(s1 <= s2)
# prior statement should print:
# a    False
# b    False
# c    False
# d     True
# e     True
# f     True
# g     True
# dtype: bool

# >
print()
print(s1 > s2)
# prior statement should print:
# a     True
# b     True
# c     True
# d    False
# e    False
# f    False
# g    False
# dtype: bool

# >=
print()
print(s1 >=  s2)
# prior statement should print:
# a     True
# b     True
# c     True
# d    False
# e    False
# f    False
# g    False
# dtype: bool

a     4
b     9
c    11
d     3
e     6
f     5
g    10
dtype: int64

a    4
b    7
c    7
d   -3
e   -2
f   -5
g   -2
dtype: int64

a     0
b     8
c    18
d     0
e     8
f     0
g    24
dtype: int64

a    False
b    False
c    False
d     True
e     True
f     True
g     True
dtype: bool

a    False
b    False
c    False
d     True
e     True
f     True
g     True
dtype: bool

a     True
b     True
c     True
d    False
e    False
f    False
g    False
dtype: bool

a     True
b     True
c     True
d    False
e    False
f    False
g    False
dtype: bool


## Step 6: Use Fancy Indexing on `Series` Objects ##

### Background: Fancy Indexing on `Series` Objects ###

Fancy indexing can also be performed on `Series` objects in much the same way as NumPy arrays.
This is illustrated in the cell below.

In [29]:
example = pd.Series([3, 8, 5, 1, 8, 3], index=["foo", "bar", "baz", "moo", "cow", "bull"])
trimmed = example[["foo", "moo"]]
print(trimmed)

foo    3
moo    1
dtype: int64


If you run the prior cell, the output shows that `trimmed` contains values for `"foo"` and `"moo"`, which are `3` and `1`, respectively.
This works exactly the same as with NumPy arrays, only now the indices correspond to the `index` parameter instead of being hard-coded as sequential integers.

### Try this Yourself ###

In the next cell, using fancy indexing, extract out the values from `example` for indices `"bar"`, `"cow"`, and `"bull"` into their own `Series` object.
That `Series` object should be bound to the variable `subset`, and then printed out.

In [31]:
# Use fancy indexing to extract out the requested indices into subset below

subset = example[["bar", "cow", "bull"]]

print(subset)

bar     8
cow     8
bull    3
dtype: int64


## Step 7: Use Masking on `Series` Objects ##

### Background: Masking on `Series` Objects ###

In addition to NumPy-style fancy indexing, `Series` objects also support masking in the same way as NumPy arrays.
For example, we can redefine the `evens` function from step 3 of the prior assignment to work with `Series` objects, as shown below:

In [22]:
def evens(s):
    return s[s % 2 == 0]

other_example = pd.Series([3, 8, 5, 1, 2, 7], index=["foo", "bar", "baz", "moo", "cow", "bull"])
print(evens(other_example))

bar    8
cow    2
dtype: int64


In fact, `evens` isn't meaningfully changes **at all** from the prior assignment; the only difference is that the parameter is named `s` instead of `arr`, but the name of the parameter doesn't have any impact on the code's behavior.

### Try this Yourself ###

In the cell below, define a `divisible_by` function which will take (in order):

- A `Series` object, holding integers
- Some positive integer `n`

Given these parameters, `divisible_by` should return a new `Series` object, holding all values which were evenly divisible by `n`.
As a hint, a number is evenly divisible by `n` if the remainder (modulus; `%`) is `0`.

In [23]:
# Define your divisible_by function here.  Leave the calls below in order
# to test your code.
def divisible_by(ser,n):
    return ser[ser % n == 0]

threes = pd.Series([3, 5, 6, 15, 7], index=["a", "b", "c", "d", "e"])
print(divisible_by(threes, 3))
# prior statement should print:
# a     3
# c     6
# d    15
# dtype: int64

sevens = pd.Series([3, 7, 2, 14, 21, 12], index=["a", "b", "c", "d", "e", "f"])
print(divisible_by(sevens, 7))
# prior statement should print:
# b     7
# d    14
# e    21
# dtype: int64

a     3
c     6
d    15
dtype: int64
b     7
d    14
e    21
dtype: int64


## Step 8: Use `mean`, `min`, and `max` Methods to Compute Basic Statistics ##

### Background: Statistical Operations with Pandas ###

Like NumPy, Pandas also supports a wide variety of statistical operations out of the box.
However, unlike NumPy, Pandas implements these as methods instead of functions.
This is illustrated in the cell below:

In [24]:
series = pd.Series([3, 5, 6, 15, 7], index=["a", "b", "c", "d", "e"])
print(series.mean()) # prints 7.2
print(series.min()) # prints 3
print(series.max()) # prints 15

7.2
3
15


Putting everything together, as with NumPy, you can combine operations to do quite a bit with one expression.
The following computes the average of all numbers in `series` which are less than or equal to `7`:

In [25]:
print(series[series <= 7].mean())

5.25


In the next cell, define a function named `descriptive_stats_series`, which will print the smallest, largest, and average number in a given input `Series` object.
This output should be printed as follows:

```
Smallest: MIN
Largest: MAX
Average: MEAN
```

Leave the calls in place for the next cell in order to test your code.
This is expected to be very similar to the `descriptive_stats` function you wrote in step 4 of the prior assignment.

In [27]:
# Define your descriptive_stats_series function here.
# Leave the calls in place below in order to test your code.
def descriptive_stats_series(serie):
    print(f'Smallest: {serie.min()}')
    print(f'Largest: {serie.max()}')
    print(f'Average: {serie.mean()}')

example1 = pd.Series([3, 7, 2, 9, 6, 4], index=["a", "b", "c", "d", "e", "f"])
descriptive_stats_series(example1)
# Above statement should print:
# Smallest: 2
# Largest: 9
# Average: 5.166666666666667

print()
example2 = pd.Series([3, 8, 1, 9], index=["a", "b", "c", "d"])
descriptive_stats_series(example2)
# Above statement should print:
# Smallest: 1
# Largest: 9
# Average: 5.25

Smallest: 2
Largest: 9
Average: 5.166666666666667

Smallest: 1
Largest: 9
Average: 5.25


## Step 9: Submit via Canvas ##

Be sure to **save your work**, then log into [Canvas](https://canvas.csun.edu/).  Go to the COMP 502 course, and click "Assignments" on the left pane.  From there, click "Assignment 26".  From there, you can upload the `26_pandas_series_objects.ipynb` file.

You can turn in the assignment multiple times, but only the last version you submitted will be graded.

### Special Thanks to Dr. Glenn Bruns ###

Special thanks to [Dr. Glenn Bruns](https://csumb.edu/scd/glenn-bruns/) at California State University, Monterey Bay, for providing me with closely-related materials which were used in the creation of this assignment.