In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw04.ipynb")

# Functional programming with Pandas Series. 
In this homework, we'll explore some of the features of Pandas that allow one to program without iterative loops. This can be a bit mind-twisting, but it's a better way to implement in these languages. 

In this assignment, try to accomplish the tasks below without using loops, comprehensions, mapping, filtering, and reduction operations in Python. Instead, use the functions available in `Series` to accomplish the same aims, including conditional syntax inside `[]`, `Series.groupby`, etc.  

Also, I feel it fair to warn you that questions 1-4 are relatively easy and *question 5 is a mind-bender.* It is best to think of questions 1-4 as "preparation" for question 5. 

In [2]:
# Run this first to load all libraries. 
import numpy as np
import pandas as pd

<!-- BEGIN QUESTION -->

*Question 1:* In the cell below, construct a function that takes as input a Pandas `Series` and removes all values less than zero. 

*Hint:* Use `Series` conditional array slice notation.

Example: `gezero(pd.Series([-1, 4, -4, 5, 7]))` returns 
```
1    4
3    5
4    7
dtype: int64
```

In [3]:
def gezero(thing): 
    return thing[thing >= 0]

In [4]:
# This will help you test. 
gezero(pd.Series([-1, 4, -4, 5, 7]))

1    4
3    5
4    7
dtype: int64

In [5]:
grader.check("q1")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

*Question 2:*  Read up on `groupby` functionality in `Series`, and then write a function `sums` that takes a single `Series` as an argument and returns a new `Series` containing the sums of the values with the same index name. Example: if 
```
foo = pd.Series([1, 2, 3, 1, 1, 2, 5], ['alice', 'george', 'alice', 'alice', 'frank', 'george', 'george'])
```
then `sums(foo)` returns: 
```
alice     5
frank     1
george    9
dtype: int64
```

In [6]:
def sums(s): 
    return s.groupby(level=0).sum()

In [7]:
# This will help you test. 
foo = pd.Series([1, 2, 3, 1, 1, 2, 5], ['alice', 'george', 'alice', 'alice', 'frank', 'george', 'george'])
sums(foo)

alice     5
frank     1
george    9
dtype: int64

In [8]:
grader.check("q2")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

*Question 3:*  *Binning* is a practice that reduces the size of data by grouping similar values. Write a function `bin` with two arguments: a `Series` to be binned and the `size` of each bin. Return a `Series` that represents the mean of values in each bin, where the index is the midpoint of the bin. Each bin starts at a multiple of `size` and encompasses all values greater than or equal to `midpoint - size/2` and strictly less than `midpoint + size/2`.

Hint: the midpoint of the bin for a value `v` is  `math.floor(v/size)*size + size/2`. Define the groups via a `lambda` expression. 

Caveat: please try to do this without using the `Scipy` binning functions. They aren't guaranteed to get past the grading checks. 

Example: if 
```
foo = pd.Series([1.0, 2.0, 3.0, 4.0, 5.0])
```
then `bin(foo, 2.0)` returns
```
1.0    1.0
3.0    2.5
5.0    4.5
dtype: float64
```


In [9]:
import math
def bin(s, size): 
    x = lambda a: (math.floor(a/size)*size + size/2)
    ind = list(set(s.apply(x)))
    grp = s.groupby(s.apply(x)).mean()
    result = pd.Series(grp, ind)
    return result

In [10]:
# This will help you test. 
foo = pd.Series([1.0, 2.0, 3.0, 4.0, 5.0])
print(foo)
bin(foo, 2.0)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64


1.0    1.0
3.0    2.5
5.0    4.5
dtype: float64

In [11]:
grader.check("q3")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

*Question 4:* Write a function `freq` that bins by frequency for a fixed bin size `size`, using the midpoint of the bin as the index and the frequency of data in that bin as the value. 

*Hint:* modify the formula from Question 3 as needed.

Example: if 
```
foo = pd.Series([1, 2, 3.5, 2.5, 1.5, 2.5, 10])
```
then `freq(foo, 2)` returns
```
1.0     2
3.0     4
11.0    1
dtype: int64
```

In [12]:
def freq(s, size): 
    x = lambda a: (math.floor(a/size)*size + size/2)
    ind = list(set(s.apply(x)))
    grp = s.groupby(s.apply(x)).count()
    result = pd.Series(grp, ind)
    return result

In [13]:
# This will help you test. 
foo = pd.Series([0, 0.5, 1, 2, 3.5, 2.5, 1.5, 2.5, 10])
freq(foo, 2)

11.0    1
1.0     4
3.0     4
dtype: int64

In [14]:
grader.check("q4")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

*Question 5:* It's really common to want to bin by something other than equal intervals. Write a function `cut` that takes two arguments: a series `s` to be binned and a series `cutoffs` of the high (maximum) cutoff points for each bin. It should return bins with the counts of each interval in the cutoffs set. Label each bin with its cutoff index.

Example: if `cutoffs` is 
```
cutoffs = pd.Series([1.0, 7.0, 10.0], ['low', 'middle', 'high'])
```
then the desired bins are 
* `low` <= 1.0
* `middle` > 1.0 and <= 7.0
* `high` > 7.0 and <= 10.0
Values greater than 'high' are ignored. 

The result of calling  `cut(pd.Series(range(20), cutoffs)` is: 
```
high      3
low       2
medium    6
dtype: int64
```
Note particularly that values 11-20 of `range(20)` are ignored. 

*Hint:* this is a sophisticated use of `groupby`. It's best to write a new categorization function for `groupby` using `numpy.searchsorted` to search for the greatest value less than a specific value. It's easiest to filter out the values > the maximum of `cutoffs` first; otherwise `numpy.searchsorted` will raise an exception.  

*Comment:* It is ironic that this is harder for a `Series` than for a `DataFrame`.

In [15]:
def category(data, idx, s, cutoffs):
    cat = 0
    for i in idx:
        if data <= s.at[i]:
            return cutoffs[cutoffs==cutoffs[cat]].index[0]
        cat += 1

def cut(s, cutoffs): 
    s.sort_values(ascending=True)
    s = s.drop(s[s > cutoffs.max()].index)
    idx = np.searchsorted(s, cutoffs)
    x = lambda a: category(a, idx, s, cutoffs)
    grp = s.groupby(s.apply(x)).count()
    result = pd.Series(grp, cutoffs.index)
    return result


last = cutoffs.iloc[-1]
    copy_s = s.copy()
    filt = s[copy_s <= last]
    one = []
    one.append(filt.searchsorted(cutoffs[0]) + 1)
    midhigh = filt[filt > cutoffs[0]]
    if cutoffs[1] != last:
        one.append(midhigh.searchsorted(cutoffs[1]) + 1)
        midhigh = midhigh[midhigh > cutoffs[1]]
    one.append(len(midhigh))
    result = pd.Series(one, cutoffs.index)

In [16]:
# This will help you test. 
foo = pd.Series(range(20))
cutter = pd.Series([1.0, 7.0, 10.0], ['low', 'medium', 'high'])
cut(foo, cutter)

low       2
medium    6
high      3
dtype: int64

In [17]:
grader.check("q5")

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [295]:
grader.check_all()

q1 results: All test cases passed!

q2 results: All test cases passed!

q3 results: All test cases passed!

q4 results: All test cases passed!

q5 results: All test cases passed!

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

You are not done until you upload the exported zipfile to GradeScope.

In [296]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)