<font color='darkred'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *app\.py* file and the *apputil\.py* file.

# Building a basic model

We'll build a [Python class](https://pythonbasics.org/class/) called `GroupEstimate` which takes in *categorical* data and corresponding *continuous* values, determines which group a new observation falls into, and "predicts" an estimate value based on the data provided.

## Exercise 1

### Part 1

Define a class `GroupEstimate` that accepts an `estimate` argument, which can be either `"mean"` or `"median"`.

### Part 2

Add a `.fit(X, y)` method that takes in a pandas DataFrame of *categorical* data, `X`, and a 1-D array, `y`. There should be no missing values in `y`, and each row of `X` corresponds to the same "row" in `y`, so they should be the same length.

- Combine `X` and `y` into a shared pandas DataFrame.
- Group the DataFrame by the columns in `X`.
- For each group, calculate either the mean or median value of `y`, depending on the `estimate` argument.
- *Note: Your class should not "store" `X` or `y`.* Only "save" the data needed to accomplish Part 3, below.

### Part 3

Add a `.predict(X_)` method that takes in an array of observations (or a dataframe) corresponding to the columns in `X_`, determines which group they fall into, and returns the corresponding estimates for `y`.

If an incoming category or combination of categories was missing in the original data, return `NaN` for that observation and print a message indicating the number of missing groups.

In [16]:
## Exercise 1
"""
Build python class called 'GroupEstimate' taking categorical data and corresponding 
continuous values, and determines which group a new observation falls into, and 
predicts an estimate value based on the data provided.
"""
# Part 1: Define a class 'GroupEstimate' that accepts an 'estimate' argument, which can be either "mean" or "median".
class GroupEstimate():
    def __init__(self, estimate):
        if estimate not in ["mean", "median"]:
            raise ValueError("Estimate must be either 'mean' or 'median'")
        self.estimate = estimate
        self.group_data = {}

In [17]:
import pandas as pd
import numpy as np

# Part 2: Add a .fit(X, y) method that takes in a pandas DataFrame of categorical data 'X', and a 1-D array, 'y'. 
"""
Combine 'X' and 'y' into shared pandas DataFrame.
Group the DataFrame by the columns in 'X'.
For each group, calculate either the mean or median value of 'y', depending on the 'estimate' argument.
Class should not store 'X' or 'y', only save the data needed to accomplish part 3.
"""

def fit(self, X, y):
    # Combine X and y into a single DataFrame
    data = X.copy()
    data['y'] = y

    # Group by all columns in X
    grouped = data.groupby(list(X.columns))['y']

    # Compute group-level statistic
    if self.estimate == "mean":
        self.group_data = grouped.mean().to_dict()
    else:
        self.group_data = grouped.median().to_dict()

def predict(self, X_):
    """Predict based on the learned group means/medians."""
    X_ = pd.DataFrame(X_)
    keys = [tuple(row) for row in X_.to_numpy()]
    predictions = [self.group_data.get(k, np.nan) for k in keys]

    missing_count = sum(pd.isna(predictions))
    if missing_count > 0:
        print(f"{missing_count} observation(s) belong to unseen group(s). Returning NaN for those.")

    return predictions

In [14]:
import pandas as pd

# Example NFL player data for two teams
data = {
    "team": ["IND", "IND", "IND", "BUF", "BUF", "BUF"],
    "player_name": [
        "Daniel Jones", "Jonathon Taylor", "Michael Pittman Jr",
        "Josh Allen", "James Cook", "Keon Coleman"
    ],
    "position": ["QB", "RB", "WR", "QB", "RB", "WR"],
    "jersey_number": [17, 28, 11, 17, 4, 0],
    "height_in": [75, 70, 77, 77, 70, 72],
    "weight_lb": [225, 215, 250, 238, 190, 191],
    "age": [28, 26, 28, 29, 24, 23],
}

players_df = pd.DataFrame(data)
print(players_df)

  team         player_name position  jersey_number  height_in  weight_lb  age
0  IND        Daniel Jones       QB             17         75        225   28
1  IND     Jonathon Taylor       RB             28         70        215   26
2  IND  Michael Pittman Jr       WR             11         77        250   28
3  BUF          Josh Allen       QB             17         77        238   29
4  BUF          James Cook       RB              4         70        190   24
5  BUF        Keon Coleman       WR              0         72        191   23


In [15]:
X = players_df[["team", "position"]]
y = players_df["age"]

model = GroupEstimate(estimate="mean")
model.fit(X, y)


X_ = [["IND", "QB"],
      ["IND", "RB"],
      ["IND", "WR"],
      ["BUF", "QB"],
      ["BUF", "RB"],
      ["BUF", "WR"]]

model.predict(X_)
# Expected Output: [28.0, 26.0, 28.0, 29.0, 24.0, 23.0]

AttributeError: 'GroupEstimate' object has no attribute 'predict'

### Example

For example, if we have a dataframe of coffee reviews, and `X` includes two columns: *country* and *roast type*, we might want to predict the average *review score* for a new coffee from a given country and roast type. In this way, we could run:

```python
X = df_raw[["loc_country", "roast"]]
y = df_raw["rating"]

gm = GroupEstimate(estimate='mean')
gm.fit(X, y)

X_ = [["Guatemala", "Light"],
      ["Mexico", "Medium"],
      ["Canada", "Dark"]]

gm.predict(X_)

>> [88.4, 91. ,  nan]  # say there are no Canadian dark roasts
```

## Bonus Exercise 2

Adjust your `GroupEstimate` class to handle the situation where the combination of categories is missing, but a particular category is not. That is, add to your `.fit` method an optional argument `default_category`. If a combination is missing, the estimate for `y` will be based solely on the group defined by `default_category`.

For example, suppose we have the code in the example above, but we replace the fit line with

```python
# ...
gm.fit(X, y, default_cagegory="country")
# ...

>> [4.5, 3.8, 3.1]
```

In this case, the missing value in that array would be filled with the average review score for Brazilian roasts.

*Hint: consider the `observed` argument of the `groupby` [method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), and go from there ...*