<br/>

## <div style="padding:10px;background-color:#9d5a47;margin:10;color:white;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 1px 10px;overflow:hidden;font-weight:50;width:auto">Feature Encoding Homework</div>

This notebook is to use and display the results of various helper functions which you will implement in the files named `target_encoding.py` and `interaction_features.py`.

Note that we will be using `pytest` to test the correctness of your implementation. You can run the tests by running the following command in the terminal:

```bash
pytest
```

If the tests pass, you should see that an output line that states that "9 passed, 1 xfailed".  This is because we have one test that is expected to fail.  If you see that, your implementation outputs the expected results.

The unit test definitions are in files named `test_target_encoding.py` and `test_interaction_features.py`. Please do not modify these files.

To complete the homework assignment, you will need to implement only the functions that are not yet implemented in the `target_encoding.py` and `interaction_features.py` files - leave everything else as is.

### Imports

In [1]:
import pandas as pd

#pd.set_option("future.no_silent_downcasting", True)

# this holds the data we will use
from sklearn.datasets import fetch_california_housing

# these are the functions we will define
from interaction_features import create_interaction_features
from target_encoding import discretize_feature, target_encode
from datetime_engineering import convert_datetime, create_month, aggregate_months

### Data Loading

In [2]:
# Load the California Housing dataset
california_housing = fetch_california_housing()
data = california_housing.data
feature_names = california_housing.feature_names

# Create a DataFrame from the dataset
df = pd.DataFrame(data, columns=feature_names)

# Add the target to the DataFrame
df[california_housing.target_names[0]] = california_housing.target  # MedHouseVal

<br>
<br>
<div style="color:white; background:#9d8547; max-width:800px;padding:40px; margin:0 auto;font-size:120%;text-align:center;">
<div>HW 1: Interaction Features</div>
</div>

Having imported the data from the California Housing Prices dataset, we will now create interaction features. Interaction features are new features that are created by combining two or more existing features. The goal is to capture the relationship between the features that are combined.

The most common way to create interaction features is by multiplying two or more features together. For example, if we have two features `x1` and `x2`, we can create a new feature `x3` by multiplying them together: `x3 = x1 * x2`.

In this homework, you will create interaction features by multiplying two or more features together. 

You will fill out the function definition in the file `interaction_features.py`. The function will take in a DataFrame as well as a list of feature-pair tuples. It will return a new DataFrame with the interaction features added.  You will use the standard muliplication operation to create the new features.

In [3]:
# Define the pairs of features for which we want to create interaction features
interaction_pairs = [
    ("MedInc", "HouseAge"),
    ("AveRooms", "AveBedrms"),
    ("Population", "AveOccup"),
]

# You will need to define this function in interaction_features.py
df_interactions = create_interaction_features(df, interaction_pairs)

# Display the updated DataFrame with interaction features
df_interactions.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,MedInc_x_HouseAge,AveRooms_x_AveBedrms,Population_x_AveOccup
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526,341.3332,7.150416,822.888889
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585,174.3294,6.062724,5065.730228
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521,377.3848,8.896869,1389.920904
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413,293.4412,6.242364,1421.753425
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422,200.0024,6.791193,1232.528958


<br>
<br>
<div style="color:white; background:#9d8547; max-width:800px;padding:40px; margin:0 auto;font-size:120%;text-align:center;">
<div>HW2: Target Encoding</div>
</div>

Target Encoding is a method of encoding categorical features that uses the target variable to create new features. The idea is to encode the categorical feature using the mean of the target variable.

We saw this as an example during the Categorical Encoding lecture. Now we will use it as it's own feature engineering technique.

This section has 2 parts:

1. Simplistically discretize the `MedInc` feature by rounding it to the nearest integer. The function you will fill out is named `discretize_feature`.
2. Now that you should have a discretized feature with 15 categories and we will use target encoding to encode it. The function you should fill out is named `target_encode`.

The 2 functions will be defined in the file `target_encoding.py`. 
1. The `discretize_feature` function will take in a DataFrame and a feature name.  It will return a new DataFrame with the feature re-encoded as discrete bins.
2. The `target_encode` function will take in a DataFrame, a list of features to encode and a target column. It will return a new DataFrame with the target encoded feature added.

In [4]:
# Part 1: Discretization
df = discretize_feature(df, "MedInc")

df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,6,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,4,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [5]:
# Part 2: Target Encoding
df = target_encode(df, "MedInc", "MedHouseVal")

df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,4.07835,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,4.07835,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,3.590495,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,2.99242,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,2.106762,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


<br>
<br>
<div style="color:white; background:#9d8547; max-width:800px;padding:40px; margin:0 auto;font-size:120%;text-align:center;">
<div>HW3: Datetime Engineering</div>
</div>

Often times we may have a single feature representing datetime that we wish to split into its component parts for the purposes of using different granularities or measuring cyclic trends. An example would be parsing out the month field and grouping by it to get aggregate features for each month.

We will be using the "Timestamp" feature in the FOREX_nzdusd-day-Close dataset to aggregate the remaining features by month.

This section has 3 parts:

1. Convert the "Timestamp" feature to datetime type.
2. Create a new feature called "Month" from the "Timestamp" feature that represents the month in which each observation occurred.
3. Use the new "Month" feature to find the mean of each other feature for each month of the year (i.e. group by "Month").

The 3 functions will be defined in the file `datetime_engineering.py`. 
1. The `convert_datetime` function will take in a DataFrame and a feature name.  It will return a new DataFrame with the feature converted to datetime type.
2. The `create_month` function will take in a DataFrame, the name of the "Timestamp" feature, and the name of the "Month" feature to be created. It will return a new DataFrame that includes the "Month" feature.
3. The `aggregate_months` function will take in a DataFrame, the name of the "Month" feature to group by, and the function used to perform aggregation by month of the year (mean). It will return a new DataFrame containing the mean of each feature for each month of the year.

In [6]:
from sklearn.datasets import fetch_openml

forex_nzdusd = fetch_openml("FOREX_nzdusd-day-Close", version=1, parser="pandas", as_frame=True)
df = pd.DataFrame(forex_nzdusd.data, columns=forex_nzdusd.feature_names)

In [7]:
# Part 1: Convert Date to datetime
df = convert_datetime(df, "Timestamp")

df.head()

Unnamed: 0,Timestamp,Bid_Open,Bid_High,Bid_Low,Bid_Close,Bid_Volume,Ask_Open,Ask_High,Ask_Low,Ask_Close,Ask_Volume
0,2012-01-01 23:00:00,0.77623,0.78023,0.77364,0.77822,13892.0999,0.7773,0.7813,0.77618,0.77858,15052.76
1,2012-01-02 23:00:00,0.77829,0.79076,0.77775,0.78923,95963.5801,0.77866,0.79087,0.77858,0.79025,94711.0697
2,2012-01-03 23:00:00,0.78922,0.79059,0.78528,0.78746,128332.3296,0.79006,0.79088,0.78545,0.78792,136065.8203
3,2012-01-04 23:00:00,0.78749,0.78795,0.77929,0.78044,145642.4405,0.78795,0.78816,0.77945,0.78131,153696.2908
4,2012-01-05 23:00:00,0.78039,0.78376,0.77736,0.7802,131763.3215,0.78139,0.7839,0.77758,0.7808,135191.8401


In [8]:
# Part 2: Create "month" feature
df = create_month(df, "Timestamp", "Month")

df.head()

Unnamed: 0,Timestamp,Bid_Open,Bid_High,Bid_Low,Bid_Close,Bid_Volume,Ask_Open,Ask_High,Ask_Low,Ask_Close,Ask_Volume,Month
0,2012-01-01 23:00:00,0.77623,0.78023,0.77364,0.77822,13892.0999,0.7773,0.7813,0.77618,0.77858,15052.76,1
1,2012-01-02 23:00:00,0.77829,0.79076,0.77775,0.78923,95963.5801,0.77866,0.79087,0.77858,0.79025,94711.0697,1
2,2012-01-03 23:00:00,0.78922,0.79059,0.78528,0.78746,128332.3296,0.79006,0.79088,0.78545,0.78792,136065.8203,1
3,2012-01-04 23:00:00,0.78749,0.78795,0.77929,0.78044,145642.4405,0.78795,0.78816,0.77945,0.78131,153696.2908,1
4,2012-01-05 23:00:00,0.78039,0.78376,0.77736,0.7802,131763.3215,0.78139,0.7839,0.77758,0.7808,135191.8401,1


In [9]:
import numpy as np

# Part 3: Display means for each month
aggregate_months(df, "Month", np.mean)

Unnamed: 0_level_0,Timestamp,Bid_Open,Bid_High,Bid_Low,Bid_Close,Bid_Volume,Ask_Open,Ask_High,Ask_Low,Ask_Close,Ask_Volume
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,2015-01-14 17:18:27.692307712,0.76064,0.765129,0.756648,0.761008,99406.192912,0.761132,0.765291,0.756845,0.761473,100582.494531
2,2015-02-10 17:05:04.225351936,0.765649,0.769902,0.761819,0.765789,97049.571571,0.766116,0.770056,0.761959,0.766286,97929.776615
3,2015-03-24 11:46:40.000000000,0.762643,0.766968,0.759026,0.762934,93604.191931,0.763166,0.767137,0.759201,0.763417,93943.434882
4,2015-04-07 02:10:43.708609280,0.772132,0.776023,0.768611,0.772145,85324.714615,0.772755,0.776212,0.768811,0.77269,85674.462954
5,2015-05-26 19:36:55.384615424,0.750648,0.754079,0.7465,0.749522,94696.426397,0.751177,0.754244,0.746652,0.750037,95017.687931
6,2015-06-16 20:34:03.243243264,0.750271,0.754878,0.74643,0.750795,97458.605945,0.750878,0.75509,0.746619,0.751396,97986.954564
7,2015-07-08 01:26:45.095541504,0.750875,0.754751,0.747155,0.750918,96866.434113,0.751496,0.754931,0.747327,0.751471,96739.130793
8,2015-08-31 16:17:55.324675328,0.744284,0.747832,0.740447,0.743746,102892.399562,0.744851,0.748011,0.740607,0.744262,102930.327798
9,2015-09-04 12:12:28.993288704,0.74221,0.746519,0.738753,0.742477,108294.418911,0.74278,0.746693,0.73892,0.743013,108684.402374
10,2015-10-17 08:27:08.025477632,0.741032,0.745168,0.737383,0.741054,99469.77052,0.741576,0.745338,0.737556,0.741585,99304.185222
