## Optional Jupyter Notebooks – Task 2

Be sure to read the [Introduction to Notebooks](IntroductiontoNotebooks.ipynb)


### Setup:
To begin, run the cell below (click inside the cell and press `Ctrl-Enter`) to load the framework necessary for Task 2. While you may import other libraries for experimentation, remember that only the standard Python libraries are supported by the Gradescope autograder.
    

In [1]:
import numpy as np
import pandas as pd
import sklearn.preprocessing
import sklearn.decomposition
import sklearn.model_selection
import unittest
import os
import sys
import warnings
warnings.filterwarnings("ignore")


task_path = os.path.abspath(os.path.join(os.getcwd(),"..","src"))
sys.path.append(task_path)
from task2 import *
utils_path = os.path.abspath(os.path.join(os.getcwd(),"..","tests"))
sys.path.append(utils_path)
from utils import *

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## `tts`
In this function, you will take a dataset, the name of its label column, a percentage of the data to put into the test set, whether you should stratify on the label column, and a random state to set the sklearn function. You will return features and labels for the training and test sets.

At a high level, you can separate the task into two subtasks. The first is splitting your dataset into both features and labels (by columns), and the second is splitting your dataset into training and test sets (by rows). You should use the sklearn `train_test_split` function but will have to write wrapper code around it based on the input values we give you.
##### Useful Resources
* <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>
* <https://developers.google.com/machine-learning/crash-course/framing/ml-terminology>
* <https://stackoverflow.com/questions/40898019/what-is-the-difference-between-a-feature-and-a-label>

##### INPUTS
* `dataset` - a pandas DataFrame that contains some data
* `label_col` - a string containing the name of the column that contains the `label` values (what our model wants to predict)
* `test_size` - a float containing the decimal value of the percentage of the number of rows that the test set should be out of the dataset
* `stratify` - a boolean (`True` or `False`) value indicating if the resulting train/test split should be stratified or not
* `random_state` - an integer value to set the randomness of the function (useful for repeatability especially when autograding)

##### OUTPUTS 
* `train_features` - a pandas DataFrame that contains the train rows and the feature columns
* `test_features` - a pandas DataFrame that contains the test rows and the feature columns
* `train_labels` - a pandas DataFrame that contains the train rows and the label column
* `test_labels` - a pandas DataFrame that contains the test rows and the label column

In [8]:
# Write your code here


def tts(  dataset: pd.DataFrame,
                       label_col: str, 
                       test_size: float,
                       stratify: bool,
                       random_state: int) -> tuple[pd.DataFrame,pd.DataFrame,pd.Series,pd.Series]:

    train_features, test_features, train_labels, test_labels = sklearn.model_selection.train_test_split(
        dataset.drop(columns=[label_col]),
        dataset[label_col],
        test_size=test_size,
        stratify=dataset[label_col],
        random_state=random_state
    )
    
    return train_features,test_features,train_labels,test_labels

In [9]:
# Run this cell to test your code
dataset = pd.read_csv(os.path.join(os.getcwd(),"..","task1","sample.csv"))
#print(dataset)
pkl_files_folder = os.path.join(os.getcwd(),"..","task2","pkl_files")
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_train_targets = pd.read_pickle(os.path.join(pkl_files_folder,"train_targets_tts.pkl"))
#print(ans_train_targets)
ans_test_targets = pd.read_pickle(os.path.join(pkl_files_folder,"test_targets_tts.pkl"))
#print(ans_test_targets)

target_col = "target"
train_features,test_features,train_targets,test_targets = tts(dataset,target_col,test_size=.2,stratify=True,random_state=0)
#print(train_features)
#print(test_features)
#print(train_targets)
#print(test_targets)
         
if compare_submission_to_answer_df(train_features,ans_train_features,"Train Features DF"):
    print('Passed...')
else:
    print('Failed...')        

if compare_submission_to_answer_df(test_features,ans_test_features,"Test Features DF"):
    print('Passed...')
else:
    print('Failed...')    

if compare_submission_to_answer_series(train_targets,ans_train_targets,"Train Targets Series"):
    print('Passed...')
else:
    print('Failed...')

if compare_submission_to_answer_series(test_targets,ans_test_targets,"Test Targets Series"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR Train Features DF

Passed...

RUNNING TEST FOR Test Features DF

Passed...

RUNNING TEST FOR Train Targets Series

Passed...

RUNNING TEST FOR Test Targets Series

Passed...


## `one_hot_encode_columns_train` and `one_hot_encode_columns_test`
One Hot Encoding is the process of taking a column and returning a binary vector representing the various values within it. There is a separate function for the training and test datasets since they should be handled separately to avoid data leakage (see the 3rd link in Useful Resources for a little more info on how to handle them). 

### Pseudocode
`one_hot_encode_columns_train()`
0. In the `__init__()` method initialize an instance variable containing an sklearn `OneHotEncoder` with any Parameters you may need.
1. Split `train_features` into into two DataFrames: one with only the columns you want to one hot encode (using `one_hot_encode_cols`) and another with all the other columns.
2. Fit the `OneHotEncoder` using the DataFrame you split from `train_features` with the columns you want to encode.
3. Transform the DataFrame you split from `train_features` with the columns you want to encode using the fitted `OneHotEncoder`.
4. Create a DataFrame from the 2D array of data that the output from step 3 gave you, with column names in the form of `columnName_categoryName` (there should be an attribute in `OneHotEncoder` that can help you with this) and the same index that `train_features` had.
5. Join the DataFrame you made in step 4 with the DataFrame of other columns from step 1.

`one_hot_encode_columns_test()`
1. Split `test_features` into two DataFrames: one with only the columns you want to one hot encode (using`one_hot_encode_cols`) and another with all the other columns.
2. Transform the DataFrame you split from `train_features` with the columns you want to encode using the `OneHotEncoder` you fit in `one_hot_encode_columns_train()`
3. Create a DataFrame from the 2D array of data that the output from step 2 gave you, with column names in the form of `columnName_categoryName` (there should be an attribute in `OneHotEncoder` that can help you with this) and the same index that `test_features` had.
4. Join the DataFrame you made in step 3 with the DataFrame of other columns from step 1.

### Example Walkthrough (from Local Testing suite):

<!--
| Item         | Price     | Count      | Type       |
|--------------|-----------|------------|------------|
| Apples       | 1.99      | 7          | Fruit      |
| Broccoli     | 1.29      | 435        | Vegetable   |
| Bananas      | 0.99      | 123        | Fruit      |
| Oranges      | 2.79      | 25         | Fruit      |
| Pineapples   | 4.89      | 5234       | Fruit      |
-->
#### INPUTS:

##### one_hot_encode_cols 
 `["color","version"]`

##### Train Features
<html>
<style>
table, th, td {
  text-align: Center;
}
</style>
<body>

<table>
  <tr><th width=100px>index</th><th width=100px>color</th><th width=100px>version</th><th width=100px>cost</th><th width=100px>height</th></tr>
  <tr><td>0</td><td>red</td><td>1</td><td>5.99</td><td>12</td> </tr>
  <tr><td>6</td><td>yellow</td><td>6</td><td>10.99</td><td>18</td></tr>
  <tr><td>3</td><td>red</td><td>1</td><td>5.99</td><td>15</td></tr>
  <tr><td>9</td><td>red</td><td>8</td><td>12.99</td><td>21</td></tr>
  <tr><td>2</td><td>blue</td><td>3</td><td>5.99</td><td>14</td> </tr>
  <tr><td>5</td><td>orange</td><td>5</td><td>10.99</td><td>17</td> </tr>
  <tr><td>1</td><td>green</td><td>2</td><td>5.99</td><td>13</td> </tr>
  <tr><td>7</td><td>green</td><td>2</td><td>12.99</td><td>19</td> </tr>
</table>
</body>
</html>

##### Test Features
<html>
<style>
table, th, td {
  text-align: Center;
}
</style>
<body>

<table>
  <tr><th width=100px>index</th><th width=100px>color</th><th width=100px>version</th><th width=100px>cost</th><th width=100px>height</th></tr>
  <tr><td>4</td><td>purple</td><td>4</td><td>10.99</td><td>16</td> </tr>
  <tr><td>8</td><td>blue</td><td>3</td><td>12.99</td><td>20</td></tr>
</table>
</body>
</html>

In [36]:
# Write your code here
def one_hot_encode_columns_train(train_features,test_features,one_hot_encode_cols) -> pd.DataFrame:
    
    df_encode_cols = train_features[one_hot_encode_cols]
    df_other_cols = train_features.drop(columns=one_hot_encode_cols)
    
    encoder = sklearn.preprocessing.OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    
    encoded = encoder.fit_transform(df_encode_cols)
    features = encoder.get_feature_names_out(one_hot_encode_cols)

    df_encoded = pd.DataFrame(encoded, columns=features, index=train_features.index)
    df_concat = pd.concat([df_encoded, df_other_cols], axis=1)

    return df_concat


In [37]:
# Run this cell to test your code
def double_height(dataframe:pd.DataFrame):
    return dataframe["height"] * 2

ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_train_features_ohe = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_ohe.pkl"))
#print(ans_train_features_ohe)

one_hot_encode_cols = ["color","version"]
train_features_ohe = one_hot_encode_columns_train(ans_train_features,ans_test_features,one_hot_encode_cols)
#print(train_features_ohe)

if compare_submission_to_answer_df(train_features_ohe,ans_train_features_ohe,"One Hot Encoded Train DF"):
    print('Passed...')
else:
    print('Failed...')

Encoded array shape: (8, 11)
Number of feature names: 11
Result DataFrame shape: (8, 11)
Result DataFrame shape: (8, 13)

RUNNING TEST FOR One Hot Encoded Train DF

Passed...


In [44]:
# Write your code here    
def one_hot_encode_columns_test(train_features,test_features,one_hot_encode_cols) -> pd.DataFrame:
    df_encode_cols = test_features[one_hot_encode_cols]
    df_other_cols = test_features.drop(columns=one_hot_encode_cols)
    
    encoder = sklearn.preprocessing.OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    
    encoded = encoder.fit_transform(df_encode_cols)
    features = encoder.get_feature_names_out(one_hot_encode_cols)

    df_encoded = pd.DataFrame(encoded, columns=features, index=test_features.index)
    df_concat = pd.concat([df_encoded, df_other_cols], axis=1)

    return df_concat

In [45]:
# Run this cell to test your code
def double_height(dataframe:pd.DataFrame):
    return dataframe["height"] * 2

ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_test_features_ohe = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_ohe.pkl"))
#print(ans_test_features_ohe)

one_hot_encode_cols = ["color","version"]
test_features_ohe = one_hot_encode_columns_test(ans_train_features,ans_test_features,one_hot_encode_cols)
#print(test_features_ohe)

if compare_submission_to_answer_df(test_features_ohe,ans_test_features_ohe,"One Hot Encoded Test DF"):
    print('Passed...')
else:
    print('Failed...')


RUNNING TEST FOR One Hot Encoded Test DF

You do not have the correct number of columns in One Hot Encoded Test DF

Failed...


## `min_max_scaled_columns_train` and `min_max_scaled_columns_test`
Min/Max Scaling is a process of scaling ints/floats from a min and max value in a series to between 0 and 1. The function for how scikit-learn does this is shown below, but for this assignment, you should just use the linked scikit-learn function.
```python
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
```
There is a separate function for the training and test datasets because they should be handled separately to avoid data leakage (see the 3rd link in Useful Resources for a little more info on how to handle them).

Example Dataframe:
<!--
| Item         | Price     | Count      | Type       |
|--------------|-----------|------------|------------|
| Apples       | 1.99      | 7          | Fruit      |
| Broccoli     | 1.29      | 435        | Vegtable   |
| Bananas      | 0.99      | 123        | Fruit      |
| Oranges      | 2.79      | 25         | Fruit      |
| Pineapples   | 4.89      | 5234       | Fruit      |
-->

<html>
<style>
table, th, td {
  text-align: Center;
}
</style>
<body>

<table>
  <tr><th width=100px>Item</th><th width=100px>Price</th><th width=100px>Count</th><th width=100px>Type</th></tr>
  <tr><td>Apples</td><td>1.99</td><td>7</td><td>Fruit</td> </tr>
  <tr><td>Broccoli</td><td>1.29</td><td>435</td><td>Vegtable</td></tr>
  <tr><td>Bananas</td><td>0.99</td><td>123</td><td>Fruit</td></tr>
  <tr><td>Oranges</td><td>2.79</td><td>25</td><td>Fruit</td></tr>
  <tr><td>Pineapples</td><td>4.89</td><td>5234</td><td>Fruit</td> </tr>
</table>
</body>
</html>

Example One Hot Encoded Dataframe (rounded to 4 decimal places):
<!--
| Item         | Price     | Count      | Type       |
|--------------|-----------|------------|------------|
| Apples       | 0.2564    | 7          | Fruit      |
| Broccoli     | 0.0769    | 435        | Vegtable   |
| Bananas      | 0         | 123        | Fruit      |
| Oranges      | 0.4615    | 25         | Fruit      |
| Pineapples   | 1         | 5234       | Fruit      |
-->

<html>
<style>
table, th, td {
  text-align: Center;
}
</style>
<body>

<table>
  <tr><th width=100px>Item</th><th width=100px>Price</th><th width=100px>Count</th><th width=100px>Type</th></tr>
  <tr><td>Apples</td><td>0.2564</td><td>7</td><td>Fruit</td> </tr>
  <tr><td>Broccoli</td><td>0.0769</td><td>435</td><td>Vegtable</td></tr>
  <tr><td>Bananas</td><td>0</td><td>123</td><td>Fruit</td></tr>
  <tr><td>Oranges</td><td>0.4615</td><td>25</td><td>Fruit</td></tr>
  <tr><td>Pineapples</td><td>1</td><td>5234</td><td>Fruit</td> </tr>
</table>
</body>
</html>

In [None]:
# Write your code here
def min_max_scaled_columns_train(train_features,test_features,min_max_scale_cols) -> pd.DataFrame:
    # TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
    min_max_scaled_dataset = pd.DataFrame()

    return min_max_scaled_dataset

In [None]:
# Run this cell to test your code
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_train_features_mms = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_mms.pkl"))
#print(ans_train_features_mms)

min_max_scale_cols = ["cost"]
train_features_mms = min_max_scaled_columns_train(ans_train_features,ans_test_features,min_max_scale_cols)
#print(train_features_mms)

if compare_submission_to_answer_df(train_features_mms,ans_train_features_mms,"Min Max Scaled Train DF"):
    print('Passed...')
else:
    print('Failed...')

In [None]:
# Write your code here    
def min_max_scaled_columns_test(train_features,test_features,min_max_scale_cols) -> pd.DataFrame:
    # TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
    min_max_scaled_dataset = pd.DataFrame()
    
    return min_max_scaled_dataset

In [None]:
# Run this cell to test your code
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_test_features_mms = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_mms.pkl"))
#print(ans_test_features_mms)

min_max_scale_cols = ["cost"]
test_features_mms = min_max_scaled_columns_test(ans_train_features,ans_test_features,min_max_scale_cols)
#print(test_features_mms)

if compare_submission_to_answer_df(test_features_mms,ans_test_features_mms,"Min Max Scaled Test DF"):
    print('Passed...')
else:
    print('Failed...')

## `pca_train` and `pca_test`
Principal Component Analysis is a dimensionality reduction technique (column reduction). It aims to take the variance in your input columns and map the columns into N columns that contain as much of the variance as it can. This technique can be useful if you are trying to train a model faster and has some more advanced uses, especially when training models on data which has many columns but few rows. There is a separate function for the training and test datasets because they should be handled separately to avoid data leakage (see the 3rd link in Useful Resources for a little more info on how to handle them).

**Note:** For the Autograder, use the column naming scheme of column names: component_1, component_2 .. component_n for the `n_components` passed into the `__init__` method.

**Note 2:** For your PCA outputs to match the Autograder, make sure you set the seed using a random state of 0 when you initialize the PCA function.

**Note 3:** Since PCA does not work with NA values, make sure you drop any columns that have NA values before running PCA.
 
##### Useful Resources
* <https://builtin.com/data-science/step-step-explanation-principal-component-analysis>
* <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA>
* <https://datascience.stackexchange.com/questions/103211/do-we-need-to-pre-process-both-the-test-and-train-data-set>

##### INPUTS
Use the needed instance variables you set in the `__init__` method

##### OUTPUTS 
a pandas DataFrame with the generated pca values and using column names: component_1, component_2 .. component_n

In [None]:
# Write your code here
def pca_train(train_features,test_features,n_components) -> pd.DataFrame:
    # TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
    pca_dataset = pd.DataFrame()
    
    return pca_dataset

In [None]:
# Run this cell to test your code
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl")).drop(columns=["color"])
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl")).drop(columns=["color"])
#print(ans_test_features)
ans_train_features_pca = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_pca.pkl"))
#print(ans_train_features_pca)

n_components = 2
train_features_pca = pca_train(ans_train_features,ans_test_features,n_components)
#print(train_features_pca)

if compare_submission_to_answer_df(train_features_pca.round(4),ans_train_features_pca.round(4),"PCA Train DF"):
    print('Passed...')
else:
    print('Failed...')

In [None]:
# Write your code here    
def pca_test(train_features,test_features,n_components) -> pd.DataFrame:
    # TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
    pca_dataset = pd.DataFrame()
        
    return pca_dataset

In [None]:
# Run this cell to test your code
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl")).drop(columns=["color"])
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl")).drop(columns=["color"])
#print(ans_test_features)
ans_test_features_pca = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_pca.pkl"))
#print(ans_test_features_pca)

n_components = 2
test_features_pca = pca_test(ans_train_features,ans_test_features,n_components)
#print(test_features_pca)

if compare_submission_to_answer_df(test_features_pca.round(4),ans_test_features_pca.round(4),"PCA Test DF"):
    print('Passed...')
else:
    print('Failed...')

## `feature_engineering_train`, `feature_engineering_test`
Feature Engineering is a process of using domain knowledge (physics, geometry, sports statistics, business metrics, etc.) to create new features (columns) out of the existing data. This could mean creating an area feature when given the length and width of a triangle or extracting the major and minor version number from a software version or more complex logic depending on the scenario. For this method, you will be taking in a dictionary with a column name and a function (that takes in a DataFrame and returns a column) and using that to create a new column with the name in the dictionary key.

For example:
```python
def double_height(dataframe:pd.DataFrame):
    return dataframe["height"] * 2
def half_height(dataframe:pd.DataFrame):
    return dataframe["height"] / 2
feature_engineering_functions = {"double_height":double_height,"half_height":half_height}
```
With the above functions, you would create two new columns named "double_height" and "half_height".

##### Useful Resources
* <https://en.wikipedia.org/wiki/Feature_engineering>
* <https://www.geeksforgeeks.org/what-is-feature-engineering/>

##### INPUTS
Use the needed instance variables you set in the `__init__` method

##### OUTPUTS 
a pandas dataframe with the features described in `feature_engineering_train` and `feature_engineering_test` added as new columns and all other columns in the dataframe unchanged


In [None]:
# Write your code here
def feature_engineering_train(train_features,test_features,feature_engineering_functions) -> pd.DataFrame:
    # TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
    feature_engineered_dataset = pd.DataFrame()

    return feature_engineered_dataset

In [None]:
# Run this cell to test your code
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_train_features_fe = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_fe.pkl"))
#print(ans_train_features_fe)

feature_engineering_functions = {"double_height":double_height}
train_features_fe = feature_engineering_train(ans_train_features,ans_test_features,feature_engineering_functions)
#print(train_features_fe)

if compare_submission_to_answer_df(train_features_fe,ans_train_features_fe,"Feature Engineered Train DF"):
    print('Passed...')
else:
    print('Failed...')

In [None]:
# Write your code here
def feature_engineering_test(train_features,test_features,feature_engineering_functions) -> pd.DataFrame:
    # TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
    feature_engineered_dataset = pd.DataFrame()

    return feature_engineered_dataset

In [None]:
# Run this cell to test your code
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_test_features_fe = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_fe.pkl"))
#print(ans_test_features_fe)

feature_engineering_functions = {"double_height":double_height}
test_features_fe = feature_engineering_test(ans_train_features,ans_test_features,feature_engineering_functions)
#print(test_features_fe)

if compare_submission_to_answer_df(test_features_fe,ans_test_features_fe,"Feature Engineered Test DF"):
    print('Passed...')
else:
    print('Failed...')

## PreprocessDataset:`preprocess_train`, `preprocess_test`
Now, we will put three of the above methods together into a preprocess function. This function will take in a dataset and perform **encoding**, **scaling**, and **feature engineering** using the above methods and their respective columns.

##### Useful Resources
See resources for one hot encoding, min/max scaling and feature engineering above

##### INPUTS
Use the needed instance variables you set in the `__init__` method

##### OUTPUTS 
a pandas dataframe for both test and train features with the columns in `one_hot_encode_cols` encoded, the columns in `min_max_scale_cols` scaled and the columns described in `feature_engineering_functions` engineered. 

In [None]:
# Write your code here
def preprocess_train(train_features,test_features,one_hot_encode_cols,min_max_scale_cols,feature_engineering_functions) -> tuple[pd.DataFrame,pd.DataFrame]:
    # TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
    train_features = pd.DataFrame()
    
    return train_features

In [None]:
# Run this cell to test your code
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_train_features_preprocess = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_preprocess.pkl"))
#print(ans_train_features_preprocess)

one_hot_encode_cols = ["color","version"]
min_max_scale_cols = ["cost"]
n_components = 2
feature_engineering_functions = {"double_height":double_height}

train_features_preprocess = preprocess_train(ans_train_features,ans_test_features,one_hot_encode_cols,min_max_scale_cols,feature_engineering_functions)
#print(train_features_preprocess)

if compare_submission_to_answer_df(train_features_preprocess,ans_train_features_preprocess,"Preprocessed Train DF"):
    print('Passed...')
else:
    print('Failed...')

In [None]:
# Write your code here
def preprocess_test(train_features,test_features,one_hot_encode_cols,min_max_scale_cols,feature_engineering_functions) -> tuple[pd.DataFrame,pd.DataFrame]:
    # TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
    test_features = pd.DataFrame()
    
    return test_features

In [None]:
# Run this cell to test your code
# Run this cell to test your code
ans_train_features = pd.read_pickle(os.path.join(pkl_files_folder,"train_feats_tts.pkl"))
#print(ans_train_features)
ans_test_features = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_tts.pkl"))
#print(ans_test_features)
ans_test_features_preprocess = pd.read_pickle(os.path.join(pkl_files_folder,"test_feats_preprocess.pkl"))
#print(ans_test_features_preprocess)

one_hot_encode_cols = ["color","version"]
min_max_scale_cols = ["cost"]
n_components = 2
feature_engineering_functions = {"double_height":double_height}

test_features_preprocess = preprocess_test(ans_train_features,ans_test_features,one_hot_encode_cols,min_max_scale_cols,feature_engineering_functions)
#print(train_features_preprocess)

if compare_submission_to_answer_df(test_features_preprocess,ans_test_features_preprocess,"Preprocessed Train DF"):
    print('Passed...')
else:
    print('Failed...')

# You have successfully reached the end of this notebook.
