Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

NAME = ""
COLLABORATORS = ""


---

# Scikit-Learn

This lab will test your knowledge of scikit-learn and your ability to perform basic tasks using this foundational library.

You are welcome (and encouraged) to make use of the scikit-learn documention

https://scikit-learn.org/stable/modules/classes.html

# Problem 1

In this problem, we will implement a variety of sklearn estimators.

## Part a)
Implement a transformer which standardizes a dataset according to the following formula

$$ X = \frac{X - \mu}{\sigma}$$ Where $\mu$ is the average and $\sigma$ is the standard deviation.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin
import pandas as pd
from sklearn.utils.validation import check_X_y, check_array

class StandardizerTransformer(BaseEstimator, TransformerMixin):

    # applies tranformation to feature matrix 
    def transform(self, x, y=''):
    
    # YOUR CODE HERE
        return x # return the transformed x (new data)
    
    # calculate the "parameters" or the mean and standard deviation:
    def fit(self, x, y=''):
     
            
    # YOUR CODE HERE    
        return self

In [None]:
import pandas as pd
trial_df = [[1, 2, 3, 4, 2],
[1,2,3,4,2],
[0,0,0,0,0]]

y = [[1],[2],[3]]

trial_df


st = StandardizerTransformer()
model = st.fit(trial_df,y)
print(model.mean_)



In [None]:
from sklearn.utils.estimator_checks import check_estimator
check_estimator(StandardizerTransformer())


## Part b)

Implement a Regressor which predicts the average of the input values.  As we know this is not a very good model, we will set the poor score tag so that we can validate the model interface. 

Please note that just because this is not the greatest model, that does not mean it is not a useful one.  This model can provide a wonderful baseline with which to start.

In [None]:
class MeanModel(BaseEstimator, RegressorMixin):
    def _more_tags(self):
        return dict(poor_score=True)
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
check_estimator(MeanModel())


## Problem 2

In this problem, we will explore some of the higher level constructs in sklearn.  Lets start by downloading the dataset from

https://archive.ics.uci.edu/dataset/2/adult

which is known as the census income dataset (see `adult.names` after the downloading of the data for more info).  We can do this with the following python function.



In [None]:
import requests
from pathlib import Path
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/{}"
def download(url, filepath):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with filepath.open('wb') as fp:
            for chunk in r.iter_content():
                fp.write(chunk)
        
def download_adult_data(ignore_cache=False):
    data_path = Path('data')
    data_path.mkdir(exist_ok=True)
    files = ['adult.data', 'adult.names', 'adult.test']
    for file_ in files:
        filepath = data_path.joinpath(file_)
        if not ignore_cache and filepath.is_file():
            continue
        download(URL.format(file_), filepath)
    return [data_path.joinpath(f) for f in files]
download_adult_data()

Now we will load the data into appropriate datastructure  for processing with scikit-learn.  We use use `pandas` as a tool to import data.  Our goal is not to rewrite useful libraries but to use the tools available in order to do some interesting data stuff.

In [None]:
from dataclasses import dataclass
import numpy as np

@dataclass
class Dataset:
    X_train: np.array
    y_train: np.array
    X_test: np.array
    y_test: np.array

In [None]:
import numpy as np
import pandas as pd
def load_adult_data():
    download_adult_data()
    data = pd.read_csv(Path('data').joinpath('adult.data'), header=None).values
    test_data = pd.read_csv(Path('data').joinpath('adult.test'), header=None, skiprows=1).values
    return Dataset(
        X_train=data[:, :-1],
        y_train=data[:, -1],
        X_test=test_data[:, :-1],
        y_test=np.array([i.rstrip('.') for i in test_data[:, -1]])
    )


Now lets do some exploratory data analysis on this dataset. In particular, look at the following columns 

- age
- workclass
- education


And answer the following questions (with supporting evidence):

- Is there anything about this feature which may need to be "engineered" in order to make it more useful?
- Do you think that this feature has any predictive power?  Does this make intuitive sense to you?

In [None]:
dataset = load_adult_data()
# age = 0, workclass = 1, education = 3. 
for i in range(10):
    print("age =", dataset.X_train[i][0], 
          "workclass =", dataset.X_train[i][1], 
          "education =", dataset.X_train[i][3])

# YOUR CODE HERE

# Problem 3

Now we will start to do some basic feature engineering and start building up a set of models.

## Part a)

First we will start with the education feature.  While education is a categorical variable in the dataset, we do not expect all categories to be completely orthognal.  Lets create a transformer which will take the education categories and collapse all education levels below HS-grad to a single value "No-Degree".  And yes, there is an education-number feature as well, however, we want to do some exercises to learn :)

**NOTE**: Although coding standards and checking for correctness of transformations is important, we will be concerned here with only your output.  However, we will be testing edge cases, so make sure you are considering those.

In [None]:
class EducationTransformer(BaseEstimator, TransformerMixin):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
class EducationTransformer(BaseEstimator, TransformerMixin):

    def transform(self, x, y=''):
        li = ['11th', '9th', '7th-8th', '12th', '1st-4th', '10th', '5th-6th', 'Preschool']
        
        for i in range(len(x)):
                if x[i][3].strip() in li:
                    x[i][3] = "No-Degree"
                    
        return x
    
    def fit(self, x, y=''):
                    
        return x

In [None]:
d = load_adult_data()

transformed = EducationTransformer().fit_transform(d.X_train[:, 3:4])[:, 0]
assert 'No-Degree' in set(transformed)
assert '11th' not in set(transformed)


In [None]:
#test that it works
for i in range(100):
    print(d.X_train[i][3])

## Part b)

Now its often the case that models will not handle features which are not numerical, we will learn more about this later in class, however, for this specific example, we can get an intuition by thinking about these education levels.  Many algorithms (for example linear regression) will require that we use numerical features.  However, if we use numerical features, we are making some implicit assumptions about the metric on this data, specifically the distance between values.  Lets take a simple example with the following data

|shirt_color|
|---|
|red|
|red|
|blue|
|green|

In this case, we could numerically encode this as 


|shirt_color|
|---|
|1|
|1|
|2|
|3|

and use a regression on the data, however, this would imply that red is somehow closer than to blue than to green.  In other words, our assignment of labels will affect the output.  Instead, we can reshape the matrix like so

|shirt_color_red|shirt_color_blue|shirt_color_green|
|---|---|---|
|1|0|0|
|1|0|0|
|0|1|0|
|0|0|1|

which removes the encoding issues.

Use the `OneHotcoder` from the sklearn library to build a more complex transformer which uses your previous transformer and then applies the one hot encoding on top of it.

There are two possible ways to do this, one is to use a higher order constructor like a `Pipeline` and the other is to use composition.  You can do either in principle but please use a `Pipeline` to do it here.


In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

education_pipe = Pipeline([
    # YOUR CODE HERE
    raise NotImplementedError()
])

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

education_pipe = Pipeline([
    
    ("education", EducationTransformer()), 
    
    ("one hot coder", OneHotEncoder(handle_unknown='ignore'))
    
])

#education_pipe.fit(X_train)

In [None]:
d = load_adult_data()

transformed = education_pipe.fit_transform(d.X_train[:, 3:4])

assert transformed.shape[0] == d.X_train.shape[0]


In [None]:
education_pipe.fit(d.X_train)

for i in range(10):
    print(i, "=", d.X_train[i])

## Part c)

Now we can apply our transformer to the feature matrix by making use of the scikit-learn object `ColumnTransformer`.  Use this to build a transformer which operates directly on the input feature matrix. Additionally, in order to make the auto grading work, please ensure the following:

- Please name your solution transformer `ct_education`.  
- Please make sure to pass through the rest of the columns. (check the `passthrough` options in docs)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct_education = None # Your column transformer here
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
res = ct_education.fit_transform(d.X_train)

assert res.shape[0] == d.X_train.shape[0]


## Part d)

Now fit a model using the following features:

- your transformed education column
- age
- hours-per-week
- capital-gain

In [None]:
# YOUR CODE HERE