# Welcome to Full Stack Machine Learning's Week 4 Project!

In the final week, you will return to the workflow you built last week on the [taxi dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). 

## Task 1: Deploy the champion
Use what you have learned in the last two weeks to make necessary modifications and to deploy your latest version of the `TaxiFarePrediction` flow to Argo. Use `--branch champion` to denote this deployment as the champion model.

### The Baseline Model

Modified the code from Week 3 project
- Removed @trigger
- Added @project
- Siimplified the @card


In [34]:
%%writefile ./taxi_faire_linear_regression.py
from metaflow import FlowSpec, step, card, conda_base, current, Parameter, Flow, project, retry
from metaflow.cards import Markdown, Table, Image, Artifact

URL = "https://outerbounds-datasets.s3.us-west-2.amazonaws.com/taxi/latest.parquet"
DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"

# @trigger(events=["s3"])   # Disable trigger for this project.
@project(name="fullstack")  # <-- Add project
@conda_base(
    libraries={
        "pandas": "1.4.2",
        "pyarrow": "11.0.0",
        "scikit-learn": "1.1.2",
    }
)
class TaxiFarePrediction(FlowSpec):
    data_url = Parameter("data_url", default=URL)

    def transform_features(self, df):
        idx = (df.fare_amount > 0)
        idx &= (df.trip_distance <= 100)
        idx &= (df.trip_distance > 0)
        idx &= (df.tip_amount >= 0)
        df = df[idx]
        return df

    @step
    def start(self):
        import pandas as pd
        from sklearn.model_selection import train_test_split
        df = pd.read_parquet(self.data_url)
        self.df = self.transform_features(df)
        self.X = self.df["trip_distance"].values.reshape(-1, 1)
        self.y = self.df["total_amount"].values
        self.next(self.linear_model)

    @step
    def linear_model(self):
        "Fit a single variable, linear model to the data."
        from sklearn.linear_model import LinearRegression
        self.model_type = "LinearRegression"
        self.model = LinearRegression()
        self.next(self.validate)

    @card(type="corise")
    @retry(times=2) 
    @step
    def validate(self):
        from sklearn.model_selection import cross_val_score
        # Get CV scores
        self.scores = cross_val_score(self.model, self.X, self.y, cv=5)
        # We still need to fit the model
        self.model.fit(self.X, self.y)
        self.next(self.end)

    @step
    def end(self):
        print("Success!")


if __name__ == "__main__":
    TaxiFarePrediction()

Overwriting ./taxi_faire_linear_regression.py


### Create a production branch: `--production --branch champion`

In [35]:
%%capture
# Deploy Baseline Model (LinearRegression)
!python ./taxi_faire_linear_regression.py --environment=conda --production --branch champion argo-workflows create

In [36]:
# Manually trigger the Baseline Model
!python ./taxi_faire_linear_regression.py --environment=conda --production --branch champion argo-workflows trigger

[35m[1mMetaflow 2.10.6+ob(v1)[0m[35m[22m executing [0m[31m[1mTaxiFarePrediction[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:sandbox[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mProject: [0m[32m[1mfullstack[0m[35m[22m, Branch: [0m[32m[1mprod.champion[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[1mWorkflow [0m[31m[1mfullstack.prod.champion.taxifareprediction[0m[1m triggered on Argo Workflows (run-id [0m[31m[1margo-fullstack.prod.champion.taxifareprediction-7lc2b[0m[1m).[K[0m[1m[0m
[1mSee the run in the UI at https://ui-pw-527107953.outerbounds.dev/TaxiFarePrediction/argo-fullstack.prod.champion.taxifareprediction-7lc2b[K[0m[1m[0m


### Validate Results

In [52]:
from metaflow import namespace, Flow

champ_namespace = "production:mfprj-cqkixzvdsy3tjqdh-0-ndkt"
flow_name = "TaxiFarePrediction"

# Retrieve data
namespace(champ_namespace)
run = Flow(flow_name).latest_successful_run
print(f"- model_type = {run.data.model_type}")
print(f"- CV scores = {run.data.scores}")

# Test the model
model = run.data.model
X = run.data.X
y = run.data.y
print(f"- X.shape = {X.shape}, y.shape = {y.shape}")
acc_score = model.score(X, y)
print(f"- Overall accuracy score = {acc_score}")
pred = model.predict(X[0:5])
print(f"- Sample predictions = {pred}")


- model_type = LinearRegression
- CV scores = [0.89223724 0.90081135 0.89633068 0.90003852 0.89430825]
- X.shape = (863296, 1), y.shape = (863296,)
- Overall accuracy score = 0.8973251926945167
- Sample predictions = [15.89793098 16.4845334  22.84691343 20.09439441 17.97360106]


## Task 2: Build the challenger
Develop a second model, by using the same `TaxiFarePrediction` architecture. Then, deploy the flow to Argo as the `--branch challenger`. 
<br>
<br>
Hint: Modify the `linear_model` step. 
<br>
Bonus: Write a paragraph summary of how you developed the second model and tested it before deploying the challenger flow. Let us know in Slack what you found challenging about the task? 

In [None]:
%%writefile ./taxi_faire_xgboost.py
from metaflow import FlowSpec, step, card, conda_base, current, Parameter, Flow, project
from metaflow.cards import Markdown, Table, Image, Artifact

URL = "https://outerbounds-datasets.s3.us-west-2.amazonaws.com/taxi/latest.parquet"
DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"

# @trigger(events=["s3"])   # Disable trigger for this project.
@project(name="fullstack")  # <-- Add project
@conda_base(
    libraries={
        "pandas": "1.4.2",
        "pyarrow": "11.0.0",
        "scikit-learn": "1.1.2",
    }
)
class TaxiFarePrediction(FlowSpec):
    data_url = Parameter("data_url", default=URL)

    def transform_features(self, df):
        idx = (df.fare_amount > 0)
        idx &= (df.trip_distance <= 100)
        idx &= (df.trip_distance > 0)
        idx &= (df.tip_amount >= 0)
        df = df[idx]
        return df

    @step
    def start(self):
        import pandas as pd
        from sklearn.model_selection import train_test_split
        df = pd.read_parquet(self.data_url)
        self.df = self.transform_features(df)
        self.X = self.df["trip_distance"].values.reshape(-1, 1)
        self.y = self.df["total_amount"].values
        self.next(self.linear_model)

    @step
    def linear_model(self):
        "Fit a single variable, linear model to the data."
        from sklearn.linear_model import LinearRegression
        self.model_type = "LinearRegression"
        self.model = LinearRegression()
        self.next(self.validate)

    @card(type="corise")
    @retry(times=2) 
    @step
    def validate(self):
        from sklearn.model_selection import cross_val_score
        self.scores = cross_val_score(self.model, self.X, self.y, cv=5)
        self.next(self.end)

    @step
    def end(self):
        print("Success!")


if __name__ == "__main__":
    TaxiFarePrediction()

## Task 3: Analyze the results
Return to this notebook, and read in the results of the challenger and champion flow using the Metaflow Client API.
<br><br>

#### Questions
- Does your model perform better on the metrics you selected? 
- Think about your day job, how would you go about assessing whether to roll forward the production "champion" to your new model? 
    - What gives you confidence one model is better than another?
    - What kinds of information do you need to monitor to get buy-in from stakeholders that model A is preferable to model B?  

## CONGRATULATIONS! 🎉✨🍾
If you made it this far, you have completed the Full Stack Machine Learning Corise course. 
We are so glad that you chose to learn with us, and hope to see you again in future courses. Stay tuned for more content and come join us in [Slack](http://slack.outerbounds.co/) to keep learning about Metaflow!