In [1]:
!pip freeze | grep scikit-learn

scikit-learn @ file:///tmp/build/80754af9/scikit-learn_1642617106979/work


In [1]:
import pickle
import pandas as pd

In [2]:
with open('model.bin', 'rb') as f_in:
    dv, lr = pickle.load(f_in)

In [3]:
categorical = ['PUlocationID', 'DOlocationID']

def read_data(filename):
    df = pd.read_parquet(filename)
    
    df['duration'] = df.dropOff_datetime - df.pickup_datetime
    df['duration'] = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
    
    return df

In [4]:
df = read_data('https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-02.parquet')

In [5]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)

## Q1 Mean predicted duration

Mean predicted ride duration for FHV 2021 Feb dataset

In [6]:
print(y_pred.mean())

16.191691679979066


## Q2 Preparing output

Create artificial ride_id column, and include the ride_id and predictions to a dataframe.

What is the size of the output parquet?

In [7]:
import os

In [8]:
taxi_type = 'fhv'
year = 2021
month = 2
EVAL_S3_STORE = os.getenv(key='EVAL_S3_STORE', 
                          default='s3://nyc-duration-predict-vk')
df_result = pd.DataFrame()
df_result['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')

# df_result = df.copy()
df_result['pred'] = y_pred


output_file = f'{EVAL_S3_STORE}/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'
df_result.to_parquet(
    output_file,
    engine='pyarrow',
    compression=None,
    index=False,
)

## Q3 - Creating scoring script:

```bash
jupyter nbconvert --to script starter.ipynb
```

## Q4 - Virtual env

Command to create Pipfile and lock:

```bash
pipenv install scikit-learn==1.0.2 pandas s3fs pyarrow boto3 --python=3.9
```

Scikit-learn dependency hash:
`sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b`



## Q5 - Parametrize script with year and month 

Use `argparse` standard library to add arguments and parametrize the script.

See `starter.py`


## Q6 - dockerfile

```dockerfile
FROM agrigorev/zoomcamp-model:mlops-3.9.7-slim
# how it was built:
# FROM python:3.9.7-slim

# WORKDIR /app
# COPY [ "model2.bin", "model.bin" ]
RUN pip install --upgrade pip

RUN pip install pipenv
WORKDIR /app

# use existing .pickle file within the image
COPY [ "Pipfile", "Pipfile.lock", "starter.py", "./"]
RUN pipenv install --system --deploy
ENTRYPOINT [ "python", "starter.py", "-y", "2021", "-m", "4"]
```