## Homework

---

**Note** all the files including app.py, Dockerfile, etc. are in `Homework/Deployment`.

---

> Note: sometimes your answer doesn't match one of the options exactly. 
> That's fine. 
> Select the option that's closest to your solution.

> Note: we recommend using python 3.11 in this homework.

In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

You can do it with `wget`:

```bash
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
unzip bank+marketing.zip 
unzip bank.zip
```

We need `bank-full.csv`.

You can also access the copy of `back-full.csv` directly:

```bash
wget https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/bank-full.csv
```


## Question 1

* Install Pipenv
* What's the version of pipenv you installed?
* Use `--version` to find out

In [1]:
! pip install pipenv



In [2]:
! pipenv --version

[1mpipenv[0m, version 2024.2.0


Good blog on how pyenv and pipenv work together: https://prassanna.io/blog/2019-05-29-pipenv-pyenv/

I already broke this but in principle you need to allow pipenv to use pyenv python:
`echo 'export PIPENV_PYTHON="$PYENV_ROOT/shims/python"' >> ~/.bashrc`

## Question 2

* Use Pipenv to install Scikit-Learn version 1.5.2
* What's the first hash for scikit-learn you get in Pipfile.lock?

> **Note**: you should create an empty folder for homework
and do it there. 

sha256:03b6158efa3faaf1feea3faa884c840ebd61b6484167c711548fce208ea09445

## Models

We've prepared a dictionary vectorizer and a model.

They were trained (roughly) using this code:

```python
features = ['job', 'duration', 'poutcome']
dicts = df[features].to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X = dv.fit_transform(dicts)

model = LogisticRegression().fit(X, y)
```

> **Note**: You don't need to train the model. This code is just for your reference.

And then saved with Pickle. Download them:

* [DictVectorizer](https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/cohorts/2024/05-deployment/homework/dv.bin?raw=true)
* [LogisticRegression](https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/cohorts/2024/05-deployment/homework/model1.bin?raw=true)

With `wget`:

```bash
PREFIX=https://raw.githubusercontent.com/DataTalksClub/machine-learning-zoomcamp/master/cohorts/2024/05-deployment/homework
wget $PREFIX/model1.bin
wget $PREFIX/dv.bin
```


## Question 3

Let's use these models!

* Write a script for loading these models with pickle
* Score this client:

```json
{"job": "management", "duration": 400, "poutcome": "success"}
```

What's the probability that this client will get a subscription? 

* 0.359
* 0.559
* 0.759 (yes)
* 0.959

If you're getting errors when unpickling the files, check their checksum:

```bash
$ md5sum model1.bin dv.bin
3d8bb28974e55edefa000fe38fd3ed12  model1.bin
7d37616e00aa80f2152b8b0511fc2dff  dv.bin
```


In [2]:
import pickle
import requests


In [4]:
# use with() because it will close the file automatically
with open("/home/svetlana/code/ml-zoomcamp-2024/Homework/Deployment/dv.bin", "rb") as f:
    dv = pickle.load(f)

In [5]:
with open("/home/svetlana/code/ml-zoomcamp-2024/Homework/Deployment/model1.bin", "rb") as f:
    model = pickle.load(f)

In [16]:
# code from lectures
def predict_single(customer, dv, model):
    X = dv.transform([customer])
    y_pred = model.predict_proba(X)[:, 1] #assuming get a subscription = 1
    return y_pred[0]

client = {"job": "management", "duration": 400, "poutcome": "success"}
predict_single(client, dv, model)


np.float64(0.7590966516879658)

## Question 4

Now let's serve this model as a web service

* Install Flask and gunicorn (or waitress, if you're on Windows)
* Write Flask code for serving the model
* Now score this client using `requests`:

```python
url = "YOUR_URL"
client = {"job": "student", "duration": 280, "poutcome": "failure"}
requests.post(url, json=client).json()
```

What's the probability that this client will get a subscription?

* 0.335 (yes)
* 0.535
* 0.735
* 0.935

```
pipenv install flask
pipenv install gunicorn
```

In [19]:

url = "http://127.0.0.1:9696/predict" #remember to add endpoint!!!
client = {"job": "student", "duration": 280, "poutcome": "failure"}
# requests.get(url) #200 #OK
response = requests.post(url, json=client)#.json()

# gpt
if response.status_code == 200:
    print(response.json())
else:
    print(f"Failed with status code {response.status_code}")

{'subscription': False, 'subscription_probability': 0.33480703475511053}


OK I am stupid: if I see this error "JSONDecodeError: Expecting value: line 1 column 1 (char 0)" it means it returned an empty json and this is because I forgot the /predict endpoint.

See gpt in code to see suggestions from huggingchat (def load_model to import dv and model)

## Docker

Install [Docker](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/05-deployment/06-docker.md). 
We will use it for the next two questions.

For these questions, we prepared a base image: `svizor/zoomcamp-model:3.11.5-slim`. 
You'll need to use it (see Question 5 for an example).

This image is based on `python:3.11.5-slim` and has a logistic regression model 
(a different one) as well a dictionary vectorizer inside. 

This is how the Dockerfile for this image looks like:

```docker 
FROM python:3.11.5-slim
WORKDIR /app
COPY ["model2.bin", "dv.bin", "./"]
```

We already built it and then pushed it to [`svizor/zoomcamp-model:3.11.5-slim`](https://hub.docker.com/r/svizor/zoomcamp-model).

> **Note**: You don't need to build this docker image, it's just for your reference.


## Question 5

Download the base image `svizor/zoomcamp-model:3.11.5-slim`. You can easily make it by using [docker pull](https://docs.docker.com/engine/reference/commandline/pull/) command.

So what's the size of this base image?

* 45 MB
* 130 MB (yes)
* 245 MB
* 330 MB

You can get this information when running `docker images` - it'll be in the "SIZE" column.


## Dockerfile

Now create your own Dockerfile based on the image we prepared.

It should start like that:

```docker
FROM svizor/zoomcamp-model:3.11.5-slim
# add your stuff here
```

Now complete it:

* Install all the dependencies form the Pipenv file
* Copy your Flask script
* Run it with Gunicorn 

After that, you can build your docker image.


**Note**: pipenv somehow detected my probabl pyenv and got mixed up - used all the dependencies from there :-(  I will need to understand how to avoid it in the future. RIght now I would simply copy Pipfile and Pipfile.lock from the lecture. Note: lecture has python 3.7 so no-go. Will have to live with marimo and polars dragged around. Luckily probabl was not very full yet.

**Note 2** Next time pay attention to this: "Pipenv found itself running within a virtual environment,  so it will automatically use that environment, instead of  creating 
its own for any project. You can set
PIPENV_IGNORE_VIRTUALENVS=1 to force pipenv to ignore that environment and create  its own instead.
"

In [5]:
# testing gunicorn locally
# run like gunicorn app:app (name of module (app.py): name of app (app in our case). not "predict()" endpoint!)
url = "http://127.0.0.1:8000/predict" #remember to add endpoint!!!
clients = [{"job": "student", "duration": 280, "poutcome": "failure"},
           {"job": "management", "duration": 400, "poutcome": "success"}]

# requests.get(url) #200 #OK
for client in clients:
    response = requests.post(url, json=client)#.json()

    if response.status_code == 200:
        print(response.json())
    else:
        print(f"Failed with status code {response.status_code}")

{'subscription': False, 'subscription_probability': 0.33480703475511053}
{'subscription': True, 'subscription_probability': 0.7590966516879658}


## Question 6

Let's run your docker container!

After running it, score this client once again:

```python
url = "YOUR_URL"
client = {"job": "management", "duration": 400, "poutcome": "success"}
requests.post(url, json=client).json()
```

What's the probability that this client will get a subscription now?

* 0.287
* 0.530
* 0.757
* 0.960




**Note** : they use ENTRYPOINT and not CMD. Why?

Official documentation: https://docs.docker.com/reference/dockerfile/#entrypoint 

Blogpost: https://codewithyury.com/docker-run-vs-cmd-vs-entrypoint/ 

In a nutshell:

- CMD specifies default command - it will be run when container is started without specifying a command. It will be ignored when run with a specified command.
- ENTRYPOINT makes container executable. It will not be ignored. 

**Note on forms**

There are two forms to specify docker instructions:
- ENTRYPOINT ["executable", "param1", "param2"] (exec form, preferred)
- ENTRYPOINT command param1 param2 (shell form)

ENTRYPOINT will behave differently with these two forms!

Exec form allows to add CMD on top of that with additional parameters, and both ENTRYPOINT and CMD parameters will be used.

Shell form will simply ignore any CMD command.


**Note on python version**

To solve the problem, simply start NOT with the downloaded course container, but with python:3.X.X-slim. See FAQ.

Command to build image:

`docker build -t subscription-prediction .`





Command to run:

`docker run --rm -it subscription-prediction:latest`

**Note on pipenv**: first docker entrypoint failed because gunicorn was not found in $PATH, according to GPT it is because I am not inside virtual environment when running the command. New entrypoint should specify `pipenv run gunicorn` to work.

**Note on IP address of the container**

as per gpt, run

- `docker ps` to verify container ir running, port exposed and to get ID
- `docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container_id>` to get the IP address of the container


Other option (did not try):

- run the container with the -p flag to map the container's port to a host port: `docker run -p 9696:9696 <image_name>`. This will allow you to access the container on http://localhost:9696.
- Use docker logs to check the container's logs for any errors or warnings: docker logs -f <container_id>.
- Try accessing the container using curl or a tool like nc (Netcat) to test the connection: curl http://localhost:9696 or nc localhost 9696.

In [8]:
# testing docker locally
url = "http://172.17.0.3:9696/predict" #random IP address...
clients = [{"job": "student", "duration": 280, "poutcome": "failure"},
           {"job": "management", "duration": 400, "poutcome": "success"}]

# requests.get(url) #200 #OK
for client in clients:
    response = requests.post(url, json=client)#.json()

    if response.status_code == 200:
        print(response.json())
    else:
        print(f"Failed with status code {response.status_code}")

{'subscription': False, 'subscription_probability': 0.33480703475511053}
{'subscription': True, 'subscription_probability': 0.7590966516879658}
