In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

You can do it with wget:
```bash
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
unzip bank+marketing.zip 
unzip bank.zip
```
We need `bank-full.csv`.

You can also access the copy of `back-full.csv` directly:

`wget https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/bank-full.csv`


In [1]:
import pandas as pd
import numpy as np
import pickle
import requests

In [2]:
path = "../3_classification/data/bank-full.csv"

In [3]:
df = pd.read_csv(path, delimiter=";")

In [4]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


### Question 1

Install Pipenv

What's the version of pipenv you installed?

Use `--version` to find out

In [5]:
!pipenv --version

[1mpipenv[0m, version 2024.1.0


**Note**: fresh installation uses version 2024.2.0

### Question 2
Use Pipenv to install Scikit-Learn version 1.5.2
What's the first hash for scikit-learn you get in Pipfile.lock?
Note: you should create an empty folder for homework and do it there.

### Answer:
* Create new folder `hw5` on Desktop.
* `pip install pipenv`
* `pipenv install scikit-learn==1.5.2`
First hash for scikit-learn in Pipfile.lock: "sha256:03b6158efa3faaf1feea3faa884c840ebd61b6484167c711548fce208ea09445"

### Models
We've prepared a dictionary vectorizer and a model.

They were trained (roughly) using this code:
```python
features = ['job', 'duration', 'poutcome']
dicts = df[features].to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X = dv.fit_transform(dicts)

model = LogisticRegression().fit(X, y)
```
Note: You don't need to train the model. This code is just for your reference.

And then saved with Pickle. Download them:

* [DictVectorizer](https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/cohorts/2024/05-deployment/homework/dv.bin?raw=true)
* [LogisticRegression](https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/cohorts/2024/05-deployment/homework/model1.bin?raw=true)

With `wget`:
```bash
PREFIX=https://raw.githubusercontent.com/DataTalksClub/machine-learning-zoomcamp/master/cohorts/2024/05-deployment/homework
wget $PREFIX/model1.bin
wget $PREFIX/dv.bin
```

### Question 3
Let's use these models!

* Write a script for loading these models with pickle
* Score this client:
```{"job": "management", "duration": 400, "poutcome": "success"}```
What's the probability that this client will get a subscription?

* 0.359
* 0.559
* 0.759
* 0.959

If you're getting errors when unpickling the files, check their checksum:
```
$ md5sum model1.bin dv.bin
3d8bb28974e55edefa000fe38fd3ed12  model1.bin
7d37616e00aa80f2152b8b0511fc2dff  dv.bin
```

In [6]:
with open("dv.bin", 'rb') as d:
    dv = pickle.load(d)
with open("model1.bin", 'rb') as m:
    model = pickle.load(m)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [7]:
client = {"job": "management", "duration": 400, "poutcome": "success"}
dv_client = dv.transform(client)
pred = model.predict_proba(dv_client)
print(round(pred[0,1], 3))

0.759


### Question 4
Now let's serve this model as a web service

* Install Flask and gunicorn (or waitress, if you're on Windows)
* Write Flask code for serving the model
* Now score this client using `requests`:
```
url = "YOUR_URL"
client = {"job": "student", "duration": 280, "poutcome": "failure"}
requests.post(url, json=client).json()
```
What's the probability that this client will get a subscription?

* 0.335
* 0.535
* 0.735
* 0.935


### Answer using Flask
In new VSCode window for `hw5` folder (separated from module 5 enviroment and folder):
* Create flask app in `hw5_predict.py`.
* `pipenv install flask gunicorn requests`.
* `pipenv shell`.
* `python hw5_predict.py` to activate the flask app and serve the model.
* Then run the next code chunk below by replacing with the actual url.

In [8]:
url = "http://127.0.0.1:9696/predict"
client = {"job": "student", "duration": 280, "poutcome": "failure"}
requests.post(url, json=client).json()

{'subscription_probability': 0.33480703475511053}

### Alternative to cross check answer within this notebook

In [9]:
client2 = {"job": "student", "duration": 280, "poutcome": "failure"}
dv_client2 = dv.transform(client2)
pred = model.predict_proba(dv_client2)
print(round(pred[0,1], 3))

0.335


### Docker
Install Docker. We will use it for the next two questions.

For these questions, we prepared a base image: `svizor/zoomcamp-model:3.11.5-slim`. You'll need to use it (see Question 5 for an example).

This image is based on `python:3.11.5-slim` and has a logistic regression model (a different one) as well a dictionary vectorizer inside.

This is how the Dockerfile for this image looks like:
```
FROM python:3.11.5-slim
WORKDIR /app
COPY ["model2.bin", "dv.bin", "./"]
```
We already built it and then pushed it to [`svizor/zoomcamp-model:3.11.5-slim`](https://hub.docker.com/r/svizor/zoomcamp-model).

Note: You don't need to build this docker image, it's just for your reference.


### Question 5
Download the base image `svizor/zoomcamp-model:3.11.5-slim`. You can easily make it by using docker pull command.

So what's the size of this base image?

* 45 MB
* 130 MB
* 245 MB
* 330 MB

You can get this information when running `docker images` - it'll be in the "SIZE" column.



### Answer:
* Execute `docker pull svizor/zoomcamp-model:3.11.5-slim`
* Then `docker images`
  
Answer: `svizor/zoomcamp-model                           3.11.5-slim   975e7bdca086   4 days ago     130MB`

### Dockerfile
Now create your own Dockerfile based on the image we prepared.

It should start like that:
```
FROM svizor/zoomcamp-model:3.11.5-slim
# add your stuff here
```
Now complete it:

* Install all the dependencies form the Pipenv file
* Copy your Flask script
* Run it with Gunicorn

After that, you can build your docker image.



### Answer for Dockerfile
```bash
FROM svizor/zoomcamp-model:3.11.5-slim

RUN pip install pipenv

# Create working directory named /app
WORKDIR /app

# Copy these files into working directory
COPY ["Pipfile", "Pipfile.lock", "./"]

# install the pipenv dependencies for the project and deploy them.
# --system: install a Pipfile’s contents into its parent system, e.g. Docker and Heroku.
# --deploy: use Pipenv as part of a deployment process
RUN pipenv install --deploy --system

# Copy any python files and the model we had to the working directory of Docker 
COPY ["*.py", "./"]

# We need to expose the 9696 port because we're not able to communicate with Docker outside it
EXPOSE 9696

# If we run the Docker image, we want our churn app to be running
ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:9696", "hw5_predict:app"]
```
* Run `docker build -t hw5_predict .` to build docker image.

### Question 6
Let's run your docker container!

After running it, score this client once again:
```
url = "YOUR_URL"
client = {"job": "management", "duration": 400, "poutcome": "success"}
requests.post(url, json=client).json()
```
What's the probability that this client will get a subscription now?

* 0.287
* 0.530
* 0.757
* 0.960


**NOTE**: `docker run -it -p 9696:9696 --name hw5 hw5_predict`

**NOTE** 
* Update flask to use `model2.bin` instead.

In [10]:
url = "http://127.0.0.1:9696/predict"
client = {"job": "management", "duration": 400, "poutcome": "success"}
requests.post(url, json=client).json()

{'subscription_probability': 0.756743795240796}