# Homework 5
In this homework, we're going to continue working with the lead scoring dataset.

## Question 1
* Install `uv`
* What's the version of uv you intalled?
* Use `--version` to find out

In [1]:
!uv --version

uv 0.9.7 (0adb44480 2025-10-30)


### Initialize an empty uv project
You should create an empty folder for homework and do it there

In [2]:
!uv init

Initialized project `[36mhomework-5[39m`


In [1]:
!uv add scikit-learn==1.6.1

[2K[2mResolved [1m6 packages[0m [2min 211ms[0m[0m                                         [0m
[2K[2mInstalled [1m5 packages[0m [2min 148ms[0m[0m                               [0m
 [32m+[39m [1mjoblib[0m[2m==1.5.2[0m
 [32m+[39m [1mnumpy[0m[2m==2.3.4[0m
 [32m+[39m [1mscikit-learn[0m[2m==1.6.1[0m
 [32m+[39m [1mscipy[0m[2m==1.16.3[0m
 [32m+[39m [1mthreadpoolctl[0m[2m==3.6.0[0m


The first hash is
`sha256:b4fc2525eca2c69a59260f583c56a7557c6ccdf8deafdba6e060f94c1c59738e`

### Models
We have prepared a pipeline with a dictionary vectorizer and a model.
It was trained (roughly) using this code:
```
categorical = ['lead_source']
numeric = ['number_of_courses_viewed', 'annual_income']

df[categorical] = df[categorical].fillna('NA')
df[numeric] = df[numeric].fillna(0)

train_dict = df[categorical + numeric].to_dict(orient='records')

pipeline = make_pipeline(
    DictVectorizer(),
    LogisticRegression(solver='liblinear')
)

pipeline.fit(train_dict, y_train)
```
And then saved with Pickle. Download it

In [2]:
!curl -L -O https://github.com/DataTalksClub/machine-learning-zoomcamp/raw/refs/heads/master/cohorts/2025/05-deployment/pipeline_v1.bin

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1300  100  1300    0     0   3028      0 --:--:-- --:--:-- --:--:--  3028


## Question 3
Let's use the model
* Write a script for loading the pipeline with pickle
* Score this record

In [3]:
import pickle
model_file = 'pipeline_v1.bin'
with open(model_file, mode = 'rb') as f_in:
    pipeline = pickle.load(f_in)

In [4]:
record = {
    "lead_source": "paid_ads",
    "number_of_courses_viewed": 2,
    "annual_income": 79276.0
}

score = pipeline.predict_proba(record)[0,1]
print(f'the score is {score:.3f}')

the score is 0.534


## Question 4
Now let's serve this model as a web service
* Install FastAPI
* Write FastAPI code for serving the model
* Now score this client using `requests`:
```
url = "YOUR_URL"
client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0
}
requests.post(url, json=client).json()
```

What's the probability that this client will get a subscription?

In [5]:
!uv add fastapi uvicorn

[2K[2mResolved [1m21 packages[0m [2min 257ms[0m[0m                                        [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/1)                                                   [37m⠋[0m [2mPreparing packages...[0m (0/0)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)-------------------[0m[0m     0 B/105.70 KiB          [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)-------------------[0m[0m 16.00 KiB/105.70 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)-------------------[0m[0m 32.00 KiB/105.70 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)[2m----------------[0m[0m 48.00 KiB/105.70 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)[30m[2m-----------[0m[0m 64.00 KiB/105.70 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)---[30m[2m-------[0m[0m 80.00 KiB/105.70 KiB        [1A
[2K[1A

In [8]:
from fastapi import FastAPI
import uvicorn
import pickle

app = FastAPI(title = 'lead_score')

with open('pipeline_v1.bin', mode='rb') as f_in:
    pipeline = pickle.load(f_in)

def predict_single(client):
    res = pipeline.predict_proba(client)[0,1]
    return float(res)

@app.post("/predict")
def predict(client):
    prob = predict_single(client)
    return {
        "prob_susbcription":prob,
        "subscription":bool(prob >= 0.5)
    }

# if __name__ == "__main__":
#     uvicorn.run(app, host="0.0.0.0", port = 9696)

In [6]:
!uv add --dev requests

[2K[2mResolved [1m25 packages[0m [2min 129ms[0m[0m                                        [0m
[2K[2mInstalled [1m4 packages[0m [2min 11ms[0m[0m.4.4                            [0m     [0m
 [32m+[39m [1mcertifi[0m[2m==2025.10.5[0m
 [32m+[39m [1mcharset-normalizer[0m[2m==3.4.4[0m
 [32m+[39m [1mrequests[0m[2m==2.32.5[0m
 [32m+[39m [1murllib3[0m[2m==2.5.0[0m


In [8]:
import requests
url = "http://localhost:9696/predict"

client = {
    "lead_source": "organic_search",
    "number_of_courses_viewed": 4,
    "annual_income": 80304.0
}

response = requests.post(url=url, json=client)
pred = response.json()

print(pred)
if pred['subscription']:
    print("The client will likely get a subscription")
else:
    print("The client is unlikely to get a subscription")

{'prob_susbcription': 0.5340417283801275, 'subscription': True}
The client will likely get a subscription


## Docker
For these questions, we prepared a base image: `agrigorev/zoomcamp-model:2025`. You'll need to use it
This image is based on `3.13.5-slim-bookworm` and has a pipeline with logistic regression (a different one) as well as a dictionary vectorizer inside.
This is how the dockerfile for this image looks like:
```
FROM python:3.13.5-slim-bookworm
WORKDIR /code
COPY pipeline_v2.bin .
```

## Question 5
Download the base image.
So what's the size of the base image?

In [12]:
!docker pull agrigorev/zoomcamp-model:2025

2025: Pulling from agrigorev/zoomcamp-model
Digest: sha256:14d79fde0bbf078eb18c99c2bd007205917b758ec11060b2994963a1e485c2ae
Status: Image is up to date for agrigorev/zoomcamp-model:2025
docker.io/agrigorev/zoomcamp-model:2025
[1m
What's next:[0m
    View a summary of image vulnerabilities and recommendations → [36mdocker scout quickview agrigorev/zoomcamp-model:2025[0m


In [13]:
!docker image inspect agrigorev/zoomcamp-model:2025

[
    {
        "Id": "sha256:14d79fde0bbf078eb18c99c2bd007205917b758ec11060b2994963a1e485c2ae",
        "RepoTags": [
            "agrigorev/zoomcamp-model:2025"
        ],
        "RepoDigests": [
            "agrigorev/zoomcamp-model@sha256:14d79fde0bbf078eb18c99c2bd007205917b758ec11060b2994963a1e485c2ae"
        ],
        "Parent": "",
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2025-10-21T07:58:31.344794708Z",
        "DockerVersion": "",
        "Author": "",
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 44332815,
        "GraphDriver": {
            "Data": null,
            "Name": "overlayfs"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:7cc7fe68eff66f19872441a51938eecc4ad33746d2baa3abc081c1e6fe25988e",
                "sha256:9621f68f1f5ddfd1fa67faa2a5c513986b9d4b015f06999152893fc9bcefb093",
                "sha256:547f1fc1a2bb7ad4fbecb26c04

## Dockerfile
Now create your own `Dockerfile` based on the image we prepared

* Install all dependencies from pyproject.toml
* copy your FastAPI script
* Run it with uvicorn

After that, you can build hte image

```
FROM agrigorev/zoomcamp-model:2025

# Copy the 'uv' and 'uvx' executables from the latest uv image into /bin/ in this image
# 'uv' is a fast Python package installer and environment manager
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# Set the working directory inside the container to /code
WORKDIR /app

# directory for uv virtual environment 
ENV PATH="/app/.venv/bin:$PATH"

# copy project data
COPY "pyproject.toml" "uv.lock" ".python-version" "pipeline_v1.bin" "predict.py" ./

# Install dependencies
RUN uv sync --locked

# expose port 9696 set in the FastAPI
EXPOSE 9696

# Set up entrypoint
COPY "runapp.sh" ./
RUN chmod +x runapp.sh

ENTRYPOINT ["/code/runapp.sh"]
```

`docker build -f Dockerfile -t lead_score:2025 .`

`docker run -t --rm -p 9696:9696 lead_score:2025`

## Pydantic and Validation


categorical = ['lead_source']
numeric = ['number_of_courses_viewed', 'annual_income']

df[categorical] = df[categorical].fillna('NA')
df[numeric] = df[numeric].fillna(0)

train_dict = df[categorical + numeric].to_dict(orient='records')

pipeline = make_pipeline(
    DictVectorizer(),
    LogisticRegression(solver='liblinear')
)

pipeline.fit(train_dict, y_train)

In [15]:
import pandas as pd
data = pd.read_csv('lead_score.csv')
categorical = ['lead_source']
numerical = ['number_of_courses_viewed', 'annual_income']

for c in categorical:
    print(f'{data[c].value_counts()}\n')
for n in numerical:
    print(f'{data[n].describe()}\n')

lead_source
organic_search    282
social_media      278
paid_ads          264
referral          260
events            250
Name: count, dtype: int64

count    1462.000000
mean        2.031464
std         1.449717
min         0.000000
25%         1.000000
50%         2.000000
75%         3.000000
max         9.000000
Name: number_of_courses_viewed, dtype: float64

count      1281.000000
mean      59886.273224
std       15070.140389
min       13929.000000
25%       49698.000000
50%       60148.000000
75%       69639.000000
max      109899.000000
Name: annual_income, dtype: float64

