# Capstone Project

## Background

Woah time really flies and you already reached the last sprint of the second module in the course! You should be proud of yourself. In the past three sprints you were gaining precious knowledge that helped you acquire data engineering skills. Now you should know what good Python code looks like, why OOP is used, how to structure a Python project, how to work with SQL, how to develop and deploy a web application. All these skills will enable you to make outstanding projects that not only cover data analysis and modeling but also making your discoveries reachable to other people.

Now the time has come to put all your learnings into one place and complete the second capstone project of the course. During this project, you will have to create a Python package, collect dataset using data scraping technique, train model and deploy it for others to reach.

Most importantly you will have to create the whole E2E Machine Learning plan: establish the problem, collect dataset, train model, evaluate it and deploy it. By completing this project, you will strengthen your data engineering skills and prove to yourself and others that you are capable of planning and executing data science projects.

<div style="text-align: center;">
<img src="https://miro.medium.com/max/700/1*x7P7gqjo8k2_bj2rTQWAfg.jpeg"/ width="300px">
</div>

---

## Requirements
The whole capstone project requires you to execute full-featured E2E Machine Learning Project so let's see what actually do you have to complete:

### Define problem you want to solve
This is the part where you have to select a problem. Here are the topics that you can choose from: text classification, price prediction, item category classification. Through the second module of the course, you saw a few examples of datasets that could be used to solve these problems (eBay listings, Reddit posts, Twitter tweets). In this stage you have to:
- Define the problem and create a short presentation
- Explain what do you want to solve, and what is the potential value of your solution
- Define the data source you will collect data from

### Collecting data
During this stage, you will need to create a Python package that is able to scrape a specific website. You saw many examples during the period of the second module, where functions that take few arguments (`keywords`, `number of samples`, etc.) and outputs pandas `DataFrame`s were created. Now you will need to transform this functionality into a Python package that is installable through pip.
- Create Python package that is able to scrape specific webpage
- The package should be installable through `pip`
- The package should meed all expected Python package standards: clean code, tests, documentation.
- Collect and process dataset using your created package

### Training and saving the model
During this step, you will need to use your collected data to train, test, and save a machine learning model. Do not spend much time on this step just make sure that:
- Correct machine learning algorithm is selected
- Model is successfully trained (remember first module of the course)
- Model is saved for later deployment

### Creating API for the trained model
This is the step you have done at least a couple of times. You will need to create an API using Flask. While creating the application you will need to do these things:
- Load trained model
- Create inference pipeline
- Create `POST` route to reach model and send its outputs as a response

### Tracking model's predictions
Now you will need to enable model's predictions tracking. During this step, you will need to connect your flask application to the PostgreSQL database hosted by Heroku and put the model's inputs and outputs into one table:
- Create PostgreSQL database hosted by Heroku
- Create table for predictions tracking. There should be columns for inputs and outputs of model
- At every request of model insert required values to the database
- Create new route in Flask application that returns 10 most recent requests and responses in JSON format

### Deploying the application
After completing all the steps required above, you will need to deploy your application to Heroku. You will need to follow the steps provided in the fourth lesson of this sprint.
- Make sure all secrets and passwords are set as ENV variables in Heroku
- Deploy application to Heroku
- Ensure that your application is accessible (provide link to it)

---

## Evaluation criteria
- All requirements are met
- The project is well thought out. Defined problem is clearly presented
- Model actually works, is able to make predictions that make sense
- Written code is clear and clean. All the PEP8 standards are met

In [1]:
!pip install beautifulsoup4



In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945?cldtidx=6&cldtsrc=listPage&searchId=b9380bcf"
page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
km = soup.find("div", class_="cldt-price")
mile = soup.find("span", class_="cldt-stage-primary-keyfact")
kw = soup.find("span", class_="sc-font-l cldt-stage-primary-keyfact")
print(km.text)
print(mile.text)
print(kw.text)

In [87]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945?cldtidx=6&cldtsrc=listPage&searchId=b9380bcf"
page = requests.get(URL, headers=headers)

km = []
km.extend([value.text for value in soup.find_all("span", class_="cldt-stage-primary-keyfact")])
print(km)


['3.221 km', '02/2020', '18 kW', '24 PK', '3.221 km', '02/2020', '18 kW', '24 PK']


In [72]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/lst-moto?sort=standard&desc=0&ustate=N%2CU&size=20&page=2"
page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
km = []
km.extend([value["href"] for value in soup.find_all(attrs={"data-item-name":"detail-page-link"})])
print(km)

# for title in soup.find_all(attrs={"data-type": "milage"}):
#     print(title)
    

['/aanbod/kawasaki-z-125-jetzt-vorbestellen-benzine-zwart-63347bfa-d1f2-418a-b95d-443048f7cacd', '/aanbod/kawasaki-ninja-125-jetzt-vorbestellen-benzine-e31d045e-77b8-4d09-9f8b-161b8491eeef', '/aanbod/kawasaki-ninja-400-ninja-400-abs-led-1-hand-top-zustand-insp-neu-benzine-groen-731db28b-7894-4fcc-8096-fef9dc5fc14f', '/aanbod/bmw-f-800-gs-0-benzine-grijs-a6326ce3-2040-480c-994f-11300b3b2199', '/aanbod/vespa-gts-300-super-hpe-versand-moeglich-benzine-b5e415d7-25e0-47f5-99e4-b3934a42ce00', '/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945', '/aanbod/honda-others-cbf1000a-benzine-zilver-122cbc99-8417-43c9-8fb3-3c27bac56f91', '/aanbod/ktm-125-duke-versand-moeglich-benzine-zilver-45e1b1f4-e755-4c1f-a9c1-fab3fc413b34', '/aanbod/ktm-125-duke-mj21-jetzt-vorbestellen-benzine-zwart-914dcddc-99b0-483d-a3cc-82d9d149343c', '/aanbod/tgb-blade-550-efi-eco-lof-4x4-special-edition-inkl-koffer-benzine-blauw-295ab125-c90d-4bbb-9218-4e5232efcb30', '/aanbod/kawasaki-vn-1

In [75]:
def collect(pages_number: int) -> pd.DataFrame:
    """ Scrape url for number of pages and a keyword
    and returns a pandas dataframe"""
    title, price, km, registration, owners = ([] for i in range(5))

    for page_no in range(1, pages_number + 1):
        url = f"https://www.autoscout24.nl/lst-moto?sort=standard&desc=0&ustate=N%2CU&size=20&page={page_no}"
        print(page_no)

        headers = {"User-Agent": "Mozilla/5.0"}
        page = requests.get(url, headers=headers)
        soup = BeautifulSoup(page.content, "html.parser")

        price.extend([value.text for value in soup.find_all("span", class_="sc-font-xl")])
        title.extend([value.text for value in soup.find_all("h2", class_="cldt-summary-makemodel")])
        km.extend([value.text for value in soup.find_all(attrs={"data-type": "mileage"})])
        registration.extend([value.text for value in soup.find_all(attrs={"data-type": "first-registration"})])
        owners.extend([value.text for value in soup.find_all(attrs={"data-type": "previous-owners"})])

    dict_ = {
        "title": title,
        "price": price,
        "milage": km,
        "first_registration": registration,
        "previous_owners": owners
    }

    df = pd.DataFrame.from_dict(dict_, orient='index')
    df = df.transpose()

    return df

In [76]:
df = collect(4)

1
2
3
4


In [77]:
df.shape

(80, 5)

In [78]:
df.head()

Unnamed: 0,title,price,milage,first_registration,previous_owners
0,PGO X-Rider 150,"\n€ 690,-\n",\n4.000 km\n,\n07/2004\n,\n-/- (Vorige eigenaren)\n
1,Suzuki VX 800,"\n€ 950,-\n",\n75.331 km\n,\n05/1992\n,\n-/- (Vorige eigenaren)\n
2,Suzuki RF 600,"\n€ 1.450,-\n",\n30.461 km\n,\n03/1997\n,\n2 vorige eigenaren\n
3,Jinling,"\n€ 1.650,-\n",\n875 km\n,\n05/2012\n,\n-/- (Vorige eigenaren)\n
4,Moto Guzzi,"\n€ 1.950,-\n",\n13.274 km\n,\n04/2004\n,\n-/- (Vorige eigenaren)\n


In [50]:
df.post_score = df.post_score.str.replace('.', '')
df.head()

Unnamed: 0,post_score,post_thumb_url
0,"\n€ 690,-\n",PGO X-Rider 150
1,"\n€ 950,-\n",Suzuki VX 800
2,"\n€ 1450,-\n",Suzuki RF 600
3,"\n€ 1650,-\n",Jinling
4,"\n€ 1950,-\n",Moto Guzzi


In [51]:
df.post_score = df.post_score.str.extract('(\d+)')
df.head()

Unnamed: 0,post_score,post_thumb_url
0,690,PGO X-Rider 150
1,950,Suzuki VX 800
2,1450,Suzuki RF 600
3,1650,Jinling
4,1950,Moto Guzzi


Hi Mary, It's time for another 1-on-1 :D Although, this time I want you to think about all the stuff you have covered so far in the course, and I want to talk about the things that you know/feel that you don't know and what topic (in data science) you think you could improve in. When would suit you?