# Capstone Project

## Background

Woah time really flies and you already reached the last sprint of the second module in the course! You should be proud of yourself. In the past three sprints you were gaining precious knowledge that helped you acquire data engineering skills. Now you should know what good Python code looks like, why OOP is used, how to structure a Python project, how to work with SQL, how to develop and deploy a web application. All these skills will enable you to make outstanding projects that not only cover data analysis and modeling but also making your discoveries reachable to other people.

Now the time has come to put all your learnings into one place and complete the second capstone project of the course. During this project, you will have to create a Python package, collect dataset using data scraping technique, train model and deploy it for others to reach.

Most importantly you will have to create the whole E2E Machine Learning plan: establish the problem, collect dataset, train model, evaluate it and deploy it. By completing this project, you will strengthen your data engineering skills and prove to yourself and others that you are capable of planning and executing data science projects.

<div style="text-align: center;">
<img src="https://miro.medium.com/max/700/1*x7P7gqjo8k2_bj2rTQWAfg.jpeg"/ width="300px">
</div>

---

## Requirements
The whole capstone project requires you to execute full-featured E2E Machine Learning Project so let's see what actually do you have to complete:

### Define problem you want to solve
This is the part where you have to select a problem. Here are the topics that you can choose from: text classification, price prediction, item category classification. Through the second module of the course, you saw a few examples of datasets that could be used to solve these problems (eBay listings, Reddit posts, Twitter tweets). In this stage you have to:
- Define the problem and create a short presentation
- Explain what do you want to solve, and what is the potential value of your solution
- Define the data source you will collect data from

### Collecting data
During this stage, you will need to create a Python package that is able to scrape a specific website. You saw many examples during the period of the second module, where functions that take few arguments (`keywords`, `number of samples`, etc.) and outputs pandas `DataFrame`s were created. Now you will need to transform this functionality into a Python package that is installable through pip.
- Create Python package that is able to scrape specific webpage
- The package should be installable through `pip`
- The package should meed all expected Python package standards: clean code, tests, documentation.
- Collect and process dataset using your created package

### Training and saving the model
During this step, you will need to use your collected data to train, test, and save a machine learning model. Do not spend much time on this step just make sure that:
- Correct machine learning algorithm is selected
- Model is successfully trained (remember first module of the course)
- Model is saved for later deployment

### Creating API for the trained model
This is the step you have done at least a couple of times. You will need to create an API using Flask. While creating the application you will need to do these things:
- Load trained model
- Create inference pipeline
- Create `POST` route to reach model and send its outputs as a response

### Tracking model's predictions
Now you will need to enable model's predictions tracking. During this step, you will need to connect your flask application to the PostgreSQL database hosted by Heroku and put the model's inputs and outputs into one table:
- Create PostgreSQL database hosted by Heroku
- Create table for predictions tracking. There should be columns for inputs and outputs of model
- At every request of model insert required values to the database
- Create new route in Flask application that returns 10 most recent requests and responses in JSON format

### Deploying the application
After completing all the steps required above, you will need to deploy your application to Heroku. You will need to follow the steps provided in the fourth lesson of this sprint.
- Make sure all secrets and passwords are set as ENV variables in Heroku
- Deploy application to Heroku
- Ensure that your application is accessible (provide link to it)

---

Problem: I like motorcycles, want to know more and know correct price

## Evaluation criteria
- All requirements are met
- The project is well thought out. Defined problem is clearly presented
- Model actually works, is able to make predictions that make sense
- Written code is clear and clean. All the PEP8 standards are met

In [None]:
!pip install beautifulsoup4



In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [29]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945?cldtidx=6&cldtsrc=listPage&searchId=b9380bcf"
page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
km = soup.find("div", class_="cldt-price")
mile = soup.find("span", class_="cldt-stage-primary-keyfact")
kw = soup.find("span", class_="sc-font-l cldt-stage-primary-keyfact")
print(km.text)
print(mile.text)
print(kw.text)



€ 5.390,-









3.221 km
3.221 km


In [53]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945?cldtidx=6&cldtsrc=listPage&searchId=b9380bcf"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

schoolType = soup.find("dt", text="Versnellingen")
print(schoolType.find_next_sibling("dd").text)


6



In [3]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945?cldtidx=6&cldtsrc=listPage&searchId=b9380bcf"
page = requests.get(URL, headers=headers)

km = []
laa = []
km.extend([value.text for value in soup.find_all('dt', {"class" : "sc-ellipsis"})])
laa.extend([value.text for value in soup.find_all('dd')])
# km.extend([value.text for value in soup.find_all("span", class_="cldt-stage-primary-keyfact")])
print(km)
print(laa)

['Categorie', 'Merk', 'Advertentienr.', 'Bouwjaar', 'Kleur', 'Carrosserietype', 'Stoelen', 'Transmissie', 'Versnellingen', 'Cilinderinhoud', 'Cilinders', 'Brandstof']
['\nGebruikt\n', '\n1\n', '\n02/2022\n', '\nHonda\n', '\n2201654\n', '\n2020\n', '\nRood\n', '\nExtreme Red\n', '\nEnduro\n', '\n2\n', '\n7432/AAV\n', '\nHandgeschakeld\n', '\n6\n', '\n250 cm³\n', '\n1\n', '\nBenzine\n']


In [None]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945?cldtidx=6&cldtsrc=listPage&searchId=b9380bcf"
page = requests.get(URL, headers=headers)

km = []
km.extend([value.text for value in soup.find_all("span", class_="cldt-stage-primary-keyfact")])
print(km)


['3.221 km', '02/2020', '18 kW', '24 PK', '3.221 km', '02/2020', '18 kW', '24 PK']


In [None]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/lst-moto?sort=standard&desc=0&ustate=N%2CU&size=20&page=2"
page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
km = []
km.extend([value["href"] for value in soup.find_all(attrs={"data-item-name":"detail-page-link"})])
print(km)

# for title in soup.find_all(attrs={"data-type": "milage"}):
#     print(title)
    

['/aanbod/kawasaki-z-125-jetzt-vorbestellen-benzine-zwart-63347bfa-d1f2-418a-b95d-443048f7cacd', '/aanbod/kawasaki-ninja-125-jetzt-vorbestellen-benzine-e31d045e-77b8-4d09-9f8b-161b8491eeef', '/aanbod/kawasaki-ninja-400-ninja-400-abs-led-1-hand-top-zustand-insp-neu-benzine-groen-731db28b-7894-4fcc-8096-fef9dc5fc14f', '/aanbod/bmw-f-800-gs-0-benzine-grijs-a6326ce3-2040-480c-994f-11300b3b2199', '/aanbod/vespa-gts-300-super-hpe-versand-moeglich-benzine-b5e415d7-25e0-47f5-99e4-b3934a42ce00', '/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945', '/aanbod/honda-others-cbf1000a-benzine-zilver-122cbc99-8417-43c9-8fb3-3c27bac56f91', '/aanbod/ktm-125-duke-versand-moeglich-benzine-zilver-45e1b1f4-e755-4c1f-a9c1-fab3fc413b34', '/aanbod/ktm-125-duke-mj21-jetzt-vorbestellen-benzine-zwart-914dcddc-99b0-483d-a3cc-82d9d149343c', '/aanbod/tgb-blade-550-efi-eco-lof-4x4-special-edition-inkl-koffer-benzine-blauw-295ab125-c90d-4bbb-9218-4e5232efcb30', '/aanbod/kawasaki-vn-1

In [123]:
def collect_urls(pages_number: int) -> list:
    """ Scrape url for number of pages and a keyword
    and returns a pandas dataframe"""
    km = []
    for page_no in range(1, pages_number + 1):
        url = f"https://www.autoscout24.nl/lst-moto?sort=standard&desc=0&ustate=N%2CU&size=20&page={page_no}"
        print(page_no)

        headers = {"User-Agent": "Mozilla/5.0"}
        page = requests.get(url, headers=headers)
        soup = BeautifulSoup(page.content, "html.parser")

        km.extend([value["href"] for value in soup.find_all(attrs={"data-item-name": "detail-page-link"})])

    return km




def collect_info(search_list: list) -> pd.DataFrame:
    """ Scrape url for number of pages and a keyword
    and returns a pandas dataframe"""
    brand, price, km, kw, registration, category, owners, cilinders, cilinder_content, seats, fuel, gears = ([] for i in range(12))

    for urls in search_list:
        url = f"https://www.autoscout24.nl{urls}"

        headers = {"User-Agent": "Mozilla/5.0"}
        page = requests.get(url, headers=headers)
        soup = BeautifulSoup(page.content, "html.parser")
        # registration.extend([value.text for value in soup.find_all(attrs={"data-type": "first-registration"})])

        # Price
        try:
          path = soup.find("div", class_="cldt-price")
          anchor = path.find("h2")
          price.extend([anchor.text])
        except:
          price.extend([None])

        # KM
        try:
          path = soup.find("span", class_="cldt-stage-primary-keyfact")
          km.extend([path.text])
        except:
          km.extend([None])

        # KW
        try:
          path = soup.find_all("span", class_="cldt-stage-primary-keyfact")
          kw.extend([path[2].text])
        except:
          kw.extend([None])

        # Registration
        try:
          path = soup.find("dt", text="Bouwjaar")
          anchor = path.find_next_sibling("dd").text
          registration.extend([anchor])
        except:
          registration.extend([None])

        # Brand
        try:
          path = soup.find("dt", text="Merk")
          anchor = path.find_next_sibling("dd").text
          brand.extend([anchor])
        except:
          brand.extend([None])

         # Category
        try:
          path = soup.find("dt", text="Categorie")
          anchor = path.find_next_sibling("dd").text
          category.extend([anchor])
        except:
          category.extend([None])

        # Owners
        try:
          path = soup.find("dt", text="Vorige eigenaren")
          anchor = path.find_next_sibling("dd").text
          owners.extend([anchor])
        except:
          owners.extend([None])

        # Gears
        try:
          path = soup.find("dt", text="Versnellingen")
          anchor = path.find_next_sibling("dd").text
          gears.extend([anchor])
        except:
          gears.extend([None])

         # Fuel
        try:
          path = soup.find("dt", text="Brandstof")
          anchor = path.find_next_sibling("dd").text
          fuel.extend([anchor])
        except:
          fuel.extend([None])

        # Cilinders
        try:
          path = soup.find("dt", text="Cilinders")
          anchor = path.find_next_sibling("dd").text
          cilinders.extend([anchor])
        except:
          cilinders.extend([None])

        # Cilinder_content
        try:
          path = soup.find("dt", text="Cilinderinhoud")
          anchor = path.find_next_sibling("dd").text
          cilinder_content.extend([anchor])
        except:
          cilinder_content.extend([None])

        # Seats
        try:
          path = soup.find("dt", text="Stoelen")
          anchor = path.find_next_sibling("dd").text
          seats.extend([anchor])
        except:
          seats.extend([None])

    dict_ = {
        "brand": brand,
        "price": price,
        "milage": km,
        "power": kw,
        "category": category,
        "first_registration": registration,
        "gears": gears,
        "previous_owners": owners,
        "fuel": fuel,
        "cilinders": cilinders,
        "cilinder_content": cilinder_content,
        "seats": seats
    }

    df = pd.DataFrame.from_dict(dict_, orient='index')
    df = df.transpose()

    return df


In [63]:
list_try = collect_urls(1)

1


In [166]:
df = collect_info(list_try)

In [139]:
df

Unnamed: 0,brand,price,milage,power,category,first_registration,gears,previous_owners,fuel,cilinders,cilinder_content,seats
0,\nPGO\n,"\n€ 690,-\n",4.000 km,8,\nGebruikt\n,\n2004\n,,,,,\n150 cm³\n,\n2\n
1,,"\n€ 799,-\n",3.400 km,8,\nGebruikt\n,\n2008\n,,\n3\n,,,\n262 cm³\n,\n2\n
2,\nSuzuki\n,"\n€ 950,-\n",75.331 km,45,\nGebruikt\n,\n1992\n,,,\nBenzine\n,,\n805 cm³\n,
3,\nSuzuki\n,"\n€ 1.450,-\n",30.461 km,72,\nGebruikt\n,\n1997\n,,\n2\n,\nBenzine\n,,\n600 cm³\n,
4,\nDaelim\n,"\n€ 1.999,-\n",8.500 km,10,\nGebruikt\n,\n1998\n,,\n2\n,\nBenzine\n,,\n124 cm³\n,
5,\nHonda\n,"\n€ 2.550,-\n",760 km,7,\nGebruikt\n,\n2020\n,\n5\n,\n1\n,\nSuper 95\n,\n1\n,\n125 cm³\n,\n2\n
6,\nHonda\n,"\n€ 2.590,-\n",1 km,5,\nNieuw\n,,\n4\n,,\nSuper 95\n,\n1\n,\n110 cm³\n,
7,\nNiu\n,"\n€ 2.899,-\n",0 km,3,\nNieuw\n,,,,\nElektrisch\n,,,\n2\n
8,\nYamaha\n,"\n€ 2.900,-\n",42.932 km,99,\nGebruikt\n,\n1991\n,,,\nBenzine\n,,\n1.000 cm³\n,
9,\nHonda\n,"\n€ 2.950,-\n",40.203 km,78,\nGebruikt\n,\n2002\n,,,\nBenzine\n,,\n996 cm³\n,


In [167]:
df.shape

(20, 12)

In [168]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   brand               19 non-null     object
 1   price               20 non-null     object
 2   milage              20 non-null     object
 3   power               20 non-null     object
 4   category            20 non-null     object
 5   first_registration  16 non-null     object
 6   gears               6 non-null      object
 7   previous_owners     8 non-null      object
 8   fuel                18 non-null     object
 9   cilinders           8 non-null      object
 10  cilinder_content    19 non-null     object
 11  seats               9 non-null      object
dtypes: object(12)
memory usage: 2.0+ KB


In [156]:
df.price = df.price.str.replace('.', '')
df.price = df.price.str.extract('(\d+)')
df.head()

Unnamed: 0,brand,price,milage,power,category,first_registration,gears,previous_owners,fuel,cilinders,cilinder_content,seats
0,\nPGO\n,690,4.000 km,8,\nGebruikt\n,\n2004\n,,,,,\n150 cm³\n,\n2\n
1,,799,3.400 km,8,\nGebruikt\n,\n2008\n,,\n3\n,,,\n262 cm³\n,\n2\n
2,\nSuzuki\n,950,75.331 km,45,\nGebruikt\n,\n1992\n,,,\nBenzine\n,,\n805 cm³\n,
3,\nSuzuki\n,1450,30.461 km,72,\nGebruikt\n,\n1997\n,,\n2\n,\nBenzine\n,,\n600 cm³\n,
4,\nDaelim\n,1999,8.500 km,10,\nGebruikt\n,\n1998\n,,\n2\n,\nBenzine\n,,\n124 cm³\n,


In [161]:
df.columns

Index(['brand', 'price', 'milage', 'power', 'category', 'first_registration',
       'gears', 'previous_owners', 'fuel', 'cilinders', 'cilinder_content',
       'seats'],
      dtype='object')

In [171]:
for column in df.columns:
  print(column)
  df[column] = df[column].str.replace({'.':'', ',-':'', '\n':''})
  # df[column] = df[column].str.extract('(\d+)')


df.head()

brand


TypeError: ignored