# Capstone Project

## Background

Woah time really flies and you already reached the last sprint of the second module in the course! You should be proud of yourself. In the past three sprints you were gaining precious knowledge that helped you acquire data engineering skills. Now you should know what good Python code looks like, why OOP is used, how to structure a Python project, how to work with SQL, how to develop and deploy a web application. All these skills will enable you to make outstanding projects that not only cover data analysis and modeling but also making your discoveries reachable to other people.

Now the time has come to put all your learnings into one place and complete the second capstone project of the course. During this project, you will have to create a Python package, collect dataset using data scraping technique, train model and deploy it for others to reach.

Most importantly you will have to create the whole E2E Machine Learning plan: establish the problem, collect dataset, train model, evaluate it and deploy it. By completing this project, you will strengthen your data engineering skills and prove to yourself and others that you are capable of planning and executing data science projects.

<div style="text-align: center;">
<img src="https://miro.medium.com/max/700/1*x7P7gqjo8k2_bj2rTQWAfg.jpeg"/ width="300px">
</div>

---

## Requirements
The whole capstone project requires you to execute full-featured E2E Machine Learning Project so let's see what actually do you have to complete:

### Define problem you want to solve
This is the part where you have to select a problem. Here are the topics that you can choose from: text classification, price prediction, item category classification. Through the second module of the course, you saw a few examples of datasets that could be used to solve these problems (eBay listings, Reddit posts, Twitter tweets). In this stage you have to:
- Define the problem and create a short presentation
- Explain what do you want to solve, and what is the potential value of your solution
- Define the data source you will collect data from

### Collecting data
During this stage, you will need to create a Python package that is able to scrape a specific website. You saw many examples during the period of the second module, where functions that take few arguments (`keywords`, `number of samples`, etc.) and outputs pandas `DataFrame`s were created. Now you will need to transform this functionality into a Python package that is installable through pip.
- Create Python package that is able to scrape specific webpage
- The package should be installable through `pip`
- The package should meed all expected Python package standards: clean code, tests, documentation.
- Collect and process dataset using your created package

### Training and saving the model
During this step, you will need to use your collected data to train, test, and save a machine learning model. Do not spend much time on this step just make sure that:
- Correct machine learning algorithm is selected
- Model is successfully trained (remember first module of the course)
- Model is saved for later deployment

### Creating API for the trained model
This is the step you have done at least a couple of times. You will need to create an API using Flask. While creating the application you will need to do these things:
- Load trained model
- Create inference pipeline
- Create `POST` route to reach model and send its outputs as a response

### Tracking model's predictions
Now you will need to enable model's predictions tracking. During this step, you will need to connect your flask application to the PostgreSQL database hosted by Heroku and put the model's inputs and outputs into one table:
- Create PostgreSQL database hosted by Heroku
- Create table for predictions tracking. There should be columns for inputs and outputs of model
- At every request of model insert required values to the database
- Create new route in Flask application that returns 10 most recent requests and responses in JSON format

### Deploying the application
After completing all the steps required above, you will need to deploy your application to Heroku. You will need to follow the steps provided in the fourth lesson of this sprint.
- Make sure all secrets and passwords are set as ENV variables in Heroku
- Deploy application to Heroku
- Ensure that your application is accessible (provide link to it)

---

Problem: I like motorcycles, want to know more and know correct price

## Evaluation criteria
- All requirements are met
- The project is well thought out. Defined problem is clearly presented
- Model actually works, is able to make predictions that make sense
- Written code is clear and clean. All the PEP8 standards are met

In [None]:
!pip install beautifulsoup4



In [3]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://www.autoscout24.nl/aanbod/honda-others-crf250r-ralley-benzine-rood-150226c1-6304-46fb-928e-3cd6da1db945?cldtidx=6&cldtsrc=listPage&searchId=b9380bcf"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

schoolType = soup.find("dt", text="Versnellingen")
print(schoolType.find_next_sibling("dd").text)


6



In [62]:
def collect_urls(pages_number: int) -> list:
    """ Scrape url for number of pages and a keyword
    and returns a pandas dataframe"""
    km = []
    keywords = ['kawasaki', 'honda', 'bmw', 'ktm', 'yamaha']

    for page_no in range(1, pages_number + 1):
        for keyword in keywords:
            url = f"https://www.autoscout24.nl/lst-moto/{keyword}?sort=standard&desc=0&offer=N%2CU&ustate=N%2CU&size=20&page={page_no}"
        # url = f"https://www.autoscout24.nl/lst-moto?sort=standard&desc=0&ustate=N%2CU&size=20&page={page_no}"
        
            print(page_no)

            headers = {"User-Agent": "Mozilla/5.0"}
            page = requests.get(url, headers=headers)
            soup = BeautifulSoup(page.content, "html.parser")

            km.extend([value["href"] for value in soup.find_all(attrs={"data-item-name": "detail-page-link"})])

    return km




def collect_info(search_list: list) -> pd.DataFrame:
    """ Scrape url for number of pages and a keyword
    and returns a pandas dataframe"""
    brand, price, km, kw, registration, category, cilinder_content, fuel = ([] for i in range(8))

    for urls in search_list:
        url = f"https://www.autoscout24.nl{urls}"

        headers = {"User-Agent": "Mozilla/5.0"}
        page = requests.get(url, headers=headers)
        soup = BeautifulSoup(page.content, "html.parser")
        # registration.extend([value.text for value in soup.find_all(attrs={"data-type": "first-registration"})])

        # Price
        try:
          path = soup.find("div", class_="cldt-price")
          anchor = path.find("h2")
          price.extend([anchor.text])
        except:
          price.extend([None])

        # KM
        try:
          path = soup.find("span", class_="cldt-stage-primary-keyfact")
          km.extend([path.text])
        except:
          km.extend([None])

        # KW
        try:
          path = soup.find_all("span", class_="cldt-stage-primary-keyfact")
          kw.extend([path[2].text])
        except:
          kw.extend([None])

        # Registration
        try:
          path = soup.find("dt", text="Bouwjaar")
          anchor = path.find_next_sibling("dd").text
          registration.extend([anchor])
        except:
          registration.extend([None])

        # Brand
        try:
          path = soup.find("dt", text="Merk")
          anchor = path.find_next_sibling("dd").text
          brand.extend([anchor])
        except:
          brand.extend([None])

         # Category
        try:
          path = soup.find("dt", text="Categorie")
          anchor = path.find_next_sibling("dd").text
          category.extend([anchor])
        except:
          category.extend([None])

        # Owners
        # try:
        #   path = soup.find("dt", text="Vorige eigenaren")
        #   anchor = path.find_next_sibling("dd").text
        #   owners.extend([anchor])
        # except:
        #   owners.extend([None])

        # Gears
        # try:
        #   path = soup.find("dt", text="Versnellingen")
        #   anchor = path.find_next_sibling("dd").text
        #   gears.extend([anchor])
        # except:
        #   gears.extend([None])

         # Fuel
        try:
          path = soup.find("dt", text="Brandstof")
          anchor = path.find_next_sibling("dd").text
          fuel.extend([anchor])
        except:
          fuel.extend([None])

        # Cilinders
        # try:
        #   path = soup.find("dt", text="Cilinders")
        #   anchor = path.find_next_sibling("dd").text
        #   cilinders.extend([anchor])
        # except:
        #   cilinders.extend([None])

        # Cilinder_content
        try:
          path = soup.find("dt", text="Cilinderinhoud")
          anchor = path.find_next_sibling("dd").text
          cilinder_content.extend([anchor])
        except:
          cilinder_content.extend([None])

        # Seats
        # try:
        #   path = soup.find("dt", text="Stoelen")
        #   anchor = path.find_next_sibling("dd").text
        #   seats.extend([anchor])
        # except:
        #   seats.extend([None])

    dict_ = {
        "brand": brand,
        "price": price,
        "milage": km,
        "power": kw,
        "category": category,
        "first_registration": registration,
        # "gears": gears,
        # "previous_owners": owners,
        "fuel": fuel,
        # "cilinders": cilinders,
        "cilinder_content": cilinder_content
        # "seats": seats
    }

    df = pd.DataFrame.from_dict(dict_, orient='index')
    df = df.transpose()

    return df


In [74]:
list_try = collect_urls(10)

1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
6
6
6
6
6
7
7
7
7
7
8
8
8
8
8
9
9
9
9
9
10
10
10
10
10


In [78]:
print(len(list_try))

840


In [77]:
df = collect_info(list_try)

In [79]:
df.head()

Unnamed: 0,brand,price,milage,power,category,first_registration,fuel,cilinder_content
0,\nKawasaki\n,"\n€ 3.950,-\n",25.000 km,25 kW,\nGebruikt\n,\n1996\n,\nBenzine\n,\n805 cm³\n
1,\nKawasaki\n,"\n€ 4.395,-\n",- km,11 kW,\nNieuw\n,\n2021\n,\nBenzine\n,\n125 cm³\n
2,\nKawasaki\n,"\n€ 4.695,-\n",- km,11 kW,\nNieuw\n,\n2021\n,\nBenzine\n,\n125 cm³\n
3,\nKawasaki\n,"\n€ 5.595,-\n",33.000 km,47 kW,\nGebruikt\n,\n1999\n,\nBenzine\n,\n1.471 cm³\n
4,\nKawasaki\n,"\n€ 6.289,-\n",18.043 km,50 kW,\nGebruikt\n,\n2018\n,\nBenzine\n,\n649 cm³\n


In [80]:
df.shape

(840, 8)

In [68]:
df.columns

Index(['brand', 'price', 'milage', 'power', 'category', 'first_registration',
       'fuel', 'cilinder_content'],
      dtype='object')

In [81]:
for column in df.columns:
  df[column] = df[column].str.replace('.', '')
  df[column] = df[column].str.replace('\n', '')
  if column is 'price':
    df.price = df.price.str.extract('(\d+)')
  elif column is 'cilinder_content':
    df.cilinder_content = df.cilinder_content.str.extract('(\d+)')
  elif column is 'power':
    df.power = df.power.str.extract('(\d+)')
  elif column is 'milage':
    df.milage = df.milage.str.extract('(\d+)')


df.head()

Unnamed: 0,brand,price,milage,power,category,first_registration,fuel,cilinder_content
0,Kawasaki,3950,25000.0,25,Gebruikt,1996,Benzine,805
1,Kawasaki,4395,,11,Nieuw,2021,Benzine,125
2,Kawasaki,4695,,11,Nieuw,2021,Benzine,125
3,Kawasaki,5595,33000.0,47,Gebruikt,1999,Benzine,1471
4,Kawasaki,6289,18043.0,50,Gebruikt,2018,Benzine,649


In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 840 entries, 0 to 839
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   brand               840 non-null    object
 1   price               840 non-null    object
 2   milage              703 non-null    object
 3   power               814 non-null    object
 4   category            840 non-null    object
 5   first_registration  662 non-null    object
 6   fuel                782 non-null    object
 7   cilinder_content    617 non-null    object
dtypes: object(8)
memory usage: 52.6+ KB


In [83]:
df.head()

Unnamed: 0,brand,price,milage,power,category,first_registration,fuel,cilinder_content
0,Kawasaki,3950,25000.0,25,Gebruikt,1996,Benzine,805
1,Kawasaki,4395,,11,Nieuw,2021,Benzine,125
2,Kawasaki,4695,,11,Nieuw,2021,Benzine,125
3,Kawasaki,5595,33000.0,47,Gebruikt,1999,Benzine,1471
4,Kawasaki,6289,18043.0,50,Gebruikt,2018,Benzine,649


In [12]:
df[["price", "milage", "power", "cilinder_content"]] = df[["price", "milage", "power", "cilinder_content"]].apply(pd.to_numeric)

In [17]:
df['first_registration'] = pd.to_datetime(df['first_registration'])

In [71]:
df.category.value_counts()

Gebruikt    130
Nieuw        30
Name: category, dtype: int64

In [72]:
df.brand.value_counts()

BMW         40
Yamaha      40
Honda       40
KTM         20
Kawasaki    20
Name: brand, dtype: int64

In [73]:
df.fuel.value_counts()

Benzine       130
Super 95        7
Overig          5
Tweetakt        1
Elektrisch      1
Diesel          1
Name: fuel, dtype: int64

#### linear regression

In [24]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt

In [None]:
# Add categorised values
full_df['state_cat'] = full_df['state'].astype('category').cat.codes

In [None]:
# Get all interesting features for linear regression
features_lin = ['brand', 'price', 'milage', 'power', 'category', 
                'first_registration', 'fuel', 'cilinder_content']

In [None]:
# Set X as df
X = full_df[features_lin]

In [None]:
# Set dummy variables
X = pd.get_dummies(data=X, drop_first=True)
X.head()

In [None]:
# Set all columns incl dummy variables as features_x
features_x = X.columns

In [25]:
# X = StandardScaler().fit_transform(X)

y = df[['price']]

In [None]:
# Set train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Fit linear regression
model_lin = LinearRegression().fit(X_train,y_train)

In [None]:
# Predictions
predictions = model_lin.predict(X_test)

In [None]:
# Get summary
X_train_Sm= sm.add_constant(X_train)
X_test_Sm= sm.add_constant(X_test)
ls=sm.OLS(y_test,X_test_Sm).fit()
print(ls.summary())