# Capstone Project

## Background

Woah time really flies and you already reached the last sprint of the second module in the course! You should be proud of yourself. In the past three sprints you were gaining precious knowledge that helped you acquire data engineering skills. Now you should know what good Python code looks like, why OOP is used, how to structure a Python project, how to work with SQL, how to develop and deploy a web application. All these skills will enable you to make outstanding projects that not only cover data analysis and modeling but also making your discoveries reachable to other people.

Now the time has come to put all your learnings into one place and complete the second capstone project of the course. During this project, you will have to create a Python package, collect dataset using data scraping technique, train model and deploy it for others to reach.

Most importantly you will have to create the whole E2E Machine Learning plan: establish the problem, collect dataset, train model, evaluate it and deploy it. By completing this project, you will strengthen your data engineering skills and prove to yourself and others that you are capable of planning and executing data science projects.

<div style="text-align: center;">
<img src="https://miro.medium.com/max/700/1*x7P7gqjo8k2_bj2rTQWAfg.jpeg"/ width="300px">
</div>

---

## Requirements
The whole capstone project requires you to execute full-featured E2E Machine Learning Project so let's see what actually do you have to complete:

### Define problem you want to solve
This is the part where you have to select a problem. Here are the topics that you can choose from: text classification, price prediction, item category classification. Through the second module of the course, you saw a few examples of datasets that could be used to solve these problems (eBay listings, Reddit posts, Twitter tweets). In this stage you have to:
- Define the problem and create a short presentation
- Explain what do you want to solve, and what is the potential value of your solution
- Define the data source you will collect data from

### Collecting data
During this stage, you will need to create a Python package that is able to scrape a specific website. You saw many examples during the period of the second module, where functions that take few arguments (`keywords`, `number of samples`, etc.) and outputs pandas `DataFrame`s were created. Now you will need to transform this functionality into a Python package that is installable through pip.
- Create Python package that is able to scrape specific webpage
- The package should be installable through `pip`
- The package should meed all expected Python package standards: clean code, tests, documentation.
- Collect and process dataset using your created package

### Training and saving the model
During this step, you will need to use your collected data to train, test, and save a machine learning model. Do not spend much time on this step just make sure that:
- Correct machine learning algorithm is selected
- Model is successfully trained (remember first module of the course)
- Model is saved for later deployment

### Creating API for the trained model
This is the step you have done at least a couple of times. You will need to create an API using Flask. While creating the application you will need to do these things:
- Load trained model
- Create inference pipeline
- Create `POST` route to reach model and send its outputs as a response

### Tracking model's predictions
Now you will need to enable model's predictions tracking. During this step, you will need to connect your flask application to the PostgreSQL database hosted by Heroku and put the model's inputs and outputs into one table:
- Create PostgreSQL database hosted by Heroku
- Create table for predictions tracking. There should be columns for inputs and outputs of model
- At every request of model insert required values to the database
- Create new route in Flask application that returns 10 most recent requests and responses in JSON format

### Deploying the application
After completing all the steps required above, you will need to deploy your application to Heroku. You will need to follow the steps provided in the fourth lesson of this sprint.
- Make sure all secrets and passwords are set as ENV variables in Heroku
- Deploy application to Heroku
- Ensure that your application is accessible (provide link to it)

---

## Evaluation criteria
- All requirements are met
- The project is well thought out. Defined problem is clearly presented
- Model actually works, is able to make predictions that make sense
- Written code is clear and clean. All the PEP8 standards are met

---

WHY DO WE NEED PRICE PREDICTION FOR MOTORSCYCLES?

When buying second hand items from the market, a problem known in economics as 'the lemons problem' arises. The lemons problem refers to issues that arise regarding the value of an investment or product due to asymmetric information possessed by the buyer and the seller. 

To make asymmetric information less of an issue, this project takes all the info from a listing on second hand car/motor website [autoscout.nl](autoscout.nl) and predicts the price that it should be. If the price predicted is lower than the actual price of the listing it could be classified as a lemon.  

In [None]:
import requests
import pandas as pd
import numpy as np

In [None]:
pip install "git+https://github.com/winckles/motor_scraper.git"

Collecting git+https://github.com/winckles/motor_scraper.git
  Cloning https://github.com/winckles/motor_scraper.git to /tmp/pip-req-build-juywjlkj
  Running command git clone -q https://github.com/winckles/motor_scraper.git /tmp/pip-req-build-juywjlkj
Building wheels for collected packages: motor-scraper
  Building wheel for motor-scraper (setup.py) ... [?25l[?25hdone
  Created wheel for motor-scraper: filename=motor_scraper-0.0.4-cp36-none-any.whl size=3751 sha256=c166e98e21b9f7071d136c122db0f1b5038dbd1cf0695d526be9fd1a63b3a096
  Stored in directory: /tmp/pip-ephem-wheel-cache-yz3u9krp/wheels/45/90/78/af29fb4aae1d001d855dd0a7d64c6b9daed57ba0146e7e63bb
Successfully built motor-scraper
Installing collected packages: motor-scraper
Successfully installed motor-scraper-0.0.4


In [None]:
from package import MotorScraper

list_try2 = MotorScraper().collect_urls(2, ['kawasaki', 'honda', 'bmw', 'yamaha', 'ducati'])

In [None]:
df = MotorScraper().collect_info(list_try2)
df.head()

Unnamed: 0,brand,price,mileage,power,new,year,fuel,cc
0,\nKawasaki\n,"\n€ 3.950,-\n",25.000 km,25 kW,\nGebruikt\n,\n1996\n,\nBenzine\n,\n805 cm³\n
1,\nKawasaki\n,"\n€ 4.395,-\n",- km,11 kW,\nNieuw\n,\n2021\n,\nBenzine\n,\n125 cm³\n
2,\nKawasaki\n,"\n€ 4.695,-\n",- km,11 kW,\nNieuw\n,\n2021\n,\nBenzine\n,\n125 cm³\n
3,\nKawasaki\n,"\n€ 5.595,-\n",33.000 km,47 kW,\nGebruikt\n,\n1999\n,\nBenzine\n,\n1.471 cm³\n
4,\nKawasaki\n,"\n€ 5.989,-\n",48.161 km,140 kW,\nGebruikt\n,\n2007\n,\nBenzine\n,\n1.352 cm³\n


In [None]:
df.shape

(1986, 8)

In [None]:
df_csv = pd.read_csv("https://raw.githubusercontent.com/TuringCollegeSubmissions/lcramw-DS.2.4/master/data/data.csv?token=AFU2SI7QEKRMHCAKGPPYG6DAHEGSO")

In [None]:
df_csv.head()

Unnamed: 0,brand,price,mileage,power,new,year,fuel,cc
0,Kawasaki,3950,25000.0,25.0,Gebruikt,1996.0,Benzine,805.0
1,Kawasaki,4395,0.0,11.0,Nieuw,2021.0,0,125.0
2,Kawasaki,4695,0.0,11.0,Nieuw,2021.0,0,125.0
3,Kawasaki,5595,33000.0,47.0,Gebruikt,1999.0,Benzine,1471.0
4,Kawasaki,5989,48161.0,140.0,Gebruikt,2007.0,Benzine,1352.0


#### linear regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

import matplotlib.pyplot as plt
import re

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import scale
from sklearn.feature_selection import RFE

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

In [None]:
df = df_csv.reset_index()
features = ['brand', 'new', 'mileage', 'power',
                'year', 'fuel', 'cc']
x = df[features]
x = pd.get_dummies(data=x, drop_first=True)
y = df[['price']]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# step-1: create a cross-validation scheme
folds = KFold(n_splits=4, shuffle=True, random_state=100)

# step-2: specify range of hyper parameters to tune
hyper_params = [{'n_features_to_select': list(range(1, 9))}]

# step-3: perform grid search
# 3.1 specify model
lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm)

# 3.2 call GridSearchCV()
model_cv = GridSearchCV(estimator=rfe,
                        param_grid=hyper_params,
                        scoring='r2',
                        cv=folds,
                        verbose=1,
                        return_train_score=True)

# fit the model
model_cv.fit(X_train, y_train)

In [None]:
r_sq = model_cv.score(x, y)
print(r_sq)

0.05048379617493415


In [None]:
# Predictions
predictions = model_cv.predict(X_test)

In [None]:
predictions[0:10]

array([ 2118.08557673,  5501.91306122,  2326.9668198 ,  2419.32122075,
        5501.91306122,  2118.08557673, 10191.9930595 ,  2326.9668198 ,
        2118.08557673,   750.30095222])

In [None]:
y[0:10]

Unnamed: 0,price
0,3950
1,4395
2,4695
3,5595
4,5989
5,6289
6,6489
7,6990
8,7489
9,7689


In [None]:
ar = X_test.to_numpy()

In [None]:
ar[9]

array([1.000e+04, 7.000e+00, 1.993e+03, 5.000e+01, 0.000e+00, 1.000e+00,
       0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 0.000e+00, 1.000e+00])



---



In [None]:
import pickle

# TO SAVE MODEL TO FILE
pickle.dump(model_cv, open("classifier.pkl", "wb"))

# TO LOAD MODEL FROM FILE
clf = pickle.load(open("classifier.pkl", "rb"))
clf

GridSearchCV(cv=KFold(n_splits=4, random_state=100, shuffle=True),
             error_score=nan,
             estimator=RFE(estimator=LinearRegression(copy_X=True,
                                                      fit_intercept=True,
                                                      n_jobs=None,
                                                      normalize=False),
                           n_features_to_select=None, step=1, verbose=0),
             iid='deprecated', n_jobs=None,
             param_grid=[{'n_features_to_select': [1, 2, 3, 4, 5, 6, 7, 8]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='r2', verbose=1)

In [None]:
import pickle
import threading
import json

from flask import Flask, request
from werkzeug import serving
import numpy as np

# db_connection = psycopg2.connect(
#     database_url="postgres://rcdpozbtfmhiqm:39011b4e67c734179fcb231dc2d4dd14de232097833134f5eff5ac02a6285d7b@ec2-54-198-73-79.compute-1.amazonaws.com:5432/dakf5mq8ckvgdo")

# DEFINING PATH TO THE SAVED MODEL'S .pkl FILE
SAVED_MODEL_PATH = "classifier.pkl"

# LOADING THE CLASSIFIER FROM FILE
classifier = pickle.load(open(SAVED_MODEL_PATH, "rb"))

app = Flask(__name__)

# CREATING A PROCESSING FUNCTION TO TRANSFORM INPUTS TO THE 
## EXPECTED FORMAT
def __process_input(request_data: str) -> np.array:
    return np.asarray([json.loads(request.data)["inputs"]])

# CREATING ROUTE FOR MODEL PREDICTION
@app.route("/predict", methods=["POST"])
def predict() -> str:
    input_params = __process_input(request.data)
    try:
        prediction = classifier.predict(input_params)
    except:
        return json.dumps({"error": "PREDICTION FAILED"}), 400
    
    return json.dumps({"predicted_price": int(prediction[0])})


t = threading.Thread(target=serving.run_simple, args=('localhost', 9000, app))
t.start()

 * Running on http://localhost:9000/ (Press CTRL+C to quit)


In [None]:
import requests
import json

# MAKING PREDICTION WITH FEATURES AS INPUTS
resp = requests.post("http://localhost:9000/predict", data=json.dumps({"inputs": [4.60000000e+04, 2.50000000e+01, 1.98500000e+03, 7.86409627e+02,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
       0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00]}))
json.loads(resp.text)

127.0.0.1 - - [19/Feb/2021 15:30:32] "[37mPOST /predict HTTP/1.1[0m" 200 -


{'predicted_class': 2118}

In [None]:
# RESTARTING KERNEL TO STOP THE BACKGROUND THREAD
import os
os._exit(00)