# Data Science Aptitude Test - Daimler Group - Part II

- date: yyyy.MM.dd
- author: [name](https://www.linkedin.com/in/)
***
## 1. Background 

The following test consists in one scenario in which you will have to analyze and train a model for one data set. The data set contains information about year, price, transmission, mileage, fuel type and engine size of used C-class cars. The idea is training a model to **predict** which will be the price of a used *C-class car* in the market. 

### Data Science Workflow stages

The competition solution workflow goes through ten stages described in the Data Science Solutions book.

1. Check Environment & libraries version.
2. Import Libraries.
3. Question or problem definition.
4. Acquire dataset or training and testing data.
5. Analyze, identify patterns, and explore the data.
6. Wrangle, prepare, cleanse the data.
7. Feature Engineering
8. **Model, predict and MLOps.**
9. **Visualize, report, and present the problem solving steps and final solution.**
10. Create or solve AI pipeline on Cloud, architecture solution and shows results.

Could be to exist another substeps during Data Processing and Data Modeling, for to train and deploy many machine learning and/or deep learning combination pipelines.

***
## 2. This notebook covers only:

1. [ ] Upload the csv file to your workspace and load it into a data frame.
2. [ ] Look for null values and outliers. Remove, keep or impute them and explain why you did so.
3. [ ] Show the main statistics (mean, standard deviation…) of the numerical columns of the data set. Are any of the variables skewed? (You can use any visualization you need to answer the last question).
4. [X] Train a model for the prediction of the price. Explain why you chose the model that you have trained.
5. [X] Test the model and obtain some performance metrics from it. Would you say that the model has a good performance? Why?
6. [X] Would you say that you have enough information to predict the price of an EQC (electric C-class)? Why?

## 3. Loading data from parque table

In [0]:
%sql


price,transmission,mileage,fuelType,engineSize,carAge
30495,Automatic,1200,Diesel,2.0,1.0
29989,Automatic,1000,Petrol,1.5,1.0
37899,Automatic,500,Diesel,2.0,1.0


In [0]:
%sql


database,tableName,isTemporary
default,cclass_cleaned,False


In [0]:
%sql


database,tableName,isTemporary,information
default,cclass_cleaned,False,Database: default Table: cclass_cleaned Owner: root Created Time: Tue Jul 20 10:50:59 UTC 2021 Last Access: UNKNOWN Created By: Spark 3.1.1 Type: MANAGED Provider: parquet Statistics: 37055 bytes Location: dbfs:/user/hive/warehouse/cclass_cleaned Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Schema: root  |-- price: integer (nullable = true)  |-- transmission: string (nullable = true)  |-- mileage: integer (nullable = true)  |-- fuelType: string (nullable = true)  |-- engineSize: string (nullable = true)  |-- carAge: double (nullable = true)


In [0]:
%scala
// Load dataframe from parquet table in Hive 


In [0]:
%scala


price,transmission,mileage,fuelType,engineSize,carAge
30495,Automatic,1200,Diesel,2.0,1.0
29989,Automatic,1000,Petrol,1.5,1.0
37899,Automatic,500,Diesel,2.0,1.0
30399,Automatic,5000,Diesel,2.0,2.0
29899,Automatic,4500,Diesel,2.0,2.0
35999,Automatic,500,Diesel,2.0,1.0
37990,Automatic,1412,Petrol,3.0,2.0
28990,Automatic,3569,Diesel,2.0,2.0
28990,Automatic,3635,Diesel,2.0,2.0
9995,Automatic,44900,Petrol,1.6,8.0


***
## 4. Data Modeling

### Working With Categorical Data

Now that we've prepped everything that we're going to be running. We need to prep our transformation to transform the data into something that we can use in our model. There are a couple of changes that we're going to make.

Because we have categorical data in several of our columns, we're going to have to convert those into a numerical representation.

In [0]:
%scala
// importing scala libraries



### Creating Training and Test Data Sets

Before we begin feature engineering and modeling, we will divide our data set into two groups: train and test. Depending on the size of your data set, your train/test ratio may vary, but many data scientists use 80/20 as a standard train/test split.

In [0]:
%scala


> Splitted dataset in train and test part with 80-20 proportion.

### Data Preparation

Let's go ahead and index all of our categorical features, and set our label to be `log(price)`.

In [0]:
%scala


### One-hot encoding

In the pipeline we just created, we only had two stages, and our linear regression model only used one feature. Let’s take a look at how to build a slightly more complex pipeline that incorporates all of our numeric and categorical features.

Most machine learning models in MLlib expect numerical values as input, represented as vectors. To convert categorical values into numeric values, we can use a technique called one-hot encoding (OHE).

There are a few ways to one-hot encode your data with Spark. A common approach is to use the StringIndexer and OneHotEncoder. With this approach, the first step is to apply the StringIndexer estimator to convert categorical values into category indices. These category indices are ordered by label frequencies, so the most frequent label gets index 0, which provides us with reproducible results across various runs of the same data.

In [0]:
%scala


### Creating a Pipeline

If we want to apply our model to our test set, then we need to prepare that data in the same way as the training set (i.e., pass it through the vector assembler). Oftentimes data preparation pipelines will have multiple steps, and it becomes cumbersome to remember not only which steps to apply, but also the ordering of the steps.

In [0]:
%scala


### Build a baseline model

This task seems well suited to a random forest classifier, since the output is binary and there may be interactions between multiple variables.

The following code builds a simple classifier using scikit-learn. It uses MLflow to keep track of the model accuracy, and to save the model for later use.

In [0]:
import mlflow
import mlflow.pyfunc
import mlflow.sklearn
from mlflow.models.signature import infer_signature
from mlflow.utils.environment import _mlflow_conda_env
import cloudpickle
import time

## 4. XGBoost

Distributed XGBoost with Spark only has a Scala API, so we are going to create views of our DataFrames to use in Scala, as well as save our (untrained) pipeline to load in to Scala.

XGBoost is a specific implementation of the Gradient Boosting method which uses more accurate approximations to find the best tree model. It employs a number of nifty tricks that make it exceptionally successful, particularly with structured data. XGBoost has additional advantages: training is very fast and can be parallelized / distributed across clusters. Therefore, XGBoost was another model that is used in this study.

In [0]:
%scala


### Training and Evaluation

Because we specified several different variations of the model, we're going to need to evaluate them somehow. In order to do this we're going to need to set a regression evaluator as well as perform a train test split so that we can train our models [ie each one with a different set of parameters] and evaluate them in the same way as well as in an automated fashion.

In [0]:
%scala
// Let's load in our data/pipeline that we defined in Python. 
import org.apache.spark.ml.Pipeline


> Now we are ready to train our XGBoost model!

In [0]:
%scala

import ml.dmlc.xgboost4j.scala.spark._
import org.apache.spark.sql.functions._



### Evaluate

Now we can evaluate how well our XGBoost model performed.

In [0]:
%scala


## 5. Evaluate ML Model

The following statistics are shown for your model:

- Mean Absolute Error (MAE): The average of absolute errors. An error is the difference between the predicted value and the actual value.
- Root Mean Squared Error (RMSE): The square root of the average of squared errors of predictions made on the test dataset.
- Relative Absolute Error: The average of absolute errors relative to the absolute difference between actual values and the average of all actual values.
- Relative Squared Error: The average of squared errors relative to the squared difference between the actual values and the average of all actual values.
- Coefficient of Determination: Also known as the R squared value, this statistical metric indicates how well a model fits the data.

For each of the error statistics, smaller is better. A smaller value indicates that the predictions are closer to the actual values. For the coefficient of determination, the closer its value is to one (1.0), the better the predictions.

In [0]:
%scala


// Shows results and evaluation metrics


> Results not bad, R2 is 0.87

In [0]:
%scala
// Shows first 20 predictions


### Export Model

In [0]:
%scala


### Question: What happens if you change your cluster configuration?

To test this out, try spinning up a cluster with just one worker, and another with two workers. NOTE: This data is quite small (one partition), and you will need to test it out with a larger dataset (e.g. 2+ partitions). However, in this code below, we will simply repartition our data to simulate how it could have been partitioned differently on a different cluster configuration, and see if we get the same number of data points in our training set.

## 6. Linear Regression

Now that we have prepared our data, we can use the `LinearRegression` estimator to build our first model [Python](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.regression.LinearRegression). Estimators accept a DataFrame as input and return a model, and have a `.fit()` method.

### Vector Assembler

Linear Regression expects a column of Vector type as input.

We can easily get the values from the features column into a single vector using VectorAssembler Python/Scala. VectorAssembler is an example of a transformer. Transformers take in a DataFrame, and return a new DataFrame with one or more columns appended to it. They do not learn from your data, but apply rule based transformations.

In [0]:
%scala


In [0]:
%scala



### Apply to Test set

In [0]:
%scala


In [0]:
%scala
// Shows first 20 predictions


### Inspect the model

In [0]:
%scala


// Shows results and evaluation metrics


> Linear regression R^2 result is 0.62, better than XGBoost. Despite the name R2 containing “squared,” R2 values range from negative infinity to 1. Let’s take a look at the math behind this metric. R2 is computed as follows:
R2=1−SSresSStot

where SStot is the total sum of squares if you always predict ȳ:
SStot=∑i=1n(yi−y¯)2

and SSres is the sum of residuals squared from your model predictions (also known as the sum of squared errors, which we used to compute the RMSE):
SSres=∑i=1n(yi−yˆi)2

If your model perfectly predicts every data point, then your SSres = 0, making your R2 = 1. And if your SSres = SStot, then the fraction is 1/1, so your R2 is 0. This is what happens if your model performs the same as always predicting the average value, ȳ.

But what if your model performs worse than always predicting ȳ and your SSres is really large? Then your R2 can actually be negative! If your R2 is negative, you should reevaluate your modeling process. The nice thing about using R2 is that you don’t necessarily need to define a baseline model to compare against.

> So how do we know if 5271.5 is a good value for the RMSE? 

There are various ways to interpret this value, one of which is to build a simple baseline model and compute its RMSE to compare against. A common baseline model for regression tasks is to compute the average value of the label on the training set ȳ (pronounced y-bar), then predict ȳ for every record in the test data set and compute the resulting RMSE

***
## Decision Tree

In the previous part, you were working with the parametric model, Linear Regression. We could do some more hyperparameter tuning with the linear regression model, but we're going to try tree based methods and see if our performance improves.

### Create Regressor

Now let's build a `DecisionTreeRegressor` with the default hyperparameters [Python](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.regression.DecisionTreeRegressor).

In [0]:
%scala


### Fit Pipeline

In [0]:
%scala
import org.apache.spark.ml.Pipeline

// Combine stages into pipeline


// Uncomment to perform fit
// val pipelineModel = pipeline.fit(trainDF) 

#### maxBins

What is this parameter [maxBins](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor.maxBins)? Let's take a look at the PLANET implementation of distributed decision trees (which Spark uses) and compare it to this paper called [Yggdrasil](https://cs.stanford.edu/~matei/papers/2016/nips_yggdrasil.pdf) by Matei Zaharia and others. This will help explain the `maxBins` parameter.

In Spark, data is partitioned by row. So when it needs to make a split, each worker has to compute summary statistics for every feature for  each split point. Then these summary statistics have to be aggregated (via tree reduce) for a split to be made.

Think about it: What if worker 1 had the value `32` but none of the others had it. How could you communicate how good of a split that would be? So, Spark has a maxBins parameter for discretizing continuous variables into buckets, but the number of buckets has to be as large as the number of categorical variables.

Let's go ahead and increase maxBins to `40`.

In [0]:
%scala


In [0]:
%scala


In [0]:
%scala


### Feature Importance

Let's go ahead and get the fitted decision tree model, and look at the feature importance scores.

In [0]:
%scala


### Interpreting Feature Importance

Hmmm... it's a little hard to know what feature 4 vs 11 is. Given that the feature importance scores are "small data", let's use Pandas to help us recover the original column names.

In [0]:
%scala


> Yes, carAge is most important feature in that machine learning problem.

### Apply model to test set

In [0]:
%scala


features,price,prediction
"Map(vectorType -> dense, length -> 4, values -> List(0.0, 1.0, 5.0, 2.0))",67495,66016.05263157895
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 5.0, 2.0))",67000,60355.52941176471
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 5.0, 2.0))",66990,60355.52941176471
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 5.0, 2.0))",61495,60355.52941176471
"Map(vectorType -> dense, length -> 4, values -> List(0.0, 1.0, 5.0, 2.0))",57998,66016.05263157895
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 5.0, 2.0))",53990,60355.52941176471
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 4.0, 1.0))",49998,46956.66666666666
"Map(vectorType -> dense, length -> 4, values -> List(0.0, 1.0, 4.0, 2.0))",49499,41762.53424657534
"Map(vectorType -> dense, length -> 4, values -> List(0.0, 1.0, 4.0, 2.0))",49499,41762.53424657534
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 5.0, 2.0))",49280,60355.52941176471


In [0]:
%scala


> What will a decision tree predict?

It turns out decision trees cannot predict any values larger than they were trained on. The max value in our training set was €66,000, so we can't predict any values larger than that (or technically any values larger than the ).

- **R^2** is 0.85 like XGBoost (yes, that works same parameter and decision tree based.)

About **RMSE** first model and last model minimize the error between 3000 and 3300, instead LR model more 5000. We cannot used normalize or scaled dataset, in the next steps, we need adopted PCA, Scaling and Normalize values and compare the results again.

## 7. Summary

This model is worse than the linear regression model.
Next workstages will be check other features, for instance look at hyperparameter tuning and ensemble models to improve upon the performance of our singular decision tree.

There are some papers, used different models in order to predict used car prices. However, there was a relatively small dataset for making a strong inference because number of observations was only 380962. Gathering more data can yield more robust predictions. Secondly, there could be more features that can be good predictors. For example, here are some variables that might improve the model: number of doors, gas/mile (per gallon), color, mechanical and cosmetic reconditioning time, used-to-new ratio, appraisal-to-trade ratio.

Another point that that has room to improvement is that data cleaning process can be dome more rigorously with the help of more technical information. For example, instead of using ‘ffill’ method, there might be indicators that helps to fill missing values more meaningfully.

As suggestion for further studies, while pre-processing data, instead of using label encoder, one hot encoder method can be used. Thus, all non-numeric features can be converted to nominal data instead of ordinal data (Raschka & Mirjalili, 2017). This may cause a serious change in performance of predictive models. Also, after training the data, instead of min-max scaler, standard scaler can be performned and results can be compared. Different scalers can be checked whether there is an improvement in prediction power of models or not.

## 8. Would you say that you have enough information to predict the price of an EQC (electric C-class)? Why?

> The answer is not in part, only with few features. After scraped **autoscout.de**, in attachment we will found the notebook, there are more variables about same model C-Klasse and, in other for EQC cars. There are specific features apart mileage and age, most important features to determinate the car used price, like other important amenities and specific features for electric-car, yes are different and after a good analysis and understand the correlation, maybe we be able to create a similar machine learning model, adjust hyperparameter or create an ensemble model.