<div style="text-align: center">
<img src="https://www.frenchscienceindia.org/wp-content/uploads/2017/02/Logo-Univ-Paris-Saclay.png" width="150px">
</div>

<div style="text-align: center">

# [Paris Saclay Center for Data Science](http://www.datascience-paris-saclay.fr)

# Predict used cars prices!


_des lauriers Cédric, Cornille Théo_

# Introduction

When shopping for a used vehicle, typically an overriding concern is: Am I paying too much? This question is often difficult to answer due to the fact that it's hard to keep track of all the vehicles of interest currently available on the market.

A second, and related concern, is: Which vehicles with similar specifications are available? This information can help the buyer get a feel for what else is available on the market and provide an indication of the value of the vehicle currently under consideration.

<img src="https://static.carfromjapan.com/wp-content/uploads/2016/08/tips_for_buying_a_used_car.png" width="500px">

In this project, we would like to build a tool that helps both used car buyers and user car sellers. Indeed, it could help used car buyers to know which price they are going to pay depending on the characteristic they entered for a specific type of car. But, it will also help used car sellers to adjust their price.
Thus, the goal of this project is to develop prediction models able to predict the prices of used cars depending on their characteristics.
The solutions to this challenge must give the buyers some knowledge about what's make the price of a car.

## Metric used

In order to get accurate prices for new used cars, the mean squared error will be our metric for this challenge.

<img src="https://qph.fs.quoracdn.net/main-qimg-008e40d98b5ce869d6b19c8eb9108178" width="300px">

## The Data

The dataset we will manipulate comes from Ebay data. Ebay is an American multinational e-commerce corporation based in San Jose California, founded in 1995 that facilitates consumer-to-consumer and business-to-consumer sales through its website. Over 370000 used cars have been scraped with Scrapy.

As the content was in german, the data has been translated in english to be easily understandable. 

As inputs we have:
* name : "name" of the car
* seller : private or dealer
* price : the price on the ad to sell the car
* abtest
* vehicleType
* yearOfRegistration : at which year the car was first registered
* gearbox
* powerPS : power of the car in PS
* model
* kilometer : how many kilometers the car has driven
* monthOfRegistration : at which month the car was first registered
* fuelType
* brand
* notRepairedDamage : if the car has a damage which is not repaired yet
* postalCode
* dateCreated : the date for which the ad at ebay was created
* dateCrawled : when this ad was first crawled, all field-values are taken from this date
* lastSeenOnline : when the crawler saw this ad last online

### Required dependencies and downloads

* `numpy`
* `pandas`
* `scikit-learn`
* `matplolib`
* `seaborn`
* `imbalanced-learn`

You need to install our runing requirements to be able to use our notebook on your PC. Simply run the command below

In [None]:
# !pip install -r requirements.txt

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#  I ) Exploratory data analysis

### Loading the data

In [2]:
from problem import get_train_data

data_train, y_train = get_train_data()

In [3]:
print(data_train.shape)
data_train.head()

(210000, 15)


Unnamed: 0,seller,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,dateCrawled,lastSeen
0,private,control,suv,2003,manual,114,x_trail,150000,11,diesel,nissan,no,2016-03-12 00:00:00,2016-03-12 09:55:46,2016-03-13 01:17:39
1,private,control,small,2011,automatic,86,polo,90000,5,gasoline,volkswagen,no,2016-04-03 00:00:00,2016-04-03 13:55:50,2016-04-05 12:46:50
2,private,control,limousine,2010,automatic,136,e_klasse,150000,11,diesel,mercedes_benz,no,2016-03-11 00:00:00,2016-03-11 11:37:03,2016-04-07 05:44:58
3,private,test,convertible,1996,manual,193,3er,150000,3,gasoline,bmw,no,2016-03-28 00:00:00,2016-03-28 17:48:37,2016-04-02 22:47:01
4,private,control,,2017,,116,3er,150000,0,gasoline,bmw,no,2016-03-26 00:00:00,2016-03-26 12:47:25,2016-03-29 14:47:45


Values taken by specific rows:

In [4]:
cat_val = ["seller", "abtest", "gearbox","fuelType", "notRepairedDamage", "vehicleType"]
for col in cat_val:
    print ([col],":",data_train[col].unique())

['seller'] : ['private' 'dealer']
['abtest'] : ['control' 'test']
['gearbox'] : ['manual' 'automatic' nan]
['fuelType'] : ['diesel' 'gasoline' nan 'lpg' 'hybrid' 'cng' 'electric' 'other']
['notRepairedDamage'] : ['no' nan 'yes']
['vehicleType'] : ['suv' 'small' 'limousine' 'convertible' nan 'estate' 'bus' 'coupe'
 'other']


In [5]:
data_train.describe()

Unnamed: 0,yearOfRegistration,powerPS,kilometer,monthOfRegistration
count,210000.0,210000.0,210000.0,210000.0
mean,2003.352757,114.158414,125797.5,5.831629
std,7.314191,70.155371,39533.808714,3.668086
min,1945.0,0.0,5000.0,0.0
25%,1999.0,75.0,100000.0,3.0
50%,2003.0,107.0,150000.0,6.0
75%,2008.0,150.0,150000.0,9.0
max,2017.0,800.0,150000.0,12.0


In [6]:
# We count the missing data for each variable
missing_values = data_train.isnull().sum()
missing_values

seller                     0
abtest                     0
vehicleType            17070
yearOfRegistration         0
gearbox                 9409
powerPS                    0
model                  10188
kilometer                  0
monthOfRegistration        0
fuelType               16041
brand                      0
notRepairedDamage      37650
dateCreated                0
dateCrawled                0
lastSeen                   0
dtype: int64

Some values are missing ("NaN" Values). It must be replaced by zeros instead.

The function clean_and_transform from the file problem.py helps to treat and clean data.

In [7]:
from problem import clean_and_transform

X_train = clean_and_transform(data_train)

In [8]:
X_train.head()

Unnamed: 0,seller,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,dateCrawled,lastSeen
0,1,0,8,2003,2,114,242,150000,11,2,23,1,2016-03-12,2016-03-12 09:55:46,2016-03-13 01:17:39
1,1,0,7,2011,0,86,174,90000,5,4,38,1,2016-04-03,2016-04-03 13:55:50,2016-04-05 12:46:50
2,1,0,5,2010,0,136,97,150000,11,2,20,1,2016-03-11,2016-03-11 11:37:03,2016-04-07 05:44:58
3,1,1,2,1996,2,193,11,150000,3,4,2,1,2016-03-28,2016-03-28 17:48:37,2016-04-02 22:47:01
4,1,0,0,2017,1,116,11,150000,0,4,2,1,2016-03-26,2016-03-26 12:47:25,2016-03-29 14:47:45


In [9]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210000 entries, 0 to 209999
Data columns (total 15 columns):
seller                 210000 non-null int8
abtest                 210000 non-null int8
vehicleType            210000 non-null int8
yearOfRegistration     210000 non-null int64
gearbox                210000 non-null int8
powerPS                210000 non-null int64
model                  210000 non-null int16
kilometer              210000 non-null int64
monthOfRegistration    210000 non-null int64
fuelType               210000 non-null int8
brand                  210000 non-null int8
notRepairedDamage      210000 non-null int8
dateCreated            210000 non-null datetime64[ns]
dateCrawled            210000 non-null datetime64[ns]
lastSeen               210000 non-null datetime64[ns]
dtypes: datetime64[ns](3), int16(1), int64(4), int8(7)
memory usage: 13.0 MB


In [10]:
#corresponding to prices for each vehicle
y_train

array([ 4400, 10200, 16800, ...,  4250,  3200, 29000], dtype=int64)

## Testing data

In order to evaluate the performance of the submissions, the dataset has then been split in three parts (train, test, valid).

In this starting kit is provided only the test set.

The validation set (not provided) will be used as the hidden test data for the final evaluation.

The testing data can be loaded similarly as follows:

In [11]:
from problem import get_test_data

data_test, y_test = get_test_data()

In [12]:
data_test.head()

Unnamed: 0,seller,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,dateCrawled,lastSeen
0,private,control,limousine,2003,manual,0,polo,125000,12,gasoline,volkswagen,,2016-03-24 00:00:00,2016-03-24 22:55:07,2016-03-25 06:45:07
1,private,control,small,2009,automatic,71,fortwo,40000,9,gasoline,smart,no,2016-04-03 00:00:00,2016-04-03 12:53:07,2016-04-05 11:45:14
2,private,test,limousine,2008,manual,143,1er,100000,4,diesel,bmw,no,2016-03-26 00:00:00,2016-03-26 18:40:58,2016-04-06 07:44:57
3,private,test,small,1999,manual,60,polo,150000,3,gasoline,volkswagen,,2016-03-23 00:00:00,2016-03-23 18:36:17,2016-03-23 19:42:29
4,private,test,other,2004,manual,0,andere,150000,2,diesel,opel,no,2016-03-31 00:00:00,2016-03-31 21:46:15,2016-03-31 21:46:15


In [13]:
#clean and treat test set
from problem import clean_and_transform

X_test = clean_and_transform(data_test)

In [14]:
X_test.head()

Unnamed: 0,seller,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,dateCrawled,lastSeen
0,0,0,5,2003,2,0,175,125000,12,4,38,0,2016-03-24,2016-03-24 22:55:07,2016-03-25 06:45:07
1,0,0,7,2009,0,71,109,40000,9,4,32,1,2016-04-03,2016-04-03 12:53:07,2016-04-05 11:45:14
2,0,1,5,2008,2,143,6,100000,4,2,2,1,2016-03-26,2016-03-26 18:40:58,2016-04-06 07:44:57
3,0,1,7,1999,2,60,175,150000,3,4,38,0,2016-03-23,2016-03-23 18:36:17,2016-03-23 19:42:29
4,0,1,6,2004,2,0,40,150000,2,2,24,1,2016-03-31,2016-03-31 21:46:15,2016-03-31 21:46:15


# Data Visualization

## Workflow

In [19]:
<img src="img/carrampchall.png">

SyntaxError: invalid syntax (<ipython-input-19-f0809fcef1f3>, line 1)

# The model to submit

The submission consists of two files: feature_extractor.py which defines a FeatureExtractor class, and regressor.py which defines a CRegressor class

* FeatureExtractor can (optionally) hold code to calculate, filter or add additional features.
* Regressor fits the model and predicts on (new) data, as outputted by the Feature Extractor.

### Feature extractor

An example FeatureExtractor, adding an additional feature based on 

In [21]:
import pandas as pd
import numpy as np

#scaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

class FeatureExtractor():
    def __init__(self):
        pass
    
    #def __init__(self,attribute_names):
       # self.attribute_names = attribute_names
        
                
        
    def fit(self, X_df, y=None):
        return self
    
    
    def transform(self, X_df):
        X_df_new = X_df.copy()
        
        #&(X_df_new["price"].between(100, 200000, inclusive=True))
        
        X_df_new = self.drop_columns(X_df_new)
        
        
        #scaler = StandardScaler()

        #X_df_new[['yearOfRegistration', 'gearbox', 'powerPS', 'model', 'kilometer', 'monthOfRegistration']] = scaler.fit_transform(X_df_new [['yearOfRegistration', 'gearbox', 'powerPS', 'model', 'kilometer', 'monthOfRegistration']])
        
        X_df_new = X_df_new.values
        return X_df_new
    
    def drop_columns(self, X_df, columns_to_drop=["dateCrawled", "abtest", "dateCreated", "lastSeen"]):
        X_df_new = X_df.copy()
        X_df_new = X_df_new.drop(columns_to_drop, axis=1)
        return X_df_new

or test it directly on data as this example:

In [23]:
fe = FeatureExtractor()
fe.fit(X_train, y_train)
New_X_train = fe.transform(X_train)
print(New_X_train.shape)

(210000, 11)


## Regressor

And an example Regressor doing a standard scaling and Decision Tree regression for the price prediction:

In [24]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression


class Regressor(BaseEstimator):
    def __init__(self):
        self.reg = DecisionTreeRegressor(random_state=42)
        #self.reg = LinearRegression()

    def fit(self, X, y):
        self.reg.fit(X, y)


    def predict(self, X):
        return self.reg.predict(X)[:, np.newaxis]  # pour le passer en (machin, 1 ) mais ça change rien

## Test pipeline 

Using thus model interactively in the notebook to fit on the training data and predict for the testing data:

In [25]:
from sklearn.pipeline import make_pipeline

In [26]:
model = make_pipeline(FeatureExtractor(), Regressor())

In [27]:
len(y_train)

210000

In [28]:
model.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('featureextractor', <__main__.FeatureExtractor object at 0x000002B4CB56A0F0>), ('regressor', Regressor())])

In [30]:
y_pred = model.predict(X_test)

In [31]:
y_pred.shape

(70000, 1)

In [32]:
y_pred

array([[ 2750.],
       [ 4970.],
       [ 8950.],
       ...,
       [12900.],
       [  299.],
       [ 6132.]])

In [35]:
from sklearn.metrics import mean_squared_error

print("RMSE on test:",np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE on test: 6575.2203749827095


## Evaluation with Cross-Validation

The metrics explained above are actually calcualted using a cross-validation approach (5-fold cross-validation):

In [36]:
# 6333 linear regression,  4900 decision tree regressor
from sklearn.model_selection import cross_val_score

def evaluation(model, X, y):
    scores = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=4)
    
    results = np.sqrt(-scores)
    
    return results

In [37]:
results = evaluation(model, X_train, y_train)

In [38]:
print("Scores:", results)
print("Mean:", results.mean())
print("Standard deviation:", results.std())

Scores: [5694.48549817 6758.45632694 6194.71856587 5922.9609923 ]
Mean: 6142.655345822698
Standard deviation: 397.1911531439782


## Submitting to the online challenge: ramp.studio

In [39]:
!ramp_test_submission --submission starting_kit

[38;5;178m[1mTesting Cars price[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m


Traceback (most recent call last):
  File "c:\users\tco\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\tco\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\TCO\Anaconda3\Scripts\ramp_test_submission.exe\__main__.py", line 9, in <module>
  File "c:\users\tco\anaconda3\lib\site-packages\rampwf\utils\command_line.py", line 93, in ramp_test_submission
    retrain=retrain)
  File "c:\users\tco\anaconda3\lib\site-packages\rampwf\utils\testing.py", line 82, in assert_submission
    cv = assert_cv(ramp_kit_dir, ramp_data_dir)
  File "c:\users\tco\anaconda3\lib\site-packages\rampwf\utils\testing.py", line 53, in assert_cv
    cv = list(problem.get_cv(X_train, y_train))
  File "c:\users\tco\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1237, in split
    for train, test in self._iter_indices(X, y, groups):
  File "c:\users\tco\anaconda3\lib\site-packages\sklearn\model_selection

## Advices 

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission. The script prints mean cross-validation scores

# Useless behind:

In [22]:
from imblearn.metrics import classification_report_imbalanced
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.over_sampling import RandomOverSampler