## Predict Car Prices

Use car_prices_bmw_audi_benz.parquet file has car sell prices for 3 different models. First plot data points on a scatter plot chart to see if linear regression model can be applied. If yes, then build a model that can answer following questions:
* Predict price of a mercedez benz that is 4 yr old with mileage 45000
* Predict price of a BMW X5 that is 7 yr old with mileage 86000
* Tell the score (accuracy) of your model (hint: use LinearRegression().score())

Adapted from: https://github.com/vitalii-levko/ML_codebasics/blob/master/4_one_hot_encoding.ipynb


In [4]:
%%pyspark
data_path = spark.read.load('abfss://azureudemeycoursewsfs@azureudemeycoursewsadls.dfs.core.windows.net/car_prices_bmw_audi_benz.parquet', format='parquet')
data_path.show(100)

+--------------------+---------+-------------+---+
|                name|km_driven|selling_price|Age|
+--------------------+---------+-------------+---+
|              BMW X5|    69000|        18000|  6|
|              BMW X5|    35000|        34000|  3|
|              BMW X5|    57000|        26100|  5|
|              BMW X5|    22500|        40000|  2|
|              BMW X5|    46000|        31500|  4|
|             Audi A5|    59000|        29400|  5|
|             Audi A5|    52000|        32000|  5|
|             Audi A5|    72000|        19300|  6|
|             Audi A5|    91000|        12000|  8|
|Mercedez Benz C c...|    67000|        22000|  6|
|Mercedez Benz C c...|    83000|        20000|  7|
|Mercedez Benz C c...|    79000|        21000|  7|
|Mercedez Benz C c...|    59000|        33000|  5|
+--------------------+---------+-------------+---+

In [5]:
%%pyspark
data_path.registerTempTable("car_prices_bmw_audi_benz")

In [6]:
%%sql
SELECT * FROM car_prices_bmw_audi_benz

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer


#converting spark dataframe to pandas
df = data_path.select("*").toPandas()





In [31]:
# Checking missing and null values
df.isnull().sum()

name             0
km_driven        0
selling_price    0
Age              0
dtype: int64

In [9]:
le = LabelEncoder()
dfle = df
dfle['name'] = le.fit_transform(dfle['name'])
dfle


name km_driven selling_price Age
0      1     69000         18000   6
1      1     35000         34000   3
2      1     57000         26100   5
3      1     22500         40000   2
4      1     46000         31500   4
5      0     59000         29400   5
6      0     52000         32000   5
7      0     72000         19300   6
8      0     91000         12000   8
9      2     67000         22000   6
10     2     83000         20000   7
11     2     79000         21000   7
12     2     59000         33000   5

In [10]:
X = dfle[['name','km_driven','Age']].values
X

array([[1, '69000', '6'],
       [1, '35000', '3'],
       [1, '57000', '5'],
       [1, '22500', '2'],
       [1, '46000', '4'],
       [0, '59000', '5'],
       [0, '52000', '5'],
       [0, '72000', '6'],
       [0, '91000', '8'],
       [2, '67000', '6'],
       [2, '83000', '7'],
       [2, '79000', '7'],
       [2, '59000', '5']], dtype=object)

In [11]:
y = dfle[['selling_price']].values
y

array([['18000'],
       ['34000'],
       ['26100'],
       ['40000'],
       ['31500'],
       ['29400'],
       ['32000'],
       ['19300'],
       ['12000'],
       ['22000'],
       ['20000'],
       ['21000'],
       ['33000']], dtype=object)

In [12]:


X[:,0] = le.fit_transform(X[:,0])
ct = ColumnTransformer([("Car Info", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
X


array([[0.0, 1.0, 0.0, '69000', '6'],
       [0.0, 1.0, 0.0, '35000', '3'],
       [0.0, 1.0, 0.0, '57000', '5'],
       [0.0, 1.0, 0.0, '22500', '2'],
       [0.0, 1.0, 0.0, '46000', '4'],
       [1.0, 0.0, 0.0, '59000', '5'],
       [1.0, 0.0, 0.0, '52000', '5'],
       [1.0, 0.0, 0.0, '72000', '6'],
       [1.0, 0.0, 0.0, '91000', '8'],
       [0.0, 0.0, 1.0, '67000', '6'],
       [0.0, 0.0, 1.0, '83000', '7'],
       [0.0, 0.0, 1.0, '79000', '7'],
       [0.0, 0.0, 1.0, '59000', '5']], dtype=object)

In [None]:

X = X[:,1:]
X

In [14]:
model = LinearRegression()
model.fit(X,y)


LinearRegression()

In [15]:
model.score(X,y)

0.9417050937281082

In [16]:
model.predict([[0,1,45000,4]])
#Predicted price of a mercedez benz that is 4 yr old with mileage 45000 is 36991


array([[36991.31721062]])

In [17]:
model.predict([[1,0,86000,7]])

#Predicted price of a BMW X5 that is 7 yr old with mileage 86000 is 11080


array([[11080.74313219]])

In [18]:
model.predict([[1,1,22500,2]])


array([[43699.30500118]])