# Pré-processamento e  Treinamento do modelo de regressão Linear
* No primeiro modelo: mod_lin_reg_1 foi realizado o treinamento utilizando o método de encoding label para as variáveis categóricas. 

* Nesse modelo de regressão iremos adotar o uso do one-hotencoding para tratar os dados categóricos. 

In [130]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import sklearn
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Importa dataset: 
df = pd.read_csv('df.csv')
df.head()

Unnamed: 0,car_name,kms_driven,fuel_type,transmission,ownership,manufacture,engine,Seats,price,multiply,engine_cat
0,Jeep,86226.0,Diesel,Manual,1st Owner,2017.0,1956.0,5.0,10.03,Lakh,medium
1,Renault,13248.0,Petrol,Automatic,1st Owner,2021.0,1330.0,5.0,12.83,Lakh,medium
2,Toyota,60343.0,Petrol,Automatic,1st Owner,2016.0,2494.0,5.0,16.4,Lakh,high
3,Honda,26696.0,Petrol,Automatic,1st Owner,2018.0,1199.0,5.0,7.77,Lakh,lower
4,Volkswagen,69414.0,Petrol,Manual,1st Owner,2016.0,1199.0,5.0,5.15,Lakh,lower


In [31]:
df['Seats'] = df.Seats.astype('object')

In [32]:
# Seleção das variáveis atributo:
atrib_var = ['kms_driven', 'fuel_type', 'transmission', 'ownership', 'engine_cat', 'Seats']
alvo_var = ['price']
features = df[atrib_var]
target = df[alvo_var]

In [33]:
# Separar dados de treino e teste
X_treino, X_teste, Y_treino, Y_teste = train_test_split(features, 
                                                        target, 
                                                        test_size = 0.20)

In [43]:
# Pipeline para pré-processamento das variáveis: 
atributo_num = ['kms_driven']
atributo_char = ['fuel_type', 'transmission', 'ownership', 'engine_cat', 'Seats']


full_pipeline = ColumnTransformer([
    ('num', MinMaxScaler(), atributo_num),
    ('cat', OneHotEncoder(), atributo_char),
])

In [64]:
# Pré-processando os dados de treino:
X_treino_prep = full_pipeline.fit_transform(X_treino)
X_teste_prep = full_pipeline.fit_transform(X_teste)

In [49]:
# Criando e treinando o modelo de regressão
model_lin_reg2 = LinearRegression()
model_lin_reg2.fit(X_treino_prep, Y_treino)

LinearRegression()

In [52]:
# Calculando o RMSE (Root Mean Square Error) do modelo
from sklearn.metrics import mean_squared_error

price_predictions = model_lin_reg2.predict(X_teste_prep)
lin_mse = mean_squared_error(Y_teste, price_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

13.91300458752021

In [74]:
print('Coeficientes: \n', model_lin_reg2.coef_)

Coeficientes: 
 [[-10.31969739   1.21054749   2.46274405  -7.06129152   3.60568531
   -0.21768534   2.43893981  -2.43893981  -5.2682167    2.01690221
    1.16693568   0.65215807   0.72285287   0.70936788   0.89327443
   -1.05771332   0.16443889   9.60921722   0.69352103  -2.92175284
   -0.93977842  -1.67887627  -4.76233073]]


In [135]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5250 entries, 0 to 5249
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   car_name      5250 non-null   object 
 1   kms_driven    5250 non-null   float64
 2   fuel_type     5250 non-null   object 
 3   transmission  5250 non-null   object 
 4   ownership     5250 non-null   object 
 5   manufacture   5250 non-null   float64
 6   engine        5250 non-null   float64
 7   Seats         5250 non-null   object 
 8   price         5250 non-null   float64
 9   multiply      5250 non-null   object 
 10  engine_cat    5250 non-null   object 
dtypes: float64(4), object(7)
memory usage: 451.3+ KB


In [137]:
pd.factorize(df.car_name)[0]

array([ 0,  1,  2, ..., 23, 12, 12])