# Decision Tree Regressor
DecisionTreeRegressor is a machine learning model in scikit-learn that uses a decision tree to model the relationship between a set of input features and a continuous target variable.

The decision tree works by recursively partitioning the input space into subsets based on the values of the input features, and assigning a predicted value to each subset based on the average of the target variable for the training examples that fall into that subset. This results in a tree-like structure where each internal node represents a decision based on one of the input features, and each leaf node represents a predicted value.

The DecisionTreeRegressor class in scikit-learn allows you to customize various hyperparameters of the decision tree, such as the maximum depth of the tree, the minimum number of samples required to split an internal node, and the minimum number of samples required to be at a leaf node. These hyperparameters can be tuned to improve the performance of the model on a given task.

DecisionTreeRegressor can be useful for a variety of regression tasks, such as predicting housing prices, stock prices, or any other continuous variable. However, it may not perform as well as more complex models on datasets with complex relationships between the input features and the target variable, or on datasets with a large number of features. In those cases, more advanced models such as random forests or gradient boosting may be more appropriate.








# Linear Regression

Linear regression is a supervised learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables. The goal of linear regression is to find a linear relationship between the independent variables (also called features or predictors) and the dependent variable (also called the response variable or target variable).

The linear regression model assumes that the relationship between the independent variables and the dependent variable is linear. This means that the change in the dependent variable is proportional to the change in the independent variables, with a constant slope.

There are two types of linear regression: simple linear regression and multiple linear regression. In simple linear regression, there is only one independent variable, while in multiple linear regression, there are two or more independent variables.

The linear regression model is defined as:

makefile
Copy code
Y = b0 + b1*X1 + b2*X2 + ... + bn*Xn
where Y is the dependent variable, X1, X2, ..., Xn are the independent variables, b0 is the intercept (the value of Y when all the independent variables are zero), and b1, b2, ..., bn are the coefficients (the values that represent the change in Y for a unit change in each of the independent variables).

The goal of linear regression is to estimate the values of the coefficients that minimize the difference between the predicted values and the actual values of the dependent variable. This is typically done using the method of least squares, which finds the values of the coefficients that minimize the sum of the squared differences between the predicted and actual values.

Linear regression can be used for a wide range of applications, including predicting sales, prices, or any other continuous variable. However, it may not perform well when the relationship between the independent variables and the dependent variable is non-linear, or when there are interactions between the independent variables. In those cases, more advanced models such as polynomial regression or regression trees may be more appropriate.

In [2]:
#importing the library
import numpy as np
import pandas as pd
#regular expression
import re
#StandardScaler
from sklearn.preprocessing import StandardScaler
#train_test_split
from sklearn.model_selection import train_test_split
#Linear Regression
from sklearn.linear_model import LinearRegression
#Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor
#Random Forest
from sklearn.ensemble import RandomForestRegressor

# Trying to Predict the MPG Miles Per Gallon

# Loading the Dataset

In [3]:
df=pd.read_csv('/kaggle/input/autompg-dataset/auto-mpg.csv')
#showing the dataset
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
395,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
396,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


# Getting the Preliminary Information

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


# Checking for Missing Values

In [5]:
df.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model year      0
origin          0
car name        0
dtype: int64

# Preprocessing Function

import re

s = "chevrolet chevelle malibu"
match = re.match(r'^\w+', s)
if match:
    first_word = match.group(0)
    print(first_word)
else:
    print("No match found.")
This code imports the re module and defines the string s as "chevrolet chevelle malibu". It then uses the re.match() function to attempt to match the regular expression r'^\w+' (which matches the beginning of the string followed by one or more word characters) against the string s.

If a match is found, the code extracts the matched substring using the match.group() method and assigns it to the variable first_word, which is then printed to the console. If no match is found, the code prints the message "No match found." to the console.

In [25]:
def onehot_encode(df,columns):
    df=df.copy()
    for column in columns:
        dummies=pd.get_dummies(df[column],prefix=column)
        df=pd.concat([df,dummies],axis=1)
        df=df.drop(column,axis=1)
    return df

In [53]:
def preprocess_inputs(df):
    df=df.copy()
    df['horsepower']=df['horsepower'].replace('?',np.NaN).astype(np.float)
    df['horsepower']=df['horsepower'].fillna(df['horsepower'].mean())
    df['Car Brand']=df['car name'].apply(lambda x:re.search(r'^\w+',x).group(0))
    df=df.drop('car name',axis=1)
    corrected_typo={'vw':'volkswagen','chevy':'chevrolet','vokswagen':'volkswagen','maxda':'mazda','toyouta':'toyota',
                   'chevroelt':'chevrolet'}
    df['Car Brand']=df['Car Brand'].replace(corrected_typo)
    
    onehot_columns=['Car Brand','cylinders','origin']
    
    
    df=onehot_encode(df,onehot_columns)
    
    y=df['mpg']
    x=df.drop('mpg',axis=1)
    
    #train_test_split
    
    x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,shuffle=True)
    
    #scaling the dataset
    scaler=StandardScaler()
    scaler.fit(x_train)

    x_train=pd.DataFrame(scaler.transform(x_train),columns=x_train.columns)
    x_test=pd.DataFrame(scaler.transform(x_test),columns=x_test.columns)
    
    
    
    
    
    
    
    
    return x_train,x_test,y_train,y_test

In [13]:
print(re.search(r'^\w+','You are not interesting').group(0))

You


In [54]:
x_train,x_test,y_train,y_test=preprocess_inputs(df)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(278, 43)
(120, 43)
(278,)
(120,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  This is separate from the ipykernel package so we can avoid doing imports until


# Training the Model

In [55]:
model=LinearRegression()
model.fit(x_train,y_train)
print(model.score(x_test,y_test))

-2.891564277933033e+21


In [56]:
rn_model=RandomForestRegressor()
rn_model.fit(x_train,y_train)
print(rn_model.score(x_test,y_test))

0.9187015827982402


In [58]:
dt_model=DecisionTreeRegressor()
dt_model.fit(x_train,y_train)
print(dt_model.score(x_test,y_test))

0.8160179620893031


In [60]:
y_pred=rn_model.predict(x_test)
root_mean_square_error=np.sqrt(np.mean(y_pred-y_test)**2)
print(root_mean_square_error)

0.48642500000000066


In [61]:
y_pred

array([36.018, 13.585, 27.575, 17.959, 31.058, 40.883, 35.144, 24.388,
       14.465, 28.62 , 19.26 , 14.41 , 19.06 , 19.35 , 16.257, 27.749,
       28.397, 17.685, 20.98 , 12.71 , 20.883, 13.86 , 21.951, 34.46 ,
       16.229, 18.515, 16.59 , 13.505, 13.765, 22.971, 25.575, 27.152,
       20.163, 25.695, 21.121, 30.11 , 25.857, 15.864, 14.035, 25.439,
       15.205, 33.349, 28.115, 18.696, 25.555, 14.   , 14.435, 26.042,
       20.114, 14.935, 32.094, 25.139, 18.967, 21.298, 26.036, 35.923,
       13.705, 19.   , 14.58 , 12.7  , 29.439, 13.48 , 29.141, 24.795,
       19.817, 39.521, 25.953, 24.066, 28.971, 23.522, 20.707, 34.462,
       28.335, 29.313, 37.4  , 28.253, 20.519, 13.78 , 12.17 , 34.066,
       35.732, 15.438, 32.292, 24.885, 15.643, 23.748, 23.711, 33.567,
       29.539, 20.725, 14.28 , 14.75 , 14.815, 29.718, 33.825, 24.882,
       34.913, 13.63 , 12.985, 16.602, 14.035, 21.276, 35.395, 23.331,
       18.734, 12.87 , 25.056, 31.335, 29.473, 14.56 , 30.987, 13.565,
      

In [62]:
y_test

352    29.9
27     11.0
29     27.0
263    17.7
117    29.0
       ... 
281    19.8
278    31.5
272    23.8
154    15.0
317    34.3
Name: mpg, Length: 120, dtype: float64

In [49]:
y_test

265    17.5
312    37.2
210    19.0
260    18.6
119    20.0
       ... 
180    25.0
211    16.5
396    28.0
164    21.0
359    28.1
Name: mpg, Length: 120, dtype: float64

In [29]:
{column:list(x[column].unique()) for column in x.columns}

{'mpg': [18.0,
  15.0,
  16.0,
  17.0,
  14.0,
  24.0,
  22.0,
  21.0,
  27.0,
  26.0,
  25.0,
  10.0,
  11.0,
  9.0,
  28.0,
  19.0,
  12.0,
  13.0,
  23.0,
  30.0,
  31.0,
  35.0,
  20.0,
  29.0,
  32.0,
  33.0,
  17.5,
  15.5,
  14.5,
  22.5,
  24.5,
  18.5,
  29.5,
  26.5,
  16.5,
  31.5,
  36.0,
  25.5,
  33.5,
  20.5,
  30.5,
  21.5,
  43.1,
  36.1,
  32.8,
  39.4,
  19.9,
  19.4,
  20.2,
  19.2,
  25.1,
  20.6,
  20.8,
  18.6,
  18.1,
  17.7,
  27.5,
  27.2,
  30.9,
  21.1,
  23.2,
  23.8,
  23.9,
  20.3,
  21.6,
  16.2,
  19.8,
  22.3,
  17.6,
  18.2,
  16.9,
  31.9,
  34.1,
  35.7,
  27.4,
  25.4,
  34.2,
  34.5,
  31.8,
  37.3,
  28.4,
  28.8,
  26.8,
  41.5,
  38.1,
  32.1,
  37.2,
  26.4,
  24.3,
  19.1,
  34.3,
  29.8,
  31.3,
  37.0,
  32.2,
  46.6,
  27.9,
  40.8,
  44.3,
  43.4,
  36.4,
  44.6,
  40.9,
  33.8,
  32.7,
  23.7,
  23.6,
  32.4,
  26.6,
  25.8,
  23.5,
  39.1,
  39.0,
  35.1,
  32.3,
  37.7,
  34.7,
  34.4,
  29.9,
  33.7,
  32.9,
  31.6,
  28.1,
  30.7,
  

In [24]:
x['Car Brand'].unique()

array(['chevrolet', 'buick', 'plymouth', 'amc', 'ford', 'pontiac',
       'dodge', 'toyota', 'datsun', 'volkswagen', 'peugeot', 'audi',
       'saab', 'bmw', 'hi', 'mercury', 'opel', 'fiat', 'oldsmobile',
       'chrysler', 'mazda', 'volvo', 'renault', 'honda', 'subaru',
       'capri', 'mercedes', 'cadillac', 'triumph', 'nissan'], dtype=object)

In [23]:
x['Car Brand'].value_counts()

ford          51
chevrolet     47
plymouth      31
amc           28
dodge         28
toyota        26
datsun        23
volkswagen    22
buick         17
pontiac       16
honda         13
mazda         12
mercury       11
oldsmobile    10
fiat           8
peugeot        8
audi           7
volvo          6
chrysler       6
renault        5
saab           4
opel           4
subaru         4
mercedes       3
cadillac       2
bmw            2
capri          1
hi             1
triumph        1
nissan         1
Name: Car Brand, dtype: int64

In [27]:
x['origin'].unique()

array([1, 3, 2])