**Task 2:** Find a data set which is suitable for regression analysis and consists of a mix of numerical, nominal, and ordinal variables. Look for the cases, where at least one of the variable has missing values. If not, you can randomly delete a very small portion of one of the variables.
Design a machine learning pipeline where you scale the numerical features and encode the nominal and ordinal features along with imputing the missing values.


## Used Car Data Set Description

This data set contains information about used cars for sale in India. The data set includes the following features:

- `Name` : The brand and model of the car.
- `Location` : The location in which the car is being sold or is available for purchase.
- `Year`: The year or edition of the model.
- `Kilometers_Driven` : The total kilometers driven in the car by the previous owner(s) in KM.
- `Fuel_Type` : The type of fuel used by the car.
- `Transmission` : The type of transmission used by the car.
- `Owner_Type` : Whether the ownership is Firsthand, Second hand or other.
- `Mileage` : The standard mileage offered by the car company in kmpl or km/kg.
- `Engine` : The displacement volume of the engine in cc.
- `Power` : The maximum power of the engine in bhp.
- `Seats` : The number of seats in the car.
- `New_Price` : Price of new model.
- `Price` : The price of the used car in INR Lakhs.$$Lakh = \frac{million}{10}$$


This data set can be used to analyze the pricing trends of used cars in India based on various factors such as location, year, kilometers driven, and other features. Additionally, this data set can be used to build predictive models to estimate the price of used cars based on their characteristics.


Import **set_config** and specify the `transform_output` parameter as `default` to set the behavior of the transform method for transformers in the sklearn pipeline.

In [1]:
from sklearn import set_config
set_config(transform_output="default")

import numpy as np
import pandas as pd
car_df = pd.read_csv("datasets/car.csv")
car_df.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74


The **drop()** method is used to remove the `Name` column from the DataFrame `car_df`. The `axis` parameter is set to `1` to indicate that the `column` is to be dropped, and the `columns` parameter specifies the `name of the column` to be removed. 

In [2]:
car_df = car_df.drop(columns=["Name"], axis=1)
car_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Location           6019 non-null   object 
 1   Year               6019 non-null   int64  
 2   Kilometers_Driven  6019 non-null   int64  
 3   Fuel_Type          6019 non-null   object 
 4   Transmission       6019 non-null   object 
 5   Owner_Type         6019 non-null   object 
 6   Mileage            6017 non-null   object 
 7   Engine             5983 non-null   object 
 8   Power              5983 non-null   object 
 9   Seats              5977 non-null   float64
 10  New_Price          824 non-null    object 
 11  Price              6019 non-null   float64
dtypes: float64(2), int64(2), object(8)
memory usage: 564.4+ KB


There are `12 columns` in total, with `6019 non-null values` in most columns, except for **Mileage**, **Engine**, **Power**, **Seats**, and **New_Price** columns which have missing values. The data types of columns vary, with object data type being used for columns containing strings or mixed data types, int64 for integer values, and float64 for floating-point values.

The `Price` column is of particular importance as it contains the `target variable` for the analysis. **The goal is to predict the price of the used cars based on the other available features.**
<br>

Additionally, the units of measurement for features such as `kmpl`, `km/kg`, `CC`, `bhp`, and `Lakh` are not standardized across car manufacturers or models, making it difficult to compare them directly. Removing these units of measurement from the features simplifies the data and makes it easier to work with. We can still use the numerical values associated with these features for analysis and modeling purposes, even without their respective units of measurement. The units of measurement can be added back later if needed or if they provide additional insights. `Overall, removing these units of measurement from the features can help to simplify the data, reduce complexity, and improve the performance of machine learning algorithms, making it easier to analyze and model the data.`

In [3]:
# Remove bhp from Power Data Feature.
df_col=pd.DataFrame()
df_col['Power'] = car_df['Power']

pattern = r'([\d\.]+) bhp'
df_col['numeric_value'] = df_col['Power'].str.extract(pattern)
df_col['numeric_value'] = df_col['numeric_value'].astype(float)

# Adding new updated values back to the Datasets
car_df = car_df.drop(columns=["Power"], axis=1)
car_df['Power'] = df_col['numeric_value']

In [4]:
# Remove CC from Engine Data Feature.
df_col=pd.DataFrame()
df_col['Engine'] = car_df['Engine']
pattern = r'([\d\.]+) CC'
df_col['numeric_value'] = df_col['Engine'].str.extract(pattern)
df_col['numeric_value'] = df_col['numeric_value'].astype(float)

# Adding new updated values back to the Datasets
car_df = car_df.drop(columns=["Engine"], axis=1)
car_df['Engine'] = df_col['numeric_value']

In [5]:
# Remove km/kg and kmpl from Mileage Data Feature.
df_col=pd.DataFrame()
df_col['Mileage'] = car_df['Mileage']
pattern = r'([\d\.]+) km'
df_col['numeric_value'] = df_col['Mileage'].str.extract(pattern)
df_col['numeric_value'] = df_col['numeric_value'].astype(float)

# Adding new updated values back to the Datasets
car_df = car_df.drop(columns=["Mileage"], axis=1)
car_df['Mileage'] = df_col['numeric_value']

In [6]:
# Remove Lakh from New_Price Data Feature.
df_col=pd.DataFrame()
df_col['New_Price'] = car_df['New_Price']
pattern = r'([\d\.]+) Lakh'
df_col['numeric_value'] = df_col['New_Price'].str.extract(pattern)
df_col['numeric_value'] = df_col['numeric_value'].astype(float)

# Adding new updated values back to the Datasets
car_df = car_df.drop(columns=["New_Price"], axis=1)
car_df['New_Price'] = df_col['numeric_value']

In [7]:
car_df.head()

Unnamed: 0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Seats,Price,Power,Engine,Mileage,New_Price
0,Mumbai,2010,72000,CNG,Manual,First,5.0,1.75,58.16,998.0,26.6,
1,Pune,2015,41000,Diesel,Manual,First,5.0,12.5,126.2,1582.0,19.67,
2,Chennai,2011,46000,Petrol,Manual,First,5.0,4.5,88.7,1199.0,18.2,8.61
3,Chennai,2012,87000,Diesel,Manual,First,7.0,6.0,88.76,1248.0,20.77,
4,Coimbatore,2013,40670,Diesel,Automatic,Second,5.0,17.74,140.8,1968.0,15.2,


**Splits** the `car_df` DataFrame into training and testing sets, with **70% of the data used for training** and **30% used for testing**. The `X_train` and `X_test` DataFrames **contain all features** except the target variable, Price, while `y_train` and `y_test` **contain only the Price column.** The random_state parameter sets the seed value for the random number generator used in the split, ensuring that the same split is obtained each time the code is run with the same value.

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

#Split 70:30
X_train, X_test, y_train, y_test = train_test_split(car_df.drop(columns=['Price'], axis=1), car_df[['Price']], test_size=0.30, random_state=1250)

In [9]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4213, 11)
(1806, 11)
(4213, 1)
(1806, 1)


In [10]:
X_train.head()

Unnamed: 0,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Seats,Power,Engine,Mileage,New_Price
5229,Bangalore,2012,73000,Diesel,Automatic,First,5.0,190.0,1995.0,22.69,48.79
1267,Chennai,2011,124000,Diesel,Manual,First,5.0,69.0,1396.0,23.03,
2185,Ahmedabad,2017,80000,Diesel,Manual,First,5.0,88.5,1248.0,24.3,10.91
489,Pune,2017,129000,Diesel,Automatic,First,7.0,258.0,2987.0,11.0,
1762,Coimbatore,2011,75871,Petrol,Automatic,First,5.0,254.0,2996.0,9.43,


**To check the distribution of values in categorical features such as Fuel_Type, Transmission, and Owner_Type**, use the `value_counts()` method. This will show the frequency of each unique value in the respective columns. This information can be used to determine which encoding method is best suited for each categorical feature. For example, if a categorical feature has two unique values, `one-hot encoding` may be appropriate. On the other hand, if a categorical feature has multiple unique values, `label encoding` or `ordinal encoding` may be more appropriate.
<br>

**Label encoding** can be useful for certain machine learning algorithms that expect numeric input, but it has some limitations. For instance, some algorithms may assume that the numerical values have some order or hierarchy, which may not be the case for categorical data. In addition, label encoding may result in biased representation of categories, especially when the categories have unequal frequencies.

In [11]:
X_train.Location.value_counts()

Location
Mumbai        561
Hyderabad     502
Coimbatore    458
Kochi         446
Pune          437
Delhi         390
Kolkata       376
Chennai       351
Jaipur        278
Bangalore     255
Ahmedabad     159
Name: count, dtype: int64

In [12]:
X_train.Year.value_counts()

Year
2014    571
2015    536
2016    529
2013    441
2012    398
2017    397
2011    336
2010    242
2018    208
2009    133
2008    117
2007     87
2019     69
2006     56
2005     40
2004     22
2003      8
2002      8
2001      6
1998      4
2000      3
1999      2
Name: count, dtype: int64

In [13]:
X_train.Fuel_Type.value_counts()

Fuel_Type
Diesel      2245
Petrol      1921
CNG           39
LPG            7
Electric       1
Name: count, dtype: int64

In [14]:
X_train.Transmission.value_counts()

Transmission
Manual       3004
Automatic    1209
Name: count, dtype: int64

In [15]:
X_train.Owner_Type.value_counts()

Owner_Type
First             3457
Second             673
Third               77
Fourth & Above       6
Name: count, dtype: int64

In [16]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4213 entries, 5229 to 5173
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Location           4213 non-null   object 
 1   Year               4213 non-null   int64  
 2   Kilometers_Driven  4213 non-null   int64  
 3   Fuel_Type          4213 non-null   object 
 4   Transmission       4213 non-null   object 
 5   Owner_Type         4213 non-null   object 
 6   Seats              4185 non-null   float64
 7   Power              4105 non-null   float64
 8   Engine             4189 non-null   float64
 9   Mileage            4212 non-null   float64
 10  New_Price          570 non-null    float64
dtypes: float64(5), int64(2), object(4)
memory usage: 395.0+ KB


## Pipeline Design

**In machine learning, preprocessing the data is a crucial step before building a model**. The pipeline design involves several steps to ensure that the data is properly cleaned, standardized, and encoded.

- The `numerical data` is first preprocessed by `imputing missing values with the median` and then `standardizing` it. The `median` is a robust statistic that is preferred over other imputation methods because it is less sensitive to outliers. The numerical columns that are standardized include Kilometers_Driven, Seats, Power, Engine, Mileage, and New_Price.

- The `categorical features` are divided into `ordinal` and `nominal` features. 

- For `ordinal features`, the missing values are `imputed with the most frequent` value and then `encoded using the OrdinalEncoder`. The OrdinalEncoder assigns an integer value to each unique category based on their order. 

- For `nominal features`, the missing values are `imputed with the most frequent` value and then `encoded using the OneHotEncoder`. The OneHotEncoder creates a binary vector for each unique category.

- These pipelines are combined using `ColumnTransformer` to preprocess the data and prepare it for machine learning models. 

- After preprocessing the data using ColumnTransformer, we can `combine the preprocessed features with an estimator` to create a complete machine learning pipeline using the make_pipeline function. In this case, we will use LinearRegression as our estimator, which is a popular and effective algorithm for regression problems. The pipeline allows us to streamline the entire process of data preprocessing and model building, reducing the potential for human error and saving time in the development and deployment of machine learning applications. 
- Once the pipeline is built, we can `fit it on the training data` to train the model and then use it to `make predictions on new data`.


In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

num_cols = ['Kilometers_Driven', 'Seats', 'Power', 'Engine', 'Mileage', 'New_Price' ]  # numerical columns
num_transformer = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler()
)


The `num_cols` list contains the names of the `numerical columns`, and the `num_transformer pipeline` is designed to first impute missing values with the median and then standardize the numerical data.

<br>


The nominal columns are `Transmission`, `Fuel_Type`, and `Location`, and the ordinal columns are `Year` and `Owner_Type`.




In [18]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline

nom_cols = ['Transmission', 'Fuel_Type', 'Location']  # nominal Columns
nom_transformer = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder()
)

ord_cols = ['Year', 'Owner_Type'] # ordinal columns
ord_transformer = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OrdinalEncoder()
)

- **The nominal pipeline** first imputes the missing values in the nominal columns with the most frequent value and then encodes the data using the OneHotEncoder. The OneHotEncoder creates a binary column for each unique value in the nominal column, indicating whether that value is present or not for each observation.

- **The ordinal pipeline** also imputes missing values with the most frequent value and then encodes the data using the OrdinalEncoder. The OrdinalEncoder converts each unique value in the ordinal column to an integer, preserving the order of the values.

The `ColumnTransformer` takes a list of transformers, where each transformer specifies the columns to transform and the transformation to apply to those columns. The transformers can be either pipelines, scalers, encoders, or any other transformation function that can be applied to a subset of the columns in the dataset.

In [19]:
from sklearn.compose import ColumnTransformer

# Create a column transformer to apply the transformers to the appropriate columns
preprocessor = ColumnTransformer([
    ('Numerical Columns', num_transformer, num_cols),
    ('Nominal Columns', nom_transformer, nom_cols),
    ('Ordinal Columns', ord_transformer, ord_cols)
])

### Creating a Machine Learning Pipeline with Preprocessing and Linear Regression

After `preprocessing` the data using ColumnTransformer, we can `combine the preprocessed features with an estimator` to create a complete machine learning pipeline using the **make_pipeline** function. 

The estimator will be used to learn the underlying patterns and relationships in the training data, and then generalize to make predictions on new data. In this case, we will use `LinearRegression` as our estimator.

In [20]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Create a pipeline that combines the preprocessor with a classifier
pipeline = make_pipeline(
    preprocessor, 
    LinearRegression()
)

 `Fit` it on the `training data` to train the model, and then use it to make `predictions` on new data.

In [21]:
pipeline.fit(X_train, y_train)

## Pipeline Results

- **pipeline.predict(X_test)** generates the predicted values for the testing data, while 
- **pipeline.predict(X_train)** generates the predicted values for the training data.

The predicted values can be compared with the actual values to evaluate the performance of the model and assess its accuracy.

In [22]:
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

To access the accuracy of the **regression model**, `r2_score` and `mean_squared_error` functions from the sklearn.metrics module are used.
- The **r2_score** function calculates the R-squared metric, which represents the proportion of variance in the target variable that is explained by the predictor variables. The closer the R-squared value is to 1, the better the model fits the data.

- The **mean_squared_error** function calculates the mean squared error between the actual target values and the predicted values. A lower mean squared error indicates better performance of the model.

In [23]:
from sklearn.metrics import mean_squared_error, r2_score

print('R2 on Train data: %.2f' % pipeline.score(X_train, y_train))
print('R2 on Test data: %.2f' % pipeline.score(X_test, y_test))

print('\nMean square Error for Train Data: %.2f' % mean_squared_error(y_train, y_train_pred))
print('Mean square Error for Test Data: %.2f' % mean_squared_error(y_test, y_test_pred))

R2 on Train data: 0.72
R2 on Test data: 0.71

Mean square Error for Train Data: 35.85
Mean square Error for Test Data: 33.23


- The `regression model` has an **R2 of 0.72 on the training data** and an **R2 of 0.71 on the test data**. This suggests that the model is `performing well` on both the training and test data, and is `generalizing well` to new data.

- The **mean square error (MSE) for the training data is 35.85**, while the **MSE for the test data is 33.23**. These values suggest that the model is making relatively `small errors` in its predictions, and is `not significantly overfitting` to the training data.

Overall, these results suggest that the regression model is performing well and is a good fit for the data.

## Conclusion

- In conclusion, **preprocessing the data is an important step before building a machine learning model**, and involves `cleaning`, `standardizing`, and `encoding` the data. 
- The numerical features are imputed and standardized, while the categorical features are imputed and encoded using either OrdinalEncoder or OneHotEncoder. 
- The pipelines are combined using `ColumnTransformer`, and a `LinearRegression model` is used as an estimator to create a complete machine learning pipeline. 
- The model's performance is evaluated using the `r2_score` and `mean_squared_error` functions, which show that the model is performing well on both the training and test data, and is generalizing well to new data. 
- **The model's mean squared error suggests that it is making relatively small errors in its predictions and is not significantly overfitting to the training data.**
- Overall, these results suggest that `the regression model is a good fit for the data`.