# Preprocessing Data And Modeling Pipeline

### This Python code imports essential libraries for a machine learning project:

- pandas: A powerful data manipulation library in Python, commonly used for handling and analyzing structured data.

- joblib: This library is used for saving and loading machine learning models. It's particularly efficient for large NumPy arrays.

- LabelEncoder: A scikit-learn class that encodes categorical labels with numerical values. It's useful for converting categorical data into a format suitable for machine learning algorithms.

- train_test_split: A function from scikit-learn for splitting datasets into training and testing sets. It helps assess the model's performance on unseen data.

- RandomForestRegressor: A machine learning algorithm for regression tasks, belonging to scikit-learn. Random Forests are ensembles of decision trees and are widely used for regression and classification tasks.

- mean_squared_log_error: This metric is used to evaluate the performance of regression models. It measures the mean - squared logarithmic differences between the predicted and true values, commonly used for tasks where predictions vary across several orders of magnitude.

- SimpleImputer: A scikit-learn class for handling missing values in a dataset. It provides strategies for imputing missing data, such as replacing missing values with the mean or median of the available values.

- These libraries are fundamental in a machine learning project, covering aspects of data handling, model building, evaluation, and data preprocessing

In [44]:
import pandas as pd
import joblib
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error
from sklearn.impute import SimpleImputer

- file_path: A variable containing the path or filename of the CSV file to be read.

- pd.read_csv(file_path): The read_csv function from the pandas library is used to read the CSV file and create a DataFrame (df). DataFrames are tabular data structures that can store and manipulate structured data efficiently.

This code snippet is commonly used at the beginning of a machine learning project to load the dataset into a DataFrame, enabling further exploration and analysis of the data.

In [None]:
file_path = "Train.csv"
df = pd.read_csv(file_path)

In [None]:
numeric_columns_with_missing = ['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource','auctioneerID', 'YearMade', 'MachineHoursCurrentMeter']
categorical_columns_with_missing = ['UsageBand', 'fiModelDesc', 'fiBaseModel',
       'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls']
# Numeric columns imputation with mean
numeric_imputer = SimpleImputer(strategy='mean')
df[numeric_columns_with_missing] = numeric_imputer.fit_transform(df[numeric_columns_with_missing])
# Categorical columns imputation with most frequent or missing
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_columns_with_missing] = categorical_imputer.fit_transform(df[categorical_columns_with_missing])

- numeric_columns_with_missing: List of numeric columns with missing values.

- categorical_columns_with_missing: List of categorical columns with missing values.

- numeric_imputer: A SimpleImputer object using the mean strategy. It fills missing values in numeric columns with the mean of the available values.

- categorical_imputer: A SimpleImputer object using the most frequent strategy. It fills missing values in categorical columns with the most frequent value or the mode.

This code is essential for handling missing data, ensuring that the dataset is ready for further analysis and model training.

In [None]:
df.isnull().sum()

SalesID                     0
SalePrice                   0
MachineID                   0
ModelID                     0
datasource                  0
auctioneerID                0
YearMade                    0
MachineHoursCurrentMeter    0
UsageBand                   0
saledate                    0
fiModelDesc                 0
fiBaseModel                 0
fiSecondaryDesc             0
fiModelSeries               0
fiModelDescriptor           0
ProductSize                 0
fiProductClassDesc          0
state                       0
ProductGroup                0
ProductGroupDesc            0
Drive_System                0
Enclosure                   0
Forks                       0
Pad_Type                    0
Ride_Control                0
Stick                       0
Transmission                0
Turbocharged                0
Blade_Extension             0
Blade_Width                 0
Enclosure_Type              0
Engine_Horsepower           0
Hydraulics                  0
Pushblock 

In [None]:
df['saledate'] = pd.to_datetime(df['saledate'])
df['sale_year'] = df['saledate'].dt.year
df['sale_month'] = df['saledate'].dt.month
df['sale_day'] = df['saledate'].dt.day
df['sale_dayofweek'] = df['saledate'].dt.dayofweek
df = df.drop(['saledate'], axis=1)

**This feature engineering process is common when dealing with time-series data, providing additional temporal information that can enhance model performance**

In [None]:
df = df[df['YearMade'] >= 1900]

**This type of filtering is common when dealing with datasets containing anomalies or invalid values. In this case, it ensures that only valid records with a 'YearMade' value greater than or equal to 1900 are retained.**

In [None]:
df[df['YearMade'] > 1900]

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,sale_year,sale_month,sale_day,sale_dayofweek
0,1139246.0,66000.0,999089.0,3157.0,121.0,3.0,2004.0,68.000000,Low,521D,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2006,11,16,3
1,1139248.0,57000.0,117657.0,77.0,121.0,3.0,1996.0,4640.000000,Low,950FII,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2004,3,26,4
2,1139249.0,10000.0,434808.0,7009.0,121.0,3.0,2001.0,2838.000000,High,226,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2004,2,26,3
3,1139251.0,38500.0,1026470.0,332.0,121.0,3.0,2001.0,3486.000000,High,PC120-6E,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2011,5,19,3
4,1139253.0,11000.0,1057373.0,17311.0,121.0,3.0,2007.0,722.000000,Medium,S175,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2009,7,23,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,6333336.0,10500.0,1840702.0,21439.0,149.0,1.0,2005.0,3457.955353,Medium,35NX2,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2011,11,2,2
401121,6333337.0,11000.0,1830472.0,21439.0,149.0,1.0,2005.0,3457.955353,Medium,35NX2,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2011,11,2,2
401122,6333338.0,11500.0,1887659.0,21439.0,149.0,1.0,2005.0,3457.955353,Medium,35NX2,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2011,11,2,2
401123,6333341.0,9000.0,1903570.0,21435.0,149.0,2.0,2005.0,3457.955353,Medium,30NX,...,Double,None or Unspecified,PAT,None or Unspecified,Standard,Conventional,2011,10,25,1


In [None]:
features_to_encode = ['UsageBand', 'fiModelDesc', 'fiBaseModel',
       'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls']
for column in features_to_encode:
        unique_types = df[column].apply(type).unique()
        if len(unique_types) > 1:
            print(f"Column '{column}' has mixed data types: {unique_types}")
            
            # Handle the mixed data types, for example, convert to numeric or handle separately
            # For simplicity, let's convert the entire column to strings
            df[column] = df[column].astype(str)
            

Column 'fiModelSeries' has mixed data types: [<class 'str'> <class 'float'>]


In [None]:
features_to_encode = ['UsageBand', 'fiModelDesc', 'fiBaseModel',
       'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls']
label_encoder = LabelEncoder()
for feature in features_to_encode:
    df[feature + '_label_encoded'] = label_encoder.fit_transform(df[feature])

- label_encoder = LabelEncoder(): Initializes a LabelEncoder object.

- for feature in features_to_encode:: Iterates through the list of categorical features specified in features_to_encode.

- df[feature + '_label_encoded'] = label_encoder.fit_transform(df[feature]): Applies label encoding to the selected feature and creates a new column with the original feature name appended with '_label_encoded'.

Label encoding is a preprocessing technique commonly used to convert categorical variables into numerical format, making them suitable for machine learning algorithms that require numerical input

In [None]:
df = df.drop(features_to_encode, axis=1)

**After label encoding the categorical features and creating new label-encoded columns, it is common to drop the original categorical columns to avoid redundancy and reduce dimensionality in the dataset**

In [None]:
df

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,sale_year,sale_month,...,Undercarriage_Pad_Width_label_encoded,Stick_Length_label_encoded,Thumb_label_encoded,Pattern_Changer_label_encoded,Grouser_Type_label_encoded,Backhoe_Mounting_label_encoded,Blade_Type_label_encoded,Travel_Controls_label_encoded,Differential_Type_label_encoded,Steering_Controls_label_encoded
0,1139246.0,66000.0,999089.0,3157.0,121.0,3.0,2004.0,68.000000,2006,11,...,18,27,2,1,0,0,5,5,3,1
1,1139248.0,57000.0,117657.0,77.0,121.0,3.0,1996.0,4640.000000,2004,3,...,18,27,2,1,0,0,5,5,3,1
2,1139249.0,10000.0,434808.0,7009.0,121.0,3.0,2001.0,2838.000000,2004,2,...,18,27,2,1,0,0,5,5,3,1
3,1139251.0,38500.0,1026470.0,332.0,121.0,3.0,2001.0,3486.000000,2011,5,...,18,27,2,1,0,0,5,5,3,1
4,1139253.0,11000.0,1057373.0,17311.0,121.0,3.0,2007.0,722.000000,2009,7,...,18,27,2,1,0,0,5,5,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,6333336.0,10500.0,1840702.0,21439.0,149.0,1.0,2005.0,3457.955353,2011,11,...,18,27,2,1,0,0,5,5,3,1
401121,6333337.0,11000.0,1830472.0,21439.0,149.0,1.0,2005.0,3457.955353,2011,11,...,18,27,2,1,0,0,5,5,3,1
401122,6333338.0,11500.0,1887659.0,21439.0,149.0,1.0,2005.0,3457.955353,2011,11,...,18,27,2,1,0,0,5,5,3,1
401123,6333341.0,9000.0,1903570.0,21435.0,149.0,2.0,2005.0,3457.955353,2011,10,...,18,27,2,1,0,0,5,5,3,1


In [None]:
df.select_dtypes(exclude=['int','float'])

0
1
2
3
4
...
401120
401121
401122
401123
401124


In [None]:
file_path = "new_preprocess_data.csv"
df.to_csv(file_path, index=False)

**Saving the preprocessed data to a new file is a crucial step in a machine learning pipeline, as it allows you to use the cleaned and transformed data for model training, testing, and deployment.**

In [None]:
# Load your preprocessed data
df = pd.read_csv(file_path)
# Split the data into features (X) and target variable (y)
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the Random Forest Regressor model with lower computing parameters
rf_model_low_resources = RandomForestRegressor(
    n_estimators=50,  # You can further reduce this number
    max_depth=20,      # Adjust as needed, lower values for shallower trees
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42
)
# Train the model
rf_model_low_resources.fit(X_train, y_train)
# Predict on the test set
y_pred = rf_model_low_resources.predict(X_test)
# Evaluate the model
rmsle_score = mean_squared_log_error(y_test, y_pred) ** 0.5
rmsle_score

0.21154680375086815

- df = pd.read_csv(file_path): Reads the preprocessed data from a CSV file into a DataFrame.

- Splits the data into features (X) and the target variable (y).

- Divides the dataset into training and testing sets using train_test_split.

- Defines a Random Forest Regressor model (rf_model_low_resources) with specified parameters.

- Trains the model using the training data (X_train, y_train).

- Predicts the target variable on the test set (X_test).

- Evaluates the model performance using Root Mean Squared Log Error (RMSLE).

This is a typical workflow for training and evaluating a machine learning regression model using the Random Forest algorithm

In [None]:
# Save the trained model to a file
joblib.dump(rf_model_low_resources, 'trained_model.pkl')

['D:/Data_Scientist/Bulldozers_Price_Prediction/FROM_Modeling_TO_Deploying/trained_model.pkl']

In [None]:
# Save the trained model to a file
joblib.dump(rf_model_low_resources, 'trained_model.joblib')

['D:/Data_Scientist/Bulldozers_Price_Prediction/FROM_Modeling_TO_Deploying/trained_model.joblib']

### Difference between `trained_model.joblib` and `trained_model.pkl`

1. **Efficiency:**
   - **Joblib (.joblib):** It is optimized for handling numerical arrays efficiently, making it suitable for machine learning models with large arrays.
   - **Pickle (.pkl):** A more general-purpose serialization module that may be less efficient than joblib for large numerical arrays.

2. **Backward Compatibility:**
   - **Joblib (.joblib):** Designed for better backward compatibility, ensuring more reliable model loading across different versions of scikit-learn.
   - **Pickle (.pkl):** Generally compatible across different Python versions, but potential issues may arise when loading models across versions.

3. **Dependencies:**
   - **Joblib (.joblib):** Requires external dependencies (e.g., NumPy) and is recommended for numerical data.
   - **Pickle (.pkl):** A standard Python module with no external dependencies, suitable for a wide range of Python objects.

In scikit-learn, models are commonly saved with the `.joblib` extension due to its optimization for numerical data and better compatibility with scikit-learn's use cases. However, both formats can be used, and the choice may depend on specific requirements and preferences. If dealing with non-numeric data or considering compatibility, `.pkl` might be chosen.
