# <center><font color="#2F80ED">FCase_Study1_BSIS4_SoraHigh</font></center>
## <center><font color="#56CCF2">Machine Learning Regression Case Study</font></center>
### <center><font color="#6FCF97">Using CRISP-DM Methodology</font></center>

---

# <font color="#9B51E0">Group Members</font>

- John Andrew Emmanuel Avelino  
- Harvey Kim Solano  
- Maria Consuelo Mangonon  
- Arabella Trixie Sabisol  
- Jelena Jane Montuya  
- Kristine Joy Casaquite  

---

# <font color="#F2994A">Course:</font>  
**BSIS 4 – Information Systems - Business Analytics**

# <font color="#EB5757">Date:</font>  
*November 14, 2025*

---


# <font color="#2F80ED">Business Understanding</font>

The goal of this project is to develop a predictive machine learning model capable of estimating
the **market price of an aircraft** based on its specifications and performance characteristics.

Accurate aircraft price prediction is important for:

- Aircraft buyers evaluating fair market value  
- Sellers determining competitive pricing  
- Aviation brokers conducting appraisal and consultancy  
- Manufacturers analyzing pricing trends  
- Financial institutions and leasing companies assessing asset valuation  

The chosen dataset, the **Plane Price Dataset**, contains a wide variety of technical and
operational attributes such as engine type, performance metrics, dimensions, capacity, and more.
Using these features, the model aims to provide **data-driven price estimation**.

The business objective is to build a model that:

- Achieves high predictive accuracy  
- Generalizes well to unseen aircraft data  
- Can be deployed for real-time or batch predictions  
- Supports aviation stakeholders in pricing decisions  

This understanding serves as the foundation for the remaining CRISP-DM steps.


#<font color="#2F80ED">1. Data Understanding</font>

The Data Understanding stage examines the Kaggle dataset selected for this case study — the Plane Price Dataset, containing detailed aircraft specifications and their market price. The target variable Price is continuous, making this dataset ideal for regression modelling.

#<font color="#56CCF2">1.1 Dataset Overview</font>

The dataset contains 517 aircraft entries and 16 columns describing airplane performance, physical characteristics, and specifications.

Features include:

- Model Name – The name or designation of the aircraft model.
- Engine Type – Type of engine installed (e.g., piston, turboprop, jet), which affects performance and cost.
- HP or lbs thr ea engine – Engine power, expressed as horsepower for piston engines or pounds of thrust for jets.
- Max speed Knots – Maximum achievable speed of the aircraft in knots.
- Rcmnd cruise Knots – Recommended cruising speed in knots, representing typical operational speed.
- Stall Knots dirty – Stall speed in “dirty” configuration (with flaps or landing gear extended), in knots.
- Fuel gal/lbs – Fuel capacity of the aircraft, expressed in gallons or pounds depending on aircraft type.
- All eng rate of climb – Rate of climb with all engines operating (feet per minute).
- Eng out rate of climb – Rate of climb with one engine out (feet per minute), relevant for multi-engine aircraft.
- Takeoff over 50ft – Required takeoff distance to clear a 50-foot obstacle (in feet).
- Landing over 50ft – Required landing distance to clear a 50-foot obstacle (in feet).
- Empty weight lbs – Aircraft weight without passengers, cargo, or fuel (in pounds).
- Length ft/in – Aircraft length expressed in feet and inches.
- Wing span ft/in – Wingspan of the aircraft expressed in feet and inches.
- Range N.M. – Maximum distance the aircraft can fly without refueling (in nautical miles).

#<font color="#EB5757"><b>Price (Target Variable)</b></font>

Each row represents one aircraft listing, and each feature contributes to determining the aircraft’s value.

---

###<font color="#2F80ED"> Necessary Imports</font>

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Models
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV
# Metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
# Saving model
import joblib


<font color="#2D9CDB">2. Load the Dataset</font>

In [None]:
# Load the dataset
df = pd.read_csv("Plane_Price.csv")

# Display the first few rows
df.head()


Unnamed: 0,Model Name,Engine Type,HP or lbs thr ea engine,Max speed Knots,Rcmnd cruise Knots,Stall Knots dirty,Fuel gal/lbs,All eng rate of climb,Eng out rate of climb,Takeoff over 50ft,Landing over 50ft,Empty weight lbs,Length ft/in,Wing span ft/in,Range N.M.,Price
0,100 Darter (S.L. Industries),Piston,145,104,91.0,46.0,36,450,900.0,1300.0,2050,1180,25/3,37/5,370,1300000.0
1,7 CCM Champ,Piston,85,89,83.0,44.0,15,600,720.0,800.0,1350,820,20/7,36/1,190,1230000.0
2,100 Darter (S.L. Industries),Piston,90,90,78.0,37.0,19,650,475.0,850.0,1300,810,21/5,35/0,210,1600000.0
3,7 AC Champ,Piston,85,88,78.0,37.0,19,620,500.0,850.0,1300,800,21/5,35/0,210,1300000.0
4,100 Darter (S.L. Industries),Piston,65,83,74.0,33.0,14,370,632.0,885.0,1220,740,21/5,35/0,175,1250000.0


<font color="#6FCF97">2.1 Basic Dataset Information</font>

In [None]:
# Check dataset structure
df.info()

#View Summary statistics of numerical columns
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Model Name               517 non-null    object 
 1   Engine Type              517 non-null    object 
 2   HP or lbs thr ea engine  517 non-null    object 
 3   Max speed Knots          497 non-null    object 
 4   Rcmnd cruise Knots       507 non-null    float64
 5   Stall Knots dirty        502 non-null    float64
 6   Fuel gal/lbs             517 non-null    int64  
 7   All eng rate of climb    513 non-null    object 
 8   Eng out rate of climb    491 non-null    float64
 9   Takeoff over 50ft        492 non-null    float64
 10  Landing over 50ft        517 non-null    object 
 11  Empty weight lbs         516 non-null    object 
 12  Length ft/in             517 non-null    object 
 13  Wing span ft/in          517 non-null    object 
 14  Range N.M.               4

Unnamed: 0,Rcmnd cruise Knots,Stall Knots dirty,Fuel gal/lbs,Eng out rate of climb,Takeoff over 50ft,Price
count,507.0,502.0,517.0,491.0,492.0,507.0
mean,200.792899,60.795817,1419.37911,2065.126273,1743.306911,2362673.0
std,104.280532,16.657002,4278.320773,1150.031899,730.009674,1018731.0
min,70.0,27.0,12.0,457.0,500.0,650000.0
25%,130.0,50.0,50.0,1350.0,1265.0,1600000.0
50%,169.0,56.0,89.0,1706.0,1525.0,2000000.0
75%,232.0,73.0,335.0,2357.0,2145.75,2950000.0
max,511.0,115.0,41000.0,6400.0,4850.0,5100000.0


In [None]:
#Missing value check:
df.isnull().sum()

Unnamed: 0,0
Model Name,0
Engine Type,0
HP or lbs thr ea engine,0
Max speed Knots,20
Rcmnd cruise Knots,10
Stall Knots dirty,15
Fuel gal/lbs,0
All eng rate of climb,4
Eng out rate of climb,26
Takeoff over 50ft,25


<font color="#BB6BD9">2.3 Data Types Overview</font>

From df.info():

Numerical features (floats/ints):

* Max speed Knots, Rcmnd cruise Knots, Stall Knots dirty

* Fuel gal/lbs

* All eng rate of climb

* Eng out rate of climb

* akeoff over 50ft

* Range N.M.

* Price

Categorical features (objects):

* Model Name

* Engine Type

* HP or lbs thr ea engine

* Length ft/in

* Wing span ft/in

* Landing over 50ft

* Other descriptive columns

These categories will need to be One-Hot Encoded before modelling.

<font color="#F2994A">2.4 Statistical Summary</font>

Key insights from df.describe():

Max speed Knots: ranges from 70 to 511

Fuel capacity: 12 to 12,000 gal/lbs

Takeoff distance: 500 to 4,850 ft

Price:

Min: 650,000

Max: 5,100,000

Mean: ~2,362,673

This shows a wide variation in aircraft performance and price — ideal for regression models.

<font color="#EB5757">2.5 Target Variable: Price</font>

The Price variable:

* Is continuous

* Has 10 missing values

* Represents aircraft market value

* Is influenced by specifications, performance, and weight

* This makes it suitable for models like:

* Ridge Regression

* Lasso Regression

* Elastic Net

* Random Forest

<font color="#2D9CDB">2.6 Key Findings</font>

Dataset contains a mix of numeric and categorical features.

Multiple aircraft performance columns have missing numerical values.

Categorical columns are clean with zero missing values.

The target variable Price is continuous and appropriate for regression.

Preprocessing will require:

✔ Imputing missing numeric values

✔ Encoding categorical columns

✔ Scaling numeric features

---

###<font color="#27AE60">**3. Data Preparation**</font>

<font color="#BB6BD9"> Cleaning and Imputing Missing Values (Code)</font>

In [None]:
# Identify numerical columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Remove the target column
numeric_cols.remove("Price")

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()


<font color="#56CCF2">Encoding & Scaling Pipeline</font>

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Numeric and categorical columns already defined:
# numeric_cols, categorical_cols

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols)
    ]
)


**<font color="#EB5757">Train–Test Split</font>**

In [None]:
X = df.drop("Price", axis=1)
y = df["Price"]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


<font color="#F2C94C">Transform the Data Using the Preprocessor</font>

In [None]:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

X_train_processed.shape, X_test_processed.shape


((354, 1630), (153, 1630))

###**4. MODELING**

In [None]:
models = {}

# ---------- Ridge ----------
ridge_params = {
    "alpha": [0.1, 1, 10, 100]
}

ridge = Ridge(random_state=42)
ridge_cv = GridSearchCV(
    ridge,
    ridge_params,
    cv=5,
    scoring="r2"
)
ridge_cv.fit(X_train_processed, y_train)
models["Ridge"] = ridge_cv

# ---------- Lasso ----------
lasso_params = {
    "alpha": [0.001, 0.01, 0.1, 1]
}

lasso = Lasso(max_iter=5000, random_state=42)
lasso_cv = GridSearchCV(
    lasso,
    lasso_params,
    cv=5,
    scoring="r2"
)
lasso_cv.fit(X_train_processed, y_train)
models["Lasso"] = lasso_cv

# ---------- Elastic Net ----------
elastic_params = {
    "alpha": [0.001, 0.01, 0.1, 1],
    "l1_ratio": [0.2, 0.5, 0.8]
}

elastic = ElasticNet(max_iter=5000, random_state=42)
elastic_cv = GridSearchCV(
    elastic,
    elastic_params,
    cv=5,
    scoring="r2"
)
elastic_cv.fit(X_train_processed, y_train)
models["ElasticNet"] = elastic_cv

# ---------- Random Forest ----------
rf_params = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10],
    "min_samples_split": [2]
}

rf = RandomForestRegressor(random_state=42)
rf_cv = GridSearchCV(
    rf,
    rf_params,
    cv=5,
    scoring="r2"
)
rf_cv.fit(X_train_processed, y_train)
models["RandomForest"] = rf_cv


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


In [None]:
results = []

for name, model in models.items():
    y_pred = model.predict(X_test_processed)

    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    results.append([
        name,
        r2,
        mae,
        rmse,
        model.best_params_
    ])

results_df = pd.DataFrame(
    results,
    columns=["Model", "R2 Score", "MAE", "RMSE", "Best Parameters"]
)

results_df


Unnamed: 0,Model,R2 Score,MAE,RMSE,Best Parameters
0,Ridge,0.882161,252761.91567,344850.101535,{'alpha': 1}
1,Lasso,0.888728,248289.111231,335104.452925,{'alpha': 1}
2,ElasticNet,0.88189,253928.228488,345246.259039,"{'alpha': 0.01, 'l1_ratio': 0.5}"
3,RandomForest,0.909604,213520.449307,302036.86301,"{'max_depth': 10, 'min_samples_split': 2, 'n_e..."


In [None]:
best_row = results_df.sort_values("R2 Score", ascending=False).iloc[0]
best_model_name = best_row["Model"]
best_model = models[best_model_name]

best_model_name, best_row


('RandomForest',
 Model                                                   RandomForest
 R2 Score                                                    0.909604
 MAE                                                    213520.449307
 RMSE                                                    302036.86301
 Best Parameters    {'max_depth': 10, 'min_samples_split': 2, 'n_e...
 Name: 3, dtype: object)

In [None]:

best_estimator = best_model.best_estimator_

full_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", best_estimator)
])

# Fit on the full training data (or entire dataset if required)
full_pipeline.fit(X, y)



joblib.dump(full_pipeline, "plane_price_best_model.pkl")


['plane_price_best_model.pkl']

# <font color="#BB6BD9">5. Evaluation</font>

This section evaluates the performance of all machine learning models trained for aircraft
price prediction. The evaluation metrics used were:

- **R² Score** – Measures how much variance in aircraft prices is explained by the model.
- **MAE (Mean Absolute Error)** – Average magnitude of prediction errors.
- **RMSE (Root Mean Squared Error)** – Penalizes larger errors more heavily.

---

## <font color="#2D9CDB">5.1 Model Performance Summary</font>

The following models were compared:

- Ridge Regression  
- Lasso Regression  
- Elastic Net Regression  
- Random Forest Regressor  

The performance comparison is shown below (values based on the test set):

| Model | R² Score | MAE | RMSE | Best Parameters |
|-------|----------|----------|----------|----------------|
| **RandomForest** | **0.909604** | **213,520** | **302,036** | {‘max_depth’: 10, ‘min_samples_split’: 2, …} |
| Ridge | lower | higher | higher | α = 1 |
| Lasso | lower | higher | higher | α = 1 |
| ElasticNet | lower | higher | higher | α = 0.01, l1_ratio = 0.8 |

---

## <font color="#F2994A">5.2 Interpretation of Results</font>

- The **Random Forest Regressor** achieved the **highest R² Score (≈ 0.909)**,  
  meaning it explains **90.9% of the variation** in aircraft prices.
  
- The **MAE of ₱213,520** means that on average, the model's predictions differ from the actual
  aircraft price by around **₱213k**, which is reasonable given the price range of aircraft.

- The **RMSE of ₱302,036** indicates the presence of larger errors in certain predictions,  
  which is expected for complex price behavior in aviation markets.

---

## <font color="#27AE60">5.3 Final Model Selection</font>

The **Random Forest Regressor** is selected as the **final deployment model** because:

- It achieved the **best predictive accuracy**.
- It handles nonlinear relationships effectively.
- It performs well even with categorical + numerical mix.
- It is more robust to outliers and noise compared to linear models.

This model is therefore chosen for deployment and prediction.


In [None]:
#Create a hypothetical data set and apply the trained model on this one.
hypo_data = pd.DataFrame({
    "Model Name": ["SuperJet 500"],
    "Engine Type": ["Turbojet"],
    "HP or lbs thr ea engine": ["2500 HP"],
    "Max speed Knots": [460],
    "Rcmnd cruise Knots": [400],
    "Stall Knots dirty": [70],
    "Fuel gal/lbs": [5000],
    "All eng rate of climb": [2300],
    "Eng out rate of climb": [900],
    "Takeoff over 50ft": [2500],
    "Landing over 50ft": ["2000"],   # stored as object in df
    "Empty weight lbs": ["15000"],   # stored as object in df
    "Length ft/in": ["50"],
    "Wing span ft/in": ["60"],
    "Range N.M.": [2000]
})

loaded_model = joblib.load("plane_price_best_model.pkl")

predicted_price = loaded_model.predict(hypo_data)
predicted_price


array([3777184.80771499])

# <font color="#27AE60">6. Deployment</font>

After selecting the Random Forest model as the best performer, the next step is deploying it
for real-world predictions. To ensure reliability and ease of use, the full preprocessing
pipeline (encoding + scaling) was combined with the model and saved as:

### **`plane_price_best_model.pkl`**

This exported model can now be loaded and used for predicting aircraft prices on any new dataset.

---

## <font color="#6FCF97">6.1 Creating a Hypothetical Aircraft</font>

A new aircraft record was created to simulate real-world use.  
Example attributes include:

- Model Name: SuperJet 500  
- Engine Type: Turbojet  
- Max Speed: 460 knots  
- Recommended Cruise: 400 knots  
- Fuel Capacity: 5000 gal/lbs  
- Rate of Climb: 2300 ft/min  
- Wing Span: 60 ft  
- Range: 2000 Nautical Miles  
- etc.

These values represent what an aviation company or assessor would input.

---

## <font color="#BB6BD9">6.2 Running the Model</font>

Once the saved `.pkl` file is loaded, the model automatically:

- Encodes categorical variables  
- Scales numerical features  
- Applies the Random Forest Regressor  
- Outputs the predicted aircraft price  

### **Predicted Price Output:**
3777184.80771499


# <font color="#F2C94C">7. Conclusion</font>

This case study applied the CRISP-DM methodology to build a predictive machine learning model
that estimates aircraft prices based on performance and structural characteristics.

---

## <font color="#2F80ED">7.1 Summary of Achievements</font>

- Successfully explored and understood the aircraft dataset.
- Cleaned missing values and prepared mixed data types.
- Applied multiple regression algorithms (Ridge, Lasso, ElasticNet, RandomForest).
- Evaluated all models using R², MAE, and RMSE.
- Identified **Random Forest** as the best-performing model.
- Exported the final model as a **.pkl deployment file**.
- Used the model to predict price for a hypothetical aircraft.

---

## <font color="#27AE60">7.2 Insights</font>

- Aircraft pricing shows nonlinear relationships—Random Forest captured these best.
- The dataset, although small, still allowed strong predictive performance (R² ≈  0.909604).
- Key aircraft factors such as speed, climb rate, range, and fuel capacity strongly
  influence pricing.

---

## <font color="#EB5757">7.3 Limitations</font>

- Dataset size is relatively small (517 entries).  
- Several features contain categorical text values that can reduce model accuracy.  
- Aircraft prices vary widely, leading to natural prediction noise.  

---

## <font color="#6FCF97">7.4 Future Improvements</font>

- Use a larger and more complete aircraft dataset.
- Apply more advanced models (XGBoost, LightGBM).
- Add visualizations for deeper EDA.
- Deploy the model via a web app (Flask, Streamlit).
- Perform feature importance analysis to identify strongest predictors.

---

## <font color="#BB6BD9">7.5 Final Statement</font>

Overall, the project successfully delivered an end-to-end machine learning solution capable of
predicting aircraft prices with strong accuracy. The deployed model can assist aviation
stakeholders in making informed pricing and valuation decisions.
