This analysis develops an end-to-end probabilistic classification pipeline for structured tabular data using a Gaussian Naive Bayes model. The focus extends beyonds implementation to evaluating model assumptions, analysing feature relationships and interpreting predictive behavior within a statistical framework.
Rather than defaulting to more compplex models, this workflow investigates whether a statistically grounded, lightweight probabilistic approach can achieve meaningful predictive performance.
Naive Bayes was deliberately selected to:
- Examine the impact of feature independence assumptions
- Evaluate robustness on real-world structured data
- Provide interpretable probabilistic outputs
I undertook this analysis to investigate how a model forms structured belief from data, not as a computational trick, but as a principled reasoning process. I wanted to examine classification at the level where assumptions, uncertainty, and evidence interact, and to understand how a model behaves when the world it encounters is messier than the theory that defines it. Bayesian methods provided the ideal lens for this. They force every step of the workflow to be explicit: how prior expectations are set, how likelihoods reshape those expectations, and how each feature contributes to the final posterior decision. Instead of relying on opaque optimisation, this approach allowed me to study the architecture of inference itself.
This work was driven by several deeper questions:
- How does a model reconcile conflicting or imperfect signals?
- What happens when theoretical assumptions collide with real‑world data structure?
- Which features meaningfully shift the model’s internal belief state, and why?
- Where does uncertainty originate, and how does it propagate through the pipeline?
By building the entire workflow end‑to‑end, from data preparation and feature diagnostics to posterior interpretation, I could observe how each analytical decision shapes the model’s reasoning. This process helped me move beyond surface‑level performance metrics and focus instead on interpretability, structural clarity, and the causal logic behind predictions. Ultimately, I did this to strengthen my ability to think in a statistically disciplined way: to interrogate assumptions, trace information flow, and understand not just what a model predicts, but why it arrives at that conclusion. It reflects an interest in modelling that prioritises transparency, principled analysis, and rigorous reasoning over complexity for its own sake.
- Handled missing values to ensure dataset consistency
- Encoded categorical variables into numerical form
- Prepared structured inputs for statistical modelling
A key component of the analysis involved assessing feature relationships in the context of Naive Bayes assumptions:
- Computed correlation coefficients across variables
- Analysed inter-feature dependencies
- Identified high-impact features such as sex, pclass, and fare
- This step provides insight into model suitability, not just performance.
- Used boxplots to examine feature distributions
- Detected outliers and distributional differences
- Analysed subgroup behaviour across key variables
A Gaussian Naive Bayes classifier was implemented due to:
- Computational efficiency
- Suitability for structured tabular data
- Probabilistic interpretability The dataset was split into training and testing subsets to evaluate generalisation on unseen data.
The model applies Bayes’ theorem to compute posterior probabilities.
Continuous features are modelled using Gaussian distributions.
The predicted class is selected by maximising posterior probability.
The workflow illustrates the full pipeline from preprocessing and feature analysis to probabilistic classification and prediction generation.
Model performance was assessed using multiple complementary metrics:
- Accuracy – overall classification correctness
- Recall (~0.69) – ability to correctly identify positive instances
- F1 Score (~0.69) – balance between precision and recall
- Confusion Matrix – detailed classification breakdown
- Mean Absolute Error (MAE) – magnitude of prediction error
The model demonstrated strong predictive capability, but deeper analysis reveals more nuanced behaviour:
- High training accuracy indicates effective pattern capture
- Recall (~0.69) shows the model identifies a substantial proportion of true positives, though some are missed
- The F1 score reflects a balanced trade-off rather than over-optimisation
- The confusion matrix highlights classification asymmetries
Naive Bayes remains effective even when its independence assumption is not fully satisfied, provided that key features retain strong predictive signal.
- Feature independence assumption is not strictly met
- Model cannot capture complex feature interactions
- Performance may be constrained compared to more flexible models
- Compare performance with Logistic Regression and Random Forest
- Apply feature selection to reduce redundancy
- Introduce cross-validation for more robust evaluation
- Perform hyperparameter tuning
- Python
- Pandas
- NumPy
- Scikit-learn
- Matplotlib
- Jupyter Notebook
To replicate this analysis:
pip install -r requirements.txt
jupyter notebook Applied_AI.ipynbThe results of this work show that predictive strength often emerges not from model complexity, but from the alignment between data structure and statistical assumptions. By tracing how likelihoods shift across features and how posterior beliefs evolve, the workflow reveals that Gaussian Naive Bayes remains effective even when independence conditions are imperfect, provided that the dataset contains a few dominant, high‑signal variables. The evaluation metrics expose asymmetries that reflect underlying patterns in the data rather than simple model shortcomings. Accuracy, recall, F1 score, and the confusion matrix together illustrate that understanding a classifier requires examining how information flows through the pipeline, how uncertainty is distributed, and where the model’s reasoning aligns with or diverges from the true structure of the dataset. What emerges is a broader insight: transparent, statistically principled models retain significant value because they make their internal logic visible. When supported by disciplined preprocessing and careful feature analysis, they offer not only reliable predictions but a coherent explanation of why those predictions arise, a quality essential for analytical work that prioritises interpretability, robustness, and sound statistical reasoning.