## Weka (Waikato Enviornment for Knowledge Analysis)
- Java-based machine learning toolkit with:
    - A large number of built-in algorithms for classification, regression, clustering, etc
    - Useful for teaching, rapid prototyping, and data analysis

## Weka in titanic example

## TableSaw Vs Smile

| Aspect                          | Tablesaw                                                                             | Smile                                                                                        |
| ------------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------- |
| **Primary focus**               | Data manipulation and exploratory data analysis (EDA), similar to pandas in Python   | Machine learning, statistics, and data analysis library with ML models and algorithms        |
| **DataFrame support**           | Yes, Tablesaw provides a rich DataFrame API for tabular data manipulation            | Yes, Smile provides DataFrame, but often more focused on ML workflows                        |
| **ML Algorithms**               | Minimal or no built-in ML algorithms; mainly for data wrangling and analysis         | Extensive ML support: classification, regression, clustering, dimensionality reduction, etc. |
| **Data types support**          | Supports various column types (numeric, categorical, date, etc.) with convenient API | Supports different types but with a focus on numeric data for ML                             |
| **Data visualization**          | Limited built-in support, but can export or integrate with Java plotting libs        | Very limited visualization; focus is on ML and stats                                         |
| **Performance**                 | Efficient for in-memory tabular data; good for typical data wrangling tasks          | Highly optimized for numerical computation and ML tasks                                      |
| **Missing value handling**      | Good support for missing data in tables                                              | Supports missing data but less focus on data cleaning than Tablesaw                          |
| **API complexity**              | Simple and intuitive for data manipulation and EDA                                   | More complex, with many ML-related classes and utilities                                     |
| **Community and documentation** | Growing, focused on data manipulation                                                | Mature, with focus on ML and statistics                                                      |
| **Integration**                 | Easy integration with Java projects for ETL, data manipulation                       | Great for projects requiring ML algorithms and predictive modeling                           |


## TableSaw

# Titanic Data Project Overview

This project explores the Titanic dataset using Java, with three main components for data preprocessing, analysis, and machine learning modeling.

---

## 1. `TitanicPreprocess.java` — *Data Cleaning & Preparation*

Prepares the raw Titanic dataset for analysis and modeling:

- Loads the raw dataset from a CSV file.
- Adds a new `"Alone"` column to indicate whether a passenger was traveling alone.
- Removes irrelevant columns: `PassengerId`, `Name`, `Ticket`, and `Cabin`.
- Encodes categorical variables:
  - `Sex`: `male → 1`, `female → 0`
  - `Embarked`: `C → 1`, `Q → 2`, `S → 3`
- Fills missing values with the median of each column.
- Saves the cleaned dataset to `titanic_cleaned.csv`.

---

## 2. `TitanicAnalysis.java` — *Exploratory Data Analysis (EDA)*

Performs visual and statistical analysis on the Titanic dataset:

- Loads the raw dataset from a CSV file.
- Adds the `"Alone"` column.
- Splits data into subsets of survivors and non-survivors.
- Calculates and visualizes:
  - **Survival rate by gender**
  - **Survival rate based on fare price**
  - **Comparison of survival for passengers traveling alone vs. with family**
  - **Age distribution among survivors and non-survivors**

---

## 3. `TitanicML.java` — *Machine Learning Modeling*

Builds and evaluates ML models using the cleaned dataset:

- Loads and processes the data (similar to `TitanicPreprocess.java`).
- Normalizes numerical features to a 0–1 range.
- Converts Tablesaw tables to Weka instances.
- Trains two models:
  - **Decision Tree (J48)**
  - **Logistic Regression**
- Evaluates the models using cross-validation.

---

## Recommended Execution Order

To ensure a smooth workflow, run the scripts in this order:

1. **`TitanicPreprocess.java`**  
   Cleans the dataset and generates `titanic_cleaned.csv`.

2. **`TitanicAnalysis.java`**  
   Performs EDA and generates insights and visualizations.

3. **`TitanicML.java`**  
   Runs machine learning models on the cleaned dataset.

## Popcorn Hack

### - Use Tablesaw to visualize the class distribution (first, second, third class) of the Titanic data


## Lets Look at a Titanic Example

## Homework

### - Use SMILE to train a classifier on the Titanic Dataset
### - Use Tablesaw to visualize the Iris data in at least 3 different ways