## Weka

## TableSaw Vs Smile

| Aspect                          | Tablesaw                                                                             | Smile                                                                                        |
| ------------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------- |
| **Primary focus**               | Data manipulation and exploratory data analysis (EDA), similar to pandas in Python   | Machine learning, statistics, and data analysis library with ML models and algorithms        |
| **DataFrame support**           | Yes, Tablesaw provides a rich DataFrame API for tabular data manipulation            | Yes, Smile provides DataFrame, but often more focused on ML workflows                        |
| **ML Algorithms**               | Minimal or no built-in ML algorithms; mainly for data wrangling and analysis         | Extensive ML support: classification, regression, clustering, dimensionality reduction, etc. |
| **Data types support**          | Supports various column types (numeric, categorical, date, etc.) with convenient API | Supports different types but with a focus on numeric data for ML                             |
| **Data visualization**          | Limited built-in support, but can export or integrate with Java plotting libs        | Very limited visualization; focus is on ML and stats                                         |
| **Performance**                 | Efficient for in-memory tabular data; good for typical data wrangling tasks          | Highly optimized for numerical computation and ML tasks                                      |
| **Missing value handling**      | Good support for missing data in tables                                              | Supports missing data but less focus on data cleaning than Tablesaw                          |
| **API complexity**              | Simple and intuitive for data manipulation and EDA                                   | More complex, with many ML-related classes and utilities                                     |
| **Community and documentation** | Growing, focused on data manipulation                                                | Mature, with focus on ML and statistics                                                      |
| **Integration**                 | Easy integration with Java projects for ETL, data manipulation                       | Great for projects requiring ML algorithms and predictive modeling                           |


## TableSaw

# Titanic Data Project Overview

This project explores the Titanic dataset using Java, with three main components for data preprocessing, analysis, and machine learning modeling.

---

## 1. `TitanicPreprocess.java` — *Data Cleaning & Preparation*

Prepares the raw Titanic dataset for analysis and modeling:

- Loads the raw dataset from a CSV file.
- Adds a new `"Alone"` column to indicate whether a passenger was traveling alone.
- Removes irrelevant columns: `PassengerId`, `Name`, `Ticket`, and `Cabin`.
- Encodes categorical variables:
  - `Sex`: `male → 1`, `female → 0`
  - `Embarked`: `C → 1`, `Q → 2`, `S → 3`
- Fills missing values with the median of each column.
- Saves the cleaned dataset to `titanic_cleaned.csv`.

---

## 2. `TitanicAnalysis.java` — *Exploratory Data Analysis (EDA)*

Performs visual and statistical analysis on the Titanic dataset:

- Loads the raw dataset from a CSV file.
- Adds the `"Alone"` column.
- Splits data into subsets of survivors and non-survivors.
- Calculates and visualizes:
  - **Survival rate by gender**
  - **Survival rate based on fare price**
  - **Comparison of survival for passengers traveling alone vs. with family**
  - **Age distribution among survivors and non-survivors**

---

## 3. `TitanicML.java` — *Machine Learning Modeling*

Builds and evaluates ML models using the cleaned dataset:

- Loads and processes the data (similar to `TitanicPreprocess.java`).
- Normalizes numerical features to a 0–1 range.
- Converts Tablesaw tables to Weka instances.
- Trains two models:
  - **Decision Tree (J48)**
  - **Logistic Regression**
- Evaluates the models using cross-validation.

---

## Recommended Execution Order

To ensure a smooth workflow, run the scripts in this order:

1. **`TitanicPreprocess.java`**  
   Cleans the dataset and generates `titanic_cleaned.csv`.

2. **`TitanicAnalysis.java`**  
   Performs EDA and generates insights and visualizations.

3. **`TitanicML.java`**  
   Runs machine learning models on the cleaned dataset.

## What is Smile?
- SMILE stands for Statistical Machine Intelligence and Learning Engine
- Java-based ML library with fast performance and wide algorithm support

### Downloading and Importing SMILE Libraries

In [None]:
%maven com.github.haifengl:smile-data:2.6.0
%maven com.github.haifengl:smile-math:2.6.0
%maven com.github.haifengl:smile-io:2.6.0
%maven org.slf4j:slf4j-nop:2.0.7
%maven com.github.haifengl:smile-core:2.6.0

EvalException: Exception occurred while running line magic 'maven': Error resolving 'com.github.haifengl:smile-validation:2.5.3'. [unresolved dependency: com.github.haifengl#smile-validation;2.5.3: not found, unresolved dependency: com.github.haifengl#smile-validation;2.5.3: not found]

## Loading dataset

In [53]:
import smile.data.DataFrame;
import smile.data.formula.Formula;
import smile.io.Read;
import smile.classification.LogisticRegression;
import smile.data.vector.IntVector;
import org.apache.commons.csv.CSVFormat;
import smile.validation.metric.Accuracy;
import java.util.HashMap;
import java.util.Map;

// Load Wine dataset with header
String url = "https://gist.githubusercontent.com/netj/8836201/raw/iris.csv";
DataFrame iris = Read.csv(url, CSVFormat.DEFAULT.withFirstRecordAsHeader());

System.out.println(iris.structure());
System.out.println(iris.summary());


[Column: String, Type: DataType, Measure: Measure]
+------------+------+-------+
|      Column|  Type|Measure|
+------------+------+-------+
|sepal.length|double|   null|
| sepal.width|double|   null|
|petal.length|double|   null|
| petal.width|double|   null|
|     variety|String|   null|
+------------+------+-------+

[column: String, count: long, min: double, avg: double, max: double]
+------------+-----+---+--------+---+
|      column|count|min|     avg|max|
+------------+-----+---+--------+---+
|sepal.length|  150|4.3|5.843333|7.9|
| sepal.width|  150|  2|3.057333|4.4|
|petal.length|  150|  1|   3.758|6.9|
| petal.width|  150|0.1|1.199333|2.5|
+------------+-----+---+--------+---+



## Train logistic Regression with the target column as "class"

In [54]:
String[] classes = iris.stringVector("variety").toArray();

Map<String, Integer> classToInt = new HashMap<>();
int labelCounter = 0;
int[] labels = new int[classes.length];

for (int i = 0; i < classes.length; i++) {
    if (!classToInt.containsKey(classes[i])) {
        classToInt.put(classes[i], labelCounter++);
    }
    labels[i] = classToInt.get(classes[i]);
}

iris = iris.merge(IntVector.of("label", labels));

System.out.println(iris.structure());


[Column: String, Type: DataType, Measure: Measure]
+------------+------+-------+
|      Column|  Type|Measure|
+------------+------+-------+
|sepal.length|double|   null|
| sepal.width|double|   null|
|petal.length|double|   null|
| petal.width|double|   null|
|     variety|String|   null|
|       label|   int|   null|
+------------+------+-------+



In [55]:
// Use formula specifying label as target, excluding the string column 'variety' as a feature
// So drop "variety" column before fitting model
DataFrame features = iris.drop("variety");

Formula formula = Formula.lhs("label");
LogisticRegression model = LogisticRegression.fit(formula, features);

System.out.println("Model trained.");


Model trained.


In [56]:
DataFrame features = iris.drop("variety").drop("label");

double[] sample = new double[features.ncols()];
for (int i = 0; i < features.ncols(); i++) {
    sample[i] = features.getDouble(0, i);
}

int pred = model.predict(sample);


In [57]:
DataFrame features = iris.drop("variety").drop("label");

int[] trueLabels = iris.intVector("label").toIntArray();
int[] predictedLabels = new int[iris.size()];

for (int i = 0; i < iris.size(); i++) {
    double[] x = new double[features.ncols()];
    for (int j = 0; j < features.ncols(); j++) {
        x[j] = features.getDouble(i, j);
    }
    predictedLabels[i] = model.predict(x);
}

double accuracy = Accuracy.of(trueLabels, predictedLabels);
System.out.printf("Training Accuracy: %.2f%%\n", accuracy * 100);


Training Accuracy: 98.00%


java.io.PrintStream@22195699

## Popcorn Hack

### - Use Tablesaw to visualize the class distribution (first, second, third class) of the Titanic data


## Lets Look at a Titanic Example

## Homework

### - Use SMILE to train a classifier on the Titanic Dataset
### - Use Tablesaw to visualize the Iris data in at least 3 different ways