## Weka

## TableSaw Vs Smile

| Aspect                          | Tablesaw                                                                             | Smile                                                                                        |
| ------------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------- |
| **Primary focus**               | Data manipulation and exploratory data analysis (EDA), similar to pandas in Python   | Machine learning, statistics, and data analysis library with ML models and algorithms        |
| **DataFrame support**           | Yes, Tablesaw provides a rich DataFrame API for tabular data manipulation            | Yes, Smile provides DataFrame, but often more focused on ML workflows                        |
| **ML Algorithms**               | Minimal or no built-in ML algorithms; mainly for data wrangling and analysis         | Extensive ML support: classification, regression, clustering, dimensionality reduction, etc. |
| **Data types support**          | Supports various column types (numeric, categorical, date, etc.) with convenient API | Supports different types but with a focus on numeric data for ML                             |
| **Data visualization**          | Limited built-in support, but can export or integrate with Java plotting libs        | Very limited visualization; focus is on ML and stats                                         |
| **Performance**                 | Efficient for in-memory tabular data; good for typical data wrangling tasks          | Highly optimized for numerical computation and ML tasks                                      |
| **Missing value handling**      | Good support for missing data in tables                                              | Supports missing data but less focus on data cleaning than Tablesaw                          |
| **API complexity**              | Simple and intuitive for data manipulation and EDA                                   | More complex, with many ML-related classes and utilities                                     |
| **Community and documentation** | Growing, focused on data manipulation                                                | Mature, with focus on ML and statistics                                                      |
| **Integration**                 | Easy integration with Java projects for ETL, data manipulation                       | Great for projects requiring ML algorithms and predictive modeling                           |


## TableSaw

### Titanic Example
TitanicAnalysis.java,
This performs exploratory data analysis:

Loads the raw Titanic dataset from CSV
Creates a new "Alone" column to identify passengers traveling alone
Splits data into survived/perished subsets
Calculates statistics and creates visualizations about:
Survival rates by gender
Survival based on fare price
Survival rates for passengers traveling alone vs with family
Age distribution among survivors and non-survivors
TitanicPreprocess.java,
This prepares data for analysis:

Loads the raw Titanic dataset
Adds the "Alone" column
Removes unnecessary columns (PassengerId, Name, Ticket, Cabin)
Transforms categorical variables into numeric:
Sex: male → 1, female → 0
Embarked: C → 1, Q → 2, S → 3
Fills missing values with median values
Saves the cleaned dataset as titanic_cleaned.csv
TitanicML.java,
This builds machine learning models:

Loads and cleans the dataset (similar to TitanicPreprocess.java)
Converts categorical variables to numeric
Normalizes numeric columns to 0-1 range
Converts Tablesaw tables to Weka instances
Builds two ML models:
Decision tree (J48)
Logistic regression
Performs cross-validation to evaluate model performance
Running Order
For the best experience, run the files in this order:

TitanicPreprocess.java first

This creates the cleaned CSV file for further analysis.

TitanicAnalysis.java second

This gives you visualizations and basic statistics.

TitanicML.java last

This runs the machine learning models on the data.

## What is Smile?
- SMILE stands for Statistical Machine Intelligence and Learning Engine
- Java-based ML library with fast performance and wide algorithm support

### Downloading and Importing SMILE Libraries

In [None]:
%maven com.github.haifengl:smile-data:2.6.0
%maven com.github.haifengl:smile-math:2.6.0
%maven com.github.haifengl:smile-io:2.6.0
%maven org.slf4j:slf4j-nop:2.0.7
%maven com.github.haifengl:smile-core:2.6.0

EvalException: Exception occurred while running line magic 'maven': Error resolving 'com.github.haifengl:smile-validation:2.5.3'. [unresolved dependency: com.github.haifengl#smile-validation;2.5.3: not found, unresolved dependency: com.github.haifengl#smile-validation;2.5.3: not found]

## Loading dataset

In [53]:
import smile.data.DataFrame;
import smile.data.formula.Formula;
import smile.io.Read;
import smile.classification.LogisticRegression;
import smile.data.vector.IntVector;
import org.apache.commons.csv.CSVFormat;
import smile.validation.metric.Accuracy;
import java.util.HashMap;
import java.util.Map;

// Load Wine dataset with header
String url = "https://gist.githubusercontent.com/netj/8836201/raw/iris.csv";
DataFrame iris = Read.csv(url, CSVFormat.DEFAULT.withFirstRecordAsHeader());

System.out.println(iris.structure());
System.out.println(iris.summary());


[Column: String, Type: DataType, Measure: Measure]
+------------+------+-------+
|      Column|  Type|Measure|
+------------+------+-------+
|sepal.length|double|   null|
| sepal.width|double|   null|
|petal.length|double|   null|
| petal.width|double|   null|
|     variety|String|   null|
+------------+------+-------+

[column: String, count: long, min: double, avg: double, max: double]
+------------+-----+---+--------+---+
|      column|count|min|     avg|max|
+------------+-----+---+--------+---+
|sepal.length|  150|4.3|5.843333|7.9|
| sepal.width|  150|  2|3.057333|4.4|
|petal.length|  150|  1|   3.758|6.9|
| petal.width|  150|0.1|1.199333|2.5|
+------------+-----+---+--------+---+



## Train logistic Regression with the target column as "class"

In [54]:
String[] classes = iris.stringVector("variety").toArray();

Map<String, Integer> classToInt = new HashMap<>();
int labelCounter = 0;
int[] labels = new int[classes.length];

for (int i = 0; i < classes.length; i++) {
    if (!classToInt.containsKey(classes[i])) {
        classToInt.put(classes[i], labelCounter++);
    }
    labels[i] = classToInt.get(classes[i]);
}

iris = iris.merge(IntVector.of("label", labels));

System.out.println(iris.structure());


[Column: String, Type: DataType, Measure: Measure]
+------------+------+-------+
|      Column|  Type|Measure|
+------------+------+-------+
|sepal.length|double|   null|
| sepal.width|double|   null|
|petal.length|double|   null|
| petal.width|double|   null|
|     variety|String|   null|
|       label|   int|   null|
+------------+------+-------+



In [55]:
// Use formula specifying label as target, excluding the string column 'variety' as a feature
// So drop "variety" column before fitting model
DataFrame features = iris.drop("variety");

Formula formula = Formula.lhs("label");
LogisticRegression model = LogisticRegression.fit(formula, features);

System.out.println("Model trained.");


Model trained.


In [56]:
DataFrame features = iris.drop("variety").drop("label");

double[] sample = new double[features.ncols()];
for (int i = 0; i < features.ncols(); i++) {
    sample[i] = features.getDouble(0, i);
}

int pred = model.predict(sample);


In [57]:
DataFrame features = iris.drop("variety").drop("label");

int[] trueLabels = iris.intVector("label").toIntArray();
int[] predictedLabels = new int[iris.size()];

for (int i = 0; i < iris.size(); i++) {
    double[] x = new double[features.ncols()];
    for (int j = 0; j < features.ncols(); j++) {
        x[j] = features.getDouble(i, j);
    }
    predictedLabels[i] = model.predict(x);
}

double accuracy = Accuracy.of(trueLabels, predictedLabels);
System.out.printf("Training Accuracy: %.2f%%\n", accuracy * 100);


Training Accuracy: 98.00%


java.io.PrintStream@22195699

## Lets Look at a Titanic Example

## Popcorn Hack

## Homework