In [1]:
#r "nuget: Microsoft.ML"
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;
using System;
using System.IO;
using System.Linq;
using System.Net;

Download data from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/annealing):

In [2]:
if (!File.Exists("anneal.data"))
{
    using var client = new WebClient();
    client.DownloadFile("https://archive.ics.uci.edu/ml/machine-learning-databases/annealing/anneal.data", "anneal.data");
}

Console.WriteLine($"Train data file has {File.ReadLines("anneal.data").Count():n0} lines");
File.ReadLines("anneal.data").Take(5)

Train data file has 798 lines


index,value
0,"?,C,A,08,00,?,S,?,000,?,?,G,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,0.700,0610.0,0000,?,0000,?,3"
1,"?,C,R,00,00,?,S,2,000,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,Y,?,?,?,COIL,3.200,0610.0,0000,?,0000,?,3"
2,"?,C,R,00,00,?,S,2,000,?,?,E,?,?,Y,?,B,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.700,1300.0,0762,?,0000,?,3"
3,"?,C,A,00,60,T,?,?,000,?,?,G,?,?,?,?,M,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,2.801,0385.1,0000,?,0000,?,3"
4,"?,C,A,00,60,T,?,?,000,?,?,G,?,?,?,?,B,Y,?,?,?,Y,?,?,?,?,?,?,?,?,?,SHEET,0.801,0255.0,0269,?,0000,?,3"


Some datasets from the UCI Machine Learning Repository have two files, one for training data and one for test data. Notice how the .test dataset has a line of descriptive text at the start of the file. We don't want to load this line--you will see how to deal with that below.

In [3]:
if (!File.Exists("anneal.test"))
{
    using var client = new WebClient();
    client.DownloadFile("https://archive.ics.uci.edu/ml/machine-learning-databases/annealing/anneal.test", "anneal.test");
}

Console.WriteLine($"Test data file has {File.ReadLines("anneal.test").Count():n0} lines");
File.ReadLines("anneal.test").Take(5)

Test data file has 100 lines


index,value
0,"?,C,A,00,45,?,S,?,000,?,?,D,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,1.600,0610.0,0000,?,0000,?,3"
1,"?,C,A,00,00,?,S,3,000,N,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,Y,?,?,?,COIL,0.699,0609.9,0000,?,0000,?,3"
2,"ZS,C,A,00,85,T,?,?,000,?,?,E,?,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.400,0610.0,0762,?,0000,?,U"
3,"ZS,C,A,00,50,T,?,?,000,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,0.400,0610.0,0000,?,0000,?,3"
4,"?,C,A,00,00,?,S,2,000,?,?,E,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,0.699,1320.0,0000,?,0000,?,3"


Notice all of the `?` values--these stand for "missing" data and we will take care of them in our pipeline below. Next, we need to declare a type that defines the shape of our data. It's rather long so be patient:

In [4]:
class AnnealData
{
    [LoadColumn(0)]
    public string Family { get; set; }

    [LoadColumn(1)]
    public string ProductType { get; set; }

    [LoadColumn(2)]
    public string Steel { get; set; }

    [LoadColumn(3)]
    public float Carbon { get; set; }

    [LoadColumn(4)]
    public float Hardness { get; set; }

    [LoadColumn(5)]
    public string TemperRolling { get; set; }

    [LoadColumn(6)]
    public string Condition { get; set; }

    [LoadColumn(7)]
    public string Formability { get; set; }

    [LoadColumn(8)]
    public float Strength { get; set; }

    [LoadColumn(9)]
    public string NonAgeing { get; set; }

    [LoadColumn(10)]
    public string SurfaceFinish { get; set; }

    [LoadColumn(11)]
    public string SurfaceQuality { get; set; }

    [LoadColumn(12)]
    public string Enamelability { get; set; }

    [LoadColumn(13)]
    public string Bc { get; set; }

    [LoadColumn(14)]
    public string Bf { get; set; }

    [LoadColumn(15)]
    public string Bt { get; set; }

    [LoadColumn(16)]
    public string BwMe { get; set; }

    [LoadColumn(17)]
    public string Bl { get; set; }

    [LoadColumn(18)]
    public string M { get; set; }

    [LoadColumn(19)]
    public string Chrom { get; set; }

    [LoadColumn(20)]
    public string Phos { get; set; }

    [LoadColumn(21)]
    public string Cbond { get; set; }

    [LoadColumn(22)]
    public string Marvi { get; set; }

    [LoadColumn(23)]
    public string Exptl { get; set; }

    [LoadColumn(24)]
    public string Ferro { get; set; }

    [LoadColumn(25)]
    public string Corr { get; set; }

    [LoadColumn(26)]
    public string BlueBrightVarnClean { get; set; }

    [LoadColumn(27)]
    public string Lustre { get; set; }

    [LoadColumn(28)]
    public string Jurofm { get; set; }

    [LoadColumn(29)]
    public string S { get; set; }

    [LoadColumn(30)]
    public string P { get; set; }

    [LoadColumn(31)]
    public string Shape { get; set; }

    [LoadColumn(32)]
    public float Thick { get; set; }

    [LoadColumn(33)]
    public float Width { get; set; }

    [LoadColumn(34)]
    public float Len { get; set; }

    [LoadColumn(35)]
    public string Oil { get; set; }

    [LoadColumn(36)]
    public string Bore { get; set; }

    [LoadColumn(37)]
    public string Packing { get; set; }

    [LoadColumn(38)]
    [ColumnName("Label")]
    public string Classes { get; set; }
}

Create a new MLContext:

In [5]:
var context = new MLContext();

Now that we have an `MLContext` and a class to represent our data, we can load the file into a `DataView`. It is a good practice to shuffle the data after loading. Many datasets come ordered by some columns of values or even worse by the label. For training a model we want our data to be in a random order:

In [6]:
var trainData = context.Data.LoadFromTextFile<AnnealData>("anneal.data", hasHeader: false, separatorChar: ',');
trainData = context.Data.ShuffleRows(trainData);

Next we will load the test data.

In [7]:
var testData = context.Data.LoadFromTextFile<AnnealData>("anneal.test", hasHeader: false, separatorChar: ',');
testData = context.Data.ShuffleRows(testData);

Now we declare the features of the dataset that we will train our model on:

In [8]:
var featureColumns = new[]
{
    nameof(AnnealData.Family), nameof(AnnealData.ProductType), nameof(AnnealData.Steel), nameof(AnnealData.Carbon), nameof(AnnealData.Hardness),
    nameof(AnnealData.TemperRolling), nameof(AnnealData.Condition), nameof(AnnealData.Formability), nameof(AnnealData.Strength), nameof(AnnealData.NonAgeing),
    nameof(AnnealData.SurfaceFinish), nameof(AnnealData.SurfaceQuality), nameof(AnnealData.Enamelability), nameof(AnnealData.Bc), nameof(AnnealData.Bf),
    nameof(AnnealData.Bt), nameof(AnnealData.BwMe), nameof(AnnealData.Bl), nameof(AnnealData.M), nameof(AnnealData.Chrom), nameof(AnnealData.Phos),
    nameof(AnnealData.Cbond), nameof(AnnealData.Marvi), nameof(AnnealData.Exptl), nameof(AnnealData.Ferro), nameof(AnnealData.Corr),
    nameof(AnnealData.BlueBrightVarnClean), nameof(AnnealData.Lustre), nameof(AnnealData.Jurofm), nameof(AnnealData.S), nameof(AnnealData.P), nameof(AnnealData.Shape),
    nameof(AnnealData.Thick), nameof(AnnealData.Width), nameof(AnnealData.Len), nameof(AnnealData.Oil), nameof(AnnealData.Bore), nameof(AnnealData.Packing)
};

There are a number of categorical columns (string values that reprsent discrete values) in the data. We will need to encode those columns so we declare which columns are categorical here:

In [9]:
var categoricalColumns = new[]
{
    nameof(AnnealData.Family), nameof(AnnealData.ProductType), nameof(AnnealData.Steel), nameof(AnnealData.TemperRolling), nameof(AnnealData.Condition),
    nameof(AnnealData.Formability), nameof(AnnealData.NonAgeing), nameof(AnnealData.SurfaceFinish), nameof(AnnealData.SurfaceQuality), nameof(AnnealData.Enamelability),
    nameof(AnnealData.Bc), nameof(AnnealData.Bf), nameof(AnnealData.Bt), nameof(AnnealData.BwMe), nameof(AnnealData.Bl), nameof(AnnealData.M), nameof(AnnealData.Chrom),
    nameof(AnnealData.Phos), nameof(AnnealData.Cbond), nameof(AnnealData.Marvi), nameof(AnnealData.Exptl), nameof(AnnealData.Ferro), nameof(AnnealData.Corr),
    nameof(AnnealData.BlueBrightVarnClean), nameof(AnnealData.Lustre), nameof(AnnealData.Jurofm), nameof(AnnealData.S), nameof(AnnealData.P), nameof(AnnealData.Shape),
    nameof(AnnealData.Oil), nameof(AnnealData.Bore), nameof(AnnealData.Packing)
};

What are we trying to predict with this data? The column called `Classes` which we have identified as the `Label` in our type is the value that we will try to predict. In order to determine whether this is a regression or a classification problem, we need to look at the values that Classes can take on--are they continuous or are they categorical?

In [10]:
trainData.GetColumn<string>("Label").Distinct()

index,value
0,3
1,U
2,5
3,1
4,2


So all but one of `Classes` are numeric. Because of this, we will treat this as a classification problem which means that we need to treat `Classes` as a categorical variable. In our pipeline, we will map the value of our `Label` (i.e., `Classes`) column to a Key (see below). The first step in the pipeline will one-hot encode all of the categorical columns. We will also concatenate all of the feature columns into a single new column, `Features`. Finally, we map the original `Label` column to a new column, `LabelValue`, just for purposes of displaying later on.

In [11]:
var chain = new EstimatorChain<OneHotEncodingTransformer>();
var pipeline = categoricalColumns
    .Aggregate(chain, (pl, col) => pl.Append(context.Transforms.Categorical.OneHotEncoding(col)))
    .Append(context.Transforms.Conversion.MapValueToKey("Label", "Label"))
    .Append(context.Transforms.Conversion.MapKeyToValue("LabelValue", "Label"))
    .Append(context.Transforms.Concatenate("Features", featureColumns));

Fit the pipeline to our training data:

In [12]:
var transformer = pipeline.Fit(trainData);

Let's first view the data as it was loaded from the downloaded file:

In [13]:
context.Data.CreateEnumerable<AnnealData>(trainData, reuseRowObject: false).Take(5)

index,Family,ProductType,Steel,Carbon,Hardness,TemperRolling,Condition,Formability,Strength,NonAgeing,SurfaceFinish,SurfaceQuality,Enamelability,Bc,Bf,Bt,BwMe,Bl,M,Chrom,Phos,Cbond,Marvi,Exptl,Ferro,Corr,BlueBrightVarnClean,Lustre,Jurofm,S,P,Shape,Thick,Width,Len,Oil,Bore,Packing,Classes
0,?,C,R,6,0,T,?,?,0,?,?,?,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.6,1320.0,4880,?,0,?,3
1,?,C,R,0,0,?,S,2,0,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,3.2,1320.0,4880,?,0,?,3
2,?,C,R,0,0,?,S,2,0,?,?,E,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,1.601,1320.0,4880,?,0,?,3
3,?,C,A,0,0,?,S,2,0,?,?,F,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,0.7,1220.0,762,?,0,?,3
4,?,C,A,0,60,T,?,?,0,?,?,G,?,?,?,?,B,Y,?,?,?,Y,?,?,?,?,?,?,?,?,?,SHEET,0.8,356.1,4880,?,0,?,3


Now let's see what the data looks like after it has been transformed by our pipeline:

In [14]:
class AnnealDataTransformed
{
    [ColumnName("LabelValue")]
    public string Classes { get; set; }

    [VectorType(84)]
    public float[] Features { get; set; }
}

var transformedData = transformer.Transform(trainData);
context.Data
    .CreateEnumerable<AnnealDataTransformed>(transformedData, reuseRowObject: false)
    .Take(3)

index,Classes,Features
0,3,"[ 1, 0, 0, 1, 1, 0, 0, 0, 0, 0 ... (74 more) ]"
1,3,"[ 1, 0, 0, 1, 1, 0, 0, 0, 0, 0 ... (74 more) ]"
2,3,"[ 1, 0, 0, 1, 1, 0, 0, 0, 0, 0 ... (74 more) ]"


Woah! Notice how there are 80+ columns after running the data through the pipeline! This is due to one-hot encoding which creates a new column for each discrete value in our categorical columns. Don't worry though, this is no problem for ML.NET which can deal with hundreds, even thousands of features in a dataset.

Now we will create a multiclass classification estimator. You can try different estimators to see how their accuracy differs.

In [15]:
var estimator = context.MulticlassClassification.Trainers.LbfgsMaximumEntropy(featureColumnName: "Features");

Use cross-validation to select the best performing model.

In [16]:
var transformedTrainData = transformer.Transform(trainData);
var cvResults = context.MulticlassClassification.CrossValidate(transformedTrainData, estimator, numberOfFolds: 3);
var cvResult = cvResults
    .OrderByDescending(x => x.Metrics.MacroAccuracy)
    .First();

And here are the metrics for the model we selected:

In [17]:
cvResult.Metrics.ConfusionMatrix.GetFormattedConfusionTable()


Confusion table
PREDICTED ||     3 |     U |     5 |     1 |     2 | Recall
        3 ||   195 |     0 |     0 |     0 |     5 | 0.9750
        U ||     1 |    11 |     0 |     0 |     0 | 0.9167
        5 ||     0 |     0 |    21 |     0 |     0 | 1.0000
        1 ||     0 |     0 |     0 |     2 |     0 | 1.0000
        2 ||     1 |     0 |     0 |     0 |    31 | 0.9688
Precision ||0.9898 |1.0000 |1.0000 |1.0000 |0.8611 |


In [18]:
cvResult.Metrics

LogLoss,LogLossReduction,MacroAccuracy,MicroAccuracy,TopKAccuracy,TopKPredictionCount,PerClassLogLoss,ConfusionMatrix
0.163498091075589,0.8069161254931962,0.9720833333333332,0.9737827715355806,0,0,"[ 0.09889670814034776, 1.0952370782134104, 0.12432197586785135, 0.285133862784716, 0.23596170411742123 ]","{ Microsoft.ML.Data.ConfusionMatrix: PerClassPrecision: [ 0.9898477157360406, 1, 1, 1, 0.8611111111111112 ], PerClassRecall: [ 0.975, 0.9166666666666666, 1, 1, 0.96875 ], Counts: [ [ 195, 0, 0, 0, 5 ], [ 1, 11, 0, 0, 0 ], [ 0, 0, 21, 0, 0 ], [ 0, 0, 0, 2, 0 ], [ 1, 0, 0, 0, 31 ] ], NumberOfClasses: 5 }"


OK, now we can use our best model on the test data.

In [19]:
var transformedTestData = transformer.Transform(testData);
var predictions = cvResult.Model.Transform(transformedTestData);
var metrics = context.MulticlassClassification.Evaluate(predictions);
metrics.ConfusionMatrix.GetFormattedConfusionTable()


Confusion table
PREDICTED ||     3 |     U |     5 |     1 |     2 | Recall
        3 ||    74 |     0 |     0 |     0 |     2 | 0.9737
        U ||     0 |     6 |     0 |     0 |     0 | 1.0000
        5 ||     0 |     0 |     7 |     0 |     0 | 1.0000
        1 ||     0 |     0 |     0 |     0 |     0 | 0.0000
        2 ||     0 |     0 |     0 |     0 |    11 | 1.0000
Precision ||1.0000 |1.0000 |1.0000 |0.0000 |0.8462 |


In [20]:
metrics

LogLoss,LogLossReduction,MacroAccuracy,MicroAccuracy,TopKAccuracy,TopKPredictionCount,PerClassLogLoss,ConfusionMatrix
0.1338884417766255,0.833952280983819,0.993421052631579,0.98,0,0,"[ 0.11400612109017827, 0.27224985752083025, 0.13824694614355384, 0, 0.19301374606174077 ]","{ Microsoft.ML.Data.ConfusionMatrix: PerClassPrecision: [ 1, 1, 1, 0, 0.8461538461538461 ], PerClassRecall: [ 0.9736842105263158, 1, 1, 0, 1 ], Counts: [ [ 74, 0, 0, 0, 2 ], [ 0, 6, 0, 0, 0 ], [ 0, 0, 7, 0, 0 ], [ 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 11 ] ], NumberOfClasses: 5 }"


Now let's pretend we have new data (for convenience we are just randomly re-sampling the test data) to see what predictions our model makes. You will see three properties to each prediction:
- `LabelValue`: this is the actual `Label` value we are trying to predict. Our model doesn't know what the actual value is--it is shown here for comparison. Since our `Label` column is mapped to a key value, we need to map it to another column in order to actually see the human-readable key.
- `Score`: an array of probabilties per class.
- `PredictedLabelValue`: this is the actual prediction made by the model. Again, since our `PredictedLabel` is a Key value, we need to map it to a new, human-readable column in order to view it.

You can run this cell multiple times to get new random samples and their predictions!

In [21]:
class MulticlassClassificationPrediction
{
    public string LabelValue { get; set; }

    public float[] Score { get; set; }

    public string PredictedLabelValue { get; set; }
}

var sampleData = context.Data.ShuffleRows(testData);
var transformedSampleData = transformer.Transform(sampleData);

var samplePredictions = cvResult.Model.Transform(transformedSampleData);
var mapValues = context.Transforms.Conversion
    .MapKeyToValue("PredictedLabelValue", "PredictedLabel")
    .Append(context.Transforms.Conversion.MapKeyToValue("LabelValue", "Label"))
    .Fit(samplePredictions);
samplePredictions = mapValues.Transform(samplePredictions);
var samplePredictionItems = context.Data.CreateEnumerable<MulticlassClassificationPrediction>(samplePredictions, reuseRowObject: false);

samplePredictionItems.Take(5)

index,LabelValue,Score,PredictedLabelValue
0,U,"[ 0.17150913, 0.8284901, 7.490555E-12, 8.5478234E-13, 1.716214E-12 ]",U
1,3,"[ 0.9837427, 0.00029367872, 0.0038972432, 0.002158442, 0.009907948 ]",3
2,5,"[ 0.040364657, 0.0018266856, 0.8743242, 7.40062E-08, 0.083482675 ]",5
3,3,"[ 0.97258323, 0.00026706772, 0.0063666697, 1.05375584E-07, 0.020782344 ]",3
4,2,"[ 0.24653675, 0.0012577792, 0.029561225, 0.020644246, 0.7020001 ]",2
