In [1]:
#r "nuget: Microsoft.ML"
using System;
using System.IO;
using System.Linq;
using System.Net;
using Microsoft.ML;
using Microsoft.ML.Data;

Download data from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone):

In [2]:
if (!File.Exists("abalone.data"))
{
    using var client = new WebClient();
    client.DownloadFile("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", "abalone.data");
}

File.ReadLines("abalone.data").Take(5)

index,value
0,"M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15"
1,"M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7"
2,"F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9"
3,"M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10"
4,"I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7"


Create a new `MLContext`. `MLContext` is the workhorse of machine learning in .NET.

In [3]:
var context = new MLContext();

Create a class to represent the data in our file:

In [4]:
class AbaloneData
{
    [LoadColumn(0)]
    public string Sex { get; set; }

    [LoadColumn(1)]
    public float Length { get; set; }

    [LoadColumn(2)]
    public float Diameter { get; set; }

    [LoadColumn(3)]
    public float Height { get; set; }

    [LoadColumn(4)]
    public float WholeWeight { get; set; }

    [LoadColumn(5)]
    public float ShuckedWeight { get; set; }

    [LoadColumn(6)]
    public float VisceraWeight { get; set; }

    [LoadColumn(7)]
    public float ShellWeight { get; set; }

    [LoadColumn(8)]
    [ColumnName("Label")]
    public Single Rings { get; set; }
}


Now that we have an `MLContext` and a class to represent our data, we can load the file into a `DataView`. It is a good practice to shuffle the data after loading. Many datasets come ordered by some columns of values or even worse by the label. For training a model we want our data to be in a random order:

In [5]:
var allData = context.Data.LoadFromTextFile<AbaloneData>("abalone.data", hasHeader: false, separatorChar: ',');
allData = context.Data.ShuffleRows(allData);

Since we are only given one file of data, we need to split it into training and test sub-datasets:

In [6]:
var splitData = context.Data.TrainTestSplit(allData, testFraction: 0.2);
var (trainData, testData) = (splitData.TrainSet, splitData.TestSet);

Now we declare the features of the dataset that we will train our model on:

In [7]:
var featureColumns = new[]
{
    nameof(AbaloneData.Sex), nameof(AbaloneData.Length), nameof(AbaloneData.Diameter), nameof(AbaloneData.Height),
    nameof(AbaloneData.WholeWeight), nameof(AbaloneData.ShuckedWeight), nameof(AbaloneData.VisceraWeight),
    nameof(AbaloneData.ShellWeight)
};

Create a pipeline that will one-hot encode the `Sex` feature, concatenate all of the features into a single new column, and, finally, normalize the values to make them better suited for machine learning models.

In [8]:
var pipeline = context
    .Transforms.Categorical.OneHotEncoding(nameof(AbaloneData.Sex))
    .Append(context.Transforms.Concatenate("Features", featureColumns))
    .Append(context.Transforms.NormalizeLpNorm("FeaturesNorm", "Features"));

Now, we fit the pipeline to our training data:

In [9]:
var transformer = pipeline.Fit(trainData);

Print the data as it was loaded from the file:

In [10]:
var sourceItems = context.Data
    .CreateEnumerable<AbaloneData>(trainData, reuseRowObject: false)
    .Take(5);
sourceItems

index,Sex,Length,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Rings
0,F,0.55,0.43,0.14,0.8105,0.368,0.161,0.275,9
1,M,0.62,0.505,0.185,1.5275,0.69,0.368,0.35,13
2,F,0.445,0.355,0.15,0.485,0.181,0.125,0.155,11


Compare that to the data as transformed by the pipeline. First, we one-hot encoded the `Sex` column. Then we concatenated all of the feature columns into a single new vector column, `Features`. Lastly, we normalized the values and put them into a new vector column, `FeaturesNorm`. Notice that the first three values of `Features` are the one-hot encoded values of `Sex`.

In [12]:
class AbaloneDataTransformed
{
    [ColumnName("Label")]
    public float Rings { get; set; }

    [VectorType(10)]
    public float[] Features { get; set; }

    [VectorType(10)]
    public float[] FeaturesNorm { get; set; }
}

var transformedData = transformer.Transform(trainData);
context.Data
    .CreateEnumerable<AbaloneDataTransformed>(transformedData, reuseRowObject: false)
    .Take(5)

index,Rings,Features,FeaturesNorm
0,9,"[ 1, 0, 0, 0.55, 0.43, 0.14, 0.8105, 0.368, 0.161, 0.275 ]","[ 0.6453789, 0, 0, 0.3549584, 0.27751294, 0.09035304, 0.52307963, 0.23749943, 0.103906, 0.1774792 ]"
1,13,"[ 0, 1, 0, 0.62, 0.505, 0.185, 1.5275, 0.69, 0.368, 0.35 ]","[ 0, 0.45927015, 0, 0.28474748, 0.23193142, 0.084964976, 0.70153517, 0.3168964, 0.16901141, 0.16074455 ]"
2,11,"[ 1, 0, 0, 0.445, 0.355, 0.15, 0.485, 0.181, 0.125, 0.155 ]","[ 0.7775134, 0, 0, 0.34599346, 0.27601725, 0.116627015, 0.377094, 0.14072992, 0.09718917, 0.12051458 ]"


Create an estimator:

In [13]:
var estimator = context.Regression.Trainers.LbfgsPoissonRegression(featureColumnName: "FeaturesNorm");

Now, we use cross-validation to select the best performing model.

In [14]:
var transformedTrainData = transformer.Transform(trainData);
var cvResults = context.Regression.CrossValidate(transformedTrainData, estimator, numberOfFolds: 3);
var cvResult = cvResults
    .OrderByDescending(x => x.Metrics.RSquared)
    .First();

Here are the metrics for our model. Since `cvResults` contains the model from each cross-validation, we will average across them all the get a measure of performance:

In [15]:
new Dictionary<string, double>
{
    ["Mean Absolute Error"] = cvResults.Average(x => x.Metrics.MeanAbsoluteError),
    ["Mean Squared Error"] = cvResults.Average(x => x.Metrics.MeanSquaredError),
    ["Root Mean Squared Error"] = cvResults.Average(x => x.Metrics.RootMeanSquaredError),
    ["R-squared"] = cvResults.Average(x => x.Metrics.RSquared),
}

key,value
Mean Absolute Error,1.543131147471789
Mean Squared Error,4.613514083469288
Root Mean Squared Error,2.147824299231494
R-squared,0.5541130709539995


Now we can evaluate our model against the test data.

In [16]:
var transformedTestData = transformer.Transform(testData);
var predictions = cvResult.Model.Transform(transformedTestData);
var metrics = context.Regression.Evaluate(predictions);

Here are the metrics for our test data:

In [17]:
metrics

MeanAbsoluteError,MeanSquaredError,RootMeanSquaredError,LossFunction,RSquared
1.503159322490582,4.44769552322018,2.108956026857881,4.4476955180904,0.575775018998866


Now let's pretend we have new data (for convenience we are just randomly re-sampling the test data) to see what predictions our model makes. You will see two values:
- `Label`: the actual number of rings from the example being predicted. Our model never sees this value but we show it below so that you can see how close the predicted number of rings are to the actual.
- `Score`: the predicted number of rings made by the model. The closer this is to the Label, the more accurate is the prediction.

You can run this cell multiple times to get new random samples and their predictions!

In [18]:
class RegressionPrediction
{
    public Single Label { get; set; }

    public Single Score { get; set; }
}

// Show some sample predictions
var sampleData = context.Data.ShuffleRows(testData);
var transformedSampleData = transformer.Transform(sampleData);

var predictionEngine = context.Model.CreatePredictionEngine<AbaloneDataTransformed, RegressionPrediction>(cvResult.Model);

context.Data.CreateEnumerable<AbaloneDataTransformed>(transformedSampleData, reuseRowObject: false)
    .Take(5)
    .Select(predictionEngine.Predict)

index,Label,Score
0,10,11.753112
1,8,8.1793375
2,11,10.811496
3,12,12.0632515
4,5,5.9146023
