In [1]:
#r "nuget: Microsoft.ML"
using System;
using System.IO;
using System.Linq;
using System.Net;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;

Download data from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/adult):

In [2]:
if (!File.Exists("adult.data"))
{
    using var client = new WebClient();
    client.DownloadFile("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "adult.data");
}

Console.WriteLine($"Train data file has {File.ReadLines("adult.data").Count():n0} lines");
File.ReadLines("adult.data").Take(5)

Train data file has 32,562 lines


index,value
0,"39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
1,"50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K"
2,"38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K"
3,"53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K"
4,"28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K"


Some datasets from the UCI Machine Learning Repository have two files, one for training data and one for test data. Notice how the .test dataset has a line of descriptive text at the start of the file. We don't want to load this line--you will see how to deal with that below.

In [3]:
if (!File.Exists("adult.test"))
{
    using var client = new WebClient();
    client.DownloadFile("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", "adult.test");
}

Console.WriteLine($"Test data file has {File.ReadLines("adult.test").Count():n0} lines");
File.ReadLines("adult.test").Take(5)

Test data file has 16,283 lines


index,value
0,|1x3 Cross validator
1,"25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
2,"38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
3,"28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K."
4,"44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K."


Create a new `MLContext`. `MLContext` is the workhorse of machine learning in .NET.

In [4]:
var context = new MLContext();

Create a class to represent the data in our file:

In [5]:
class AdultData
{
    [LoadColumn(0)]
    public float Age { get; set; }

    [LoadColumn(1)]
    public string WorkClass { get; set; }

    [LoadColumn(2)]
    public float Fnlwgt { get; set; }

    [LoadColumn(3)]
    public string Education { get; set; }

    [LoadColumn(4)]
    public float EducationNum { get; set; }

    [LoadColumn(5)]
    public float MaritalStatus { get; set; }

    [LoadColumn(6)]
    public float Occupation { get; set; }

    [LoadColumn(7)]
    public float Relationship { get; set; }

    [LoadColumn(8)]
    public string Race { get; set; }

    [LoadColumn(9)]
    public string Sex { get; set; }

    [LoadColumn(10)]
    public float CapitalGain { get; set; }

    [LoadColumn(11)]
    public float CapitalLoss { get; set; }

    [LoadColumn(12)]
    public float HoursPerWeek { get; set; }

    [LoadColumn(13)]
    public string NativeCountry { get; set; }

    [LoadColumn(14)]
    [ColumnName("Label")]
    public string Target { get; set; }
}

Now that we have an `MLContext` and a class to represent our data, we can load the file into a `DataView`. It is a good practice to shuffle the data after loading. Many datasets come ordered by some columns of values or even worse by the label. For training a model we want our data to be in a random order:

In [6]:
var trainData = context.Data.LoadFromTextFile<AdultData>("adult.data", hasHeader: false, separatorChar: ',');
trainData = context.Data.ShuffleRows(trainData);

Remember that we noted earlier that our file of test data has "garbage" text on the first line? The `LoadFromTextFile` method does not have a way to skip lines in a file so we will use the `hasHeader` parameter to serve that purpose. We are lucky here that there is only one line that we need to skip, otherwise, we would need to find another way to deal with "garbage" lines.

In [7]:
var testData = context.Data.LoadFromTextFile<AdultData>("adult.test", hasHeader: true, separatorChar: ',');
testData = context.Data.ShuffleRows(testData);

Now we declare the features of the dataset that we will train our model on:

In [8]:
var featureColumns = new[]
{
    nameof(AdultData.Age), nameof(AdultData.WorkClass), nameof(AdultData.Fnlwgt), nameof(AdultData.Education),
    nameof(AdultData.EducationNum), nameof(AdultData.MaritalStatus), nameof(AdultData.Occupation), nameof(AdultData.Relationship),
    nameof(AdultData.Race), nameof(AdultData.Sex), nameof(AdultData.CapitalGain), nameof(AdultData.CapitalLoss),
    nameof(AdultData.HoursPerWeek), nameof(AdultData.NativeCountry)
};

There are a number of categorical columns (string values that reprsent discrete values) in the data. We will need to encode those columns so we declare which columns are categorical here:

In [9]:
var categoricalColumns = new[]
{
    nameof(AdultData.WorkClass), nameof(AdultData.Education), nameof(AdultData.MaritalStatus), nameof(AdultData.Occupation),
    nameof(AdultData.Relationship), nameof(AdultData.Race), nameof(AdultData.Sex), nameof(AdultData.NativeCountry)
};

The "target" or "label" for this example can take on two values. Thus, we will be creating a binary classification model. You may have noticed above that the labels in the test file differ from the labels in the training file--they have periods only in the test file! No problem, we can create a mapping such that label `<=50K` is treated the same as the label `<=50K.`. Simply create a `Dictionary` that maps the string value to our binary label of `true` or `false`:

In [10]:
var labelLookup = new Dictionary<string, bool>
{
    ["<=50K"] = false,
    ["<=50K."] = false,
    [">50K"] = true,
    [">50K."] = true
};

Create a pipeline with the following transforms:
- One-hot encode each of the categorical columns.
- Map our label column to `true` or `false`.
- Concatenate all of the feature columns into a single new column.
- Normalize the feature values.

In [11]:
var chain = new EstimatorChain<OneHotEncodingTransformer>();
var pipeline = categoricalColumns
    .Aggregate(chain, (pl, col) => pl.Append(context.Transforms.Categorical.OneHotEncoding(col)))
    .Append(context.Transforms.Conversion.MapValue("Label", labelLookup, "Label"))
    .Append(context.Transforms.Concatenate("Features", featureColumns))
    .Append(context.Transforms.NormalizeBinning("FeaturesNorm", "Features"));

Fit the pipeline to our training data:

In [12]:
var transformer = pipeline.Fit(trainData);

Let's first view the data as it was loaded from the downloaded file:

In [13]:
context.Data
    .CreateEnumerable<AdultData>(trainData, reuseRowObject: false)
    .Take(5)

index,Age,WorkClass,Fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Sex,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Target
0,55,Private,247552,Some-college,10,,,,White,Male,0,0,56,United-States,<=50K
1,32,Private,132601,Bachelors,13,,,,White,Male,0,0,50,United-States,<=50K
2,20,Private,298227,Some-college,10,,,,White,Male,0,0,35,United-States,<=50K
3,20,Private,81145,Some-college,10,,,,White,Female,0,0,25,United-States,<=50K
4,48,Private,102102,Assoc-voc,11,,,,White,Male,0,0,50,United-States,>50K


Now let's see what the data looks like after it has been transformed by our pipeline:

In [14]:
class AdultDataTransformed
{
    [ColumnName("Label")]
    public bool Target { get; set; }

    [VectorType(83)]
    public float[] Features { get; set; }

    [VectorType(83)]
    public float[] FeaturesNorm { get; set; }
}

var transformedData = transformer.Transform(trainData);
context.Data
    .CreateEnumerable<AdultDataTransformed>(transformedData, reuseRowObject: false)
    .Take(5)

index,Target,Features,FeaturesNorm
0,False,"[ 55, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]","[ 0.5277778, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]"
1,False,"[ 32, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]","[ 0.20833333, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]"
2,False,"[ 20, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]","[ 0.041666668, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]"
3,False,"[ 20, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]","[ 0.041666668, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]"
4,True,"[ 48, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]","[ 0.43055555, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]"


Woah! Notice how there are 80+ columns after running the data through the pipeline! This is due to one-hot encoding which creates a new column for each discrete value in our categorical columns. Don't worry though, this is no problem for ML.NET which can deal with hundreds, even thousands of features in a dataset.

Now we will create a binary classigication estimator. You can try different estimators to see how their accuracy differs.

In [15]:
var estimator = context.BinaryClassification.Trainers.SdcaLogisticRegression(featureColumnName: "FeaturesNorm");

Use cross-validation to select the best performing model:

In [16]:
var transformedTrainData = transformer.Transform(trainData);
var cvResults = context.BinaryClassification.CrossValidate(transformedTrainData, estimator, numberOfFolds: 3);
var cvResult = cvResults
    .OrderByDescending(x => x.Metrics.Accuracy)
    .First();

Here are the metrics for our model. Since `cvResults` contains the model from each cross-validation, we will average across them all the get a measure of performance:

In [17]:
new Dictionary<string, double>
{
    ["Accuracy"] = cvResults.Average(x => x.Metrics.Accuracy),
    ["Area Under Roc Curve"] = cvResults.Average(x => x.Metrics.AreaUnderRocCurve),
    ["F1 Score"] = cvResults.Average(x => x.Metrics.F1Score),
}

key,value
Accuracy,0.8227696846019305
Area Under Roc Curve,0.8474232098160762
F1 Score,0.5517198394398587


OK, now we can use our best model on the test data.

In [18]:
var transformedTestData = transformer.Transform(testData);
var predictions = cvResult.Model.Transform(transformedTestData);
context.BinaryClassification.Evaluate(predictions)

LogLoss,LogLossReduction,Entropy,AreaUnderRocCurve,Accuracy,PositivePrecision,PositiveRecall,NegativePrecision,NegativeRecall,F1Score,AreaUnderPrecisionRecallCurve,ConfusionMatrix
0.5649796455271927,0.283664533627649,0.7887081849909641,0.8443358088163494,0.8221853694490511,0.6868369351669941,0.4544981799271971,0.8472626674432149,0.9359067149175714,0.5470192458144265,0.6602950675767943,"{ Microsoft.ML.Data.ConfusionMatrix: PerClassPrecision: [ 0.6868369351669941, 0.8472626674432149 ], PerClassRecall: [ 0.4544981799271971, 0.9359067149175714 ], Counts: [ [ 1748, 2098 ], [ 797, 11638 ] ], NumberOfClasses: 2 }"


Now let's pretend we have new data (for convenience we are just randomly re-sampling the test data) to see what predictions our model makes. You will see two values:
- `Label`: the actual label (<=50K, >50K) from the example being predicted. Our model never sees this value but we show it below so that you can see how close the predicted number of rings are to the actual.
- `Probability`: the probability that the model assigns to the predicted label. 0 would be the most probable `false` label while 1 would be the most probable `true` label.
- `PredictedLabel`: this is the actual prediction made by the model.

You can run this cell multiple times to get new random samples and their predictions!

In [23]:
class BinaryClassificationPrediction
{
    public bool Label { get; set; }

    public float Probability { get; set; }

    public bool PredictedLabel { get; set; }
}

var sampleData = context.Data.ShuffleRows(testData);
var transformedSampleData = transformer.Transform(sampleData);

var predictionEngine = context.Model.CreatePredictionEngine<AdultDataTransformed, BinaryClassificationPrediction>(cvResult.Model);

context.Data.CreateEnumerable<AdultDataTransformed>(transformedSampleData, reuseRowObject: false)
    .Take(5)
    .Select(predictionEngine.Predict)

index,Label,Probability,PredictedLabel
0,False,0.4270283,False
1,True,0.58039045,True
2,False,0.27477393,False
3,True,0.38614753,False
4,False,0.20412506,False
