In [1]:
#r "nuget: Microsoft.ML"
#load "./Modules/MLWrapper.fs"
open Microsoft.ML
open Microsoft.ML.Data
open Microsoft.ML.Transforms
open System.Collections.Generic
open System.IO
open System.Net
open FunctionalMl

Download data from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/adult):

In [2]:
if not <| File.Exists("adult.data") then
    use client = new WebClient()
    client.DownloadFile("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "adult.data")

printfn "Train data file has %d lines" <| File.ReadLines("adult.data").Count()
File.ReadLines("adult.data")
|> Seq.take 5

Train data file has 32562 lines


index,value
0,"39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
1,"50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K"
2,"38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K"
3,"53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K"
4,"28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K"


Some datasets from the UCI Machine Learning Repository have two files, one for training data and one for test data. Notice how the .test dataset has a line of descriptive text at the start of the file. We don't want to load this line--you will see how to deal with that below.

In [3]:
if not <| File.Exists("adult.test") then
    use client = new WebClient()
    client.DownloadFile("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", "adult.test")

printfn "Train data file has %d lines" <| File.ReadLines("adult.test").Count()
File.ReadLines("adult.test")
|> Seq.take 5

Train data file has 16283 lines


index,value
0,|1x3 Cross validator
1,"25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K."
2,"38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K."
3,"28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K."
4,"44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K."


Create a class to represent the data in our file:

In [4]:
[<CLIMutable>]
type AdultData =
    {
        [<LoadColumn(0)>]
        Age : float32

        [<LoadColumn(1)>]
        WorkClass : string

        [<LoadColumn(2)>]
        Fnlwgt : float32

        [<LoadColumn(3)>]
        Education : string

        [<LoadColumn(4)>]
        EducationNum : float32

        [<LoadColumn(5)>]
        MaritalStatus : float32

        [<LoadColumn(6)>]
        Occupation : float32

        [<LoadColumn(7)>]
        Relationship : float32

        [<LoadColumn(8)>]
        Race : string

        [<LoadColumn(9)>]
        Sex : string

        [<LoadColumn(10)>]
        CapitalGain : float32

        [<LoadColumn(11)>]
        CapitalLoss : float32

        [<LoadColumn(12)>]
        HoursPerWeek : float32

        [<LoadColumn(13)>]
        NativeCountry : string

        [<LoadColumn(14)>]
        [<ColumnName("Label")>]
        Target : string
    }

Now that we have an `MLContext` and a class to represent our data, we can load the file into a `DataView`. It is a good practice to shuffle the data after loading. Many datasets come ordered by some columns of values or even worse by the label. For training a model we want our data to be in a random order:

In [5]:
let trainData =
    ML.context.Data.LoadFromTextFile<AdultData>("adult.data", hasHeader = false, separatorChar = ',')
    |> ML.shuffle

Remember that we noted earlier that our file of test data has "garbage" text on the first line? The `LoadFromTextFile` method does not have a way to skip lines in a file so we will use the `hasHeader` parameter to serve that purpose. We are lucky here that there is only one line that we need to skip, otherwise, we would need to find another way to deal with "garbage" lines.

In [6]:
let testData =
    ML.context.Data.LoadFromTextFile<AdultData>("adult.test", hasHeader = true, separatorChar = ',')
    |> ML.shuffle

Now we declare the features of the dataset that we will train our model on:

In [7]:
let featureColumns = [| "Age"; "WorkClass"; "Fnlwgt"; "Education"; "EducationNum"; "MaritalStatus"; "Occupation"; "Relationship"; "Race"; "Sex"; "CapitalGain";
                        "CapitalLoss"; "HoursPerWeek"; "NativeCountry" |]

There are a number of categorical columns (string values that reprsent discrete values) in the data. We will need to encode those columns so we declare which columns are categorical here:

In [8]:
let categoricalColumns = [| "WorkClass"; "Education"; "MaritalStatus"; "Occupation"; "Relationship"; "Race"; "Sex"; "NativeCountry" |]

The "target" or "label" for this example can take on two values. Thus, we will be creating a binary classification model. You may have noticed above that the labels in the test file differ from the labels in the training file--they have periods only in the test file! No problem, we can create a mapping such that label `<=50K` is treated the same as the label `<=50K.`. Simply create a `Dictionary` that maps the string value to our binary label of `true` or `false`:

In [9]:
let labelLookup =
    [|
        KeyValuePair("<=50K", false)
        KeyValuePair("<=50K.", false)
        KeyValuePair(">50K", true)
        KeyValuePair(">50K.", true)
    |]

Create a pipeline with the following transforms:
- One-hot encode each of the categorical columns.
- Map our label column to `true` or `false`.
- Concatenate all of the feature columns into a single new column.
- Normalize the feature values.

In [10]:
let pipeline =
    categoricalColumns
    |> Seq.map ML.onehot // Create a one-hot encoder for each categorical column
    |> Seq.fold ML.append (EstimatorChain()) // Add the encoders to a new EstimatorChain
    |> ML.append <| ML.mapValue "Label" labelLookup "Label" // Map labels to either true or false
    |> ML.append <| ML.concatenate "Features" featureColumns // Concatenate feature columns into a single new column
    |> ML.append <| ML.normalizeMinMax "Features" "FeaturesNorm" // Normalize features into a new column, FeaturesNorm

Fit the pipeline to our training data:

In [11]:
let transformer =
    pipeline
    |> ML.fit trainData // Fit our pipeline on the training data

Let's first view the data as it was loaded from the downloaded file:

In [12]:
ML.context.Data.CreateEnumerable<AdultData>(trainData, reuseRowObject = false)
|> Seq.take 3

index,Age,WorkClass,Fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Sex,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Target
0,24,Private,278130,HS-grad,9,,,,White,Male,0,0,40,United-States,<=50K
1,33,Self-emp-inc,155781,Some-college,10,,,,White,Male,0,0,60,?,<=50K
2,23,Private,204653,HS-grad,9,,,,White,Male,0,0,72,Dominican-Republic,<=50K


Now let's see what the data looks like after it has been transformed by our pipeline:

In [13]:
[<CLIMutable>]
type AdultDataTransformed =
    {
        [<ColumnName("Label")>]
        Target : bool

        [<VectorType(83)>]
        Features : single[]

        [<VectorType(83)>]
        FeaturesNorm : single[]
    }

let transformedData =
    trainData
    |> ML.transform transformer

ML.context.Data.CreateEnumerable<AdultDataTransformed>(transformedData, reuseRowObject = false)
|> Seq.take 3

index,Target,Features,FeaturesNorm
0,False,"[ 24, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]","[ 0.26666668, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]"
1,False,"[ 33, 0, 1, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]","[ 0.36666667, 0, 1, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]"
2,False,"[ 23, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]","[ 0.25555557, 1, 0, 0, 0, 0, 0, 0, 0, 0 ... (73 more) ]"


Woah! Notice how there are 80+ columns after running the data through the pipeline! This is due to one-hot encoding which creates a new column for each discrete value in our categorical columns. Don't worry though, this is no problem for ML.NET which can deal with hundreds, even thousands of features in a dataset.

Now we will create a binary classification estimator. You can try different estimators to see how their accuracy differs.

In [14]:
let estimator =
    ML.context.BinaryClassification.Trainers.SdcaLogisticRegression(featureColumnName = "FeaturesNorm")
    |> ML.downcastEstimator

Use cross-validation to select the best performing model. Along the way we will print the metrics for our model.

In [15]:
let model =
    trainData // Begin with the training data
    |> ML.transform transformer // Transform using the transformer built above
    |> ML.crossValidateBinaryClassification estimator 3 // 3-fold cross-validation
    |> ML.printBinaryClassificationCvMetrics // Print cross-fold metrics
    |> Seq.maxBy (fun cvResult -> cvResult.Metrics.Accuracy) // Select the best model by Accuracy
    |> fun cvResult -> cvResult.Model

------------------
Cross Validation Metrics
------------------
Accuracy: 0.823474
Area Under Roc Curve: 0.846644
F1 Score: 0.561248


OK, now we can use our best model on the test data.

In [16]:
model
|> ML.transform <| ML.transform transformer testData // Transform the test data and get predictions
|> ML.context.BinaryClassification.Evaluate // Get test metrics

LogLoss,LogLossReduction,Entropy,AreaUnderRocCurve,Accuracy,PositivePrecision,PositiveRecall,NegativePrecision,NegativeRecall,F1Score,AreaUnderPrecisionRecallCurve,ConfusionMatrix
0.5666477233935161,0.2815495842736715,0.7887081849909641,0.8447904767819181,0.8220011055831952,0.6692857142857143,0.4872594903796152,0.8537200504413619,0.9255327704061118,0.563948239542582,0.6634202107906937,"{ Microsoft.ML.Data.ConfusionMatrix: PerClassPrecision: [ 0.6692857142857143, 0.8537200504413619 ], PerClassRecall: [ 0.4872594903796152, 0.9255327704061118 ], Counts: [ [ 1874, 1972 ], [ 926, 11509 ] ], NumberOfClasses: 2 }"


Now let's pretend we have new data (for convenience we are just randomly re-sampling the test data) to see what predictions our model makes. You will see three properties for each prediction:
- `Label`: the actual label (<=50K, >50K) from the example being predicted. Our model never sees this value but we show it below so that you can see how close the predicted number of rings are to the actual.
- `Probability`: the probability that the model assigns to the predicted label. 0 would be the most probable `false` label while 1 would be the most probable `true` label.
- `PredictedLabel`: this is the actual prediction made by the model.

You can run this cell multiple times to get new random samples and their predictions!

In [17]:
[<CLIMutable>]
type BinaryClassificationPrediction = { Label : bool; Probability : single; PredictedLabel : bool }

let sampleData =
    testData
    |> ML.shuffle
    |> ML.transform transformer

let predictionEngine = ML.context.Model.CreatePredictionEngine<AdultDataTransformed, BinaryClassificationPrediction>(model)

ML.context.Data.CreateEnumerable<AdultDataTransformed>(sampleData, reuseRowObject = false)
|> Seq.take 5
|> Seq.map predictionEngine.Predict

index,Label,Probability,PredictedLabel
0,True,0.0677757,False
1,False,0.04996717,False
2,False,0.06952896,False
3,False,0.02013277,False
4,False,0.2534104,False
