# Machine Learning to predict Housing Prices with ML.NET

## Dataset

This uses the [California Housing](https://github.com/ageron/handson-ml2/tree/master/datasets/housing) dataset provided by [Aurélien Geron](https://github.com/ageron) the author of the book **Hands-On Machine Learning with Scikit-Learn and TensorFlow**.

This dataset is a modified version of the California Housing dataset available from [Luís Torgo's page](http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.

This dataset appeared in a 1997 paper titled *Sparse Spatial Autoregressions* by Pace, R. Kelley and Ronald Barry, published in the *Statistics and Probability Letters* journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

## Technologies used

- [ML.NET](https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet) Machine Learning for .NET
- [.NET Interactive Notebooks](https://github.com/dotnet/interactive) Jupyter Notebooks for .NET Languages
- [XPlot](https://fslab.org/XPlot//index.html) F# Data Visualization Package

## Setup

Import required NuGet packages, setup `using` statements and create formatters for the data.

In [None]:
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json" 
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-tools/nuget/v3/index.json" 

#r "nuget:Microsoft.ML,1.5.5"
#r "nuget:Microsoft.ML.AutoML,0.17.5"
#r "nuget:Microsoft.Data.Analysis,0.4.0"
#r "nuget:XPlot.Plotly.Interactive,4.0.1"

Installed package Microsoft.ML version 1.5.5

Installed package XPlot.Plotly.Interactive version 4.0.1

Installed package Microsoft.Data.Analysis version 0.4.0

Installed package Microsoft.ML.AutoML version 0.17.5

Loading extensions from `XPlot.Plotly.Interactive.dll`

Configuring PowerShell Kernel for XPlot.Plotly integration.

Installed support for XPlot.Plotly.

In [None]:
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using Microsoft.DotNet.Interactive.Formatting;
using Microsoft.Data.Analysis;
using XPlot.Plotly;

In [None]:

using Microsoft.AspNetCore.Html;
using Microsoft.DotNet.Interactive.Formatting;
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;

In [None]:
Formatter.Register<DataFrame>((df, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c.Name)));
    var rows = new List<List<IHtmlContent>>();
    var take = 20;
    for (var i = 0; i < Math.Min(take, df.Rows.Count); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var obj in df.Rows[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }

    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));

    writer.Write(t);
}, "text/html");

## Import Data

Load from local filesystem if it exists, otherwise download it

In [None]:
using System.IO;
using System.Net.Http;
string housingPath = "housing.csv";

if (!File.Exists(housingPath))
{
    var contents = await new HttpClient()
        .GetStringAsync("https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv");

    File.WriteAllText("housing.csv", contents);
}

var housingData = DataFrame.LoadCsv(housingPath);
housingData

index,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41,880,129,322,126,8.3252,452600,NEAR BAY
1,-122.22,37.86,21,7099,1106,2401,1138,8.3014,358500,NEAR BAY
2,-122.24,37.85,52,1467,190,496,177,7.2574,352100,NEAR BAY
3,-122.25,37.85,52,1274,235,558,219,5.6431,341300,NEAR BAY
4,-122.25,37.85,52,1627,280,565,259,3.8462,342200,NEAR BAY
5,-122.25,37.85,52,919,213,413,193,4.0368,269700,NEAR BAY
6,-122.25,37.84,52,2535,489,1094,514,3.6591,299200,NEAR BAY
7,-122.25,37.84,52,3104,687,1157,647,3.12,241400,NEAR BAY
8,-122.26,37.84,42,2555,665,1206,595,2.0804,226700,NEAR BAY
9,-122.25,37.84,52,3549,707,1551,714,3.6912,261100,NEAR BAY


## Explore the data

Take a look at the data in a few ways so that we can decide how to use the data.

In [None]:
var chart = Chart.Plot(
  new Histogram()
  {
      x = housingData.Columns["median_house_value"],
      nbinsx = 20
  }
);
chart.WithXTitle("Median House Value");
chart.WithYTitle("Count");
display(chart)

In [None]:
var chart = Chart.Plot(
  new Scattergl()
  {
      x = housingData.Columns["longitude"],
      y = housingData.Columns["latitude"],
      mode = "markers",
      marker = new Marker()
      {
          color = housingData.Columns["median_house_value"],
          colorscale = "Jet"
      }
  }
);

chart.Width = 600;
chart.Height = 600;
chart.WithXTitle("Longitude");
chart.WithYTitle("Latitude");
display(chart);

## Split into Training and Test Data

In [None]:
static T[] Shuffle<T>(T[] array)
{
    Random rand = new Random();
    for (int i = 0; i < array.Length; i++)
    {
        int r = i + rand.Next(array.Length - i);
        T temp = array[r];
        array[r] = array[i];
        array[i] = temp;
    }
    return array;
}

int[] randomIndices = Shuffle(Enumerable.Range(0, (int)housingData.Rows.Count).ToArray());
int testSize = (int)(housingData.Rows.Count * .1);
int[] trainRows = randomIndices[testSize..];
int[] testRows = randomIndices[..testSize];

DataFrame housing_train = housingData[trainRows];
DataFrame housing_test = housingData[testRows];

display(housing_train.Rows.Count);
display(housing_test.Rows.Count);

## Find the best ML Algorythm

Use Microsoft's [AutoML](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) to test and tune various Machine Learning algorithyms and find the best for our data.

In [None]:
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.AutoML;

In [None]:
#!time

const int TRAINING_SECONDS = 15;    // Seconds

var mlContext = new MLContext();

var experiment = mlContext.Auto().CreateRegressionExperiment(maxExperimentTimeInSeconds: TRAINING_SECONDS);
var result = experiment.Execute(housing_train, labelColumnName:"median_house_value");

In [None]:
var scatters = result.RunDetails.Where(d => d.ValidationMetrics != null).GroupBy(
    r => r.TrainerName,
    (name, details) => new Scattergl()
    {
        name = name,
        x = details.Select(r => r.RuntimeInSeconds),
        y = details.Select(r => r.ValidationMetrics.MeanAbsoluteError),
        mode = "markers",
        marker = new Marker() { size = 12 }
    });

var chart = Chart.Plot(scatters);
chart.WithXTitle("Training Time");
chart.WithYTitle("Error");
display(chart);

Console.WriteLine($"Best Trainer:{result.BestRun.TrainerName}");

Best Trainer:LightGbmRegression


## Test the Results

In [None]:
var testResults = result.BestRun.Model.Transform(housing_test);

var trueValues = testResults.GetColumn<float>("median_house_value");
var predictedValues = testResults.GetColumn<float>("Score");

var predictedVsTrue = new Scattergl()
{
    x = trueValues,
    y = predictedValues,
    mode = "markers",
};

var maximumValue = Math.Max(trueValues.Max(), predictedValues.Max());

var perfectLine = new Scattergl()
{
    x = new[] {0, maximumValue},
    y = new[] {0, maximumValue},
    mode = "lines",
};

var chart = Chart.Plot(new[] {predictedVsTrue, perfectLine });
chart.WithXTitle("True Values");
chart.WithYTitle("Predicted Values");
chart.WithLegend(false);
chart.Width = 600;
chart.Height = 600;
display(chart);