Loads datasets, pre-processes them, splitting them into training and testing sets, and then applying two types of models: 
a neural network model (Multi-Layer Perceptron) and 
a Decision Tree model, 

for either classification or regression tasks. 

The performance of these models is evaluated, and the results are printed.

# Import Libraries

In [1]:
import numpy as np 

import matplotlib.pyplot as plt


from numpy import *

from sklearn import datasets 
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import Normalizer

from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier

import random

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text


from sklearn.neural_network import MLPRegressor

from sklearn.tree import DecisionTreeRegressor 


from sklearn.metrics import roc_curve, auc

fpath = "c:\\dropbox\\variance\\unsw\\zzsc5836\\raw_data\\"

# Functions

In [2]:
def read_data(run_num, prob):

    normalise = False
    
    if prob == 'classifification': #Source:  Pima-Indian diabetes dataset: https://www.kaggle.com/kumargh/pimaindiansdiabetescsv
        data_in = genfromtxt(fpath+"pima.csv", delimiter=",")
        data_inputx = data_in[:,0:8] # all features 0, 1, 2, 3, 4, 5, 6, 7 
        data_inputy = data_in[:,-1] # this is target - so that last col is selected from data

    elif prob == 'regression': # energy - regression prob
        data_in = genfromtxt(fpath+'ENB2012_data.csv', delimiter=",")  
        data_inputx = data_in[:,0:8] # all features 0, - 7
        data_inputy = data_in[:,8] # this is target - just the heating load selected from data
  

    if normalise == True:
        transformer = Normalizer().fit(data_inputx)  # fit does nothing.
        data_inputx = transformer.transform(data_inputx)
 

 
    x_train, x_test, y_train, y_test = train_test_split(data_inputx, data_inputy, test_size=0.40, random_state=run_num)

    return x_train, x_test, y_train, y_test

 
    
def scipy_models(x_train, x_test, y_train, y_test, type_model, hidden, learn_rate, run_num, problem):

    print(run_num, ' is our exp run')

    tree_depth = 2
 
    if problem == 'classifification':
        if type_model ==0: #SGD 
            model = MLPClassifier(hidden_layer_sizes=(hidden,), random_state=run_num, max_iter=100,solver='sgd',  learning_rate_init=learn_rate ) 
        elif type_model ==1: #https://scikit-learn.org/stable/modules/tree.html  (see how tree can be visualised)
            model = DecisionTreeClassifier(random_state=0, max_depth=tree_depth)

    elif problem == 'regression':
        if type_model ==0: #SGD 
            #model = MLPRegressor(hidden_layer_sizes=(hidden,), random_state=run_num, max_iter=100,solver='sgd',  learning_rate_init=learn_rate ) 

            model = MLPRegressor(hidden_layer_sizes=(hidden*3,), random_state=run_num, max_iter=500, solver='adam',learning_rate_init=learn_rate) 
        elif type_model ==1: #https://scikit-learn.org/stable/modules/tree.html  (see how tree can be visualised)
            model = DecisionTreeRegressor(random_state=0, max_depth=tree_depth)
   
    # Train the model using the training sets
    model.fit(x_train, y_train)

    if type_model ==1:
        r = export_text(model)
        print(r)

    # Make predictions using the testing set
    y_pred_test = model.predict(x_test)
    y_pred_train = model.predict(x_train) 

    if problem == 'regression':
        perf_test =  np.sqrt(mean_squared_error(y_test, y_pred_test)) 
        perf_train=  np.sqrt(mean_squared_error(y_train, y_pred_train)) 

    if problem == 'classifification': 
        perf_test = accuracy_score(y_pred_test, y_test) 
        perf_train = accuracy_score(y_pred_train, y_train) 
        cm = confusion_matrix(y_pred_test, y_test) 
        #print(cm, 'is confusion matrix')
        #auc = roc_auc_score(y_pred, y_test, average=None) 

    return perf_test #,perf_train




In [3]:
max_expruns = 5

SGD_all = np.zeros(max_expruns) 
Adam_all = np.zeros(max_expruns) 
tree_all = np.zeros(max_expruns)  

learn_rate = 0.01
hidden = 8

prob = 'classifification' #  classification  or regression 
#prob = 'regression' #  classification  or regression 


# classifcation accurary is reported for classification and RMSE for regression

print(prob, ' is our problem')

for run_num in range(0,max_expruns): 

    x_train, x_test, y_train, y_test = read_data(run_num, prob)   
    
    acc_sgd = scipy_models(x_train, x_test, y_train, y_test, 0, hidden, learn_rate, run_num, prob) #SGD 
    acc_tree = scipy_models(x_train, x_test, y_train, y_test, 1, hidden, learn_rate,  run_num, prob) #Decision Tree
    
    SGD_all[run_num] = acc_sgd 
    tree_all[run_num] = acc_tree


classifification  is our problem
0  is our exp run
0  is our exp run
|--- feature_1 <= 154.50
|   |--- feature_7 <= 28.50
|   |   |--- class: 0.0
|   |--- feature_7 >  28.50
|   |   |--- class: 0.0
|--- feature_1 >  154.50
|   |--- feature_5 <= 29.95
|   |   |--- class: 0.0
|   |--- feature_5 >  29.95
|   |   |--- class: 1.0

1  is our exp run
1  is our exp run
|--- feature_1 <= 130.50
|   |--- feature_7 <= 27.50
|   |   |--- class: 0.0
|   |--- feature_7 >  27.50
|   |   |--- class: 0.0
|--- feature_1 >  130.50
|   |--- feature_5 <= 33.25
|   |   |--- class: 0.0
|   |--- feature_5 >  33.25
|   |   |--- class: 1.0

2  is our exp run
2  is our exp run
|--- feature_1 <= 127.50
|   |--- feature_0 <= 4.50
|   |   |--- class: 0.0
|   |--- feature_0 >  4.50
|   |   |--- class: 0.0
|--- feature_1 >  127.50
|   |--- feature_1 <= 165.50
|   |   |--- class: 1.0
|   |--- feature_1 >  165.50
|   |   |--- class: 1.0

3  is our exp run
3  is our exp run
|--- feature_1 <= 144.50
|   |--- feature_7 <=

In [4]:
print(SGD_all,' SGD_all')
print(np.mean(SGD_all), ' mean nn_all')
print(np.std(SGD_all), ' std nn_all')

[0.66558442 0.6461039  0.66558442 0.59415584 0.66558442]  SGD_all
0.6474025974025974  mean nn_all
0.02767178669176953  std nn_all


In [5]:
print(tree_all, hidden,' tree_all')
print(np.mean(tree_all),  ' tree _all')
print(np.std(tree_all),  ' tree _all')

[0.72402597 0.70454545 0.73051948 0.71428571 0.78896104] 8  tree_all
0.7324675324675325  tree _all
0.029586456493232507  tree _all


# PROGRAM EXPLANATION

The script is structured to run multiple experiments (defined by max_expruns) where in each run, it trains both a neural network and a Decision Tree model on the dataset, then evaluates their performance based on the problem type (classification or regression). The results across all runs are aggregated and printed at the end.

    np, plt: Aliases for the numpy and matplotlib.pyplot libraries, respectively.

    datasets, train_test_split, metrics, etc.: These are various modules and functions imported from sklearn, a machine learning library.

    random: The random module is imported but not used in the script.

    normalise, transformer: Used for an optional normalization step in the read_data function. normalise is a boolean flag, and transformer is an instance of Normalizer.

    data_in, data_inputx, data_inputy: Variables used to store the dataset, its features, and labels respectively.

    x_train, x_test, y_train, y_test: These variables represent the training and testing splits of the dataset's features and labels.

    type_model, hidden, learn_rate, run_num, problem: Parameters for the scipy_models function. They control the type of model, architecture of the neural network, learning rate, run number, and problem type (classification or regression).

    tree_depth: Used to set the depth of the Decision Tree model.

    model: Represents the machine learning model, either MLPClassifier/MLPRegressor or DecisionTreeClassifier/DecisionTreeRegressor.

    y_pred_test, y_pred_train: Variables to store the predictions of the model on test and train data.

    perf_test, perf_train, cm: Variables to store the performance metrics (accuracy, RMSE, confusion matrix) of the model.

    max_expruns, SGD_all, Adam_all, tree_all: Used in the main function. max_expruns defines the number of experimental runs, and the other variables are arrays to store performance metrics for each model across runs.

    prob: A variable to define the type of problem - classification or regression.