# Activity: Let's Break the Student Record Application
There is a big difference between writing code that only you will use and writing code that others will use. When you write code for others, you need to make sure that it is easy to use,and robust enough to handle unexpected inputs. 

In this activity, you will be given a code base, in particular the `build(...)` and `find(...)` methods from the student records app, that is not robust and has some bugs. Some of these bugs ae can fix, while others are just errors in user logic. 

Our task is to break the code by providing unexpected inputs, look at how the system reacts (who is doing what) and then suggest fixes to make it more robust for items that we can control. Let's go!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.
* The [include command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 
* In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl), check out that documentation for more information on the functions and types used in this material.

In [1]:
include("Include.jl");

### Types and Functions
Let's start by defining mutable `MySimpleStudentModel` type, which will represent a student record. This type will include fields for the student's first name, last name, student identification number, and thier netid (email). 
* _Why a mutable type?_ Mutable types allow for modification of their fields after creation, which is useful for objects that may need to be updated or changed over time, such as student records. Additonally, mutable types offer different possible initialization options, such as default values for fields.
* _Keyword argument constructor_: The constructor allows for the creation of `MySimpleStudentModel` objects with default values for fields, making it easier to instantiate objects without providing all arguments. Thus, we can create an empty student record with default values for each field.

In [2]:
"""
    mutable struct MySimpleStudentModel

A mutable struct that models a student with a firstname, lastname, student id and a netid.

### Fields
- `firstname::String`: The first name of the student.
- `lastname::String`: The last name of the student.
- `sid::Int64`: The student identification number.
- `netid::String`: The network identifier (email address) of the student.
"""
mutable struct MySimpleStudentModel

    # data fields -
    firstname::String
    lastname::String
    sid::Int64
    netid::String
    
    # keyword argument constructor: builds a new student model with default values
    MySimpleStudentModel(; firstname::String = "firstname", 
        lastname::String = "lastname", sid::Int64 = -1, netid::String = "abcd") = new(firstname, lastname, sid, netid);
end;

Let's implement a `build(model::Type{MySimpleStudentModel}; data::NamedTuple)::MyStudentModel` method which takes the type of thing we want to build, and the data need to build the model [in a `NamedTuple` instance](https://docs.julialang.org/en/v1/base/base/#Core.NamedTuple). The `build(...)` returns a populated `MySimpleStudentModel` instance.

In [3]:
"""
    build(modeltype::Type{MySimpleStudentModel}, data::NamedTuple)::MySimpleStudentModel

Builds a new `MySimpleStudentModel` from a named tuple of data.

### Arguments
- `modeltype::Type{MySimpleStudentModel}`: The type of the model to be built.
- `data::NamedTuple`: A named tuple containing the fields to be set in the model.

### Returns
- `MySimpleStudentModel`: A new instance of `MySimpleStudentModel` with the fields set from the named tuple.
"""
function build(modeltype::Type{MySimpleStudentModel}, data::NamedTuple)::MySimpleStudentModel
    
    # initailize -
    model = modeltype(); # This builds an empty model (with default values)

    # TODO: Uncomment the following code to give a warning if the named tuple is missing required fields
    # required_fields = [:firstname, :lastname, :sid, :netid]
    # for field in required_fields
    #     if haskey(data, field) == false
    #         @warn "Ooops! Missing required field: $field. Using default value."
    #     end
    # end

    # TODO: Uncomment error handling code below to add missing data if needed
    # firstname = get(data, :firstname, "default_firstname");
    # lastname = get(data, :lastname, "default_lastname");
    # sid = get(data, :sid, -1);
    # netid = get(data, :netid, "default_netid");

    # TODO: Get data from the named tuple
    # TODO: Comment this block of code when using the error handling code above
    firstname = data.firstname;
    lastname = data.lastname;
    sid = data.sid;
    netid = data.netid;

    # set the fields of the model
    model.firstname = firstname;
    model.lastname = lastname;
    model.sid = sid;
    model.netid = netid;

    # return -
    return model;
end;

Finally, we have updated the [`find(...)` method](src/Compute.jl) implementation. This method takes a collection of student models and values for the fields of the student we want and returns either the student index `sid::Int64` matching the other search fields or `nothing`.

In [4]:
"""
    find(students::Array{MySimpleStudentModel,1}; netid::String="jdv27", firstname::String = "firstname", 
        lastname::String = "lastname") -> Union{Int64, Nothing}

Finds a student in the array of `MySimpleStudentModel` based on the provided parameters. 
Returns the student's `sid` if found, otherwise returns `nothing`.

### Parameters
- `students::Array{MySimpleStudentModel,1}`: An array of `MySimpleStudentModel` objects.
- `netid::String`: The netid of the student to search for (default is "jdv27").
- `firstname::String`: The firstname of the student to search for (default is "firstname").
- `lastname::String`: The lastname of the student to search for (default is "lastname").

### Returns
- `Union{Int64, Nothing}`: The student's `sid` if found, otherwise `nothing`.
"""
function find(students::Array{MySimpleStudentModel,1}; netid::String="jdv27", firstname::String = "firstname", 
    lastname::String = "lastname", sid::Int64 = -1)::Set{MySimpleStudentModel}

    # initialize -
    set_of_matching_students = Set{MySimpleStudentModel}(); # default: we don't know which student we are looking for

    # main loop -
    for i ∈ eachindex(students)
        test_student = students[i];  # get student i from the array -

        # Default: Let's start by assuming a compound OR check
        if test_student.lastname == lastname || test_student.netid == netid || test_student.firstname == firstname || test_student.sid == sid
            push!(set_of_matching_students, test_student); # add the student to the set of found students
        end
    end

    return set_of_matching_students; # return the search results to the caller
end;

## Task 1: Break the Student Model Build Method
In this task, let's try to break the `build(...)` method by providing unexpected inputs. For example, we'll provide an incorrect type of the object we want to build, and will provide a `data::NamedTuple` with missing or incorrect fields and see how the system reacts.

The intersting bit of the experiment is to see who is doing what when the system reacts. For example, are the errors things that we could anticipate and handle, or are they errors in whcih the system if responsible for handling? Does the system throw an error, or does it return a default value? Does it provide a helpful error message, or is it cryptic?

### Incorrect Object Type
The first argument to the `build(...)` method is the type of the object we want to build. Let's provide an incorrect type, such as `Int64` instead of `MySimpleStudentModel` and see how the system reacts.

* _What do you expect to happen_? This bug (which is a user logic error) should be caught by the system, and it should throw an error. The user is asking for a method with incorrect parameters in the type sense, so the system should not be able to find a matching method, and thus throw an error.

Is this what happens? Let's try it out.

In [5]:
build(Int64, (firstname = "John", lastname = "Doe", sid = 123456, netid = "jbd123")) # Example usage, wrong object build type

MethodError: MethodError: no method matching build(::Type{Int64}, ::@NamedTuple{firstname::String, lastname::String, sid::Int64, netid::String})
The function `build` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  build(!Matched::Type{MySimpleStudentModel}, ::NamedTuple)
   @ Main ~/Desktop/julia_work/CHEME-140-eCornell-Repository/CHEME-140-eCornell-Repository/courses/CHEME-141/module-2/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_W6sZmlsZQ==.jl:13


__Yes!__ The system throws [a `MethodError`](https://docs.julialang.org/en/v1/base/base/#Core.MethodError) indicating that there is no method matching `build(::Type{Int64}; data::NamedTuple)`. The system is doing the right thing by throwing an error, and the error message is clear and helpful.

### Missing Fields
Next, let's provide a `data::NamedTuple` with missing fields. For example, we can provide a `NamedTuple` with only the `firstname::String` field and see how the system reacts.

* _What do you expect to happen_? This bug will be caught by the system, but the question is should it be, i.e., should we handle this? Yes, this is a user error, and if our documentation is clear, the user should know that they need to provide all fields. But thjis case seems like a reasonable case that we gracefully handle, so let's see what happens.

What happens when we try this out?

In [6]:
build(MySimpleStudentModel, (firstname = "John",)) # Example usage, we didn't provide all fields

ErrorException: type NamedTuple has no field lastname

Ok, the system found the method, but it threw an error indicating that the `lastname` field is missing. The error message is clear and helpful. The system is doing the right thing by throwing an error, but it would be better if we provided a default value for the missing field instead of throwing an error?
* _What do you think_? Yes, this is a user error, but it is also a reasonable case that we could handle gracefully. For example, we could provide a default value for the missing field, or we could throw an error with a more helpful message. Let's check this out and see what happens.

`Uncomment` the error handling code in the `build(...)` method, comment out the block that throws the error and reload the modified method. We should now provide default values for the missing field. This way, the system will not throw an error, and a student model will be created and returned with the default value for the missing field.

Did we get the expected behavior?

In [7]:
build(MySimpleStudentModel, (firstname = "John",)) # Example usage, we didn't provide all fields

ErrorException: type NamedTuple has no field lastname

__Hmmm, Yes, we get the expected behavior__: That works, but it is not ideal. The system should not be silently providing default values for missing fields without informing the user. Maybe it would be better if we provided a warning that the field is missing and a default value is being used.

`Uncomment` the warning code in the `build(...)` method, and reload the modified method. We should now provide a warning that the field is missing and a default value is being used.

In [8]:
build(MySimpleStudentModel, (firstname = "John",)) # Example usage, we didn't provide all fields, should use default value and get a warning

ErrorException: type NamedTuple has no field lastname

__Ok, that is better!__ Now the system provides a warning that the field is missing and a default value is being used. This way, the user is informed that something is not right, and they can take action if needed. Furthermore, the system is still able to create a valid model instance with the available data. No crashes, no errors, just a warning and a valid model instance (that can be updated becuase it is a mutable type).

## Task 2: Break the Student Model Find Method
In this task, let's try to break the `find(...)` method implementation. We'll provide unexpected inputs, such as a collection of student models with missing fields, and see how the system reacts. 

To begin, build some test data to use with the `find(...)` method with realistic values, such as repeated (non-unique, and perhaps missing) values for the names, and (perhaps) non-unique (or missing) values for the student ID and netid fields.
* _Why?_ This is what we call in data science a _dirty dataset_, i.e., a dataset that is not perfect and has some noise and mistakes in it, thus, it is like real-life, not perfect. Testing with real data is important because it allows us to test the robustness of the `find(...)` method and see how it handles unexpected inputs. The world is awash with dirty data, and we need to be able to handle it.

Let's start by building a collection of possible student fornames, and surnames, that we'll use in student model generation. The names datasets are loaded using [the `MyCommonForenameDataset()`](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyCommonForenameDataset) and [the `MyCommonSurnameDataset()`](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyCommonSurnameDataset) functions from [the `VLDataScienceMachineLearningPackage.jl` package.](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl)

We'll store the forenames and surnames in `set_of_firstnames::Set{String}` and `set_of_lastnames::Set{String}` variables, respectively. Then, we'll use these datasets to build a collection of student models with realistic values.

In [9]:
set_of_firstnames, set_of_lastnames = let 
    
    # initialize -
    set_of_firstnames = Set{String}();
    set_of_lastnames = Set{String}();

    # Get names from the VLDataScienceMachineLearningPackage.jl package
    fornames_df = MyCommonForenameDataset(); # firstnames dataset
    surnames_df = MyCommonSurnameDataset(); # lastnames dataset

    # build fornames set - 
    for name in eachrow(fornames_df)
        push!(set_of_firstnames, name["Romanized Name"]);
    end

    # build surnames set -
    for name in eachrow(surnames_df)

        value = name["Romanized Name"];
        if ismissing(value) == true
            continue; # skip missing values
        end
        push!(set_of_lastnames, value);
    end

    set_of_firstnames, set_of_lastnames # return -
end;

Next, let's build a collection of student models with realistic values by providing repeated (non-unique) values for the names, and non-unique values for the student ID and netid fields, and in some cases missing values.

In [10]:
students = let

    # initialize -
    number_of_students = 1000; # number of students to create
    θ = 0.1; # TODO: set the probability of have a missing field
    students = Array{MySimpleStudentModel,1}(undef, number_of_students); # create an array of students

    # main loop - create a new student with random data
    for i ∈ 1:number_of_students

        firstname = rand(set_of_firstnames); # select a random first name
        if rand() < θ
            firstname = ""; # empty string to simulate missing data
        end
       
        lastname = rand(set_of_lastnames); # select a random last name
        if rand() < θ
            lastname = ""; # empty string to simulate missing data
        end

        sid = rand(1000:9999); # random student id
        if rand() < θ
            sid = 0; # set to -1 to simulate missing data
        end

        netid = "netid_$(firstname)_$(lastname)_$(sid)"; # create a netid from the firstname, lastname and sid
        if rand() < θ
            netid = ""; # set to default to simulate missing data
        end

        # build the student model and add it to the array
        students[i] = build(MySimpleStudentModel, (firstname = firstname, lastname = lastname, sid = sid, netid = netid));
    end

    students; # return the array of students
end

1000-element Vector{MySimpleStudentModel}:
 MySimpleStudentModel("Reem", "Myers", 0, "netid_Reem_Myers_0")
 MySimpleStudentModel("Olga", "Arben", 8247, "netid_Olga_Arben_8247")
 MySimpleStudentModel("Isabella", "Moreira", 9082, "netid_Isabella_Moreira_9082")
 MySimpleStudentModel("Safija", "Lefèvre", 9020, "")
 MySimpleStudentModel("Elias", "Tô", 7178, "netid_Elias_Tô_7178")
 MySimpleStudentModel("Arthur", "", 9377, "netid_Arthur__9377")
 MySimpleStudentModel("", "Ismailov", 0, "netid__Ismailov_0")
 MySimpleStudentModel("Anna", "Salo", 9921, "netid_Anna_Salo_9921")
 MySimpleStudentModel("Esther", "Marchenko", 4326, "netid_Esther_Marchenko_4326")
 MySimpleStudentModel("Ayşa", "Lei", 5667, "netid_Ayşa_Lei_5667")
 ⋮
 MySimpleStudentModel("Sahar", "Jansons", 6098, "netid_Sahar_Jansons_6098")
 MySimpleStudentModel("", "Wagener", 1794, "netid__Wagener_1794")
 MySimpleStudentModel("Diego", "De Smet", 3212, "netid_Diego_De Smet_3212")
 MySimpleStudentModel("Wateen", "Berisha", 4666, "")
 MySim

### Can we Break the Find Method?

Next, we'll search our collection of student models using the `find(...)` method with different combinations of search fields. Let's make a case for using the `||` short-circuiting operator in our search criteria in our new `find(...)` method implementation.
* _Why OR instead of AND?_ The shortcut OR operator `||` allows us to search for a student model when we only have some of the student data matching the search criteria, e.g., when `firstname` is known but `lastname` is not. The new `find(...)` method implementation will take advantage of this behavior and it will return all students that match the criteria, even if some fields are missing or have default values.

Select a random student model from the collection and use the `find(...)` method to search for it using different combinations of search fields. For example, we can search for a student by their first name, last name, student ID, or netid.

In [43]:
random_test_student = rand(students) # select a random student from the array

MySimpleStudentModel("Mark", "Conti", 0, "netid_Mark_Conti_0")

Do we get the expected behavior, for example, when searching for a student by their last name?
* _What should we expect to happen?_ We expect to find the student models with the matching last name, even if the other fields are different, i.e., we should see all students with the last name "Smith" regardless of their first name, netid, or student ID. If the last name is missing, we'll find all students with missing last names.

Do we see the expected behavior?

In [44]:
students_that_we_found = find(students, lastname=random_test_student.lastname)

Set{MySimpleStudentModel} with 2 elements:
  MySimpleStudentModel("Mark", "Conti", 0, "netid_Mark_Conti_0")
  MySimpleStudentModel("Sherifa", "Conti", 4107, "netid_Sherifa_Conti_4107")

Ok, let's try searching for a student by their netid. 
* _What should we expect to happen?_ We expect to find a set of all student models with the matching netid, even if the other fields are different, i.e., we could have multiple students with the same netid (not supposed to, but it happens in the case of dirty data), so we should see all of them.
* _What about when the netid is missing?_ We expect to find all students with missing netid, i.e., all students that have a default value for the netid field.

Do we see the expected behavior?

In [45]:
students_that_we_found = find(students, netid=random_test_student.netid)

Set{MySimpleStudentModel} with 1 element:
  MySimpleStudentModel("Mark", "Conti", 0, "netid_Mark_Conti_0")

Ok, how about searching for a student by their student ID?
* _What should we expect to happen?_ We expect to find a set of all student models with the matching student ID, even if the other fields are different, i.e., we could have multiple students with the same student ID in the case of dirty data, so we should see all of them.
* _What about when the student ID is missing?_ We expect to find all students with missing student ID, i.e., all students that have a default value for the student ID field.

Do we see the expected behavior?

In [46]:
students_that_we_found = find(students, sid=random_test_student.sid)

Set{MySimpleStudentModel} with 97 elements:
  MySimpleStudentModel("Elshan", "Rivera", 0, "netid_Elshan_Rivera_0")
  MySimpleStudentModel("Kazi", "Kaneko", 0, "netid_Kazi_Kaneko_0")
  MySimpleStudentModel("Jovan", "Gjoni", 0, "netid_Jovan_Gjoni_0")
  MySimpleStudentModel("Martim", "Camilleri", 0, "netid_Martim_Camilleri_0")
  MySimpleStudentModel("", "Jeon", 0, "netid__Jeon_0")
  MySimpleStudentModel("", "Radović", 0, "netid__Radović_0")
  MySimpleStudentModel("Konul", "Hofer", 0, "")
  MySimpleStudentModel("Paninnguaq", "Veselá", 0, "netid_Paninnguaq_Veselá_0")
  MySimpleStudentModel("Diego", "Wilson", 0, "netid_Diego_Wilson_0")
  MySimpleStudentModel("Harper", "", 0, "")
  MySimpleStudentModel("Solomiya", "Lewis", 0, "netid_Solomiya_Lewis_0")
  MySimpleStudentModel("Mark", "Conti", 0, "netid_Mark_Conti_0")
  MySimpleStudentModel("Ella", "Thường", 0, "")
  MySimpleStudentModel("Tomáš", "Markov", 0, "netid_Tomáš_Markov_0")
  MySimpleStudentModel("Inès", "Karlsson", 0, "netid_Inès_Karls

Finally, so the `find(...)` method seems to be working as expected, but what happens if we intentially provide the wrong type of data for the search fields? For example, we can provide an `Int64` instead of a `String` for the `firstname` field.
* _What should we expect to happen?_ We expect the system to throw an error indicating that there is no method matching `find(...)` with the provided arguments, i.e., the system should not be able to find a matching method for the provided arguments.

Is this what happens?

In [47]:
students_that_we_found = find(students, sid="0")

TypeError: TypeError: in keyword argument sid, expected Int64, got a value of type String

What about the case where we initially provide bad data to the `find(...)` method, for example, for the `firstname` field, we provide a `String` that is not in the set of first names, or for the sid field, we provide an `Int64` that is not in the set of student IDs?
* _What should we expect to happen?_ We expect the `find(...)` method to return an empty set, i.e., no students found. Alternatively, depending upon the random values we provided, we could get a set of students that match one or more other search criteria. 

For example a negative student ID, or a netid that is not in the set of netids, should return an empty set, i.e., no students found. But if we have a `netid = ""`, we should find all students with missing netid, i.e., all students that have a default value for the netid field.

In [50]:
students_that_we_found = find(students, sid=-1234, lastname="NVDA", netid="")

Set{MySimpleStudentModel} with 92 elements:
  MySimpleStudentModel("Julija", "Van Dyk", 9194, "")
  MySimpleStudentModel("", "Urbonienė", 7883, "")
  MySimpleStudentModel("Hana", "Grech", 7877, "")
  MySimpleStudentModel("", "Chávez", 7444, "")
  MySimpleStudentModel("Petar", "Heng", 8420, "")
  MySimpleStudentModel("Konul", "Hofer", 0, "")
  MySimpleStudentModel("Vasilije", "Claes", 3991, "")
  MySimpleStudentModel("Alikhan", "", 1173, "")
  MySimpleStudentModel("Bisera", "Kvaran", 5931, "")
  MySimpleStudentModel("Arman", "", 9677, "")
  MySimpleStudentModel("Mahammad", "Reyes", 4171, "")
  MySimpleStudentModel("Hinato", "Kawano", 5476, "")
  MySimpleStudentModel("Amaris", "Tahirović", 6909, "")
  MySimpleStudentModel("Peter", "Kazlauskienė", 4544, "")
  MySimpleStudentModel("Matei", "Thill", 8679, "")
  MySimpleStudentModel("Amēlija", "Reuter", 6262, "")
  MySimpleStudentModel("Harper", "", 0, "")
  MySimpleStudentModel("Ella", "Thường", 0, "")
  MySimpleStudentModel("Snežana", "", 

Taken together, the `find(...)` method seems to be robust enough to handle unexpected inputs and return a meaningful result, even if the inputs are not perfect.