# Homework 01

### Brown University  
### DATA 1010  
### Fall 2020

In [1]:
using CSV

## Part I

In this problem, we'll explore the algorithmic side of a simple and common database algorithm.

In [2]:
# begin by loading the two local CSV files
df1 = CSV.read("flights1.csv")

Unnamed: 0_level_0,id,date,delay,distance
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,4222203651472,2120910,10,328
2,3154993796870,1141335,5,279
3,1290620327152,3131830,-7,293
4,6201397169412,1051910,22,255
5,9573467304590,3311247,6,229
6,2602364768170,1230615,4,197
7,4730451796020,1062025,-5,178
8,4630387769705,3081910,1,308
9,2530401818664,2260755,-7,328
10,8726984488790,2131710,-5,528


In [3]:
df2

Unnamed: 0_level_0,id,origin,destination
Unnamed: 0_level_1,Int64,String,String
1,2140856843975,MCI,MDW
2,9172973311495,LAX,PHX
3,3113090932533,ONT,SMF
4,4130406938034,OAK,LAX
5,4550074644436,MSY,HOU
6,1477758908913,LAS,LAX
7,4107464736747,MDW,MCI
8,6785156927837,RNO,SJC
9,6199429263131,FLL,TPA
10,1099696213029,SEA,BOI


Each row of `df1` corresponds to a particular flight (with identifier `id`) and contains information about the date of the flight, how much it was delayed (in minutes), and the number of miles on the journey. The second database includes the origin and destination airport codes.

The problem we're looking to solve is to put these two data frames together, so that each row contains all five pieces of information for a given flight. This is called a *join* operation, and in real life you'll almost always use a library to do that operation. However, we're going to do it by hand as a learning experience and as a way of highlighting some of the aspects of the problem which are relevant even if you're using library code.

### Problem 1
Write a function to perform this join operation using the most naive algorithm: loop through the rows of the first dataframe. For each ID you encounter, use `findfirst` to search the `id` column of the second data frame to find the matching information.

*Hint*. You just need to know a couple things about data frames in Julia to do this problem: (1) You can access a data frame entry using the row index and the column name, like `df[3, "id"]` returns the `id` value for the third row of the data frame `df`. And (2) you can construct a data frame from an array of *named tuples*, like so:

In [4]:
(name = "Alice", age = 42)

(name = "Alice", age = 42)

In [5]:
typeof((name = "Alice", age = 42))

NamedTuple{(:name, :age),Tuple{String,Int64}}

In [6]:
DataFrame([
    (name = "Alice", age = 42), 
    (name = "Bob", age = 38)
])

Unnamed: 0_level_0,name,age
Unnamed: 0_level_1,String,Int64
1,Alice,42
2,Bob,38


### Problem 2
Write a function to perform this join operation using a different approach: build a *dictionary* which maps each flight ID value to the corresponding row of the second table. Then loop through the rows of the first table, using the dictionary to access the relevant data from the second table. 

### Problem 3
Time your functions from (a) and (b); which is faster? Here's an example of how to time a simple function in Julia:

In [7]:
@time sqrt.(rand(10^8));

  0.958840 seconds (4 allocations: 1.490 GiB, 17.84% gc time)


### Problem 4
Now suppose that you get to store the dictionary mapping ids to rows *in advance*. Rewrite your function in (b) in such a way that it takes that dictionary as an argument. Time the evaluation of this new function (which gets the dictionary as an argument rather than having to evaluate it). How much faster is it?

### Problem 5
What would be another way we could make this operation faster without indexing (that's what the dictionary storing operation in (d) is called)?

## Part II

### Problem 6

In the game of Set, every card has four features:

**Number**. 1, 2, or 3  
**Color**. purple, red, or green  
**Shape**. oval, squiggle, or diamond  
**Shading**. striped, solid, or outline

<img src="set.png"/>

There is exactly one card for every possible combination of attributes (for example there's exactly one card with 2 red solid squiggles), so there are 81 cards in total. 

Three cards are said to form a set if for each feature, the three cards are either all the same or all different. For example, the cards shown above form a set because their shapes are all different, their shading is all different, the numbers are all different, and the colors are all different. 

The cards are shuffled, and 12 cards are dealt face up for all players to see. If any three of the cards forms a set, then the first player to identify a set gets to pick up those cards, and the cards are replaced from the deck. If no sets are present, then three additional cards are dealt. If a set is identified at 15 cards, then the three removed cards are not replaced, and the count goes back down to 12. If no sets are present even at 15 cards, then three more cards are dealt to get up to 18, and so on. The game ends when the deck is empty and there are no more sets. 

Write a program to simulate this game, play it 10,000 times, and find the proportion of games in which 18 cards appear at some point.

**Hints:**

(1) You should represent cards using a new `Card` type, rather than using 4-tuples of `Int`s.  
(2) It's going to be important to break your program up into small, dedicated functions. For example, you can write a function for returning the 81 total cards, a function for shuffling the deck, a function which takes three cards and returns true or false depending on whether they form a set, a function for dealing cards, a function for finding all the sets on the board, a function which takes a single turn by randomly selecting one of the available sets or adding three cards to the board if there are no sets, and finally a function which puts those functions together to play the game.  
(3) Some Julia tips: 
  - `randperm` generates random permutations, and you can index an array with the resulting list to shuffle it. 
  - `vcat` concatenates two arrays.
  - You probably want to use a four-dimensional array comprehension to generate the list of all cards, and you can index it with a colon to flatten it.
  - To remove elements at positions `i`, `j`, and `k` from an array `A`, you can do `A[setdiff(1:end, (i,j,k))]`
  - You can start with an empty array and grow it one element at a time inside a loop using this pattern:
    ```julia 
    A = []
    for x in 1:10
        push!(A, x)
    end
    ```
  - You can use nested `for` loops to check all the triples to look for a set. There are more efficient algorithms, but that's OK.

Your solution should contain working code, but you should also discuss in prose how each function works and how the functions fit together in the design of the program.

*Solution.*

## Part III

### Problem 7

Some things which look like functions in Julia start with `@`; these are called **macros**. The difference between a macro and a function is that the former is able to operate on the *code expressions* supplied to it, while a function can only operate on the *values* that those code expressions evaluate to. 

For example, if `f` is a function, then `f(2)` is always equivalent to `f(1 + 1)`. However, macros can behave differently depending on the expression: 

In [8]:
x = 1
@show(x + 1);

x + 1 = 2


In [9]:
@show(1 + 1);

1 + 1 = 2


In [10]:
@show(2);

2 = 2


What would you need to do if you wanted to be able to use plain functions to do a computation with the supplied expression (not just the value it evaluates to)? In other words, write a function `showfunction` which achieves an effect somewhat similar to the `@show` macro. 

Feel free to look up whatever you want online to solve this problem. You might want to look at the `Meta.parse` function (no import necessary), which turns a string into an expression, as well as the `eval` function, which evaluates an expression.

(Note that this technique is essential in a language like Python, which doesn't have macros. By contrast, R's approach is at the opposite end of the spectrum: [*every* function gets access to the code of its arguments](http://adv-r.had.co.nz/Computing-on-the-language.html).)

(Also note: this exercise is helpful because it sheds some light on one way that language design influences the interfaces of various data science libraries you will use.)

### Problem 8

Explain how the following function returns a value even though it contains no return statement. 

In [11]:
function A(x)
    if x ≥ 0
        x
    else
        -x
    end
end

using Test
@testset begin
    @test A(-3) == 3
    @test A(0.0) == 0.0
    @test A(1) == 1
end

[37m[1mTest Summary: | [22m[39m[32m[1mPass  [22m[39m[36m[1mTotal[22m[39m
test set      | [32m   3  [39m[36m    3[39m


Test.DefaultTestSet("test set", Any[], 3, false)

As in Problem 7, feel free to look up whatever you want and write a brief explanation of what's going on.