In [1]:
from datascience import *
import numpy as np

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
plt.style.use('fivethirtyeight')
%matplotlib inline

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# Plotly plotting support
# import plotly.plotly as py
import plotly.plotly as py

# import cufflinks as cf
# cf.go_offline() # required to use plotly offline (no account required).

import plotly.graph_objs as go
import plotly.figure_factory as ff

# 1. Tables
The `cal` table describe the:

1. `name` (string)
2. `position` (string)
3. `class` (string), and
4. `height` (int)

of Cal basketball players in the 2016-2017 season.

In [2]:
name = make_array('Ivan Rabb', 'Charlie Moore', 'A', 'B', 'C', 'D', 'E')
position = make_array('Forward', 'Guard', 'Center', 'Power Forward', 'Point Guard', 'Forward', 'Guard')
Class = make_array('Sophomore', 'Freshman', 'Junior', 'Senior', 'Sophomore', 'Junior', 'Senior')
height = make_array(83, 71, 75, 90, 81, 77, 92)

cal = Table().with_column(
    'name', name,
    'position', position,
    'class', Class,
    'height', height
)

cal.show(3)

name,position,class,height
Ivan Rabb,Forward,Sophomore,83
Charlie Moore,Guard,Freshman,71
A,Center,Junior,75


Complete the **Python expressions** below to compute each result.

**You must fit your solution into the lines and spaces provided to receive full credit**

A blank can be filled  with multiple expressions, such as 2 expressions separated by commas. The last line of each answer should evaluate to the result requested; you never need to call `print`.

#### (a) The proportion of all players whose position is `Forward`

In [3]:
cal.where('position', 'Forward').num_rows / cal.num_rows

0.2857142857142857

#### (b) The name of the shortest `Freshman`. Assume that one is shorter than the rest.

In [4]:
cal.where('class', 'Freshman').sort('height', descending = False).row(0).item('name')

'Charlie Moore'

#### (c) Whether there are at least $\frac{3}{4}$ of players that are 80 inches tall or shorter. The result should be `True` or `False`.

For this problem, we are checking whether the 75% percentile of the players are less than or equal to 80 inches tall. Thus, the whole line is a boolean value.

In [5]:
percentile(75, cal.column('height')) <= 80

False

#### (d) The number of players that are (strictly) more than one standard deviation above the mean height

Note: **strictly** means either just `>` or `<`. So no `>=` or

In [6]:
a = cal.column('height')
cal.where('height', are.above(np.mean(a) + np.std(a))).num_rows

2

#### (e) An array of all positions, sorted in increasing order of the average height for all players in that position.

In [7]:
t = cal.select('position', 'height')
t.group('position', np.mean).sort(1).column(0)

array(['Center', 'Forward', 'Point Guard', 'Guard', 'Power Forward'],
      dtype='<U13')

# 2. Experiment
The `cal` table described in the previous question has columns `name`, `position`, `class` and `height`. Read the following code used to test a hypothesis about the `cal` table. then answer the questions below to interpret it.

In [9]:
def diff_of_means(t):
    forward_mean = t.where('position', 'Forward').column('height').mean()
    guard_mean = t.where('position', 'Guard').column('height').mean()
    return forward_mean - guard_mean

differences = make_array()

for i in np.arange(10000):
    shuffled_heights = cal.sample(with_replacement = False).column('height')
    shuffled = cal.select('position').with_column('height', shuffled_heights)
    differences = np.append(differences, abs(diff_of_means(shuffled)))

#### (a) Circle all of the following hypotheses that could potentially be tested using the `differences` array

1. Among guards and forwards on this team, there is an association between height and position
2. Whether a player is guard or a forward is like flipping a fair coin
3. The forwards on Cal's team have historically been taller than the guards, on average
4. The heights of guards and forwards are like random samples from the same distribution

#### Answer: 1, 4

#### (b) Circle one option among `(A)`, `(B)`, and `(C)` for each blank in this description of the hypothesis test:

In this (----i----), the null hypothesis states that the heights and positions of players are (----ii----)

| Options | i | ii |
| ----- | ---- | -----|
| (A) | permutation test | drawn at random from the same population |
| (B) | confidence interval test | paired up at random | 
| (C) | bootstrap resampling test | normally distributed |


#### Answer: 
1. (i) = A
2. (ii) = B

If we do a permutation test, then it involves pairing at random rather than drawing at random from the same population.

#### (c) What test statistic is being used to test the null hypothesis in (b)? Describe it in English, not code.

Answer: The **absolute** difference of means between `guard` and `forward`

#### (d) Circle the letter for the chart below that could plausibly be a histogram of the `differences` array.

<img src = 'differences.jpg' width = 1200/>

#### Answer: C

**(A)** We are calculating **absolute** difference and thus, the values can only be greater than or equal to 0

**(B)** This histogram is centered at some positive value (around 7.5). Since we randomly pair heights and positions, the absolute difference of means should be somewhat around 0. 

2. Difference in means by random under the null hypothesis should be somewhat near 0.
3. The histogram should look like a ladder with a tail at the right side

#### (e) Write a Python expression to compute a P-value for this test, using the null hypothesis you defined in part (b) and the test statistic you described in part (c)

Recall `P-value` is the probability of seeing a value that is more extreme than the observed value, assuming null hypothesis is true. 

In [11]:
np.count_nonzero(differences >= abs(diff_of_means(cal))) / 10000

0.8984

#### (f) Using:
1. P-value cutoff of 5% to determine significance
2. The null hypothesis from part **(b)**
3. The test statistic from **(c)**, and
4. The null distribution from **(d)**

What should you conclude if the observed value of the test statistic is `6`?

# ============ Answer =============
Looking at the graph **(C)** 

<img src = 'P-value.jpg' width = 500/>

If the proportion of values above `6` were 5% or greater, then the bins that are greater than `6` should look like the following,

<img src = '5.jpg' width = 500/>

However, from the actual bin, we can see that the proportion of values that are greater than `6` are much less than 5%.

This means: 
1. We reject the null since less than 5% of the values are greater than `6`
2. There is an association between height and position among guards and forwards

# (3) Sampling
The histogram shows sample means from 2,500 random samples. Each sample contains 10,000 trip distances, measured in miles, drawn at random from the distances of 1.4 million trips from New York taxis in January 2016.

<img src = 'sample.jpg' width = 800/>

#### (a) What quantity is measured by the horizontal axis of this histogram?
1. Total miles for a single randomly chosen trip
2. Total miles for a single randomly chosen sample
3. Average miles for 10,000 randomly chosen trips **ANSWER**
4. Average miles for 2,500 randomly chosen samples
5. None of the above

Recall from homework problems and labs. Just by looking at the histogram, we shouldn't be able to tell the number of samples involved. 

#### (b) What quantity is measured by the vertical axis of this histogram?
1. Percent of trips per sample
2. Percent of trips per mile
3. Percent of trips per sample mean
4. Percent of sample means per trip
5. Percent of sample means per mile **ANSWER**

## =========== (b) Answer =============
From part **(a)**, we know that the product of `y` axis and the `x` axis must be a percentage or proportion. If the horizontal axis is in miles, then the vertical axis must be a unit of per miles.

The vertical axis is percent of sample means per mile. This is because the proportion that we observe are the proportion of sample means that has a certain average mile distance (e.g. 2.75). The proportion relates to the proportion of sample means that has a given height. 

<img src = 'proportion.jpg' width = 500/>

## ======= End of (b) ============

**(c)** The percent of sample means represented by the tallest bar (with height about 1400) is closest to:

**(A)** 1.4 percent     

**(B)** 7 percent   

**(C)** 14 percent --> **ANSWER**

**(D)** 28 percent    

**(E)** 70 percent



## =========== (c) Answer  =============
Recall the percentage can be calculated by multiplying width of bar with height. The width of a bar is 0.01.

## ====================================

**(d)** How would the height of the tallest bar change if we drew 10,000 random samples instead of 2,500?
1. Grow by about 2 times
2. Grow by about 4 times
3. Shrink by about 2 times
4. Shrink by about 4 times
5. Not much change --> **ANSWER**

## ============= (d) Answer =================
Reasoning: 2,500 samples of 10,000 trips is already considered large enough that the distribution of the `mean` resembles a normal distribution. At this point, increasing the amount of random samples won't make any significant difference.

## ==========================================

**(e)** How would the height of the tallest bar change if the bars all had width 0.02 instead of 0.01? The new histogram would be generated by `sample_means.hist(bins = np.arange(2.65, 2.9, 0.02))`.

1. Grow by about 2 times
2. Grow by about 4 times
3. Shrink by about 2 times
4. Shrink by about 4 times
5. Not much change --> **ANSWER**

## =============== (e) Answer =================
After the changes:
1. The total area of the new histogram has to be the same as the histogram before the change
2. The total area of the same bins must be the same

For example, we pair the first 2 bins (marked orange),

<img src = 'first_2.jpg' width = 500/>

We make it so that the area of the orange bars are equal to the area of the 2 blue bars altogether.

<img src = 'orange.jpg' wdith = 500/>

As we estimated above, there appears to be no significant height change for the tallest bars pair.

# ========== End of (e) Answer ============

#### (f) The standard deviation of taxi trip distances in the population of 1.4 million trips is closest to:
1. 0.01 miles
2. 0.03 miles
3. 0.1 miles
4. 0.3 miles
5. 3 miles --> **ANSWER**

## ============= (f) Answer ================

Recall the formula,

$$\text{SD of sample means} = \frac{\text{Population SD}}{\sqrt{n}}$$

Note that `n` is the sample size, which is `10,000` trips.

We want to solve for the population SD. However, we don't know the SD of the sample means.

Recall a property for a normal distribution, 1 SD away from the mean is located roughly around the point of inflection where if we imagine drawing a curve for the histogram, it would be the point where the curve stops decreasing / increasing.

<img src = 'SD.jpg' width = 500/>

From the histogram below, the point of inflection on the right side is roughly `2.75 + 0.03`. Thus, the SD of sample means is roughly `0.03`.

Solving for population SD,

In [1]:
0.03 * (10000)**0.5

3.0

The population SD is `3` miles!

## ========================== End of (f) Answer ===========================

**(g)** In order to construct a 95% confidence interval for the mean trip distance in the population, such that the width of the interval is 0.4 miles or less, the minimum sample size required is closest to:

1. 9
2. 90
3. 900 --> **ANSWER**
4. 9,000
5. 90,000

## ============= (g) Answer =============
Recall that **95% confidence interval means 95% of values are within 2 SD of the mean**. This means the width of a 95% confidence interval consists of 4 SD. Then 1 SD of the interval is:

$$ \frac{0.4 \text{ miles}}{4} = 0.1 \text{ miles}$$

Now using the following formula,

$$\text{SD of sample means} = \frac{\text{Population SD}}{\sqrt{n}}$$

and solving for `n`, we have,

$$ n = (\frac{\text{Population SD}}{\text{SD of sample means}})^2 $$

Recall from problem **(f)** we solved the population SD to be 3 miles.

$$ n = (\frac{\text{3 miles}}{\text{0.1 miles}})^2 $$

In [5]:
(3 / 0.1)** 2

900.0

## ========= End of (g) ============

# 4. Linear Regression
This scatter plot of a sample of 1,000 trips for New York taxis in January 2016 compares distance and cost. The regression line is shown. Two trips of the same length can vary in cost because of waiting times, special fees, taxes, tolls, tips, discounts, etc.

<img src = 'linear_regression.jpg' width = 500/>

In [None]:
np.average(t.column('Distance')) = 3

In [None]:
np.std(t.column('Distance')) = 2

In [None]:
np.average(t.column('Cost')) = 13

In [None]:
np.std(t.column('Cost')) = 6

In [None]:
correlation(t, 'Distance', 'Cost') = 0.9

**(a)** Convert a trip total cost of 9 dollars to standard unit

## ========== (a) Answer ============
Recall that to convert a value to a standard unit `z`, the formula is as the following,

$$ z = \frac{x - \text{mean(x)}}{\text{SD(x)}} $$

In [6]:
(9 - 13) / 6

-0.6666666666666666

## ======= End of (a) ============

**(b)** What is the slope of the regression line for this sample in dollars per mile?

## ========= (b) Answer =========
Recall the slope formula,

$$ slope = r \times \frac{\text{SD of y}}{\text{SD of x}} $$

In [7]:
0.9 * 6 / 2

2.7

## ============ End of (b) ================

#### (c) What is the intercept of the regression line for this sample in dollars?

## ====== (c) Answer ==========
Recall the intercept formula,

$$ intercept = mean(y) - slope \times mean(x) $$

In [9]:
13 - (2.7) * 3

4.899999999999999

## ======== End of (c) ===========

**(d)** If instead we fit a regression line to estimate distance in miles from total cost in dollars, what would be the slope of that line in miles per dollar? Write **not enough info** if it's impossible to say.

## =========== (d) Answer ==============

Similar to part **(b)**, but this time we flip the `y` and `x` axis since we want to solve for the distance in miles.

In [10]:
0.9 * 2 / 6

0.3

## =========== (c) Answer ===========

#### Choose either `True`, `False`, or **Not enough info** to describe the following statement:

**(e)** The total cost values in this sample are normally distributed --> **FALSE**

For a normal distribution, most of the values have to be around the center and decreasing as it gets further from the center (think of normal distribution histogram). 

If we observe only the `y` values of the scatter plot, the majority of the points are found at low range of cost. The data points become more scarce as we move to greater cost. Rather than a normal distribution, the total cost distribution would resemble more of the following histogram:

<img src = 'P-value.jpg' width = 500/>

**(f)** All of the total cost values in this sample are within 3 standard deviations of the mean --> **FALSE**

3 SD of the total cost is `18`. The most that the `mean` $\pm$ 3SD would cover is 13 $\pm$ 18. Meanwhile, there are data points that are around 35 to 40 dollars.

**(g)** At least 88% of the total cost values in this sample are within 3 standard deviations of the mean --> **TRUE**

Recall the Chebyshev's Inequality that regardless of the shape of the distribution, proportion of values in the range `average` $\pm$ `SD` is at least ($ 1 - \frac{1}{SD^2}$)%

Thus, 3 SD covers:

In [2]:
1 - (1/3**2)

0.8888888888888888

**(h)** The residual costs have a similar average magnitude for short trips (1 mile) and long trips (5+ miles) --> **FALSE**

Recall that `residuals` is `actual` - `predicted`. If we look at the scatter plot, we can see that:

1. For short trips, the data points are relatively close to the prediction (the regression line)
2. For long trips, the data points are relatively far from the line.

Thus, this is definitely False

**(i)** You compute a 95% confidence interval from this sample to estimate the height (fitted value) of the population regression line at 6 miles. Which one of the following could plausibly be the result?

**(A)** 5 to 7

**(B)** 7 to 19

**(C)** 12 to 14

**(D)** 15 to 35

**(E)** 24 to 26 --> **ANSWER**

## ============= (i) Answer ==============
If we look at the scatter plot, at 6 miles the regression lines is at 25 dollar. From here, we can rule out that option A, B, and C are definitely incorrect.

Between D and E, we can see that the data points at 6 miles appear to be ranging between 15 to 35. However, we are taking 95% confidence interval, which mean we are bootstrapping multiple times, and there might be trials where 15 and 35 are not covered since they are outlier. The values that are very likely to be covered in the confidence interval is the in-between, which is 24 to 26.

# 5. Classification
You want to predict whether a final survey response comes from a first-year student (class 1) or not (class 0) based on responses to 2 questions. The average responses from a random sample of 200 surveys are below.

#### Survey Questions:
1. What fraction of **lectures** did you attend?
2. What fraction of the **text** did you read?

| Class | Count | Lecture Average | Text Average|
| --- | --- | --- | --- |
| 1: First Year | 80 | 75% | 64% |
| 0: Other | 120 | 67% | 68% |

**(a)** A `constant` classifier is one that always guesses the same class label, regardless of the example attributes. What's the accuracy on this sample of the best constant classifier for predicting the class?

**(A)** 50%

**(B)** 60% --> **ANSWER**

**(C)** 70%

**(D)** 80%

**(E)** 90%

There are a total of 200 students.

1. If the classifier only classifies `First Year`, this means the classifier would be correct 80/200 of the time.

In [3]:
80 / 200

0.4

2. If the classifier only classifies `Other`, this means the classifier would be correct 120/200 of the time.

In [4]:
120 / 200

0.6

The `Other` classifiers is the best since it gives us higher correct answer rate (60%) compared to that of the `First Year` classifier. Thus, the accuracy is 60%.

**(b)** Among the following, what is the best reason to expect that a nearest-neighbor classifier that uses this sample as a training set will have higher accuracy on a held-out test set than any constant classifier?

1. The test set may have a different distribution of classes than this sample
2. A nearest-neighbor classifier is designed to generalize to unseen example
3. A nearest-neighbor classifier can predict different classes for different examples
4. The attributes (lecture and text) are associated with each other.
5. The attributes (lecture and text) are both associated with the class --> **ANSWER**

## ======= (b) Answer ========
In other words, this problem asks why it makes sense to use `lectures` and `text` for the classifier as opposed to using a constant classifier that only guesses `First Year` or `Other`. The only answer that makes sense is that the fraction of the lecture attended and the textbook read has something to do with whether a student is a `First Year` or not. Thus, the answer is `5`.

## ========= End of (b) =========

**(c)** 2 roommates always attended exactly the same lectures. One read $\frac{1}{2}$ the textbook, and the other read $\frac{9}{10}$. What is the distance between these 2 roommates used by a nearest-neighbor classifier that includes as attributes both the fraction of lectures attended and the fraction of text read?

## ======== (c) Answer =========
Recall the Euclidean distance formula for 2 points in a 2-D system:

$$ \text{Euclidean Distance} = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} $$

Now take the first roommate with a subscript `1` and the second roommate with the subscript `2`.
1. $x_1$ = $x_2$
2. $y_1$ = 0.5
3. $y_2$ = 0.9

Solving for the Euclidean distance, we obtain 0.4.

**(d)** For the small training set of 9 examples shown below, how will a `k`-nearest neighbor classifier label each of the 2 test examples `(i)` and `(ii)`? 

<img src = 'fraction.jpg' width = 500/>

For each example and each value of `k`, write either `0` or `1`

If it is impossible to determine the predicted label because of tied distances, write `impossible`.

| k-nearest | Prediction for (i) | Prediction for (ii) |
| --- | --- | --- |
| 1-nearest | 1 | 0 |
| 3-nearest | 0 | 1 |
| 5 nearest | 0| 0 |

**(e)** Your nearest-neighbor classifier is correct $\frac{4}{5}$ of the time on the test, but it's so slow that you can only use it for $\frac{3}{4}$ of test examples. The rest of the time you use a constant classifier that always guesses **"1: First-year"**, which is correct only $\frac{2}{5}$ of the time on the test set. For a randomly chosen test example,

1. What is the chance that it will be classified correctly?
2. What is the chance that you used your nearest-neighbor classifier, given that it was classified correctly?

## ======= (e) Answer ========

### Question 1

$$ \frac{4}{5} \times \frac{3}{4}  + \frac{1}{4} \times \frac{2}{5}$$

$$ = \frac{14}{20}$$

$$ = 70\text{%} $$

### Question 2

1. P(KNN and correct) = The probability of using K-nearest neighbor and the classification is correct
2. P(Correct) The probability that the classification is correct, regardless of method

Thus, the chance that we used the nearest neighbor classifier given it was correct is:

$$ = \frac{P(\text{KNN and correct)}}{P(Correct)} $$

$$ = \frac{\frac{4}{5} \times \frac{3}{4}}{\frac{14}{20}}$$

In [5]:
12 / 14

0.8571428571428571