# <span style="color:purple">Prerequisite Knowledge Check</span>

This is a short self-assessment to help you decide if you have the prerequisite knowledge for the course. This is not a timed test nor there are passing or failing scores. Feel free to use the internet (especially existing StackOverflow answers and package documentation) to help you get through this assessment. If you can answer all of these questions, you will have the appropriate prerequisite knowledge to be successful in the course.

## <span style="color:red">Question 1.1</span>

Below is some sample data about students selling different types of fruit. `sales` is a list of lists.  The lists in `sales` contain the name of the student who sold the fruit and the type of fruit which was sold.  So if Bernard sold an apple, then `(["Bernard","Apple"] in sales)==True`. Turn the data into a pandas dataframe and then use pandas to answer the following:

* How many apples did Anna sell? 79

* Who sold more Watermelons: Bernard or Daisy?  Daisy sells more

* Who sold the most fruit? Daisy

* Which fruit was sold the most? Peach

In [1]:
import numpy as np
np.random.seed(0)
N = 1000
students = ['Anna','Bernard','Charlie','Daisy']
fruits = ['Apple','Peach','Watermelon']
sales = [ [np.random.choice(students), np.random.choice(fruits)] for j in range(N)]

## <span style="color:red">Question 1.2</span>

Shown below is data relating to the position of a car in meters. The data was recorded at the indicated times below (so at time = 1, the car was 1 meter from the starting position).  Load the data as a numpy array. Calculate the average speed at which the car was traveling between time points.  Do this with a loop and again using array slicing.   1.7557

Hint: Speed = (Distance Traveled)/(Time To Travel Distance)

speeds: [0, 1, 1.2, 1.8, 2.0, 1.7, 1.5, 1.9, 2.1, 2.3]

times:  [0, 1, 1.5, 1.9, 2.3, 2.7, 3.8, 4.8, 5.4, 7.0]

## <span style="color:red">Question 1.3</span>

Generate a random 100-by-100 2-dimensional array of integers using `numpy.random.randint` ranging from 1 to 100.  To ensure your answer is the same as ours, set the random seed to `19920908`. 

Which row has the largest mean? 36

Which column has the smallest sum?  27

Which is the first column (from left to right) to have sum exceding 600?

Answer these questions without the use of a loop.

Hint: The `argmin`, `argmax`, and `argwhere` functions may be useful.


In [2]:
import numpy as np
import pandas as pd

np.random.seed(19920908)

arr = np.arange(0, 100*100, 1)
for i in range (len(arr)):
    arr[i] = np.random.randint(1,100)
arr = arr.reshape(100,100)

means = arr.mean(axis=1)

print(np.argmin(means))

27


## <span style="color:red">Question 1.4</span>

Newton's method is a numerical method finding the roots of a function.  Newton's method is

$$ x_{n+1} = x_{n} - \dfrac{f(x_n)}{f'(x_n)} $$

Below, I've written a function to try to use Newton's method to find the two roots of the function $f(x) = \exp(-x)\ln(x+1) - 0.25$.

My function should:

* Terminate when $\vert f(x_n) \vert < 1\times10^{-8}$ or when the number of iterations exceeds 1000.

* Take as its first argument the starting point for the method (i.e $x_0$)

* Take as its second argument the function $f$

* Take as its third argument the function $f'$

My code, as it stands, does not return the right answer.  Look through the code and debug the function so that it returns answers similar to `scipy.optimize.newton`.  Please don't completely rewrite the code (I spent a long time on it and want to learn what I messed up!).


Don't worry about `f` and `fprime`.  I've ensured those are correct.


f = lambda x: np.exp(-x)*np.log(x+1) - 0.25
fprime = lambda x: -np.exp(-x)*np.log(x+1) + np.exp(-x)/(x+1)

def broken_newtons_method(x0,f, fprime, tol = 1e-8, maxiter = 1000):
    
    res = float('inf')
    iters = 0
    x_n = x0
    
    while (res<tol) and (iters<maxiter):
        
        x_n -= f(x_n)\fprime(x_n)
        
        res = abs(f(x_n))
        
    return x_n
        
    
print('My algorithm, starting at 0.01, yields answer: ',broken_newtons_method(0.01,f,fprime))
print('My algorithm, starting at 2, yields answer: ', broken_newtons_method(2,f,fprime))
        

#compare with scipy
from scipy.optimize import newton

print('scipy.optimize.newton starting at 0.01 returns ',newton(f,0.01))
print('scipy.optimize.newton returns at 2 returns ', newton(f,2))

## <span style="color:red">Question 1.5</span>

Estimate through simulation the probability that a baseball player with a 0.300 batting average (that is, makes 300 hits for every 1000 at bats) hits fewer hits than a baseball player with a 0.275 batting average in 45 at bats.

Hint: Use the binomial distribution from `scipy.stats`.


## <span style="color:red">Question 1.6</span>

The file `data.csv` lists the soccer players participated in the Soccer World Cup 2022. It contains attributes such as age, overall performance score, wage etc. Load this dataset as a pandas dataframe and use pandas methods to query it to find `Nationality`, `Wage`, `Value`, `Skill Moves`, `Overall` of `L. Messi`. 

In [6]:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

df = pd.read_csv("data.csv").set_index("Name")
"""Nationality`, `Wage`, `Value`, `Skill Moves`, `Overall`"""
messi = df.loc["L. Messi"]
nationality = messi[3]
wage = messi[10]
value = messi[9]
skill_moves = messi[15]
overall = messi[5]
#print(messi)
print("Nationality: ", nationality, ", Value: ", value, ", Skill Moves: ", skill_moves, ", Overall: ", overall)

ID                                                                    158023
Age                                                                       35
Photo                       https://cdn.sofifa.net/players/158/023/23_60.png
Nationality                                                        Argentina
Flag                                     https://cdn.sofifa.net/flags/ar.png
Overall                                                                   91
Potential                                                                 91
Club                                                     Paris Saint-Germain
Club Logo                             https://cdn.sofifa.net/teams/73/30.png
Value                                                                   €54M
Wage                                                                   €195K
Special                                                                 2190
Preferred Foot                                                          Left

## <span style="color:red">Question 1.7</span>

The feature `Overall` indicates player's overall performance score, which normally ranges from 0 to 100. Plot the smoothed distribution of `Overall`. Your plot must also include three vertical lines: one for mean, one for median, and one for 99th percentile of the distribution. Your plot must have a legend.

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import matplotlib.pyplot as plt
"""
The feature `Overall` indicates player's overall performance score, which normally ranges from 0 to 100. 
Plot the smoothed distribution of `Overall`. 
Your plot must also include three vertical lines: one for mean, one for median, and one for 99th percentile of the distribution. 
Your plot must have a legend.
"""

df = pd.read_csv("data.csv")
plt.hist(df["Overall"], color='#AABBCC')

mean_value = df['Overall'].mean()
median_value = df['Overall'].median()
percentile_99 = np.percentile(df['Overall'], 99)

plt.axvline(mean_value, color='r')
plt.axvline(median_value, color='g')
plt.axvline(percentile_99, color='purple')

plt.legend({
    f'Mean: {mean_value:.2f}': mean_value,
    f'Median: {median_value}': median_value,
    f'99th Percentile: {percentile_99:.2f}': percentile_99
})

plt.title('Overall')
plt.xlabel('Overall Score')
plt.ylabel('Frequency')

plt.show()





## <span style="color:red">Question 1.8</span>

What is `Name`, `Nationality`, `Wage`, `Value`, `Skill Moves`, and `Overall` of the top 0.06% players of the `Overall` distribution?

## <span style="color:red">Question 1.9</span>

Attributes `Value` and `Wage` appear to be categorical attributes, but we need pure numbers for them. Do the following:
-   remove any possible white spaces as well as the "€" symbol from their entries,
-   some of their entries contain a "K" and some an "M". Multiply the "K" ones by 1e+3 and the "M" ones by 1e+6.

---
# <span style="color:orange">Section 2: Statistics Questions</span>

Students should be familiar with the following concepts: 
- Events and probability   
- Discrete and continuous random variables 
- Probability mass, probability density, and cumulative distribution functions 
- Joint, marginal, and conditional probability distributions
- Prior and posterior probability, Bayes rule 
- Maximum likelihood estimation
- Central limit theorem and normal approximation
- Confidence intervals 
- Mean, median, variance, standard deviation 
- Linear regression 

---

## <span style="color:red">Question 2.1</span>

What is the correct interpretation of the 95% confidence interval?

A. There is a 95% probability the true mean lies outside your interval.

B. The probability of the alternative hypothesis being true is 95%.

C. There is a 95% probability that the mean is the midpoint of the interval

D. Upon repeated construction, the longterm relative frequency of 95% confidence intervals containing the true mean is 95%.

## <span style="color:red">Question 2.2</span>

Bill James is credited with creating sabermetrics (baseball analytics). In one of his early "Baseball Abstracts", Bill writes:

"If you see 15 games a year, there is a 40% chance that a .275 hitter will have more hits than a .300 hitter."

Bill refers to players by their *batting average* (i.e. .275 means the hitter will hit the ball 275 times for every 1000 times they come at bat).  The actual probability is quite smaller than that. Bill wrote this in the late 1970s without the ubiquity of computers to perform the simulations we can.  It is quite plausible that Bill used a Normal approximation to arrive at this conclusion.

Assuming that every batter appears 3 times per game for 15 games (for a total of 45 at bats), use a Normal approximation to estimate the probability that a .275 batter hits more hits than a .300 batter.  Assume the batters are independent.  You can use python to evaluate any complicated functions, but do not estimate the probability via simulation.

## <span style="color:red">Question 2.3</span>

A diagnostic test has a 99% chance of correctly labeling a person as sick if they are truly sick.  The probability that the test labels someone as sick, regardless of disease status is 50%.  Approximately 1% of the population has the disease. 

a) what is the joint probability of having the disease and a positive test? 

b) what is the marginal probability that a test comes back positive? 

c) what is the conditional probability that a person has the disease if their test comes back positive?

## <span style="color:red">Question 2.4</span>

Why might someone want to know the median rather than the mean of their data?

## <span style="color:red">Question 2.5</span>

You obtain a dataset with $n$ rows and $n$ columns (the same number of rows and columns). Each column houses numeric data (no categories, just numbers). You're asked to perform a linear regression this data (the outcome is in a different file.  It is not one of the $n$ columns).  Assume that the data matrix is full rank.

What will the $R^2$ of this regression be?

---
# <span style="color:orange">Section 3: Linear Algebra Questions</span>

For the class we require some basic linear algebra
- Vectors, matrices, inner products, outer products, matrix multiplication
- Eigenvectors, eigenvalues, rank 
- Matrix inversion 
- Norms 

Gilbert Strang's book (http://math.mit.edu/~gs/linearalgebra/) might be a good refresher should you need it. Here (http://vmls-book.stanford.edu/vmls.pdf) is another book which may cover the topics you need, though we have not verified its quality.  If you have taken MATH 1600 and/or AMATH 2811, that should be enough.

---

## <span style="color:red">Question 3.1</span>

If $A$ $n \times n$ is a matrix, and $A$ has full rank, is $A$ invertible?

## <span style="color:red">Question 3.2</span>

If a matrix, $A$, is positive definite, which of the following is false:

A) $\mathbf{x}^T A \mathbf{x} >0 $ for every vector which is not 0

B) Every element of A is positive

C) The Eigenvalues of A are positive

D) A is symmetric

## <span style="color:red">Question 3.3</span>

Let $x$ and $y$ be vectors such that $\vert x \vert = 3$ and $\vert y \vert = 4$.  Use the triangle inequality to put an upper bound on the length of $\vert x+y \vert$.


## <span style="color:red">Question 3.4</span>

Let $A$ be a matrix, and let $\mathbf{x},\mathbf{y}$ be vectors.  If $A\mathbf{x} = [4,3,2]^T$ and $A\mathbf{y} = [-1,2,0]^T$ what is $A(2\mathbf{x} - \mathbf{y})$?