\item \textbf{Definition (Expectation and The Law of Large Numbers)}

If you roll a fair die, there are six possible outcomes, $1,2,3,4,5,6$, which are equally likely. The average of these six numbers $(1+2+3+4+5+6)/6 = 7/2$. We might say that "on average", if you roll a die, the outcome should be $7/2$. That is of course absurd for a single die roll, but it becomes increasingly true of the *sample mean* (i.e., the sum of the rolled values, divided by the number of rolls) if we perform more rolls.

To express this mathematically, suppose we roll the die $N$ times, and call the $n$'th outcome $x_n$, (so $x_n$ is one of $1,2,3,4,5,6$) and compute the sample mean of the $N$ resulting numbers:

$$\frac{1}{N}\sum_{n=1}^{N}x_n$$

we expect the result to be near $3.5$ if $N$ is large.

This can be quantified more precisely by the [law of large numbers][1], which says (stated informally) that the sample mean is increasingly likely to be close to the expected value, as $N$ grows large, and in fact the probability that the sample mean differs from the expected value approaches zero as $N \to \infty$.

**Edited** to respond to the comment by the OP:

Let's consider rolling the die $2$ times. There are $36$ possible outcomes, as follows:

$$\begin{aligned}
(1,1)\qquad(2,1)\qquad(3,1)\qquad(4,1)\qquad(5,1)\qquad(6,1) \\
(1,2)\qquad(2,2)\qquad(3,2)\qquad(4,2)\qquad(5,2)\qquad(6,2) \\
(1,3)\qquad(2,3)\qquad(3,3)\qquad(4,3)\qquad(5,3)\qquad(6,3) \\
(1,4)\qquad(2,4)\qquad(3,4)\qquad(4,4)\qquad(5,4)\qquad(6,4) \\
(1,5)\qquad(2,5)\qquad(3,5)\qquad(4,5)\qquad(5,5)\qquad(6,5) \\
(1,6)\qquad(2,6)\qquad(3,6)\qquad(4,6)\qquad(5,6)\qquad(6,6) \\
\end{aligned}$$
Each outcome is equally likely, with probability $1/36$. Now let's look at the *sample mean* of the two rolls in each case. So for example, for the first outcome, both rolls were $1$, so the sample mean is $(1+1)/2 = 1$. Computing the sample mean for all $36$ outcomes:
$$\begin{aligned}
1\qquad 1.5\qquad 2\qquad 2.5\qquad 3\qquad 3.5 \\
1.5\qquad 2\qquad 2.5\qquad 3\qquad 3.5\qquad 4 \\
2\qquad 2.5\qquad 3\qquad 3.5\qquad 4\qquad 4.5 \\
2.5\qquad 3\qquad 3.5\qquad 4\qquad 4.5\qquad 5 \\
3\qquad 3.5\qquad 4\qquad 4.5\qquad 5\qquad 5.5 \\
3.5\qquad 4\qquad 4.5\qquad 5\qquad 5.5\qquad 6 \\
\end{aligned}$$
Notice that some sample mean values are more likely than others.

For example, there is only one outcome, namely $(1,1)$, which results in a sample mean of $1$, and similarly, only $(6,6)$ gives a sample mean of $6$. So the probability of observing a sample mean of $1$ is only $1/36$, and similarly for a sample mean of $6$.

On the other hand, there are three outcomes that give a sample mean of $2$, namely $(3,1)$, $(2,2)$, and $(1,3)$. So the probability of observing a sample mean of $2$ is $3/36 = 1/12$.

Looking at the second table, the most likely observed sample mean is $3.5$. It occurs once in each row, for a total of $6$ out of $36$, so the probability that the sample mean is $3.5$ is $1/6$.

If we were to repeat the experiment with more rolls, we would see that a higher percentage of outcomes would result in sample means near $3.5$, and this percentage would grow closer and closer to $100\%$ as the number of rolls grows larger and larger.

\item \textbf{Definition (Poisson Distribution)}

At first glance, the binomial distribution and the Poisson distribution seem unrelated. But a closer look reveals a pretty interesting relationship.

It turns out the Poisson distribution is just a special case of the binomial - where the number of trials is large, and the probability of success in any given one is small.

In this post I will walk through a simple proof showing that the Poisson distribution is really just the binomial with $n$ approaching infinity and $p$ approaching zero.

\bigskip

{\color{red}

Sometimes, when sampling a binomial variable, the probability of observing the event is very small (that is $p$ tends to zero) and the sample size is large (that is $n$ tends to infinity). This might be the case, for example, if we were looking at the incidence of a rare disease, where only one in ten thousand people are affected. In such a situation it is difficult and tedious to estimate expected probabilities from the binomial distribution.



\bigskip


The Poisson with parameter $\lambda$ has mean $\lambda$. So when we are approximating a binomial with parameters $n$, $p$ (and therefore mean $np$) by a Poisson, the appropriate parameter $\lambda$ is the mean $np$ of the binomial. But the above can mainly be thought of as a *mnemonic*, a device to remember the right answer. So we go into more detail.

The following is an informal calculation that can be turned into a formal limit argument. If $X$ is the binomial, then 
$$P(X=k)=\binom{n}{k}p^k(1-p)^{n-k}~~~~~~~~~~~~~(1)$$
Use the **abbreviation** $\lambda$ for $np$. Then $p=\frac{\lambda}{n}$. So we have 
$1-p=1-\frac{\lambda}{n}$. 

We can then rewrite (1) as
$$P(X=k)=\frac{1}{k!}(n)(n-1)\cdots(n-k+1)\frac{\lambda^k}{n^k}\left(1-\frac{\lambda}{n}\right)^n   \left(1-\frac{\lambda}{n}\right)^{-k}~~~~~~(2)$$
Note that we have done nothing so far, except for the introduction of $\lambda$ as an abbreviation for $np$. 

Now imagine $n$ large, and $p$ small, so that $np=\lambda$ stays constant. Let $k$ be fixed. We look at the various terms in Formula (2). 

For fixed $k$, if $n$ is large then $(n)(n-1)\cdots(n-k+1)\frac{1}{n^k}\approx 1$ and $\left(1-\frac{\lambda}{n}\right)^{-k}\approx 1$. Also, for $n$ large, $\left(1-\frac{\lambda}{n}\right)^n  \approx e^{-\lambda}$.   


It follows that the right-hand side of (2) is approximately
$$\frac{1}{k!} \lambda^ke^{-\lambda}.$$
This is precisely the probability that a Poisson with parameter $\lambda$ takes on the value $k$. 



\bigskip}


\item \textbf{Definition (Mode of Poisson Distribution)}

To find the mode of the Poisson distribution, for $k > 0$, consider the ratio
$$
\frac{P\{X = k\}}{P\{X = k-1\}}
= \frac{e^{-\lambda}\frac{\lambda^k}{k!}}{e^{-\lambda}\frac{\lambda^{k-1}}{(k-1)!}} = \frac{\lambda}{k}$$
which is larger than $1$ for $k < \lambda$ and smaller than $1$ for $k > \lambda$.

 - If $\lambda < 1$, then $P\{X = 0\} > P\{X = 1\} > P\{X > 2\} \cdots$ and so the mode is $0$.

 - If $\lambda > 1$ is not an integer, then the mode is $\lfloor\lambda\rfloor$ since $P\{X = \lceil\lambda\rceil\} < P\{X = \lfloor\lambda\rfloor\}$.

 - If $\lambda$ is an integer $m$, then $P\{X = m\} = P\{X = m-1\}$ and so either 
$m$ or $m-1$ can be taken to be the mode.


\bigskip

\item \textbf{Theorem (Mean of Poisson is Linear)}

Intuition: If mean number of people queue to buy coffee in 1 minute is 2.9, then obviously mean number of people queue to buy coffee in 3 minutes is 2.9 times 3 = 7.8.\bigskip

If $X \sim \operatorname{Poisson} \left({2.9}\right)$ represents the mean number of people queuing to buy coffee in 1 min. Then $X \sim \operatorname{Poisson} \left({7.8}\right)$ represents the mean number of people queuing to buy coffee in 3 mins.


\bigskip



\item \textbf{Definition 5.4.1} 

We say that $X$ is a normal random variable, or simply that $X$ is normally distributed, with parameters $\mu$ and $\sigma^{2}$ if the density of $X$ is given by $$f(x) = \dfrac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^{2}/2\sigma^{2}}~~~~~ - \infty < x < \infty$$

\bigskip

To prove that $f(x)$ is indeed a probability density function, one needs to show that $$\dfrac{1}{\sqrt{2\pi}\sigma}\int_{-\infty}^{\infty}e^{(x-\mu)^{2}/2\sigma^{2}}dx = 1$$

\bigskip

An important fact about normal random variables is that if $X$ is normally distributed with parameters $\mu$ and $\sigma^{2}$, then $Y = aX + b$ is normally distributed with parameters $a\mu + b$ and $a^{2}\sigma^{2}$. 

Note in particular if we let $Z = \dfrac{X-\mu}{\sigma}$, then by what we established in the previous paragraph, we will have $Z$ to be normally distributed with parameters as $0$ and $1$. Such a random variable $Z$ is called a standard normal random variable. \bigskip

\item \textbf{Propositions 5.4.2}

(i) $P(Z \geq 0) = P(Z \leq 0) = 0.5$ \bigskip

(ii) $-Z ~ N(0,1)$ \bigskip

(iii) $P(Z \leq x) = 1 - P(Z > x)$ for $-\infty < x < \infty$ \bigskip

(iv) $P(Z \leq -x) = P(Z \geq x)$ for $-\infty < x < \infty$ \bigskip


(v) If $X ~ N(\mu,\sigma^{2})$, then $E(X) = \mu$ and $\text{var}(X) = \sigma^{2} $ \bigskip

(vi) If $Z ~ N(0,1)$, then  $E(Z) = 0$ and $\text{var}(Z) = 0 $ \bigskip


\item \textbf{Definition (Continuity Correction)}

The continuity correction comes up most often when we are using the normal approximation to the binomial. It comes up sometimes when we are approximating a Poisson distribution with large $\lambda$ by a normal. 

Let $X$ be a binomially distributed random variable that represents the number of successes in $n$ independent trials, where the probability of success on ay trial is $p$. Let $Y$ be a normal random variable with the same mean and the same variance as $X$. 

Suppose that $npq$ is not too small. Then if $k$ is an **integer**, $\Pr(X\le k)$ is reasonably well-approximated by $\Pr(Y\le k)$. It is ordinarily **better** approximated by $\Pr(Y\le k+\frac{1}{2})$. The difference can be significant when $n$ is not large. When $np(1-p)$ is big, say bigger than $100$, the continuity correction makes little practical difference.

The continuity correction is less important than it used to be. For with modern software, we can compute $\Pr(X\le k)$ essentially exactly. 

It is easy to get confused when using the continuity correction. In particular, the question that you asked comes up: when do we **add** $\frac{1}{2}$, and when do we subtract? I deal with that by remembering only one rule. To repeat,

**Rule:** If $k$ is an integer, then $\Pr(X\le k)\approx \Pr(Y\le k+\frac{1}{2})$, where $Y$ is a normal with the same mean and variance as $X$.

Let us look at a couple of examples.  Let $X$ have binomial distribution. Approximate the probability that $X < k$, where $k$ is an integer. This **doesn't** quite look like our Rule. Note we have $< k$, not $\leq k$. But $X< k$ if and only if $X \leq  k-1$. Now we are of the right shape. The answer is, approximately, $\Pr(Y\leq (k-1+\frac{1}{2}$, where $Y$ is the appropriate normal.
This is $\Pr(Y\leq k-\frac{1}{2})$, so in a sense we subtracted. But it all came from the one Rule, where we always add, but pay close attention to the difference between $<$ and $\leq$. 

What is the probability that $X > k$? This is $1-\Pr(X \leq  k)$. Thus we get that the result is approximately $1-\Pr(Y\leq k+\frac{1}{2})$. 

**A numerical example:** Toss a fair coin $100$ times. Approximate the probability that the number of heads is $\leq 55$.

By working directly with the binomial, and software, I get this is, to $6$ figures, $0.864373$. That's the "right" answer. 

Using $\Pr(Y\leq 55)$, where $Y$ is normal mean $50$, standard deviation $5$, no continuity correction, I get the approximation $0.8413$.

Using the continuity correction, I get the approximation $0.8643$. I should really do a few other examples, the continuity correction is **too** good here!


\item \textbf{Useful Links}

\begin{enumerate}


    \item This set of notes currently serves as a supplement to An Introduction to Statistical Learning for STAT 432 - Basics of Statistical Learning at the University of Illinois at Urbana-Champaign. Quite a good introduction text - Recommended to read first.
    
    \url{https://daviddalpiaz.github.io/r4sl/}
\bigskip


\item Another introductory set of notes is as follow for R.

\url{https://bookdown.org/egarpor/SSS2-UC3M/simplin.html}


\bigskip

\item Adding IV makes the originally not significant variable become significant.

\url{https://stats.stackexchange.com/questions/28474/how-can-adding-a-2nd-iv-make-the-1st-iv-significant}

\bigskip

\item On how to find outlier in a dataset.

\url{https://stepupanalytics.com/outlier-detection-techniques-using-r/}

\bigskip

\item An excellent tutorial of R by University of Ademos.


\url{https://ademos.people.uic.edu/Chapter12.html}


\bigskip

\item Some basics on EDA.

\url{https://r4ds.had.co.nz/exploratory-data-analysis.html}

\bigskip




\end{enumerate}




\newpage


\bigskip

\item \textbf{Feature Engineering good links}
\begin{enumerate}
    \item Extremely good and informative ; HOWEVER, some of the ways they suggest may not be ideal in some datasets, for example imputating missing values with median may not be accurate for all scenarios. Please think through.

\url{https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114}
\bigskip

\item How to impute missing values.

\url{https://towardsdatascience.com/handling-missing-values-in-machine-learning-part-2-222154b4b58e}

\url{https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techniques-9f40abe0fd87}

\url{https://www.omicsonline.org/open-access/a-comparison-of-six-methods-for-missing-data-imputation-2155-6180-1000224.php?aid=54590}

\url{https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/}



\end{enumerate}

\newpage


\item \textbf{Model Accuracy}

\url{https://developers.google.com/machine-learning/crash-course/classification/accuracy}



\bigskip

\item \textbf{Encoding}

\url{https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179}

\bigskip

\url{https://stackoverflow.com/questions/31506987/scikit-learn-one-hot-encode-before-or-after-train-test-split/}

\bigskip

\url{https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931}

\bigskip

\url{https://forums.fast.ai/t/to-label-encode-or-one-hot-encode/6057/4}


\end{enumerate}

\newpage





\subsection{Basic Statistical Terminologies}

\begin{enumerate}


\item \textbf{Definition (Standard Deviation and Variance)}

The variance of a random variable $X$ is the expected value of the squared deviation from the mean of X, $\mu =$E[X] given by the formula:

$$\text{Var}(X) = \text{E}\left[\left(X-\mu\right)^2\right]$$

\bigskip


\textbf{Simple Example}

Imagine that a company is manufacturing cell phones batteries. Some of the customers who bought the phones and using the batteries are reporting that after a full charge, it last maybe 10 min, some report 28 min, and others 18min, 8min, 15min, etc. \bigskip

These batteries duration are up and down, all over the place and the company customers are not happy because they can’t have confidence about the batteries lifecyle after a full charge. So in this example, the variance of the dataset is big (data is spreaded everywhere). But note, to be slightly more precise, we need to know the mean battery life first, say the mean is 11min, then the variance tells you how scattered around the centre of mass (mean) the points are. \bigskip


\textbf{Standard Deviation}

Standard deviation is the square root of Variance. It is measured in the same units as the mean, so the idea of spread is more literal.
\bigskip

A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

\bigskip


\item \textbf{Definition (Confidence Intervals)}


In survey sampling (read: say we want to find the mean height of the whole population in Singapore - 6 million. We obviously cannot find all 6 million people and measure their height, so we choose a random sample from it and measure. Read more about sampling), we of course can RANDOMLY select different samples from a population : different samples can be randomly selected from the same population; and each sample can often produce a different confidence interval. Some confidence intervals include the true population parameter (here the true population parameter refers to the true population mean height); others do not.

\bigskip


A confidence level refers to the percentage of all possible samples that can be EXPECTED to include the true population parameter. For example, suppose all possible samples were selected from the same population, and a confidence interval were computed for each sample. A $95\%$ confidence level implies that $95\%$ of the confidence intervals would include the true population parameter.
\bigskip


\textbf{Misunderstandings}

Suppose that a $90\%$ confidence interval states that the TRUE/REAL population mean is greater than 100 and less than 200. How would you interpret this statement? \bigskip

Some people think this means there is a $90\%$ chance that the TRUE population mean falls between 100 and 200. This is incorrect. Like any population parameter, the TRUE population mean is a constant, not a random variable. It does not change. The probability that a constant falls within any given range is always 0 or 1. So in a sense it is wrong to say that there is a percentage chance for the TRUE population mean to fall between some values, it is either the case that the TRUE population mean is in the interval or it's not. \bigskip

The confidence level describes the uncertainty associated with a sampling method. Suppose we used the same sampling method to select different samples and to compute a different interval estimate for EACH sample. Some of the sample's interval estimates would include the true population parameter and some would not. A $90\%$ confidence level means that we would expect $90\%$ of the interval estimates to include the population parameter; a $95\%$ confidence level means that $95\%$ of the intervals would include the parameter. \bigskip

\textbf{Calculating Confidence Intervals}

\begin{enumerate}
    \item [1.] First, you need to set a confidence level. It is usually often set to be $90\%, 95\%$. \bigskip
    
    \item [2.] Secondly, you need to identify sample statistic - like the sample mean, sample population that you will use to estimate a population parameter. So in the question, if we want to find the confidence interval of the TRUE population mean, then our sample statistics is basically the population sample mean. Also sample statistics can be called point estimates. \bigskip
    
    \item [3.] Calculate margin of error. One can refer to the following formula to calculate it. 
    
    $$\text{Margin of error = Critical value * Standard deviation of statistic}$$
    $$\text{Margin of error = Critical value * Standard error of statistic}$$
    
    \bigskip
    
    \item [4.] Finally, specify the confidence interval. The uncertainty is denoted by the confidence level. And the range of the confidence interval is defined by the following equation.
    
$$\text{Confidence interval = sample statistic + Margin of error}$$
    
    
    \url{https://stattrek.com/estimation/confidence-interval.aspx}
    
    \bigskip
    
\end{enumerate}

\item \textbf{Definition (Point Estimates, Interval Estimates)}

There is a relationship between point estimates and population parameters. \bigskip


Statisticians use sample statistics to estimate population parameters. For example, sample means are used to estimate population means; sample proportions, to estimate population proportions.
\bigskip

An estimate of a population parameter may be expressed in two ways:

\textbf{Point estimate}

A point estimate of a population parameter is a single value of a statistic. For example, the sample mean $x$ is a point estimate of the population mean $\mu$. Similarly, the sample proportion $p$ is a point estimate of the population proportion $P$. \bigskip

\textbf{Interval Estimate}

An interval estimate is defined by two numbers, between which a population parameter is said to lie. For example, $a < x < b$ is an interval estimate of the population mean $\mu$. It indicates that the population mean is greater than $a$ but less than $b$.

\bigskip

\item \textbf{Definition (Intuition of p - values)}

\textbf{Hypothesis Testing} 

In order to understand the intuition about P-value, it is necessary to know about hypothesis test intuitively. I have explained hypothesis test also. If you knew some of those concepts very well before, please skip the basic and go to P-value concept directly.
Usually in hypothesis testing, we evaluate two mutually exclusive statements about a population to determine which statement is best supported by the sample data.\bigskip

\textbf{Why do we need to conduct hypothesis testing?}


We got just a sample data from the population. Based on the sample data, we need to make an inference for a population.

For example:

\begin{enumerate}
    \item [1.] Doctor wants to know the children who take vitamin C are less likely to become ill. \bigskip
    
\item [2.] Manufacturer wants to check the product’s quality meets the pre-specified criteria.

\bigskip

\item [3.] Scientist wants to know that young boys are not necessarily prone to more behavioural problems than young girls.
In all these above examples, it is not possible to check the whole population to make a decision.
\end{enumerate}

\bigskip


If Doctor wants to know that the children who take Vitamin C are less likely to become ill, it will be very costly to scrutinise all the children in the world to make a decision. So, we always prefer to take a sample from the population. Using the sample, we need to make an inference for the population.
\bigskip


\textbf{Example of Null and Alternate Hypothesis}

I am running a company which manufactures beverages. I want my beverage cap’s diameter approximately is equal to 3 cm; otherwise it won’t fit for the bottle.
\bigskip

\begin{enumerate}
    \item [1.] Null Hypothesis ($H_0$): It states that the population parameter is equal to the claimed value.\bigskip
    
    
    \item [2.] Alternate Hypothesis($H_a$): There are three possibilities exist for alternate hypothesis. The population parameter is not equal to the claimed value, the population parameter is greater than the claimed value and the population parameter is less than the claimed value. \bigskip
\end{enumerate}


But my quality manager claims that cap’s diameter is not equal to 3 cm. So, for our case;

$$H_0: \mu = 3cm$$
$$H_1: \mu \neq 3cm$$

\bigskip

\textbf{Random Variability:}

As I said before, it is not possible to measure the cap’s diameter for the whole population due to budget constraint. So, I took a sample of 100 caps and measured its average diameter which turns out to be 2.92 cm ($\bar{\x}$). I took an another sample of same size (100 caps) and measured its average diameter; it turned out to be 3.12 cm. Even, if I take many samples with the same size, I might get different average diameter value for each sample. The reason we get different values, is due to random variability. \bigskip


Thanks to central limit theorem, by using it, we can define a theoretical distribution for the above scenario which captures the random variability of diameter.\bigskip

In order to understand more about central limit theorem, please read this: Balaji Pitchai Kannu's answer to How do you explain central limit theorem of normal distribution?

(\sloppy\url{https://www.quora.com/How-do-you-explain-central-limit-theorem-of-normal-distribution/answer/Balaji-Pitchai-Kannu})

\bigskip

Central limit theorem tells three important information about theoretical distribution (get to know what are the theoretical distributions and how it plays a part in hypothesis testing).
\bigskip

\begin{enumerate}
    \item [1.] Theoretical distribution approximately follows normal distribution. \bigskip

\item [2.] Population mean = Theoretical distribution mean. \bigskip

\item [3.] Population standard deviation= Theoretical distribution’s standard deviation/ $\sqrt{n}.$
\bigskip
\end{enumerate}



\textbf{What is the intuition of P-value?}

\hl{So in general hypothesis testing procedure, we will have some hypothesis (our null hypothesis) about the population parameter and we investigate it using a sample extracted from the population. P-value is nothing but the probability of observing such sample from the population given that null hypothesis is true.} So in a way, it means that if the null hypothesis is true, say population mean height is 100cm is indeed true, then what is the PROBABILITY of observing samples that are indeed close to 100cm (usually there are some cut off points like a range say 98cm - 102cm which can be calculated). So that probability that we said just now is the p-value, and say if p value is 0.03 (i.e 3 percent), then it means if null hypothesis is indeed true, but in our sample population, we only observe a mere 3 percent of values that is close to 100cm! That is kind of little and we therefore reject this null hypothesis, otherwise we accept Null hypothesis by saying we don’t have enough evidence to reject the null hypothesis.

\bigskip


It means that I am going to assume my theoretical distribution is normal distribution and its mean is null hypothesis mean ($\mu$ =3cm). All I need to do, find out the probability of observing the sample mean (100 samples of cap’s diameter) in that distribution. If the probability value (P-value) is very high, it means that probability of observing that sample in the theoretical distribution is very high, which indicates that the $\bar{\x}$ is from the same distribution (theoretical distribution). We got different $\bar{\x},$ only due to random variability. If the probability value (P-value) is low, it means that $\bar{\x} $might have come from other distribution. The other distribution might be anything other than theoretical distribution. When our P-value is very low in hypothesis testing, the only conclusion we can make, $\bar{\x}$ is not from the theoretical distribution. That’s why, we always say FAIL to reject null hypothesis instead of accepting the alternate hypothesis.


\bigskip


\end{enumerate}

\newpage



\subsection{Types of Variables: Categorical vs Numerical}

\begin{enumerate}


\item \textbf{Definition (Categorical)}

\begin{enumerate}
    \item Nominal: Variables that cannot be ordered in any ways: For example, Colours, you cannot organize them into red, black, blue in any sequences. But they can be categorized into different categories. \bigskip
    
    \item Ordinal: Variables that can be ordered. But they do not really involve explicit numbers per se. For example, student grades ranging from A, B, C have an order, but no numbers attached to it. 
    
    \bigskip
    
    Also, Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add them together, for example. (Other names for categorical data are qualitative data, or Yes/No data.) \bigskip
    
    Ordinal data mixes numerical and categorical data. The data fall into categories, but the numbers placed on the categories have meaning. For example, rating a restaurant on a scale from 0 (lowest) to 4 (highest) stars gives ordinal data. Ordinal data are often treated as categorical, where the groups are ordered when graphs and charts are made. However, unlike categorical data, the numbers do have mathematical meaning. For example, if you survey 100 people and ask them to rate a restaurant on a scale from 0 to 4, taking the average of the 100 responses will have meaning. This would not be the case with categorical data.
\end{enumerate}


\bigskip



\item \textbf{Definition (Numerical)}

\begin{enumerate}
    \item Discrete: Take on whole numbers and are usually finite. There are 10 people but we do not say there are 10.5 people. 
    
    
    \bigskip
    
    \item Continuous: Take on any real numbers. \hl{Question: If the variables are in range for salary: 10k-20k, 20k - 30k, etc, are those numerical or ordinal?}
\end{enumerate}
\end{enumerate}
\bigskip


\newpage

\subsection{Some random questions}
\begin{enumerate}
\item \textbf{Linear Regression Questions}

\begin{enumerate}
    \item [1.]  Suppose Pearson correlation between x and y is zero. In such case, is it right to conclude that x and y do not have any relation between them?
    
    \bigskip
    
    {\color{red}
    Zero correlation will indicate no linear dependency, however won't capture non-linearity. Typical example is uniform random variable $y$, and $x^2$ over $[-1,1]$ with zero mean. Correlation is zero but clearly not independent meaning $y$ still depends on $x$ but is not shown in the Pearson Coefficient due to their relationship being non-linear. For example $$y = x^2 $$ is a simple example. 
    
    
    \bigskip}
    
    
    
    
    \item [2.] Give me an explanation of overfitting and underfitting.
    
    \bigskip
    
    {\color{red}
    Overfitting and Underfitting is very crucial to know if the predictive model is generalizing the data well or not. The good model must be able to generalize the data well. \bigskip
    
    \begin{center}
  \makebox[\textwidth]{\includegraphics[width=100mm,scale=0.5]{overfit.jpg}}
\end{center}
    
    The model is \textbf{Overfitting}, when it performs well on training examples but does not perform well on unseen data. It is often a result of an excessively complex model. It happens because the model is memorizing the relationship between the input example (often called X) and target variable (often called y) or, so unable to generalize the data well. Overfitting model predicts the target in the training data set very accurately. As you can seen in the third image, the best fit line seems to fit the trained (seen) data set very well, however, such best fit line or model may not be as accurate any more when done on an unseen/new sample data sets as it is likely to be too complex to fit exactly to the new data set. \bigskip
    
The predictive model is said to be \textbf{Underfitting}, if it performs poorly on training data. This happens because the model is unable to capture the relationship between the input example and the target variable. It could be because the model is too simple i.e. input features are not expressive enough to describe the target variable well. Underfitting model does not predict the targets in the training data sets very accurately. As you can see from the first image, the best fit line is very generic, it may have a low RSE, indicating not a good fit for the data set. 
\bigskip

Both overfitting and underfitting leads to poor prediction on new data sets.}
\end{enumerate}

\end{enumerate}

\newpage


\section{Statistical Learning}

\subsection{What Is Statistical Learning?}

More generally, suppose that we observe a quantitative response $Y$ and $p$ different predictors, $X_{1}, X_{2}$, . . . , $X_{p}$. We assume that there is some relationship between $Y$ and $X = (X_{1},\ X_2,\ .\ .\ .\ ,\ X_{p})$ , which can be written in the very general form

\begin{equation}
 Y=f(X)+\epsilon 
\end{equation}


Here $f$ is some fixed but unknown function of $X_{1}$, . . . , $X_{p}$, and $\epsilon$ is a random
  error term, which is independent of $X$ and has mean zero. 


\newpage



\subsection{Why do we need to estimate a function f?}





There are two main reasons that we may wish to estimate $f$: {\it prediction} and {\it inference}. We discuss each in turn. \bigskip


\begin{enumerate}

\item \textbf{Prediction}

In many situations, a set of inputs $X$ are readily available, but the output $Y$ cannot be easily obtained. For example, consider input variables such as murder rate, rape rate, assault rate, but we do not have the output variable crime rate. So while we have many input variables, we may want to have a output. In this setting, since the error term averages to zero (recall mean of error term is 0), we can predict $Y$ using

\begin{equation}
\hat{Y}=\hat{f}(X)
\end{equation}


where $\hat{f}$ represents our estimate for $f$, and $\hat{Y}$ represents the resulting pre-

diction for $Y$. In this setting, $\hat{f}$ is often treated as a {\it black box}, in the sense

that one is not typically concerned with the exact form of $\hat{f}$, provided that

it yields accurate predictions for $Y.$ It is just saying $\hat{f}$ can be any function.

\bigskip

\textbf{Example}

As an example, suppose that $X_{1}$, . . . , $X_{p}$ are characteristics of a patient's blood sample that can be easily measured in a lab, note these $X_i$ are our inputs, and $Y$ is a variable (our output) encoding the patient's risk for a severe adverse reaction to a particular drug. $Y$ can take on a categorical output such as Low risk, Moderate risk and High risk. It is natural to seek to predict $Y$ using $X$, since we can then avoid giving the drug in question to patients who are at high risk of an adverse reaction--that is, patients for whom the estimate of $Y$ is high. \bigskip


The accuracy of $\hat{Y}$ as a prediction for $Y$ depends on two quantities, which we will call the {\it reducible error} and the {\it irreducible error}. In general, reducible $\hat{f}$ will not be a perfect estimate for $f$, and this inaccuracy will introduce some error. This error is {\it reducible} because we can potentially improve the 
accuracy of $\hat{f}$ by using the most appropriate statistical learning technique to error estimate $f$. However, even if it were possible to form a perfect estimate for $f$, so that our estimated response took the form $\hat{Y}=f(X)$ , our prediction would still have some error in it! This is because $Y$ is also a function of
$\epsilon$, which, by definition, cannot be predicted using $X$. Therefore, variability associated with $\epsilon$ also affects the accuracy of our predictions. This is known
as the {\it irreducible} error, because no matter how well we estimate $f$, we cannot reduce the error introduced by $\epsilon.$ \bigskip

The focus of this book is on techniques for estimating f with the aim of minimizing the reducible error. It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for Y . This bound is almost always unknown in practice.

\bigskip


\item \textbf{Inference} \label{Inference}

We are often interested in understanding the way that $Y$ is affected as $X_{1}$, . . . , $X_{p}$ change. In this situation we wish to estimate $f$, but our goal is not necessarily to make predictions for $Y$. We instead want to understand the relationship between $X$ and $Y$, or more specifically, to understand how $Y$ changes as a function of $X_{1}$, . . . , $X_{p}$. Now $f$ cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:

$\bullet$ {\it Which predictors are associated with the response}?

It is often the case that only a small fraction of the available predictors are substantially
associated with $Y$. Identifying the few {\it important} predictors among a large set of possible variables can be extremely useful, depending on the application.
\bigskip

$\bullet$ {\it What is the relationship between the response and each predictor}    ?

Some predictors may have a positive relationship with $Y$, in the sense that increasing the predictor is associated with increasing values of $Y$. Other predictors may have the opposite relationship. Depending on the complexity of $f$, the relationship between the response and a given predictor may also depend on the values of the other predictors. \bigskip

$\bullet$ {\it Can the relationship between} $Y$ {\it and each predictor be adequately summarized using a linear equation, or is the relationship more complicated}?

Historically, most methods for estimating $f$ have taken a linear form. In some situations, such an assumption is reasonable or even desirable. But often the true relationship is more complicated, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.


\end{enumerate}

\newpage


\subsection{How to estimate f?}

Throughout this book, we explore many linear and non-linear approaches for estimating $f$. However, these methods generally share certain characteristics. We provide an overview of these shared characteristics in this section. We will always assume that we have observed a set of $n$ different data points. In particular, in supervised learning, you use a training dataset, that contains outcomes, to train the machine. You then use testing dataset that has no outcomes to predict outcomes.
For example, you have a dataset of students with their demographics, hours spent practicing for the SAT, books they’ve read through out the year, times spent per day studying for the test, and the SAT results for years 2005–2016. This is your training dataset, since the results of their test scores already known. The algorithm will train to predict each score, based on the students’ parameters.

\bigskip

You have a similar dataset of students, with the same data, except for the SAT scores. This is your testing dataset, and the algorithm will predict the outcome, based on the historical data for similar students.
\bigskip


These observations are called the {\it training data} because we will use these observations to train, or teach, our method how to estimate $f$. Let $x_{ij}$ represent the value of the jth predictor, or input, for observation $i$, where $i = 1$, 2, . . . , $n$ and $j = 1$, 2, . . . , $p$. Correspondingly, let $y_{i}$ represent the response variable for the $i\mathrm{t}\mathrm{h}$ observation. Then our training data consist of $\{(x_{1},\ y_{1}),\ (x_{2},\ y_{2}),\ .\ .\ .\ ,\ (x_{n},\ y_{n})\}$ where $x_{i}=(x_{i1},\ x_{i2},\ \ldots,\ x_{ip})^{T}.$

\bigskip

Our goal is to apply a statistical learning method to the training data in order to estimate the unknown function $f$. In other words, we want to find a function $\hat{f}$ such that $Y\approx\hat{f}(X)$ for any observation $(X,\ Y)$ . Broadly speaking, most statistical learning methods for this task can be characterized as either {\it parametric} or {\it non-parametric}. We now briefly discuss these two types of approaches. \bigskip

\newpage


\begin{enumerate}


\item \textbf{Parametric Methods!!!}

Parametric methods involve a two-step model-based approach.
\bigskip

\begin{enumerate}
    \item [1.] First, we make an \textbf{assumption} about the functional form, or shape, of $f$. For example, one very simple assumption is that $f$ is linear in $X$:
    
\begin{equation}
f(X)=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}
\end{equation}
\bigskip

This is a {\it linear model}, which will be discussed extensively in Chapter 2. Once we have assumed that $f$ is linear, the problem of estimating $f$ is greatly simplified. Instead of having to estimate an entirely
arbitrary $p$-dimensional function $f(X)$ , one only needs to estimate the $p+1$ coefficients $\beta_{0}, \beta_{1}$, . . . , $\beta_{p}.$

\bigskip

\item [2.] After a model has been selected, we need a procedure that uses the
training data to {\it fit} or {\it train} the model. In the case of the linear model fit (1.3), we need to estimate the parameters $\beta_{0}, \beta_{1}$, . . . , $\beta_{p}$. That is, we want to find values of these parameters such that
$$
Y\approx\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{p}X_{p}.
$$
The most common approach to fitting the model (1.3) is referred to as ({\it ordinary}) {\it least squares}, which we discuss in Chapter 2.\bigskip

However, least squares is one of many possible ways to fit the linear model. In
Chapter 5, we discuss other approaches for estimating the parameters in (1.3).
\bigskip
\end{enumerate}

The model-based approach just described is referred to as {\it parametric};
it reduces the problem of estimating $f$ down to one of estimating a set of
parameters. Assuming a parametric form for $f$ simplifies the problem of
estimating $f$ because it is generally much easier to estimate a set of parameters, such as $\beta_{0}, \beta_{1}$, . . . , $\beta_{p}$ in the linear model (1.3), than it is to fit an entirely \textbf{arbitrary function} $f$. The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of $f$. If the chosen model is too far from the true $f$, then our estimate will be poor. We can try to address this problem by choosing {\it flexible} models that can fit many different possible functional forms for $f$. But in general, fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as {\it overfitting} the data, which essentially means they follow the errors, or {\it noise}, too closely. These issues are discussed throughout this book. \bigskip



\newpage



\textbf{Example of Parametric}



Figure 1.1 displays income as a function of years of education
and seniority in the Income data set. The blue surface represents the TRUE underlying relationship between income and years of education and seniority,
which is known since the data are simulated. The red dots indicate the observed
values of these quantities for 30 individuals. \bigskip

Figure 1.2 shows an example of the parametric approach applied to the
Income data from Figure 1.1. We have fit a linear model of the form

$$\text{income} \approx \beta_{0}+\beta_{1} \times \text{education} +\beta_{2} \times \text{seniority}$$


\begin{figure}[h]
  \centering
  \begin{minipage}[b]{0.4\textwidth}
    \includegraphics[width=\textwidth]{image006.png}
    \caption{True observations}
  \end{minipage}
  \hfill
  \begin{minipage}[b]{0.4\textwidth}
    \includegraphics[width=\textwidth]{image008.png}
    \caption{Estimated Graphical Representations}
  \end{minipage}
\end{figure}


\bigskip

Since we have assumed a linear relationship between the response and the two predictors, the entire fitting problem reduces to estimating $\beta_{0}, \beta_{1}$, and $\beta_{2}$, which we do using least squares linear regression. Comparing Figure 1.1 to Figure 1.2, we can see that the linear fit given in Figure 1.2 is not quite right: the true $f$ has some curvature that is not captured in the linear fit. However, the linear fit still appears to do a reasonable job of capturing the positive relationship between years of education and income, as well as the slightly less positive relationship between seniority and income. It may be that with such a small number of observations, this is the best we can do. \bigskip

\bigskip

\item \textbf{Non Parametric Methods}

Non-parametric methods do not make explicit assumptions about the functional form of $f$. Instead they seek an estimate of $f$ that gets as close to the
data points as possible without being too rough or wiggly. Such approaches
can have a major advantage over parametric approaches: by avoiding the
ASSUMPTION of a particular functional form for $f$, they have the potential
to accurately fit a wider range of possible shapes for $f$. Any parametric
approach brings with it the POSSIBILITY that the functional form used to
estimate $f$ is very different from the true $f$, in which case the resulting
model will not fit the data well. In contrast, non-parametric approaches
completely avoid this danger, since essentially no assumption about the
form of $f$ is made. But non-parametric approaches do suffer from a major
disadvantage: since they do not reduce the problem of estimating $f$ to a
small number of parameters, a very large number of observations (far more
than is typically needed for a parametric approach) is required in order to
obtain an accurate estimate for $f.$
\bigskip

\end{enumerate}
\newpage



\subsection{The Trade-Off Between Prediction Accuracy and Model Interpretability}



Of the many methods that we examine in this book, some are less \hl{flexible}, or more restrictive, in the sense that they can produce just a relatively small range of shapes to estimate $f$. For example, linear regression is a relatively inflexible approach, because it can only generate linear functions.
\bigskip

Other methods, such as the thin plate splines are considerably more flexible because they can generate a much wider range of possible shapes to estimate $f.$ \bigskip

One might reasonably ask the following question: {\it why would we ever}
{\it choose to use a more restrictive method instead of a very flexible approach}? There are several reasons that we might prefer a more restrictive model. If we are mainly interested in inference \hyperref[Inference]{Inference}, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between $Y$ and $X_{1}, X_{2}$, . . . , $X_{p}$. In contrast, very flexible
approaches, such as the splines and boosting discussed in later chapters can
lead to such complicated estimates of $f$ that it is difficult to understand
how any individual predictor is associated with the response. \bigskip

\begin{figure}[ht]
  \centering
    \includegraphics[width=\textwidth]{11.png}
    \caption{Tradeoff between flexibility and interpretability}
\end{figure}
\bigskip

Figure 1.3 provides an illustration of the trade-off between flexibility and
interpretability for some of the methods that we cover in this book. Least squares linear regression is relatively inflexible but is quite interpretable. The {\it lasso}, discussed later, relies upon the linear model but uses an alternative fitting procedure for estimating the coefficients $\beta_{0}, \beta_{1}$, . . . , $\beta_{p}$. The new procedure is more restrictive in estimating the coefficients, and sets a number of them to exactly zero. Hence
in this sense the lasso is a less flexible approach than linear regression.
It is also more interpretable than linear regression, because in the final
model the response variable will only be related to a small subset of the
predictors--namely, those with nonzero coefficient estimates. \bigskip

{\it Generalized} {\it additive models} (GAMs), discussed later also, instead extend the linear model to allow for certain non-linear relationships. Consequently, GAMs are more flexible than linear regression. They are also somewhat less interpretable than linear regression, because the relationship between each predictor and the response is now modeled using a curve. Finally, fully non-linear methods such as {\it bagging, boosting}, and {\it support vector machines} with non-linear kernels,  are highly flexible approaches that are harder to interpret. \bigskip

{\color{red}


We have established that when \hl{inference} is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. \bigskip

In some settings, however, we are only interested in \hl{prediction}, and the interpretability of the predictive model is simply not of interest. For instance, if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately--
interpretability IS NOT a concern. In this setting, we might expect that it
will be best to use the most flexible model available. Surprisingly, this is
not always the case! We will often obtain more accurate predictions using
a less flexible method. This phenomenon, which may seem counter-intuitive
at first glance, has to do with the potential for overfitting in highly flexible methods. We will discuss this very important concept further in Section 1.3 and throughout this book.
}



\newpage

\subsection{Supervised Versus Unsupervised Learning}

Most statistical learning problems fall into one of two categories: {\it supervised} or {\it unsupervised}. The examples that we have discussed so far in this chapter all fall into the supervised learning domain. For each observation of the predictor measurement(s) $x_{i}, i = 1$, . . . , $n$ there is an associated response measurement $y_{i}$. We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference). Many classical statistical learning methods such as linear regression and {\it logistic regression}, as well as more modern approaches such as GAM, boosting, and support vector machines, operate in the supervised learning domain. The vast majority of this book is devoted to this setting.
\bigskip


\begin{figure}[h]
  \centering
  \begin{minipage}[b]{0.4\textwidth}
    \includegraphics[width=\textwidth]{image012.png}
    \caption{}
  \end{minipage}
  \hfill
  \begin{minipage}[b]{0.4\textwidth}
    \includegraphics[width=\textwidth]{image013.png}
    \caption{}
  \end{minipage}
\end{figure}


In contrast, unsupervised learning describes the somewhat more challenging situation in which for every observation $i = 1$, . . . , $n$, we observe
a vector of measurements $x_{i}$ but NO associated response $y_{i}$. It is not possible to fit a linear regression model, since there is no response variable
to predict. In this setting, we are in some sense working blind; the situation is referred to as {\it unsupervised} because we lack a response variable that can supervise our analysis. What sort of statistical analysis is possible? We can seek to understand the relationships between the variables or between the observations. One statistical learning tool that we may use in this setting is {\it cluster analysis}, or clustering. The goal of cluster analysis is to ascertain, on the basis of $x_{1}$, . . . , $x_{n}$, whether the observations fall into relatively distinct groups. For example, in a market segmentation study we might observe multiple characteristics (variables) for potential customers, such as zip code, family income, and shopping habits. We might believe that the customers fall into different groups, such as big spenders versus low spenders. If the information about each customer's spending patterns
were available, then a supervised analysis would be possible. However, this
information is NOT available--that is, we do not know whether each potential customer is a big spender or not. In this setting, we can try to cluster
the customers on the basis of the variables measured, in order to identify
distinct groups of potential customers. Identifying such groups can be of
interest because it might be that the groups differ with respect to some
property of interest, such as spending habits. \bigskip


Figure 1.4 and 1.5 provides a simple illustration of the clustering problem. We have plotted 150 observations with measurements on two variables, $X_{1}$ and $X_{2}$. Each observation corresponds to one of three distinct groups. For
illustrative purposes, we have plotted the members of each group using 
different colors and symbols. However, in practice the group memberships are unknown, and the goal is to determine the group to which each observation belongs. In figure 1.4, this is a relatively easy task because the groups are well-separated. In contrast, the right-hand panel illustrates a more challenging problem in which there is some overlap between the groups. A clustering method could not be expected to assign all of the overlapping points to their correct group (blue, green, or orange).\bigskip


In the examples shown in Figure 1.4, 1.5, there are only two variables, and
so one can simply visually inspect the scatterplots of the observations in
order to identify clusters. However, in practice, we often encounter data
sets that contain many more than two variables. In this case, we cannot
easily plot the observations. For instance, if there are $p$ variables in our
data set, then $p (p-\ 1)/2$ distinct scatterplots can be made, and visual
inspection is simply not a viable way to identify clusters. For this reason, automated clustering methods are important. We discuss clustering and
other unsupervised learning approaches in later chapters.\bigskip

Many problems fall naturally into the supervised or unsupervised learning paradigms. However, sometimes the question of whether an analysis should be considered supervised or unsupervised is less clear-cut. For instance, suppose that we have a set of $n$ observations. For $m$ of the observations, where $m<n$, we have both predictor measurements and a response measurement. For the remaining $n-m$ observations, we have predictor measurements but no response measurement. Such a scenario can arise if the predictors can be measured relatively cheaply but the corresponding responses are much more expensive to collect. We refer to this setting as a {\it semi-supervised learning} problem. In this setting, we wish to use a statistical learning method that can incorporate the $m$ observations for which response measurements are available as well as the $n-m$ observations for which they are not. Although this is an interesting topic, it is beyond the scope of this book.
\newpage



\subsection{Regression VS Classification Problems}


Variables can be characterized as either quantitative or qualitative (also
quantitative qualitative known as categorical). Quantitative variables take on numerical values. Examples include a person's age, height, or income, the value of a house, and the price of a stock. In contrast, qualitative variables take on values in one of $K$ different {\it classes}, or categories. Examples of qualitative class variables include a person's gender (male or female), the brand of product purchased (brand $\mathrm{A}, \mathrm{B}$, or C), whether a person defaults on a debt (yes or no). We tend to refer to problems with a quantitative response as \textbf{regression} problems, while those involving a qualitative response are often referred to as \textbf{classification} problems.
\bigskip

However, the distinction is not always that crisp. Least squares linear regression (Chapter 2) is used with a quantitative response, whereas logistic
regression (Chapter 3) is typically used with a qualitative (two-class, or
{\it binary}) response. As such it is often used as a classification method. But since it estimates class probabilities, it can be thought of as a regression method as well. Some statistical methods, such as $K$-nearest neighbors and boosting, can be used in the case of either quantitative or qualitative responses. \bigskip


We tend to select statistical learning methods on the basis of whether
the response is quantitative or qualitative; i.e. we might use linear regression when quantitative and logistic regression when qualitative. However,
whether the \textbf{PREDICTORS} are qualitative or quantitative is generally considered less important. Most of the statistical learning methods discussed in this book can be applied regardless of the predictor variable type, provided
that any qualitative predictors are properly {\it coded} before the analysis is
performed. 

\newpage


\section{Assessing Model Accuracy}


\url{https://developers.google.com/machine-learning/crash-course/classification/accuracy}

Online resources mainly explain about model accuracy for classification algorithms. The definition being the ratio of number of correct predictions to the total number of input samples.

$$\text{Accuracy} = \dfrac{\text{Number of correct predictions}}{\text{Number of total predictions made}}$$

\bigskip



We will however, start off with explaining how to assess model accuracy in the \textbf{regression setting.}


\newpage


\subsection{Measuring quality of fit in Regression}

\begin{enumerate}

\item \textbf{Mean Squared Error}

There are many regression algorithms, we want the best of them, that is the hard task. We want to know which model predicts accurately - and there is no one fit all best algorithm for all datasets. So we have to choose wisely.

\bigskip

For example, we were given a training set, with the true population known. We use the inputs to train a regression model to predict the output. We need to quantify the extent to which the PREDICTED response value for a given observation (input) is CLOSE to the TRUE response value for that particular given observation (input). How do we quantify it? In regression, one popular method is called Mean Squared Error (MSE).
\bigskip

\textbf{MSE} is defined by 

\begin{equation}
\text{MSE} = \dfrac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2
\end{equation}

where $\hat{f}(x_i)$ is the prediction that $\hat{f}$ gives for the i-th observation. The intuition is easy, and you will see a similar version in the next chapter called Residual Sum of Squares. Intuition, let $y_i$ be the true output value for each input $x_i$, then let $\hat{f}$ be the regression function, and $\hat{f}(x_i)$ is the predicted output value for each input $x_i$. The difference between the true output value and the predicted output value should be small if the regression model is accurate. So our aim is to minimize the sum off all the differences. Why do we want to square? I think it is because the errors might have negative values and you do not want to deal with them. Alternatively, you can modulus them but this method is not mentioned.

\bigskip

The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.

\bigskip


\item \textbf{Alert! Training MSE!}


The MSE in (1.4) is computed using the training data that was used to fit the model, and so should more accurately be referred to as the {\it training} {\it MSE}. But in general, we do not really care how well the method works on the training data. Rather, {\it we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data}. 

\bigskip

Here is an example to give you some intuition. Suppose that we are interested in developing an algorithm to predict a stock's price based on previous stock returns. We can train the method using stock returns from the past 6 months. But we don't really care how well our method predicts last week's stock price. We instead care about how well it will predict tomorrow's price or next month's price. On a similar note, suppose that we have clinical measurements (e.g. weight, blood pressure, height, age, family history of disease) for a number of patients, as well as information about whether each patient has diabetes. We can use these patients to train a statistical learning method to predict risk of diabetes based on clinical measurements. In practice, we want this method to accurately predict diabetes risk for \textbf{FUTURE UNKNOWN PATIENTS} based on their clinical measurements. We are not very interested in whether or not the method accurately predicts diabetes risk for patients used to train the model, since we already know which of those patients have diabetes.

\bigskip

But a natural question that follows is: If the model is accurate on the training data used, should it not be also accurate on the new unseen data? That is a fallacy. Unfortunately, there is a fundamental problem with this: there is no guarantee that the method with the lowest training MSE will also
have the lowest test MSE. Roughly speaking, the problem is that many
statistical methods SPECIFICALLY estimate coefficients so as to minimize the
training set MSE. For these methods, the training set MSE can be quite
small, but the test MSE is often much larger. \bigskip


As model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance
rather than by true properties of the unknown function $f$. When we overfit
the training data, the test MSE will be very large because the supposed
patterns that the method found in the training data simply don’t exist
in the test data. Note that regardless of whether or not overfitting has
occurred, we almost always expect the training MSE to be smaller than
the test MSE because most statistical learning methods either directly or
indirectly seek to minimize the training MSE. Overfitting refers specifically
to the case in which a less flexible model would have yielded a smaller
test MSE.
\bigskip

\item \textbf{TEST MSE}


To state the MSE more mathematically, suppose that we fit our statistical learning method on our training observations $\{(x_{1},\ y_{1})$ , $(x_{2},\ y_{2})$ , . . . , $(x_{n}, y_{n}$ and we obtain the estimate $\hat{f}$. We can then compute $\hat{f}(x_{1})$ , $\hat{f}(x_{2})$ , . . . , $\hat{f}(x_{n})$ .
If these are approximately equal to $y_{1}, y_{2}$, . . . , $y_{n}$, then the training MSE given by (1.4) is small. However, we are really not interested in whether $\hat{f}(x_{i})\approx y_{i}$; instead, we want to know whether $\hat{f}(x_{0})$ is approximately equal to $y_{0}$, where $(x_{0},\ y_{0})$ is a {\textbf{ previously unseen test observation not used to train the statistical learning method}}. We want to choose the method that gives
the \textbf{LOWEST TEST MSE}, as opposed to the lowest training MSE.\bigskip

In other words, if we had (In reality, we probably do not have test observation outputs) a large number of test observations, we could compute

\begin{equation}
    \mathrm{A}\mathrm{v}\mathrm{e}(y_{0}-\hat{f}(x_{0}))^{2}
\end{equation}

the average squared prediction error for these test observations $(x_{0},\ y_{0})$. Note if there are $m$ test observations, then equation 1.5 can be 
\begin{equation}
    \dfrac{1}{m}(y_{0}-\hat{f}(x_{0}))^{2}
\end{equation}

\bigskip


We'd like to select the model for which the average of this quantity--the test MSE--is as small as possible.\bigskip



How can we go about trying to select a method that minimizes the test MSE? In some settings, we may have a test data set available--that is, we may have access to a set of observations that were not used to train the statistical learning method. We can then simply evaluate (1.5) on the test observations, and select the learning method for which the test MSE is smallest. However, in reality, we may not have any test observations available.

\bigskip



\item \textbf{Flexibility of a Model}


We will cover flexibility later, but here is an intuition.

You can think of "Flexibility" of a model as the model's "curvy-ness" when graphing the model equation.  A linear regression is said to be be inflexible.  On the other hand, if you have 9 training sets that are each very different, and you require a more rigid decision boundary, the model will be deemed flexible, just because the model can't be a straight line.  \bigskip


Of course, there's an essential assumption that these models are adequate representations of the training data (a linear representation doesn't work well for highly spread out data, and a jagged multinomial representation doesn't work well with straight lines).  
\bigskip
As a result, A flexible model will:

 1.  Generalize well across the different training sets
 -  Comes at a cost of higher variance.  That's why flexible models are generally associated with low bias. \bigskip
 
 2.  Perform better as complexity increases and/or number of data points increase (up to a point, where it won't perform better)
\bigskip


\item \textbf{Throughout this book, we discuss a
variety of approaches that can be used in practice to estimate this minimum
point. One important method is cross-validation (Chapter 4), which is a method
for estimating test MSE using the training data.}

\end{enumerate}

\newpage


\subsection{The Bias-Variance Trade-Off}


\begin{enumerate}

\item \textbf{Bias}


\begin{center}
  \makebox[\textwidth]{\includegraphics[width=100mm,scale=0.5]{bias1.jpg}}
\end{center}

Graphical interpretation of Bias - Variance. \bigskip

Let’s understand this image. This is bull’s eye diagram. Assume that center of the target (Red colored) is a model that perfectly predict the correct values. As we move away from the bull’s eye, our prediction goes worse. Imagine we can repeat our entire model building process to get a number of separate hits on the target. Each hit represents an individual realization of our model, given the chance variability in the training data we gather. Sometimes we will get a good distribution of training data so we predict very well and we are close to the bulls-eye, while sometimes our training data might be full of outliers or non-standard values resulting in poorer predictions. These different realizations result in a scatter of hits on the target.
\bigskip

Let’s look at the definition of Bias and Variance :

\textbf{\Large Bias} - Bias means how far off our predictions are from real values. Generally parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn they are have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.
\bigskip

\textbf{Low Bias}: Suggests more assumptions about the form of the estimator function, i.e. $\hat{f}$. \bigskip

\textbf{High-Bias}: Suggests less assumptions about the form of the target function.

\bigskip


Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.
\bigskip


Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression. \bigskip


\textbf{\Large Variance} - Change in predictions across different data sets. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model. \hl{In other words,
Variance is the amount that the estimate of the target function will change if different training data was used.}
The target function is estimated from the training data by a machine learning algorithm, so we should expect the algorithm to have some variance. Ideally, it should not change too much from one training dataset to the next, meaning that the algorithm is good at picking out the hidden underlying mapping between the inputs and the output variables. \bigskip

Machine learning algorithms that have a high variance are strongly influenced by the specifics of the training data. This means that the specifics of the training have influences the number and types of parameters used to characterize the mapping function.
\bigskip

\textbf{Low Variance}: Suggests small changes to the estimate of the target function with changes to the training dataset.
\bigskip

\textbf{High Variance}: Suggests large changes to the estimate of the target function with changes to the training dataset.
\bigskip

Generally non-parametric machine learning algorithms that have a lot of flexibility have a high variance. For example decision trees have a high variance, that is even higher if the trees are not pruned before use.
\bigskip


Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
\bigskip

Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

\bigskip


\item \textbf{More formal Idea of Bias - Variance Trade-off}


Though the mathematical proof is beyond the scope of this book, it is possible to show that the expected test MSE, for a given value/input $x_{0}$, can always be decomposed into the sum of three fundamental quantities: the {\it variance} of $\hat{f}(x_{0})$ , the squared {\it bias} of $\hat{f}(x_{0})$ and the variance of the error terms $\epsilon$. That is,

\begin{equation}
\text{E}\left[y_{0}-\hat{f}(x_{0})\right]^{2}= \Var(\hat{f}(x_{0}))+[\mathrm{B}\mathrm{i}\mathrm{a}\mathrm{s}(\hat{f}(x_{0}))]^{2}+ \Var(\epsilon)  
\end{equation}

\bigskip

Here the notation $E(y_{0}-\hat{f}(x_{0}))^{2}$ defines the {\it expected test MSE}, and refers to the average test MSE that we would obtain if we repeatedly estimated $f$ using a large number of training sets, and tested each at $x_{0}$. The overall expected test MSE can be computed by averaging $E(y_{0}-\hat{f}(x_{0}))^{2}$ over all possible values of $x_{0}$ in the test set. \bigskip



Equation 1.7 tells us that in order to \textbf{MINIMIZE the expected test error}, we need to select a statistical learning method that simultaneously achieves \textbf{low variance and bias}. Note that variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence, we see that the expected test MSE can never lie below $\Var(\epsilon)$ , the irreducible error. 

\bigskip

\textbf{Important Explanation here!}


What do we mean by the \textbf{variance} and \textbf{bias} of a statistical learning method?

\bigskip


\textbf{Variance} refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets
will result in a different $\hat{f}$. But ideally the estimate for $f$ should not vary too much between training sets. However, if a method has high variance
then small changes in the training data can result in large changes in $\hat{f}$. In general, more flexible statistical methods have higher variance.

\bigskip


\textbf{Bias}, on the other hand, refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between $Y$ and $X_{1}, X_{2}$, . . . , $X_{p}$. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of $f$. For example, if the true $f$ of a dataset is substantially non-linear, but we used a linear regression model, then no matter how many training observations we are given, it will not be possible to produce an accurate estimate using linear regression. In other words, linear regression
results in high bias in this example. Generally, more flexible methods result in less bias. \bigskip



As a general rule, as we use more flexible methods, the variance will
increase and the bias will decrease. The relative rate of change of these
two quantities determines whether the test MSE increases or decreases. As
we increase the flexibility of a class of methods, the bias tends to initially
decrease faster than the variance increases. Consequently, the expected
test MSE declines. However, at some point increasing flexibility has little
impact on the bias but starts to significantly increase the variance. When
this happens the test MSE increases. 

\bigskip


Good test set performance of a statistical learning method requires low variance as well as low squared bias. This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance (for instance, by drawing a curve that passes through every single training observation) or a method with very low variance but high bias (by fitting a horizontal line to the data). The challenge lies in finding a method for which both the variance and the squared bias are low. This trade-off is one of the most important recurring themes in this book.

\bigskip

In a real-life situation in which $f$ is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a statistical learning method. Nevertheless, one should always keep the bias-variance trade-off in mind. In this book we explore methods that are extremely flexible and hence can essentially eliminate bias. However, this does not guarantee that they will outperform a much simpler method such as linear
regression. To take an extreme example, suppose that the true $f$ is linear.
In this situation linear regression will have no bias, making it very hard
for a more flexible method to compete. In contrast, if the true $f$ is highly
non-linear and we have an ample number of training observations, then
we may do better using a highly flexible approach. In Chapter 4 we discuss cross-validation, which is a way to estimate the test MSE using the training data.

\bigskip



\item \textbf{Understanding Over- and Under-Fitting}

At its root, dealing with bias and variance is really about dealing with over- and under-fitting. Bias is reduced and variance is increased in relation to model complexity. As more and more parameters are added to a model, the complexity of the model rises and variance becomes our primary concern while bias steadily falls. For example, as more polynomial terms are added to a linear regression, the greater the resulting model's complexity will be. In other words, bias has a negative first-order derivative in response to model complexity while variance has a positive slope.

\begin{center}
  \makebox[\textwidth]{\includegraphics[width=100mm,scale=0.5]{bias2.png}}
\end{center}


Understanding bias and variance is critical for understanding the behavior of prediction models, but in general what you really care about is overall error, not the specific decomposition. The sweet spot for any model is the level of complexity at which the increase in bias is equivalent to the reduction in variance. Mathematically:
$$\text{dBias/Complexity=-dVariance/dComplexity}$$


If our model complexity exceeds this sweet spot, we are in effect over-fitting our model; while if our complexity falls short of the sweet spot, we are under-fitting the model. In practice, there is not an analytical way to find this location. Instead we must use an accurate measure of prediction error and explore differing levels of model complexity and then choose the complexity level that minimizes the overall error.

\bigskip

\textbf{Extremely useful on understanding unbiased is in 3.1.2 of An Introduction to Statistics page 65}.



\bigskip





\end{enumerate}












\newpage


\chapter{Linear Regression}

\section{Simple Linear Regression}

\begin{enumerate}

\subsection{Basics}


\item \textbf{Definition (Linear Regression)}

\begin{enumerate}
    \item \textbf{Model Formulation}
    
    We start off with a basic/simple linear regression model where there is only one predictor variable; Let $X$ be the independent variable and $Y$ is the dependent variable. Note we can view $X$ and $Y$ as a $\R^n$ vector where $$X= \left[X_1,...,X_n\right], Y = \left[Y_1,...,Y\right]$$ are the data points of the data set.
    
    \bigskip
    

So, we can usually write a linear regression like this:

$$Y_i = \beta_0 + \beta_1X_i + \epsilon_i \text{ where } \epsilon_i \sim N(0, \sigma^2)$$

A few things to note: 

$Y$ is the observed dependent variable\\
$X$ is the observed independent variable, and\\
$\beta_0$ is the intercept and $\beta_1$ is the slope for $x$ which are called parameters. Note $\beta_0$ does not have a particular meaning unless the model data set includes $X=0$, in which case $\beta_0$ gives the mean of the probability distribution of $Y$ at $X=0$.\bigskip

$\epsilon$ is the residual term. In regression analysis, the difference between the observed value of the dependent variable $\y$ and the predicted value $\hat{\y}$ is called the residual $\epsilon$. Each data point has one residual.
\bigskip


\item \textbf{Applied Regression Page 9-10, come back when I am more familiar with basics of statistical features.}

\bigskip




\item \textbf{Residual Variable}


$$\text{Residual = Observed value - Predicted value }  = \y - \hat{\y} $$


Both the sum and the mean of the residuals are equal to zero. That is, $\sum{\epsilon} = 0$ and $\bar{\epsilon}= 0$.

\bigskip

    \item Multiple Linear Regression: $$\y = b_0+b_1\x_1+...+b_n\x_n$$
    
    Similar idea as above, if we want the output to be the employer salary, we can have multiple independent variables/factors being associated or directly causing the salary. 
\end{enumerate}

\bigskip

\item \textbf{Estimating the Coefficients}

In reality, the model coefficients $\beta_1, \beta_2$ are unknown, so that is also our aim to find an estimator for them, name them $\hat{\beta_1}, \hat{\beta_2}$. We will use the Least Squares method.\bigskip

The observational or experimental data are to be used for estimating the parameters (coefficients) of the
regression function consisting of observations on the explanatory or predictor variable $X$ and
the corresponding observations on the response variable $Y$. For each trial, there is an $X$
observation and a Y observation. We denote the $(X, Y)$ observations for the first trial as
$(X_1, Y_1)$, for the second trial as $(X_2, Y_2)$, and in general for the $i$th trial as $(X_i, Y_i)$, where
$i = 1, ... ,n$.


\bigskip


\item \textbf{Method of Least Squares}

Now, the above regression equation in 1(a) can be rewritten as:
$$Y_i-(\beta_0 + \beta_1X_i) = \epsilon_i $$
which can be understood as by 1(c),

$$Y_i - (\beta_0 + \beta_1X_i) \sim N(0, \sigma^2)$$

Which we can further simplify to:

$$Y_i \sim N(\beta_0 + \beta_1X_i, \sigma^2)$$

And lastly, we need to take note that in reality, we work with the predicted values of $Y$, which we define as $\hat{Y}$ and we can remove the error term behind to have $$\hat{Y} \approx \beta_0 + \beta_1X_i$$ we can simplify again:

$$Y_i \sim N(\hat{Y_i}, \sigma^2)$$

Notice the subscript $i$. This means that, for every individual, we say their observed score comes from a normal distribution with a *mean of the predicted value* and a given residual variance. Notice that, for every person, this predicted value is different, but the variance is the same. The variance being the same is where we get the assumption of \textbf{homogeneity of variance}.

Now, I don't really like using "average value" and "expected value" interchangeably, because for some distributions, they are not the same thing. But for the normal distribution, the average and expected values are the same (because the mean is equal to the mode of a normal distribution). But that is what they mean by *average*. You are basically saying:

"Every observation of $\y$ is distributed with a mean of y-hat and a variance of sigma squared." 
    
\bigskip





\item \textbf{Residual Sum of Squares}

Let $\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_i$ be the prediction for $Y$ based on the ith value of $X$.
Then $\epsilon =Y_i-\hat{Y_i}$ represents the ith residual. We define the residual sum of squares (RSS) as
residual sum of squares $$\text{RSS} = \epsilon_1^2+\epsilon_2^2+...+\epsilon_n^2$$

or equivalently as $$\text{RSS} = (Y_1-\hat{\beta_0}-\hat{\beta_1}x_1)^2+(Y_2-\hat{\beta_0}-\hat{\beta_1}x_2)^2+...+(Y_n-\hat{\beta_0}-\hat{\beta_1}x_n)^2$$
\bigskip


The least squares approach minimizes RSS by choosing the minimizer parameters $\hat{\beta_0}, \hat{\beta_1}$. Intuitively, you can visualize a scatter plot and want to fit a best fit line, this method allows you to choose a line that can somehow minimize the distance from the actual $Y$ values from the predicted $Y$ values. We need to square them cause by using euclidean distances we have positive and negative and in the end will sum up to $0$ which will be meaningless. \bigskip

With some simple calculus, we can actually work out in formula what are those minimizers:

$$\hat{\beta_1} = \dfrac{\sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^{n}(X_i-\bar{X})^2}$$

$$\hat{\beta_0} = \bar{Y}-\bar{\beta_1}\bar{X}$$

where $\bar{Y} = \frac{1}{n}\sum_{i=1}^{n}Y_i$ and $\bar{X} = \frac{1}{n}\sum_{i=1}^{n}X_i$ are the sample means. So now the above minimizers uniquely define the least squares coefficient estimates for simple linear
regression.

\bigskip

\item \textbf{Theorem (Gauss Markov)}

Under the conditions of regression model $$Y_i = \beta_0+\beta_1X_i+\epsilon_i$$

the least squares
estimators $\hat{\beta_0}$ and $\hat{\beta_1}$ are unbiased and have minimum variance among all unbiased linear estimators.

\bigskip



\item \textbf{Definition (Residual Standard Error (RSE)}


Before we start on the definition of RSE, it is necessary to understand and review the concept of sample mean and the likes. \bigskip

\textbf{Estimation}

The least squares line can always be computed by using the coefficients estimates given by pointer 4, $\hat{\beta_0}, \hat{\beta_1}$. But the true population regression line is unobserved as we can never collect accurately data for everyone. So imagine in Singapore, we have 6 million people, we let that be our data set (6 million points). Assume we have a TRUE relationship among the 6 million data points, call it $Y = mX+c$; On the other hand, we generate 10 random sample data sets from the population, and DO NOTE that these 10 data sets will not always produce the exact same least squares line!!! There will always be some subtle (or unfortunately, huge) differences. \bigskip


Fundamentally, the concept of these two lines is a natural extension of the standard statistical
approach of using information from a sample to estimate characteristics of a
large population. For example, suppose that we are interested in knowing
the population mean $\mu$ of some random variable $Y$ . Unfortunately, $\mu$ is
unknown, but we do have access to $n$ (n datasets) observations from $Y$, which we can
write as $y_1, ... , y_n$, and which we can use to estimate $\mu$. A reasonable
estimate is $\hat{\mu}=\bar{y}$, where $$\bar{y} = \frac{1}{n}\sum_{i=1}^ny_i$$ 
is the sample mean. The sample
mean and the population mean are different, but in general the sample
mean will provide a GOOD estimate of the population mean. In the same
way, the unknown coefficients $\beta_0$ and $\beta_1$ in linear regression define the
population regression line. We seek to estimate these unknown coefficients
using $\hat{\beta_0}$ and $\hat{\beta_1}$ given pointer 4. These coefficient estimates define the least squares line.

\bigskip

So if we use $\bar{\mu}$ to estimate the true population mean $\mu$, we say that this estimate is UNBIASED, laymen terms, meaning we EXPECT that on average, $\hat{\mu} = \mu$. But one may ask, what do you mean on average?? Does that mean some sample data points we take can be way below average, and for some sample data points we take, the estimated mean can be above average? It means
that on the basis of one particular set of observations $y_1, ... , y_n$, $\hat{\mu}$ might overestimate $\mu$, and on the basis of another set of observations, $\hat{\mu}$ might
underestimate $\mu$. But if we could average a huge number of estimates of
$\mu$ obtained from a huge number of sets of observations, then this average
would exactly equal $\mu$. Hence, an unbiased estimator does not systematically
over- or under-estimate the true parameter. The property of unbiasedness
holds for the least squares coefficient estimates given by as well: if
we estimate $\beta_0$ and $\beta_1$ on the basis of a particular data set, then our
estimates won’t be exactly equal to $\beta_0$ and $\beta_1$. But if we could average
the estimates obtained over a huge number of data sets, then the average
of these estimates would be spot on!

\bigskip


We continue the analogy with the estimation of the population mean $\mu$ of a random variable $Y$ . A natural question is as follows: how accurate
is the sample mean $\hat{\mu}$ as an estimate of $\mu$? We have established that the
average of $\hat{\mu}$'s over many data sets will be very close to $\mu$, but that a
single estimate $\hat{\mu}$ may be a substantial underestimate or overestimate of $\mu$.
How far off will that single estimate of $\hat{\mu}$ be? In general, we answer this
question by computing the standard error of $\hat{\mu}$, written as $\text{SE}(\hat{\mu})$. We have
standard
the well-known formula: 

\begin{equation}
\text{Var}(\hat{\mu}) = \text{SE}(\hat{\mu})^2 = \dfrac{\sigma^2}{n} \tag{\text{Standard-Error}}\label{Standard-Error}
\end{equation}

 where $\sigma$ is the standard deviation of each of the realizations $y_i$ of $Y$, note, the $n$ observations here must be uncorrelated for the formula to hold. Laymen, the standard error tells us the average amount that this estimate $\hat{\mu}$ differs from the actual value of $\mu$. Equation \eqref{Standard-Error} easily tells us as $n$ tends to infinity (becomes large), then our standard error will tend to $0$. \bigskip
 
 Now we also can extend our idea to asking: How close would our ESTIMATED coefficients, $\hat{\beta_0}, \hat{\beta_1}$ be as compared to the TRUE values $\beta_0, \beta_1$? To compute the standard errors associated with $\hat{\beta_0}, \hat{\beta_1}$, we use the following formulas:
 
 \begin{equation}
     \text{SE}\left(\hat{\beta_0}\right)^2 = \sigma^2\left[\dfrac{1}{n}+\dfrac{\bar{x}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}\right] ~~~
     \text{SE}\left(\hat{\beta_1}\right)^2 = \dfrac{\sigma^2}{\sum_{i=1}^n(x_i-\bar{x})^2} \label{2.1}
 \end{equation}

where $\sigma^2 = \Var{\epsilon}$. Also notice that in equation \eqref{2.1} above, we have a sigma there and in general, the $\sigma^2$ population variance is not known, but can be estimated from the data. So the estimate of $\sigma$ the population standard deviation is known as the \textbf{Residual Standard Error}. The formula is as follows: $$\text{RSE}=\sqrt{\text{RSS}/(n-2)}$$
\bigskip

So what good are standard errors? Standard errors can be used to compute confidence intervals. A $95\%$ confidence
confidence interval is defined as a range of values such that with $95\%$ interval
probability, the range will contain the true unknown value of the parameter.
The range is defined in terms of lower and upper limits computed from the
sample of data. For linear regression, the $95\%$ confidence interval for $\beta_1$
approximately takes the form 
\begin{equation}
    \hat{\beta_1} \pm 2 \cdot \text{SE}(\hat{\beta_1})
\end{equation}

That is, there is approximately a $95\%$ chance that the interval $$\left[\hat{\beta_1}-2\cdot \text{SE}(\hat{\beta_1}), \hat{\beta_1}+2\cdot \text{SE}(\hat{\beta_1})\right]$$ will contain the TRUE value of $\beta_1$. Similarly, the confidence interval of $\beta_0$ is of the same form.

\bigskip

\item \textbf{Hypothesis Testing on Regression}

Standard errors can also be used to perform hypothesis tests on the
hypothesis
coefficients. The most common hypothesis test involves testing the null test
hypothesis of

\begin{equation}
    H_0: \text{There is no relationship between } X \text{ and } Y
\end{equation}


\begin{equation}
    H_1: \text{There is some relationship between } X \text{ and } Y
\end{equation}

More rigourously, in mathematical terms, it is equivalent to testing: 

\begin{equation}
    H_0: \beta_1 = 0
\end{equation}


\begin{equation}
    H_1: \beta_1 \neq 0
\end{equation}

Because if $\beta_1 = 0$, then our model reduces to $Y = 0\cdot X + \beta_0 + \epsilon$, and $X$ is forced to disappear from this model, so $X$ is not associated with $Y$ in the linear sense (we do not treat 0 gradient here as linearly related because it does not makes sense.) So of course, we want the alternate hypothesis to win, if not there is no point in running the regression model if $X, Y$ are not associated in the first place. Here I refer you to read page 67 of An Introduction to Statistics. \bigskip


\textbf{Assessing the accuracy of the model}

After checking that p-values are indeed small, that $X, Y$ are indeed related in some sense, it is natural to want to quantify the extent to which the
model fits the data. The quality of a linear regression fit is typically assessed
using two related quantities: the residual standard error (RSE) and the $R^2$
statistic.





\item \textbf{Definition (Multiple Linear Regression)}

Note that Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent (and therefore not really correlated (value of cor should be smaller than 0.8). If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.


\bigskip

The need to reduce multicollinearity depends on its severity and your primary goal for your regression model. Keep the following three points in mind:\bigskip

1. The severity of the problems increases with the degree of the multicollinearity. Therefore, if you have only moderate multicollinearity, you may not need to resolve it.\bigskip

2. Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not present for the independent variables that you are particularly interested in, you may not need to resolve it. Suppose your model contains the experimental variables of interest and some control variables. If high multicollinearity exists for the control variables but not the experimental variables, then you can interpret the experimental variables without problems. \bigskip

3. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.

\bigskip


Just keep in mind that only the terms in the model with high VIFs are actually affected by multicollinearity. You can have some terms with high VIFs and others with low VIFs. Multicollinearity does not affect the variables with low VIFs,, and you don’t need to worry about those. But i think one should do a correlation matrix nonetheless so can pin point which individual variable is the culprit.


\bigskip
\newpage

\item \textbf{Cetirus Paribus}


When we do multiple regressions and say we are looking at the average change in the $y$ variable for a change in an $x$ variable, holding all other variables constant, what values are we holding the other variables constant at? Their mean? Zero? Any value?

\bigskip

{\color{red}

The rough intuition should be visualized first by understanding a simple formula, consider: $$y = b_1x_1+b_2x_2+b_3x_3+d$$

When we say look at the average change in the $y$ variable for a change in $x_3$, \textbf{holding all other variables } $x_1, x_2$ \textbf{constant}, we mean that we can take any value of $x_1, x_2$, and we hold that value right there and do not 'move it', we can then observe the effect of one unit change in $x_3$, which leads to a change of $b_3$ in $y$. So in the mathematical proof below, we easily see that as long as we HOLD the other variables constant, $b_3$ will always be the effect of a unit change in $x_3$. \bigskip



You are right.  Technically, it is *any value*.  However, when I teach this I usually tell people that you are getting the effect of a one unit change in $X_j$ when all other variables are held at their respective means.  I believe this is a common way to explain it that is not specific to me.  \bigskip

I usually go on to mention that if you don't have any interactions, $\beta_j$ will be the effect of a one unit change in $X_j$, no matter what the values of your other variables are.  But I like to start with the mean formulation.  The reason is that there are two effects of including multiple variables in a regression model.  First, you get the effect of $X_j$ controlling for the other variables (see link below).  The second is that the presence of the other variables (typically) reduces the residual variance of the model, making your variables (including $X_j$) 'more significant'.  It is hard for people to understand how this works if the other variables have values that are all over the place.  That seems like it would *increase* the variability somehow.  If you think of adjusting each data point up or down for the value of each other variable until all the rest of the $X$ variables have been moved to their respective means, it is easier to see that the residual variability has been reduced. \bigskip


I don't get to interactions until a class or two after I've introduced the basics of multiple regression.  However, when I do get to them, I return to this material.  The above applies when there *are not* interactions.  When there are interactions, it is more complicated.  In that case, the interacting variable[s] is being held constant (very specifically) at $0$, and at no other value.  

\bigskip


If you want to see how this plays out algebraically, it is rather straight-forward.  We can start with the no-interaction case.  Let's determine the change in $\hat Y$ when all other variables are held constant at their respective means.  Without loss of generality, let's say that there are three $X$ variables and we are interested in understanding how the change in $\hat Y$ is associated with a one unit change in $X_3$, holding $X_1$ and $X_2$ constant at their respective means:  \bigskip

\begin{align}
\hat Y_i    &= \hat\beta_0 + \hat\beta_1\bar X_1 + \hat\beta_2\bar X_2 + \hat\beta_3X_{3}  \\
\hat Y_{i'} &= \hat\beta_0 + \hat\beta_1\bar X_1  + \hat\beta_2\bar X_2 + \hat\beta_3(X_{3}\!+\!1)  \\
~  \\
&\text{subtracting the first equation from the second:}  \\
~  \\
\hat Y_{i'} - \hat Y_i &= \hat\beta_0 - \hat\beta_0 + \hat\beta_1\bar X_1 - \hat\beta_1\bar X_1 + \hat\beta_2\bar X_2 - \hat\beta_2\bar X_2 + \hat\beta_3(X_{3}\!+\!1) - \hat\beta_3X_{3}  \\
\Delta Y &= \hat\beta_3X_{3} + \hat\beta_3 - \hat\beta_3X_{3}  \\
\Delta Y &= \hat\beta_3
\end{align}  

Now it is obvious that we *could* have put any value in for $X_1$ and $X_2$ in the first two equations, so long as we put the *same* value for $X_1$ ($X_2$) in both of them.  That is, so long as we are holding $X_1$ and $X_2$ *constant*.  

On the other hand, it does not work out this way if you have an interaction.  Here I show the case where there is an $X_1X_3$ interaction term:  
$$
\begin{aligned}
\hat Y_i &= \hat\beta_0 + \hat\beta_1\bar X_1 + \hat\beta_2\bar X_2 + \hat\beta_3X_{3} \quad\quad\ \! + \hat\beta_4\bar X_1X_{3}  \\
\hat Y_{i'} &= \hat\beta_0 + \hat\beta_1\bar X_1 + \hat\beta_2\bar X_2 + \hat\beta_3(X_{3}\!+\!1) + \hat\beta_4\bar X_1(X_{3}\!+\!1)  \\
~  \\
&\text{subtracting the first equation from the second:}  \\
~  \\
\hat Y_{i'} - \hat Y_i &= \hat\beta_0 - \hat\beta_0 + \hat\beta_1\bar X_1 - \hat\beta_1\bar X_1 + \hat\beta_2\bar X_2 - \hat\beta_2\bar X_2 + \hat\beta_3(X_{3}\!+\!1) - \hat\beta_3X_{3} +  \\
&\quad\ \hat\beta_4\bar X_1(X_{3}\!+\!1) - \hat\beta_4\bar X_1X_{3}  \\
\Delta Y &= \hat\beta_3X_{3} + \hat\beta_3 - \hat\beta_3X_{3}  + \hat\beta_4\bar X_1 X_{3} + \hat\beta_4\bar X_1 - \hat\beta_4\bar X_1X_{3}  \\
\Delta Y &= \hat\beta_3 + \hat\beta_4\bar X_1
\end{aligned}  
$$
In this case, it is not possible to hold all else constant.  Because the interaction term is a function of $X_1$ and $X_3$, it is not possible to change $X_3$ without the interaction term changing as well.  Thus, $\hat\beta_3$ equals the change in $\hat Y$ associated with a one unit change in $X_3$ *only when* the interacting variable ($X_1$) is held at $0$ instead of $\bar X_1$ (or any other value but $0$), in which case the last term in the bottom equation drops out.  

In this discussion, I have focused on interactions, but more generally, the issue is when there is any variable that is a function of another such that it is not possible to change the value of the first without changing the respective value of the other variable.  In such cases, the meaning of $\hat\beta_j$ becomes more complicated.  For example, if you had a model with $X_j$ and $X_j^2$, then $\hat\beta_j$ is the derivative $\frac{dY}{dX_j}$ holding all else equal, and holding $X_j=0$ (see my answer [here][2]).  Other, still more complicated formulations are possible as well.  }


  [1]: \url{https://stats.stackexchange.com/a/78830/7290}
  \bigskip
  
  [2]: \url{https://stats.stackexchange.com/a/28750/7290}
  
  
  \bigskip
  
  \newpage
\item \textbf{Questions}

Could it be that two highly related independent variables (r=0.834!!) yield a VIF of 3.82? Of course, it makes my life easier that I don’t have to deal with the multicollinearity problem, but I don’t understand how this can happen.

\bigskip

Yes, that might be surprising but it is accurate. In fact, for the example in this blog post, the $\%$ fat and body weight variables have a correlation of 0.83, yet the VIF for a model with only those two predictor variables is just 3.2. That’s very similar to your situation. When you have only a pair of correlated predictors in your model, the correlation between them has to be very high (~0.9) before it starts to cause problems.

However, when you have more than two predictors, the collective predictive power between the predictors adds up more quickly. As you increase the number of predictors, each pair can have a lower correlation, but the overall strength of those relationships accumulates. VIFs work by regressing a set of predictors on another predictor. Consequently, it’s easier to get higher VIFs when you have more predictors in the model. No one predictor has to “work very hard” to produce problems.

But, when you have only two predictors, the relationship between them must be very strong to produce problems!


\bigskip

\item \textbf{Question 2}


a) First, we just regress assault on the few variables in USArrest. And found that, rape and murder are stat sig. But when We add 1 more variable like life expectancy, urban pop also became stat sig. Why?



\bigskip

Answer by a medical: In epidemiology, we call this a negative confounder. This means a co-factor artificially reduces the observed association between a studied factor and an outcome.
\bigskip

Imagine we want to study the association between vitamin A supplement intake and diarrhea in infants. Let us consider that children who are more likely to take supplements are also those who eat less fruits and vegetables. Eating fibers is a known cause of improving stool consistency. This is independent of vitamin A. So vitamin A decreases diarrhea and lack of fiber intake increases it. Both compensate each other. When observing both sub-population (exposed vs. non-exposed to vitamin A) we therefore could not observe any association. If we now "adjust" for fiber intake we then will study the same association but independently of fiber intake. The association between vitamin A intake and diarrhea will appear.
Once you understand negative confounders exist, you will change two things when constructing models.
1. You will conceptually identify potential confounders -> they need to be a) factors that are known causes of the outcome of interest, b) to be associated to the factor of interest you are studying, c) they must not be on the causal pathway between the factor of interest and the outcome (beware of over-adjustment).
2. You will test associations of potential confounders with the outcome of interest and include them in your model using a significance level that is higher than the one you set to define a co-factor. I usually set it at p<0.2 for uni-variate analysis and then set it at p<0.05 for retaining a factor in my final model.

\bigskip


\newpage



\item \textbf{Question 2: Predicting Assault Rates}

\begin{enumerate}

    \item [Step 1:] Sense-Making
    
    Read the literature of the study, gain a good understanding of the objectives of the research.
    
    
    
    
    \item [Step 2:] Assumptions


There are four principal assumptions which justify the use of linear regression models for purposes of inference or prediction:
    
    \begin{enumerate}
        \item linearity: each predictor has a linear relation with our outcome variable; \bigskip
        
    That is: linearity and additivity of the relationship between dependent and independent variables:

    (1) The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed. \bigskip

    (2) The slope of that line does not depend on the values of the other variables.
\bigskip

    (3)  The effects of different independent variables on the expected value of the dependent variable are additive. \bigskip
    
    {\color{red} I think, plotting IV vs DV for each is sufficient? Read: nonlinearity is usually most evident in a plot of observed versus predicted values or a plot of residuals versus predicted values, which are a part of standard regression output. So we choose plot of IV vs DV}
    
        \bigskip
        
        \item normality: the prediction errors are normally distributed in the population : normality of the error distribution. \bigskip
        
        \item  homoscedasticity: the variance of the errors is constant in the population.
    \end{enumerate}
    
    
    \item [Step 3:] Run some histograms, plots to visualize?
    
    
    \bigskip
    
    \item [Step 4:]
     
     Run the Linear Regression model.
     
     
     \bigskip
     
     Then check the summary table. There's a few key things to check in the summary table.
     
     \begin{enumerate}
         \item The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary. In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
         
         \bigskip
         
         \item Then read the p and t and coefficient estimate?
         
         \bigskip
         
         \item Then we look at R-squared:

In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be positive and will range from zero to one.
\bigskip

R2 represents the proportion of variance, in the outcome variable y, that may be predicted by knowing the value of the x variables. An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.
\bigskip

A problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.
\bigskip

The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the prediction model.

\bigskip

\item Can check VIF for multi collinearity as well.

     \end{enumerate}
    
\end{enumerate}




\bigskip

\item \textbf{Ways to check assumptions in Regression}

\textbf{USEFUL: READ\url{https://www.theanalysisfactor.com/when-to-check-model-assumptions/}}

\bigskip


\begin{enumerate}
    \item Global Stat- Are the relationships between your X predictors and Y roughly linear?. Rejection of the null ($p  < .05$) indicates a non-linear relationship between one or more of your X’s and Y
\bigskip    
    
    
    \item Skewness - Is your distribution skewed positively or negatively, necessitating a transformation to meet the assumption of normality? Rejection of the null ($p < .05$) indicates that you should likely transform your data.
    \bigskip
    
    
    \item Kurtosis- Is your distribution kurtotic (highly peaked or very shallowly peaked), necessitating a transformation to meet the assumption of normality? Rejection of the null ($p < .05$) indicates that you should likely transform your data. \bigskip
    
    \item Link Function- Is your dependent variable truly continuous, or categorical? Rejection of the null ($p < .05$) indicates that you should use an alternative form of the generalized linear model (e.g. logistic or binomial regression). \bigskip
    
    \item  Heteroscedasticity- Is the variance of your model residuals constant across the range of X (assumption of homoscedastiity)? Rejection of the null ($p < .05$) indicates that your residuals are heteroscedastic, and thus non-constant across the range of X. Your model is better/worse at predicting for certain ranges of your X scales.
    
\end{enumerate}
\newpage


\subsection{What happens when you add in a Squared Variable?}


\item \textbf{Why does one add powers of a variable in regression models?}


If the quadratic term coefficient is close to zero, and if it is not significant, then that suggests that the relationship is rather flat (linear), and it might be best to just re-run the model without the quadratic term. You can see this by graphing $y = x^2+x$ vs $y = 0.00000001x^2+x$ where the latter is almost a linear graph due to the coefficient of $x^2$ being near $0$. In that case, we shall drop the $x^2$. \bigskip


\textbf{Taylor Series} approximations tell us that pretty much any smooth function can be approximated by a polynomial, so including terms like $x^2$ or $x^3$ (where x is age for your example) let us estimate the coefficients for the approximation for a known or unknown non-linear function of $x$, or age in your case.  Testing these coefficients is also a simple way to test if the relationship is reasonably linear or if non-linear terms will give a better fit.
\bigskip

Depending on the ultimate goal of the analysis the non-linear terms can be kept for prediction, or plots of the prediction can be used to suggest the actual functional relationship.  There are other tools, such as cubic splines, that can be used instead of polynomial terms to accomplish similar goals, but adding a squared term is a quick and easy way to do this.

\bigskip

Also, let us see an example.

\bigskip

\item \textbf{Example on quadratic variables in Regression}

Define $x = \text{age}$, $x^2 = \text{age}^2$, $y = \text{health (numerical values given)}$ and consider the following equation $$y = 0.28x- 0.007x^2$$

One can set the 2nd derivative to see the maximum is $x=20$ years old before the $y$ starts to go down hill. IF we only input the age variable without squared, then we will obtain a very sloppy 'linear' slope for the equation between age and health. So we will obtain a very inaccurate linear relationship with respect to $y$, hence adding a $x^2$ variable will allow more flexibility in the model, allowing regression to be run in a non linear fashion.
\bigskip

I believe that you can check the relationship of $x$ vs $y$ before we run the model, a simple scatter plot or some plots will give you a rough idea if the $x,y$ has a good linear relation. Also, there are multiple tests for linearity between the predictor and response variables.

\newpage







\end{enumerate}

\newpage

\chapter{Classification}

\section{Intuition of Classification}

The linear regression model discussed in Chapter 2 assumes that the response variable $Y$ is quantitative. But in many situations, the response variable is instead qualitative. Often qualitative variables are referred to as \textbf{categorical}; We will use these terms interchangeably. For example, in Linear Regression, we input many independent variables and hope to output a prediction of the house pricing; But what is our output is not a numerical number. Instead, one can think of predicting whether you have diabetes by inputting many independent variables. But now the output is instead a binary class, it is either a yes or no, as such linear regression does not work here.

\bigskip

In this chapter, we study approaches for predicting qualitative responses, a process that is known as \textbf{classification}. Why is it called classification when it seems like "prediction"? Because predicting a qualitative response for an observation can be referred to as \textbf{classifying} that observation, since it involves assigning the observation to a category, or class.

\bigskip

On the other hand, often the methods used for classification actually first \textbf{predict} the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods. This prediction of probability of each of the categories of a qualitative variable is known as \textbf{Classification Threshold. We will mention it later.}



\newpage

\section{Why Linear Regression is not suitable}

We have stated that linear regression is not appropriate in the case of a
qualitative response. Why not? Suppose that we are trying to predict the medical condition of a patient in the emergency room on the basis of her symptoms. In this simplified example, there are three possible diagnoses: stroke, drug overdose, and epileptic seizure. We could consider encoding these values as a quantitative response variable, $Y$, as follows: \bigskip


$Y= \left\{\begin{array}{l}
1\ \mathrm{i}\mathrm{f}\ \mathrm{s}\mathrm{t}\mathrm{r}\mathrm{o}\mathrm{k}\mathrm{e};\\
2\ \mathrm{i}\mathrm{f}\ \mathrm{d}\mathrm{r}\mathrm{u}\mathrm{g}\ \mathrm{o}\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{d}\mathrm{o}\mathrm{s}\mathrm{e};\\
3\ \mathrm{i}\mathrm{f}\ \mathrm{e}\mathrm{p}\mathrm{i}\mathrm{l}\mathrm{e}\mathrm{p}\mathrm{t}\mathrm{i}\mathrm{c}\ \mathrm{s}\mathrm{e}\mathrm{i}\mathrm{z}\mathrm{u}\mathrm{r}\mathrm{e}.
\end{array}\right.$

\bigskip


Using this coding, least squares could be used to fit a linear regression model
to predict $Y$ on the basis of a set of predictors $X_{1}$, . . . , $X_{p}$. Unfortunately, this coding implies an ordering on the outcomes, putting drug overdose in between stroke and epileptic seizure, and insisting that the difference between stroke and drug overdose is the same as the difference between drug overdose and epileptic seizure. In practice there is no particular
reason that this needs to be the case. For instance, one could choose an
equally reasonable coding,
\bigskip

$Y= \left\{\begin{array}{l}
1\ \mathrm{i}\mathrm{f}\ \mathrm{e}\mathrm{p}\mathrm{i}\mathrm{l}\mathrm{e}\mathrm{p}\mathrm{t}\mathrm{i}\mathrm{c}\ \mathrm{s}\mathrm{e}\mathrm{i}\mathrm{z}\mathrm{u}\mathrm{r}\mathrm{e};\\
2\ \mathrm{i}\mathrm{f}\ \mathrm{s}\mathrm{t}\mathrm{r}\mathrm{o}\mathrm{k}\mathrm{e};\\
3\ \mathrm{i}\mathrm{f}\ \mathrm{d}\mathrm{r}\mathrm{u}\mathrm{g}\ \mathrm{o}\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{d}\mathrm{o}\mathrm{s}\mathrm{e}.
\end{array}\right.$

\bigskip

which would imply a totally different relationship among the three conditions. Each of these codings would produce fundamentally different linear models that would ultimately lead to different sets of predictions on test observations.
\bigskip


If the response variable's values did take on a natural ordering, such as
{\it mild, moderate}, and {\it severe}, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a
quantitative response that is ready for linear regression.
\bigskip

For a {\it binary} (two level) qualitative response, the situation is better. For instance, perhaps there are only two possibilities for the patient's medical condition: stroke and drug overdose. We could then potentially use
the {\it dummy variable} approach to code the response as follows:
\bigskip

$Y= \left\{\begin{array}{l}
0\ \mathrm{i}\mathrm{f}\ \mathrm{s}\mathrm{t}\mathrm{r}\mathrm{o}\mathrm{k}\mathrm{e};\\
1\ \mathrm{i}\mathrm{f}\ \mathrm{d}\mathrm{r}\mathrm{u}\mathrm{g}\ \mathrm{o}\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{d}\mathrm{o}\mathrm{s}\mathrm{e}.
\end{array}\right.$

\bigskip

We could then fit a linear regression to this binary response, and predict
drug overdose if $\hat{Y}>0.5$ and stroke otherwise. In the binary case it is not hard to show that even if we flip the above coding, linear regression will
produce the same final predictions.

\bigskip



For a binary response with a 0/1 coding as above, regression by least
squares does make sense; it can be shown that the $X\hat{\beta}$ obtained using linear regression is in fact an estimate of $\mathrm{P}\mathrm{r}$ (drug overdose $|X$) in this special case. However, if we use linear regression, some of our estimates might be outside the $[0$, 1$]$ interval (see Figure 4.2), making them hard to interpret as probabilities! Nevertheless, the predictions provide an ordering and can be interpreted as crude probability estimates. Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure we discuss in Section 4.4.
\bigskip


However, the dummy variable approach cannot be easily extended to accommodate qualitative responses with more than two levels. For these reasons, it is preferable to use a classification method that is truly suited for qualitative response values, such as the ones presented next.


\bigskip

\newpage


\section{Classification Threshold}


Now we recall back in linear regression, we have a bunch(or single) number of independent input/variables defined as X and we seek a response/output which we call Y. However in logistic regression, our output is not in the form of a number like in Linear regression since most of the time our output is a categorical variable. In our simple example we call the output to be encoded as 0 and 1. Hence logistic regression models the probability of the class output. We use a simple example to show case the idea here.

\bigskip


\textbf{Example:} We are trying to predict if a person has malignant tumor or not based on some inputs such as "Tumor Size" and etc. Our output Y is basically encoded as a binary class where Yes it is Malignant Tumor stands for 1 and No it is not Malignant Tumor stands for 0. Consider this data set, where the response "Y = Malignant" falls into one of two categories, Yes or No. Rather than modeling this response Y directly, logistic regression models the probability that Y belongs to a particular category.
\bigskip

For this, in a simple logistic regression model, we can calculate our output Y as a probability defined as $$\text{Prob(malignant = Yes}~|~ \text{tumor size})$$

\bigskip

Now it should be obvious that $\text{Prob(malignant = Yes}~|~ \text{tumor size})$ should fall in between $0$ and $1$ since it is a probability. But it does not answer our question of whether you are in class 1 or 0 because ultimately we want to find out our output value which is either a 1 or a 0. So what if you get $\text{Prob(malignant = Yes}~|~ \text{tumor size}) = 0.2349538$ which is neither 0 or 1? So this is where classification threshold comes in. You need to pre-define a threshold (default is usually 0.5). As a result, if we use a classification threshold of 0.5, then we will predict a Yes/1 for any $\text{Prob(malignant = Yes}~|~ \text{tumor size})$ that gives you more than 0.5. i.e: $\text{Prob(malignant = Yes}~|~ \text{tumor size}) > 0.5$. As a result if for 0.2349538, you are lucky as we predict you as a No. However, the threshold is there for a reason, usually, in medical and healthcare industry, we tend to be more conservative with our predictions as we have 0 tolerance for False Negatives. We rather give you a false alarm than to classify you as No Malignancy when in fact you are already at the last stage of your life. So we can tune and change our threshold to something like $\text{Prob(malignant = Yes}~|~ \text{tumor size}) >0.1$ and in this case, 0.2349538 will be in the Yes class.  \bigskip

\newpage

\section{Logistic Modelling Formula}

As with any modelling, there should be a formula between the X and the Y. However, since we are "Representing" our $Y$ with $\text{Prob(malignant = Yes}~|~ \text{tumor size})=\text{Prob(Y=1}~|~ X)$. For simplicity we call our function $\text{Prob(Y=1}~|~ X)$ as $p(X)$ and we seek to find a relationship between $p(X)$ and $X$.


\bigskip

IF we were to use the linear regression modelling to predict our output probability, our formula would look like this:

$$p(X) = \beta_0+ \beta_1X$$

\bigskip

Also recall that in order to come up with this linear regression formula in chapter 2, we have to find out regression coefficients $\beta$. Here we will use a linear regression model to represent the probabilities. For example, given 8 data points in the left diagram below. We can see easily that there are 4 data points that represent No/0 and 4 data points that represent Yes/1. After fitting in our linear regression line (aka finding out coefficients), we find that the linear regression predicts pretty well, as it seems to classify correctly all given data points given that the probability of the classification threshold is 0.5. In which case, we can say that everything to the right of the point c we predict as positive class 1 because everything to the right of point c, the output value is greater than 0.5 on the y axis, and everything to the left we predict as negative class 0.

\bigskip

i.e: 

$$\text{if } p(X) \geq 0.5, \text {predict } Y = 1$$
$$\text{if } p(X) < 0.5, \text{ predict } Y = 0$$

\bigskip

Everything looks fine until you add in one more data point in the right diagram, now when we fit our linear regression line again, the problem becomes CLEAR. Our regression line will change accordingly and may look like this. And now when you threshold the hypothesis at 0.5 again one end up with a threshold at point $c_1$. Similarly, we predict everything right of $c_1$ to be positive and left of it to be negative. But now this is bad since it classifies 2 points wrongly as shown.





\begin{figure}[h]
  \centering
  \begin{minipage}[b]{0.4\textwidth}
    \includegraphics[width=9cm]{logistic1.jpg}
    \caption{}
  \end{minipage}
  \hfill
  \begin{minipage}[b]{0.4\textwidth}
    \includegraphics[width=9cm]{logistic2.jpg}
    \caption{}
  \end{minipage}
\end{figure}

\bigskip


\textbf{Logistic Formula}

Also when we use linear regression, we may sometimes predict a negative probability of malignancy or even a probability more than 1, which is all not acceptable. As such we seek to find a function to represent $p(X)$ such that $0 \leq p(X) \leq 1$, and also the function must be stable in the  sense that outliers like the figure 3.2 should not affect the classification of the classes. 

\bigskip

To deal with both issues, we come up with the Sigmoid function, 





\newpage



\section{Supervised vs Unsupervised Learning}



   \begin{enumerate} 

\item \textbf{Definition (Supervised Learning)}


Most statistical learning problems fall into one of two categories: supervised
supervised or unsupervised. In supervised learning, For each observation of the
independent variables, $x_i, i = 1, . . . , n$ there is an associated output (dependent) variable $y_i$. We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future
observations (prediction) or better understanding the relationship between
the dependent and independent. For example, we are given some independent variables, income level, education level and health consciousness (measured numerically) level and name them $x_1,x_2,x_3$, and we have a $y$ variable which is dependent on the 3 independent variables. We let $y$ be life expectancy rate. If we are given this set of data, say 5000000 sets of individuals with these data, both input and output are given to you, these are training data for the machine to keep looking through it for patterns. And then it can come up with some suitable algorithm and if we give the machine a new data set with $x,y,z$ as inputs, he can produce a reliable output for you (say the person's life expectancy).
\bigskip

In the examples shown , there are only three variables, and
so one can simply visually inspect the scatter-plots of the observations in
order to identify clusters. However, in practice, we often encounter data
sets that contain many more than two variables. In this case, we cannot
easily plot the observations. For instance, if there are $p$ variables in our
data set, then $p(p-1)/2$ distinct scatter-plots can be made, and visual
inspection is simply not a viable way to identify clusters. For this reason,
automated clustering methods are important (K means).



\item \textbf{Definition (Unsupervised)}

In contrast, unsupervised learning describes the somewhat more challenging
situation in which for every observation $i = 1, . . . , n$, we observe
a vector of measurements $x_i$ but no associated output $y_i$ (I.E we have all the values of income level, education level and health consciousness (measured numerically) level but no associated output. It is not possible to fit a linear regression model, since there is no response variable
to predict. In this setting, we are in some sense working blind; the situation
is referred to as unsupervised because we lack a response variable
that can supervise our analysis.

\bigskip


\item \textbf{Confusion on when to use Regression or Classification}

Regression and classification are both related to prediction, where regression predicts a value from a continuous set (the output variable takes continuous values.), whereas classification predicts the 'belonging' to the class (the output variable takes class labels.). \bigskip


For example, the price of a house depending on the 'size' (in some unit) and say 'location' of the house, can be some 'numerical value' (which can be continuous): this relates to regression.
\bigskip


Similarly, the prediction of price can be in words, viz., 'very costly', 'costly', 'affordable', 'cheap', and 'very cheap': this relates to classification. Each class may correspond to some range of values.

\bigskip

A rule of thumb, if your output variable AKA the dependent variable AKA the variable you want to PREDICT, is in the form of a class, then obviously use classification, if continuous number, then use regression. One can dig deeper to see if there is a combination of both method. 
\bigskip

Also note, some classification algorithms can be used on both qualitative and quantative responses!

\bigskip


\item \textbf{Definition (K Means)}

\url{http://www.learnbymarketing.com/tutorials/k-means-clustering-in-r-example/} good website

\bigskip

\url{https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226}

\url{https://stats.stackexchange.com/questions/183236/what-is-the-relation-between-k-means-clustering-and-pca}

\url{https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues}
\bigskip

\url{https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/}

\url{https://uc-r.github.io/kmeans_clustering}

\url{http://www.sthda.com/english/wiki/print.php?id=236}
\bigskip


\item \textbf{K means and PCA}


Oh okay, yeah it's important to realize that PCA and kmeans serve totally different purposes. PCA is for dimensionality reduction: it tells you which variables (or which linear combinations of variables) in your data are most important. Kmeans is for clustering: it tells you which of your data points "belong together" in some sense.

People will often use the two together, but they do different things.


\url{https://www.r-bloggers.com/how-to-perform-pca-on-r/}

\bigskip


    \item \textbf{Curse of dimensions}


I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high dimensions'? I have been applying hierarchical clustering using Euclidean distance with 100 features. Up to how many features is it 'safe' to use this metric? \bigskip

Can imagine, \url{https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions/} super good read.


\bigskip

\section{Intuition of Classification}

\item \textbf{Why Classification?}
\justify 

The linear regression model discussed in Chapter 2 assumes that the response variable $Y$ is quantitative. But in many situations, the response variable is instead qualitative. Often qualitative variables are referred to as \textbf{categorical}; We will use these terms interchangeably. For example, in Linear Regression, we input many independent variables and hope to output a prediction of the house pricing; But what is our output is not a numerical number. Instead, one can think of predicting whether you have diabetes by inputting many independent variables. But now the output is instead a binary class, it is either a yes or no, as such linear regression does not work here.

\bigskip

In this chapter, we study approaches for predicting qualitative responses, a process that is known as \textbf{classification}. Why is it called classification when it seems like "prediction"? Because predicting a qualitative response for an observation can be referred to as \textbf{classifying} that observation, since it involves assigning the observation to a category, or class.

\bigskip

On the other hand, often the methods used for classification actually first \textbf{predict} the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods. This prediction of probability of each of the categories of a qualitative variable is known as \textbf{Classification Threshold. We will mention it later.}



\newpage

\section{Why Linear Regression is not suitable}

We have stated that linear regression is not appropriate in the case of a
qualitative response. Why not? Suppose that we are trying to predict the medical condition of a patient in the emergency room on the basis of her symptoms. In this simplified example, there are three possible diagnoses: stroke, drug overdose, and epileptic seizure. We could consider encoding these values as a quantitative response variable, $Y$, as follows: \bigskip


\begin{equation}
  Y=\begin{cases}
    1, & \text{if stroke}\\
    2, & \text{if drug overdose}\\
    3, & \text{if epileptic seizure}\\
  \end{cases}
\end{equation}


\bigskip


Using this coding, least squares could be used to fit a linear regression model
to predict $Y$ on the basis of a set of predictors $X_{1}$, . . . , $X_{p}$. Unfortunately, this coding implies an ordering on the outcomes, putting drug overdose in between stroke and epileptic seizure, and insisting that the difference between stroke and drug overdose is the same as the difference between drug overdose and epileptic seizure. In practice there is no particular
reason that this needs to be the case. For instance, one could choose an
equally reasonable coding,
\bigskip

\begin{equation}
  Y=\begin{cases}
    1, & \text{if strepileptic seizure}\\
    2, & \text{if stroke}\\
    3, & \text{if drug overdose}\\
  \end{cases}
\end{equation}


\bigskip

which would imply a totally different relationship among the three conditions. Each of these codings would produce fundamentally different linear models that would ultimately lead to different sets of predictions on test observations.
\bigskip


If the response variable's values did take on a natural ordering, such as
{\it mild, moderate}, and {\it severe}, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a
quantitative response that is ready for linear regression.
\bigskip

For a {\it binary} (two level) qualitative response, the situation is better. For instance, perhaps there are only two possibilities for the patient's medical condition: stroke and drug overdose. We could then potentially use
the {\it dummy variable} approach to code the response as follows:
\bigskip

$Y= \left\{\begin{array}{l}
0\ \mathrm{i}\mathrm{f}\ \mathrm{s}\mathrm{t}\mathrm{r}\mathrm{o}\mathrm{k}\mathrm{e};\\
1\ \mathrm{i}\mathrm{f}\ \mathrm{d}\mathrm{r}\mathrm{u}\mathrm{g}\ \mathrm{o}\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{d}\mathrm{o}\mathrm{s}\mathrm{e}.
\end{array}\right.$

\bigskip

We could then fit a linear regression to this binary response, and predict
drug overdose if $\hat{Y}>0.5$ and stroke otherwise. In the binary case it is not hard to show that even if we flip the above coding, linear regression will
produce the same final predictions.

\bigskip



For a binary response with a 0/1 coding as above, regression by least
squares does make sense; it can be shown that the $X\hat{\beta}$ obtained using linear regression is in fact an estimate of $\mathrm{P}\mathrm{r}$ (drug overdose $|X$) in this special case. However, if we use linear regression, some of our estimates might be outside the $[0$, 1$]$ interval (see Figure 4.2), making them hard to interpret as probabilities! Nevertheless, the predictions provide an ordering and can be interpreted as crude probability estimates. Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure we discuss in Section 4.4.
\bigskip


However, the dummy variable approach cannot be easily extended to accommodate qualitative responses with more than two levels. For these reasons, it is preferable to use a classification method that is truly suited for qualitative response values, such as the ones presented next.


\bigskip

\newpage


\section{Logistic Modelling}


Now we recall back in linear regression, we have independent input/variables $X$ and we seek a response/output variable $Y$. However in logistic regression, our output is in the form of a categorical variable. In our simple tutorial we will only be considering a binary output is binary coded as 0 and 1. 

\bigskip

We have mentioned in the previous section that we cannot use linear regression to predict a categorical output even if the categorical output is coded as numerical values. As a result, we need to come up with a slightly different hypothesis to model out relationship for $X$ and $Y$.

\bigskip

As with any modelling, there should be a formula between the $X$ and the $Y$. However, we have already established it is not easy to obtain an direct equation between $X$ and $Y$. Instead, in logistic regression, we are more interested in having a relation between $X$ and $P(Y = 1 ~|~ X)$. One should immediately be asking, what and why is $P(Y=1~|~X)$? Let me give you an intuition by the following example.

\bigskip

\begin{figure}[h]
  \centering
  \includegraphics[width=0.5\textwidth]{phd_template/images/logistic-8.jpg}
  \caption{Example}
\end{figure}



\bigskip

\textbf{Example:} We are trying to predict if a person has malignant tumor or not based on some inputs such as "Tumor Size" and etc. For simplicity sake, we only deal with one variable: The "Tumor Size" $X$. 

\bigskip

Our output $Y$ is basically encoded as a binary class where \textbf{Yes it is Malignant Tumor} stands for 1 and \textbf{No it is not a Malignant Tumor} stands for 0. Consider the data set above, where the response "$Y$ = Malignant" falls into one of two categories, Yes (1) or No (0). As mentioned in the previous paragraph, rather than modeling this response $Y$ directly with $X$, logistic regression models the probability that $Y$ belongs to a particular category.

\bigskip

For this, in a simple (one variable only) logistic regression model, we can define our output $Y$ as a probability defined as $$P(Y=1~|~ X)$$

\bigskip

\textbf{So to reiterate, instead of modelling our $Y$ directly with $X$, we aim to find a model that can model the probability of $Y$ given $X$.} But why? How does getting a probability help us? Although it should be obvious that $P(Y=1~|~X)$ should fall in between $0$ and $1$ since it is a probability, \textbf{but} it does not answer our question of whether you are in class 1 or 0 because ultimately, we are interested in finding out our output value which is either a 1 or a 0.

\bigskip

So what if you told me you found that  $P(Y=1~|~X =1.1 \text{ cm}) = 0.2349538$ which is neither 0 or 1? So this is where \textbf{classification threshold} comes in. You need to pre-define a threshold (default is usually 0.5). As a result, if we use a classification threshold of 0.5, then we will predict a $Y = \text{Yes } (1)$ for any $P(Y=1|~X) > 0.5$. To write it more compactly, we define the following indicator function. 

\begin{equation}
  Y=\begin{cases}
    1, & \text{if $P(Y=1~|~X) \geq 0.5$} \\
    0, & \text{if $P(Y=1~|~X < 0.5$}\\
  \end{cases}
\end{equation}



As a result if your tumour size is 1.1 cm, then the probability of your tumour is malignant is $0.2349538$, which is less than 0.5 and we predict your as a No (not malignant). However, the threshold is there for a reason, usually, in medical and healthcare industry, we tend to be more conservative with our predictions as we have 0 tolerance for False Negatives. We rather give you a false alarm than to classify you as No Malignancy when in fact you are already at the last stage of your life. So we can tune and change our threshold to something like $P(Y=1~|~X) >0.1$ and in this case, $0.2349538$ will be in the Yes class.  \bigskip

\bigskip

I know we are going off the tracks, but I hope I have provided you with some intuition on how modelling the $P(Y=1~|~X)$ as a function of $X$ makes sense here. \bigskip


\bigskip

\item [2.] \textbf{Hypothesizing and Modelling the Logistic Function}

For simplicity we call our function $P(Y=1~|~ X)$ as the function $p(X)$ and we seek to find a relationship between $p(X)$ and $X$. Although we have gone through a lot of ideas just now, it would be meaningless if we cannot find a suitable function (equation) to model $p(X)$ and $X$. 

\bigskip

\textbf{Hypothesis 1: The Linear Hypothesis}

Hmm, so we got quite some success hypothesizing linear regression models with linear functions, can we try that too on $p(X)$ and $X$? Consider that we "guess/hypothesize" that $p(X)$ have a \textbf{linear relationship} with $X$ as follows: 

$$p(X) = \beta_0+\beta_1 X$$

However the problem with this modelling is that for very large Tumour sizes $X$, say $X=10 \text{ cm}$, then our $p(X)$ may take values greater than $1$. And for extremely small Tumour sizes $X$, say those very small benign lumps, which may be $X=0.05 \text{ cm}$ in size, then $p(X)$ may take negative values. In any case, no matter how likely or unlikely one is to find his/her tumour to be malignant, how big or small the tumour size is, our $p(X)$ should only output values between $0$ and $1$ because $p(X)$ is a probability. Hence our linear model may be accurate to a certain extent, but not sensible.


\bigskip

\textbf{Further graphical presentation of Hypothesis 1}

\begin{figure}[h]
  \centering
  \begin{minipage}[b]{0.4\textwidth}
    \includegraphics[width=9cm]{phd_template/images/logistic1.jpg}
    \caption{}
  \end{minipage}
  \hfill
  \begin{minipage}[b]{0.4\textwidth}
    \includegraphics[width=9cm]{phd_template/images/logistic2.jpg}
    \caption{}
  \end{minipage}
\end{figure}

\bigskip


Also recall that in order to come up with this linear regression formula in chapter 2, we have to find out regression coefficients $\beta$. Here we will use a linear regression model to represent the probabilities. For example, given 8 data points in the left diagram below. We can see easily that there are 4 data points that represent No/0 and 4 data points that represent Yes/1. After fitting in our linear regression line (aka finding out coefficients), we find that the linear regression predicts pretty well, as it seems to classify correctly all given data points given that the probability of the classification threshold is 0.5. In which case, we can say that everything to the right of the point c we predict as positive class 1 because everything to the right of point c, the output value is greater than 0.5 on the y axis, and everything to the left we predict as negative class 0.

\bigskip

i.e: 

$$\text{if } p(X) \geq 0.5, \text {predict } Y = 1$$
$$\text{if } p(X) < 0.5, \text{ predict } Y = 0$$

\bigskip

Everything looks fine until you add in one more data point in the right diagram, now when we fit our linear regression line again, the problem becomes CLEAR. Our regression line will change accordingly and may look like this. And now when you threshold the hypothesis at 0.5 again one end up with a threshold at point $c_1$. Similarly, we predict everything right of $c_1$ to be positive and left of it to be negative. But now this is bad since it classifies 2 points wrongly as shown.




\bigskip


\textbf{Hypothesis 2}

Instead of the linear hypothesis, we come up with another one, recall that we learnt that probability and odds have similar definition. And recall that $$\text{odds} = \dfrac{P(Y=1~|~X)}{1-P(Y=1~|~X)}$$

\bigskip

So why not model the odds against $X$? If we can successfully do that, then we can easily get the probability $P(Y=1~|~X)$ since odds and probability are in a if and only if relationship. So let us try: $$\text{odds} = \dfrac{p(X)}{1-p(X)} = \beta_0 + \beta_1X$$

\bigskip

But ALAS! We soon realise that the odds can only take on values from $0$ to $\infty$, but the problem still exists for the $\beta_0 + \beta_1X$ since some $X$ values can output negative values.

\bigskip

But we are close, and if one has some mathematical backgrounds, then we know that if we take the log or ln of $\text{odds}$ then we can have the desired results.

\bigskip

\textbf{Hypothesis 3: The Chosen one}

If we finally consider the modelling of the logarithm of the odds, against the variable $X$, where we still assume a linear relationship, then we may be good to go because the logarithm of the odds gives a range of $-\infty$ to $\infty$ and matches well with $\beta_0+\beta_1X$.      

\bigskip

With this we have achieved a regression model, with the output of the model being the logarithm or ln of the odds. i.e: the modelled equation is as follows:

$$\ln\left(\dfrac{p(X)}{1-p(X)}\right) = \beta_0+\beta_1X$$

The main reason we reach this step is because both sides of the equation can take in the same range, and thus makes more mathematical sense now. We have yet to estimate or found what the coefficients $\beta_0, \beta_1$ are. This is just a logical and sound hypothesis.

\bigskip

\textbf{Recovering the Logistic Function from log odds}

So in the previous paragraph we have settled on a hypothesis that there is a \textbf{linear relationship} between the predictor variable $X$ and the \textbf{log-odds} of the event that $Y=1$. However, do not forget what our original aim is, we modelled log odds against $X$ simply because the relationship can be mathematically justified, we ultimately want to find the probability of $Y=1$ given $X$. And that is easy, by some reverse engineering, once $\beta_0, \beta_1$ are fixed, we do some manipulation:

\bigskip

\begin{equation*} 
\begin{split}
\ln\left(\dfrac{p(X)}{1-p(X)}\right) &= \beta_0+\beta_1X \\
\iff  \dfrac{p(X)}{1-p(X)} &= \exp{(\beta_0+\beta_1X)}\\
\iff p(X)  &= \dfrac{\exp{(\beta_0+\beta_1X)}}{\exp{(\beta_0+\beta_1X)}+1}\\
\iff p(X) &= \dfrac{1}{1+\exp^{-1}{(\beta_0+\beta_1X)}}\\
\end{split}
\end{equation*}


\bigskip


\textbf{Given the log odd mode (logit model actually), we can recover the probability of $Y=1$ given $X$ for each $X$.}



\bigskip \newpage

\item [3.]  \textbf{Important - The workflow process of Logistic Regression}

\begin{enumerate}

\item [1.] We aim to predict $Y$ given $X$ where $Y$ is encoded as $0$ and $1$.

\bigskip

\item [2.] It is not easy to find step 1 directly, and thus we turn out eyes to finding the probability of $Y = 1$ given $X$ instead. Imagine we are building logistic regression from scratch, and we tried to hypothesize that maybe the probability $p(X)$ can be modelled the same way as in \textbf{linear regression?} But soon realised that modelling $p(X) = \beta_0+\beta_1$ is not good since its range gives values out of $[0,1]$. In order to overcome this we can make a transformation and fit the sigmoid/logistic function which forces the output $p(X)$ to be in $[0,1]$.

\bigskip

Since the transformation may not be intuitive, I have made a simple explanation above, and showed steps on how modelling $P(X)$ as a sigmoid function come back.

Sigmoid in logistic regression:

$$p(X) = \dfrac{1}{1+\exp^{-1}{(\beta_0+\beta_1X)}}$$

\bigskip

\item [3.] Finally, we have to estimate the coefficients in the $$p(X) = \dfrac{1}{1+\exp^{-1}{(\beta_0+\beta_1X)}}$$ and we use a method called \textbf{Maximum Likelihood}.

\bigskip

\item [4.] Once we recover the coefficients $\beta_0, \beta_1$, we can simply plug in the coefficients and the respective values of $X$ to get $p(X)$. 

\bigskip

\item [5.] Once we get the $p(X)$, we can define a indicator function as our classification threshold (mentioned earlier) and subsequently, get all the values of $Y$.

    
    
    
\end{enumerate}








\newpage



\section{Supervised vs Unsupervised Learning}





\item \textbf{Definition (Supervised Learning)}


Most statistical learning problems fall into one of two categories: supervised
supervised or unsupervised. In supervised learning, For each observation of the
independent variables, $x_i, i = 1, . . . , n$ there is an associated output (dependent) variable $y_i$. We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future
observations (prediction) or better understanding the relationship between
the dependent and independent. For example, we are given some independent variables, income level, education level and health consciousness (measured numerically) level and name them $x_1,x_2,x_3$, and we have a $y$ variable which is dependent on the 3 independent variables. We let $y$ be life expectancy rate. If we are given this set of data, say 5000000 sets of individuals with these data, both input and output are given to you, these are training data for the machine to keep looking through it for patterns. And then it can come up with some suitable algorithm and if we give the machine a new data set with $x,y,z$ as inputs, he can produce a reliable output for you (say the person's life expectancy).
\bigskip

In the examples shown , there are only three variables, and
so one can simply visually inspect the scatter-plots of the observations in
order to identify clusters. However, in practice, we often encounter data
sets that contain many more than two variables. In this case, we cannot
easily plot the observations. For instance, if there are $p$ variables in our
data set, then $p(p-1)/2$ distinct scatter-plots can be made, and visual
inspection is simply not a viable way to identify clusters. For this reason,
automated clustering methods are important (K means).



\item \textbf{Definition (Unsupervised)}

In contrast, unsupervised learning describes the somewhat more challenging
situation in which for every observation $i = 1, . . . , n$, we observe
a vector of measurements $x_i$ but no associated output $y_i$ (I.E we have all the values of income level, education level and health consciousness (measured numerically) level but no associated output. It is not possible to fit a linear regression model, since there is no response variable
to predict. In this setting, we are in some sense working blind; the situation
is referred to as unsupervised because we lack a response variable
that can supervise our analysis.

\bigskip


\item \textbf{Confusion on when to use Regression or Classification}

Regression and classification are both related to prediction, where regression predicts a value from a continuous set (the output variable takes continuous values.), whereas classification predicts the 'belonging' to the class (the output variable takes class labels.). \bigskip


For example, the price of a house depending on the 'size' (in some unit) and say 'location' of the house, can be some 'numerical value' (which can be continuous): this relates to regression.
\bigskip


Similarly, the prediction of price can be in words, viz., 'very costly', 'costly', 'affordable', 'cheap', and 'very cheap': this relates to classification. Each class may correspond to some range of values.

\bigskip

A rule of thumb, if your output variable AKA the dependent variable AKA the variable you want to PREDICT, is in the form of a class, then obviously use classification, if continuous number, then use regression. One can dig deeper to see if there is a combination of both method. 
\bigskip

Also note, some classification algorithms can be used on both qualitative and quantative responses!

\bigskip


\item \textbf{Definition (K Means)}

\url{http://www.learnbymarketing.com/tutorials/k-means-clustering-in-r-example/} good website

\bigskip

\url{https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226}

\url{https://stats.stackexchange.com/questions/183236/what-is-the-relation-between-k-means-clustering-and-pca}

\url{https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues}
\bigskip

\url{https://www.r-bloggers.com/pca-and-k-means-clustering-of-delta-aircraft/}

\url{https://uc-r.github.io/kmeans_clustering}

\url{http://www.sthda.com/english/wiki/print.php?id=236}
\bigskip


\item \textbf{K means and PCA}


Oh okay, yeah it's important to realize that PCA and kmeans serve totally different purposes. PCA is for dimensionality reduction: it tells you which variables (or which linear combinations of variables) in your data are most important. Kmeans is for clustering: it tells you which of your data points "belong together" in some sense.

People will often use the two together, but they do different things.


\url{https://www.r-bloggers.com/how-to-perform-pca-on-r/}

\bigskip


    \item \textbf{Curse of dimensions}


I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high dimensions'? I have been applying hierarchical clustering using Euclidean distance with 100 features. Up to how many features is it 'safe' to use this metric? \bigskip

Can imagine, \url{https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions/} super good read.


\bigskip

\section{Intuition of Logistic Regression}

\begin{enumerate}

\item [1.] Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level.
    
    \bigskip

\begin{enumerate}
    \item 

Firstly, it does not need a linear relationship between the dependent and independent variables. Logistic regression can handle all sorts of relationships, because it applies a non-linear log-transformation to the predicted odds ratio. 

\bigskip

\item  Secondly, the independent variables do not need to be multivariate normal – although multivariate normality yields a more stable solution. Also the error terms (the residuals) do not need to be multivariate normally distributed.

\bigskip

\item  Thirdly, homoscedasticity is not needed. Logistic regression does not need variances to be heteroscedastic for each level of the independent variables. 

\bigskip

\item  Lastly, it can handle ordinal and nominal data as independent variables. The independent variables do not need to be metric (interval or ratio scaled).

\bigskip

\end{enumerate}

However some other assumptions still apply. 

\begin{enumerate}

\item  [1.] Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal. Reducing an ordinal or even metric variable to dichotomous level loses a lot of information, which makes this test inferior compared to ordinal logistic regression in these cases.

\bigskip

\item [2.] Secondly, since logistic regression assumes that $P(Y=1)$ is the probability of the event occurring, it is necessary that the dependent variable is coded accordingly. That is, for a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.

\bigskip

\item [3.] Thirdly, the model should be fitted correctly. Neither over fitting nor under fitting should occur. That is only the meaningful variables should be included, but also all meaningful variables should be included. A good approach to ensure this is to use a step-wise method to estimate the logistic regression.

\bigskip

\item [4.] Fourthly, the error terms need to be independent. Logistic regression requires each observation to be independent. That is that the data-points should not be from any dependent samples design, e.g., before-after measurements, or matched pairings. Also the model should have \textbf{little or no multicollinearity}. That is that the independent variables should be independent from each other.

\bigskip

However, there is the option to include interaction effects of categorical variables in the analysis and the model. If multicollinearity is present centering the variables might resolve the issue (But subjected to many restrictions) see \url{https://stats.stackexchange.com/questions/16710/does-standardising-independent-variables-reduce-collinearity}, i.e. deducting the mean of each variable. If this does not lower the multicollinearity, a factor analysis with orthogonally rotated factors should be done before the logistic regression is estimated.

\bigskip


\item [5.] Fifthly, logistic regression assumes linearity of independent variables and log odds. Whilst it does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds. Otherwise the test underestimates the strength of the relationship and rejects the relationship too easily, that is being not significant (not rejecting the null hypothesis) where it should be significant. A solution to this problem is the
categorization of the independent variables. That is transforming metric variables to ordinal level and then including them in the model. Another approach would be to use discriminant analysis, if the assumptions of homoscedasticity, multivariate normality, and absence of multicollinearity are met.

\bigskip


\item [6.] Lastly, it requires quite large sample sizes. Because maximum likelihood estimates are less powerful than ordinary least squares (e.g., simple linear regression, multiple linear regression); whilst OLS needs 5 cases per independent variable in the analysis, ML needs at least 10 cases per independent variable, some statisticians recommend at least 30 cases for each parameter to be estimated.

    \end{enumerate}

\chapter{Statistical Significance, Effect Size,
and Confidence Intervals}\label{cha:Paradoxes of Newtonian Cosmology}

\section{Standard Error}

This is an abstract concept and requires one to read more than once.

\bigskip


\textbf{Definition: Standard Error}

The \textbf{standard error} of a statistic (statistic can be mean, the correlation coefficient or the difference between two means) is the \textbf{standard deviation} of the \textbf{sampling distribution} of that statistic.

\bigskip

\textbf{Example}

We begin our understanding of Standard Error with an example. Although
there are standard errors for all statistics (like correlation coefficient etc), we will focus on the standard error of the mean.

\bigskip

We choose 100 men from the population in Singapore to examine their average shoe size. We found out that the sample mean of the shoe sizes of this chosen 100 men is 10. Note that in this 100 men, we can easily form a frequency \textbf{distribution} as follows (table xxx) where the sample mean is just one of the distribution's statistics.

\bigskip

Now, always keep in mind that our sample should be a good representation of our true population, and that is but just an aim, one should always realise that no sample can perfectly resemble the true population. In one sample of 100 men that we took, our sample mean can be 10, on another sample of 100 men, the sample mean can be 6, and this is "bad" because there is a lot of fluctuation between the sample means, thus making us confused on if our sample chosen is representative of the true population.

\bigskip

So \textbf{Standard Errors} are important because it is a metric to measure how much \textbf{sample fluctuation a statistics will show}, in this case, we want to be sure that the sample fluctuation of the mean will not be big. The \textbf{Standard Error} tells you how far your sample statistic (like the sample mean) deviates from the \textbf{actual population mean}.

\bigskip

Back to our example of the shoes, suppose we take 1,000 different random samples of men, each sample of 100 men, and find the mean of each of the sample, then we will have 1000 different sample means. These 1000 different sample means also form their own distribution, which we call \textbf{sample distribution of the (sample?) mean.}

\bigskip

So following the above, we have a new distribution, namely \textbf{sample distribution of the (sample?) mean.} And as with any distribution, they will have statistics! And in particular, recall in our previous paragraphs that we are interested in the 'fluctuation' of this \textbf{sample distribution of the (sample?) mean}. Hence it follows that we need to find the \textbf{Standard Deviation} of the \textbf{sample distribution of the (sample?) mean}. We also look at the mean of the \textbf{sample distribution of the (sample?) mean} (aka the sum of the sample mean of the 1000 different random samples divided by 1000). This mean is called the \textbf{expected value of the mean} and it is "expected" to be \textbf{equals to the True Population mean (Proof omitted as out of scope)}.

\bigskip

Next up: A logic ride! So put on your thinking cap and go along! Recall our definition back then in $section$ on \textbf{standard deviation}. Given a \textbf{distribution,} the \textbf{standard deviation} tells us the average difference, or deviation between an \textbf{individual score in the distribution and the mean of the distribution.} Now we use this definition to find out what it means to be the standard deviation of the \textbf{sample distribution of the (sample?) mean.} Firstly, we replace the phrase "individual score" with "individual sample mean" because indeed, each "score" in our \textbf{sample distribution of the (sample?) mean} is none other than the sample mean (recall there are 1000 samples here). Second, we replace the word "distribution" with "\textbf{sample distribution of the (sample?) mean}". Lastly, we replace the phrase "mean of the distribution" with "mean of the \textbf{sample distribution of the (sample?) mean}", and bear in mind that the "mean of the \textbf{sample distribution of the (sample?) mean}" is previously defined as the True Population mean!"
\bigskip

So the standard deviation of the \textbf{sample distribution of the (sample?) mean} is referring to the average difference between the \textbf{expected value of the mean which is the population mean} and \textbf{an individual sample mean}. Now one might wonder where is the standard error idea that we promised to talk about, actually, here it is, the standard deviation of the \textbf{sample distribution of the (sample?) mean} is just the \textbf{standard error}. 

\bigskip

So one way to think about the standard error of the mean is that it tells us how confident we should be that a sample mean represents the actual population mean. Phrased another way, the standard error of the mean provides a measure
of how much error we can expect when we say that
a sample mean represents the mean of the larger population. 




That is why it is called a standard
error. Knowing how much error we can expect when selecting a sample of a given size from a
population is critical in helping us determine whether our sample statistic, such as the sample
mean, is meaningfully different from the population parameter, such as the population mean.





\bigskip


\textbf{Properties: Sample Distribution of the mean and its Standard Error}

If the population's standard deviation and size can be known, then it is easy to calculate the standard error:

Given a population of size $n$ with a mean of $\mu$ and a standard deviation of $\sigma$, the sampling distribution of the mean has a mean of $\mu$ and a standard deviation (standard error) of $$\sigma_M = \dfrac{\sigma}{\sqrt{n}}$$

\bigskip

More often than not, we are unable to have the population size and its standard deviation. As such, we shall use the below formula to \textbf{estimate the standard deviation} of the sampling distribution of the mean AKA standard error of the sample mean.

$$S_x = \dfrac{S}{\sqrt{n}}$$

where $S$ is the standard deviation of the sample (sample standard deviation) given by $$S = \sqrt{\dfrac{\sum{(X_i-X)^2}}{n-1}}$$


\bigskip


2.  While the mean of a sampling distribution is equal to the mean of the population, the standard error depends on the standard deviation of the population, the size of the population and the size of the sample. Knowing how spread apart the mean of each of the sample sets are from each other and from the population mean will give an indication of how close the sample mean is to the population mean.

\bigskip



\textbf{Derivation of Standard Error of the Mean}

\url{https://stats.stackexchange.com/questions/425319/deriving-the-standard-error?noredirect=1#comment793971_425319}

Let the sample mean be a random variable called $\bar{X}$, we aim to find the variance of this $\bar{X}$.  For any $\bar{X}$, it is merely equals to $\frac{1}{n}\sum_{i=1}^{n}X_i$ where $n$ is the size of the sample and $X_i, 1 \leq i \leq n$ is a single observation from the population. Here we can say that each $X_i$ is a random variable as well which is independent and following a population variance of $\sigma^2$. Since $$\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$$ it hence follows that 
 \bigskip
 
$
\begin{aligned}
    \text{Var}(\bar{X}) &= \text{Var}\left(\frac{1}{n}\sum_{i=1}^nX_i\right) \\
    & =\left(\frac{1}{n}\right)^2\sum_{i=1}^{n}\text{Var}(X_i)\\
    & =  \left(\frac{1}{n}\right)^2(n\sigma^2)\\
    &= \frac{\sigma^2}{n}\\
\end{aligned}
$

\bigskip

and thus the standard error of the sample mean is merely the standard deviation of the sample mean, given by $$\frac{\sigma}{\sqrt{n}}$$

\bigskip



\newpage



\section{Seeliger's Attack}



 \textbf{Seeliger's Idea}

The real assault on Newtonian Cosmology started off from the famous German astronomer Hugo von Seeliger. In Seeliger's paper dated 1895, he questioned whether Newton's Law of Gravitation will hold exactly for masses separated by immeasurably great distances (Norton, 1999). 

\bigskip

So Seeliger showed that in an \textbf{Infinite Euclidean Universe} with a \textbf{uniformed mass and density distribution}, one cannot have agreement with Newton's Law of Gravitation, as one would see later that the Gravitational force/potential at any point in the Universe will not give an unique result (Indeterminate) (Seeliger, 1895).

\bigskip

 \textbf{Seeliger's Proposal} 


\begin{enumerate}
    \item  Consider the figure 1.1(A) below, the total gravitational potential at the test mass's origin due to a shell of inner radius $R_0$ and outer radius $R_1$ is 
\begin{equation}\tag{1.1}
\varphi = \int_{0}^{2\pi}\int_{0}^{\pi}\int_{R_0}^{R_1} - \dfrac{G\rho}{r}(r^2\sin \phi) \, dr \,d\phi\, d\theta = \int_{0}^{2\pi}\int_{0}^{\pi}\int_{R_0}^{R_1} -G\rho r\sin\phi  \,dr\, d\phi\, d\theta 
\end{equation}
\bigskip

\item The gravitational force in the direction of the x coordinate axis is denoted as $F_x$ and is given by 
\begin{equation}\tag{1.2}
F_x = -\dfrac{d\varphi}{dx}=G\int_{0}^{2\pi}\int_{0}^{\pi}\int_{R_0}^{R_1}\rho\sin\phi\cos\phi \,dr \,d\phi\, d\theta
\end{equation}

\bigskip

\item The tidal force $Z_x$ is given by 
\begin{equation}\tag{1.3}
Z_x=\dfrac{dF_x}{dx}= -\dfrac{d^2\varphi}{dx^2}=2G\int_{0}^{2\pi}\int_{0}^{\pi}\int_{R_0}^{R_1}\frac{\rho}{r}\,dr\sin\phi\frac{3\cos^2(\phi)-1}{2} \,d\phi \,d\theta
\end{equation}

\end{enumerate}

\begin{figure}
\centering
\begin{subfigure}{.5\textwidth}
  \centering
\definecolor{wrwrwr}{rgb}{0.3803921568627451,0.3803921568627451,0.3803921568627451}
\begin{tikzpicture}[line cap=round,line join=round,>=triangle 45,x=1cm,y=1cm,scale=.7]
\draw [line width=.5pt] (0,0) circle (4cm);
\draw [line width=.5pt] (0,0) circle (2cm);
\draw [line width=.5pt,<->] (0,0.15)-- (0,2);
\draw [line width=.5pt,<->] (0.15,0)-- (3.8371147553897166,0.9911772427811913);
\draw (-0.6,-0.16990637718001933) node[anchor=north west] {\parbox{2 cm}{unit test \\mass}};
\draw [line width=2pt] (0,0) circle (0.1cm);
\begin{scriptsize}
\draw[color=black] (0.1867353955897328,0.8763639528820679) node {$R_{0}$};
\draw[color=black] (1.3605512418350735,0.5) node {$R_{1}$};
\draw [fill=wrwrwr] (0,0) circle (2.5pt);
\end{scriptsize}
\end{tikzpicture}
  \caption{}
  \label{fig:sub1}
\end{subfigure}%
\begin{subfigure}{.5\textwidth}
  \centering
  \includegraphics[width=\linewidth]{phd_template/images/peanut.jpg}
  \caption{Peanut}
  \label{fig:sub2}
\end{subfigure}
\caption{Sphere and Peanut}
\label{fig:test}
\end{figure}


%\begin{figure}
%\centering
%\definecolor{wrwrwr}{rgb}{0.3803921568627451,0.3803921568627451,0.3803921568627451}
%\begin{tikzpicture}[line cap=round,line join=round,>=triangle 45,x=1cm,y=1cm,scale=.7]
%\draw [line width=.5pt] (0,0) circle (4cm);
%\draw [line width=.5pt] (0,0) circle (2cm);
%\draw [line width=.5pt,<->] (0,0.15)-- (0,2);
%%\draw [line width=.5pt,<->] (0.15,0)-- (3.8371147553897166,0.9911772427811913);
%\draw (-0.6,-0.16990637718001933) node[anchor=north west] {\parbox{2 cm}{unit test \\mass}};
%\draw [line width=2pt] (0,0) circle (0.1cm);
%\begin{scriptsize}
%\draw[color=black] (0.1867353955897328,0.8763639528820679) node {$R_{0}$};
%\draw[color=black] (1.3605512418350735,0.5) node {$R_{1}$};
%\draw [fill=wrwrwr] (0,0) circle (2.5pt);
%\end{scriptsize}
%\end{tikzpicture}
%\caption{}
%\end{figure}
%\bigskip

%\begin{figure}
%  \includegraphics[width=0.5\textwidth]{phd_template/images/peanut.jpg}
%    \caption{Peanut}
%\end{figure}


Note that the exact form of $F_x$ and $Z_x$ are derived via Legendre polynomials. The Legendre Polynomials form an orthonormal basis of the space of polynomials over $x$ defined on $[-1,1]$. I shall omit the proof here.

\bigskip


\textbf{Sphere Shape}

Let us perform Seeliger's 3 integrals over a sphere of radius $R_1$ centered at a point of distance $r$ from $0$ the origin. Then as $R_1 \to \infty$, the gravitational potential and gravitational force are both dependent on $r$, which means $F_x$ and $\varphi$ have a value that depends on our \textbf{{\color{red}Choice of center of sphere}}. This does not gel with our common experience. In this model, however, the tidal force $Z_x$ remains of finite value and is NOT indeterminate. Hence he extrapolated his shapes to a peanut shape. 

\bigskip

\textbf{Peanut Shape}

Since the tidal force was determinate, Seeliger integrated over a different shape, for example the peanut shape as shown in figure 1.1(B), where the shape is described by the spherical equation

$$\log\frac{R_1}{R_0} = am+mP_2(\cos \varphi)$$

where $P_2(\cos\varphi) = \frac{1}{2}(3\cos^2\varphi-1)$ is the second Legendre polynomial in $\cos \varphi$.

\bigskip

We then consider the volume of the peanut shape to be $V(R_1)$ and we have the following results. \bigskip

\begin{enumerate}
    \item  Again, considering the diagram figure 1.1(B) above, the total gravitational potential at the test mass's origin due to a shell of inner radius $R_0$ and outer radius $R_1$ is 
\begin{equation}\tag{1.11}
\varphi = \int\int\int_{V(R_1)} - \dfrac{G\rho}{r}(r^2\sin \varphi) \,dV = \int\int\int_{V(R_1)} -G\rho r\sin\varphi  \,dV
\end{equation}
\bigskip

\item The gravitational force in the direction of the x coordinate axis is denoted as $F_x$ and is given by 
\begin{equation}\tag{1.21}
F_x = -\dfrac{d\varphi}{dx}=G\int\int\int_{V(R_1)}\rho\sin\phi\cos\phi\, dV
\end{equation}

\bigskip

\item The tidal force $Z_x$ is given by 
\begin{equation}\tag{1.31}
Z_x=\dfrac{dF_x}{dx}= -\dfrac{d^2\varphi}{dx^2}=2G\int\int\int_{V(R_1)}\frac{\rho}{r}\,dr\sin\phi\frac{3\cos^2(\phi)-1}{2} \,dV
\end{equation}

\end{enumerate}


and this time round both $F_x$ and $\varphi$ tend to $0$ as $R_1$ tends to $\infty$, but the tidal force $Z_x$ tends to $\infty$. This is again inconsistent with physical observation as we do not experience infinite tidal force. Moral of the story, something is wrong with Newtonian Cosmology according to Seeliger.
\bigskip


\textbf{Path Independent}


We see that as $R_1 \to \infty$, the $\Lim{R_1 \to \infty} F_x$ and $\Lim{R_1 \to \infty} Z_x$ have no well defined value, and in particular the limiting values of $F_x,Z_x$ are \textbf{path dependent} on how $R_1 \to \infty$. This means if we choose the model to be different shapes, we get different results as $R_1 \to \infty$. 

\bigskip



\textbf{Neumann also proposed similar idea}

Basically, another guy called Carl Neumann also had a similar idea on the contradictions of Newton's Law, we shall not repeat too much narratives here.




\bigskip




\newpage

\section{Kelvin's Krux}


 \textbf{Kelvin's argument}


Kelvin's argument on the issue of Newtonian Gravity is as follows: Consider the surface of a sphere to be $S$, the mass to be $M$, and assume a \textbf{uniformed matter density} throughout our Universe, let the density to be denoted as $\rho$. The gravitational flux $\Phi$ through the surface $S$ is given by the formula: 
\begin{equation}\tag{1.4}
\Phi_S = \int\int_S \vec{E} \cdot \vec{dA}
\end{equation}
where $\vec{E}$ denotes the field strength.

\bigskip

By the Gauss's Law of gravitational flux, we also have 
\begin{equation}\tag{1.5}
\Phi_S = -4\pi GM\end{equation} where $S$ is a closed surface and $M$ is the total mass bounded within the sphere.
\bigskip

Consequently, since we assumed a uniformed matter density throughout, we can substitute $M = \rho V$ into equation 1.5, then 
\begin{equation}\tag{1.6}
\Phi_S = -4\pi G\left(\dfrac{4}{3}\pi R^3\rho\right) = -\dfrac{16}{3}\pi^2G\rho R^3
\end{equation}

\bigskip

Since the total flux through a sphere of radius $R$ that is filled with matter of uniform density $\rho$ is represented as $\Phi$. And since the sphere is of radius $R$, the surface area of the sphere is $4\pi R^2$, so the field strength on top of a sphere of radius $R$ is given by 
\begin{equation}\tag{1.7}
||\vec{E}|| = \dfrac{\Phi_S}{4\pi R^2} = -\dfrac{4}{3}\pi G \rho R = -\dfrac{GM}{R^2}
\end{equation}

where $M = \frac{4}{3}\pi R^3\rho$.

\bigskip

Now from above, by the Kelvin's argument, problem arises if we assume that the universe is of constant and uniform (matter) density $\rho$. This is because we have, 

$$||E|| = \dfrac{GM}{R^2} = \dfrac{G\rho V}{R^2} = \dfrac{4}{3}G\rho R $$

Then as $R \to \infty$, $||\vec{E}|| \to \infty$ as well.
\bigskip

Consequently, any particle on the surface experiences ever increasing field strength as $R \to \infty$. And worse still, if we assume the Universe to be infinite, as in the Newtonian Cosmology case, then we can choose any the centre of the sphere to be any point in space, so the gravitational force/field strength on a particle is therefore undefined.\bigskip



\bigskip

\newpage



\section{Norton's Non Convergence}


 \textbf{The Non Convergence of Gravitational Force in Newtonian Cosmology}


Assuming that our Universe is of \textbf{uniformed matter distribution} with a density $\rho$. And given a test body of unit mass, the gravitational force exerted on it is the resultant of the forces exerted by \textbf{all THE MASSES} in the Universe. \bigskip

We can compute this force by integrating over all these masses. There are many ways to do this, and for a start, assume the case where we divide the Universe into hemispherical shells, with the test mass as the origin center. Fix the test mass and draw shells around it. Define the radius from the unit test mass to one particular shell as $r$ and the thickness of the shell to be $\Delta r$ and $\rho$ to be the density of the shell. So this hemispherical shell exerts force on the test mass is calculated as such using Spherical Coordinates below:

$$F = \int_{0}^{2\pi}\int_{0}^{\pi/2}\int_{r}^{r+\Delta r} \dfrac{G\rho \cos\varphi}{R^2}(R^2\sin\varphi) \,dR\, d\varphi\, d\theta = G\pi \rho \Delta r$$

\bigskip

\begin{figure}[h!]
  \centering
  \includegraphics[width=0.5\textwidth]{phd_template/Nortons.png}
  \caption{Non - Convergence of force on a test mass in Newtonian Cosmology, courtesy of John Norton}
\end{figure}

So the unit test mass experiences a force of $G\pi \rho \Delta r$ and this force is independent of the radius of the shell as we have seen that in the process of the integration, the radius is cancelled. This will mean that no matter where the shell is located, one would experience a constant force of $G\pi \rho \Delta r$ from each shell, but as the diagram (fig 1.2) illustrates, this causes a problem.

\bigskip

We sum the force exerted by all the hemispherical shells to infinity (basically the whole Universe mass in a nutshell) onto this unit test mass and this results in $$G\pi \rho \Delta r - G\pi \rho \Delta r + G\pi \rho \Delta r - G\pi \rho \Delta r +...$$

This is an alternating series, alternating series does not have a well defined limit (MA2108) to which it converges, this would be a big problem because we can take this force we just calculated to be any value as long as we rearrange the series accordingly. Well, one would at least expect to know a definite answer to the amount of force the Universe acts on you. But by this calculation, it is impossible. 

\bigskip


There are many many other proposals of how Newtonian Gravity does not hold when we assume a infinite cosmos, or an uniform density throughout, mostly have the same arguments as the few I listed above.

\bigskip

\newpage




\chapter{Probability Space}\label{cha:P space}

\begin{enumerate}

\section{Probability Space}



\item [1.] \textbf{Definition: Probability Space}
    
\justify

Real-world scenarios often involve chance. We can model such scenarios mathematically.
For this purpose, we’ll use a mathematical object named the \textbf{Probability Space} (experiment), typically denoted a Probability Space $PS = (S, \Sigma, P)$ is an ordered triplet composed of three objects, called the
sample space $S$, the event space $\Sigma$ (upper-case sigma), and the probability function $P$, where \bigskip

\begin{enumerate}
    \item [i] The \textbf{sample space} $S$ (or sometimes represented as $\Omega$) is simply the set of ALL possible outcomes. And one may ask what is an outcome, an \textbf{outcome} is a result of an experiment.
    
    \bigskip
    
    \item [ii] An \textbf{event} is simply any set of possible outcomes, which means that an event is a subset of the sample space. In turn, the event space $\Sigma$ is simply the set of all events.
    
    
    \bigskip
    
    
    \item [iii] The \textbf{probability function} $P$ simply assigns to each event some probability between $0$ and $1$. This probability is interpreted as the likelihood of that particular event occurring.
    
\end{enumerate}

\bigskip


\item [1.1] \textbf{Example of Probability Space}


\begin{center}
\begin{tabular}{ | m{8cm} | m{7cm}|  } 
\hline
\textbf{Definition} & \textbf{Example}  \\ 
\hline
A Probability Space is a scenario involving chance or probability & Throwing a coin \\
\hline
\textbf{Outcome:} Result of an experiment/Probability Space & When you toss a coin, only outcomes are \textbf{Heads or Tails}  \\ 
\hline
\textbf{Sample Space:} Set of all possible outcomes & S = \{\text{Heads, Tails}\}\\
\hline
\textbf{Event Space:} Set of all possible events & $\Sigma = \{\emptyset, \{\text{Heads}\}, \{\text{Tails}\}, \{\text{Heads, Tails}\}\}$\\
\hline
\textbf{Probability Function:} $P: \Sigma  \to \R$ assigns to each event a \textbf{probability} which is a number between 0 and 1& $P(\{\text{Heads}\}) = \frac{1}{2}$\\
\hline
\end{tabular}
\end{center}

\bigskip

Note that an event is a subset of the Sample Space, so we can say let $A$ be the event of getting a Heads, where $A = \{\text{Heads}\} \in \Sigma$.

\bigskip

\textbf{Probability:} If the sample space $S$ consists of a finite number of equally likely outcomes, the probability of an event $A$, denoted by $P(A)$, is given by $$P(A) = \dfrac{n(A)}{n(S)}$$ , where $n(A)$
denotes the number of outcomes in the event $A$.





\bigskip


\item [1.2] \textbf{Definition (The Three Axioms of Probability (Kolmogorov Axioms))}

A probability function $P$ should satisfy the three axioms below. I will define the first two informally defined whilst the third one is formally defined.
\bigskip

For any event $A$, $P(A)$ is defined as the probability of event $A$ happening.

\bigskip


\begin{enumerate}
    \item [1.] First axiom (Non - Negativity Axiom): $P(A) \geq 0$ for any event $A$.
    
    \bigskip
    
    \item [2.] Second Axiom (Normalization Axiom): $\sum_{i=1}^{n}P(A_i) = 1$  where $A_i$ are all possible outcomes for $i = 1, 2,..., n$.
    
    \bigskip
    
    \item [3.] Third Axiom (Additivity Axiom): Given a countable sequence of \textbf{Disjoint Events} $A_1, A_2, ..., A_n,... \subset S$, we have $$P\left(\bigsqcup_{i=1}^{\infty} A_i \right) = \sum_{i=1}^{\infty}P(A_i)$$

\bigskip

This axiom asserts that one can find the probability of an event $A$ by just using 1 subtract the probability of the complement of $A$.

\end{enumerate}


\bigskip

\item [1.3] \textbf{Proposition (Implications of the Probability Axioms)}

\begin{enumerate}
    \item [1.] \textbf{Probability of Empty Set:} $P(\emptyset) = 0$ (Trivial)
    
    \bigskip
    
    \item [2.] \textbf{Complements:} $P(A) = 1 - P(A^{c})$
    
    \bigskip
    
    \item [3.] \textbf{Monotonicity:} If $A \subset B$, then $P(B) \leq P(A)$
    
    \bigskip
    
    \item [4.] \textbf{Numeric Bound:} It immediately follows that for any event $A$, $P(A) \leq 1$ by Monotonicity and Axiom 2.
    
    \bigskip
    
    \item [5.] \textbf{Inclusion-Exclusion:} $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
\end{enumerate}

\bigskip

\item [1.4] \textbf{Definition (Mutually Exclusive Event)}


In logic and probability theory, two events (or propositions) are mutually exclusive or disjoint if they cannot both occur at the same time. 

\bigskip

A clear example is the set of outcomes of a single coin toss, which can result in either heads or tails, but not both. To be concise, let the event of coin landing on head be $A$, the event of coin landing on tails be $B$, event $A$ and $B$ can never occur at the same time, it is the case that one cannot be true if the other one is true, or at least one of them cannot be true. 

\bigskip

One can draw a venn diagram with event $A$ and $B$ being disjoint to illustrate the idea.

\bigskip

Also, by proposition 1.3 (5), it is easy to see that since $P(A \cup B) = P(A) + P(B)$ since $P(A \cap B) = 0$. For the coin toss example, since $A \cup B$ spans the entire sample space, it is intuitive that $P(A \cup B)$ (the probability of throwing a head OR a tail is) 1.

\newpage


\section{Conditional Probability}

\item  \textbf{Definition (Conditional Probability)}

Let $P$ be a probability function and $A, B \in \Sigma$ be events. Then the \textbf{conditional probability of} $A$ given by $B$ is denoted $P(A|B)$ and is defined by: $$P(A|B) = \dfrac{P(A\cap B)}{P(B)}$$

\bigskip

\begin{enumerate}
    \item [1.] Cautions: In most textbooks, one doesn't specify the condition that all events involved must be part of the sample space. It might seem intuitive to us but may not be so to the beginners.
    
    \bigskip
    
    
    \item [2.] Informally, the below diagram gives you an idea: Since $B$ has occurred, So two Venn diagrams looked like this, The shaded area belong to both $A$ and $B$, So given $B$ has happened, what then, is the probability of event $A$ occurring? In particular, in the sample space $B$ now, there is only a portion of $A$ there, and one sees that portion is $P(A \cap B) = P(A)$.
    
    
    
    
    \bigskip

\end{enumerate}

\bigskip


\item [1.5.1] \textbf{Intuition for Conditional Probability}


The intuition of the conditional probability might not be immediate for those not inclined in statistical ideas. Here is a good link to brush up on some intuition. \url{https://stats.stackexchange.com/questions/326253/what-is-the-intuition-behind-the-formula-for-conditional-probability}

\bigskip


Suppose I have 10 balls, in which there are 4 black balls and 6 red balls. In the 6 red balls, there are 2 round balls. So what is the probability that the ball you pick is round given that it is red.
\bigskip

So connecting our example with probability. Our original sample space has 10 balls, and ? outcomes. Define event $B$ to be a red ball. Then the sample space of event $B$ is all the 6
red balls. In other words, once we condition our probability, the SAMPLE space will be reduced to $B$. Now define our event $A$ to be a ball is round.

\bigskip

And we essentially want to find the probability of a ball that is \textbf{BOTH RED AND ROUND.} The event of getting a ball that is \textbf{both red and round is} $A \cap B$. It is now easy to see that $P(A|B)$ is merely the number of outcomes $A \cap B$ over the number of outcomes of $B$.

\bigskip

Explained in yet another way.

\begin{figure}[h!]
  \centering
  \includegraphics[width=0.5\textwidth]{phd_template/images/conditional.png}
  \caption{Venn Diagram}
\end{figure}
\bigskip



A good intuition is given that $B$ occurred—with or without A—what is the probability of $A$? I.e., we are now in the universe in which $B$ occurred - which is the full right circle. In that circle, the probability of A is the area of A intersect B divided by the area of the circle - or in other words, the number of outcomes of $A$ in the right circle (which is $n(A \cap B)$, over the number of outcomes of the reduced sample space $B$.




\bigskip

\item [1.6] \textbf{Definition (Proposition)}

\begin{enumerate}
    \item If $P(A) < P(B)$, then $P(A|B) <P(B|A)$
    \bigskip
    
    \item If $P(A) > P(B)$, then $P(A|B) > P(B|A)$
    
    \bigskip
    
    
    \item If $P(A) = P(B)$, then $P(A|B) = P(B|A)$
\end{enumerate}


\bigskip

\newpage



\section{Independence}


\item [1.7] \textbf{Definition (Independent Events)}

Two events $A, B \in \Sigma$ are \textbf{independent} if $$P(A \cap B) = P(A)P(B)$$

\bigskip


\item [1.7.1] \textbf{Intuition}

Two events $A$ and $B$ are \textbf{independent} if the occurrence of $B$ does not affect the probability of the occurrence of $A$. Recall this paragraph: A good intuition is given that $B$ occurred—with or without A—what is the probability of $A$? I.e., we are now in the universe in which $B$ occurred - which is the full right circle. In that circle, the probability of A is the area of A intersect B divided by the area of the circle - or in other words, the number of outcomes of $A$ in the right circle (which is $n(A \cap B)$, over the number of outcomes of the reduced sample space $B$.

\bigskip

So we can view it as such, given then $B$ occurred, the probability of $A$ occurring is still the probability of $A$ occurring. i.e $$P(A|B) = P(A)$$

\bigskip

It follows immediately that $$P(A) = P(A|B) = \dfrac{P(A \cap B)}{P(B)} \Longrightarrow P(A) P(B) = P(A \cap B)$$


\url{https://math.stackexchange.com/questions/123192/the-definition-of-independence-is-not-intuitive }



\newpage



\section{Random Variables}

\item [1.8] \textbf{Definition (Random Variable)}

A \textbf{random variable} is a \textbf{measurable function} $X: \Omega \to E$ where the domain is the \textbf{Sample Space} and the co-domain, a measurable space $E$. 

\bigskip

\textbf{So the idea is that a random variable is actually a function, which takes in an outcome from the sample space and outputs into the measurable space.}

\bigskip


However in Statistics, we usually deal with the case when the measurable space is just the \textbf{Real Numbers.} And hence we redefine the definition informally as such:

\bigskip

Let $PS = (S, \Sigma, P)$ be a probability space. A \textbf{random variable} $X$ is any function with domain $\Omega$ and co-domain $\R$.

\bigskip

When the image of $X$ is \textbf{countable}, the random variable is called a \textbf{discrete random variable} and its distribution can be described by a probability mass function that assigns a probability to each value in the image of $X$. If however, the image of $X$ is \textbf{uncountably infinite} then $X$ is called a \textbf{continuous random variable.}

\bigskip

\item [1.9] \textbf{Confusion (Realization of a Random Variable)}

As we previously defined, a \textbf{random variable} $X$ is nothing more than a function which maps all outcomes $s \in \Omega$ from the sample space $\Omega$ to a real number $X(s)$.

\bigskip


As an example, considering the experiment of rolling two dices, our event space will be as such $$\Omega = \{(1,1), (1,2),...,(6,1),(6,2),..,(6,6)\}$$

Given the above event space, we can define a function $X$ such that for every outcome $s = (i,j), 1 \leq i, j \leq 6$ in $\Omega$, we have $X(s) = i+j$ where $X$ is the function that maps $s = (i,j)$ to the sum of $i,j$.

\bigskip


\textbf{Notation:} We write $X = x$ for the event $\{s \in \Omega~:~ X(s) = x\}$, and if $X$ is the sum of two dice, $X = 4$ is the event $\{(1,3),(2,2),(3,1)\}$.  And notation wise, we also write $P(X = x)$ to represent $P\{s \in \Omega~:~X(s) = x\}$ so that we can easily say that $P(X = 4) = P(\{(1,3),(2,2),(3,1)\}) = \frac{3}{36}$.

\bigskip

It is important to distinguish between random variables and the values they take. A
\textbf{realization} is a particular value taken by a random variable. Conventionally, we use UPPER CASE for random variables, and lower case (or numbers)
for realizations. So, ${X = x}$ is the event that the random variable $X$ takes the specific value $x$. Here, $x$ is an arbitrary specific value, which does
not depend on the outcome $s \in \Omega$.

\bigskip


\item [1.10] \textbf{Independence of Random Variable}


Given random variables $X: S \to \R$ and $Y: S \to \R$, the \textbf{notation} "$X = x, Y = y$" denotes the event $\{s \in S~:~ X(s) = x, Y(s) = y\}$.



\newpage


\section{Probability Distribution}


\item [1.5.1]  \textbf{Intuition and Examples of Probability Distribution}


We call a complete specification of $P(X = k)$ for all values of $k$ the probability distribution (or probability law or probability mass function) of $X$. In the below example, we give the probability distributions of both $X$ and $Y$ .

\bigskip

\textbf{Example}

Model an experiment with probability space as follows: A fair coin-flip with $PS = (S, \Sigma, P)$, where $$S = \{H,T\}~,~ \Sigma = \{\emptyset, \{H\}, \{T\}, \{H, T\}\}~,~ P: \Sigma \to \R$$ where $P$ is defined as follows: $$P(\emptyset) = 0~,~ P(\{H\}) = P(\{T\}) = 0.5~,~ P(\{H,T\}) = 1$$

\bigskip

\textbf{Define the random variable} $X: S \to \R$ to indicate whether the coin flip is heads. Note carefully that the \textbf{observed value (realization)} of $X$ is 

\begin{equation}
  X(s)=\begin{cases}
    1, & \text{if $s=\{H\}$}.\\
    0, & \text{if $s = \{T\}$}.
  \end{cases}
\end{equation}

\bigskip

So it follows that $X(H) = 1, X(T) = 0$ and in section 1.4 we have a notation that goes: $$X = 1 \text{ denotes the event } \{s \in S~|~X(s) = 1\}$$

$$X = 0 \text{ denotes the event } \{s \in S~|~X(s) = 1\}$$

\bigskip

We further write $$P(X = 1) = 0.5 \textbf{ AND } P(X = 0) = 0.5 \textbf{ AND } P(X = k) = 0 \text{ where } k \neq 0,1 $$

\bigskip

Writing it compactly: 

\begin{equation}
  P(X=k)=\begin{cases}
    0.5, & \text{if $k=1$}.\\
    0.5, & \text{if $k=0$}\\
    0, &\text{if $k \neq 0,1$}\\    
  \end{cases}
\end{equation}


\bigskip

\textbf{Discrete Random Variable}

A random variable that can take on at most a \textbf{countable number} of possible values is said to be \textbf{discrete}. We define the set of all \textbf{realizations} of a \textbf{discrete random variable} to be its \textbf{support}. 

\bigskip

\textbf{Continuous Random Variable}

A random variable that takes on an \textbf{uncountable number} of values is said to be \textbf{continuous}. We define the set of all \textbf{realizations} of a \textbf{continuous random variable} to be its \textbf{support}. 


\bigskip


\item [1.5.2] \textbf{Definition: (Probability Mass Function)}

For a discrete random variable $X$, we define $$p(x) = P(X = x)$$  as the \textbf{probability mass function} of $X$.

\bigskip

If $X$ can take on a discrete number of values $x_1,x_2,...,x_n$, then 

\begin{equation}
\begin{cases}
    p(x_i) \geq 0, & \text{if $i=1,2,...,n,..$}.\\
    0, &\text{otherwise}\\    
  \end{cases}
\end{equation}
\bigskip

Since $X$ must take on one of the values $x_i$ and the events $X=x_i$ are disjoint, then it follows that $$\sum_{i=1}^{\infty}p(x_i) = 1$$

\bigskip

\textbf{Discrete Distribution}

A \textbf{discrete distribution} is a probability mass function as per defined. And we say two random variables have the \textbf{same distribution} if they have the same pmf. But \textbf{having the same pmf do not mean they are equal}. The two random variables are only equal $X=Y$ iff for all $s \in S$, we have $X(s) = Y(s)$.

\bigskip


\textbf{Example:}

A simple example, Roll two dice, one red and one blue. Outcomes are listed as (red dice, blue dice), so $$S = \{(1, 1),(1, 2), . . . ,(6, 6)\}$$

Now let $X$ be the random value that take on values of the red dice and $Y$ be the random variable that take on the values of blue dice. Note very carefully that our sample space of this experiment is $S = \{(1, 1),(1, 2), . . . ,(6, 6)\}$ and not $S = \{1,2,3,4,5,6\}$. So the random variable $X$ takes on the value of the red dice means $$X((i,j)) = i$$ where $(i,j) \in S~,~ i,j \in [1,6]$. Similarly, $Y((i,j)) = j$.

\bigskip

Now by the definition, $X \neq Y$ because there exists $s \in S$, $X(s) \neq Y(s)$.

\bigskip

However,  $X$ and $Y$ have the \textbf{same distribution}. Let me show you the pmf for both.

\bigskip


\begin{equation}
  P(X=k)=P(Y = k)\begin{cases}
    \frac{1}{6}, & \text{if $k=i$ where $i = 1,2,3,4,5,6$}.\\
    0, & \text{if $k \not \in [1,6]$}\\ 
  \end{cases}
\end{equation}

\bigskip



\item [1.5.3] \textbf{Cumulative Distribution Function (CDF)}

The cumulative distribution function (cdf) of $X$, abbreviated to distribution function (d.f.) of $X$, (denoted as $F_X$ or $F$ if context is clear) is
defined as $$F_X: \R \to \R$$ where $$F_X(x) = P(X \leq x)~~ \text{for } x \in \R$$

\bigskip

Note in particular, suppose that $X$ is discrete and takes values $x_1, x_2, x_3,...$ where $x_1 < x_2 < x_3 < ...$. Then $F$ is a step function, that is, $F$ is constant in the interval $[x_{i-1}, x_i)$ where $F$ takes value $p(x_1)+...+p(x_{i-1})$ and then take a jump of size = $p(x_i)$.

\bigskip

For example, if $X$ is discrete random variable with pmf given below: $$P(X=1) = \frac{1}{4}, P(X=2) = \frac{1}{2}, P(X=3) = \frac{1}{8}, P(X=4) = \frac{1}{8}$$

then by definition, the cumulative distribution function of $X$ is as follows: 

\begin{equation}
  F(x) = \begin{cases}
    0, & \text{if $x < 1$}.\\
    \frac{1}{4}, & \text{if $1 \leq x < 2$}\\ 
    \frac{3}{4}, & \text{if $2 \leq x < 3$}\\
    \frac{7}{8}, & \text{if $3 \leq x < 4$}\\
    1          , & \text{if $4 \leq x$}\\
  \end{cases}
\end{equation}

\bigskip

For discrete random variables' cdf, we defined it in the above manner, because for example, any $2 \leq x < 3$ means that for any $x$ in between 2 inclusive and 3 exclusive, our $F(x) = P(X < 3)$ gives the same value as $P(X \leq 2)$ which is $P(X=1) + P(X=2) = \frac{3}{4}$.

\bigskip

Exercise: Given that $X$ is a random variable such that it represents the number of Heads when you throw $3$ coins. Find the CDF of $X$.

\bigskip

{\color{red} Our sample space $\Omega = \{HHH, HTT, HHT, HTH, TTT, THH, THT, TTH\}$
and it is easy to see that our $X$ can only take on a certain number of values, $X = 0, 1,2, 3$ and whenever $X = 0$ means that out of the 3 coins you get $0$ heads and that only matches to $TTT$ which gives $P(X = 0) = \frac{1}{8}$; whenever $X=1$ it signifies that out of the 3 coin throws, there is only one head, matching to the scenario $HHH$ and hence $P(X = 1) = \frac{3}{8}$; whenever $X = 2$ then it signifies that our of the 3 coin throws, there is 2 heads, which means $P(X = 2) = \dfrac{\{HHT, HTH, THH\}}{8} = \dfrac{3}{8}$; Lastly $P(X = 3) = \frac{1}{8}$

\bigskip

Once we have the above which is the PMF of the random variable $X$, we can easily recover the CDF of $X$ as follows:

\begin{equation}
  F(x) = \begin{cases}
    0, & \text{if $x < 0$}.\\
    \frac{1}{8}, & \text{if $0 \leq x < 1$}\\ 
    \frac{1}{2}, & \text{if $1 \leq x < 2$}\\
    \frac{7}{8}, & \text{if $2 \leq x < 3$}\\
    1          , & \text{if $3 \leq x$}\\
  \end{cases}
\end{equation}


}

\newpage

\item \textbf{Useful Calculations}

Some useful calculations

Theoretically, all probability questions about $X$ can be computed in terms of density function or probability mass function since they can be found interchangeably. \bigskip

\begin{enumerate}

\item [1] \textbf{Calculating probabilities from density function}

\begin{enumerate}

    \item [i] $P(a<X\leq b)=F_{X}(b)-F_{X}(a)$ \bigskip
    

{\color{red} Proof: Note that $\{X\leq b\}=\{X\leq a\}\cup\{a<X\leq b\}$ and it follows that 

$$P(X\leq b)=P(X\leq a)+P(a<X\leq b)$$


Rearrangement yields the result.}

\bigskip


\item [ii] $P(X<b)=\displaystyle \lim_{n\rightarrow\infty}F\left(b-\frac{1}{n}\right)$ 

{\color{red} Proof: 

\begin{align*}
   P(X<b)&=P\left(\lim_{n\rightarrow\infty}\{X\leq b-\frac{1}{n}\}\right) \\
   &=\lim_{n\rightarrow\infty}P(X\leq b-\frac{1}{n})\\
   &= \lim_{n\rightarrow\infty}F(b-\frac{1}{n})\\
\end{align*}

Note that $P\{X<b\}$ does not necessarily equal $F(b)$ , since $F(b)$ also includes the probability that $X$ equals $b$.}

\bigskip



\item [iii] $P(X=a)=F_{X}(a)-F_{X}(a^{-})$ where $F_{X}(a^{-})=\displaystyle \lim_{x \to a^{-}}F_{X}(x)$ \bigskip


\item [iv] Using the above, we can compute $P(a\leq X\leq b); P(a\leq X<b)$ and $P(a<X<b)$. For example,

\begin{align*}
P(a\leq X\leq b) &= P(X=a)+P(a<X\leq b)\\
&=F_{X}(a)-F_{X}(a^{-})+[F_{X}(b)-F_{X}(a)]\\
&= F_{X}(b)-F_{X}(a^{-})\\
\end{align*}

and similarly for the other two.

\end{enumerate}

\bigskip


\item [2] Calculating probabilities from probability mass function

$$P(A)=\sum_{x\in A}p_{X}(x)$$

\bigskip

\item [3] Calculate probability mass function from density function

$$p_{X}(x)=F_{X}(x)-F_{X}(x^{-})\ ,\ x\in \mathbb{R}$$

\bigskip

\item [4] Calculate density function from probability mass function

$$F_{X}(x)=\sum_{\mathcal{y}\leq x}p_{X}(y)\ x\in \mathbb{R}$$

\bigskip


\item [5] Example: The distribution function the random variable $X$ is given by


\begin{equation}
  F(x) = \begin{cases}
    0, & \text{if $x < 0$}.\\
    \frac{x}{2}, & \text{if $0 \leq x < 1$}\\ 
    \frac{2}{3}, & \text{if $1 \leq x < 2$}\\
    \frac{11}{12}, & \text{if $2 \leq x < 3$}\\
    1          , & \text{if $3 \leq x$}\\
  \end{cases}
\end{equation}


\bigskip

Compute

(a) $P(X<3)$ ,

(b) $P(X=1)$ ,

(c) $P(X>\displaystyle \frac{1}{2})$ ,

(d) $P(2<X\leq 4)$ .

by using the above formulas.

\bigskip

{\color{red}
{\it Solution}:

(a) $P(X<3)=\displaystyle \lim_{n}P\left(X\leq 3-\frac{1}{n}\right)=\lim_{n}F\left(3-\frac{1}{n}\right)=\lim_{n}\frac{11}{12}=\frac{11}{12}.$ \bigskip


(b) $P(X=1)=F(1)-\displaystyle \lim_{n}F\left(1-\frac{1}{n}\right)=\frac{2}{3}-\lim_{n}\frac{1-\frac{1}{n}}{2}=\frac{2}{3}-\frac{1}{2}=\frac{1}{6}.$

\bigskip

(c) $P\left(X>\displaystyle \frac{1}{2}\right)=1-P\left(X\leq\frac{1}{2}\right)=1-F \left(\displaystyle \frac{1}{2}\right)=\frac{3}{4}.$

\bigskip

(d) $P\left(2<X\displaystyle \leq 4\right)=F(4)-F(2)=\frac{1}{12}.$}


\end{enumerate}

\newpage


\section{Probability Distribution Function for Continuous Random Variable}

To refine the definition of Continuous Random Variable, we would come up with a new definition. But it is only right to extend our intuition from PMF and CDF to something alike in CRV. And that is called Probability Density Function, which, will be widely used in our journey.


\bigskip


\item [1.6] \textbf{Distribution of CRV}

We often see the phrase "the probability distribution of this random variable is". Here we give a very intuitive example of how we get a continuous distribution from histograms.

\bigskip

The masses of 200 six-month-old babies are collected. To visualise the distribution of the mass of a six-month old baby (which is a continuous random variable), the data collected are grouped into 5 classes with a class width of 1 kg each. The frequency table and the histogram corresponding to the data (Figure 1.2) are shown below. \bigskip

\begin{tabu} to 1\textwidth { | X[c] | X[c] | X[c] | X[c] | X[c] | X[c] |}
 \hline
 x, Mass (kg)  & $5 \leq x < 6$ & $6 \leq x < 7$ & $7 \leq x < 8$ & $8 \leq x < 9$ & $9 \leq x < 10$\\
 \hline
 Frequency, f  & 20  & 48 & 80 & 36 & 16  \\
\hline
\end{tabu}

\bigskip


\begin{figure}[h!]
  \centering
  \includegraphics[width=0.5\textwidth]{phd_template/images/dist1.PNG}
  \caption{Frequency Histogram}
\end{figure}
\bigskip


\newpage

We then plot the \textbf{relative frequency table and histogram of the above}. \bigskip


\begin{tabu} to 1\textwidth { | X[c] | X[c] | X[c] | X[c] | X[c] | X[c] |}
 \hline
 x, Mass (kg)  & $5 \leq x < 6$ & $6 \leq x < 7$ & $7 \leq x < 8$ & $8 \leq x < 9$ & $9 \leq x < 10$\\
 \hline
 Relative Frequency, f/200  & 0.1  & 0.24 & 0.4 & 0.18 & 0.08  \\
\hline
\end{tabu}

\bigskip

\begin{figure}[h!]
  \centering
  \includegraphics[width=0.5\textwidth]{phd_template/images/dist2.PNG}
  \caption{Frequency Histogram}
\end{figure}
\bigskip

In a relative frequency histogram, we have 
\begin{enumerate}
    \item [a)] the heights of the rectangles are the relative frequencies,
    \bigskip
    
    \item [b)] the area of each rectangle is the probability of the variable lying within the respective class interval,
    
    \bigskip
    
    \item [c)] the total area of the rectangles is equal to 1.
    
    \end{enumerate}
    
    \bigskip
    
    
A relative frequency histogram is preferred to a histogram (which only shows absolute
frequencies rather than relative ones) because the area of each rectangle gives the \textbf{probability} of the mass of a randomly selected six-month-old baby lying within that respective class interval. To give a better representation of the distribution of the masses, the data can be further grouped into more classes. The relative frequency histogram of the masses when $k = 20$ is shown in Figure 1.4 below.

\bigskip

\begin{figure}[h!]
  \centering
  \includegraphics[width=0.5\textwidth]{phd_template/images/dist3.PNG}
  \caption{Frequency Histogram}
\end{figure}
\bigskip

The idea is simple and can be understood that instead of finding the relative frequency for $5-6, 6-7,...$ we instead make it a bit more precise by finding the relative frequency for $5-5.2, 5.2 - 5.4,..., 9.8-10$. It gives a more accurate representation of our actually distribution. A relative frequency of $k$ classes will have a class width of $\frac{5}{k}$ kg each where $k$ is an integer. \bigskip

Having learnt integration before and this idea is quite similar, so we ask ourselves, what will happen if we extend $k$ to infinity, in a sense you can imagine yourself drawing a relative frequency histogram with infinite bins, and at each real value there is a representation of the probability. Note that this is an inaccurate explanation but gives you an intuition somehow. For probability, it is interesting to note that $P(X = k) = 0$. So to be precise, we shall say that as $k$ increases to infinity, the histogram will become closer to a graph of some function $f(x)$ (whose graph is a continuous curve) such that the area under this curve equals to one. This is because the total area of the rectangles in the relative frequency histogram for any given value of $k$ is always equal to the constant 1. So, it is a reasonable consequence that the area under the curve $y = f(x)$, being the limit of a sequence of these relative frequency histograms, is equal to 1. The function $f(x)$ which
describes the distribution of the mass of a six-month-old baby is called a probability density function (p.d.f.).

\bigskip

Therefore, it follows that the definition (equivalent idea to PMF) of a probability density function (PDF) of a CRV, $X$, is given by $$P(a \leq X \leq b) = \int_{a}^{b}f(x) dx$$ and $f$ satisfies the following:

1) $f(x) \geq 0$ for all real values of $x$;
\bigskip

2) $\int_{-\infty}^{\infty} f(x) dx = 1$.

\bigskip

Note in particular, one should find it intuitive that if $f$ is a continuous curve, then the area under the curve for each interval is just the probability from $a$ to $b$, then we can just take the integration of the curve from $a$ to $b$ to get the answer. A diagram is illustrated below.


\begin{figure}[h!]
  \centering
  \includegraphics[width=0.5\textwidth]{phd_template/images/dist.PNG}
  \caption{Probability Distribution KDE}
\end{figure}

\bigskip

\textbf{Given a probability density function $f_X$ of a random variable $X$, we can answer all probability question about the random variable $X$}.

\bigskip

For example, we want to find the CDF of a continuous random variable $X$. i.e find $$F_X(x) = P(X \leq x) ~, \textbf{for } x \in \R$$
\bigskip

And in the context of continuous random variable, we have $$F_X(x) = \int_{-\infty}^{x} f_X(t) dt , ~x \in \R$$ 

It also follows by the Fundamental Theorem of Calculus that, $$F_X^{'}(x) = f_X(x), x \in \R$$ which means that the PDF is the derivative of the CDF of a continuous random variable $X$. 

\bigskip

A more intuitive explanation of CDF can be seen by manipulating $$P(a \leq X \leq b) = \int_{a}^{b}f(x) dx$$ so that $$P\left(x-\frac{\epsilon}{2} \leq X \leq x+\frac{\epsilon}{2}\right) = \int_{x-\epsilon/2}^{x+\epsilon/2}f(x)dx \approx \epsilon f(x)$$

when $\epsilon$ is small, and $f$ is continuous at $x$. This means that the probability that $X$ will be contained in an interval of length $\epsilon$ around
the point $x$ is approximately $\epsilon f(x)$. From this result we see that $f(x)$ is a measure of how likely it is that the random variable will be near $x$.





\end{enumerate}

\newpage



