# (S.17) Permutation and Combination

## Sets

1. **Set**: *a well-defined colletion of objects*.
    - **Math notation:**
        - Define a set as $S$ 
        - If $x$ is an element of set $S$:
            - $x \in S$.
        - If $x$ is not an element of set $S$:
            - $x\notin S$.
        
2. **Subset**: set $T$ is a subset of set $S$ if *every element* in set $T$ is also in set $S$. 

    - **Math Notation**
        - $T$ is a subset of $S$.   
            - $T \subset S$

        - $T$ and $S$ can be the SAME population 
        - If $T$ != $S$, but $T \subset S$, called a 'proper subset'
            - Can use notation:  $T \subsetneq S$ and $T \subseteq S$ 

3. **Universal Set**: The collection of all possible outcomes in a certain context or universe.

    - **Math Notation**
        - Universal set denoted by Omega ($\Omega$)
        - Can have infinite # of elements (e.g. the set of all real numbers!)
        - Example: all possible outcomes of a 6-sided die:
            - $\Omega = \{1,2,3,4,5,6\}$

4. **Set Operations**
    1. **Union** $S \cup T$
        - all unique elements found in both sets
    2. **Intersection** $S \cap T$.
        - All elements of $S$ that also belong to $T$. 
    3.  **Relative complement / difference** 
        - all the elements of T that are NOT in S.
            - relative complement of S (or $ T\backslash S $) is $\{2,4,8\}$.
            - relative complement  of T (or $ S\backslash T $) is $\{3,9,12\}$.
    4. **Absolute complement**
        - The absolute complement of $S$, with respect to the Universal set $\Omega$, is the collection of the objects in $\Omega$ that don't belong to $S$.
        - Denoted as $S'$ or $S^c$.
        
    5. <u> NOTE:</u>Inclusion Exclusion principle
        - When combining  2 sets, the method for obtaining the union of two finite sets is given by:

            - $\mid S \cup T \mid = \mid S \mid + \mid T \mid - \mid S \cap T \mid $

 ## Probability
 1. **Law of relative frequency**
    - Limit of large infinite outcomes produce fixed numbers .
        - $$P(E) = \lim_{n\to\infty}\frac{S(n)}{n}$$
    - Probability of Event E having Successful(S) outcomes for $n$ trials
    - a given outcome is successful if it is in the event space 
 2. **Sample Space:**
     - All possible outcomes for a given experiment. 
 3. **Event space:**
     - Those outcomes in the sample space that satisfy some critera
         - Always a subset of S
 4. Probability Axioms
 
    1.  Positivity : Prob is always $0 <= P(E) <=1$

    2.  Probability of a certain event: $P(S)=1$

    3.  Additivity Union of 2 exclusive sets = sum prob of individual events
        happening <br>
        - If $A\cap B = \emptyset $, then $P(A\cup B) = P(A) + P(B)$
    4. Addition Law:
        - Prob of union of A and B is individual P minus intersection
$$P(A\cup B) = P(A) + P(B) - P(A \cap B)$$


## Permutations

### Permutations of a subset

How many ways can we select k elements out of a pool of n objects?
- $k$-permutation of $n$:

$n*(n-1)*...*(n-k+1)$ or in other words, $P_{k}^{n}= \dfrac{n!}{(n-k)!}$

### Permutations with replacement 

\# of possible options doesn’t change, so n is raised to the power of j, the number of draws from the pop<br><br>
$n^j$

### Permutations with repetition ...?

The \# of permutations of *n* objects with identical objects of type 1<br>
$(n_1^{j_1}* n_2^{j_2})$

## Combinations
- How many ways can we create a subset $k$ out of $n$ objects? 
    - Unordered
$$\displaystyle\binom{n}{k} = \dfrac{P_{k}^{n}}{k!}=\dfrac{ \dfrac{n!}{(n-k)!}}{k!} = \dfrac{n!}{(n-k)!k!}$$

## 

# (S.19)  Central Limit Theorem

## Central Limit Theorem

The averages of samples will form a normal distribution,
Even if population from which samples are drawn is not normal

In [None]:
sns.distplot(data); # Draws a histogram and KDE

In [None]:
scipy.stats.normaltest(data): # Tests if data is normal

In [None]:
# Generate a normal distribution with mean, sd given
male_height = scipy.stats.norm(mean, sd)
# The result male_height is a SciPy rv object which represents a normal continuous random variable.

In [None]:
# or use this to generate actual numbers in one go
stats.norm.rvs(loc=mean,scale=std_dev,size=number of samples)

<strong> Create numpy arrays of data points that have a distribution of a given rv object</strong>

In [None]:
mean = rv.mean()
std = rv.std()

In [None]:
# Use numpy to calculate evenly spaced numbers over the specified interval (4 sd) and generate 100 samples.
xs = np.linspace(mean - x*std, mean + x*std, 100)

In [None]:
# Calculate the peak of normal distribution i.e. probability density. 
ys = rv.pdf(xs)

In [None]:
# use stats.t.pdf to get values on the probability density function for the t-distribution
# the second argument is the degrees of freedom
ys = stats.t.pdf(xs, df, 0, 1)

## Sampling Statistics

<ol>

<li><b>Sample Distribution:</b></li>  

>The distribution of the data points within a single sample.

<li><b>Sampl<i><u>ing</u></i> Distribution:</b></li>
    
>The probability distribution a statistic (for example, the mean of samples) can take

<li><b>Sample distribution of the Sample mean: :</b></li>

>A distributions of the means of many samples drawn from the population. This forms a normal distribution. 

<li><b>Sampling Error:</b></li>
    
> The difference between the sample mean and the population mean.<br><ul><li>This decreases as sample size increases</ul></li>


<li><b>Standard error of the sample mean</b></li>  
> The std-deviations of the sampling distribution. It is given by

 > $$ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}}$$

</ol>

## Confidence Interval

1. <b>Definition</b>
    - A range of values
    - above and below the point estimate
    - that captures the true population parameter
    - at some predetermined confidence level.
    <br><br>
We calculate a confidence interval by taking a point estimate and then adding and subtracting a margin of error to create a range <br><br>
    
2. <b> Margin of Error</b><br><br>
    
    1. <u><font size="+2">Pop (σ)  </font><b>is known</b></u>:<br>
        - use <b> z critical value </b><br><br>
            - Definitoin of z - critical value:
                - The number of standard deviations you'd have to go 
                - from the mean of the normal distribution 
                - to capture the proportion of the data 
                - associated with the desired confidence level
                    - For instance, we know that roughly 95% of the data in a normal distribution lies within 2 standard deviations of the mean, so we could use 2 as the z-critical value for a 95% confidence interval as shown in this image<br><br>
            - get z-critical values with:<br><br>
                - <mark>stats.norm.ppf(q, loc=0, scale=1)</mark> q = percentile point i.e. 0.95
                     - Note that we use stats.norm.ppf(q = 0.975) to get the desired z-critical value instead of q = 0.95 because the distribution has two tails.<br><br>
        - <font size="+2">MoE = z ∗ σ / √n</font><br><br>
            - <font size="+1">σ (sigma)</font> is the population standard deviation
            - <u>n</u> is sample size
            - <u>z</u> is the z-critical value
<br><br>
    2. <u><font size="+2">Pop (σ)  </font><b>is <u>NOT</u> known</b></u>:<br>
        - use a <b> t-critical </b> value
            - This is found on a <b>t-distribution</b>, which is like a normal distribution but with heavier tails
            - The shape of a t-distribution is determined by a parameter known as <b>degrees of freedom</b>. 
                - This is esentially the size of the sample
                - the higher the dof (i.e. the larger the sample <br> the closer the t-distribution is to a normal distribution <br><br>
                
        - <font><b>MoE</b> =  $\bar{x}\pm t_{\alpha/2,n-1}\left(\dfrac{S}{\sqrt{n}}\right)$</font><br><br>
            - $\bar{x}$  is the sample mean
            - $t_{\alpha/2,n-1}$ is the <b> t - critical value</b><br><br>
            - $\dfrac{S}{\sqrt{n}}$ is the "standard error of the mean"(???)
            - <u>n</u> is sample size
            - <u>$\alpha$</u> is the confidence interval <br><br>
            
        - <b>Finding interval with scipy</b>     
               
 

In [None]:
#Min and Max of Confidence Interval
stats.t.interval(alpha = 0.95,                              # Confidence level
                 df= len(sample_chol_levels)-1,             # Degrees of freedom
                 loc = x_bar,                               # Sample mean
                 scale = s)    

                - returns tuple of lower and upper bound of MoE calculated using t-distribution

4. <b> Poisson Dist Code</b>
    * The Poisson distribution is the discrete probability distribution of the number of events occurring in a given time period, given the average number of times the event occurs over that time period. We shall use a Poisson distribution to construct a bimodal distribution.

In [None]:
pop_ages =stats.poisson.rvs(loc=18, mu=35, size=150000)

<b>QUESTION</b>: 

1. Can you explain the difference between interpretations more? why cant you say 95% probability <br>
2. T-dist: where is this confidence itnerval equation coming from ? $\dfrac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0,1)$
3. what exactly is t_{\alpha/2,n-1}? is this the t-critical values for two tails? this is what we get from the t-distribution? is this similar to z-critical values? i.e number of deviations from mean on t dist we have to go to?

# (s.20) Hypothesis Testing

## p-vals and null hypothesis

1. <b>Null Hypothesis</b>:
    - There is no relationship between A and B<br><br>
2. <b> Alternative Hypothesis</b>
    - The hypothesis traditionally thought of when creating a hypothesis for an experiment  I.E. some effect has occured<br><br>
3. **_p-value_**: 
    - The probability of observing a test statistic at least as large as the one observed, by random chance, assuming that the null hypothesis is true.<br><br>

4. $\alpha$ **_(alpha value)_**: 
    - The marginal threshold at which you're okay with rejecting the null hypothesis.<br><br>
    - <b>If $p < \alpha$, then you reject the null hypothesis</b> <br><br>

## Effect Size

<b> Effect Size </b>is used to quantify the size of the difference between two groups under observation <br><br>
There are a few ways of making this comparison

1. <b><u> Un-standardized or Simple Effect</u></b>

    - Look at the **[ difference in means ]** between the two populations
        - <b>CAVEATS</b>:
            - Without knowing more about the distributions (like the standard deviations or spread of each distribution), it's hard to interpret whether a difference like 15 cm is a big difference or not.
            - The magnitude of the difference depends on the units of measure, making it hard to compare across different studies that may be conducted with different units of measurement.
            
    - There are a number of ways to quantify the difference between distributions. 
        - A simple option is to express the difference as a percentage of the mean.
            - But a problem with relative differences is that you have to choose which mean to express them relative to
                - Express as % of pop A or pop B?
                
                
2. <b><u> Overlap Threshold </u></b>
    - Choose a threshold between the two means. 
        - The simple threshold is the midpoint between the means f
    - How many members of pop b are below threshold? = <b> b_below </b>
    - How many of pop A are above threshold? = <b> a_above </b>
    - Calculate overlap (Area under both curves)
        - <b> b_below/len(b)   +  a_above/len(a) </b>
        - divide by 2 to report fraction of misclassified data points
        
<br>
3. <b><u> Probability of superiority </u></b><br><br>
   probability that *"a randomly-chosen B is greater than a randomly-chosen A"
   
   - sum(x > y for x, y in zip(b_sample, a_sample)) / len(b_sample)



4. <b><u> Standardized effect size </u></b><br>
An effect size statistic that divide effect size by some standardizer i.e. standard deviation:<br>
    <b>Effect Size / Standardiser</b><br><br>

    1. <b>Cohen's D</b> <br>
    One type of standardized effect size statistic.<br>
    
        -   <u>Formula:</u><br>**$d$ = effect size (difference of means) / pooled standard deviation** <br>
            - <b>Pooled standard deviation</b>: Weighted average of the standard deviations of the two groups
                -  ( (group1.var * len(group1) + (group2.var * len(group2) )  /  (len(group1) + len(group2)  )
                    - This is the variance. Take sqroot to get the pooled std
        - <b>Interpretation</b>:<br>
        We use these rules of thumb to assess meaning of magnitude of d
            - Small effect = 0.2
            - Medium Effect = 0.5
            - Large Effect = 0.8     




## T-tests

Just as you can use t distribution to provide confidence intervals for estimating the population mean you can also use it to test whether two populations are different, statistically speaking.

Use at-test if:
    - Don’t know the population standard deviation 
    - You have a small sample size



1. <b><u> One Sample T-Test</b></u><br>
    Used to determine whether<br>
    a sample of observations could have been generated<br> 
    by a process with a specific mean
    - For example, you might want to know how your sample mean compares to the population mean
        - Does the sample come from the population?(QUESTION)
        
    - <b>Formula</b> to find t-value of sample 
    $$t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$$
    
    - To find <b>t - critical value</b>
        - scipy.stats.t.ppf(1-alpha, df)
            - df is degrees of freedom i.e length of sample
    - <u>stats method for conducitng one sample ttest</u>
        - scipy.stats.ttest_1samp(a, popmean, axis=0, nan_policy='propagate')
            - a = sample mean
            -  This function returns the t-value and p-value for the sample
    - if t-val is greater than t- crit,
        - null hypothesis that sample mean is same as pop mean can be rejected
        
        

2. <b><u> Two Sample T-Test</b></u><br>
    used to determine if two population means are equal<br>
    Two main types:
    - <u>Paired test</u><br>
    the individual items/people in the sample will remain the same and researchers are comparing how they change after treatment
    - <u>Unpaired(Independant) test</u>
    comparing two different, unrelated samples to one another.
       
        - <mark><a href = "https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html">stats.ttest_ind(experimental, control)</a></mark>
            - Calculates the t-test for the means of *two independent* samples of scores.

            This is a two-sided test for the null hypothesis that 2 independent samples
            have identical average (expected) values. 
            - returns t statistic and pvalue
        - Assumes that the populations the samples have been drawn from have the same variance
            '''

    <br><br>
    - <b> Formulas </b><br>
        t statistic:
         $$\large t = \frac{\bar{x}_{1} - \bar{x}_{2}}{\sqrt{s^{2}_{p} (\frac{1}{n_{1}} + \frac{1}{n_{2}}) }    }  $$ <br>
            
         Where $s^{2}_{p}$ is the pooled sample variance, calculated as:

       $$\large s^{2}_{p}  = \frac{(n_{1} -1)s^{2}_{1} +  (n_{2} -1)s^{2}_{2}}{n_{1} + n_{2} - 2}  $$

        Where $s^{2}_{1}$ and $s^{2}_{2}$ are the variances for each sample given by the formula 
        $$ \large s^{2} = \frac{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}}{n-1} $$
        
    
            

<br>
3. <b><u>Assumptions</b></u>

- sample observations have numeric and continuous values
- sample observations are independent from each other
- the samples have been drawn from normal distributions
- if upaired two sample test:
    - the populations the samples have been drawn from have the same variance
- if paired two-sample test:
    - the difference between the two sets of samples are normally distributed.

## Type 1 and Type 2 Errors

1. <b><u>Type I error:</u></b>
    - rejecting a null hypothesis when it should not have been rejected.
        - The confidence level is also the probability that you reject the null hypothesis when it is actually true.
    - commonly known as a <b><u>False Positive</u></bb>
        
2. <b><u>Type II error:</u></b>
    
    - the probability that you fail to reject the null hypothesis when it is actually false
        - aka <u>**False Negatives**</u> or $\beta$
    - related to something called Power_,
        - Power = the probability of rejecting the null hypothesis given that it actually is false. 
            - Power_ = 1 - $\beta$
        
        
        
    
    

<img src="./images/Type1Type2Chart.png">

# (S.21) Statistical Power and ANOVA

## Statistical Power

- Definition:
    - the probability of rejecting the null hypothesis, given that it is indeed false.
        - This is a conditional probability: 
            - P(A) is the probability that you reject the null hypothesis, 
            - P(B) is the prob that the null is actually false, 
            - P(A|B) is the prob that y ou reject the null hypothesis given that it is actually false. 
            
- For any given 𝛼: 
    - 𝑝𝑜𝑤𝑒𝑟=1−𝛽.
        - 𝛼 = Type I errors i.e. false positives
        - 𝛽 = Type II errors i.e. false negatives

- The more power a statistical test has,
    - the higher the chance that it will affirm the alternative hypothesis (by rejecting the null hypothesis)
    - but also a higher chance that it will lead to more false positives. 
    
- To increase the power of a test we can:
    1. increase 𝛼:
        - This means that we will increase the tolerance for false positives i.e. we may pick samples that have a relatively higher likelihood of actually being from the the distribution of the null hypothesis.
    2. increase the sample size (n):
        - Ideal, but not always feasible
        - makes the distributions for the null and alt hyps narrower,<br>
        reducing the areas under each curve 
    3. increase the effect size:
        - i.e. posit a greate difference between the null hypothesis and the alternative hypothesis.

<img src="./images/PowerDiagram.png">


See how increasing power when the effect size is small is difficult and requires higher sample sizes. Shows how its difficult to detect small changes over the null hypo

In [None]:
# Calculating and plotting the power of a given independant t-test:

from statsmodels.stats.power import TTestIndPower, TTestPower
power_analysis = TTestIndPower()
power_analysis.plot_power(dep_var='nobs',
                          nobs = np.array(range(5,1500)),
                          effect_size=np.array([.05, .1, .2,.3,.4,.5]),
                          alpha=0.05)

# nobs = numbr of observations(goes on x axis)
# 

- you can also calculate specific values. Simply don't specify one of the four parameters.

In [None]:
# Calculate power
power_analysis.solve_power(effect_size=.2, nobs1=80, alpha=.05) # power not specified

## Welch's t-test

$ t = \frac{\bar{X_1}-\bar{X_2}}{\sqrt{\frac{s_1^2}{N_1} + \frac{s_2^2}{N_2}}} = \frac{\bar{X_1}-\bar{X_2}}{\sqrt{se_1^2+se_2^2}}$

where $\bar{X_i}$ , $s_i$, and $N_i$ are the sample mean, sample variance, and sample size, respectively, for sample i.

### Variance

$ v \approx \frac{\left( \frac{s_1^2}{N_1} + \frac{s_2^2}{N_2}\right)^2}{\frac{s_1^4}{N_1^2v_1} + \frac{s_2^4}{N_2^2v_2}} $

## ANOVA

1. 

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

formula = 'S ~ C(E) + C(M) + X'
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

# Extensions to Linear Models (S. 20)

## Interactions

- two or more variables interact in a non-additive manner
    - including them when they're needed will increase your  𝑅2  value!
- how would you actually include interaction effects in our model? 
    - To do this, you basically multiply 2 predictors.



## Polynomial Regression

- square your predictor (or raise to 3 or 4 whatever) and 
- include it in your model as if it were a new predictor. 
    - so add df[x**2] as a new col to the df

In [None]:
# Scikit-Learn has a built-in polynomial option in the preprocessing module

from sklearn.preprocessing import PolynomialFeatures

y = yld['Yield']
X = yld.drop(columns='Yield', axis=1)

poly = PolynomialFeatures(6)
X_fin = poly.fit_transform(X)

# The transformed feature names are: ['1', 'x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6']

## Bias Variance Trade off

Bias arises when wrong assumptions are made when training a model. For example, an interaction effect is missed, or we didn't catch a certain polynomial relationship. Because of this, our algorithm misses the relevant relations between predictors and the target variable. Note how this is similar to underfitting!

Variance arises when a model is too sensitive to small fluctuations in the training set. When variance is high, random noise in the training data is modeled, rather than the intended outputs. This is overfitting!

# Linear Algebra

## Systems of linear equations

The *inverse* of a square matrix *A*, sometimes called a *reciprocal matrix*, is a matrix $A^{-1}$such that

> $A \cdot A^{-1} = I$

where $I$ is the identity matrix 

You need an inverse because you can't perform division operations with matrices! There is no concept of dividing by a matrix. However, you can multiply by an inverse, which achieves the same thing.

> $A \cdot X = B$

> $X = A^{-1} \cdot B$


In [None]:
#The dot product for these two matrices can be calculated as:

import numpy as np
A = np.array([[2,1],[3,4]])
I = np.array([[1,0],[0,1]])
print(I.dot(A))
print('\n', A.dot(I))

In [None]:
numpy.linalg.inv(a) #takes in a matrix A and calculates its inverse as shown below:

In [None]:
# Use Numpy's built in function solve() to solve linear equations
x = np.linalg.solve(A, B)


# Calculus , Cost Functions

## Gradient Descent

# Regularization ( Ridge and Lasso)

In [None]:
from sklearn.linear_model import Lasso, Ridge

#the Ridge and Lasso models have the parameter alpha, which is Scikit-Learn's version of  𝜆  in the regularization cost functions.


ridge = Ridge(alpha=0.5)
ridge.fit(X_train_transformed, y_train)

lasso = Lasso(alpha=0.5)
lasso.fit(X_train_transformed, y_train)

## Ridge

In ridge regression, the cost function is changed by adding a penalty term to the square of the magnitude of the coefficients.

$$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p m_j^2$$

## Lasso regression 
Very similar to Ridge regression, except that the magnitude of the coefficients are not squared in the penalty term.

$$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij})-b)^2 + \lambda \sum_{j=1}^p \mid m_j \mid$$

## AIC

$$ \text{AIC} = -2\ln(\hat{L}) + 2k $$

Where:
* $k$ : length of the parameter space (i.e. the number of features)
* $\hat{L}$ : the maximum value of the likelihood function for the model

<b>the model with the lowest AIC should be selected.</b>

## BIC

\text{BIC} = - 2\ln(\hat{L}) + \ln(n) * k $$

where:

* $\hat{L}$ and $k$ are the same as in AIC
* $n$ : the number of data points (the sample size)

<b>Like the AIC, the lower your BIC, the better your model is performing.</b>

# Logistic Regression

## getting dummies

In [None]:
# Getting dummies
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) 

## Fitting a model

### Using Scikit Learn

In [1]:
# Instantiate a Logistic regression model
# Solver must be specified to avoid warning, see documentation for more information
# liblinear is recommended for small datasets
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
regr = LogisticRegression(C=1e5, solver='liblinear')

# Fit the model to the training set
regr.fit(age, income_bin)

#to predict
y_hat_test = logreg.predict(X_test)

NameError: name 'LogisticRegression' is not defined

In [None]:
# Store the coefficients
coef = regr.coef_
interc = regr.intercept_

# Create the linear predictor
lin_pred = (age * coef + interc)

I dont mind waiting an extra month or two if it means I can finally be free of the mind numbing, soul crushing prison that is my current "career"?P[[';;p;///[[[[[[[[[p0-;54r/;'# Perform the log transformation
mod_income = 1 / (1 + np.exp(-lin_pred))

# Sort the numbers to make sure plot looks right
age_ordered, mod_income_ordered = zip(*sorted(zip(age ,mod_income.ravel()),key=lambda x: x[0]))

### Using Stats Models

In [None]:
import statsmodels.api as sm

# Create intercept term required for sm.Logit, see documentation for more information
X = sm.add_constant(X)

# Fit model
logit_model = sm.Logit(y, X)
Hey nick D
# Get results of the fit
result = logit_model.fit()

result.summary()

## Evaluating a model's results

### Confusion Matrices

In [None]:
from sklearn.metrics import confusion_matrix
cf = confusion_matrix(example_labels, example_preds

### Evalution Metrics

In [2]:
from sklearn.metrics import classification_report

classification_report() #(y_true, y_pred, labels=None, target_names=None, sample_weight=None, digits=2, output_dict=False, zero_division='warn')

returns d

SyntaxError: invalid syntax (<ipython-input-2-d00d41ffa443>, line 1)

In [None]:
precision_recall_fscore_support,

$$ \text{Precision} = \frac{\text{Number of True Positives}}{\text{Number of Predicted Positives}} $$    

$$ \text{Recall} = \frac{\text{Number of True Positives}}{\text{Number of Actual Total Positives}} $$  
  
$$ \text{Accuracy} = \frac{\text{Number of True Positives + True Negatives}}{\text{Total Observations}} $$

$$ \text{F1 score} = 2 * \frac{\text{Precision * Recall}}{\text{Precision + Recall}} $$

### ROC and AUC

In [None]:
from sklearn.metrics import roc_curve, auc

# Scikit-learn's built in roc_curve method returns the fpr, tpr, and thresholds
# for various decision boundaries given the case member probabilites

# First calculate the probability scores of each of the datapoints:
y_score = logreg.fit(X_train, y_train).decision_function(X_test)

fpr, tpr, thresholds = roc_curve(y_test, y_score)

In [None]:
print('AUC: {}'.format(auc(fpr, tpr)))

### Class Imbalance

#### Class Weight

- By default the class weights for logistic regression in scikit-learn is None, meaning that both classes will be given equal importance in tuning the model. 

- Alternatively, you can pass 'balanced' in order to assign weights that are inversely proportional to that class's frequency
    - Can pass thes ein as dictionary with key = name of class and value = weight


In [None]:
weights = [None, 'balanced', {1:2, 0:1}, {1:10, 0:1}, {1:100, 0:1}, {1:1000, 0:1}]
names = ['None', 'Balanced', '2 to 1', '10 to 1', '100 to 1', '1000 to 1']
colors = sns.color_palette('Set2')

plt.figure(figsize=(10,8))

for n, weight in enumerate(weights):
    # Fit a model
    logreg = LogisticRegression(fit_intercept=False, C=1e20, class_weight=weight, solver='lbfgs')
    model_log = logreg.fit(X_train, y_train)
    print(model_log)

    # Predict
    y_hat_test = logreg.predict(X_test)

    y_score = logreg.fit(X_train, y_train).decision_function(X_test)

    fpr, tpr, thresholds = roc_curve(y_test, y_score)
    

#### Over sampling, under sampling, SMOTE

- oversampling the minority class or undersampling the majority class can help by producing a synthetic dataset that the learning algorithm is trained on
    - important to still maintain a test set from the original(unsamppled) dataset in order to accurately judge the accuracy of the algorithm overall.

- Synthetic Minority Oversampling(SMOTE)
    - . Here, rather then simply oversampling the minority class with replacement (which simply adds duplicate cases to the dataset), the algorithm generates new sample data by creating 'synthetic' examples that are combinations of the closest minority class cases

In [None]:
from imblearn.over_sampling import SMOTE

ratios = [0.1, 0.25, 0.33, 0.5, 0.7, 1]
names = ['0.1', '0.25', '0.33','0.5','0.7','even']
colors = sns.color_palette('Set2')

plt.figure(figsize=(10, 8))

for n, ratio in enumerate(ratios):
    # Fit a model
    smote = SMOTE(sampling_strategy=ratio)
    X_train_resampled, y_train_resampled = smote.fit_sample(X_train, y_train) 
    logreg = LogisticRegression(fit_intercept=False, C=1e20, solver ='lbfgs')
    model_log = logreg.fit(X_train_resampled, y_train_resampled)
    print(model_log)

    # Predict
    y_hat_test = logreg.predict(X_test)

    y_score = logreg.decision_function(X_test)

    fpr, tpr, thresholds = roc_curve(y_test, y_score, ) 

# K Nearest Neighbors

## Distance Metrics

### Manhatten Distance

$$ \large d(x,y) = \sum_{i=1}^{n}|x_i - y_i | $$

### Euclidean Distance

$$ \large d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} $$

### Minkwoski Distance

$$\large d(x,y) = \left(\sum_{i=1}^{n}|x_i - y_i|^c\right)^\frac{1}{c}$$ 

## K - Nearest

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Functions, Methods, Objects



## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,  test_size=0.33, random_state=42)

## Scalers

In [None]:
from sklearn.preprocessing import StandardScaler