# Statistics concepts

* **Problem:  Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?**


<img src="Yellow.png" width="400">


This is the case of power law that as described by Pareto principle: For many events, roughly 80% of the effects come from 20% of the causes.

Examples:

1. Population of all US citiities with population $\geq$ 100,000.
 <img src="newman-power-distribution.png" width="800">
 
2. Marketplaces Power Law is the observation that a large portion of sales on a marketplace is generated by a small fraction of its sellers population.

<img src="marketplaces-power-law-distribution.png" width="400">

3. Power Law Distributions in the App Ecosyste: The amount of time users spend within their top ranked apps is most likely the default app in their system. This counts for roughly half of all the time spent, followed by another 3-4 apps accounting for an additional 40% of time, and the long tail outside the Top 5 combining to account for just 10-15% of time spent.

<img src="Users_Time_Spent_on_Apps_reference.png" width="400">

Other Descriptions:

4. This can be loss vs ephocs in a training model, where the model is trained in first epochs in a calssification problem.

![Loss](index.png)

It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.

* **Problem: What is the Central Limit Theorem? Explain it. Why is it important?**

The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution N(0,1), as the sample size gets larger no matter what the shape of the population distribution.

* **Problem: What is the statistical power?**

Probability that the test rejects a false null hypothesis ($H_0$), given that the alternative hypothesis is true, and is equal to one minus beta. Beta is the probability of a type-II error, which occurs when a false null hypothesis is not rejected. 

Power  = 1 - $\beta$

* **Problem: What is type-1 error and $\alpha$?**

A type-I error occurs when a true null hypothesis is rejected. $\alpha$ specifies one or more values for the probability of a type-I error. 

* **Problem: How do you handle missing data? What imputation techniques do you recommend?**

If data is large, just delete the missing data rows, else either predict the missing values or use random forest.

* **Problem: Is mean imputation of missing data acceptable practice? Why or why not?**

No, as it doesnt consider the correlations of the features. Mean imputation reduces the variance of the data and increases bias in our data. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.

* **Problem: What is an outlier**

An outlier is a data point that differs significantly from other observations, mostly 2.7$\sigma$ away considering a normal distribution or Z score of +/-3. More precisely, if data points are lies in range less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR, then they outliers where IQR stands for interquartile range. This comes to approximately 2.698 standard deviations.

<img src="percentile.gif" width="400">
<!-- <img src="figs/IQR.png" width="400"> -->
<img src="outlier.png" width="400">

* **Problem: You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?**

One can use QQ plots to check if the duration of calls follows a lognormal distribution.

### Concepts

* **log normal:**
In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. 

* **examples of data that does not have a Gaussian distribution, nor log-normal**
Any categorical data, exponential distributions such as amount of time that a car battery lasts or the amount of time until an earthquake occurs.

* **Lift:** 
It is a measure of the performance of a targeting model measured against a random choice targeting model; in other words, lift tells you how much better your model is at predicting things than if you had no model.

* **KPI:** 
It stands for Key Performance Indicator, which is a measurable metric used to determine how well a company is achieving its business objectives. Eg. error rate. Key words to remember are SMART (Specific, Measurable, Attainable, Relevant, Time frame).


* **Six sigma:**
A process is one in which 99.99966% of all outcomes are free of defects.

* **Correlation vs Causation**:
Correlation measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships. One can test for causation using hypothesis testing or A/B testing. Example: a higher crime rate is associated with higher sales in ice cream in Canada, aka they are positively correlated. However, this doesn’t mean that one causes another. Instead, it’s because both occur more when it’s warmer outside.

* **Problem: There’s one box — has 12 black and 12 red cards, 2nd box has 24 black and 24 red; if you want to draw 2 cards at random from one of the 2 boxes, which box has the higher probability of getting the same color? Can you tell intuitively why the 2nd box has a higher probability?**

Probability of drawing a 2nd red card from the first box is 11/23=0.48, where as from second box it is 23/47 =0.49. So later has slightly higher probablity.



## Group Study Questions

https://docs.google.com/document/d/1iwZjqmvyVMeh88UGqE6Qv5kwtLSPT9k31MvCzPrAK3M/edit

#### Breakout: https://docs.google.com/spreadsheets/d/1oeEnMdwBjQdAAzeAv6PVjlNamSdywapgMHHkX0zIPVc/edit#gid=0

#### Relevant topics: 
* SQL

#Show the party and RANK for constituency S14000024 in 2017. List the output by party

`
SELECT party, 
			 RANK() OVER (PARTITION BY party ORDER BY votes DESC) AS rank_party
FROM ge
WHERE constituenty = 's14000024' AND year = 2017
`

#Use PARTITION to show the ranking of each party in S14000021 in each year. Include yr, party, votes and ranking (the party with the most votes is 1).

`
SELECT yr, party, votes, RANK() OVER (PARTITION BY party, yr ORDER BY votes DESC) AS rank
FROM ge
WHERE constituency = 'S14000024' AND year = 2017
ORDER BY yr
`

* Confusion Matrix
* Recall and Precission (https://en.wikipedia.org/wiki/Precision_and_recall)
* Random Forest vs Logistic Regression (https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f)

* Biased vs unbiased estimator, consistency (https://www.youtube.com/watch?v=6i7mqDJICzQ, https://www.statlect.com/glossary/unbiased-estimator)

* (Stats) Nicks Notebook: https://github.com/nmmichalak/nmmichalak.github.io/blob/master/welch_t_vs_z_test_false_positive.ipynb
* (Stats) Test Hypothesis: https://medium.com/swlh/the-ultimate-guide-to-a-b-testing-part-1-experiment-design-8315a2470c63

* (Stats) Test Hypotheis 2: https://towardsdatascience.com/the-math-behind-a-b-testing-with-example-code-part-1-of-2-7be752e1d06f

* (Stats) Funnel/Test Hypotheis 3: https://medium.com/@henryfeng/customer-funnel-analysis-for-online-retailer-using-pivot-table-and-clustering-in-python-bdcc88824d0b


* (CS) Bennett's Notebook: https://hub.gke.mybinder.org/user/bjmarsh-insight-coding-practice-z8f85ygw/notebooks/data_structures/data_structures.ipynb

## Jun 30th

* Workshop: Fermi Problem (http://web.pdx.edu/~pmoeck/pdf/The%20classic%20Fermi%20problem.pdf)

* Group study + SQL (https://docs.google.com/document/d/1ILxzEJc_uyHRF8UPGnYpSesvfycOipoOqB67vYYP5Vw/edit)

* CS Python (https://runestone.academy/runestone/books/published/pythonds/AlgorithmAnalysis/BigONotation.html)

* Group Study brakout: https://docs.google.com/spreadsheets/d/1oeEnMdwBjQdAAzeAv6PVjlNamSdywapgMHHkX0zIPVc/edit#gid=0

* Stats: https://danieltakeshi.github.io/2016/09/25/the-expectation-of-the-minimum-of-iid-uniform-random-variables/

## White Board Questions
https://docs.google.com/document/d/1136AuJSEYDmWwaOwfZVBGPBGNEAwS-Jrdd6761qQWR4/edit#


# Big O

* 𝑛! grows even faster than 2𝑛 as n gets large.