# What is Statistical Data Analysis?

**Statistical data analysis** is to understand a complex, real-world phenomenon from partial and uncertain observations. The uncertainty in the data results in uncertainty in the knowledge we get about the phenomenon. A major goal of the theory is to **quantify this uncertainty**.

![image.png](attachment:image.png)


## Key Terms: Exploration, Inference, Decision, Prediction
- Exploratory methods allow us to get a preliminary look at a dataset through basic statistical aggregates and interactive visualization.  Computing and Data Visualization and Exploring a dataset
- Statistical inference consists of getting information about an unknown process through partial and uncertain observations. In particular, estimation entails obtaining approximate quantities for the mathematical variables describing this process.
- Decision theory allows us to make decisions about an unknown process from random observations, with a controlled risk. 
    - The Getting started with statistical hypothesis testing – a simple z-test 
    - Statistical hypothesis testing allows us to make decisions in the presence of incomplete data. By definition, these decisions are uncertain
    - The Estimating the correlation between two variables with a contingency table and a chi-squared test recipe
- Prediction consists of learning from data, that is, predicting the outcomes of a random process based on a limited number of observations.

![image.png](attachment:image.png)

## Hypothesis Testing
 A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. 

![image.png](attachment:image.png)

### Level of significance
Refers to the degree of significance in which we accept or reject the null-hypothesis.
- **Type I error**: When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha. In hypothesis testing, the normal curve that shows the critical region is called the alpha region
- **Type II errors**: When we accept the null hypothesis but it is false. Type II errors are denoted by beta. In Hypothesis testing, the normal curve that shows the acceptance region is called the beta region.

![image.png](attachment:image.png)

- P-value :- The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis H<sub>0</sub> of a study question is true. 

**If your P value is less than the chosen significance level then you reject the null hypothesis**

### Some of widely used hypothesis testing type :-
- T Test ( Student T test)
- Z Test
- ANOVA Test
- Chi-Square Test

### One sample t-test 
The One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. 
- T-test to check whether average age in a society is =35
1. H0=35

In [2]:
from scipy.stats import ttest_1samp
import numpy as np
data_ages = [23,45,65,34,45,12,22,33,30,29,38,37,34,32,19,18,45,62,5,10,26,19,50,3,34,44,90,23,45,67]
np.array(data_ages)
ages_mean = np.mean(data_ages)
print(ages_mean)
a = ttest_1samp(data_ages,35)
pval=a[1]
print("p-values",pval)
if pval < 0.01:    # alpha value is 0.05 or 5%
   print(" we are rejecting null hypothesis")
else:
  print("we are accepting null hypothesis")

34.63333333333333
p-values 0.9181590416037229
we are accepting null hypothesis


## Z-test

- Your sample size is greater than 30. Otherwise, use a t test.
- Data points should be independent from each other. In other words, one data point isn’t related or doesn’t affect another data point.
- Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter.
- Your data should be randomly selected from a population, where each item has an equal chance of being selected.
- Sample sizes should be equal if at all possible.

**E.g: z-test for blood pressure with some mean like 156 (python code is below for same) one-sample Z test.**

In [20]:
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests
a =[110,120,139,140,118,130,109,102,98,120,103,113,140,155,156]
Bp_values=pd.DataFrame(a)
ztest ,pval = stests.ztest(Bp_values, x2=None, value=156)
print(float(pval))
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

2.4454166663991188e-11
reject null hypothesis


## ANOVA (F-TEST) :- 
The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable.
The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.


## Chi-Square Test- 
The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.