# What is Statistics?

Statistics is the study of the tools and methods that allow us to learn from data. It involves: 

1. Understanding new types of data and statistical methods for their analysis
2. Understanding properties of statistical methods
3. Development of new statistical methods

## Historical Milestones

1.  **Ancient Times**: It gave rise to data collection on populations, harvests, and natural calamities.
2.  **18th Century**: It gave rise to probability theory as the study of randomness and variation.
3.  **19th Century**: It gave birth to modern statistics via demographical studies and genetics.
4.  **20th Century**: It led to significant advances in statistical theory and statistical learning with rise of computers.
5.  **21st Century**: It has led to huge data stores (Big Data) and advanced applications of statistical methods on data (Machine Learning). 


## Statistic

It is the numerical and graphical summary of a collection of data



# Study Design

It refers to the different types of research studies that give rise to data. It is crucial for **power analysis** which is the process to assess whether a study design is likely to yield meaningful findings.  

## Confirmatory vs Exploratory

1. Confirmatory Studies employ the scientific method (Falsifiable Hypothesis -> Data Collection -> Testing Hypothesis Validity)

2. Exploratory Studies collect and analyze data without pre-specifying question. 

While the latter is informative, it can be misleading. The more questions we ask from a dataset, the more likely we are to draw a misleading conclusion (multiple testing, p-hacking, overfitting). 


## Comparative vs Non-Comparative

1. Comparative Studies compares a feature between two things. Eg: Harvest of Oranges in Spain vs Italy
2. Non comparative Studies estimate the value of a feature. Eg: Stock Price prediction.


## Observational vs Experimental

1. Observational studies give rise to natural grouping of data. Eg: Lifespan of Smokers vs Non Smokers
2. Experimental studies involve manipulation of different groups. Eg: Drug Trials

The observational studies are often susceptible to bias where sample is not representative or measurements are systematically off-target.

# Data 

It can take on many forms: numbers, images, text, audio, etc. But where does data come from?

1. **Organic/Process Data**: It refers to data that are generated as a result of a process (POS, Stock Market, Search History, etc). It often leads to Big Data that can be "mined" with statistical methods to study trends and relationships.
2. **Designed Data**: It refers to data collected to address a specific research objective via sampling (Tweet Extraction, Census Survey). It is often smaller than organic data and do not reflect natural trends.

## IID Data 

The data source has direct implications for an important consideration: **IID Data**. This refers to data being independent (each data point does not influence others) and identically distributed (all data points come from the same probability distribution). It is important because: 

1. Under independence assumptions, probabilities can be multiplied easily.
2. Under identical distributions, we can summarize with a single statistic.
3. Many statistical methods can only apply to IID data (Central Limit Theorem, T Test, Regression etc)

For Non IID Data: 

1. If Dependent: We need to use methods/models that account for autocorrelation in data (like time-series/clustered models)
2. If Heterogeneous: We need to model data from different distributions separately or include covariates (like mixed effects models)

## Data Storing Guidelines: 

1. Database software and tools (e.g. SQL) can be very useful for large-scale data management. Some statistical software can read data directly from a database.  Another approach is to construct a text data file from a database e.g. using SQL.
2. Hadoop and Spark are two popular tools for manipulating very large datasets.
3. HDF5, Apache Parquet, and Apache Arrow are open-source standards for large binary datasets.  Using these formats saves processing time relative to text/csv because fewer conversions are performed when reading and writing the data.
4. Large data sets can be saved in compressed form (e.g. using “gzip”) and read into statistical software directly from the compressed file.  This allows the data to be read much faster, and reduces storage space.
5. Text/CSV is a better choice than spreadsheet formats (e.g. .xlsx) for data exchange and archiving.

## Data Manipulation Software

Python has an ecosystem of modules (libraries of code) that augument the base language. A library is a collection of functions and data types that can be accessed without having to implement everything yourself from scratch. 

Some common libraries useful for statistical analysis are: 


* **[Pandas](http://pandas.pydata.org)**: A library for efficient and easy-to-use data structures and data analysis tools.

* **[Numpy](http://numpy.org)** : A library for working with arrays of data.

* **[Scipy](http://scipy.org)**:  A library of techniques for numerical and scientific computing.

* **[Statsmodels](http://www.statsmodels.org)**:  A library that implements many statistical techniques.

* **[Matplotlib](http://matplotlib.org)**: A library for making graphs.

* **[Seaborn](http://seaborn.pydata.org)** A higher-level interface to Matplotlib that can be used to simplify many graphing tasks.




In [24]:
# Working with Numpy
import numpy as np
a = np.array([0,1,2,3,4,5,6,7,8,9,10]) # array function from numpy library
np.mean(a) # mean function from numpy library

#1D array
a = np.array([1, 2, 3])
print("type(a) =", type(a))
print("\na.shape =", a.shape)


#2D Array
b = np.array([[1, 2], [3, 4]])
print("type(b) =", type(b))
print("\nb.shape =", b.shape)

# Array with Shape
c=np.ones((2,3))
print("c=\n", c)

# Empty Array
d=np.empty((2,5))
print("\nd =\n", d)

# Slicing Array
h = np.zeros((3, 3))
i = h[0:2, 0:2].copy()
h[0, 0] = 99
print("h =\n", h)
print("\ni =\n", i)

type(a) = <class 'numpy.ndarray'>

a.shape = (3,)
type(b) = <class 'numpy.ndarray'>

b.shape = (2, 2)
c=
 [[1. 1. 1.]
 [1. 1. 1.]]

d =
 [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


# Variables/Features

The variables in data refer to different aspects of a record on which we have information. Eg: In a typical Census Data, we have a person as a record and corresponding information for variables like age, occupation etc.  There can be different types of variables.

## Quantitative Variables

The value of the feature indicates a numerical quantity that can undergo meaningful mathematical operations. They are primarily of two types: 

1. Continuous: The feature value can be any number between a fixed interval. Eg: Age, Height, Time, etc
2. Discrete: The feature can only take on very specific values. Eg: No of Children.

It is often (but not always) the case that discrete data are reperented by the computer with integers, and continuous data are represented by the computer with float values.  In base Python a single "literal" number is stored as an integer or as a "float" value based on whether it is expressed with a decimal point.

In [7]:
# types
# integer
print(type(4))
# base float
print(type(4.))
# numpy float
import numpy as np
numbers = np.r_[2, 3, 4, 5]
print(type(numbers.mean()))

# float vs integer division
print(3/5)
print(3//5)

<class 'int'>
<class 'float'>
3.5
<class 'numpy.float64'>
0.6
0


## Categorical Variables

The value of the feature indicates a classification into different groups and cannot undergo meangful mathematical operations. They are primarily of two types: 

1. Ordinal: The feature value indicates an order or ranking in groups. Eg: Positions in a Race
2. Nominal: The feature value indicates group names. Eg: Football Teams

In base Python, Boolean (bool), and String (str) data types are often used to represent nominal values. Ordinal values may be represented by numbers, but it is important to remember that these numbers are codes that do not contain any quantitative information.

In [9]:
# boolean
print(type(True))

# string
print(type("This sentence makes sense"))

<class 'bool'>
<class 'str'>


In [12]:
# None is a special value that is a placeholder representing "no meaningful value".
# None can be compared using "is" or "==" but conventionaly "is" is preferred
print(type(None))
print(None is None)
print(None == None)

<class 'NoneType'>
True
True
