### What is Statistics?
The subject of **“Statistics”** may be called the “science of data” and broadly described
as 
**A body of methods for**

**(i)Collecting,**

**(ii)Summarizing,**

**(iii)Analyzing, and**

**(iv)Interpreting data.**

#### Collecting Data
How to collect data is the subject of two specialized branches in
statistics, called **“Sampling Techniques”** and **“Design of Experiments”.** 

**Sampling techniques** or methods deal with collecting data in the real world as it exists, namely through opinion
polls, cross-sectional studies, surveys dealing with political and social issues, etc. 
Sampling is an essential part of everyday life and we return to it in the next section. 

The object of **Design of Experiments**, on the other hand is to design data collection in a more controlled
setting in order to answer specific scientific questions, as in agricultural or clinical trials. 

The goal in either case is to maximize the “information” obtained for a given amount of money
or conversely, to minimize the cost for attaining a given level of precision. Averages tend to have higher precision, a fact we will learn soon. Thus, designing a proper experiment
is very crucial and different methods of collecting information can be quite different in terms of their efficiency.

#### Summarizing

A large set of numbers by themseleves do not make sense.
If they are appropriately
summarized, either graphically or numerically, we can figure out some essential features
of this data, like where is the **center** and how much is the **spread** around this center. 

Providing **graphical** and **numerical** summaries of data (often **sample** data from a larger **population**), is called **Descriptive statistics.**

#### Analyzing and interpreting data

It can be argued that analyzing and interpreting data is the heart of statistics. This is also
sometimes called **inferential statistics** or **statistical inference**, i.e., drawing conclusions
or inferences about the **“population”** after observing only a **subset** – a **“sample”** from it.

### Population, sample and inference

Most often the data statisticians collect is a sample, i.e., a **randomly selected (and hence
representative) subset of the population**. We use the word population to refer to the
“entirety of data one can collect on a topic of interest”. For instance, 

if one is interested in the “average family income” in the US, the population here consists of incomes of every
family in the US – over 100 million numbers (after one tackles such non-trivial questions as
to what we mean by a “family” and what we mean by “family income” etc.).

If suppose we are interested in “television rating” for a particular TV show in a given week, the word
population then refers to all the data on time spent by each of the potential viewers watching
this show. 

As a third example, consider the situation where one is interested in figuring out
the “chance of heads” for a given coin. The entirety of data here refers to the results of all
possible tosses with this coin. Since the coin can be flipped forever to obtain more and more
data about this coin, the population is theoretically infinite.

**In all these cases, a sensible thing to do is to take a representative sample from such a
population and try to interpret what this sample has to say, regarding the population.** 

This type of generalizations i.e., drawing conclusions about the population from the observed
sample, is called **“statistical inference”.** Statistical inference is a fundamental ingredient in
most scientific advances because we can not wait in most cases for “all the data to be in”. 

**In some cases, it is patently unwise or impossible to collect all the data - like in the coin-tossing
example.**
Think what would happen if a manufacturer wanted to determine the “lifetime”
of a certain brand of electric bulb and started to burn each of the bulbs manufactured in
order to get the “totality of data” on the lifetimes. Not good for business since there would
be nothing left to sell!

To recap, **we call a “properly” selected subset of population, a sample , whose representativeness
is guaranteed by selecting the sample units randomly.** 

In most cases, we select a
simple random sample of size n by giving each unit in the population **the same chance of**
being selected into the sample and also by giving every sample of size n from this population
the same chance of being selected. 

The advantages of random sampling include:
i) reduced cost,
ii) possibility to measure precision in the sample estimates,
iii) greater accuracy and scope are possible, as opposed to a complete population count,
iv) sometimes, sampling is the only way, as when the population is infinite (e.g., coin tossing)
or when measurements involve destructive testing (as when studying the life-times of light
bulbs).


Sometimes it is known in advance that the characteristic under study is heterogeneous
among the **various subgroups** of the population. For instance, in an agricultural experiment
say comparing the yields of different varieties of wheat, we may know that certain plots of
land are more fertile than others and will give higher yields. Or in an electoral poll, we
may know that Blacks, Hispanics and Caucasians tend to vote quite differently. In such
cases, it is best to separate out the population into these subgroups, called **“blocks”** in an
experimental design context or **“strata”** in a sampling context and then randomly select a
proportionate number of units from each subgroup. This idea termed blocking avoids the
possibility of all fertile plots of land being assigned to a particular variety of wheat or of our
poll sample being entirely Caucasian and thus ensures the representativeness of the overall
sample. 

**The choice of blocks or strata should be such that all the units within a block are
as homogeneous as possible and there is considerable difference between blocks.** 
A simple example of such blocking occurs later in Section 9.1, in what is referred to as paired data.



Population characteristics **(like its center or spread)** are called **parameters** and are denoted
typically by Greek letters such as μ and . On the other hand, a sample characteristic,
which is computed based on the sample values, is called a statistic (yet another meaning
for the word “statistics” — as the plural of the word “statistic”) and are generally denoted
by English alphabet, using symbols like x and s. 

Statistical inference typically proceeds
through “estimating” or sometimes “testing hypotheses” i.e., statements about the unknown
parameters, using sample based values, i.e., the statistics. 

Sometimes, the populations may
not be characterized through a few parameters and the inference for such populations, called
“nonparametric inference”.

![title](1_Intro.png)

Of course, when one tries to generalize from an observed sample to the (incompletely
observed) population, there are bound to be pitfalls. One can never be sure that the conclusions
are quite right and such generalizations may involve errors. 

**The beauty of statistics is
that it allows us to quantify these errors and control them as desired.** 

**Since typically a larger
sample size provides more information about the population, choosing an appropriately large
enough sample size is one way to reduce the error.**

In estimating
the “chance of heads”, recall that the coin can be tossed indefinitely. We are forced to stop
at some finite point, say after 100 tosses. This is a sample of size 100 and suppose the data
collected looks like this (with H and T standing for Heads and Tails respectively):
H, H, T, H,....., T.
Suppose there are altogether 46 heads (and the other 54 are tails) in this sample of 100
tosses. It appears reasonable to declare the observed proportion of heads in our sample,
namely (46/100) = 0.46 as our best guess or estimate of the “chance of heads”. How certain
are we? Of course there is no guarantee that if we tossed the same coin another 100 times,
we will get again 46 heads and 54 tails. 

**This kind of variation from sample to sample, is
referred later on as sampling variability.**

How different can that other estimate be, from
0.46? If we can declare something like, “Well, we have 0.46 for our estimate but if we repeat
this again, 90% of the time it will be within 0.08 of 0.46”, i.e., the error in our estimate is
no more than 0.08, with 90% chance –that would be somewhat reassuring. This is the type
6 What is Statistics?
of errors we are talking about measuring. 

**This type of measurement of errors, is based on
ideas of probability.**

### [Q] How does descriptive statistics differ from inferential statistics?

Providing graphical and numerical summaries of data (often sample data from a larger population), is called Descriptive statistics.

Analyzing and interpreting data is the heart of statistics. This is also sometimes called inferential statistics or statistical inference, i.e., drawing conclusions or inferences about the “population” after observing only a subset – a “sample” from it.