<img src="../../images/banners/intoroduction.png" width="600"/>

# <img src="../../images/logos/statistics-logo.jpeg" width="27"/> AI Glossary 


## <img src="../../images/logos/toc.png" width="20"/> Table of Contents 
* [Descriptive Statistics](#descriptive_statistics)
* [Inferential Statistics](#inferential_statistics)
* [Population vs. Sample](#population_vs._sample)
    * [Surveys (Random Sampling) vs. Experiments (Random Assignment)](#surveys_(random_sampling)_vs._experiments_(random_assignment))
* [Three Types of Data](#three_types_of_data)
* [Levels of Measurement](#levels_of_measurement)
    * [Qualitative Data and Nominal Measurement](#qualitative_data_and_nominal_measurement)
    * [Ranked Data and Ordinal Measurement](#ranked_data_and_ordinal_measurement)
    * [Quantitative Data and Interval/Ratio Measurement](#quantitative_data_and_interval/ratio_measurement)
    * [Summary](#summary)
* [Types of Variables](#types_of_variables)
    * [Discrete and Continuous Variables](#discrete_and_continuous_variables)
    * [Independent and Dependent Variables](#independent_and_dependent_variables)
* [Observational Studies](#observational_studies)
* [Confounding Variable](#confounding_variable)

---

<i class="fas fa-book" style="color:green !important;"></i> Reference: Statistics (Eleventh Edition) by Robert S. Witte

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.

<a class="anchor" id="descriptive_statistics"></a>

## Descriptive Statistics
Statistics exists because of the prevalence of variability in the real world. In its simplest form, known as descriptive statistics, statistics provides us with tools—tables, graphs, averages, ranges, correlations—for organizing and summarizing the inevitable variability in collections of actual observations or scores

Examples:
1. A graph showing the annual change in global temperature during the last 30 years
2. A report that describes the average difference in grade point average (GPA) between college students who regularly drink alcoholic beverages and those who don’t.

<a class="anchor" id="inferential_statistics"></a>

## Inferential Statistics
Statistics also provides tools—a variety of tests and estimates—for generalizing beyond collections of actual observations. This more advanced area is known as inferential statistics. Tools from inferential statistics permit us to use a relatively small collection of actual observations

Examples:
1. A researcher’s hypothesis that, on average, meditators report fewer headaches than do nonmeditators
2. An assertion about the relationship between job satisfaction and overall happiness.

<img src="./images/what-is-statistics/descriptive-inferential.webp" alt="sample-population" width=400 align="left" />

<a class="anchor" id="population_vs._sample"></a>

## Population vs. Sample
Inferential statistics is concerned with generalizing beyond sets of actual observations, that is, with generalizing from a sample to a population. In statistics, a population refers to any complete collection of observations or potential observations, whereas a sample refers to any smaller collection of actual observations drawn from a population.

<img src="../images/sample-population.png" alt="sample-population" width=500 align="left" />

<a class="anchor" id="surveys_(random_sampling)_vs._experiments_(random_assignment)"></a>

### Surveys (Random Sampling) vs. Experiments (Random Assignment)

**Random sampling (Survey)** is a procedure designed to ensure that each potential observation in the population has an equal chance of being selected in a survey.

Estimating the average anxiety score for all college students probably would not generate much interest. Instead, we might be interested in determining whether relaxation training causes, on average, a reduction in anxiety scores between two groups of otherwise similar college students.

College students in the relaxation experiment probably are not a random sample from any intact population of interest, but rather a convenience sample consisting of volunteers from a limited pool of students fulfilling a course requirement. Accordingly, our focus shifts from random sampling to the random assignment of volunteers to the two groups.

**Random assignment (Experiment)** is procedure designed to ensure that each person has an equal chance of being assigned to any group in an experiment.

<img src="../images/assignment-survey.png" alt="assignment-survey" width=500 align="left" />

Indicate whether each of the following terms is associated primarily with a survey (S) or an experiment (E).

- Random assignment
- Representative
- Generalization to the population
- Control group
- Real difference
- Random selection
- Volunteers

<a class="anchor" id="three_types_of_data"></a>

## Three Types of Data

The precise form of a statistical analysis often depends on whether data are **qualitative**, **ranked**, or **quantitative**.

- **Qualitative data** consist of words (Yes or No), letters (Y or N), or numerical codes (0 or 1) that represent a class or category.
- **Ranked data** consist of numbers (1st, 2nd, . . . 40th place) that represent relative standing within a group.
- **Quantitative data** consist of numbers (weights of 238, 170, . . . 185 lbs) that represent an amount or a count.

Indicate whether each of the following terms is qualitative, ranked, or quantitative.

- Ethnic group
- Age
- Family size
- Academic major
- Sexual preference
- IQ score
- Net worth (dollars)
- Gender
- Temperature

<a class="anchor" id="levels_of_measurement"></a>

## Levels of Measurement

The level of measurement specifies the extent to which a number (or word or letter) actually represents some attribute and, therefore, has implications for the appropriateness of various arithmetic operations and statistical procedures.

For our purposes, there are three levels of measurement—**nominal**, **ordinal**, and **interval/ratio**—and these levels are paired with **qualitative**, **ranked**, and **quantitative** data, respectively.

<a class="anchor" id="qualitative_data_and_nominal_measurement"></a>

### Qualitative Data and Nominal Measurement

If people are classified as either male or female (or coded as 1 or 2), the data are qualitative and measurement is nominal. The single property of nominal measurement is **classification**—that is, sorting observations into different classes or categories.

**A distinctive feature of nominal measurement is its bare-bones representation of any attribute**. For instance, a student is either male or female. Even with the introduction of arbitrary numerical codes, such as 1 for male and 2 for female, it would never be appropriate to claim that, because female is 2 and male is 1, females have twice as much gender as males. Similarly, **calculating an average with these numbers would be meaningless**.

<a class="anchor" id="ranked_data_and_ordinal_measurement"></a>

### Ranked Data and Ordinal Measurement

When any single number indicates only relative standing, such as first, second, or tenth place in a horse race or in a class of graduating seniors, the data are ranked and the level of measurement is ordinal. The distinctive property of ordinal measurement is **order**.

Since ordinal measurement fails to reflect the actual distance between adjacent ranks, **simple arithmetic operations with ranks are inappropriate**. For example, it’s inappropriate to conclude that the arithmetic mean of ranks 1 and 3 equals rank 2, since this assumes that the actual distance between ranks 1 and 2 equals the distance between ranks 2 and 3.

<a class="anchor" id="quantitative_data_and_interval/ratio_measurement"></a>

### Quantitative Data and Interval/Ratio Measurement

The distinctive properties of interval/ratio measurement are **equal intervals** and a **true zero**.

- **Equal intervals** imply that hefting a 10-lb weight while on the bathroom scale always registers your actual weight plus 10 lbs.

- A **true zero** signifies that the bathroom scale registers 0 when not in use—that is,
when weight is completely absent.

**In the absence of a true zero, numbers—much like the exposed tips of icebergs— fail to reflect the total amount being measured**. For example, a reading of 0 on the Fahrenheit temperature scale does not reflect the complete absence of heat—that is, the absence of any molecular motion. In fact, true zero equals −459.4°F on this scale. It would be inappropriate, therefore, to claim that 80°F is twice as hot as 40°F. An appropriate claim could be salvaged by adding 459.4°F to each of these numbers: 80° becomes 539.4° and 40° becomes 499.4°. **Clearly, 539.4°F is not twice as hot as 499.4°F**.

In the absence of a true zero, it would be inappropriate to claim that an IQ score of 140 represents twice as much intellectual aptitude as an IQ score of 70.

Other interpretations are possible. One possibility is to treat IQ scores as attaining only ordinal measurement—that is, for example, a score of 140 represents more intellectual aptitude than a score of 130—without specifying the actual size of this difference.

<img src="../images/data-types.png" alt="data-types" width=500 align="left" />

<a class="anchor" id="types_of_variables"></a>

## Types of Variables

A variable is a characteristic or property that can take on different values.

<a class="anchor" id="discrete_and_continuous_variables"></a>

### Discrete and Continuous Variables

- A **discrete variable** consists of isolated numbers separated by gaps.
- A **continuous variable** consists of numbers whose values, at least in theory, have no restrictions.

<a class="anchor" id="independent_and_dependent_variables"></a>

### Independent and Dependent Variables

- In an experiment, an **independent variable** is the treatment manipulated by the investigator.
- When a variable is believed to have been **influenced by the independent variable**, it is called a **dependent variable**. 

Unlike the independent variable, the dependent variable isn’t manipulated by the investigator. Instead, it represents an outcome: the data produced by the experiment.

<img src="./images/what-is-statistics/dependent-independent-var.jpeg" width=500/>

With just a little practice, you should be able to identify these two types of variables. In an experiment, what is being manipulated by the investigator at the outset and, therefore, qualifies as the independent variable? What is measured, counted, or recorded by the investigator at the completion of the study and, therefore, qualifies as the dependent variable? Once these two variables have been identified, they can be used to describe the problem posed by the study; that is, does the independent variable cause a change in the dependent variable?

<a class="anchor" id="observational_studies"></a>

## Observational Studies

Instead of undertaking an experiment, an investigator might simply observe the relation between two variables. For example, a sociologist might collect paired mea- sures of poverty level and crime rate for each individual in some group. If a statistical analysis reveals that these two variables are related or correlated, then, given some person’s poverty level, the sociologist can better predict that person’s crime rate or vice versa. Having established the existence of this relationship, however, the sociologist can only speculate about cause and effect. Poverty might cause crime or vice versa. On the other hand, both poverty and crime might be caused by one or some combination of more basic variables, such as inadequate education, racial discrimination, unstable family environment, and so on. Such studies are often referred to as observational stud- ies. **An observational study focuses on detecting relationships between variables not manipulated by the investigator, and it yields less clear-cut conclusions about cause- effect relationships than does an experiment**.

<img src="./images/what-is-statistics/correlation-causation-2.jpeg" width=500/>

For example, in an experiment that studies relation betweek couples having active-listening therapy and number of breakdowns, couples already possessing high active-listening scores might also tend to be more seriously committed to each other, and this more serious commitment itself might cause both the higher active-listening score and fewer break- downs in communication. In this case, any special training in active listening, without regard to the existing degree of a couple’s commitment, would not reduce the number of breakdowns in communication.

<img src="../images/obs-experiment.png" alt="observation-experiment" width=500 />

<a class="anchor" id="confounding_variable"></a>

## Confounding Variable

Whenever groups differ not just because of the independent variable but also because some uncontrolled variable co-varies with the independent variable, any con- clusion about a cause-effect relationship is suspect. If, instead of random assignment, each couple in an experiment is free to choose whether to undergo special training in active listening or to be in the less demanding control group, any conclusion must be qualified. A difference between groups might be due not to the independent variable but to a confounding variable. For instance, couples willing to devote extra effort to special training might already possess a deeper commitment that co-varies with more active-listening skills. **An uncontrolled variable that compromises the interpretation of a study is known as a confounding variable**. You can avoid confounding variables, as in the present case, by assigning subjects randomly to the various groups in the experi- ment and also by standardizing all experimental conditions, other than the independent variable, for subjects in both groups.

<img src="./images/what-is-statistics/confounding-variable.webp" width=500 />

Sometimes a confounding variable occurs because it’s impossible to assign subjects randomly to different conditions. For instance, if we’re interested in possible differ- ences in active-listening skills between males and females, we can’t assign the sub- ject’s gender randomly. Consequently, any difference between these two preexisting groups must be interpreted cautiously. For example, if females, on average, are better listeners than males, this difference could be caused by confounding variables that co-vary with gender, such as preexisting disparities in active-listening skills attribut- able not merely to gender, but also to cultural stereotypes, social training, vocational interests, academic majors, and so on.

TO ADD: Case study

