# Chapter 1: Statistical Thinking for Programmers

Examples and Exercises from Think Stats, 2nd Edition

http://thinkstats2.com

Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT


## Notes:
1. **Probability** - study of random events
2. **Statistics** - discipline of using data samples to support claims about populations. Most statistical analysis is based on probability, which is why these pieces are usually presented together.
3. **Computation** - Computation is a tool that is well-suited to quantitative analysis, and computers are commonly used to process statistics.

## Fallacies and Biases
1. **Selection Bias** - Only certain subset of population sampled who experienced a certain event or belonged to a particular demography
2. **Confirmation Bias** - Citing examples supporting their hypothesis and ignoring the counter evidence
3. **Small Sample Size** - Few observations, unreliable to draw conclusion

## Steps | Terms | Glossary | Definitions
1. **Data collection** - Finding or sampling data from relevant source
2. **Descriptive Statistics** - Statistics that summarize the data concisely, and evaluate different ways to visualize data
3. **Exploratory data analysis** - We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations
4. **Hypothesis testing** - Where we see apparent effects, like a difference be- tween two groups, we will evaluate whether the effect is real, or whether it might have happened by chance
5. **Estimation** - We will use data from a sample to estimate characteristics of the general population
6. **Cross-sectional Study** -  Captures a snapshot of a group at dnala point in time
7. **Longitudnal Study** - Observes a group repeatedly over a period of time
8. **Population** - Objective of surveys, sampling and statistics is to draw conclusions about the population - all instances of the target group
9. **Respondents** - People who participate in the survey
10. **Cohort** - Group of respondents
11. **Representative Survey** - In general, cross-sectional studies are meant to be representative, which means that every member of the target population has an equal chance of participating. That ideal is hard to achieve in practice, but people who conduct surveys come as close as they can.
12. **Oversampling** - Some groups recruited or sampled at higher rates than their actual proportion in the population. 
    1. The reason for oversampling is to make sure that the num- ber of respondents in each of these groups is large enough to draw valid statistical inferences.
    2. The drawback of oversampling is that it is not as easy to draw conclusions about the general population based on statistics from the survey.
13. **Table** - Collection of records | instances | rows
14. **Fields** - Columns | Features | Variables | Each column of Excel sheet
15. **Record** - Instance | Row | Each line or row of Excel sheet
16. **DataFrame** - The fundamental data structure provided by pandas, which is a Python data and statistics package we’ll use
    1. In addition to the data, a DataFrame also contains the variable names and their types, and it provides methods for accessing and modifying the data.
17. **Data Cleaning** - 
    1. Check for errors, missing values
    2. Deal with special values - Outliers, incorrect data
    3. Convert data into different formats - string to datetime
    4. Perform calculations - imputing missing values, Standardisation, Normalisation, COnverting kilogram and pound to same unit
18. **Data Validation** - When data is exported from one software environment and imported into another, errors might be introduced.
    1. Incorrect data interpretation
    2. Method to validate: compute basic statistics and compare them with published results
19. **Data Interpretation** -  To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.
20. **Anecdotal Evidence** - Evidence, often personal, that is collected casually rather than by a well-designed study.

In [1]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py")

In [2]:
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py")

download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct")
download(
    "https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz"
)

## Examples from Chapter 1

Read NSFG data into a Pandas DataFrame.

In [3]:
import nsfg

In [4]:
preg = nsfg.ReadFemPreg()
preg.head()

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,1,1,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,8.8125
1,1,2,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,7.875
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,9.125
3,2,2,,,,,6.0,,1.0,,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,7.0
4,2,3,,,,,6.0,,1.0,,...,0,0,0,7226.30174,8567.54911,12999.542264,2,12,,6.1875


Print the column names.

In [5]:
preg.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Select a single column name.

In [6]:
preg.columns[1]

'pregordr'

Select a column and check what type it is.

In [7]:
pregordr = preg['pregordr']
type(pregordr)

pandas.core.series.Series

Print a column.

In [8]:
pregordr

0        1
1        2
2        1
3        2
4        3
        ..
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

Select a single element from a column.

In [9]:
pregordr[0]

1

Select a slice from a column.

In [10]:
pregordr[2:5]

2    1
3    2
4    3
Name: pregordr, dtype: int64

Select a column using dot notation.

In [11]:
pregordr = preg.pregordr

Count the number of times each value occurs.

In [12]:
preg.outcome.value_counts().sort_index()

1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: outcome, dtype: int64

Check the values of another variable.

In [13]:
preg.birthwgt_lb.value_counts().sort_index()

0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

Make a dictionary that maps from each respondent's `caseid` to a list of indices into the pregnancy `DataFrame`.  Use it to select the pregnancy outcomes for a single respondent.

In [14]:
caseid = 10229
preg_map = nsfg.MakePregMap(preg)
indices = preg_map[caseid]
preg.outcome[indices].values

array([4, 4, 4, 4, 4, 4, 1])

## Exercises

Select the `birthord` column, print the value counts, and compare to results published in the [codebook](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NSFG/Cycle6Codebook-Pregnancy.pdf)

We can also use `isnull` to count the number of nans.

In [16]:
preg.birthord.isnull().sum()

Select the `prglngth` column, print the value counts, and compare to results published in the [codebook](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NSFG/Cycle6Codebook-Pregnancy.pdf)

To compute the mean of a column, you can invoke the `mean` method on a Series.  For example, here is the mean birthweight in pounds:

In [18]:
preg.totalwgt_lb.mean()

Create a new column named <tt>totalwgt_kg</tt> that contains birth weight in kilograms.  Compute its mean.  Remember that when you create a new column, you have to use dictionary syntax, not dot notation.

`nsfg.py` also provides `ReadFemResp`, which reads the female respondents file and returns a `DataFrame`:

In [26]:
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dct")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dat.gz")

In [27]:
resp = nsfg.ReadFemResp()

`DataFrame` provides a method `head` that displays the first five rows:

In [28]:
resp.head()

Select the `age_r` column from `resp` and print the value counts.  How old are the youngest and oldest respondents?

We can use the `caseid` to match up rows from `resp` and `preg`.  For example, we can select the row from `resp` for `caseid` 2298 like this:

In [30]:
resp[resp.caseid==2298]

And we can get the corresponding rows from `preg` like this:

In [31]:
preg[preg.caseid==2298]

How old is the respondent with `caseid` 1?

What are the pregnancy lengths for the respondent with `caseid` 2298?

What was the birthweight of the first baby born to the respondent with `caseid` 5012?