<font color='red'> 
# Computational Statistics for Data Analysis
<font\>

# <div class="alert alert-error"><strong><center><small>Statistics is the discipline of using data samples to support claims about populations.<small></center></strong> </div>

Statistics is based on 2 main concepts:

* A **population** is a collection of objects, items (“units”) about which information is sought.

* A **sample** is a part of the population that is observed.

# Index

### 1 Descriptive Statistics.
* 1.1 Getting data
* 1.2 Data preparation
* 1.3 Improving data as a pandas DataFrame
* 1.4 Data cleaning and preparation
 

### 2 Exploratory Data Analysis.
* 2.1 Summarizing the data: mean, variance, median, quantiles & percentiles
* 2.2 Histogram
* 2.3 Data distributions
* 2.3.1 PMF
* 2.3.2 CDF
* 2.4 Outliers
* 2.5 Measuring asymmetry  (optional)
* 2.5.1 Skewness
* 2.5.2 Pearson's median skewness coefficient
* 2.6 Relative risk
* 2.7 A firts glimpse to Conditional Probability



<center><img src="images/taft_puck_eggdig.jpg">
</center><center><img src="images/eggs.png"></center>
<center><small>Headline from Chicago Tribune June 13, 1897.</small></center>

Read more: "http://ptara.com/2011/12/17/never-eat-eggs-with-an-angry-stomach/"

### "Do firts babies arrive late?"

       From [Think Stats: Probability and Statistics for Programmers](http://www.greenteapress.com/thinkstats/), by Allen B. Downey, published by O'Reilly Media.

Some people believe it is true, but **without data analysis** to support it, this claim is a case of **anecdotal evidence**:

* There are a **small number of samples** (personal experience, friends, etc.).
* There is a **selection bias**: most *believers* are interested in this claim because their first babies were late.
* There is a **confirmation bias**: believers might be more likely to contribute data that confirm it.
* Sources are **innaccurate**: personal stories are subject to memory deformations. 



<font color='blue'>
# 1 Descriptive Statistics.
* 1.1 Getting data
* 1.2 Data preparation
* 1.3 Improving data as a pandas DataFrame
* 1.4 Data cleaning 
 <font\>

## 1.1 Getting Data

There is an interesting and publicly available **data source** to check this claim. Since 1973 the U.S. Centers for Disease Control and Prevention (CDC) have conducted a survey, the National Survey of Family Growth (NSFG), to gather *information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men's and women's health.* 

Data can be downloaded from: 

http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm#cyc6downdatafiles

We will use this file: 

* Female Pregnancy Data File (2002FemPreg.dat): one record for each pregnancy reported by a respondent.

There are 13593 pregnancies in our data. The meaning of the data is (each record is in a line):

+ <code>case.id</code> is the ID of the respondent. From col 1 to 12.
+ <code>prg.lenght</code> is the duration of the pregnancy in weeks. From col 275 to 276. 
+ <code>outcome</code> is the outcome of the pregnancy (1 = live birth). Col 277.
+ <code>birth.ord</code> is the integer birth order of each live birth. From col 278 to 279.
+ <code>final.wgt</code> is the statistical weight associated to the respondent (it is a floating point value that indicates the number of people in the U.S. population this respondent represents). From col 423 to 440.

<small> If curious: Online documentation of the survey is at http://www.icpsr.umich.edu/nsfg6.<small> 

### 1.2 Data preparation

One of the reasons we are using a general-purpose language such as Python rather than a stats language like R is that for many projects the *hard* part is preparing the data, not doing the analysis.

The most common steps are:

1. **Getting the data**. Data can be directly read from a file or it might be necessary to scrap the web.
2. **Parsing the data**.  Of course, this depends on what format it is in: plain text, fixed columns, CSV, XML, HTML, etc.
3. **Cleaning the data**.  Survey responses and other data files are almost always incomplete.  Sometimes there are multiple codes for things like, *not asked*, *did not know*, and *declined to answer*. And there are almost always errors. A simple strategy is to remove or ignore incomplete records.
4. **Building data structures**. Once you read the data, you usually want to store it in a data structure that lends itself to the analysis you want to do.

If the data fits into memory, building a data structure is usually the way to go.   If not, you could build a **database**, which is an out-of-memory data structure. Most databases provide a mapping from keys to values, so they are like dictionaries.

In [None]:
file = open('files/2002FemPreg.dat', 'r')

# Let's build a list of lists.

preg=[]
for line in file:
    preg.append([int(line[:12]), int(line[274:276]), int(line[276]),
                 int(line[277:279]), float(line[422:440])])

Ooops! There is something wrong in the data file!

By inspecting the data we can observe that there are some empty records that caused an error to the ``int`` function.

In [None]:
file = open('files/2002FemPreg.dat', 'r')

def chr_int(a):
    if a == '  ':
        return 0
    else:
        return int(a)
        
preg=[]
for line in file:
    lst  = [int(line[:12]), int(line[274:276]), int(line[276]), \
                 chr_int(line[277:279]), float(line[422:440])]
    preg.append(lst)

In [None]:
print(preg[1])
print('The number of lines read is: ', len(preg))

### 1.3 Importing data as a pandas DataFrame

In [None]:
%matplotlib inline

import pandas as pd
df = pd.DataFrame(preg) #  Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes 
# http://pandas.pydata.org/pandas-docs/dev/dsintro.html#dataframe

df.columns = ['caseId', 'prgLength', 'outcome', 'birthOrd', 'finalWgt']
df.head()

In [None]:
df.tail()

In [None]:
df.shape

Let's count the number of  births according to the order:

In [None]:
counts = df.groupby('birthOrd').size()
print(counts[1])
print(counts) 

# also: df.outcome.value_counts()

Let's build a partition of *live births* into two groups: first babies and others.

In [None]:
# Divide records into two lists: first babies and others.

firstbirth = df[ (df.outcome == 1) & (df.birthOrd == 1)]
firstbirth.shape

In [None]:
othersbirth = df[(df.outcome == 1) & (df.birthOrd >= 2)]
othersbirth.shape

### 1.4 Data Cleaning

The most common steps are:

+ **Sample the data**. If the amount of raw data is huge, processing all of them may require an extensive amount of processing power which may not be practical.  In this case, it is quite common to sample the input data to reduce the size of data that need to be processed.

+ **Impute missing data**. It is quite common that some of the input records are incomplete in the sense that certain fields are missing or have input error.  In a typical tabular data format, we need to validate each record contains the same number of fields and each field contains the data type we expect. In case the record has some fields missing, we have the following choices: 
<small>
* (a) Discard the whole record if it is incomplete; 
* (b) Infer the missing value based on the data from other records.  A common approach is to fill the missing data with the average, or the median.
<small>



+ **Normalize numeric value**. Normalize data is about transforming numeric data into a uniform range.
+ **Reduce dimensionality**. High dimensionality can be a problem for some machine learning methods.  There are two ways to reduce the number of input features.  One is about $removing$ $irrelevant$ input variables, another one is about $removing$ $redundant$ input variables.
+ **Add derived features**. In some cases, we may need to compute additional attributes from existing attributes (f.e. converting a geo-location to a zip code, or converting the age to an age group).
+ **Discretize numeric value into categories**. Discretize data is about cutting a continuous value into ranges and assigning the numeric with the corresponding bucket of the range it falls on.  For numeric attribute, a common way to generalize it is to discretize it into ranges, which can be either constant width (variable height/frequency) or variable width (constant height).
+ **Binarize categorical attributes**. Certain machine learning models only take binary input (or numeric input).  In this case, we need to convert categorical attribute into multiple binary attributes, while each binary attribute corresponds to a particular value of the category. 

+ **Select, combine, aggregate data**. Designing the form of training data is the most important part of the whole predictive modeling exercise because the accuracy largely depends on whether the input features are structured in an appropriate form that provide strong signals to the learning algorithm. Rather than using the raw data as it is, it is quite common that multiple pieces of raw data need to be combined together, or aggregating multiple raw data records along some dimensions.

<font color='blue'>
## 2 Exploratory Data Analysis.
* 2.1 Summariizing the data: mean, variance, median, quantiles & percentiles
* 2.2 Histogram
* 2.3 Data distributions
* 2.3.1 PMF
* 2.3.2 CDF
* 2.4 Outliers
* 2.5 Measuring asymmetry
* 2.5.1 Skewness
* 2.5.2 Pearson's median skewness coefficient
* 2.6 Relative risk
* 2.7 A firts glimpse to Conditional Probability
<font\>

### 2.1 Summarizing the data: 
#### 2.1.1 Sample Mean 

If you have a sample of $n$ values, $x_i$, the **sample mean** is the sum of the values divided by the number of values:

$$ \mu = \frac{1}{n} \sum_i x_i$$

The **mean** is the most basic and important summary statistic. It describes the central tendency of a sample. 

Let's see if there is a difference between firstbirth and othersbirth:

In [None]:
print(firstbirth['prgLength'].mean(), othersbirth['prgLength'].mean())

In [None]:
print(abs(firstbirth['prgLength'].mean()-
          othersbirth['prgLength'].mean()),"weeks")

In [None]:
print(abs(firstbirth['prgLength'].mean()-
          othersbirth['prgLength'].mean())*7,"days")

In [None]:
print(abs(firstbirth['prgLength'].mean()-
          othersbirth['prgLength'].mean())*7*24, "hours")

This difference in sample means can be considered a first evidence of our hypothesis!


**Comment: ** *Later, we will work with both concepts: the population mean and the sample mean. Do not confuse them! Remember, the first one is the mean of samples taken from the population and the second one is the mean of the whole population.*

#### 2.1.2 Sample Variance

Usually, mean is not a sufficient descriptor of the data, we can do a little better with two numbers: mean and **variance**:

$$ \sigma^2 = \frac{1}{n} \sum_i (x_i - \mu)^2 $$

**Variance** $\sigma^2$ describes the *spread* of data. The term $(x_i - \mu)$ is called the *deviation from the mean*, so variance is the mean squared deviation.

The square root of variance, $\sigma$, is called the **standard deviation**. We define standard deviation because variance is hard to interpret (in the case the units are grams, the variance is in grams squared).


In [None]:
mu1 = firstbirth['prgLength'].mean()
mu2 = othersbirth['prgLength'].mean()
var1 = firstbirth['prgLength'].var()
var2 = othersbirth['prgLength'].var()
std1 = firstbirth['prgLength'].std()
std2 = othersbirth['prgLength'].std()
print('mu1:', mu1, 'var1:', var1, 'std1:', std1)
print('mu2:', mu2, 'var2:', var2, 'std2:', std2)

#### 2.1.3 Sample Median

The statistical median is an order statistic that gives the *middle* value of a sample. It is a value more robust to ouliers.

<center><img src="images/mean_median.gif">

In [None]:
median1= firstbirth['prgLength'].median()
median2= othersbirth['prgLength'].median()
print(median1, median2)

#### 2.1.4 Summarizing the data: Quantiles & Percentiles

Order the sample $\{ x_i \}$, then find $x_p$ so that it divides the data into two parts where:

+ a fraction $p$ of the data values are less than or equal to $x_p$ and
+ the remaining fraction $(1 − p)$ are greater than $x_p$.

That value $x_p$ is the pth-quantile, or 100×pth percentile.

**5-number summary**: $x_{min}, Q_1, Q_2, Q_3, x_{max}$, where $Q_1$ is the 25×pth percentile,
$Q_2$ is the 50×pth percentile and $Q_3$ is the 75×pth percentile.

### 2.2 Histogram

The most common representation of a distribution is a **histogram**, which is a graph that shows the frequency of each value.

In [None]:
fb=firstbirth['prgLength']

fb.hist(normed=0, histtype='stepfilled', bins=30)

In [None]:
        
ob=othersbirth['prgLength']
ob.hist(normed=0, histtype='stepfilled', bins=30)

In [None]:
import seaborn as sns
fb.hist(normed=1, histtype='stepfilled', alpha=.5)   # default number of bins = 10
ob.hist(normed=1, histtype='stepfilled', alpha=.5, color=sns.desaturate("indianred", .75))

In [None]:
import scipy.stats as stats

# Computes several descriptive statistics:
# size of the data 
# minimum and maximum value of data array
# arithmetic mean 
# unbiased variance 
# biased skewness 
# kurtosis (Fisher)

stats.describe(othersbirth['prgLength'].values)

## 2.3 Data Distributions

Summarizing can be dangerous: very different data can be described by the same statistics. It must be validated by inspecting the data.

We can look at the **data distribution**, which describes how often (frequency) each value appears.


We can normalize the frequencies of the histogram by dividing/normalizing by $n$, the number of samples. The normalized histogram is called **Probability Mass Function (PMF)**.

In [None]:
# if needed, execute the command 'pip3 install seaborn'

import seaborn as sns

x = firstbirth['prgLength']
y = othersbirth['prgLength']

x.hist(normed=1, histtype='stepfilled')


In [None]:
y.hist(normed=1, histtype='stepfilled')

The **cumulative distribution function (CDF)**, or just distribution function, describes the probability that a real-valued random variable X with a given probability distribution will be found to have a value less than or equal to x. 

In [None]:
x.hist(normed=1, histtype='step', cumulative=True, linewidth=3.5)

In [None]:
ob.hist(normed=1, histtype='step', cumulative=True, linewidth=3.5)

In [None]:
fb.hist(bins=10, normed=1, histtype='stepfilled', alpha=.5)   # default number of bins = 10
ob.hist(bins=10, normed=1, histtype='stepfilled', alpha=.5, color=sns.desaturate("indianred", .75))

# Check with 20, 30, 60 bins.

In [None]:
fb.hist(normed=1, histtype='step', cumulative=True,  linewidth=3.5, bins=30)
ob.hist(normed=1, histtype='step', cumulative=True,  linewidth=3.5, bins=30, color=sns.desaturate("indianred", .75))

In [None]:
print("The mean sample difference is ", x.mean() - y.mean(), "weeks.")

## 2.4 Outliers

**Ouliers** are data samples with a value that is far from the central tendency.

We can find outliers by:

+ Computing samples that are *far* from the median.
+ Computing samples whose value *exceeds the mean* by 2 or 3 standard deviations.

This expression will return a series of boolean values that you can then index the series by:

In [None]:
(df.outcome == 1) & (df['prgLength'] < df['prgLength'].median() - 10)

In [None]:
df['prgLength'].median()

In [None]:
df[(df.outcome == 1) & (df['prgLength'] < df['prgLength'].median() - 10)]

In [None]:
df[(df.outcome == 1) & (df['prgLength'] > df['prgLength'].median() + 6)]

If we think that outliers correspond to errors, an option is to trim the data by discarting the highest and lowest values.

In [None]:
df2 = df.drop(df.index[(df.outcome == 1) & (df['prgLength'] > df['prgLength'].median() + 6)])
df2[(df2.outcome == 1) & (df2['prgLength'] > df2['prgLength'].median() + 6)] # check if removed


In [None]:
df3 = df2.drop(df2.index[(df2.outcome == 1) & (df2['prgLength'] < df2['prgLength'].median() - 10)])
df3[(df3.outcome == 1) & (df3['prgLength'] < df3['prgLength'].median() - 10)]

In [None]:
firstbirth3 = df3[(df3.outcome == 1) & (df3.birthOrd == 1)]
mu3fb = firstbirth3['prgLength'].mean()
std3fb = firstbirth3['prgLength'].std()
md3fb = firstbirth3['prgLength'].median()
print(mu3fb, std3fb, md3fb, firstbirth3['prgLength'].min(),
      firstbirth3['prgLength'].max())

In [None]:
othersbirth3 = df3[(df3.outcome == 1) & (df3.birthOrd >= 2)]
mu3ob = othersbirth3['prgLength'].mean()
std3ob = othersbirth3['prgLength'].std()
md3ob = othersbirth3['prgLength'].median()
print(mu3ob, std3ob, md3ob, firstbirth3['prgLength'].min(), 
      firstbirth3['prgLength'].max())

In [None]:
print("The mean sample difference is: ",)
print(firstbirth3['prgLength'].mean() - 
      othersbirth3['prgLength'].mean(), "weeks.")

In [None]:
print(len(df3.prgLength[(df3.outcome == 1)]))
print(len(df.prgLength[(df.outcome == 1)]))

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(15,3))

df3.prgLength[(df3.outcome == 1)].plot(alpha=.5, color='blue')
df.prgLength[(df.outcome == 1)].plot(alpha=.5, 
                            color=sns.desaturate("indianred", .95))


Let's see what is happening near the mode:

In [None]:
import numpy as np

x = firstbirth3['prgLength']
y = othersbirth3['prgLength']

countx,divisionx = np.histogram(x) 
county,divisiony = np.histogram(y)
print (countx-county)


In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(15,3))
val = [(divisionx[i]+divisionx[i+1])/2 for i in range(len(divisionx)-1)]
plt.plot(val, countx-county, 'o-') 

There is still some evidence for our hypothesis!

In [None]:
print('The means are: ', x.mean(), y.mean())

**Exercise**:

+ Read the file ``run10.txt`` from the ``files`` directory. It represents 16.924 runners who finished the 2012 Cherry Blossom 10 mile run in USA. The file is a ``tab``separated file. It can be read with the pandas ``read_table`` function.
+ Compute the mean time.
+ Compute the difference in mean between men and women.
+ Visualize both distributions (normalized histogram).

In [None]:
## Your solution here

### 2.5 Measuring asymmetry.

** Skewness** is a statistic that measures the asymmetry of set of $n$ data samples $x_i$:

$$ g_1 = \frac{\frac{1}{n} \sum_i (x_i - \mu)^2 }{\frac{1}{n} \sum_i (x_i - \mu)^3 }$$

The numerator is the mean squared deviation (or variance) and the denominator the mean cubed deviation.

Negative deviation indicates that the distribution "skews left" (it extends farther to the left than to the right).

** Skewness** can be affected by outliers!!! A simpler alternative is to look at the relationship between mean ($\mu$) and median ($\mu_{\frac{1}{2}}$). 

** 2.6 Pearson's median skewness coefficient** is a more robust alternative:

$$ g_p = \frac{3(\mu - \mu_{\frac{1}{2}})}{\sigma} $$

**Exercise**: Write a function to compute $g_1$ and $g_p$ of the pregnancy length.

In [None]:
## Your solution here

**Exercises**: 

+ Could you give a real example, where for all data samples, $x_i \leq \mu$? 
+ Could you give a real example, where for all data samples, $x_i \leq \mu_{\frac{1}{2}}$? This is really a distribution that skews left!
+ If we ask to a random group of people "What is your position with respect to the average driver?", what kind of distribution would we get? 

## 2.6 Relative Risk

Let's say that a baby is "early" if it is born during week 37 or earlier, "on time" if it is born during week 38, 39 or 40, and "late" if it is born during week 41 or later. 

In [None]:
firstbirth3 = df3[(df3.outcome == 1) & (df3.birthOrd == 1)]
firstbirth3.head()

Let's compute the probability of being *early*, *on time* and *late* for first babies and the others.

In [None]:
print("Firsts babies: ")
print("Early",len(firstbirth3[firstbirth3['prgLength'] <38])/
      float(len(firstbirth3.index)))
print("Late", len(firstbirth3[firstbirth3['prgLength'] >40])/
      float(len(firstbirth3.index)))
print("On time", len(firstbirth3[(firstbirth3['prgLength'] >37) &
    (firstbirth3['prgLength'] < 41)])/float(len(firstbirth3.index)))

In [None]:
print("Other babies:")
print("Early", len(othersbirth3[othersbirth3['prgLength'] <38])/
      float(len(othersbirth3.index)))
print("Late", len(othersbirth3[othersbirth3['prgLength'] >40])/
      float(len(othersbirth3.index)))
print("On time", len(othersbirth3[(othersbirth3['prgLength'] >37) &
    (othersbirth3['prgLength']<41)])/float(len(othersbirth3.index)))

The **relative risk** is the ratio of two probabilities. In our case, the probability that a first baby is born early is 17%. For other babies is 16%, so the relative risk is:

In [None]:
a = len(firstbirth3[firstbirth3['prgLength'] <38])/float(len(firstbirth3.index))
b = len(othersbirth3[othersbirth3['prgLength'] <38])/float(len(othersbirth3.index))
print("First babies are about ", a/b, " more likely to be early.")

That means that first babies are about 7% more likely to be early. For the case of late births:

In [None]:
a = len(firstbirth3[firstbirth3['prgLength'] >40])/float(len(firstbirth3))
b = len(othersbirth3[othersbirth3['prgLength'] >40])/float(len(othersbirth3))
print("First babies are about ", a/b, " more likely to be late.")

That means that first babies are about 67% more likely to be late. 

## 2.7 A firts glimpse to Conditional Probability

Imagine that someone you know is pregnant and it is the beginning of week 39. What is the chance that the baby will be born in the week 39? What is the chance if it is a first baby?

We can ask these questions by computing a **conditional probability**, $P(X|Y)$.

In our first question, the event $X$ is a birth in week 39 and the event $Y$ is that we know that the baby didn't arrive during weeks 0-38. In the second question, we also know that it is a first baby.

A simple way to compute these chances is to drop from our data the cases that do not fulfill the conditions and then renormalize.

In [None]:
df4 = df3.drop(df3.index[df3['prgLength'] < 39]) 
df4.shape

In [None]:
x = df4.prgLength
x.hist(bins=6, histtype='stepfilled', alpha=.5)   

We are ready to compute the probability that the baby will be born in the week 39 for a pregnant woman in the beginning of week 39.

In [None]:
print(len(df4[(df4.prgLength == 39)].index)/float(len(df4)))

Let's now add the second condition.

In [None]:
firstbirth39 = df4[(df4.birthOrd == 1)]
othersbirth39 = df4[(df4.birthOrd > 1)]
x = firstbirth39['prgLength']
y = othersbirth39['prgLength']
x.hist(bins=6,  normed=True, histtype='stepfilled', alpha=.5)   # default number of bins = 10, blue
y.hist(bins=6,  normed=True, histtype='stepfilled', alpha=.5, color=sns.desaturate("indianred", .75))

In [None]:
print('Probability First baby on week 39: ', 
    len(firstbirth39[(firstbirth39.prgLength == 39)].index)/
    float(len(firstbirth39.index)))

In [None]:
print('Probability non first baby to be born on week 39: ',
    len(othersbirth39[(othersbirth39.prgLength == 39)].index)/
    float(len(othersbirth39.index)))


### Discussions.

After exploring the data we have seem some **appearent effects** that seem to support our first hypothesis:

+ **Data description**: The mean pregnant lenght for first babies is 38.76 and for other babies is 38.65.

+ **Relative risk**: First babies are about 67% more likely to be late.

+ **Conditional probability**: If someone is pregnant and it is the beginning of week 39, the chance (63% vs. 72%) that the baby will be born in the week 39 is lower if it is the first baby.

### Other possible experiments

We can compare the first and others for the same woman. While may be unlikely it could still be that a tendency exists for a woman's second, third, etc, child comes earlier.

<small>(Result:  The second baby is born about some hours earlier, but this difference is not *statistically significant*.)<small>

### Main reference
*Think Stats: Probability and Statistics for Programmers*, by Allen B. Downey, published by O'Reilly Media.
http://www.greenteapress.com/thinkstats/