# Inference for categorical data
In August of 2012, news outlets ranging from the [Washington Post](http://www.washingtonpost.com/national/on-faith/poll-shows-atheism-on-the-rise-in-the-us/2012/08/13/90020fd6-e57d-11e1-9739-eef99c5fb285_story.html) to the [Huffington Post](http://www.huffingtonpost.com/2012/08/14/atheism-rise-religiosity-decline-in-america_n_1777031.html) ran a story about the rise of atheism in America. The source for the story was a poll that asked people, “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?” This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this lab we take a look at the atheism survey and explore what’s at play when making inference about population proportions using categorical data.

The survey
To access the press release for the poll, conducted by WIN-Gallup International, click on the following link:

http://www.wingia.com/web/files/richeditor/filemanager/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf

Take a moment to review the report then address the following questions.

**Exercise 1** In the first paragraph, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

**Exercise 2** The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

# The data
Turn your attention to Table 6 (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load this data set into Python with the following commands

In [4]:
import pandas as pd
atheism = pd.read_csv("atheism.csv")

**Exercise 3** What does each row of Table 6 correspond to? What does each row of atheism correspond to?

To investigate the link between these two ways of organizing this data, take a look at the estimated proportion of atheists in the United States. Towards the bottom of Table 6, we see that this is 5%. We should be able to come to the same number using the `atheism` data.

**Exercise 4** Using the command below, create a new dataframe called `us12` that contains only the rows in atheism associated with respondents to the 2012 survey from the United States. Next, calculate the proportion of atheist responses. Does it agree with the percentage in Table 6? If not, why?

In [7]:
us12 = atheism[(atheism['nationality'] == 'United States') & (atheism['year'] == 2012)]

In [34]:
us12[us12['response'] == 'atheist']['response'].count() / us12['response'].count()

0.049900199600798403

# Inference on proportions
As was hinted at in Exercise 1, Table 6 provides statistics, that is, calculations made from the sample of 51,927 people. What we’d like, though, is insight into the population parameters. You answer the question, “What proportion of people in your sample reported being atheists?” with a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.

The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.

**Exercise 5** Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

If the conditions for inference are reasonable, we can either calculate the standard error and construct the interval by hand, or allow the following function to do it for us.

In [36]:
import math
from scipy import stats
a = us12['response']
success = 'atheist'
n = len(a)
alpha = 0.05

obs = len(a[a == success]) / len(a)
std = math.sqrt((obs*(1-obs)) / n)

pvalue = 1.0 - alpha / 2
crit_val = stats.norm.ppf(pvalue, loc = 0, scale = 1)
MoE = crit_val * std

print('standard error = ', std, '\n'
    'critical value on the normal distribution is z =', crit_val, '\n'
      'margin of error =', MoE )

print ('\nThe observed proportion is {} and the {}% confidence interval is [{} to {}]'.format(obs, (1.0 - alpha)*100, obs - MoE, obs + MoE))

standard error =  0.006878629122390021 
critical value on the normal distribution is z = 1.95996398454 
margin of error = 0.0134818653429

The observed proportion is 0.0499001996007984 and the 95.0% confidence interval is [0.0364183342579056 to 0.0633820649436912]


In [35]:
std

0.015795600590164084

In [30]:
a[a == 'atheist']

49937    atheist
49944    atheist
49949    atheist
49955    atheist
49956    atheist
49983    atheist
49995    atheist
50040    atheist
50057    atheist
50063    atheist
50117    atheist
50149    atheist
50205    atheist
50221    atheist
50249    atheist
50253    atheist
50259    atheist
50280    atheist
50309    atheist
50322    atheist
50353    atheist
50359    atheist
50360    atheist
50374    atheist
50433    atheist
50444    atheist
50445    atheist
50449    atheist
50455    atheist
50466    atheist
50511    atheist
50541    atheist
50556    atheist
50581    atheist
50607    atheist
50640    atheist
50641    atheist
50692    atheist
50696    atheist
50709    atheist
50718    atheist
50720    atheist
50743    atheist
50747    atheist
50798    atheist
50838    atheist
50854    atheist
50858    atheist
50874    atheist
50891    atheist
Name: response, dtype: object