# Categorical Data

Within the general set of data known as numerical data, we have seen both continuous and discreet data sets.

Examples of continuous data:  temperature, pressure, position, speed, weight, etc.

Examples of discreet data:  number appearing on a die, age in years, etc.

However, there exists a LOT of data that in fact are not numerical.  These are known as CATEGORICAL data.  

Examples of categorical data:  hair color, handedness, house style, country of birth, etc.

We would like to investigate some of the methods that can be used to analyze categorical data.  These turn out to be quite useful in machine learning, neural network, and other AI applications.

### Confidence Intervals in one-proportion data

Suppose that we have one sample of N data points which represents some sort of categorical data.  For example, it could be something like house style data, where there are 1000 houses that have been divided into 4 categories - condominium, townhouse, duplex, single family home.

One question might be:  does the data sample represent what we expect for the population?

For EACH category, we can think of a probabilty, $p$, that describes the expected probability for that category,an presumably we know what this probability is.  The occurrence of this characteristic in the data sample will then be described by a binomial distribution!

Recall that for the binomial distribution, we have that:

$\mu = p \cdot N$

$SEM = \sigma = \sqrt{Np(1-p)}$

And thus, the 95% confidence interval will be given by:

$CI = \mu \pm SEM\cdot t_{N-1,0.025}$

# Example 1 - Breast Cancer Rates

Question:  How many female students would one expect to develop breast cancer each year, at CNU?  How many of these women would one expect to die from breast cancer in their lifetime?

There are about 5000 students at CNU, about half of whom are women.  So, N = 2500.  The incidence of breast cancer for women in the 18-30 age range is approximately 10 in 100,000 per year.  The mortality rate for breast cancer is approximately 3.8% in their lifetime (btw, I find this number shockingly high).

In [10]:
import numpy as np
import scipy.stats as stats

pi = 10.0/100000.0
pd = 3.8/100.0
N = 2500

SEM_i = np.sqrt(N*pi*(1-pi))
SEM_d = np.sqrt(N*pd*(1-pd))

print (SEM_i,SEM_d)

alpha = 0.05

tdist = stats.t(N-1)
t_critical = tdist.ppf(1-alpha/2)

mu_i = pi*N
mu_d = pd*N

print (mu_i,mu_d)

mu_i_low = mu_i - SEM_i*t_critical
mu_i_high = mu_i + SEM_i*t_critical

mu_d_low = mu_d - SEM_d*t_critical
mu_d_high = mu_d + SEM_d*t_critical

print ("Incidence: (%0.2f,%0.2f)      Death: (%0.2f,%0.2f)" % (mu_i_low,mu_i_high,mu_d_low,mu_d_high))

print ("Conclusions: we expect zero or 1 women to die each year at CNU from breast cancer. Of the 2500 female")
print ("             students currently enrolled at CNU, we expect 76-114 of them to eventually die from this")
print ("             disease.  Again, this is a shockingly high number, in my opinion.")

0.49997499937496875 9.559811713627
0.25 95.0
Incidence: (-0.73,1.23)      Death: (76.25,113.75)
Conclusions: we expect zero or 1 women to die each year at CNU from breast cancer. Of the 2500 female
             students currently enrolled at CNU, we expect 76-114 of them to eventually die from this
             disease.  Again, this is a shockingly high number, in my opinion.


# Example 2 - One Way $\chi^2$ Test

If the data can be organized into a set of categories, then the total number of samples in each category are known as $frequencies$.  I find this terminology to be a bit confusing, because it is nothing like what a physicist would call a frequency!

In any case, the point of the analysis is to compare the number of samples in each category to what we would expect, based on some model.

Let's suppose that you record the number of times that a person has to do the dishes, in an apartment that houses six people.  Ideally, you would expect as a matter of fairness that everyone would contribute equally, and so the model would be a uniform distribution.  If you recorded who did the dishes for a certain number of days, then you would expect that:

$N_i = \frac{N_{days}}{N_{people}}$ for all $i$.

We can calculate the total deviation of the sample data from the expected data using:

$SS_{total} = \sum_i {\frac{(o_i - e_i)^2}{e_i}}$

$SS_{total}$ should follow a $\chi^2$ distribution, where the number of degrees of freedom is the total number of data points.

In [48]:
N_days = 33
N_people = 6
names = ['Atka','Javier','Hans','Aoiffe','Jamal','Marie']
times = np.array([10,6,5,4,5,3])

e = np.array([(N_days/N_people) for i in range(len(names))])

# What is the likelihood that Atka is being treated unfairly?

SS_total, p = stats.chisquare(times)
print ("SS_total = ",SS_total," Probability of random chance distribution = ",p)

SS_e, p_e = stats.chisquare(e)
print ("SS_expected = ",SS_e," Probability of random chance distribution = ",p_e)

times2 = np.array([33,0,0,0,0,0])

SS_rare, p_rare = stats.chisquare(times2)
print ("SS_rare = ",SS_rare," Probability of random chance distribution = ",p_rare)

SS_total =  5.363636363636364  Probability of random chance distribution =  0.37313038594870584
SS_expected =  0.0  Probability of random chance distribution =  1.0
SS_rare =  165.0  Probability of random chance distribution =  8.503975940427937e-34
