## Inferential statistics lab

It might be a good idea to first check the [source of the Boston housing data](https://archive.ics.uci.edu/ml/datasets/Housing). 

We've saved the data for you in a file named "housing.data". Load it in using any method you choose.

In [3]:
import pandas as pd

t = pd.read_csv('housing.data',header=None, delimiter=r"\s+")
t.head()
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS','NOX','RM','AGE','DIS','RAD',
         'TAX', 'PTRATIO', 'B','LSTAT','MEDV']
t.columns = columns
t.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


Exercise 1: Conduct a brief integrity check of your data. This integrity check should include, but is not limited to, checking for missing values and making sure all values make logical sense. (i.e. Is one variable a percentage, but there are observations above 100%?)
Summarize your findings in a few sentences, including what you checked and, if appropriate, any steps you took to rectify potential integrity issues.

In [4]:
t.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


Focusing on the max and the mins of each column nothing jumps out as being blatentely wrong. We have the same count in each column, so that also gives me confidence that we are not missing anything blantant. Over all the data looks to be in good working shape.

Exercise 2: For what two attributes does it make the least sense to calculate mean and median? Why?

The ZN attribute appears to only be significant for a subset of the sample, and calculating the mean of the entire sample have little statistical significance. The median value is 0, while the max value is 100. It is essentially measuring the mean of a percentage where almost all values for that percentage are 0.

The other attribute that would have a somewhat non useful mean is the RAD column, which represents an index of accessibility to radial highways. The numerical significance of that number is unclear, and it appears to be non continous as if the index are associated with something else in particular. 

Exercise 3: Find the mean, standard deviation, and the standard error of the mean for variable 'AGE.'

In [5]:
t['AGE'].describe()

count    506.000000
mean      68.574901
std       28.148861
min        2.900000
25%       45.025000
50%       77.500000
75%       94.075000
max      100.000000
Name: AGE, dtype: float64

Exercise 4: Generate a 90%, 95%, and 99% confidence interval for 'AGE'. Do at least one of these manually (i.e. by plugging in the appropriate parts) and at least one of these using a function from scipy.stats. Interpret the results from all three confidence intervals.

In [37]:
import scipy.stats as stats

#let's first do this by hand for a 90% confidence interval
data = t['AGE']
n = len(data) #N is large so z score is fine
stan_error = data.std()/(len(data)**(.5))
z_value = 1.645 # for 90% conf interval

interval_width = z_value* stan_error
print("Our interval is:" + str((data.mean() - interval_width, data.mean() + interval_width)))

Our interval is:(66.51639831672087, 70.6334040548207)


In [38]:
#Now lets do it the easy way

print("Our 90% interval is:" + str(stats.t.interval(.9, n-1, loc=data.mean(), scale=data.std()/(n ** 0.5))))
print("Our 95% interval is:" + str(stats.t.interval(.95, n-1, loc=data.mean(), scale=data.std()/(n ** 0.5))))
print("Our 99% interval is:" + str(stats.t.interval(.99, n-1, loc=data.mean(), scale=data.std()/(n ** 0.5))))


Our 90% interval is:(66.512798667041892, 70.637003704499676)
Our 95% interval is:(66.11636971854324, 71.033432652998329)
Our 99% interval is:(65.3393604183414, 71.810441953200169)


We can say that there is a 90% chance an randomly select data point will have an age between 66.5 and 70.64. a 95% chance to be between 66.11 and 71.03, and a 99% to be between 65.34 and 71.81.

Exercise 5: For variable 'NOX', generate a 95% confidence interval and interpret it.

In [39]:
data = t['NOX']
n = len(data) #N is large so z score is fine

print("Our 95% interval for 'Nox' is:" + str(stats.t.interval(.95, n-1, loc=data.mean(), scale=data.std()/(n ** 0.5))))

Our 95% interval for 'Nox' is:(0.54457426229217976, 0.56481585628489472)


In [35]:
print("max: " + str(data.max()))
print("min: " + str(data.min()))

max: 0.871
min: 0.385


The 95% interval for the 'Nox' variable  is 0.54 and 0.565. This is a very narrow interval given the min and max of the Nox field. It would seem the 'Nox' field likely has outliers.

Exercise 6: For the variable 'NOX', find the median.

In [41]:
data.median()

0.538

Exercise 7: For the variable 'NOX', test the hypothesis that the mean is equal to the median. You may use scipy functions to complete this, but complete all steps - define hypotheses, etc. Let alpha = 0.05. Interpret your results.

In [48]:
#Our Hypothesis is that the mean is not equal 0.538, which is the calculated median of the data set.
# So we can see H0: u == x_ HA: u != x_

u = data.median()
x_ = data.mean()
print(stats.ttest_1samp(data,u))



Ttest_1sampResult(statistic=3.2408837167794102, pvalue=0.0012702109998191441)


We can reject our hypothesis that the median is equal to the mean because our alpha condition is not satisifed.

Exercise 8: What do you notice about the results from Exercise 5 through Exercise 7? If you were going to generalize this to the relationship between hypothesis tests and confidence intervals, what might you say? Be specific.

We can see that our value for u is outside the 95% confidence interval, which means that the hypothesis is unlikely from that observation alone. We are ultimately coming up with a score for the likely hood of an event taking place. The confidence intervals are the range that encapsolates a sum of probabilities (summing up to whatever confidence you want) where the p value is the likely hood of a specifc event haven taken place.

Exercise 9: For the variable 'NOX', test the hypothesis that the mean is greater than or equal to the median. You may use scipy functions to complete this, but complete all steps - define hypotheses, etc. Let alpha = 0.05. Interpret your results.

In [49]:

u = data.median()
x_ = data.mean()
n = len(data)
print(stats.ttest_1samp(data,u))
#We only care about the case were the median is less than the mean, and we can see by our T statistic that the meidan is
#greater. So we can haolve our p value in this case. We are essentially even more confident than before that our
# median is not <= to our mean.

Ttest_1sampResult(statistic=3.2408837167794102, pvalue=0.0012702109998191441)


Exercise 10: Compare the p-values from Exercise 7 and Exercise 9. What do you notice?

The pvalue for exercise two shoudl be half the pvalue of excerise one. But I don't know how to show that.