# 04-02-analysis-of-observations-practice

## The general analysis pattern

A typical pattern for **`scikit-learn`** supervised learning classes proceeds as follows:
* Begin with invoking an estimator class, as shown with GaussianNB in the next cell.
* Call the `fit` method with the using the training data (typically called `X`) and the classifications (typically called `y`). This yields a predictor.
* Call the `predict` method on the predictor with test data to obtain its predictions.

In [None]:
# Example of using a Gaussian Naive Bayes predictions
# Nothing to do here, this cell is provided as a code sample

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

## 1. Regression Diagnostics Problem

Whereas a detailed discussion of regression diagnostics is outside our scope, 
the following definitions may be helpful ([source](https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression7.html)):
* **Outliers:** an outlier is defined as an observation that has a large residual. In other words, the observed value for the point is very different from that predicted by the regression model.
* **Leverage points:** A leverage point is defined as an observation that has a value of x that is far away from the mean of x. 
* **Influential observations:** An influential observation is defined as an observation that changes the slope of the line. Thus, influential points have a large influence on the fit of the model. One method to find influential points is to compare the fit of the model with and without each observation.

The table below shows data from a study of 20 patients with chronic congestive heart failure. Two
measurements are shown — ejection fraction x (in percent), which is a measure of left ventricular
dysfunction, and pulmonary arterial wedge pressure y (in mm Hg).

![](patient_data.JPG)

One value has been mistranscribed from the original paper. Determine which patient’s data is most likely to be wrong.

Hint: The next cell has two imports to get you started: `linear_model` function helps you perform a linear regression on the data points.
The `mean_squared_error` function helps you determine the overall error. Drop the patients one by one and observe the mean squared error.

**Question:** If you drop the correct patient from the sample, what do you expect the mean-squared error for the remaining set to do?

**Your answer**

---

---

In [None]:
# Nothing to code in this cell, just run it!

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
# For your convenience, the two rows of the table are available as Python variables here
x = [int(n) for n in '28 26 42 29 16 21 25 35 30 36 37 41 20 26 38 26 10 18 10 31'.split()]
y = [int(n) for n in '15 14 15 12 37 30 7 14 28 13 5 13 24 8 13 17 27 29 8 5'.split()]

# For your convenience, a look at the data:
noprint = plt.scatter(x, y)

In [None]:
# Your solution

## 2. Bayesian Estimation Problem

The following table indicates data of 10 taxpayers and whether they were audited. Jill's data is shown in row 11. What does a Naive Gaussian Estimate predict about whether Jill will be audited?

| Txid | Refund | Marital Status | Taxable Income | Audit |
|:---:|:------:|:--------------:|:--------------:|:-----:|
| 1   | Yes    | Single         | 125K           | No    |
|2 | No | Married | 100K | No
|3 | No | Single | 70K | No
|4 | Yes | Married | 120K | No
|5 | No | Divorced | 95K | Yes
|6 | No | Married | 60K | No
|7 | Yes | Divorced | 220K | No
|8 | No | Single | 85K  |Yes
|9 | No | Married | 75K | No
|10 | No | Single | 90K | Yes
|11 | Yes | Divorced | 80K | ??

In [None]:
# A Python-accessible version of the above table is givem below for your convenience.
tbl = '''
1	Yes	Single	125K	No
2	No	Married	100K	No
3	No	Single	70K	No
4	Yes	Married	120K	No
5	No	Divorced	95K	Yes
6	No	Married	60K	No
7	Yes	Divorced	220K	No
8	No	Single	85K	Yes
9	No	Married	75K	No
10	No	Single	90K	Yes
'''

In [None]:
# Your solution

![The end](https://live.staticflickr.com/32/89187454_3ae6aded89_b.jpg)