# Classes and Object Oriented Programming
## Notebook 2

In this notebook you are going to create classes that simulate people. We will be using Pandas to generate the data we will use to create our person objects.

In [None]:
from nose.tools import assert_equal, assert_true, assert_raises, assert_almost_equal

In [None]:
import pandas as pd
import gzip
import pickle
import dateutil
import datetime
from dateutil.relativedelta import *
import numpy.random as ra
import math
import random
import seaborn as sns
import numpy as np
import numbers
import uuid

## Get Surnames

I have downloaded from the [Census Bureau](https://www.census.gov/topics/population/genealogy/data/2000_surnames.html) a csv file with the surnames and their counts in the 2010 census.  

**Problem 1 (5 points)**: Create a Pandas DataFrame named `surnames` from the file `2010_surnames.csv` that has columns `"name"` and `"count"` where count is greater than 10,000.


In [None]:
surnames = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert_equal(surnames.shape, (3287,2))

In [None]:
assert_equal(list(surnames.columns), ["name", "count"])

**Problem 2 (10 points):** Create a new column in `surnames` called `"probability"` that is the probability each name occurs in the sample of names (i.e. the count for that name divided by the sum of all the counts). Sort `surnames` so that the rows increase with `"probability"`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert_almost_equal(surnames[surnames["name"] == "YOUNG"]["probability"].values[0], 0.003138555454437736)

In [None]:
assert_almost_equal(surnames[surnames["name"] == "CHAPMAN"]["probability"].values[0], 0.00083938110990776664)

In [None]:
assert_true(surnames["probability"][surnames.index[0]] < surnames["probability"][surnames.index[-1]])


**Problem 3 (xy 5):** Add a column named `"cumulative_probability"` that is for a given name the cumulative sum of all the probabilities for names less common than that name.

The tail of your DataFrame should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>name</th>
      <th>count</th>
      <th>probability</th>
      <th>cumulative_probability</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>4</th>
      <td>JONES</td>
      <td>1362755</td>
      <td>0.009179</td>
      <td>0.951855</td>
    </tr>
    <tr>
      <th>3</th>
      <td>BROWN</td>
      <td>1380145</td>
      <td>0.009296</td>
      <td>0.961152</td>
    </tr>
    <tr>
      <th>2</th>
      <td>WILLIAMS</td>
      <td>1534042</td>
      <td>0.010333</td>
      <td>0.971485</td>
    </tr>
    <tr>
      <th>1</th>
      <td>JOHNSON</td>
      <td>1857160</td>
      <td>0.012510</td>
      <td>0.983994</td>
    </tr>
    <tr>
      <th>0</th>
      <td>SMITH</td>
      <td>2376206</td>
      <td>0.016006</td>
      <td>1.000000</td>
    </tr>
  </tbody>
</table>

**Hint:** There is a Pandas DataFrame/Series method for this.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert_almost_equal(surnames[surnames["name"] == "WEBBER"]["cumulative_probability"].values[0], 0.21867902035776357)

In [None]:
assert_almost_equal(surnames[surnames["name"] == "JOHNSON"]["cumulative_probability"].values[0], 0.98399423475974246)

**Problem 4 (15):** Write a function `get_lastname` that takes as an argument your modified DataFrame `surnames` and returns a random name based on name frequency. Here is some pseudo-code for the function

function get_lastname  

    Arguments:  
        a positional argument that is a DataFrame with surnames and the cumulative probability for each surname  
        a keyword argument seed with a seed value (default None)   
    Returns: a name   
    
    Set the random number generator seed with seed.   
    Generate a random number (v) between 0 and 1   
    for each row in DataFrame    ask
       is v < row["cumulative_probability"]?    
           if yes then return the name value in this row   
           if no then continue    
           
There is a faster way to achieve this results that uses Pandas selecting with a boolean DataFrame rather than a for loop:

    Arguments:  
        a positional argument that is a DataFrame with surnames and the cumulative probability for each surname  
        a keyword argument seed with a seed value (default None)   
    Returns: a name   
    
    Set the random number generator seed with seed.   
    Generate a random number (v) between 0 and 1   
    
    Select all the rows in the DataFrame where the cumulative probability is greater than v
    Select first row in the resulting DataFrame and return the cumulative probability from that row.


#### We need the seed argument for testing purposes.

In [None]:
import random
def get_lastname(surnames, seed=None):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:

assert_equal(get_lastname(surnames, seed=1), 'BATEMAN')


In [None]:
assert_equal(get_lastname(surnames, seed=356789), 'WEBB')

## Get Mortality Data

We are going to simulate patients living and dying. To do this we need the probability of dying for an individual at a given age. This is available through the [Social Security Administration website](https://www.ssa.gov/oact/STATS/table4c6.html). As we learned in the Pandas module, Pandas can read HTML tables. HTML, however, is notoriously messy and I had to do some hacking to get the data cleaned up to a usable state.


In [None]:
mortality = pd.read_html("https://www.ssa.gov/oact/STATS/table4c6.html", 
                         skiprows=4, 
                         header=None)[0]
mortality = mortality.iloc[0:120,[1,4]]
mortality.rename(columns=dict(zip(mortality.columns, 
                                  ("Male prob. death", 
                                   "Female prob. death"))), 
                 inplace=True)
mortality.head()

## Get USA Municipalities 

Our people need to live somewhere. I've downloaded a csv file with USA municipalities and their populations (`"PEP_2016_PEPANNRES.csv"`).

[Census Website](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk)

**Problem 5 (15 points):** 

1. Use Pandas to read in `PEP_2016_PEPANNRES.csv` into a DataFrame named `municipalities`. The DataFrame should contain the following two columns: "Geography" and "April 1, 2010 - Census". To properly read in the data you will need to use the following keyword argument in `read_csv`:
```Python
encoding="latin1"
```
2. Use the `r2` regular expression object to replace all matches in the DataFrame with an empty string.

3. Similar to what we did with `surnames` create a column `"probabilities"` (using the "April 1, 2010 - Census" column), sort the `municipalities` by `"probabilities"` and create a `"cumulative_probability"` column.

Your resulting DataFrame should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Geography</th>
      <th>April 1, 2010 - Census</th>
      <th>probability</th>
      <th>cumulative_probability</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>9610</th>
      <td>Goss, Missouri</td>
      <td>0.0</td>
      <td>0.000000e+00</td>
      <td>0.000000e+00</td>
    </tr>
    <tr>
      <th>10665</th>
      <td>Monowi, Nebraska</td>
      <td>1.0</td>
      <td>5.203602e-09</td>
      <td>5.203602e-09</td>
    </tr>
    <tr>
      <th>10553</th>
      <td>Gross, Nebraska</td>
      <td>2.0</td>
      <td>1.040720e-08</td>
      <td>1.561081e-08</td>
    </tr>
    <tr>
      <th>14109</th>
      <td>Lotsee, Oklahoma</td>
      <td>2.0</td>
      <td>1.040720e-08</td>
      <td>2.601801e-08</td>
    </tr>
    <tr>
      <th>16043</th>
      <td>Hillsview, South Dakota</td>
      <td>3.0</td>
      <td>1.561081e-08</td>
      <td>4.162882e-08</td>
    </tr>
  </tbody>
</table>

In [None]:
import re
r2 = re.compile(" city| village| town")
municipalities = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert_equal(municipalities.columns.tolist(), 
             ['Geography', 
              'April 1, 2010 - Census', 
              'probability',
              'cumulative_probability'])

In [None]:
assert_equal(municipalities.iloc[0]["Geography"], "Goss, Missouri")

In [None]:
assert_equal(municipalities.iloc[3274]["Geography"], "Creston, Washington")

**Problem 6 (5 points):**

Generalize the `get_lastname` function to return the value in an arbitrary column based on an arbitrary test column.

In [None]:
def get_random_attribute(df, return_col, 
                         test_col="cumulative_probability",
                         seed=None):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert_equal(get_random_attribute(municipalities, 
                                  "Geography",
                                  seed=172),
             'Manassas, Virginia') 

In [None]:
assert_equal(get_random_attribute(surnames, "name",seed=172),"JOHNS")

## Age Distribution
We need distributions for our population ages. Here is information at the [Census Bureau](https://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf). I'm going to just use the geometric distribution in numpy to create a (to me, for this homework) reasonable distribution of ages.


In [None]:

def get_age(minage=17, maxage=100, p=0.06):
    age = minage+ra.geometric(p,1)[0]
    if age < maxage:
        return age
    else:
        return get_age()
ages = pd.Series([get_age() for i in range(100000)])
print(ages.min(), ages.max())
ages.hist(bins=100)

**Problem 7 (5 points):**

Write a function `get_sex` that takes a positional argument the proportion of the population that is female and returns `"F"` or `"M"` based on that proportion. The keyword argument `seed=None` is needed for testing.

```Python
0.52
```

In [None]:
def get_sex(female_proportion, seed=None):
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert_equal(get_sex(0.52, seed=1), "F")

In [None]:
assert_equal(get_sex(0.1, seed=1), "M")

## First Names

**Problem 8 (5 points):** 

The Social Security Administration provides the [names of babies](https://www.ssa.gov/oact/babynames/limits.html) born in the United States between 1880 and 2016. For privacy purposes names occuring fewer than 5 times per year are excluded. A dicctionary containing these name, along with ther probabilities and cumulative probabilites is stored in a compressed pickle file `first_names.pickle.gz`. Use Python to read the dictionary into this notebook.

In [None]:
first_names = None

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert_equal(list(first_names.keys()), 
                          [1880, 1881, 1882, 1883, 1884, 1885, 
                           1886, 1887, 1888, 1889, 1890, 1891, 
                           1892, 1893, 1894, 1895, 1896, 1897, 
                           1898, 1899, 1900, 1901, 1902, 1903, 
                           1904, 1905, 1906, 1907, 1908, 1909, 
                           1910, 1911, 1912, 1913, 1914, 1915, 
                           1916, 1917, 1918, 1919, 1920, 1921, 
                           1922, 1923, 1924, 1925, 1926, 1927, 
                           1928, 1929, 1930, 1931, 1932, 1933, 
                           1934, 1935, 1936, 1937, 1938, 1939, 
                           1940, 1941, 1942, 1943, 1944, 1945, 
                           1946, 1947, 1948, 1949, 1950, 1951, 
                           1952, 1953, 1954, 1955, 1956, 1957, 
                           1958, 1959, 1960, 1961, 1962, 1963, 
                           1964, 1965, 1966, 1967, 1968, 1969, 
                           1970, 1971, 1972, 1973, 1974, 1975, 
                           1976, 1977, 1978, 1979, 1980, 1981, 
                           1982, 1983, 1984, 1985, 1986, 1987, 
                           1988, 1989, 1990, 1991, 1992, 1993, 
                           1994, 1995, 1996, 1997, 1998, 1999, 
                           2000, 2001, 2002, 2003, 2004, 2005, 
                           2006, 2007, 2008, 2009, 2010, 2011, 
                           2012, 2013, 2014, 2015, 2016])

**Problem 9 (15 points):**

Complete the code for the person class defined below:


```Python
class person(object):
    """
    person class. 
    """
    __sexes = ('M', 'F')
    def __init__(self, name = "John",
                 dob = "",
                 sex = 'F'):
        """Contructor. Names no arguments."""
        self.name = name
        self.dob = dob
        self.sex = sex


    @property
    def name(self):
        ???
    @name.setter
    def name(self,name):
        ???
    @property
    def age(self):
        """
        Returns the person's age in years as an integer.
        
        Age is computed using the different between the 
        date of birth and the current date.
        """
        ???
    @property
    def dob(self):
        return self.__dob
    @dob.setter
    def dob(self, dob):
        self.__dob = dateutil.parser.parse(dob).date()
    @property
    def sex(self):
        ???
    @sex.setter
    def sex(self, value):
        sex = value.upper()[0]
        if sex not in self.__sexes:
            raise ValueError("%s: invalid sex specification"%sex)
        self.__sex = sex

    def __str__(self):
        ???
```

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
test_person = person(name="Earl", dob="Feb 28, 1929", sex="M")
assert_true("Earl" in test_person.__str__())

In [None]:
test_person = person(name="Earl", dob="Feb 28, 1929", sex="M")
assert_true("88" in test_person.__str__())

In [None]:
test_person = person(name="Donna", dob="March 15, 1947", sex="F")
assert_equal(test_person.age, 70)

In [None]:
assert_raises(ValueError, person, name="Ziggy", 
                                  dob="01/08/1947", 
              sex="Other")


In [None]:
assert_raises(AttributeError, person, name=1947, sex="M", dob="01/08/1947")

**Problem 10 (30 points):**

Define a class `participant` that inherits from `person`. The class adds the following attributes and properties. For properties, store the actual value in a private variable:

* `deceased`: a boolean property set initially to False
    * Define getter/setter properties
* `dod`: a property
    * The actual value is a `datetime.date` object stored as a private attribute 
    * The getter property returns the date in an appropriately formated string
    * The setter property accepts either a string or a datetime.date object
* `residence`: a string property indicating a municipality in the USA.
    * * Define getter/setter properties
* `__study_start`: a datetime.date object initialized to the current date 
    * Initialize in the constructor
    * Define a getter property (def study_start) but not a setter (cannot be modified)
* `__time_in_study`: a `dateutil` `relativedelta` object
    * Define a getter property that returns the number of days in the study as an integer
* `__studyid`: An integer generated with the `uuid` library.
    * Initialize in the constructor
    * Define a getter property but not a setter (cannot be modified)

In addition to the @property and @xyz.setter methods, the class should implement the following methods:

* `increment_study_time`: accepts a `relativedelta` object and increments the `__time_in_study` attribute.
* `dies`: sets `deceased` to `True` and sets `dod` to the study start date plus the time in the study
* `__repr__`
* `__str__`


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
p1= participant(residence="Salt Lake City", 
                name="Brian", 
                dob="5/24/1968",
                enrollment_date = "October 28, 2017",
                sex='male')
assert_true(isinstance(p1,person))

In [None]:
p1= participant(residence="Salt Lake City", 
                name="Brian", 
                dob="5/24/1968",
                enrollment_date = "October 28, 2017",
                sex='male')
p1.increment_study_time(relativedelta(months=+24))
assert_true(p1.days_in_study, 730)

In [None]:
p1= participant(residence="Salt Lake City", 
                name="Brian", 
                dob="5/24/1968",
                enrollment_date = "October 28, 2017",
                sex='male')
p1.increment_study_time(relativedelta(months=+36))
p1.dies()
assert_equal(p1.age, {'months': 5, 'years': 52})

In [None]:
p1= participant(residence="Salt Lake City", 
                name="Brian", 
                dob="5/24/1968",
                enrollment_date = "October 28, 2017",
                sex='male')
p1.increment_study_time(relativedelta(months=+36))
p1.dies()
assert_equal(p1.deceased, True)

In [None]:
p1= participant(residence="Salt Lake City", 
                name="Brian Chapman", 
                dob="5/24/1968",
                enrollment_date = "October 28, 2017",
                sex='male')
p1.increment_study_time(relativedelta(months=+12))
assert_equal(p1.deceased, False)

In [None]:
p1= participant(residence="Salt Lake City", 
                name="Brian Chapman", 
                dob="5/24/1968",
                enrollment_date = "October 28, 2017",
                sex='male')
assert_true("Chapman" in p1.__str__())
#assert_equal(p1.deceased, False)

In [None]:
p1= participant(residence="Salt Lake City", 
                name="Brian Chapman", 
                dob="5/24/1968",
                enrollment_date = "October 28, 2017",
                sex='male')
p1.increment_study_time(relativedelta(months=+6))
assert_true("182" in p1.__repr__())

**Problem 11 (25 points):**

Complete and correct the function `generateRandomPersonAttributes` so that it properly returns a `participant` object with randomly determined attributes
```Python
def generateRandomPersonAttributes(surnames, first_names, 
                                   municipalities,
                                   proportion_female=0.52):
    sex = get_sex
    age = get_age
    year = datetime.datetime.now().year-???
    dob = (datetime.date(datetime.datetime.now.year-age, 1,1)+\
        relativedelta(days=+random.randint(0,???))).strftime("%B %d, %Y")
    f_name = get_random_attribute(first_names???, ???)
    l_name = get_random_attribute(surnames, ???).capitalize
    residence = get_random_attribute(municipalities, ???)
    return participant(residence=residence, 
                name="??? ???"%(f_name, l_name), 
                dob=???,
                sex=sex)
```

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
participants = \
[generateRandomPersonAttributes(surnames, 
                                first_names, 
                                municipalities) for i in range(5)]
assert_equal(len(participants),5)

In [None]:
participants = \
[generateRandomPersonAttributes(surnames, 
                                first_names, 
                                municipalities) for i in range(5)]
for p in participants:
    assert_true(isinstance(p,participant))

In [None]:
participants = \
[generateRandomPersonAttributes(surnames, 
                                first_names, 
                                municipalities) for i in range(1000)]

In [None]:
pd.Series([p.age["years"] for p in participants]).hist(bins=100)

## A Function to Run the "Study"

In [None]:
def increment_study(participants, mortality, unit=10):
    delta = 365/unit
    mkeys = {"M":"Male prob. death", "F":"Female prob. death"}
    for p in participants:
        p.increment_study_time(relativedelta(days=+unit))
        if random.random()< mortality.iloc[p.age["years"]][mkeys[p.sex]]/delta:
            p.dies()
    return None
    

In [None]:

while True:
    living = [p for p in participants if not p.deceased]
    if len(living)%200 == 0:
        print(len(living))
    if not living:
        break
    increment_study(living, mortality)
    

In [None]:
sns.distplot([p.age["years"] for p in participants])

In [None]:
pd.Series([p.age["years"] for p in participants]).hist(bins=100)

In [None]:
np.mean([p.age["years"] for p in participants if p.sex == "M"])
np.mean([p.age["years"] for p in participants if p.sex == "F"])

**Problem 12 (15 points):** Write a function `get_participant_df` that takes a list of participant objects and returns a Pandas DataFrame witht he following columns:
['Age', 'Days in Study', 'First Name', 'Last Name', 'Sex', 'State'].

Each row in the DataFrame corresponds to a particular participant with the column values the appropriate values from a participant object.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
pdf = get_participant_df(participants)

In [None]:
assert_equal(set(pdf.columns),
             set(['Age', 'First Name', 'Days in Study',
                  'Last Name', 'Sex', 'State']))

In [None]:
isinstance(pdf[pdf["Sex"]=="M"]["Age"].mean(), numbers.Real)

In [None]:
assert_true(isinstance(pdf.iloc[0]["Days in Study"], numbers.Integral))

In [None]:
sns.kdeplot(data=pdf[pdf["Sex"]=="M"]["Age"], 
            label="Male", color="Blue")
sns.kdeplot(data=pdf[pdf["Sex"]=="F"]["Age"],
            label="Female", color="pink")


In [None]:
sns.boxplot(x="Sex", y="Age", data=pdf);

In [None]:
sns.countplot(x="Sex", data=pdf);

In [None]:
sns.countplot(x="Sex", data=pdf[pdf["Age"]>90]);

**Problem 13 (10 points):** Write a function `create_obituaries` that takes a list of participant objects, a filename, writes out obituaries for each participant in ascending order by date of death. Your file should look something like this:

Jerry Peck, age 82, of Detroit Lakes, Minnesota, died April 06, 2018   
John Conley, age 96, of Kokomo, Indiana, died May 16, 2018   
Jeffery Meyer, age 32, of Wilson's Mills, North Carolina, died May 26, 2018   
Kenneth Jackson, age 54, of Portland, Oregon, died August 04, 2018   
Heather Johnson, age 46, of Sunrise, Florida, died August 24, 2018   
Kristen Holland, age 31, of Town and Country, Missouri, died September 13, 2018   

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
rdate = re.compile(r"""(?P<month>[A-Z][a-z]+) (?P<day>\d{2,2}), (?P<year>\d{4,4})""")
fname = "test_obits.txt"
create_obituaries(participants, fname)
with open(fname) as f0:
    data = f0.readlines()
os.remove(fname)
m0 = rdate.search(data[0])
ml = rdate.search(data[-1])
assert_true(int(m0.group("year")) < int(ml.group("year")))

In [None]:
import os
fname = "test_obits.txt"
create_obituaries(participants, fname)

with open(fname) as f0:
    data = f0.readlines()
os.remove(fname)
m0 = rdate.search(data[0])
ml = rdate.search(data[-1])
assert_equal(len(data), len(participants))
