<font color='darkred'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file. If you like, you're welcome to adjust the *app\.py* file, but it is not required.

## Notes on Recursion

A [recursive function](https://www.w3schools.com/python/gloss_python_function_recursion.asp) is one which calls itself.

1. When the function is called, your CPU runs through each line of code until the function needs to be called again.
2. At that point, all variables are saved in memory, and the function runs through each line of code again until the function is called (again, but with a different passed argument), and so on.
3. Eventually, this process will stop at the "bottom of the **stack**", where the function doesn't get a chance to call itself again (likely because of some condition un/met by the latest passed argument).
4. Then, your CPU will work its way back up the stack to the final result. For example, take a look at [this visual example](https://realpython.com/python-recursion/#calculate-factorial) of calculating 4!.

When you write these functions, keep two things in mind:

- You will need a built-in stopping point (i.e., the "bottom"), where your function returns some result before it calls itself.
- **Don't think too hard about this.** Recursion can be perplexing to conceptualize when writing the code. So, when you call the function inside the function, think about it as a magical "hidden" function that has already done what you want it to do.
- [Python Tutor](https://pythontutor.com/) ([editor](https://pythontutor.com/visualize.html#mode=edit)) can be a helpful resource for this exercise!

## Exercise 1

The Fibonacci Series starts with 0 and 1. Each of the following numbers are the sum of the previous two numbers in the series:

`0 1 1 2 3 5 8 13 21 34 ...`

So, `fib(9) = 34`.

Write a recursive function (`fib`) that, given `n`, will return the `n`th number of the Fibonacci Series.

*Test your function using Google or any other tool that can calculate the Fibonacci Series.*

In [2]:
# Exercise 1: Write a recursive function that returns the nth number of the Fibonacci Series
'''Define the fibonacci function
fibonacci = fib_recursive'''
def fib_recursive(n):
# Using an if-elif-else statement, find fib_recursive starting with 0 and 1
# If n is 0, return 0; if n is 1, return 1
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fib_recursive(n-1) + fib_recursive(n-2)
    # Test the function by printing the 5th and 9th Fibonacci number
print(fib_recursive(5))
print(fib_recursive(9))


5
34



## Exercise 2

Write a (single) recursive function, `to_binary()`, that [converts](https://en.wikipedia.org/wiki/Binary_number#Conversion_to_and_from_other_numeral_systems) an integer into its [binary](https://en.wikipedia.org/wiki/Binary_number) representation. So, for example:

```python
to_binary(2)   -->  10
to_binary(12)  -->  1100
```

*Note: you can test your function with the built in `bin()` function.*

In [9]:
# Write a single recursive function, to_binary, that converts an integer into its binary representation
# Define the function to_binary
def to_binary(w):
    # Base case: if w is 0, return the string '0'; if w is 1, return the string '1'
    if w == 0:
        return '0'
    elif w == 1:
        return '1'
  # Build recursive case: Call the function recursively with the integer divided by 2 and append the remainder of the integer divided by 2
    else:
            return to_binary(w // 2) + str(w % 2)
# Test the function to_binary to return a string of the binary representation of the integer
# Test the function to_binary(10)
print(to_binary(10))

1010


## Exercise 3 

Use the raw Bellevue Almshouse Dataset (`df_bellevue`) extracted at the top of the lab (i.e., with `pd.read_csv ...`).

**Write a function for each of the following tasks. Name these functions `task_i()`** (i.e., without any input arguments).

1. Return a list of all column names, *sorted* such that the first column has the *least* missing values, and the last column has the *most* missing values (use the raw column names).
   - *Note: there is an issue with the `gender` column you'll need to remedy first ...*
2. Return a **data frame** with two columns:
   - the year (for each year in the data), `year`
   - the total number of entries (immigrant admissions) for each year, `total_admissions`
3. Return a **series** with:
   - Index: gender (for each gender in the data)
   - Values: the average age for the indexed gender.
4. Return a list of the 5 most common professions *in order of prevalence* (so, the most common is first).

For each of these, if there are messy data issues, use the `print` statement to explain.


In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

# Exercise 3: Write a function for the following tasks and name them task_i()
# Read the dataset from the URL into a DataFrame
url = 'https://github.com/melaniewalsh/Intro-Cultural-Analytics/raw/master/book/data/bellevue_almshouse_modified.csv'
df_bellevue = pd.read_csv(url)

# Display the first few rows of the DataFrame
print(df_bellevue.head())
# Display the summary statistics of the DataFrame
print(df_bellevue.describe(include='all'))
# Display the data types of each column in the DataFrame
print(df_bellevue.dtypes)
# Check for missing values in the DataFrame
print(df_bellevue.isnull().sum())

      date_in first_name  last_name   age          disease profession gender  \
0  1847-04-17       Mary  Gallagher  28.0  recent emigrant    married      w   
1  1847-04-08       John  Sanin (?)  19.0  recent emigrant    laborer      m   
2  1847-04-17    Anthony      Clark  60.0  recent emigrant    laborer      m   
3  1847-04-08   Lawrence     Feeney  32.0  recent emigrant    laborer      m   
4  1847-04-13      Henry      Joyce  21.0  recent emigrant        NaN      m   

                     children  
0         Child Alana 10 days  
1              Catherine 2 mo  
2  Charles Riley afed 10 days  
3                       Child  
4                  Child 1 mo  
           date_in first_name last_name          age   disease profession  \
count         9584       9580      9584  9534.000000      6497       8565   
unique         653        523      3142          NaN        75        172   
top     1847-05-24       Mary     Kelly          NaN  sickness    laborer   
freq           113 

In [28]:
# Task 1: Return a list of all column names, sorted such that the first column has the least missing values, and the last column has the most missing values (use the raw column names)
# Define the function task_i
def task_i():
    # Calculate the number of missing values for each column
    # Convert missing values to null values
    missing_values = df_bellevue.isnull().sum()
    # Sort the columns based on the number of missing values in ascending order
    sorted_columns = missing_values.sort_values().index.tolist()
    return sorted_columns
# Test the function task_i
print(task_i())

['date_in', 'last_name', 'gender', 'year', 'first_name', 'age', 'profession', 'disease', 'children']


In [27]:
# Task 2: Return a data frame with two columns: 'year' and 'total_admissions', where 'year' is the year of admission and 'total_admissions' is the total number of admissions for that year
# Define the function task_ii
def task_ii():
    # Extract the year from the 'admission_date' column and create a new 'year' column
    df_bellevue['year'] = pd.to_datetime(df_bellevue['date_in']).dt.year
    # Group the DataFrame by 'year' and count the number of admissions for each year
    admissions_per_year = df_bellevue.groupby('year').size().reset_index(name='total_admissions')
    return admissions_per_year
# Test the function task_ii
print(task_ii())

   year  total_admissions
0  1846              3073
1  1847              6511


In [26]:
# Task 3: Return a series with Index: gender (M or F) and Values: average age for the indexed gender
# Define the function task_iii
def task_iii():
    # Create the series with index
    df = pd.DataFrame(df_bellevue)
    # Group the DataFrame by 'gender' and calculate the mean age for males and females using groupby.mean()
    avg_age = df_bellevue.groupby('gender')['age'].mean()
    return avg_age
# Test the function task_iii
print(task_iii())

gender
?          NaN
g    59.000000
h    56.000000
m    31.813433
w    28.725162
Name: age, dtype: float64


In [50]:
# Task 4: Return a list of the 5 most common professions in order of prevalence (most common first)
# Define the function task_iv
def task_iv():
    # Count the occurrences of each profession and get the 5 most common ones
    most_common_professions = df_bellevue['profession'].value_counts().head(5).index.tolist()
    return most_common_professions
# Test the function task_iv
print(task_iv())


['laborer', 'married', 'spinster', 'widow', 'shoemaker']


In [54]:
# Alternative Solution for Task 4 removing missing values first
# Task 4: Return a list of the 5 most common professions in order of prevalence (most common first)
# Define the function task_iv
def task_iv():
    # Remove missing values from the 'profession' column
    professions = df_bellevue['profession'].dropna()
    # Count the occurrences of each profession in the 'profession' column
    professions = df_bellevue['profession'].value_counts()
    # Find the top 5 most common professions
    most_common_professions = professions.head(5).index.tolist()
    return most_common_professions
# Test the function task_iv
print(task_iv())

['laborer', 'married', 'spinster', 'widow', 'shoemaker']
