# Case 7.6 review module

**Instructions**

In order to complete this review module, we recommend you follow these instructions:

1. Complete the functions provided to you in this notebook, but do **not** change the name of the function or the name(s) of the argument(s). If you do that, the autograder will fail and you will not receive any points.
2. Remove from each function the code `raise NotImplementedError()` and replace it with your implementation.
3. Run all the function-definition cells before you run the testing cells. The functions must exist before they are graded!
4. Read the function docstrings carefully. They contain additional information about how the code should look (a [docstring](https://www.datacamp.com/community/tutorials/docstrings-python) is the stuff that comes between the triple quotes).

In [3]:
# Loading the data
import pandas as pd
import numpy as np
df = pd.read_csv("data/insurance.csv")

## Exercise 1

Which region is the most common in the dataset? **Hint**: You may use Pandas' built-in `.mode()` method if you like.

In [4]:
def find_most_common_region(df):
    """
    Find the most common region.
    
    Arguments:
    df: The Pandas dataframe (the dataset)
    
    Output:
    answer: A string.
    """
    
    # YOUR CODE HERE
    answer = df['region'].mode()
    answer = answer[0]
    return answer

In [5]:
answer = df['region'].mode()
answer = answer[0]
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Exercise 2

You can easily create a frequency table of the ages present in the dataset with the `.value_counts()` method, and report only the 10 most common:

```python
data['age'].value_counts().head(10)

```

If you run the code above, you'll find that the most common ages are 18 and 19. Just for fun, do the same using the `.groupby()` method and the `.count()` aggregation function. Notice that the function you have to submit has two outputs (which should be returned in this specific order): the frequency table (identical to the one that `.value_counts()` produces) and the [groupby object](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) you used to calculate it.

In [10]:
def find_top10_ages(df):
    """
    Find the 10 most common ages.
    
    Arguments:
    df: The Pandas dataframe (the dataset)
    
    Output:
    answer: A sorted 10-row Pandas series in which the index is the ages and the values are the frequencies.
    gob: The groupby object that you used to calculate answer.
    """

    # YOUR CODE HERE
    answer = df.age.value_counts().head(10)
    gob = df[['age','smoker']].groupby('age')
    
    return answer, gob

In [11]:
find_top10_ages(df)

(18    69
 19    68
 20    29
 51    29
 45    29
 46    29
 47    29
 48    29
 50    29
 52    29
 Name: age, dtype: int64,
 <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe950b56f10>)

## Exercise 3

What is the correlation between age and number of children?

In [23]:
def find_corr_age_children(df):
    """
    Find the Pearson correlation coefficient of age and number of children
    
    Arguments:
    df: The Pandas dataframe (the dataset)
    
    Output:
    answer: A float or numpy float64.
    """
    
    # YOUR CODE HERE
    
    age_column = df.age
    children_column = df.children
    
    answer = age_column.corr(children_column)
    return answer

In [24]:
find_corr_age_children(df)

0.04246899855884944

## Testing Cells

Run the below cells to check your answers. Make sure you run your solution cells first, otherwise you will get a `NameError` when checking your answers.

In [12]:
# Ex 1
assert type(find_most_common_region(df)) == type('northwest'), 'Exercise 1: Your output doesn\'t seem to be a string! Check it using type(). Maybe you passed a Pandas Series as the output? To extract a value from a Series s, you can index like this s[x], where x is the position of the element.'
assert find_most_common_region(df) == 'southeast', 'Exercise 1: This is not the right region. If you\'re using .mode(), make sure you are working with the region column.'
print("Exercise 1 looks correct!")

Exercise 1 looks correct!


In [13]:
# Ex 2
assert len(find_top10_ages(df)) == 2, 'Exercise 2: Make sure your function outputs two objects: The frequency table and the GroupBy object.'
plants = pd.DataFrame({'Plant': ['Rose', 'Rose','Lily', 'Lily'],'Max Growth': [380., 370., 24., 26.]})
gb = plants.groupby(['Plant'])
assert type(find_top10_ages(df)[0]) == type(plants['Plant']), 'Exercise 2: Are you sure your table is a Series? You can check with the type() command.'
assert len(find_top10_ages(df)[0]) == 10, "Exercise 2: Your frequency table doesn't have 10 rows! To get the first 10 rows of a Series, use .head(10)."
assert list(find_top10_ages(df)[0].values) == [69, 68, 29, 29, 29, 29, 29, 29, 29, 29], "Exercise 2: Hmmm, your frequencies don't match ours. Check again! Maybe they aren't properly sorted? To sort a series Z -> A, use .sort_values(ascending=False)"
assert set(list(find_top10_ages(df)[0].index)) == set([18, 19, 51, 45, 46, 47, 48, 50, 52, 20]), "Exercise 2: Check the ages... Are they correct? They should be the index of the series, and should be integers. To access the index of a Pandas series s, run s.index"
assert type(find_top10_ages(df)[1]) == type(gb), 'Exercise 2: Did you submit a GroupBy object? You can check with the type() command.'
print("Exercise 2 looks correct!")

Exercise 2 looks correct!


In [25]:
# Ex 3
assert (type(find_corr_age_children(df)) == type(0.05) or type(find_corr_age_children(df)) == type(np.float64(4))), "Exercise 3: You passed something that is not a float number or a numpy float64! Check with type(). You can use the .corr() method to create the correlation matrix, but then you have to extract only the number you are interested in. To access a specific element in a matrix, you can use the .iloc[] method."
assert abs(find_corr_age_children(df) - 0.042468998558849488) <= 0.05, "Exercise 3: Check your results! This number is way off from the true correlation!"
print("Exercise 3 looks correct!")

Exercise 3 looks correct!
