# Jupyter Notebook Challenge

Congratulations on completing the Introduction to Jupyter Notebooks flow! This challenge is designed to help you practice the skills introduced in that flow. 

We will be examining the relationship between several variables and their associated insurance premiums. The data set for this challenge is a summary of medical insurance data in the format. This notebook is structured to guide you through the challenge, but feel free to reorganize and add cells where needed in order for you to accomplish the following. In this first section of the notebook you will be exploring the data. 
________
## Section 1: Data Exploration
To complete the following section:

- Fill out the ? in the following table:
    - Fill in the column names
    - Determine the data type stored in each column

| 1: age | 2: ? | 3: ? | 4: ? | 5: ? | 6: ? | 7: ? |
| -- | -- | -- | -- | -- | -- | -- |
| int | ? | ? | ? | ? | ? | ? |

- Determine how many rows of information there are. 
- Analyze the values contained in the column:
    - The first column (`age`): determine the frequency of all ages reported in this column
    - The third column: compute the minimum, maximum, average, and standard deviation for this column
    - The fourth column: sort the dataFrame `(ascending = True)` based off of this column
    - The sixth column: compute how often each value appears in this column
    - The final column: compute the total sum of all reported values in this column

### Import Data + Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [7]:
df = pd.read_csv('insurance.csv')

In [8]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


________
## Section 2: Indexing Practice
- Index data points: 
    1. Print just the 3rd row of the data.
    2. Print the first 6 rows of the data in a table.
    3. Iterate through these first 6 rows: using `string.format(value1, value2...)` and `iterrows()` print a summary for each individual with their age, smoking status, and charge premium for each
    4. Filter the data frame so that you only have individuals with more than 3 children who live in the northwest region. Calculate the average charge of individuals in this category.
    5. Create a copy of the data using `df.copy()` and then create a column titled `'reduced_charge'` to remove 20% of the charge using the following format: <br>
        `df['rounded_up_charge'] = df['charges'].apply(lambda x: math.ceil(x))`
    6. Create a column titled `['mask_practice']` that contains an `'*'` if the individuals has more than 3 children and lives in the northwest and a `'x'` if not

________
## Section 3: Summarizing the DataFrame
<br><b>All of the following is only based off the original data, none of the values manipulated or created in the earlier section should be needed here </b>
- Summarize the data:
    1. Calculate the number of individuals for each age. 
    2. Create a histogram of this.
    3. Bin the ages in spans of 5 years (ie. 15-19, 20-24, 25-29, etc) and recalculate the number of individuals for each age group.
    4. Create a histogram of this.
    5. Using `pandas.groupby()` to enumerate through the regions, create subplots on a single figure for the number of children in each region.
    6. Create a heatmap using the binned ages vs region. Color the heatmap using the average charge value for each of these groups.

________
## Section 4: Data Visualization 
1. Using `seaborn.distplot` visualize the distribution of charges listed in the data
2. After completing the above, you should see that the charges data is skewed-right (a peak on the left side of the plot with a long tail up the x-axis): transform the data to make it more normal by applying a natural log.
3. Repeat step 1 with the newly transformed data. 


________
## Section 5: Introduction to Regression
For this final section, we ask everyone to complete part one, though we encourage everyone to at least attempt part two, though it is not.
- <b>Part One: Linear Regression</b>
    1. Create a train-test split using sklearn using `from sklearn.model_selection import train_test_split` 
    2. Fit a linear regression model for age vs charge using your training data. (Hint: `from sklearn.linear_model import LinearRegression`)
    3. Plot a scatter of your test data and overlay a regression line from your trained model.
    4. Plot the residuals from your test data using
- <b>Part Two</b>
    - This section is intentionally left open-ended. Create and evaluate a regressor using all of the provided variables. Additional data processing might be needed to handle the categorical data. We encourage you to challenge yourself here to develop the best performing model you can. We welcome you to compare different models as well. Compute MAE, MSE, and RMSE for your final model using `metrics` from sklearn (Hint: use `from sklearn import metrics`)