# Module 7 Assignment


A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import pandas as pd

from numpy.testing import assert_almost_equal, assert_array_equal
from nose.tools import assert_equal, assert_true

import warnings
warnings.filterwarnings('ignore')

# Problem 1: Read in a dataset

For this problem you will read in a dataset from iris_data.csv which is in data folder

- Read dataset to a dataframe from data/iris_data.csv
- Store the dataframe to variable **df**

In [None]:
# YOUR CODE HERE


In [None]:
assert_equal(len(df), 110, msg="The dataset should have 110 rows. Your solution only has %s"%len(df))
assert_equal(set(df.columns.tolist()), set(['sepal length', 'sepal width', 'petal length',
       'petal width', 'iris type']), 
             msg="Your column names do not match the solutions")
df.head()

# Problem 2: Fix column names and convert iris type to upper case

In this problem you will fix the column names of the dataframe loaded from problem 1 and conver iris type to all upper case.

- Directly work on **df** created from problem 1
- Fix all column names so that whitespaces are replaced by '_'
- Convert values in iris type column to all **upper** case

In [None]:
# YOUR CODE HERE


In [None]:
assert_true('sepal_length' in df.columns, "Column name is fixed as directed")
assert_true('sepal_width' in df.columns, "Column name is fixed as directed")
assert_true('petal_length' in df.columns, "Column name is fixed as directed")
assert_true('petal_width' in df.columns, "Column name is fixed as directed")
assert_true('iris_type' in df.columns, "Column name is fixed as directed")
assert_equal(set(df.iris_type.unique()), {'IRIS-SETOSA', 'IRIS-VERSICOLOR', 'IRIS-VIRGINICA'}, "iris_type values should be all upper case.")
df.head()

# Problem 3: Fill missing values

There are 3 columns that have missing values, sepal length, sepal width and petal length. In this problem you will fill all missing values with mean value of specific iris type.

- Work on **df** created from problem 1 and fixed in problem 2
- Fill missing values with mean value of specific iris type. For example, fill missing sepal length of Iris-setosa with average setosa sepal length.(Make sure you don't fill missing values with mean of the whole column)
- Work on one column each time and fill missing values of all 3 columns.

In [None]:
df.info()

In [None]:
# YOUR CODE HERE


In [None]:
assert_equal(df.shape[0], 110, "df doesn't have correct values")
assert_equal(df.sepal_length.isnull().sum(), 0, "sepal_length column has missing values")
assert_equal(df.sepal_width.isnull().sum(), 0, "sepal_width column has missing values")
assert_equal(df.petal_length.isnull().sum(), 0, "petal_length column has missing values")
assert_almost_equal(df.iloc[0,0], 5.004, 3, "Missing value is not filled with correct value")
assert_almost_equal(df.iloc[50,1], 2.776, 3, "Missing value is not filled with correct value")
assert_almost_equal(df.iloc[80,2], 5.590, 3, "Missing value is not filled with correct value")
df.info()

# Problem 4: EDA: Plot multiple box plots

In this problem you will plot a multiple box plot to display sepal length feature of different iris types.

- Work on **df** created from problem 1 and fixed in problem 2&3
- Use seaborn boxplot to plot a vertical box plot
- The plot should have 3 boxs, one for each iris type.
- The plot should have a descriptive title
- Assign returned axes by boxplot() to variable **ax**

In [None]:
# YOUR CODE HERE


In [None]:
assert_true(len(ax.title.get_text()) > 0, msg="Your plot doesn't have a title.")
assert_equal(ax.get_xlabel(), 'iris_type', msg="Bar plot not for different iris types.")
assert_equal(ax.get_ylabel(), 'sepal_length', msg="Bar plot is not plot on sepal length.")

# Problem 5: Construct a linear regression model

In this problem you will construct a linear regression model using statsmodels.

- Work with **df** created from problem 1 and fixed in problem 2&3
- **petal_width** will be the dependent variable
- All other columns are independent variable
- Enclose categorical feature with "C()" in the regression formula
- Create linear regression model with ols function in statsmodels.formula.api
- Fit the model and assign fitted model to variable **result**

In [None]:
import statsmodels.formula.api as smf
# YOUR CODE HERE


In [None]:
result.summary()

In [None]:
assert_almost_equal(result.rsquared, 0.96, 2, "Regression result is not correct")
assert_almost_equal(result.bic, -82.91, 2, "Regression result is not correct")