# Module 7 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer anywhere else other than where it says `YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_).


In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import pandas as pd
import numpy as np

from numpy.testing import assert_array_almost_equal, assert_array_equal

from nose.tools import assert_equal, assert_true


# Problem 1: Read in a dataset

For this problem you will read in a dataset from a **csv** file using pandas. In the cell below, the function *read_data* has argument "file_path" which contains a path to a dataset.
- Use the *read_csv* function from pandas to read in the dataset from the file path and return the resulting dataframe.

In [2]:
def read_data(file_path):
    '''
    Reads in a dataset using pandas.
    
    Parameters
    ----------
    file_path : string containing path to a file
    
    Returns
    -------
    pandas dataframe with data read in from the file path
    '''
    ### BEGIN SOLUTION
    df = pd.read_csv(file_path)
    return df
    ### END SOLUTION
    

In [3]:
df = read_data('data/iris_data.csv')
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,iris type
0,,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
assert_equal(len(df), 110, msg="The dataset should have 110 rows. Your solution only has %s"%len(df))
assert_equal(set(df.columns.tolist()), set(['sepal length', 'sepal width', 'petal length',
       'petal width', 'iris type']), 
             msg="Your column names do not match the solutions")

# Problem 2: Fix column names

In this problem you will fix the column names of the dataframe loaded from problem 1 so that there's no whitespaces in the column names. Use '-' to connect words. For example, "sepal length" should become "sepal_length"

- Directly work on **df** created from problem 1
- Fix all column names so that whitespaces are replaced by '_'

In [5]:
### BEGIN SOLUTION
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'iris_type']
### END SOLUTION

In [6]:
assert_true('sepal_length' in df.columns, "Column name is fixed as directed")
assert_true('sepal_width' in df.columns, "Column name is fixed as directed")
assert_true('petal_length' in df.columns, "Column name is fixed as directed")
assert_true('petal_width' in df.columns, "Column name is fixed as directed")
assert_true('iris_type' in df.columns, "Column name is fixed as directed")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,iris_type
0,,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


# Problem 3: Drop missing values

In this problem you will drop all rows with missing values from the dataframe **df**.

- Directly work on **df** created from problem 1 and fixed in problem 2
- Drop all rows with missing values
- After this problem, there should be no missing values in **df**

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 5 columns):
sepal_length    109 non-null float64
sepal_width     109 non-null float64
petal_length    109 non-null float64
petal_width     110 non-null float64
iris_type       110 non-null object
dtypes: float64(4), object(1)
memory usage: 4.4+ KB


In [8]:
### BEGIN SOLUTION
df.dropna(inplace=True)
### END SOLUTION

In [9]:
assert_equal(df.shape[0], 107, "df doesn't have correct values")
assert_equal(df.sepal_length.isnull().sum(), 0, "sepal_length column has missing values")
assert_equal(df.sepal_width.isnull().sum(), 0, "sepal_width column has missing values")
assert_equal(df.petal_length.isnull().sum(), 0, "petal_length column has missing values")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,iris_type
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa


# Problem 4: Create linear regression formula

In this problem you will create a statsmodels regression formula to predict petal_width with other columns in the dataframe **df** fixed by problem 3.

- Work with columns in **df** created from problem 1 and fixed in problem 2&3
- **petal_width** will be the dependent variable
- All other columns are independent variable
- Enclose categorical feature with "C()"
- Assign the formula string to variable **formula**

In [10]:
### BEGIN SOLUTION
formula = "petal_width ~ sepal_length + sepal_width + petal_length + C(iris_type)"
### END SOLUTION

In [11]:
assert_true('petal_width' in formula.split('~')[0], "Dependent variable is wrong")
assert_true('sepal_length' in formula.split('~')[1], "sepal_length should be independent variable")
assert_true('sepal_width' in formula.split('~')[1], "sepal_width should be independent variable")
assert_true('petal_length' in formula.split('~')[1], "petal_length should be independent variable")
assert_true('C(iris_type)' in formula.split('~')[1], "Categorical variable is not enclosed in C()")


-----

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode