# Module 1 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [3]:
import pandas as pd
import numpy as np

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

---

# Problem 1: Read in a dataset

For this problem you will read in a dataset using pandas. In the cell below, the function *read_data* has parameter "file_path" which contains a path to a dataset.
- Use the *read_csv* function from pandas to read in the dataset from the file path.
- Drop all rows with missing values from the dataframe
- Return the dataframe

-----

In [6]:
def read_data(file_path):
    '''
    Reads in a dataset using pandas.
    Drop all rows with missing values
    
    Parameters
    ----------
    file_path : string containing path to a file
    
    Returns
    -------
    pandas dataframe with data read in from the file path
    '''
    # YOUR CODE HERE
    return pd.read_csv(file_path)

In [8]:
mpg = read_data('data/mpg.csv')
assert_equal(type(mpg), pd.core.frame.DataFrame, msg="Your function does't return a dataframe")
assert_equal(len(mpg), 392, msg="The dataset should have 392 rows. Your solution has %s"%len(mpg))

print("2 random rows of the dataset mpg:")
mpg.sample(2)

2 random rows of the dataset mpg:


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
162,21.0,6,231.0,110.0,3039,15.0,75,usa,buick skyhawk
240,21.5,4,121.0,110.0,2600,12.8,77,europe,bmw 320i


---

# Problem 2: Encode "origin" column

For this problem you will work on the DataFrame `mpg` created from problem 1.

- One hot encode the categorical feature 'origin' using `get_dummies` in Pandas module. 
- Set the prefix of dummy columns to 'origin'.
- Assign the resulting DataFrame back to `mpg`.

After this problem, DataFrame `mpg` should have three dummy features, `origin_europe`, `origin_japan` and `origin_usa`.

-----

In [10]:
# YOUR CODE HERE
mpg = pd.get_dummies(mpg, columns=['origin'], prefix=['origin'])

In [11]:
assert_true('origin_europe' in mpg.columns, msg="mpg doesn't have 'origin_europe' column")
assert_true('origin_japan' in mpg.columns, msg="mpg doesn't have 'origin_japan' column")
assert_true('origin_usa' in mpg.columns, msg="mpg doesn't have 'origin_usa' column")
mpg.sample(5, random_state=2)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,name,origin_europe,origin_japan,origin_usa
55,26.0,4,91.0,70.0,1955,20.5,71,plymouth cricket,0,0,1
70,19.0,3,70.0,97.0,2330,13.5,72,mazda rx2 coupe,0,1,0
313,24.3,4,151.0,90.0,3003,20.1,80,amc concord,0,0,1
179,33.0,4,91.0,53.0,1795,17.5,75,honda civic cvcc,0,1,0
307,41.5,4,98.0,76.0,2144,14.7,80,vw rabbit,1,0,0


---

# Problem 3: Define and split independent and dependent variables

For this problem you will work on the DataFrame `mpg` created from problem 2.

To complete this process, do the following:

- Choose column 'mpg' in DataFrame `mpg` as dependent variable, set it to variable **y**  
- Choose columns 'displacement', 'horsepower', 'acceleration' in DataFrame `mpg` as independent variable, set it to variable **x**  
- Split dependent and independent variable to training and testing set
- Name the training and testing independent variable to x_train and x_test
- Name the training and testing dependent variable to y_train and y_test
- The test_size argument in train_test_split should be set to 0.4.
- The random_state argument in train_test_split should be set to 23.

After this problem, there are 6 new variables defined, **x, y, x_train, x_test, y_train, y_test**.

-----

In [12]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE
y = mpg.mpg
x=[['displacement', 'horsepower', 'acceleration']]
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.4, random_state=23)

ValueError: Found input variables with inconsistent numbers of samples: [1, 392]

In [None]:
assert_equal(type(x), pd.core.frame.DataFrame, msg="x should be a DataFrame")
assert_equal(type(y), pd.core.frame.Series, msg="x should be a Series")
assert_equal(len(x.columns), 3, msg="x should have 3 columns")
assert_true('displacement' in x.columns, msg="origin_usa is not in the independent variable")
assert_equal(y[0], 18, msg="dependent variable values are not right")
assert_equal(x_train.shape, (235, 3), msg="Independent training dataset size is not correct")
assert_equal(x_test.shape, (157, 3), msg="Independent testing dataset size is not correct")
x_train.sample(2)

-----

# Problem 4: Standardize dataset

This problem works on the variables **x_train** and __x_test__ created in problem 3.

Standardize training and testing independent variables using `StandardScaler`

To complete this process, do the following:

- Create `StandardScaler` object and fit it with `x_train`
- Transform `x_train` and assign transformed data to `x_train_ss`
- Transform `x_test` and assign transformed data to `x_test_ss`

After this problem, there are 2 new variables created, **x_train_ss** and __x_test_ss__

-----

In [None]:
from sklearn.preprocessing import StandardScaler

# YOUR CODE HERE


In [None]:
assert_almost_equal(x_train_ss[0][2], 1.17940189, msg="Training set is not standardized correctly")
assert_almost_equal(x_test_ss[0][0], -0.74018877, msg="Testing set is not standardized correctly")

-----

# Problem 5: Scale dataset

This problem works on the variables **x_train** and __x_test__ created in problem 3.

Scale training and testing independent variables using `MinMaxScaler`

To complete this process, do the following:

- Create `MinMaxScaler` object and fit it with `x_train`
- Transform `x_train` and assign transformed data to `x_train_mm`
- Transform `x_test` and assign transformed data to `x_test_mm`

After this problem, there are 2 new variables created, **x_train_mm** and __x_test_mm__

-----

In [None]:
from sklearn.preprocessing import MinMaxScaler

# YOUR CODE HERE


In [None]:
assert_almost_equal(x_train_mm[0][0], 0.40673575129533684, msg="Training set is not scaled correctly")
assert_almost_equal(x_test_mm[0][0], 0.12435233160621761, msg="Testing set is not scaled correctly")