# Module 1 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

---

# Problem 1: Read in a dataset

For this problem you will read in a dataset using pandas. In the cell below, the function *read_data* has parameter "file_path" which contains a path to a dataset.
- Use the *read_csv* function from pandas to read in the dataset from the file path and return the resulting dataframe.

-----

In [2]:
def read_data(file_path):
    '''
    Reads in a dataset using pandas.
    
    Parameters
    ----------
    file_path : string containing path to a file
    
    Returns
    -------
    pandas dataframe with data read in from the file path
    '''
    # YOUR CODE HERE
    return pd.read_csv(file_path)

In [3]:
tips = read_data('data/tips.csv')
assert_equal(type(tips), pd.core.frame.DataFrame, msg="Your function does't return a dataframe")
assert_equal(len(tips), 244, msg="The dataset should have 244 rows. Your solution only has %s"%len(tips))

print("2 random rows of the dataset tips:")
tips.sample(2)

2 random rows of the dataset tips:


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
203,16.4,2.5,Female,Yes,Thur,Lunch,2
161,12.66,2.5,Male,No,Sun,Dinner,2


---

# Problem 2: Encode "sex" column

For this problem you will work on the DataFrame `tips` created from problem 1.

Encode the categorical feature 'sex' by using `LabelEncoder`. Store encoded sex in a new column named 'sex_code'.

After this problem, DataFrame tips should have one more column 'sex_code' in addition to orignal columns.

-----

In [4]:
from sklearn.preprocessing import LabelEncoder

# YOUR CODE HERE
tips['sex_code'] = LabelEncoder().fit_transform(tips.sex)

In [5]:
assert_true('sex_code' in tips.columns, msg="tips doesn't have 'sex_code' column")
assert_true(0 in tips.sex_code.unique(), msg="sex is not properly encoded")
assert_true(1 in tips.sex_code.unique(), msg="sex is not properly encoded")
tips.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_code
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1


---

# Problem 3: Define independent variable and dependent variable

For this problem you will work on the DataFrame `tips` created from problem 2.

In the problem, you will prepare dependent and independent variables from DataFrame tips for a regression problem.

Define dependent variable y which has values in 'tip' column
Define independent variable x which has values in 'total_bill' and 'sex_code' columns.

After this problem, there are two new variables defined, **x** and __y__. x is a Pandas Series and y is a Pandas DataFrame with two columns.

-----

In [6]:
# YOUR CODE HERE
y = tips.tip
x=tips[['total_bill', 'sex_code']]

In [7]:
assert_equal(type(x), pd.core.frame.DataFrame, msg="x should be a DataFrame")
assert_equal(type(y), pd.core.frame.Series, msg="x should be a Series")
assert_equal(len(x.columns), 2, msg="x should have two columns")
assert_true('sex_code' in x.columns, msg="sex_code is not in the independent variable list")
assert_equal(y[0], 1.01, msg="dependent variable values are not right")

-----

# Problem 4: Create the Training and Testing Datasets

This problem works on the variables **x** and __y__ created in problem 3.

Splite the independent and dependent variables to training and testing set.

To complete this process, do the following:

- Name the training and testing independent variable to x_train and x_test
- Name the training and testing dependent variable to y_train and y_test
- The `test_size` argument in `train_test_split` should be set to 0.3.
- The `random_state` argument in `train_test_split` should be set to 23.

After this problem, there are 4 new variables defined, x_train, x_test, y_train and y_test

-----

In [8]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE
x_train, x_test, y_train, y_test=train_test_split(x, y, test_size=0.3, random_state=23)

In [9]:
assert_equal(x_train.shape[0], 170, msg="Training set doesn't have correct size")
assert_equal(x_test.shape[0], 74, msg="Testing set doesn't have correct size")

# Test independent values
assert_equal(x_train.total_bill[0], 16.99, msg="Training indenpendent data is wrong")
assert_equal(y_train[0], 1.01, msg="Training dependent data is wrong")
