# Module 1 Assignment


A few things you should keep in mind when working on assignments:

1. Before you submit your assignment, make sure everything runs as expected. Go to the menu bar, select Kernel, and Restart & Run all. 
2. Make sure that you save your work.
3. Upload your notebook to Compass.

-----


# Problem 1: Load and clean up a dataset

For this problem you will read in a dataset from mpg.csv.
- Import needed modules.
- Load the dataset from mpg.csv to a DataFrame `mpg`.
- Use DataFrame method info() to check if `mpg` has missing values.
- If there are missing values in `mpg`, drop all rows with missing values.
- Use DataFrame method info() to verify `mpg` has no missing values.
- Display the first 5 rows in the DataFrame `mpg`.
- Feel free to add extra code cells if needed.
-----

In [1]:
# Import modules, load dataset, display basic information.
import pandas as pd

mpg = pd.read_csv('mpg.csv')
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [2]:
# Drop missing values, verify with info()
mpg = mpg.dropna()
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   model_year    392 non-null    int64  
 7   origin        392 non-null    object 
 8   name          392 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 30.6+ KB


In [3]:
# Display first 5 rows in mpg
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


---

# Problem 2: Encode "origin" column

For this problem you will work on the DataFrame `mpg` created from problem 1.

- Use Pandas Series unique() method to check unique values in `origin` column.
- One-hot encode the categorical feature `origin` using `get_dummies` in Pandas module. 
- Set the prefix of dummy columns to 'origin'.
- Assign the resulting DataFrame to `mpg_onehot`.
- Display the first 5 rows of `mpg_onehot`.

After this problem, DataFrame `mpg_onehot` should have three dummy features, `origin_europe`, `origin_japan` and `origin_usa`, in addition to columns in DataFrame mpg (excluding `origin`).

-----

In [4]:
# Display unique values in origin
mpg.origin.unique()

array(['usa', 'japan', 'europe'], dtype=object)

In [5]:
# One-hot encode origin, create DataFrame mpg_onehot, display first 5 rows of mpg_onehot
mpg_onehot = pd.get_dummies(mpg, columns=["origin"], prefix=["origin"])
mpg_onehot.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,name,origin_europe,origin_japan,origin_usa
0,18.0,8,307.0,130.0,3504,12.0,70,chevrolet chevelle malibu,0,0,1
1,15.0,8,350.0,165.0,3693,11.5,70,buick skylark 320,0,0,1
2,18.0,8,318.0,150.0,3436,11.0,70,plymouth satellite,0,0,1
3,16.0,8,304.0,150.0,3433,12.0,70,amc rebel sst,0,0,1
4,17.0,8,302.0,140.0,3449,10.5,70,ford torino,0,0,1


---

# Problem 3: Define and split independent and dependent variables

For this problem you will work on the DataFrame `mpg_onehot` created from problem 2.

To complete this process, do the following:

- Import `train_test_split` from `sklearn`.
- Choose column `mpg` in DataFrame `mpg_onehot` as dependent variable, set it to variable **y**.
- Choose columns `horsepower`, `weight`, `origin_europe`, `origin_japan` and `origin_usa` in DataFrame `mpg_onehot` as independent variable, set it to variable **x**.
- Split dependent and independent variable to training and testing set.
- Name the training and testing independent variable to `x_train` and `x_test`.
- Name the training and testing dependent variable to `y_train`and `y_test`.
- The `test_size` argument in train_test_split should be set to 0.4.
- **Don't** set `random_state` argument in train_test_split.
- Display first 5 rows in x_test.

After this problem, there are 6 new variables defined, **x, y, x_train, x_test, y_train, y_test**.

-----

In [6]:
# Your answer
from sklearn.model_selection import train_test_split

x = mpg_onehot[['horsepower','weight', 'origin_europe', 'origin_japan', 'origin_usa']]
y = mpg_onehot['mpg']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
x_test.head()

Unnamed: 0,horsepower,weight,origin_europe,origin_japan,origin_usa
356,75.0,2350,0,1,0
100,88.0,3021,0,0,1
346,67.0,2065,0,1,0
390,96.0,2665,0,1,0
145,61.0,2003,0,1,0


-----

# Problem 4: Standardize dataset

This problem works on the variables `x_train` and `x_test` created in problem 3.

Standardize training and testing independent variables using `StandardScaler`.

To complete this process, do the following:

- Import `StandardScaler` from `sklearn`.
- Create `StandardScaler` object and fit it with `x_train`.
- Transform `x_train` and assign transformed data to `x_train_ss`.
- Transform `x_test` and assign transformed data to `x_test_ss`.
- Display first 5 rows in `x_train_ss` (Use array slicing).

After this problem, there are 2 new variables created, **x_train_ss** and __x_test_ss__.

-----

In [7]:
# Your answer
from sklearn.preprocessing import StandardScaler

# Create and fit scaler
ss = StandardScaler().fit(x_train)
x_train_ss = ss.transform(x_train)
x_test_ss = ss.transform(x_test)

x_train_ss[:5]

array([[-0.80586471, -1.0374177 ,  2.14365065, -0.45291081, -1.36596254],
       [ 0.110397  ,  0.58513513, -0.46649392, -0.45291081,  0.7320845 ],
       [-0.91058034, -0.90526927,  2.14365065, -0.45291081, -1.36596254],
       [-0.88440143, -0.82827845, -0.46649392,  2.20794022, -1.36596254],
       [ 0.110397  ,  0.72877472, -0.46649392, -0.45291081,  0.7320845 ]])

-----

# Problem 5: Scale dataset

This problem works on the variables `x_train` and `x_test` created in problem 3.

Scale training and testing independent variables using `MinMaxScaler`.

To complete this process, do the following:

- Import `MinMaxScaler` from `sklearn`.
- Create `MinMaxScaler` object and fit it with `x_train`.
- Transform `x_train` and assign transformed data to `x_train_mm`.
- Transform `x_test` and assign transformed data to `x_test_mm`.
- Display first 5 rows in `x_train_mm` (Use array slicing).

After this problem, there are 2 new variables created, **x_train_mm** and __x_test_mm__.

-----

In [8]:
# Your answer
from sklearn.preprocessing import MinMaxScaler

# Create and fit scaler
mm = MinMaxScaler().fit(x_train)
x_train_mm = mm.transform(x_train)
x_test_mm = mm.transform(x_test)

x_train_mm[:5]

array([[0.16201117, 0.1403459 , 1.        , 0.        , 0.        ],
       [0.3575419 , 0.54068614, 0.        , 0.        , 1.        ],
       [0.1396648 , 0.17295152, 1.        , 0.        , 0.        ],
       [0.1452514 , 0.19194783, 0.        , 1.        , 0.        ],
       [0.3575419 , 0.57612702, 0.        , 0.        , 1.        ]])