# Pre-Processing Exercises

Try the exercises below to practice data pre-processing with *pandas*. To edit and run the code, open the notebook in "playground mode" using the button in the upper right corner. Be sure to add it to your Drive to save your work. 

## Setup

Run the cell below to import necesary modules. 

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive



The exercises make use of the following datasets:
1.  `insurance` - A modified version of the [Automobile dataset](https://drive.google.com/a/sas.upenn.edu/file/d/1QgyhseW8t7luFPnAlB15n6_v9vpqf7Tr/view?usp=sharing) dataset maintained by the UCI Machine Learning Repository *(Same dataset used in EDA exercises)* 
2. `models` - A modified version of [a dataset of car makes and models](https://drive.google.com/open?id=1a-ymSBOfF_LbnYvd4ZgISK_SzUoH47MI) compiled by Github user n8barr

Import these datasets to the notebook using the data importation procedure outlined in the [EDA tutorial](https://drive.google.com/open?id=1Cy3izai9zLQYTCTQF9IwkcuLmNArcKZO). First, add them to your Google Drive using the links above, then mount your Drive to the notebook by running the cell below. 

In [0]:
prefix = '/content/drive'
from google.colab import drive
drive.mount(prefix, force_remount=True)

In the cell below, copy the paths to the `insurance` and `models` datasets in your Drive to the variables `insurance_path` and `models_path` respectively. Then run the cell to load them into `DataFrames` named `insurance` and `models`. 

In [0]:
insurance_path = '/content/drive/My Drive/CIS550/cars.csv' # Path to the insurance dataset in your drive here
models_path = '/content/drive/My Drive/CIS550/models.csv' # Path to the models dataset in your drive here
models = pd.read_csv(models_path)
insurance = pd.read_csv(insurance_path)
models.head()

Finally, run the cell below to initialize several functions that will spot-check the correctness of your solutions as you complete the exercises.

In [0]:
def test_1(df):
  assert len(df.index) == 163

def test_2(df):
  assert (insurance['wheel_base'] == 95.9).sum() == 31
  assert (insurance['width'] == 65.4).sum() == 53
  assert not insurance['wheel_base'].isna().any()
  assert not insurance['width'].isna().any()

def test_3(df):
  assert (insurance["engine_type"] == "ohc").sum() == 129

def test_4(df):
  cols = ["bore", "stroke", "horsepower", "peak_rpm"]
  for col in cols:
    assert df[col].dtype == np.float
    assert not df[col].isna().any()

def test_5(df):
  assert df.shape[0] == 7268

def test_6(df):
  assert df['city_mpg'].max() == 49

def test_7(df):
  assert df['highway_mpg'].min() == 18

def test_8(a):
  assert len(a) == 0

def test_9(a):
  assert len(a) == 60

def test_10(df):
  assert len(df.index) == 3692

def test_11(df1, df2, codes):
  norm_codes = codes.copy()
  assert len(codes.index) == len(codes['make'].unique())
  assert len(codes.index) == len(codes['id'].unique())
  norm_codes['make'] = norm_codes['make'].str.strip().str.lower()
  tcode = norm_codes.loc[norm_codes['make'] == 'toyota', 'id'].squeeze()
  assert np.issubdtype(df1['make'].dtype, np.number)
  assert np.issubdtype(df2['make'].dtype, np.number)
  assert (df1['make'] == tcode).sum() == 351
  assert (df2['make'] == tcode).sum() == 31
  assert len(codes.index) == 18

def test_12(df):
  assert np.issubdtype(df["num_doors"].dtype, np.number)
  assert (df["num_doors"] == 4).sum() == 95
  assert (df["num_doors"] == 2).sum() == 68


## Missing Values

As you saw in the EDA exercises, several columns in the `insurance` dataset have missing values. Below, you'll remove or replace these. 

First, remove all rows in `insurance` missing `price`, `num_doors`, or `normalized_losses`

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_1(insurance)
insurance.head()

Replace all missing values in `wheel_base` and `width` with the median of the remaining values in each column. 

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_2(insurance)
insurance.head()

Replace all missing values in `engine_type` with the mode of the known values

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_3(insurance)
insurance.head()

Replace all missing values in `bore`, `stroke`, `horsepower`, and `peak_rpm` with the mean of the remaining values in each column, and set the datatype of each column to `np.float`

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_4(insurance)
insurance.head()

Remove all rows with missing values from `models`, if there are any.

*You can assume missing values are represented with NaN's or placeholder values that contain "unknown" or "?".*

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_5(models)

## Misentered Values

As you saw in the EDA exercises, `city_mpg` and `highway_mpg` each contain at least one misentered values. Below, you'll replace these with more plausible values. 

First, replace all instances of 200 in `city_mpg` with the maximum value in the column, excluding 200. 

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_6(insurance)

Replace all instances of 0 in `highway_mpg` with the minimum value in the column, excluding 0. 

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_7(insurance)

## Entity Resolution
Now, you'll perform simple entity resolution between the two datasets, using car makes (e.g. BMW, Toyota) as the shared entities. 

First, compile an alphabetically sorted list of all makes found in either dataset. What appears to be the primary source of resolution problems in this list?

In [0]:
# YOUR CODE HERE

Address the major entity resolution problem you identified above using built-in *pandas* functions. 

*Your solution should simultaneously address all makes. You shouldn't be fixing the names on a case-by-case basis at this stage.*

In [0]:
# YOUR CODE HERE

Output a list of all the remaining make names that appear in either dataset, then inspect it for lingering problems. 

In [0]:
# YOUR CODE HERE

Resolve the remaining entity resolution problems you identified above. 

*Hint: You should have found two makes that the datasets reference using two different names*

In [0]:
# YOUR CODE HERE

Determine which makes (if any) in the `insurance` dataset don't appear in the `models` dataset. Store these in a set named `models_missing`.

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_8(models_missing)
model_missing

Determine which makes (if any) in the `models` dataset don't appear in the `insurance` dataset. Store these in a set named `insurance_missing`.

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_9(insurance_missing)
insurance_missing

Remove all rows from the `models` dataset where the make doesn't have a match in the `insurance` dataset. 

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_10(models)
models.head()

## Categorical Variables
Replace the text-based makes in both datasets with integer codes, then store the key to the codes in a separate dataframe called `make_codes`. The `make_codes` `DataFrame` should contain a `make` column containing the names of the makes and an `id` column containing the corresponding integer IDs. 

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_11(models, insurance, make_codes)

Replace the `num_doors` column in `insurance` with a numeric equivalent. 

*In this case, replace each value with its numeric equivalent, rather than an arbitrary ID. For example, you should replace "four" with 4.*

In [0]:
# YOUR CODE HERE

####################################
# Do not edit below this line
test_12(insurance)

## Index Candidates

Show that no subset of columns in `models` can form an index. In other words, show that every subset of columns in `models` contains at least one set of repeated values. 

In [0]:
# YOUR CODE HERE

Find all columns `insurance` that can form a unique index with `normalized_losses`. 

*Hint: You should only find 1 column*

In [0]:
# YOUR CODE HERE

Check whether this column could form an index without `normalized_losses`. 

In [0]:
# YOUR CODE HERE