# Answer Key
Worksheet 04: Metadata and Missing Data
Section: (0101|0102)

# Reflection
> Mendenhall, R., Brown, N., Black, M. L., Van Moer, M., Lourentzou, I., Flynn, K., … Zerai, A. (2016). Rescuing Lost History: Using Big Data to Recover Black Women’s Lived Experiences. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale (p. 56:1–56:6). New York, NY, USA: ACM.

Can you think of any other novel uses for the authors' "Search, Recognition, Rescue, and Recover (SeRRR)" process? And do you think it would it be the same paper with less theory?

Double Click to Write Here

## Metadata - Import Cars 1985
Look at the following metadata:

In [1]:
# Standard library includes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Open the "imports-85.names" file, which is a text file  
# which includes descriptive metadata about the dataset
txt = open('data/imports-85.names', 'r')
print(txt.read())

1. Title: 1985 Auto Imports Database

2. Source Information:
   -- Creator/Donor: Jeffrey C. Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
   -- Date: 19 May 1987
   -- Sources:
     1) 1985 Model Import Car and Truck Specifications, 1985 Ward's
        Automotive Yearbook.
     2) Personal Auto Manuals, Insurance Services Office, 160 Water
        Street, New York, NY 10038 
     3) Insurance Collision Report, Insurance Institute for Highway
        Safety, Watergate 600, Washington, DC 20037

3. Past Usage:
   -- Kibler,~D., Aha,~D.~W., \& Albert,~M. (1989).  Instance-based prediction
      of real-valued attributes.  {\it Computational Intelligence}, {\it 5},
      51--57.
	 -- Predicted price of car using all numeric and Boolean attributes
	 -- Method: an instance-based learning (IBL) algorithm derived from a
	    localized k-nearest neighbor algorithm.  Compared with a
	    linear regression prediction...so all instances
	    with missing attribute values were discarded.  This res

In [3]:
cars = pd.read_csv('data/imports-85.data', header=None)

In [4]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
0     205 non-null int64
1     205 non-null object
2     205 non-null object
3     205 non-null object
4     205 non-null object
5     205 non-null object
6     205 non-null object
7     205 non-null object
8     205 non-null object
9     205 non-null float64
10    205 non-null float64
11    205 non-null float64
12    205 non-null float64
13    205 non-null int64
14    205 non-null object
15    205 non-null object
16    205 non-null int64
17    205 non-null object
18    205 non-null object
19    205 non-null object
20    205 non-null float64
21    205 non-null object
22    205 non-null object
23    205 non-null int64
24    205 non-null int64
25    205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.7+ KB


In [5]:
cars.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [6]:
cars.rename(columns=
            {0:'symboling', 
             1:'normalized-losses', 
             2:'make', 
             3:'fuel-type',
             4:'aspiration',
             5:'num-doors',
             6:'body-styles',
             7:'drive-wheels',
             8:'engine-location',
             9:'wheel-base',
            10:'length',
            11:'width',
            12:'height',
            13:'curb-weight',
            14:'engine-type',
            15:'num-of-cylinders',
            16:'engine-size',
            17:'fuel-system',
            18:'bore',
            19:'stroke',
            20:'compression-ratio',
            21:'horsepower',
            22:'peak-rpm',
            23:'city-mpg',
            24:'highway-mpg',
            25:'price'}, inplace=True)

In [7]:
cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-doors,body-styles,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [8]:
cars['normalized-losses'] = pd.to_numeric(cars['normalized-losses'], errors='coerce')
cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-doors,body-styles,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [9]:
#cars[cars[['num-doors', 'bore', 'stroke', 'price']] == '?'] = np.NaN

In [10]:
#cars.sort_values(['city-mpg','highway-mpg','fuel-type'])
cars.replace('?', np.NaN, inplace=True)

In [11]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    164 non-null float64
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-doors            203 non-null object
body-styles          205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 201 non-null object
stroke               201 non-null object
compression-ratio    205 non-null float64
horsepower           203 non-nu

In [12]:
cars.to_csv('wk04-cars.csv')

In [13]:
cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-doors,body-styles,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [25]:
df = cars[cars['city-mpg'] > 30]

In [26]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-doors,body-styles,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
18,2,121.0,chevrolet,gas,std,two,hatchback,fwd,front,88.4,...,61,2bbl,2.91,3.03,9.5,48,5100,47,53,5151
19,1,98.0,chevrolet,gas,std,two,hatchback,fwd,front,94.5,...,90,2bbl,3.03,3.11,9.6,70,5400,38,43,6295
20,0,81.0,chevrolet,gas,std,four,sedan,fwd,front,94.5,...,90,2bbl,3.03,3.11,9.6,70,5400,38,43,6575
21,1,118.0,dodge,gas,std,two,hatchback,fwd,front,93.7,...,90,2bbl,2.97,3.23,9.41,68,5500,37,41,5572
22,1,118.0,dodge,gas,std,two,hatchback,fwd,front,93.7,...,90,2bbl,2.97,3.23,9.4,68,5500,31,38,6377
