# EPA Fuel Economy Data Assessment

This jupyter notebook provides an assessment of EPA fuel economy data from 2018 and 2008.

To download data: https://www.fueleconomy.gov/feg/download.shtml

Documentation: https://www.fueleconomy.gov/feg/EPAGreenGuide/GreenVehicleGuideDocumentation.pdf

README.txt: http://www.fueleconomy.gov/feg/epadata/Readme.txt

In [67]:
import pandas as pd

#import and assess the 2008 data
temp_08 = pd.read_excel("https://www.fueleconomy.gov/feg/EPAGreenGuide/xls/all_alpha_08.xls")
temp_18 = pd.read_excel('https://www.fueleconomy.gov/feg/EPAGreenGuide/xls/all_alpha_18.xlsx')

# REMOVE THE BELOW

In [68]:
df_08 = temp_08
df_18 = temp_18

In [69]:
#evaluate column labels to determine if they are aligned
print("2008", df_08.columns, "2018", df_18.columns, 
      "\nDo all columns match?",
      (df_08.columns == df_18.columns).all(),
      sep='\n')


2008
Index(['Model', 'Displ', 'Cyl', 'Trans', 'Drive', 'Fuel', 'Sales Area', 'Stnd',
       'Underhood ID', 'Veh Class', 'Air Pollution Score', 'FE Calc Appr',
       'City MPG', 'Hwy MPG', 'Cmb MPG', 'Unadj Cmb MPG',
       'Greenhouse Gas Score', 'SmartWay'],
      dtype='object')
2018
Index(['Model', 'Displ', 'Cyl', 'Trans', 'Drive', 'Fuel', 'Cert Region',
       'Stnd', 'Stnd Description', 'Underhood ID', 'Veh Class',
       'Air Pollution Score', 'City MPG', 'Hwy MPG', 'Cmb MPG',
       'Greenhouse Gas Score', 'SmartWay', 'Comb CO2'],
      dtype='object')

Do all columns match?
False


In [70]:
#dropping columns that are not present in both datasets or are unnecessary for evaluation
df_08.drop(['Stnd', 'Underhood ID', 'FE Calc Appr', 'Unadj Cmb MPG'],axis=1,inplace=True)
df_18.drop(['Stnd', 'Stnd Description', 'Underhood ID', 'Comb CO2'],axis=1,inplace=True)

#renaming columns for consistency
df_08.rename(columns={'Sales Area':'Cert Region'}, inplace=True)

#Make columns lowercase and replace spaces with underscores in column names
df_08.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)
df_18.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

#confirm columns are identitical
print("Do all columns match?",
      (df_08.columns == df_18.columns).all(),
      sep='\n')

Do all columns match?
True


In [71]:
#remove null values and duplicates
df_08.drop_duplicates(inplace=True)
df_08.dropna(inplace=True)
df_18.drop_duplicates(inplace=True)
df_18.dropna(inplace=True)

In [72]:
df_08.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2142 entries, 0 to 2403
Data columns (total 14 columns):
model                   2142 non-null object
displ                   2142 non-null float64
cyl                     2142 non-null object
trans                   2142 non-null object
drive                   2142 non-null object
fuel                    2142 non-null object
cert_region             2142 non-null object
veh_class               2142 non-null object
air_pollution_score     2142 non-null object
city_mpg                2142 non-null object
hwy_mpg                 2142 non-null object
cmb_mpg                 2142 non-null object
greenhouse_gas_score    2142 non-null object
smartway                2142 non-null object
dtypes: float64(1), object(13)
memory usage: 251.0+ KB


In [73]:
df_18.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2615 entries, 0 to 2681
Data columns (total 14 columns):
model                   2615 non-null object
displ                   2615 non-null float64
cyl                     2615 non-null float64
trans                   2615 non-null object
drive                   2615 non-null object
fuel                    2615 non-null object
cert_region             2615 non-null object
veh_class               2615 non-null object
air_pollution_score     2615 non-null int64
city_mpg                2615 non-null object
hwy_mpg                 2615 non-null object
cmb_mpg                 2615 non-null object
greenhouse_gas_score    2615 non-null int64
smartway                2615 non-null object
dtypes: float64(2), int64(2), object(10)
memory usage: 306.4+ KB


Following columns do not have matching types: cyl, air_pollution_score, city_mpg, hwy_mpg, cmb_mpg, greenhouse_gas_score

In [74]:
# 2008 - convert from string to integer
#2018 - convert from integer to string
df_08['cyl'] = df_08['cyl'].apply(lambda x: int(x.strip('( cyl)')))
df_18['cyl'] = df_18['cyl'].astype(int)

The air_pollution_score, mpg columns, and greenhouse gas scores have a problem. According to [this link](http://www.fueleconomy.gov/feg/findacarhelp.shtml#airPollutionScore) (via the PDF documentation):

    "If a vehicle can operate on more than one type of fuel, an estimate is provided for each fuel type."
    
Vehicles with more than one fuel type will have a string that holds two values - one for each. 