## Filter, drop nulls and dedupe
1. **Filter** For consistency, only compare cars certified by California standards. Filter both datasets using query to select only rows where cert_region is CA. Then, drop the cert_region columns, since it will no longer provide any useful information (we'll know every value is 'CA').
2. **Drop Nulls** Drop any rows in both datasets that contain missing values.
3. **Dedupe** Drop any duplicate rows in both datasets.

In [46]:
import pandas as pd
df_08 = pd.read_csv('data_08_for_03.csv')
df_18 = pd.read_csv('data_18_for_03.csv')

In [47]:
# view dimensions of dataset 2008
df_08.shape

(2404, 14)

In [48]:
# view dimensions of dataset 2018
df_18.shape

(1611, 14)

### Filter By Certification Region
Just want to look at cars that meet California certifcation standards

In [49]:
# filter datasets for rows following California standards
df_08 = df_08.query('cert_region == "CA"')
df_18 = df_18.query('cert_region == "CA"')

In [50]:
# confirm only certification region is California in 2008 data
df_08['cert_region'].unique()

array(['CA'], dtype=object)

In [51]:
# confirm only certification region is California in 2018 data
df_18['cert_region'].unique()

array(['CA'], dtype=object)

In [52]:
# drop certification region columns form both datasets
df_08.drop(['cert_region'], axis=1, inplace=True)
df_18.drop(['cert_region'], axis=1, inplace=True)

In [53]:
df_08.shape

(1084, 13)

In [54]:
df_18.shape

(798, 13)

### Drop Rows with Missing Values

In [55]:
# view missing value count for each feature in 2008
df_08.isnull().sum()

model                    0
displ                    0
cyl                     75
trans                   75
drive                   37
fuel                     0
veh_class                0
air_pollution_score      0
city_mpg                75
hwy_mpg                 75
cmb_mpg                 75
greenhouse_gas_score    75
smartway                 0
dtype: int64

In [56]:
# view missing value count for each feature in 2018
df_18.isnull().sum()

model                   0
displ                   1
cyl                     1
trans                   0
drive                   0
fuel                    0
veh_class               0
air_pollution_score     0
city_mpg                0
hwy_mpg                 0
cmb_mpg                 0
greenhouse_gas_score    0
smartway                0
dtype: int64

In [57]:
# drop rows with any null values in both datasets
df_08.dropna(inplace=True)
df_18.dropna(inplace=True)

In [58]:
# checks if any of columns in 2008 have null values - should print False
df_08.isnull().sum().any()

False

In [59]:
# checks if any of columns in 2018 have null values - should print False
df_18.isnull().sum().any()

False

### Dedupe Data

In [60]:
# print number of duplicates in 2008 and 2018 datasets
print(sum(df_08.duplicated()))
print(sum(df_18.duplicated()))

23
3


(1009, 13)

In [61]:
# drop duplicates in both datasets
df_08.drop_duplicates(inplace = True)
df_18.drop_duplicates(inplace = True)

(986, 13)

In [62]:
# print number of duplicates again to confirm dedupe - should both be 0
print(sum(df_08.duplicated()))
print(sum(df_18.duplicated()))

0
0


In [63]:
# save new datasets for next section
df_08.to_csv('data_08_for_04.csv', index=False)
df_18.to_csv('data_18_for_04.csv', index=False)

In [64]:
df_08.shape

(986, 13)

In [65]:
df_18.shape

(794, 13)