In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

<h1> Input Data Description, Data Cleaning, Data Summary & EDA: California EPA State Agency CO2 </h1>

<h1> Input Data Description </h1>

This data originates from the California EPA. In this dataset, CO2 emissions are recorded from various government sources. This data is manually inputted by state agency reporters into their web-based Climate Registry Information System (CRIS) tool. 

The data along with the data dictionary can be downloaded at the link [here](https://data.ca.gov/dataset/calepa-state-agency-co2).


2. The structure, granularity, scope, temporality and faithfulness (SGSTF) of your data. To discuss these attributes you should load the data into one or more data frames (so you'll start building code cells for the first time). At a minimum, use some basic methods (.head, .loc, and so on) to provide support for the descriptions you provide for SGSTF.


In [27]:
epa_data = pd.read_csv('../../data/state-agency-co2e-data-2010-2014.csv')
epa_data.head()

Unnamed: 0,Organization Name,Emission Year,Facility Name,Source Name,Activity Type,Fuel Type,Fuel,End Use Sector,Technology,CO2e
0,California Department of General Services,2013,000 Fleet Vehicles,,SEM: Mobile Combustion - Scope 1,N\A,N\A,N\A,N\A,563.025594
1,California Department of General Services,2010,Lease#2243001,,SEM: Purchased Electricity - Scope 2,N\A,N\A,N\A,N\A,848.793031
2,California Department of General Services,2010,Lease#5098002,,SEM: Purchased Electricity - Scope 2,N\A,N\A,N\A,N\A,12.768303
3,California Department of General Services,2010,Lease#5107001,,SEM: Purchased Electricity - Scope 2,N\A,N\A,N\A,N\A,4.683929
4,California Department of General Services,2010,Lease#5368001,,SEM: Purchased Electricity - Scope 2,N\A,N\A,N\A,N\A,0.265898


<h1> EDA: Structure, Granularity, Scope, Temporality and Faithfulness </h1>

# Structure

The data was uploaded onto the California EPA official government website, and it was hosted in a csv file format which made it easy to convert to a Dataframe.

# Granularity

In looking at the granularity of the data, each record in the Dataframe represents the carbon emissions for the different government agencies. We see that data includes, for example, carbon emissions of a government building, government-owned vehicle, and even an entire fleet of government-owned vehicles. Given that there is nothing that contains a summary row, we believe that all the records capture granularity at the same level. Because the government agencies have the option to input their individual facilities, they are able to aggregate the sources at the agency level. 

# Scope

The scope of the data covers the equivalent carbon emissions from various  Government Agencies in California sourced from California EPA. 

# Temporality

The data is collected over a period of 4 years. More specifically, the data ranges from: 

2010-01-01T05:00:00+00:00 THROUGH 2014-12-31T05:00:00+00:00

# Faithfulness

The data originates from the California Environmental Protection Agency which is a reliable source. However, there are a good amount of NaN values. In addition, because they are manually inputted, it would be a great idea to look at any input errors.

# Data Cleaning and Visualizations

We see that some of the inputs in 'Source Name' -- particularly with the word 'Electricity' -- were inputted differently, so we look to correct this. 

In [31]:
epa_data['Source Name'].unique()#value_counts()

array([nan, 'Electricity', 'Natural Gas', 'Van', 'Honda', 'Gasoline',
       'Propane Gas', 'E-85', 'CNG', 'Unleaded Gas', 'Diesel', 'Gas',
       'Purchased Electricty', 'Purchased Electricity',
       'Gas purchase Therms', 'Alternate Fuel Vehicles',
       'Diesel Fuel Vehicles', 'Passenger vehicles and trucks',
       'Puchased Electricity',
       'E-85 biogenic emissions (Light Duty Vehicles PC & Van)',
       'CNG (Medium and Heavy Duty Vehicles)',
       'E-85 anthropogenic emissions (Gasoline Light Trucks, Vans, Pickup Trucks, SUVs)',
       'Dual Propane/Unleaded (Forklifts)',
       'Gasoline (Medium and Heavy Duty Trucks)',
       'Gasoline (Passenger Cars)',
       'Diesel Vehicles (Medium and Heavy Duty Vans, Pickup Trucks, Trucks)',
       'Methanol Passenger Cars', 'Diesel Vehicles (Tractors)',
       'Gasoline Light Trucks (Vans, Pickup Trucks, SUVs)',
       'Purchased electricity (owned and leased)',
       'Purchased natural gas (leased)', 'Purchased natural gas (ow