# Assignment 5

Jane Programmer

## Description

This time, we're working with New York City gas consumption data. More information can be found at https://data.cityofnewyork.us/Environment/Natural-Gas-Consumption-by-ZIP-Code-2010/uedp-fegm/data .

As a data analyst working to improve energy efficiency in NYC, my first job is to ingest and explore this data.  That's what we'll do here.

## Data Ingestion

We'll bring the data into my local environment using a pandas command called `read_csv`.  This might take awhile!

In [1]:
import pandas as pd
nyc_gas =  pd.read_csv("https://data.cityofnewyork.us/api/views/uedp-fegm/rows.csv?accessType=DOWNLOAD")

## Quick Data Preview

Sometimes the best way to get an idea of what data looks like is, well, to look at it.  Let's look at the first ten rows.

In [2]:
nyc_gas.head(10)

Unnamed: 0,Zip Code,Building type (service class,Consumption (therms),Consumption (GJ),Utility/Data Source
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
2,10360,Large residential,33762.0,3562.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
4,11200,Institutional,3605.0,380.0,National Grid
5,11200,Small residential,3960.0,418.0,National Grid
6,11254,Small residential,1896.0,200.0,National Grid
7,11274,Commercial,8364.0,882.0,National Grid
8,11279,Commercial,2579.0,272.0,National Grid
9,11279,Large residential,301.0,32.0,National Grid


So, it seems that we have five columns of data, which represent:

* ZIP code (postal code)
* Building type
* Consumption in therms (some kind of measurement of energy use?)
* Consumption in GJ (not sure what a GJ is, but looks like some sort of energy measure)
* Utility that reported this data


## Data Size and Labels

Let's get additional basic details about our data: how many columns and the label names for our data.

In [4]:
nyc_gas.# add some keyword here that tells you the number of rows and columns

(1015, 5)

In [5]:
list(nyc_gas.# add some keyword here)  
# We can use list combined with 'columns' to make a nicer list appearance than just using 'columns' alone.

['Zip Code',
 'Building type (service class',
 ' Consumption (therms) ',
 ' Consumption (GJ) ',
 'Utility/Data Source']

We have ______ rows.  As we already established, we have ___ columns.  When looking at the column labels, we notice a few things that might affect how easy it is to use them in code:

* First thing you notice
* Another thing?
* Maybe a third?

I will find it useful to reindex my DataFrame and give different column names.  I prefer to use labels that are all lowercase and have no punctuation save the underscore (_). 

In [6]:
new_labels = {'Zip Code' : "zip",
              'Building type (service class' : "building_type",
              ' Consumption (therms) ' : "consumption_therms",
              # Finish this dictionary of tuples! }
nyc_gas = nyc_gas.rename(# What goes inside?)

Let's peek at our renamed DataFrame to see if it's a bit easier to work with, as far as computationally-friendly but human-readable column names:

In [7]:
nyc_gas.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
2,10360,Large residential,33762.0,3562.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
4,11200,Institutional,3605.0,380.0,National Grid


Now we want to do some actual data analysis to explore the data!  We'll start with building types.

## Building Types

We have a few questions we want to answer here, including:

* How many distinct building types are included?  
* What is the median energy consumption for each type?  
* How do building types compare?

We'll start by looking at the number of unique building types:

In [8]:
nyc_gas['building_type'].unique()

array(['Commercial', 'Large residential', 'Institutional',
       'Small residential', 'Industrial', 'Residential',
       'Large Residential'], dtype=object)

I see that there are seven kinds of buildings, but two of them seem to be the same thing, just written differently, with different capitalization.  I want to combine "Large residential" and "Large Residential" into one group!  Also, it's unclear what "Residential" is -- is it small? Large?  I'll leave just plain "Residential" on its own until we get more information.

There are many ways to accomplish what I want to do here.  One way is to filter the DataFrame so that I get several smaller DataFrames, one for each type of building.  That's what I'll do here, doing a quick peek to make sure I have the right data.  Then I can find median values on the columns, in my case the `consumption_gj` column!

In [9]:
commercial_df = nyc_gas[nyc_gas["building_type"] # Put something here!]
commercial_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
7,11274,Commercial,8364.0,882.0,National Grid
8,11279,Commercial,2579.0,272.0,National Grid


In [10]:
commercial_df["consumption_gj"].# add something here to find the median value!

189413.0

I'll do the same for the other building types, and make sure to include both "large" types when I create that data frame!

In [11]:
institutional_df = nyc_gas[nyc_gas["building_type"] # Put something here!]
institutional_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
4,11200,Institutional,3605.0,380.0,National Grid
15,11315,Institutional,339.0,36.0,National Grid
27,11400,Institutional,93140.0,9827.0,National Grid
32,11438,Institutional,1770.0,187.0,National Grid
36,11468,Institutional,49184.0,5189.0,National Grid


In [12]:
institutional_df["consumption_gj"].# add something here to find the median value!

66027.0

In [13]:
small_residential_df = nyc_gas[nyc_gas["building_type"] # Put something here!]
small_residential_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
5,11200,Small residential,3960.0,418.0,National Grid
6,11254,Small residential,1896.0,200.0,National Grid
11,11303,Small residential,3009.0,317.0,National Grid
12,11313,Small residential,3488.0,368.0,National Grid
13,11314,Small residential,6011.0,634.0,National Grid


In [14]:
small_residential_df["consumption_gj"].# add something here to find the median value!

599600.0

In [15]:
industrial_df = nyc_gas[nyc_gas["building_type"] # Put something here!]
industrial_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
26,11400,Industrial,275.0,29.0,National Grid
76,"11385(40.70122489161548, -73.88334858436963)",Industrial,1048061.0,110576.0,National Grid
77,"11201(40.69467825879468, -73.98992086835335)",Industrial,3887016.0,410102.0,National Grid
122,"11413(40.66971118122609, -73.75087393328533)",Industrial,387601.0,40894.0,National Grid
161,"11412(40.69812711747255, -73.75923519834566)",Industrial,164379.0,17343.0,National Grid


In [16]:
industrial_df["consumption_gj"].# add something here to find the median value!

16867.5

In [17]:
residential_df = nyc_gas[nyc_gas["building_type"] # Put something here!]
residential_df.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
51,"10007(40.71363051943297, -74.00913138370635)",Residential,12845.0,1355.0,ConEd
52,"10002(40.71612146793143, -73.98583147024613)",Residential,550055.0,58034.0,ConEd
53,"10012(40.72553802086304, -73.99789641059084)",Residential,178497.0,18832.0,ConEd
54,"10003(40.73194394755518, -73.98887214913032)",Residential,260502.0,27484.0,ConEd
55,"10001(40.75025902143676, -73.99688630375988)",Residential,58338.0,6155.0,ConEd


In [18]:
residential_df["consumption_gj"].# add something here to find the median value!

53288.5

In [19]:
large_residential_df = nyc_gas[nyc_gas["building_type"].isin(["Large Residential", # What's the other term? ])]
large_residential_df.head(10)

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
2,10360,Large residential,33762.0,3562.0,National Grid
9,11279,Large residential,301.0,32.0,National Grid
16,11315,Large residential,335091.0,35354.0,National Grid
20,11335,Large residential,,,National Grid
28,11400,Large residential,280.0,30.0,National Grid
43,11474,Large residential,2223970.0,234641.0,National Grid
46,11477,Large residential,493904.0,52110.0,National Grid
62,"10075(40.77293949176419, -73.95609016263086)",Large Residential,73358.0,7740.0,ConEd
68,"10459(40.82552621616202, -73.89313106784448)",Large Residential,62003.0,6542.0,ConEd
70,"10451(40.82069640711353, -73.92384136798472)",Large Residential,3433811.0,362286.0,ConEd


In [20]:
large_residential_df["consumption_gj"].# add something here to find the median value!

160960.0

It looks as though the "industrial" building type has the lowest median consumption, which is surprising!  We should understand more about building classifications before we proceed much further in our energy consumption analysis.  The highest median consumption belongs to "small residential" building types.

## Utility Reporters

Again, we have a few questions to answer here:

* How many utility data reporters are included?
* What's the mean and standard deviation of their energy consumption (in GJ)?
* How do the different utility types compare?

I'm going to do a similar method to what I did with building types!

In [21]:
nyc_gas['utility_reporter']. # What's the command we're looking for?

array(['National Grid', 'ConEd'], dtype=object)

Great, only two possibilities here!  I'll make two data frames, as always, peeking in a bit to make sure what I'm doing makes sense.

In [22]:
national_grid = nyc_gas[nyc_gas['utility_reporter'] # Put something here!]
national_grid.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
2,10360,Large residential,33762.0,3562.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
4,11200,Institutional,3605.0,380.0,National Grid


In [23]:
national_grid['consumption_gj'].# How do we calculate mean?

357475.56048387097

In [24]:
national_grid['consumption_gj'].# How do we calculate standard deviation?

562355.2736235437

In [25]:
coned = nyc_gas[nyc_gas['utility_reporter'] # Put something here!]
coned.head()

Unnamed: 0,zip,building_type,consumption_therms,consumption_gj,utility_reporter
51,"10007(40.71363051943297, -74.00913138370635)",Residential,12845.0,1355.0,ConEd
52,"10002(40.71612146793143, -73.98583147024613)",Residential,550055.0,58034.0,ConEd
53,"10012(40.72553802086304, -73.99789641059084)",Residential,178497.0,18832.0,ConEd
54,"10003(40.73194394755518, -73.98887214913032)",Residential,260502.0,27484.0,ConEd
55,"10001(40.75025902143676, -73.99688630375988)",Residential,58338.0,6155.0,ConEd


In [26]:
coned['consumption_gj'].# How do we calculate mean?

224575.75049115912

In [27]:
coned['consumption_gj'].# How do we calculate standard deviation?

298958.0488076621

## Suggestions for Improvement

I noticed that the zip codes are listed a bit strangely (describe what you see here, and any suggestions).

I also noticed ... (anything else you noticed and would want changed, if you were to work with this data in the future?)