# DATA CLEANING: METADATA AND IRRIGATION
## Urban Data Genome Project
This notebook is for exploratory analysis and cleaning of the metadata and irrigation dataframes within the Urban Data Genome Project

## Read in the Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
os.chdir('/kaggle/input/buildingdatagenomeproject2')
os.listdir()

# EDA: Meta

 ### Data-Type Analysis
 #### Completed:
- Look at difference between eui, source_eui, site_eui
- Look at difference between building_id and site_id vs building_id_kaggle and site_id_kaggle
- Convert water, irrigation, solar, gas, electricity, hotwater, and chilledwater to binary classifications
- Label-encode energystarscore
- Label-encode rating
- Label-encode leed level
- Convert date_opened to datetime

#### Questions:
- Why are some of the kaggle id's Nan? A: These sites were not used in the kaggle competition
- What to do about building_id and site_id vs. their respective kaggle versions? A: Drop kaggle id's


In [None]:
meta = pd.read_csv('metadata.csv')
print(meta.shape)
meta.head()

In [None]:
meta.info()

Looking at the difference between the eui, site_eui, and source_eui by analyzing the unique values in each of these columns and comparing them to each other. Additionally, a new dataframe entitled 'test' was created to ensure the original dataframe was not altered through this exploration.

In [None]:
test = meta.copy()
test[['eui', 'site_eui', 'source_eui']].head(20)

Replacing '-' with NaN as there was a discrepancy in the recording of null values in these features.

In [None]:
test['site_eui'] = test['site_eui'].replace('-', np.nan).astype('float64')
test['source_eui'] = test['source_eui'].replace('-', np.nan).astype('float64')
test['eui'] = test['eui'].str.replace(',', '').replace('-', np.nan).astype('float64')
test['source_eui'].unique()

Analyzing the difference between the kaggle ids vs. the unique name identifiers. 

In [None]:
meta[['building_id', 'building_id_kaggle', 'site_id', 'site_id_kaggle']].tail(20)

Turning categorical variables into numeric through binary classification of the prescence and/or absence of the following utilities: electricity, hot water, chilled water, water, steam, irrigation, solar, and gas. These utilities correspond to the other datasets found in this project.

In [None]:
def binary(df, cols):
    for col in cols:
        df[col] = df[col].replace(np.nan, 0)
        df[col] = df[col].replace("Yes", 1)
    return df

In [None]:
bin_cols = ['electricity', 'hotwater', 'chilledwater', 'water', 'steam', 'irrigation', 'solar', 'gas']
test = binary(meta, bin_cols)

In [None]:
test[bin_cols].nunique()

Exploratory analysis of the three different types of ratings in this dataset: energystarscore, rating, and leed. 

In [None]:
test['energystarscore'] = meta['energystarscore'].replace('-', np.nan).astype('float64')
test['energystarscore'].unique()

In [None]:
meta['rating'].unique()

In [None]:
meta['leed_level'].unique()

Convert the date_opened column to a datetime variable.

In [None]:
test['date_opened'] = test['date_opened'].astype('datetime64[ns]')

In [None]:
test.dtypes

In [None]:
test.info()

### Missing/Null Data Analysis

### Notes
- performed one-hot encoding on heatingtye and industry
- dropped date_opened, site_eui (energy use intensity of site), and source_eui (total primary energy use intensity by area)

### Questions
- What is the missing latitude and longitude information? A: These locations reported their data as 'anonymous' sources
- consider predicting various score metrics, using available data as the training and validation sets? A: yes!
-  difference between primaryspaceusage and industry? A: These columns can be concatenated to fill null values

In [None]:
import missingno as msno
msno.matrix(test);

Drop the columns that have a very large proportion of null values and won't contribute much to further analysis/prediction.

In [None]:
test = test.drop(['date_opened', 'site_eui', 'source_eui'], axis=1)

One-hot encode the heating type to conver this variable into numeric columns for future use. Also rennamed the columns to be clearer.

In [None]:
test['heatingtype'].unique()

In [None]:
heating = pd.get_dummies(test['heatingtype'], drop_first=True, dtype='int64')
heating.head()

In [None]:
heating = heating.rename(columns={'Electric': 'Electric Heating', 
                                  'Electicity': 'Electricity Heating',
                                  'Gas': 'Gas Heating', 
                                  'Oil': 'Oil Heating', 
                                  'Steam': 'Steam Heating'})

In [None]:
heating.head()

Analyze the differences between primary space usage and industry (and their respective sub-cateogires). Determined that these columns can be merged to fill in null values as they are very similar and oftentimes repetative in the information they contain.

In [None]:
primaryspaceusage = test['primaryspaceusage'].unique()
sub_primaryspaceusage = test['sub_primaryspaceusage'].unique()
industries = test['industry'].unique()
subindustries = test['subindustry'].unique()
print(primaryspaceusage)
print(industries)

In [None]:
industries = pd.DataFrame(test['industry'])
industries = industries.rename(columns={'industry': 'usage'})
subindustries = pd.DataFrame(test['subindustry'])
subindustries = subindustries.rename(columns={'subindustry': 'subusage'})
primaryspaceusage = pd.DataFrame(test['primaryspaceusage'])
primaryspaceusage = primaryspaceusage.rename(columns={'primaryspaceusage': 'usage'})
sub_primaryspaceusage = pd.DataFrame(test['sub_primaryspaceusage'])
sub_primaryspaceusage = sub_primaryspaceusage.rename(columns={'sub_primaryspaceusage': 'subusage'})
print(primaryspaceusage.isnull().sum())
print(sub_primaryspaceusage.isnull().sum())
print(industries.isnull().sum())
print(subindustries.isnull().sum())

In [None]:
combine_sub = subindustries.combine_first(sub_primaryspaceusage)
combine_sub.isnull().sum()

In [None]:
combine_sub.head(20)

In [None]:
combine_sub['subusage'].unique()

In [None]:
combine_primary = industries.combine_first(primaryspaceusage)
combine_primary.isnull().sum()

In [None]:
combine_primary.head(20)

In [None]:
(combine_primary['usage']=='Other').sum()

In [None]:
combine_primary['usage'].unique()

In [None]:
test = test.drop(['industry', 'subindustry', 'primaryspaceusage', 'sub_primaryspaceusage', 'heatingtype'], axis=1)
test = pd.concat([test, combine_primary, combine_sub, heating], axis=1)
test.head()

In [None]:
msno.matrix(test)

Drop further columns that cannot be filled in using imputation or interpolation and will not be useful in future analysis.

In [None]:
test = test.drop(['numberoffloors', 'occupants', 'energystarscore'], axis=1)

Analyze why there is missing latitude and longitude: revealed that the four locations for which there is missing lat/long data correspond to the sites that listed themselves as 'anonmyous' when reporting their data.

In [None]:
import math
latlong = test.copy()
for index, row in test.iterrows():
    if not (math.isnan(row['lat'])):
        latlong = latlong.drop(index)
latlong['building_id'].unique()

In [None]:
msno.matrix(test)

In [None]:
test.info()

Creating a cleaned dataset with only the rows used in the kaggle competition.

In [None]:
kaggle = test[test['building_id_kaggle'].notna()]
kaggle = kaggle[kaggle['site_id_kaggle'].notna()]
msno.matrix(kaggle)

Create a cleaned dataset that contains only the rows used in the kaggle competition that were not anonymous and therefore have lat/long data.

In [None]:
no_anonymous = kaggle[kaggle['lat'].notna()]
msno.matrix(no_anonymous)

In [None]:
#save as csv
test.to_csv('/kaggle/working/metadata_cleaned.csv', index=False)
kaggle.to_csv('/kaggle/working/metadata_kaggle_cleaned.csv', index=False)
no_anonymous.to_csv('/kaggle/working/metadata_kaggle_anonymous_cleaned.csv', index=False)

# EDA: Irrigation

### DataTypes
- converted timestamp from object to datetime

### Missing Data
- three types of imputation used:
    - slinear (proven to work well with time-series, as this dataset is; does not fill data on the ends of the df)
    - ffill (used to fill in holes at the back of the df, propogating forward)
    - bbfill(used to fill in holes at the front of the df, propogating backward)
- dropped one column because over half of its values were missing: Panther_lodging_Cora
- dropped all columns that had only nan and 0 (13 columns)
- dropped columns with significant missing data on either end (front or back) of the dataframe

In [None]:
irr = pd.read_csv('irrigation_cleaned.csv')
irr.info()

In [None]:
irr.isnull().sum()

In [None]:
clean = irr.copy()

In [None]:
clean.head(20)

Convert timestamp to datetime type

In [None]:
clean['timestamp'] = clean['timestamp'].astype('datetime64[ns]')

In [None]:
msno.matrix(clean)

In [None]:
clean.shape

Analyze distribution of irrigation measurements over time by picking a sample site and plotting its data as a time series.

In [None]:
times = clean['timestamp']
clean['timestamp'].unique()

In [None]:
vals = clean['Panther_lodging_Paulette']
clean['Panther_lodging_Paulette'].unique()

In [None]:
clean.plot.scatter(x='timestamp', y='Panther_lodging_Paulette', figsize=(20,10))

In [None]:
import matplotlib.pyplot as plt
plt.plot(times, vals, '-')
plt.show()

Removing any columns that only have null and/or 0 data

In [None]:
columns = irr.columns.tolist()
zeros = irr.copy()
zeros = zeros.replace(0, np.nan)
drop = [];
ii = 1;
while ii<len(columns):
    if zeros[columns[ii]].isnull().sum() == 17544:
        drop.append(columns[ii])
    ii = ii + 1
    
drop

In [None]:
clean = clean.drop(drop, axis=1)
msno.matrix(clean)

slinear interpolation to fill in missing values in the middle of the dataframe.

In [None]:
clean = clean.interpolate(method="slinear")
clean.isnull().sum()

In [None]:
msno.matrix(clean)

In [None]:
clean = clean.drop('Panther_lodging_Cora', axis=1)
clean.isnull().sum()

Create cleaned dataframe by dropping columns that have significant missing data at the front/back of the collection period.

In [None]:
clean_drop = clean.drop(['Panther_lodging_Otis', 'Panther_office_Daina', 'Panther_education_Karri', 'Panther_parking_Lorriane'], axis=1)
msno.matrix(clean_drop)

Use forward and back propogation to fill in remaining null values at the front/back of the dataframe.

In [None]:
clean = clean.fillna(method='ffill')
clean.isnull().sum()

In [None]:
msno.matrix(clean)

In [None]:
clean = clean.fillna(method = 'bfill')
clean.isnull().sum()

In [None]:
msno.matrix(clean)

Looking at an interpolated/filled column to see if data distribution is similar to what is expected.

In [None]:
clean.plot.scatter(x='timestamp', y='Panther_parking_Adela', figsize=(20,10))

In [None]:
clean.to_csv('/kaggle/working/interpolated_propogated_irrigation.csv', index=False)
clean.to_csv('/kaggle/working/no_propogation_irrigation.csv', index=False)