## Train and Test Split 

##### Since we have data worth 2 years: 2016 & 2017 and we are data rich; we will split the data into 2 parts: training on 2016 data and predicting/testing for 2017. 

![image.png](attachment:98d4d75b-b383-46ad-b6c7-492042bb0641.png)

In [1]:
# Importing the required libraries
import pandas as pd
import numpy as np

In [2]:
# Importing the dataset 
data = pd.read_csv("../../data/cleaned/cleaned_all_data.csv")
data['date'] = pd.to_datetime(data['date'])

In [3]:
# Rename the columns 
data.rename(columns={'building_id': 'building_name', 'site_id': 'site_name'}, inplace=True)

# Assigning numeric columns to building and sites 
data['building_id'] = pd.factorize(data['building_name'])[0] +1 

# Assign numeric IDs to 'column2'
data['site_id'] = pd.factorize(data['site_name'])[0] +1 

In [4]:
# Dropping the columns that are not important 
data = data.drop(columns=['Unnamed: 0', 'building_id_kaggle','site_id_kaggle'])

# Inspecting the head 
data.head()

Unnamed: 0,building_name,meter,date,meter_reading,site_name,sub_primaryspaceusage,sqm,sqft,timezone,airTemperature,cloudCoverage,dewTemperature,precipDepth1HR,precipDepth6HR,seaLvlPressure,windDirection,windSpeed,season,building_id,site_id
0,Bear_education_Alfredo,electricity,2016-01-01,2.905,Bear,Education,609.8,6564.0,US/Pacific,5.246861,1.927009,0.254484,0.351088,10.801125,1018.888301,172.924863,3.807399,Winter,1,1
1,Bear_education_Alfredo,electricity,2016-01-02,2.77,Bear,Education,609.8,6564.0,US/Pacific,5.993973,1.997893,0.892188,0.409453,11.105558,1014.347411,181.359441,4.202455,Winter,1,1
2,Bear_education_Alfredo,electricity,2016-01-03,2.6725,Bear,Education,609.8,6564.0,US/Pacific,5.660314,1.946017,0.778475,0.552568,11.167389,1010.396019,208.978674,4.015919,Winter,1,1
3,Bear_education_Alfredo,electricity,2016-01-04,4.565,Bear,Education,609.8,6564.0,US/Pacific,5.048507,1.987616,-0.268905,0.479493,11.089874,1008.903334,211.37704,3.909701,Winter,1,1
4,Bear_education_Alfredo,electricity,2016-01-05,4.7825,Bear,Education,609.8,6564.0,US/Pacific,4.745567,2.007311,0.321921,1.033857,11.723586,1012.7477,170.002007,3.528571,Winter,1,1


In [5]:
# Splitting the dataset 
train = data[data['date'] < '2017-01-01 00:00:00']
train['date'] = pd.to_datetime(train['date'])

test = data[data['date'] >= '2017-01-01 00:00:00']
test['date'] = pd.to_datetime(test['date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['date'] = pd.to_datetime(train['date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['date'] = pd.to_datetime(test['date'])


In [6]:
#Checking the shape 
print(train.shape)
print(test.shape)

(447618, 20)
(446395, 20)


Let's check we have the same buildings in both the datasets 

In [7]:
# List of unique buildings in each dataset
train_bdg = pd.DataFrame(train.building_id.unique()).rename(columns={0:"train"})
test_bdg = pd.DataFrame(test.building_id.unique()).rename(columns={0:"test"})

# Number of unique buildings in each dataset
print("Buildings in train: " + str(len(train_bdg)))
print("Buildings in test: " + str(len(test_bdg)))

Buildings in train: 617
Buildings in test: 617


In [8]:
# list of shared buildings in both datasets
shared_bdg = list(pd.merge(train_bdg, test_bdg, how="inner", left_on="train",right_on="test").drop("test",axis=1).rename(columns={"train":"unique_bdg"}).unique_bdg)
print("Buildings in test AND train: " + str(len(shared_bdg)))

Buildings in test AND train: 617


We can confirm the same buildings are in both the datasets 

In [9]:
#Saving the files 
train.to_csv("../../data/cleaned/train.csv")
test.to_csv("../../data/cleaned/test.csv")