# <center><font color="green"> https://bit.ly/ptpjb-2021-04</font><br><font color="blue">04 - Exploratory Data Analysis ~ Visualization</font></center>

<center><img alt="" src="images/cover_ptpjb_2021.png"/></center> 

## <center><font color="blue">tau-data Indonesia</font><br>(C) Taufik Sutanto - 2021</center>
<center><a href="https://tau-data.id">https://tau-data.id</a> ~ <a href="mailto:taufik@tau-data.id">taufik@tau-data.id</a></center>

# <center><font color="blue"> Outline Exploratory Data Analysis (EDA) ~ Visualisasi</font></center>

* Pendahuluan Visualisasi
* Visualisasi Data Kategorik, Numerik, dan Kombinasinya.
* Visualisasi Time Series
* Visualisasi Spatial
* Study kasus energy buildings usage

# <center><font color="blue">  Pendahuluan Visualisasi </font></center>
<center><img alt="" src="images/Purpose_Visualize_Data.jpg" style="height: 300px;" /></center>

* Setelah melakukan data preprocessing, maka visualisasi dapat digunakan untuk:
 - Mengetahui apakah perlu preprocessing lebih lanjut.
 - Mendapatkan informasi/insight dasar dari data.
 - Mendapatkan hipotesis/dugaan untuk diuji dengan model di tahap berikutnya.
 - Kelak visualisasi juga digunakan untuk melakukan pelaporan performa/hasil prediksi model.
* Contoh (dasar/generik) tujuan visualisasi: monitor system, tracking (IKU/statistics), tell stories, show outliers/trends, support argumen, atau sekedar overview data (e.g. Kibana).

<img alt="" src="images/XII_EDA_ML.png" style="height: 200px;" />

In [None]:
!pip install statsmodels folium chart_studio plotly
# dalam module ini kita membutuhkan beberapa module tambahan

In [None]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
import matplotlib.cm as cm
import calendar, folium
from folium.plugins import HeatMap
from collections import Counter
from statsmodels.graphics.mosaicplot import mosaic
plt.style.use('bmh'); sns.set()

In [None]:
# Importing CSV data  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
try:
    # Running Locally 
    price = pd.read_csv('data/price.csv')
except:
    # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/price.csv
    price = pd.read_csv('data/price.csv')

In [None]:
# Dari Module sebelumnya - Bisa juga Load PreProcessed Data
price.drop("Observation", axis=1, inplace=True)
price.drop_duplicates(inplace=True)
price['Parking'] = price['Parking'].astype('category')
price['City_Category'] = price['City_Category'].astype('category')
price2 = price[np.abs(price.House_Price - price.House_Price.mean())<=(2*price.House_Price.std())]
price2.info()

## Apakah ada kecenderungan perbedaan harga rumah akibat dari tipe tempat parkir?

In [None]:
p= sns.catplot(x="Parking", y="House_Price", data=price2)
# Apa yang bisa dilihat dari hasil ini?

# Tambah dimensi di Visualisasi untuk melihat insight yang lebih jelas/baik 

In [None]:
# Bisa juga plot dengan informasi dari 3 variabel sekaligus
# (untuk melihat kemungkinan faktor interaksi)
p= sns.catplot(x="Parking", y="House_Price", hue="City_Category", kind="swarm", data=price2)

# Ada informasi apakah dari hasil diatas?

# <center><font color="blue">1D Visualization: Bar Chart / Count Plot</font></center>
<center><img alt="" src="images/barchart.png" style="height: 300px;" /></center>

Image Source: https://datavizcatalogue.com/methods/bar_chart.html

# <center><font color="blue">Hati-hati: Bar Chart VS Histogram </font></center>
<center><img alt="" src="images/barchart_vs_histogram.png" style="height: 300px;" /></center>

image Source: https://www.mathsisfun.com/data/bar-graphs.html

In [None]:
plt.figure(figsize=(8,6)) # https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
p = sns.countplot(x="City_Category", hue="Parking", data=price2)

# Horizontal? Why?

In [None]:
ax = sns.countplot(y = 'Parking', hue = 'City_Category', palette = 'muted', data=price2)

In [None]:
# Demo "SubPlot" tapi menggunakan data berbeda karena data price hanya punya 2 var kategori.

tips=sns.load_dataset('tips') # Data built-in dari Module Seaborn ... akan dijelaskan lebih lanjut di bawah.
categorical = tips.select_dtypes(include = ['category']).columns

fig, ax = plt.subplots(2, 2, figsize=(12, 6))
for variable, subplot in zip(categorical, ax.flatten()):
    sns.countplot(tips[variable], ax=subplot)

# Adding labels? ... Hhhmmm...

In [None]:
X = price2[price2["Parking"].isin(["Open","Covered"])]
X = X[X["House_Price"]<7000000]
X.groupby(["Parking", "City_Category"]).size().unstack()

In [None]:
def groupedbarplot(df, width=0.8, annotate="values", ax=None, **kw):
    ax = ax or plt.gca()
    n = len(df.columns)
    w = 1./n
    pos = (np.linspace(w/2., 1-w/2., n)-0.5)*width
    w *= width
    bars = []
    for col, x in zip(df.columns, pos):
        bars.append(ax.bar(np.arange(len(df))+x, df[col].values, width=w, **kw))
        for val, xi in zip(df[col].values, np.arange(len(df))+x):
            if annotate:
                txt = val if annotate == "values" else col
                ax.annotate(txt, xy=(xi, val), xytext=(0,2), 
                            textcoords="offset points",
                            ha="center", va="bottom")
    ax.set_xticks(np.arange(len(df)))
    ax.set_xticklabels(df.index)
    return bars

In [None]:
counts = price2.groupby(["Parking", "City_Category"]).size().unstack()
plt.figure(figsize=(12,8))
groupedbarplot(counts)
plt.show()

# Stacked/Segmented Chart

In [None]:
CT = pd.crosstab(index=price2["City_Category"], columns=price2["Parking"])
p = CT.plot(kind="bar", figsize=(8,8), stacked=True)

In [None]:
# ini dilakukan jika kita ingin menyimpan plotnya ke dalam suatu file
p.figure.savefig('barChart.png')
# lihat di folder ipynb-nya akan muncul file baru.

# Mosaic Plot for multiple categorical data analysis

In [None]:
p = mosaic(tips, ['sex','smoker','time'])

# <center><font color="blue">Pie Chart</font></center>
<center><img alt="" src="images/piechart.png" style="height: 400px;" /></center>

Image Source: https://datavizcatalogue.com/methods/pie_chart.html

In [None]:
# PieChart
plot = price2.City_Category.value_counts().plot(kind='pie')

# Show Values?

In [None]:
data = price2['Parking']

proporsion = Counter(data)
values = [float(v) for v in proporsion.values()]
colors = ['r', 'g', 'b', 'y']
labels = proporsion.keys()
explode = (0.1, 0, 0, 0)
plt.pie(values, colors=colors, labels= values, explode=explode, shadow=True)
plt.title('Proporsi Tipe Parkir')
plt.legend(labels, loc='best')
plt.show()

# <center><font color="blue">Box Plot</font></center>

<center><img alt="" src="images/boxplot.png" style="height: 350px;" /></center>

* Lower Extreme: $Q_1 - 1.5(Q_3-Q_1)$  Upper Extreme $Q_3 + 1.5(Q_3-Q_1)$
* Source: https://datavizcatalogue.com/methods/box_plot.html & https://lsc.deployopex.com/box-plot-with-jmp/

In [None]:
# Jika ada outlier grafiknya menjadi tidak jelas (data = price, bukan price2)
p = sns.boxplot(x="House_Price", y="Parking", data=price)

In [None]:
# BoxPlots
p = sns.boxplot(x="House_Price", y="Parking", data=price2)
# Apa makna pola yang terlihat di data oleh BoxPlot ini?

# Bagaimana mendapatkan data-data outliernya?

* Hati-hati beda iloc dan loc di Dataframe.
* Hati-hati Rumus Outlier Boxplot di SeaBorn!!!...

In [None]:
Q1 = price2['House_Price'].quantile(0.25)
Q3 = price2['House_Price'].quantile(0.75)
IQR = Q3 - Q1 #IQR is interquartile range. 
print("Q1={}, Q3={}, IQR={}".format(Q1, Q3, IQR))

outliers_ = (price2['House_Price'] < (Q1 - 1.5 *IQR)) # Outlier bawah
rumah_potensial = price2.loc[outliers_]
rumah_potensial

# Boxplot dapat juga dipisahkan berdasarkan suatu kategori

In [None]:
p = sns.catplot(x="Parking", y="House_Price", hue="City_Category", kind="box", data=price2)

* Ada dugaan/interpretasi (baru) apakah dari boxPlot diatas?
* Apakah kelemahan (PitFalls) Box Plot?

# Swarn Plot & Violin Plot

### Menangani kelemahan BoxPlot.

In [None]:
p= sns.catplot(x="day", y="total_bill", hue="sex", kind="swarm", data=tips)

In [None]:
p = sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow')

# <center><font color="blue">histogram</font></center>

<center><img alt="" src="images/histogram.png" style="height: 300px;" /></center>

image source: https://datavizcatalogue.com/methods/histogram.html

In [None]:
numerical = price2.select_dtypes(include = ['int64','float64']).columns

price2[numerical].hist(figsize=(15, 6), layout=(2, 4));

# <center><font color="blue">Scatter Plot</font></center>

<center><img alt="" src="images/scatter_plot.png" style="height: 350px;" /></center>

image source: https://datavizcatalogue.com/methods/scatterplot.html

In [None]:
p = sns.scatterplot(x=price2['House_Price'], y=price2['Dist_Market'], hue = price2['Parking'])

# Bigger picture?

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12,8))

p = sns.scatterplot(x=price2['House_Price'], y=price2['Dist_Market'], hue = price2['Parking'], ax=ax)

# Joined

In [None]:
p = sns.jointplot(x=price2['House_Price'], y=price2['Rainfall'], hue = price2['Parking'])

# Conditional Plot

In [None]:
cond_plot = sns.FacetGrid(data=price2, col='Parking', hue='City_Category')#, hue_order=["Yes", "No"]
p = cond_plot.map(sns.scatterplot, 'Dist_Hospital', 'House_Price').add_legend()

# Pairwise Plot

In [None]:
# Coba kita perhatikan sebagiannya saja dulu dan coba kelompokkan berdasarkan "Parking"
p = sns.pairplot(price2[['House_Price','Builtup','Dist_Hospital','Parking']], hue="Parking")
# Ada pola menarik?

# 3D Visualization: 3D Scatter Plot

https://pythonprogramming.net/matplotlib-3d-scatterplot-tutorial/

In [None]:
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(111, projection='3d')
x = price2['House_Price']
y = price2['Dist_Hospital']
z = price2['Rainfall']
warna = cm.rainbow(np.linspace(0, 1, len(y)))

ax.scatter(x, y, z, s=50, c=warna, marker='o')
ax.set_xlabel('Harga')
ax.set_ylabel('Jarak ke RS')
ax.set_zlabel('Curah Hujan')
plt.show()

# 3D Visualization:  3D Bar Plots

Bar plots are used quite frequently in data visualisation projects since they’re able to convey information, usually some type of comparison, in a simple and intuitive way. The beauty of 3D bar plots is that they maintain the simplicity of 2D bar plots while extending their capacity to represent comparative information.

https://towardsdatascience.com/an-easy-introduction-to-3d-plotting-with-matplotlib-801561999725

In [None]:
import random

fig = plt.figure(figsize=(12, 10))
ax = plt.axes(projection="3d")

num_bars = 15
x_pos = random.sample(range(20), num_bars)
y_pos = random.sample(range(20), num_bars)
z_pos = [0] * num_bars

x_size = np.ones(num_bars)
y_size = np.ones(num_bars)
z_size = random.sample(range(20), num_bars)

ax.bar3d(x_pos, y_pos, z_pos, x_size, y_size, z_size, color='aqua')
plt.show()

# Checking Correlations

In [None]:
price2.corr()

In [None]:
# HeatMap untuk menyelidiki korelasi
corr2 = price2.corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 14}, square=True);

# Time Series Plot

# Datetime di Pandas

## References:

* https://towardsdatascience.com/a-complete-guide-to-time-series-data-visualization-in-python-da0ddd2cfb01
* https://machinelearningmastery.com/time-series-data-visualization-with-python/
* https://datascienceanywhere.medium.com/visualizing-time-series-data-in-python-e49fa5d10ea
* Dataset: https://github.com/rashida048/Datasets/blob/master/stock_data.csv

In [None]:
file_ = 'data/stock_data.csv'

try: # Running Locally, yakinkan "file_" berada di folder "data"
    df = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/stock_data.csv
    df = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')

print(df.shape)
df.head()

In [None]:
# Penting untuk cek Tipe Data Dataframe
# Perhatikan disini tipe data "Date" masih berupa string!!!....
df.info()

In [None]:
# Perhatikan sekarang "Date" tidak lagi berupa variable, tapi merupakan index bagi dataframenya
file_ = 'data/stock_data.csv'

try: # Running Locally, yakinkan "file_" berada di folder "data"
    df = pd.read_csv(file_, parse_dates=True, index_col = "Date", error_bad_lines=False, low_memory = False, encoding='utf8')
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/stock_data.csv
    df = pd.read_csv(file_, parse_dates=True, index_col = "Date", error_bad_lines=False, low_memory = False, encoding='utf8')

print(df.shape)
df.head()

In [None]:
set(df["Name"])

In [None]:
# Basic Plot Menggunakan fungsi di Pandas
df.sort_index(inplace=True) # Harus diyakinkan dulu data terurut waktu
p = df['Volume'].plot(figsize=(10,6))

# Hue : Menambahkan informasi hari

In [None]:
# Bisa menggunakan Fungsi "Map": silahkan dicoba sebagai latihan
hari_ = {0:"Senin", 1:"Selasa", 2:"Rabu", 3:"Kamis", 4:"Jumat", 5:"Sabtu", 6:"Minggu"}

df['weekdays'] = ['']*df.shape[0]
for i,d in df.iterrows():
    df.loc[i,'weekdays'] =  hari_[i.weekday()] # Perhatikan disini menggunakan i dan bukan d.Date karena waktu=index
df.head()

In [None]:
plt.figure(figsize=(15,6))
sns.lineplot(x='Date', y='Volume', data=df, hue='weekdays', palette='Set1')
plt.show()
# We have our first insight!!!...

# SubPlot

In [None]:
p = df.plot(subplots=True, figsize=(10,12))

# Seasonality

* Resampling berdasarkan bulan (month)
* Filter tahun >2016

In [None]:
# Reduce
df_month = df.resample("M").mean() # dirata-ratakan perbulan. hati-hati hanya bisa jika "waktu" adalah index
df_month.head()
# Perhatikan Date mulai 2006

In [None]:
import matplotlib.dates as mdates # Need this additional function

fig, ax = plt.subplots(figsize=(10, 6))

ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m')) # Supaya label lebih jelas
ax.bar(df_month['2016':].index, df_month.loc['2016':, "Volume"], width=25, align='center')

plt.show()

# Seaborn & Seasonality

* Butuh Kolom baru "Month"

In [None]:
# Latihan: ganti dengan "map" function
df['Month'] = ['']*df.shape[0]
for i,d in df.iterrows():
    df.loc[i,'Month'] =  i.month # Perhatikan disini menggunakan i dan bukan d.Date karena waktu=index
    
df.head()

In [None]:
#start, end = '2016-01', '2016-12'
fig, axes = plt.subplots(4, 1, figsize=(10, 16), sharex=True)

for name, ax in zip(['Open', 'Close', 'High', 'Low'], axes):
    sns.boxplot(data = df, x='Month', y=name, ax=ax)
    ax.set_ylabel("")
    ax.set_title(name)
    if ax != axes[-1]:
        ax.set_xlabel('')

# Line Plot Revisited with resampling

In [None]:
p = df_month['Volume'].plot(figsize=(8, 6))

# Resampling bisa juga berdasarkan minggu (dan contrasted with daily)

* Perhatikan Business understanding dengan baik.

In [None]:
df_week = df.resample("W").mean()
start, end = '2015-01', '2015-08'

fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(df.loc[start:end, 'Volume'], marker='.', linestyle='-', linewidth = 0.5, label='Daily', color='black')
ax.plot(df_week.loc[start:end, 'Volume'], marker='o', markersize=8, linestyle='-', label='Weekly', color='coral')

ax.set_ylabel("Open")
ax.legend()
plt.show()

# Rolling?

<img alt="" src="images/rolling_MA_pandas.png"  style="height: 200px;"/>

* moving average, also called a rolling or running average is used to analyze the time-series data by calculating averages of different subsets of the complete dataset. Since it involves taking the average of the dataset over time, it is also called a moving mean (MM) or rolling mean.
* https://www.datacamp.com/community/tutorials/moving-averages-in-pandas
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
* https://medium.com/@alexander.mueller/rolling-aggregations-on-time-series-data-with-pandas-80dee5893f9

In [None]:
# Contoh sederhana
df2 = pd.DataFrame({'B': [0, 1, 2, 3, 4]})
print(df2)
df2.rolling(3).mean()

In [None]:
df_7d_rolling = df.rolling(window=7, center=False).mean() # Perhatikan centre = true !!!... 
df_7d_rolling.head(10)

In [None]:
start, end = '2016-06', '2017-05'
fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(df.loc[start:end, 'Volume'], marker='.', linestyle='-', linewidth=0.5, label='Daily')
ax.plot(df_week.loc[start:end, 'Volume'], marker='o', markersize=5, linestyle='-', label = 'Weekly mean volume')
ax.plot(df_7d_rolling.loc[start:end, 'Volume'], marker='.', linestyle='-', label='7d Rolling Average')

ax.set_ylabel('Stock Volume')
ax.legend()
plt.show()

# Memvisualisasikan Perubahan (Rasio dengan 1 hari sebelumnya)

* Menggunakan fungsi "Shift": The shift function shifts the data before or after the specified amount of time.
* https://pandas.pydata.org/docs/reference/api/pandas.Series.shift.html
* Fungsi Div (membagi), dalam hal ini dengan shift.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.div.html
* https://www.geeksforgeeks.org/python-pandas-dataframe-shift/

In [None]:
df.head()

In [None]:
df.Close.shift().head()

In [None]:
df['Change'] = df.Close.div(df.Close.shift())
p = df['Change'].plot(figsize=(20, 8), fontsize = 16)

In [None]:
df.head() # perhatikan di kolom "Change"

In [None]:
# Zoom to a year
p = df.loc['2008']['Change'].plot(figsize=(10, 6))

# Percent_Change

* Percentage change between the current and a prior element.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pct_change.html

In [None]:
df_month.loc[:, 'pct_change'] = df.Close.pct_change()*100

df_month.head()

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
df_month['pct_change' ].plot(kind='bar', color='coral', ax=ax)

ax.xaxis.set_major_locator(mdates.WeekdayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
plt.xticks(rotation=45)
ax.legend()

plt.show()

# Expanding Window: Akumulasi Data

* Bayangkan seperti jumlah total kasus Covid, tapi lebih fleksible karena bisa juga menggunakan rata-rata atau simpangan baku.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))

ax = df.High.plot(label='High')
ax = df.High.expanding().mean().plot(label='High expanding mean')
ax = df.High.expanding().std().plot(label='High expanding std')

ax.legend(); plt.show()

# Heat Map

* Jauh lebih mudah untuk mendapatkan insight
* Butuh untuk menyesuaikan "Struktur Data"-nya.
* Butuh tambahan kolom "Year
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
* https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
# Latihan: ganti dengan "map" function
df['Year'] = ['']*df.shape[0]
for i,d in df.iterrows():
    df.loc[i,'Year'] =  i.year # Perhatikan disini menggunakan i dan bukan d.Date karena waktu=index
    
df.head()

In [None]:
all_month_year_df = pd.pivot_table(df, values="Open",
                                   index=["Month"],
                                   columns=["Year"],
                                   fill_value=0,
                                   margins=True)
named_index = [[calendar.month_abbr[i] if isinstance(i, int) else i for i in list(all_month_year_df.index)]] # name months
all_month_year_df = all_month_year_df.set_index(named_index)
all_month_year_df.head()

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))

sns.heatmap(all_month_year_df, cmap='RdYlGn_r', robust=True, fmt='.2f', 
                 annot=True, linewidths=.5, annot_kws={'size':11}, 
                 cbar_kws={'shrink':.8, 'label':'Open'}, ax=ax)                       
    
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=10)
ax.set_xticklabels(ax.get_xticklabels(), rotation=0, fontsize=10)
plt.title('Average Opening', fontdict={'fontsize':18},    pad=14);

plt.show()

# Spatial Visualization

In [None]:
def generateBaseMap(default_location=[-0.789275, 113.921], default_zoom_start=5):
    base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
    return base_map

In [None]:
# Load Data
try:
    # Running Locally, yakinkan module folium sudah terinstall
    df_loc = pd.read_csv('data/df_loc.csv')
except:
    # Running in Google Colab, yakinkan folder "data" sudah ada
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/df_loc.csv
    df_loc = pd.read_csv('data/df_loc.csv')
    
df_loc.head()

In [None]:
base_map = generateBaseMap()
HeatMap(data=df_loc[['lat', 'lon', 'count']].groupby(['lat', 'lon']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
base_map

# Hati-hati Chrome terbaru suka bermasalah dengan Folium jika datanya cukup besar

# <center><font color="blue"> Studi Kasus (Latihan): Penggunaan Energi Gedung</font></center>

<img alt="" src="images/Ashrae-Energy-Prediction.jpg" style="height: 200px;" />

<font color="green"> Deskripsi</font>

* Studi Kasus kali ini berasal dari Permasalahan Prediksi Penggunaan Energi Gedung dari PT Ashrae - American Society of Heating, Refrigerating and Air-Conditioning Engineers https://www.ashrae.org/about
* Sebagai latihan studi kasus EDA kita hanya akan menggunakan sebagian dari data yang ada.
* Data lengkap dan keterangan: https://www.kaggle.com/c/ashrae-energy-prediction/data
* Data berupa 3 buah file CSV: Informasi Tentang Gedung, Penggunaan Energi Gedung, dan Cuaca.
* Permasalahan utama dari kasus ini sebenarnya adalah forecasting/peramalan penggunaan energi. Akan tetapi di module 03 dan 04 ini kita hanya akan melakukan EDA pada data yang ada.
* Hasil EDA ini kelak akan kita gunakan untuk melakukan analisis lebih lanjut.

<font color="green"> Metadata</font>

* Variabel di Data Gedung "**gd**":
    - site_id & building_id: id lokasi dan gedung
    - primary_use: Peruntukan Gedung
    - square_feet: Luas bangunan gedung
    - year_built: Tahun pembuatan gedung
    - floor_count: Banyaknya lantai yang ada di gedung.
* Variabel di Data Energy Gedung "**en**" (selain building_id):
    - meter	: Jenis meter reading penggunaan energy gedung.
    - timestamp	: Waktu saat pengukuran (per-jam)
    - meter_reading: Penggunaan energy.
* Variabel di Data Cuaca "**cu**" (selain site_id & timestamp):
    - air_temperature: suhu udara
    - cloud_coverage: ukuran berawan	
    - dew_temperature: suhu dew (menbun?)
    - precip_depth_1_hr: precipitation (banyaknya air dari langit, karena sebab apapun)
    - sea_level_pressure: Tekanan permukaan laut.	
    - wind_direction & wind_speed: arah dan kecepatan angin

# <center><font color="blue"> Task</font></center>

* Silahkan lakukan EDA dengan melakukan beberapa ha berikut:
    - Preprocessing data apa saja yang perlu dilakukan?
    - Gunakan statistika deskriptif & visualisasi untuk menghasilkan berbagai informasi/insight dari data yang ada.
* Loading data diatas diberikan di cell dibawah ini.

In [None]:
file_00 = 'data/ashrae-energy_building_metadata.csv'
file_01 = 'data/ashrae-energy_train_sample.csv'
file_02 = 'data/ashrae-energy_weather_test.csv'
try: # Running Locally, yakinkan "file_" berada di folder "data"
    gd = pd.read_csv(file_00, error_bad_lines=False, low_memory = False, encoding='utf8') #gedung
    en = pd.read_csv(file_01, error_bad_lines=False, low_memory = False, encoding='utf8') #energy
    cu = pd.read_csv(file_02, error_bad_lines=False, low_memory = False, encoding='utf8') #cuaca
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/data/ashrae-energy_building_metadata.csv
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/data/ashrae-energy_train_sample.csv
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/ptpjb/master/data/ashrae-energy_weather_test.csv
    gd = pd.read_csv(file_00, error_bad_lines=False, low_memory = False, encoding='utf8') #gedung
    en = pd.read_csv(file_01, error_bad_lines=False, low_memory = False, encoding='utf8') #energy
    cu = pd.read_csv(file_02, error_bad_lines=False, low_memory = False, encoding='utf8') #cuaca

print("Ukuran Data Gedung={}, Data Train={}, Data Weather={}".format(gd.shape, en.shape, cu.shape))
gd.head()

In [None]:
cu.head()

In [None]:
en.head()

In [None]:
# jawaban diberikan mulai dari cell ini, silahkan buat cell baru sesuai kebutuhan (alt+Enter)



# <center><font color="blue"> Akhir Modul 04 - Exploratory Data Analysis ~ Visualization

<hr />
<img alt="" src="images/meme-cartoon/meme visualization.jpg" style="height: 400px;"/>