<a href="https://colab.research.google.com/github/seobho/energy-analysis/blob/master/notebooks/data_manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Manipulation for Seoul City Energy Usage Statistics Dataset

This Jupyter Notebook contains the process of manipulating the Seoul City energy usage statistics dataset. The following are the key steps involved in data manipulation:

### Introduction to the Dataset

This dataset provides statistical information on electricity usage, gas usage, water usage, and district heating energy usage in Seoul Special City. The dataset, owned by Seoul Metropolitan Government, covers aggregated information from September 2009 to February 2023.

link: https://data.seoul.go.kr/dataList/OA-15361/S/1/datasetView.do

### Data Manipulation Process

1. **Removal of Unnecessary Data**: Remove columns that are not relevant to the analysis or contain duplicate data. This helps clean up the dataframe and focus on the necessary data for analysis.

2. **Translation of Korean to English**: Convert the Korean columns in the dataset to English to standardize the data. This enables easier data analysis and visualization tasks.

3. **Handling Missing Values**: Identify and handle missing values using appropriate techniques. Dealing with missing values ensures data integrity and reliability.

By following these steps sequentially, we obtain a processed dataset. We will save the file at the dataset directory as a CSV file. Each step is accompanied by code cells and comments explaining the functions and codes used.

**Note**: This notebook utilizes Python and the pandas library for data manipulation.

For more details, please refer to the code cells and comments.


In [None]:
##Use this code when running it in Google Colab.
# !git clone https://github.com/seobho/energy-analysis

In [None]:
##Use this code when running it in Google Colab.
#cd /content/energy-analysis

In [None]:
import pandas as pd

In [None]:
dataset_path = '/workspace/energy-analysis/dataset/'

# #Use this code when running it in Google Colab
# dataset_path = '/content/energy-analysis/dataset'

In [None]:
df = pd.read_csv(dataset_path + '에너지사용량데이터_통계_요약정보.csv', encoding='utf-8')

In [None]:
df.head(5)

## 1. Removal of Unnecessary Data
Remove unnecessary data and reindex columns

In [None]:
df.columns

In [None]:
new_columns = ['년도', '월', '회원타입', '건수', '현년 전기사용량', '현년 가스사용량', '현년 수도사용량', '현년 지역난방 사용량']
df = df.reindex(columns=new_columns)

## 2. Translation of Korean to English

Translate Korean strings to English. We need to modify the columns and the 'user_type' column.

In [None]:
# columns
english_columns = ['year', 'month', 'user_type', 'count', 'electricity_usage', 'gas_usage', 'water_usage', 'district_heating_usage']
change_columns = dict(zip(new_columns, english_columns))
df.rename(columns=change_columns, inplace=True)

# user_type
translation_dict = {'학교' : 'school',
                 '종교단체' : 'religious_organization',
                 '소상공인' : 'small_business_owner',
                 '기업' : 'corporation',
                 '공동주택관리소' : 'homeowners_association',
                 '공공기관' : 'public_institution',
                 '개인' : 'individual'}
df['user_type'] = df['user_type'].replace(translation_dict)

## 3. Missing Value Handling

### 3.1 Missing Value Identification

We need to check for any missing values in the dataset.


In [None]:
df.isnull().sum()

In [None]:
df[df['district_heating_usage'].isnull() == True]

### 3-2. Missing Values Handling
The missing values are only in the district_heating_usage column.
Given the presence of 653 instances where the value is 0 in this column, it is prudent to impute 0 for the missing values.

In [None]:
df['district_heating_usage']

In [None]:
len(df[df['district_heating_usage'] == 0])

In [None]:
df['district_heating_usage'].fillna(0, inplace=True)

In [None]:
df.isnull().sum()

## 4. Save File

Save the file at the dataset directory as a CSV file.

In [None]:
df.to_csv(dataset_path + 'processed_energy_usage_data.csv', index=False)