# Kaggle Competition Report - Island in the sun

Team: Sun Trek  -  Shiwen Chen, Jin Zhou, Rongsheng Zhang

**Abstract： ** aaaaa

### Introduction

The climate change, which is primarily caused by the emission of greenhouse gases such as carbon dioxide, has become a concern for people from the whole world. According to EPA, the electricity sector accounts for 29% of the greenhouse gas emission in the US (EPA, 2015). State governments and concerned parties are trying to alleviate the impact of power generation on the climate by promoting renewable energy such as wind and solar energy. 

Currently, solar energy only represents a small fraction of the energy mix (0.9% for solar compared with 64% for fossil fuel according to EIA, 2017), but it has the greatest potential as it is abundant and in-exhaustive. However, the growing portion of renewable energy like solar can pose a threat to the reliability of the grid (Zahedi, 2011). The electricity generation for solar PV is intermitted and hard to predict while the power system operator needs precise prediction to keep supply and demand in balance. Moreover, the generation of solar drops suddenly during the sunset and requires fast ramp-up of generation from other sources to make up for the gap, which can be risky and uneconomical.

There are several methods to address this reliability challenge, including installation energy storage. However, before taking any measures, it is crucial to get the capacity data of installed solar PV. This data is hard to obtain because many solar PVs are installed on rooftops by individuals. This study aims to collect the rooftop PV installation data from satellite imagery with a computer algorithm. This algorithm can determine whether the rooftop PV is installed by inputting a satellite image of the house, thus realizing the data collection in an automatic and fast manner.


### Background

This section will discuss about some of the existing research on similar topics and introduce their algorithms so that they can be taken as a comparison to our method. In particular, Section (A) will basically discuss a popular model used in panel detection and will talk about a recent work using this method to handle this problem. Section (B) will discuss another recent research that uses a different method which inspires us on our panel detection design.

**(A) Convolutional Neural Network (CNN) in Panel Detection**

CNN is a popular method that develops the ideas of both Neocognitron, introduced by Fukushima, and weight-sharing networks. The convolutional neural network puts together the following three advanced approaches: local receptive field, shared weights and spatial subsampling. Basically, CNN model will always contain several layers, such as convolutional layers, pooling layers and fully-connected layers, followed by a classifier to get its prediction of the input features. Recent works have shown that CNN has a convincing advantage on ROC performance. The study by Golovko et al. explores the possibility to use multi-layers to construct the CNN model so as to get a satisfying result on panel detection. It employs three convolutional-pooling layers, followed by three fully-connected layers, to transform a 200x200-pixels figure to a single output signal. By utilizing this method, this design can achieve an AUC of 0.92 and its F1-measure is 0.86 (Golovko et al., 2017).

**(B) Feature Extraction Method in Panel Detection**

In the study by Malof et al., the authors talk about a well-designed feature extraction method. Generally, the origin figure will be in a relatively large size and it is hard for the detector to compute on such a large size of data. To lower the computing complexity, a window-based feature extraction method has been proposed. In particular, each figure will be divided into multiple 3x3 small windows and the detector will focus on the center window and its 8 surrounding windows and extracts their mean and variance in RGB matrix to form a feature vector. By combing this method with Random Forest classifier, the design proposed can achieve an R value of 0.6 (Malof et al., 2016).


### Data

The data is tif format satellite images with three color channels (RGB: red, green, and blue).  The size of each picture is 101 * 101 pixels. Most of the image data contain rooftops of houses. Some examples are shown below:
  
![Img](https://raw.githubusercontent.com/viviancsw/machine-learning-course/master/withoutPV.tif)
![Img](https://raw.githubusercontent.com/viviancsw/machine-learning-course/master/withPV.tif)

*Figure 1. training image without solar PV (top), training image that contains solar PV (bottom)*

The training dataset contains in total 1,500 images, among which about one-third of the images contain solar PV (505 images). From the examples in figure 1, we can see that the content of the image is highly variable. For instance, the proportion of vegetation, the color of the roof, and objects such as pools and cars varies for each image. This high variance implies that some features derived from the whole picture (e.g. average color of the picture) may not serve as very ideal features for our classifier.

The bottom row of figure 1 shows some characteristics of solar PVs such as being darker (has lower RGB values) comparing with most of the objects in picture. However, the shadows in the picture are also dark and can be confused with PV when using the darker color of PVs as features. While a single cell of the solar PV has a rectangular shape, the shapes of whole solar PV array tend to be irregular. However, the total areas of the solar PVs have a relatively small variance among images. There are also recognizable patterns within the solar PVs, which can be potentially denoted by the variance of color within the PV array.

### Methods

methods

### Results

In this section, we will set up a simulation platform in Python 3.0 and present the simulation result of our panel detection design. To better evaluate the performance of our design, we will compare it with two other classifiers to verify the advantages of our method.

**(A) Simulation Parameters**

In the simulation, we set the reference RGB value as 99.84, 111.64 and 120.47, which is generally a mean value of the panel pixels in 30 different figures. It is rational to use the mean of panel pixels as the reference value given the fact that the average RGB values of panel pixels are lower than non-pv values and for each window, the RGB value will not change a lot.
Another parameter setting that need to be mentioned here is the percentage of the 'unrelated' pixels that we choose to discard. In particular, we choose to discard 70% of the pixels that have a relatively high RGB values. 70% is a number which has the best performance during our experiment.

**(B) Cross Validation**

Cross validation is method that have been widely used in machine learning field. Basically, cross validation will split the dataset into several folders and for each time of the validation, one specific folder will be used as test dataset and the remaining will be seen as training dataset.
In this experiment, we choose to set the split number as 20, which is reasonable number for that the overall dataset has 1500 samples and dividing it into 20 folders will make a rational number of training and testing dataset, without suffering from high computing complexity.

**(C) Simulation Result**

In this part, we will present our simulation result. To evaluate our performance, we choose to use ROC curve and AUC as our evaluation metrics. Apart from the proposed design, we take two other classifiers as comparison, to better verify the performance advantages. These two classifiers are set as following: 

1) a classifier that will randomly guess the label of each figure; 

2) a classifier that will regard all figures as non-pv figures.

![Img](https://raw.githubusercontent.com/viviancsw/machine-learning-course/master/roc.png)

*Figure 2. ROC of different methods*

In Figure 2, it is obvious that the design we propose has a significant advantage over the other classifiers, which is reasonable because our design will 'gain knowledge' from the figure and utilize the knowledge to make the decision.

|  | Proposed design | Randomly guess | Non_pv |
| :------:| :------: | :------: |
| AUC | 0.851 | 0.498 |0.500|

Table 1. AUC of different methods

In Table 1, we can notice that the AUC of the proposed design is significantly larger than the other two methods, which matches the ROC performance above.

### Conclusions

conclusion

### Roles

### References

EIA. (2017). Electric Power Monthly. https://www.eia.gov/electricity/monthly/

EPA. (2015). Sources of Greenhouse Gas Emissions. https://www.epa.gov/ghgemissions/sources-greenhouse-gas-emissions
   
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International journal of computer vision, 59(2), 167-181.

Golovko, V., Bezobrazov, S., Kroshchanka, A., Sachenko, A., Komar, M., & Karachka, A. (2017, September). Convolutional neural network based solar photovoltaic panel detection in satellite photos. In Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), 2017 9th IEEE International Conference on (Vol. 1, pp. 14-19). IEEE.

Malof, J. M., Bradbury, K., Collins, L. M., & Newell, R. G. (2016). Automatic detection of solar photovoltaic arrays in high resolution aerial imagery. Applied energy, 183, 229-240.

Zahedi, A. (2011). A review of drivers, benefits, and challenges in integrating renewable energy sources into electricity grid. Renewable and Sustainable Energy Reviews, 15(9), 4775-4779.