# Data Preprocessing with R

In [1]:
# import library
library(tidyverse)

"package 'stringr' was built under R version 4.3.1"
── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.1     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
df <- read.csv('out_put_2.csv')

In [3]:
head(df)

Unnamed: 0_level_0,DATE,CHANCE_OF_PRECIPITATION,TEMPERATURE,FEELS_LIKE_TEMPERATURE,WIND_GUST,VISIBILITY,HUMIDITY,UV,WIND_DIRECTION,SPEED
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<chr>,<int>,<int>,<chr>,<int>
1,2023-06-15 18:00:00,60,30,33,13,VG,73,1,WSW,7
2,2023-06-15 19:00:00,60,29,32,11,VG,80,0,WSW,6
3,2023-06-15 20:00:00,60,28,31,10,VG,82,0,WSW,6
4,2023-06-15 21:00:00,10,28,31,9,VG,84,0,WSW,5
5,2023-06-15 22:00:00,10,28,31,8,VG,86,0,SW,4
6,2023-06-15 23:00:00,10,27,31,8,VG,88,0,SW,4


### Label Transformation `VISIBILITY`

In [4]:
df$VISIBILITY <- ifelse(df$VISIBILITY == "G", 0, 1)

In [5]:
table(df$VISIBILITY )


  0   1 
118 296 

### Label Transformation `WIND_DIRECTION`

In [6]:
table(df$WIND_DIRECTION)


ESE NNE NNW  NW   S  SE SSE SSW  SW   W WNW WSW 
  4   1   1   6  29   3  24  91  84  69  12  90 

Ta đã nhận xét là những biến nào có W thì lượng mưa tương đối cao, ngược lại với S

Vì vậy tôi sẽ sử dụng giả định danh cho 3 trường hợp
- ESE, S,,SE,SSE, NNE -> S
- NNW, NW,W, WNW -> W
- SSW, SW,WSW -> B (Both)

In [7]:
df$WIND_DIRECTION <- ifelse(df$WIND_DIRECTION %in% c("ESE", "S", "SE", "SSE", "NNE"), "S",
                ifelse(df$WIND_DIRECTION %in% c("NNW", "NW", "W", "WNW"), "W",
                       ifelse(df$WIND_DIRECTION %in% c("SSW", "SW", "WSW"), "B", NA)))

In [8]:
table(df$WIND_DIRECTION)


  B   S   W 
265  61  88 

#### One-hot encoding for `WIND_DIRECTION`

In [9]:
encoded_WIND_DIRECTION <- model.matrix(~ df$WIND_DIRECTION - 1)
colnames(encoded_WIND_DIRECTION) <- c("B", "S", "W")
encode_df <- as.data.frame(encoded_WIND_DIRECTION)

In [10]:
head(encode_df)

Unnamed: 0_level_0,B,S,W
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,1,0,0
6,1,0,0


In [11]:
df <- cbind(df[, -which(names(df) == "WIND_DIRECTION")], encode_df)

In [12]:
head(df)

Unnamed: 0_level_0,DATE,CHANCE_OF_PRECIPITATION,TEMPERATURE,FEELS_LIKE_TEMPERATURE,WIND_GUST,VISIBILITY,HUMIDITY,UV,SPEED,B,S,W
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>
1,2023-06-15 18:00:00,60,30,33,13,1,73,1,7,1,0,0
2,2023-06-15 19:00:00,60,29,32,11,1,80,0,6,1,0,0
3,2023-06-15 20:00:00,60,28,31,10,1,82,0,6,1,0,0
4,2023-06-15 21:00:00,10,28,31,9,1,84,0,5,1,0,0
5,2023-06-15 22:00:00,10,28,31,8,1,86,0,4,1,0,0
6,2023-06-15 23:00:00,10,27,31,8,1,88,0,4,1,0,0


## Data Normalization

In [13]:
table(df$CHANCE_OF_PRECIPITATION)


  5  10  20  30  40  50  60  70 
 30 170   9  38  70  17  58  22 

##### Data bining
Data distribution in the range: 5,10,20,30,40,50,60,70

- 5-20: 0 
- 20-30: 1
- 30-40: 2
- 40-50: 3
- 50-70 : 4

In [14]:
breaks <- c(5, 20, 30, 40, 50, 70)
labels <- c(0, 1, 2, 3, 4)
df$CHANCE_OF_PRECIPITATION <- cut(df[['CHANCE_OF_PRECIPITATION']], breaks = breaks, labels = labels, include.lowest = TRUE)


In [18]:
table(df$CHANCE_OF_PRECIPITATION)


  0   1   2   3   4 
209  38  70  17  80 

In [15]:
df_without_date <- subset(df, select = -c(DATE, CHANCE_OF_PRECIPITATION))

normalized_df <- as.data.frame(apply(df_without_date, 2, function(x) {
  (x - min(x)) / (max(x) - min(x))
}))

df <- cbind(DATE = df$DATE, CHANCE_OF_PRECIPITATION=df$CHANCE_OF_PRECIPITATION, normalized_df)


In [16]:
head(df)

Unnamed: 0_level_0,DATE,CHANCE_OF_PRECIPITATION,TEMPERATURE,FEELS_LIKE_TEMPERATURE,WIND_GUST,VISIBILITY,HUMIDITY,UV,SPEED,B,S,W
Unnamed: 0_level_1,<chr>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,2023-06-15 18:00:00,4,0.6666667,0.6666667,0.4090909,1,0.4054054,0.1,0.4166667,1,0,0
2,2023-06-15 19:00:00,4,0.5,0.5,0.3181818,1,0.5945946,0.0,0.3333333,1,0,0
3,2023-06-15 20:00:00,4,0.3333333,0.3333333,0.2727273,1,0.6486486,0.0,0.3333333,1,0,0
4,2023-06-15 21:00:00,0,0.3333333,0.3333333,0.2272727,1,0.7027027,0.0,0.25,1,0,0
5,2023-06-15 22:00:00,0,0.3333333,0.3333333,0.1818182,1,0.7567568,0.0,0.1666667,1,0,0
6,2023-06-15 23:00:00,0,0.1666667,0.3333333,0.1818182,1,0.8108108,0.0,0.1666667,1,0,0


## Save data for building model

In [17]:
write.csv(df,'out_put_3.csv',row.names=FALSE)