In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages
library(ggplot2)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Import Data
The data set used for this practice and learning is House Sales Price on https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

In [None]:
df = read.csv('../input/house-prices-advanced-regression-techniques/train.csv')
head(df)

In [None]:
missing.values <- df %>%
    gather(key = "key", value = "val") %>%
    mutate(is.missing = is.na(val)) %>%
    group_by(key, is.missing) %>%
    summarise(num.missing = n()) %>%
    filter(is.missing==T) %>%
    select(-is.missing) %>%
    arrange(desc(num.missing)) 

In [None]:
options(repr.plot.width = 10, repr.plot.height = 7)
missing.values %>%
    ggplot() +
    geom_bar(aes(x=key, y=num.missing), stat = 'identity') +
    labs(x='variable', y="number of missing values", title='Number of missing values') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Before we're through the data, first im gonna check if there's missing value on the dataset.
Here's the following missing value with 19 columns has a missing value,
with there's 5 variable has more than 50% of the missing value, for the cleansing later

# Initial EDA

In [None]:
glimpse(df)

this is the first initial EDA :
- there's 1,460 rows of data with 81 columns
- data consist of only String and Integer values
- with the target of variable is `SalePrice` which are the dependent variable

# 1. Start with the target `SalePrice` 

Statistic `SalePrice`

In [None]:
df$SalePrice %>% summary()

- the mean sale is on 180k
- the highest sale on 755k
- the cheapest sale on 34k
- not normally distribute

Histogram `SalePrice`

In [None]:
options(repr.plot.width = 10, repr.plot.height = 7)
df %>%
    ggplot(aes(x=SalePrice))+
    geom_histogram(bins = 30)

is a positive skew distribution

Boxplot `SalePrice`

In [None]:
df %>%
    ggplot(aes(x=SalePrice))+
    geom_boxplot()

- there's an outlier on the data , with 2 outlier that beyond than the other

In [None]:
df %>%
  ggplot(aes(x = SalePrice)) +
  geom_histogram(aes(fill = factor(YrSold)), alpha = 0.5)

before move'on on to the next, i want to know the House SalePrice Through the years
and as i can see, the house sales at the peak at 2006 and slightly slope over the years until 2010.
it is because Financial Crisis that start on 2007

# 2. Correlations

after do the univariant analysis with the target variable and get the insight.
lets check the target variable correlation with other

# Heatmap

In [None]:
options(repr.plot.width = 20, repr.plot.height = 15)
df %>% select(-Id,-LotFrontage,-MasVnrArea,-GarageYrBlt) %>%
  select_if(is.numeric) %>% cor() %>% corrplot::corrplot(type='lower',method = 'number')

from the heat map above , we clearly can see that there's 5 variable that had a strong correlation with the variable target : 
1. OveraalQuall = 0.79
2. GrLivArea = 0.71
3. GarageCars = 0.64
4. GarageArea = 0.62
5. 1stFlrSF = 0.61

from this variable shows has effect for the SalePrice, such as the quality, house area as well as other

# 3. Relationship target Variable

as you know if the quality is low so the price is, and vice versa, oke than lets see whats the visual show us
The corresponding variable is OverallQual which values consist of :
- 10 = Very Excellent
- 9 = Excellent
- 8 = Very Good
- 7 = Good
- 6= =  Average
- 5 = Average
- 4 = Below Average
- 3 = Fair
- 2 = Poor
- 1 = Very Poor

In [None]:
options(repr.plot.width = 10, repr.plot.height = 7)
df %>% 
ggplot(aes(x=factor(OverallQual),y=SalePrice)) + geom_boxplot()

In [None]:
df %>%
  ggplot(aes(x = GrLivArea, y = SalePrice)) +
  geom_smooth(method='lm', se=FALSE)+
  labs(title = 'Sales price', subtitle = 'directly proportional as bigger area more pricey \n')+
  geom_point()

In [None]:
df %>%
  ggplot(aes(x = GarageCars, y = SalePrice)) +
  geom_smooth(method='lm', se=FALSE)+
  geom_point()

In [None]:
df %>%
  ggplot(aes(x = GarageArea, y = SalePrice)) +
  geom_smooth(method='lm', se=FALSE)+
  geom_point()

In [None]:
df %>%
  ggplot(aes(x = X1stFlrSF, y = SalePrice)) +
  geom_smooth(method='lm', se=FALSE)+
  geom_point()

According to the heatmap there are the graph SalePrice with several strong correlation
1. bassicly the graph say the truth the higher the quality, more pricey it is
2. in this graph GrlivArea there 2 anomalies/outlier. with the house area more than 4k square feet
   but it sales under 200k , what happend? is there any particular reason the owner sell it under the market ?
3. For the GarageCars and GarageArea its just the same like a twin.
4. and 1stFlrSF slightly same with the GrlivArea

# 4. Feature Engineering

In [None]:
df2 = df %>% 
  mutate(age_after_remodAdd = 2021 - YearRemodAdd,                            
         age_built_to_remodAdd = YearRemodAdd - YearBuilt,                    
         age_sell = YrSold - YearBuilt,                                        
         age_sell_to_month = usia_jual * 12,                                  
         OQ_redefine = case_when(OverallQual <= 4 ~ "low",                     
                        OverallQual > 4 & OverallQual < 8  ~ "medium", 
                        OverallQual >= 8 ~ "high"),
         MS_redefine = case_when(MoSold <= 6 ~ "Awal Tahun",                   
                        MoSold > 6 ~ "Akhir Tahun"),
         HouseType = case_when(YearBuilt >= 1800 & YearBuilt <= 1950 ~ "Antique", 
                     YearBuilt >= 1951 & YearBuilt <= 2007 ~ "Recent",
                     YearBuilt > 2008 ~ "Modern"),
         quartal = case_when(MoSold <= 3 ~ "Quartal 1",                  
                        MoSold > 3 & MoSold <= 6  ~ "Quartal 2",
                        MoSold > 6 & MoSold <= 9  ~ "Quartal 3",
                        MoSold > 9 ~ "Quartal 4"))

kolom yang di hasilkan dari syntax di atas adalah masing2 usia_after_remodAdd, usia_built_to_remodAdd, usia_jual, usia_jual_to_month, OQ_redefine, MS_redefine, HouseType, quartal

In [None]:
df2 %>% 
  select(Id, age_after_remodAdd, age_built_to_remodAdd, age_sell, age_sell_to_month, OQ_redefine, MS_redefine, HouseType, quartal )
```

# 5. Conclusion

this the end of my practice and exercise

i'm actually do the EDA analysed SalePrice alone and with several correlated variables and make a future Engineering with the data, is this the end ? nope
there's more tons of things u can do with the data, there no limit on it and basicly i'm gonna keep going to learn so i can go beyond. so i can update this and makeit more advanced :)

In [None]:
COnclu