In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Import**

In [None]:
library(dplyr)
library(ggplot2)
library(psych)

# **1. Data Import & Structure**

In [None]:
data <- read.csv('/kaggle/input/data-penjualan-zara/zara.csv', sep = ";", quote = "\"" ,stringsAsFactors = FALSE)

**1. Product ID: Unique identifier for each product.**

**2. Product Position: The position of the product in the catalog or store layout.**

**3. Promotion: Indicator of whether the product is currently on promotion or not.**

**4. Product Category: The category of the product, such as clothing, accessories, shoes, etc.**

**5. Seasonal: Indicator of whether the product is part of a specific seasonal collection.**

**6. Sales Volume: The quantity of products sold.**

**7. Brand: Brand of the product.**

**8. URL: Product URL (e.g., if the product is sold online).**

**9. SKU: Stock Keeping Unit, a unique code used to identify items available for sale.**

**10. Name: Name of the product.**

**11. Description: Description of the product.**

**12. Price: Price of the product.**

**13. Currency: Currency of the product price.**

**14. Scraped_at: The time when the data was scraped (e.g., in web scraping process).**

**15. Terms: Terms or conditions of the product.**

**16. Section: Section or category where the product is sold in the store (e.g., women's clothing, men's clothing, children's clothing, etc.).**

**Check data before preprocessing**

In [None]:
head(data)

In [None]:
str(data)

**Missing value Check**

In [None]:
colSums(is.na(data))

In [None]:
dim(data)

In [None]:
describe(data)

**Remove Unnecessary Variables**

In [None]:
data <- data[,-c(8:9,11,14)]
data2 <- data # Data for datavisualization

**Variable Type Transformation**

In [None]:
data$Product.Position <- as.factor(data$Product.Position)
data$Product.Category <- as.factor(data$Product.Category)
data$brand <- as.factor(data$brand)
data$terms <- as.factor(data$terms)
data$section <- as.factor(data$section)
data$name <- as.factor(data$name)
data$currency <- as.factor(data$currency)

**Data Transform**

In [None]:
# Promotion : No - > 0 , Yes -> 1 
data$Promotion <- ifelse(data$Promotion == 'No', 0 ,
                        ifelse(data$Promotion == 'Yes',1,2))

# Seasonal : No - > 0 , Yes -> 1 
data$Seasonal <- ifelse(data$Seasonal == 'No',0,
                       ifelse(data$Seasonal=='Yes',1,2))

# section : MAN - > 0 , WOMEN -> 1 , 
data$section <- ifelse(data$section == 'MAN',0,
                       ifelse(data$section=='WOMEN',1,2))


**Data check after preprocessing**

In [None]:
head(data)

In [None]:
str(data)

In [None]:
summary(data)

# **2. Data Visualization**

**Data visualization uses data2 for the convenience of visualization.**

In [None]:
cols = c('Product.Position','Promotion','Seasonal','section')

for (i in cols){
    print(ggplot(data2, aes(x=data2[,i],fill = data2[,i])) + geom_bar() + ggtitle(paste(i,'count'))+ xlab(i) + theme_bw() + theme(legend.position = 'none'))
}

----------------------------------

In [None]:
for(i in cols){
    print(ggplot(data2, aes(x=data2[,i], y=price, fill= data2[,i])) + geom_boxplot() + xlab(i) +ggtitle(paste(i,'vs price'))+ theme_bw() + theme(legend.position = 'none') )
}

In [None]:
for(i in cols){
    print(ggplot(data2, aes(x=data2[,i], y=Sales.Volume, fill= data2[,i])) + geom_boxplot() + xlab(i) +ggtitle(paste(i,'vs Sales.Volume'))+ theme_bw() + theme(legend.position = 'none') )
}

In [None]:
ggplot(data2, aes(x=price)) + geom_density() + ggtitle('The density of commodity prices') + xlab('price')

**There are a lot of products under 100 dollars.**

In [None]:
ggplot(data2, aes(x=price, y= Sales.Volume)) + geom_smooth(se=F)

**The relationship between price and sales**

-----------------------------------------

**Average price by group**

In [None]:
data2_promotion <- data2 %>% group_by(Promotion) %>% summarize(N = n(), avg_price = round(mean(price,na.rm=T)))
data2_promotion

In [None]:
ggplot(data2_promotion, aes(x=Promotion, y= avg_price,fill = Promotion)) + geom_col()

**The average price of products whose promotions are 'Yes' is higher.**

In [None]:
data2_Seasonal <- data2 %>% group_by(Seasonal) %>% summarize(N = n(), avg_price = round(mean(price,na.rm=T)))
data2_Seasonal

In [None]:
ggplot(data2_Seasonal, aes(x=Seasonal,y=avg_price, fill = Seasonal)) + geom_col()

**There's not much difference in the average price**

In [None]:
# Average calculation according to the number of people allocated
data2_section <- data2 %>% group_by(section) %>%  summarize(N = n(), avg_price = round(mean(price,na.rm=T))) 
data2_section

# If pick 30 people and average them

data2_section30 <- data2 %>% group_by(section) %>% sample_n(size = 30)%>% summarize(N = n(), avg_price = round(mean(price,na.rm=T))) 
data2_section30

In [None]:
ggplot(data2_section30, aes(x=section,y=avg_price, fill = section)) + geom_col()

**The average price of men product is higher.**

-----------------------------

In [None]:
data2_promotion2 <- data2 %>% group_by(Promotion) %>% summarize(N = n(), avg_Sales.Volume = round(mean(Sales.Volume,na.rm=T)))
data2_promotion2

In [None]:
ggplot(data2_promotion2,aes(x=Promotion, y= avg_Sales.Volume,fill = Promotion)) + geom_col()

**There is not much difference in the average sales volume**

In [None]:
data2_Seasonal2 <- data2 %>% group_by(Seasonal) %>% summarize(N = n(), avg_Sales.Volume = round(mean(Sales.Volume,na.rm=T)))
data2_Seasonal2

In [None]:
ggplot(data2_Seasonal2, aes(x=Seasonal,y=avg_Sales.Volume, fill = Seasonal)) + geom_col()

**There is not much difference in the average sales volume**

In [None]:
# Average calculation according to the number of people allocated
data2_section2 <- data2 %>% group_by(section) %>%  summarize(N = n(), avg_Sales.Volume = round(mean(Sales.Volume,na.rm=T))) 
data2_section2

In [None]:
ggplot(data2_section2, aes(x=section,y=avg_Sales.Volume, fill = section)) + geom_col()

**The average sales volume for men and women is similar. But, it shows that women buy more because the number of women is small.**

----------------------------

# **3. Price & Sales.Volume prediction.**

In [None]:
library(psych)
library(forecast)

**Modeling**

**LinearRegression**

In [None]:
md_lr <- lm(price ~Promotion + Seasonal+ section + Sales.Volume ,data=data)

In [None]:
summary(md_lr)

In [None]:
step(md_lr,direction = "backward")

**Select variable**

In [None]:
md_lr <- lm(price ~ Promotion + section , data = data)

In [None]:
summary(md_lr)

**Regression equation: 86.014 + Promotion * 12.298 + section * -20.819**

In [None]:
plot(md_lr)

In [None]:
pred <- 86.014 + data$Promotion * 12.298 + data$section * -20.819

In [None]:
accuracy(data$price,pred)

------------------------------------------

In [None]:
md_lr2 <- lm(Sales.Volume ~  price + Promotion + Seasonal+ section  ,data=data)

In [None]:
summary(md_lr2)

**The regression model that predicts Sales.Volume is not statistically significant.**

----------------------------------