
# Productivity Prediction of Garment Employees



- Family Name:Xuan
- Given Name:Tianyuan
- email:irvingxuan@gmail.com

- Last edited date: 3/9/2021

Programming Language: R 3.7 in Jupyter Notebook

R Libraries used:

- ggplot2

- reshape2

- car

- stats

- scales

- grid

- gridExtra

- RColorBrewer

- lattice

- caret

- xgboost

- plyr

## Table of Contents

1. [Introduction](#sec_1)
3. [Exploratory Data Analysis](#sec_3)
3. [Methodology](#sec_4)
3. [Model Development](#sec_5)
3. [Results and discussion](#sec_6)
3. [Conclusion](#sec_7)
3. [References](#sec_8)

## 1. Introduction <a class="anchor" id="sec_1"></a>

The file "garments_empolyee_productivity.csv"consist of data relating to modern garments industry productivity.This file contains 14 attributes, one target value, and more than a thousand pieces of data. The data is first analyzed in order to decide what method of machine learning will be used. After the analysis, it was found that the training features were first transformed, and the different models were optimized by adding or filtering the features to get a better model. In addition to the common linear regression, logistic, KNN, and Decision Tree are also used. By comparing different models, a better model for predicting actual_productivity is finally obtained.

## 2. Exploratory Data Analysis<a class="anchor" id="sec_3"></a>

### Library

In [53]:
library(ggplot2)
library(reshape2)
library(car)
library(stats)
library(scales)
library(grid)
library(gridExtra)
library(RColorBrewer)
library(lattice)
library(caret)
library(xgboost)
library(plyr)

### Overview of the whole Dataset

In [54]:
# Load the raw dataset
GEP <- read.csv("garments_empolyee_productivity.csv")

In [None]:
# Display the dimensions
cat("The garments_empolyee_productivity has", dim(GEP)[1], "records, each with", dim(GEP)[2],
    "attributes. The structure is:\n\n")

# Display the structure
str(GEP)

cat("\nThe first few and last few records in the dataset are:")
# Inspect the first few records
head(GEP)
# And the last few
tail(GEP)

cat("\nBasic statistics for each attribute are:")
# Statistical summary 
summary(GEP)

cat("The numbers of unique values for each attribute are:")
apply(GEP, 2, function(x) length(unique(x)))

### Split dataset into train and test

Randomly, 70% of the data are used as trian dataset and 30% of the data are used as test dataset.

In [56]:
set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 70% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(GEP), size = floor(.7*nrow(GEP)), replace = F)
train <- GEP[sample, ]
test  <- GEP[-sample,]


In [None]:
head(train)
head(test)

### Summary of Attributes

The following table identifies which attributes are numerical and whether they are continuous or discrete, and which are categorical and whether they are nominal or ordinal. It includes some initial observations about the ranges and common values of the attributes.

|Attribute  |Type       |Sub-type  |Comments                                                                              |
|-----------|-----------|----------|--------------------------------------------------------------------------------------|
|date        |Categorical|Ordinal    |Date in MM-DD-YYYY, probably has no corelation with target value.|
|day      |Categorical  |Ordinal|Day of the Week, probably has no corelation with target value.| 
|quarter   |Categorical  |Ordinal |A portion of the month. A month was divided into four quarters. There are also a few quater 5              |
|department  |Categorical  |Nominal  |Associated department with the instance and has 2 classed finishing, sweing|
|no_of_workers| Numerical   |Discrete| Associated team number with the instance.  Ranges from 2 to 89 - could have  outliers                                     |
|no_of_style_change   |Numerical  |Continuous| Number of changes in the style of a particular product. Majority of data are 0.       |
| targeted_productivity |Numerical|Continuous   |Targeted productivity set by the Authority for each team for each day.                   |
|smv  |Numerical |Continuous   |Standard Minute Value, it is the allocated time for a task with min 2.9 and max 54.56. It could have outliers    |
|wip       |Numerical |Continuous   |Work in progress. Includes the number of unfinished items for products. It could have outliers |
|over_time  |Numerical |Continuous   |Represents the amount of overtime by each team in minutes. Range from 0 to 25920      |
|incentive   |Numerical |Continuous    |Has 48 values between 0 and 3600, whose mean is 38.21. It could have extreme outliers. |
|idle_time   |Numerical |Continuous    |The amount of time when the production was interrupted due to several reasons. Range is from 0 to 300, majority of it is 0.
|idle_men   |Numerical |Continuous    |The number of workers who were idle due to production interruption. Range is from 0 to 45 majority of it is 0.|
|actual_productivity   |Numerical |Continuous    |The actual % of productivity that was delivered by the workers. It ranges from 0-1. Our target attribute.


### Investigate Distribution of Each Variable

In [58]:
attach(GEP)

#### View the variable distributions using boxplots(numberical attributes)

It is worth noting that here we are observing the distribution of numerical type data.

In [None]:
# Generate box plots of all variables except the date and department
boxplot <- melt(as.data.frame(GEP[,c(-0,-3)]))
ggplot(boxplot,aes(x = variable,y = value)) +
facet_wrap(~variable, scales="free") +
geom_boxplot() +
scale_y_continuous(labels=function (n) {format(n, scientific=FALSE)})

### View the variable distributions using histograms and bar charts
### num and str type

In [None]:
# Plot a histogram or bar chart of each variable
par(mfrow = c(4,3))
hist(no_of_workers)
hist(no_of_style_change)
hist(targeted_productivity)
hist(smv)
hist(wip)
hist(incentive)
hist(idle_time)
hist(idle_men)
hist(actual_productivity)
hist(team)
hist(over_time)

par(fig=c(0,1,0,0.30),ps=10,new=TRUE)


In [None]:
barplot(sort(table(actual_productivity)),las=2,main="Bar Chart of actual_productivity")

In [None]:
par(mfrow = c(2,2))
plot(as.factor(date),main="Bar Chart of date")
plot(as.factor(day), main="Bar Chart of day")
plot(as.factor(quarter), main="Bar Chart of quarter")
plot(as.factor(department), main="Bar Chart of department")
par(fig=c(0,1,0,0.30),ps=10,new=TRUE)

These graphs show:
- wip, incentive, idle_time, idle_men all have large positive skews.
- wip, incentive, idle_time, idle_men,wip,no_of_style_changes have a lot of zero value.
- Most incentive are with value 50,although it has value more than one thousand.
- Team one has more work tasks than other teams while other teams are probably equal.
- very few data with quarter 5
- mojority of department are sweing
- If we look at boxplot alone, many data have exaggerated outliers, so these outliers are also important.

#### Take a closer look at some interesting features

Replot wip,incentive,idle_time using a log scale to see if these variables have a log-normal distribution

In [None]:
# Set some colours using Colorbrewer
gg.colour <- brewer.pal(12,"Paired")[12]
gg.fill <- brewer.pal(12,"Paired")[11]

# Re-plot some of the charts using log scales to counteract the skew
p1 <- ggplot(aes(x=wip), data=GEP) +
      geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) +
      scale_x_log10(labels=comma) 
p2 <- ggplot(aes(x=incentive), data=GEP) +
      geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) +
      scale_x_log10(labels=comma)
p3 <- ggplot(aes(x=smv), data=GEP) +
      geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) +
      scale_x_log10(labels=comma)

grid.arrange(p1, p2,p3, ncol=1, nrow=3)

These graphs show:
- The log of incentive has a little bit normally distributed,left part.
- The log of the wip is not quite normal. The majority of lot sizes are between 100 and 10,000  wip, with a few outliers > 100,00 or <100 wip

### Investigate Pairs of Variables

#### Correlation Plot Function

This is the DIY correlation plot provided in the tutorial.

First we look at the overall correlation between variables and the distribution of the data.

In [None]:
# DIY correlation plot
# http://stackoverflow.com/questions/31709982/how-to-plot-in-r-a-correlogram-on-top-of-a-correlation-matrix
# there's some truth to the quote that modern programming is often stitching together pieces from SO 

colorRange <- c('#69091e', '#e37f65', 'white', '#aed2e6', '#042f60')
## colorRamp() returns a function which takes as an argument a number
## on [0,1] and returns a color in the gradient in colorRange
myColorRampFunc <- colorRamp(colorRange)

panel.cor <- function(w, z, ...) {
    correlation <- cor(w, z)

    ## because the func needs [0,1] and cor gives [-1,1], we need to shift and scale it
    col <- rgb(myColorRampFunc((1 + correlation) / 2 ) / 255 )

    ## square it to avoid visual bias due to "area vs diameter"
    radius <- sqrt(abs(correlation))
    radians <- seq(0, 2*pi, len = 50) # 50 is arbitrary
    x <- radius * cos(radians)
    y <- radius * sin(radians)
    ## make them full loops
    x <- c(x, tail(x,n=1))
    y <- c(y, tail(y,n=1))

    ## trick: "don't create a new plot" thing by following the
    ## advice here: http://www.r-bloggers.com/multiple-y-axis-in-a-r-plot/
    ## This allows
    par(new=TRUE)
    plot(0, type='n', xlim=c(-1,1), ylim=c(-1,1), axes=FALSE, asp=1)
    polygon(x, y, border=col, col=col)
}

# usage e.g.:
# pairs(mtcars, upper.panel = panel.cor)

#### Scatterplot Matrix

**1)** Plot the variables using a scatterplot matrix to visualise the correlations between variables.

In [None]:
pairs(GEP[sample.int(nrow(GEP),1000),], lower.panel=panel.cor)

According to the above table we can clearly observe that there are strong correlations between some variables.
For example

- smv， no_of_workers， over_time,department，there is a correlation between two of them.
- Looking at the distribution graphs, it does appear that there is a linear distribution relationship between some of the data. But the picture is too small we need to investigate further.

Let's take a closer look at the distribution of some of the above variables through a scatter plot.

In [None]:
scatterplotMatrix(~smv+no_of_workers+over_time+department+actual_productivity,data=GEP)

As seen in the above image, the correlation between the data distribution between smv and no_of_workers, for example, is still very obvious. The correlation between targer_productivity and other variables is hard to identify. This is also consistent with the information obtained from the above table.

### Num values

From the analysis of correlation above, and the distribution relationship between variables, it is relatively easy to observe the relationship between numberic data. But for example, some dates, departments, and how these data are related to other data, we need to explore further.

First let's look at numeric data.

In [None]:
head(GEP[c(5,6,7,8,9,10,11,12,13,14,15)])

In [None]:
#Calculate all Pearson's correlation for num cols
Pearson_cor<- as.data.frame(cor(GEP[,c(5,6,7,8,9,10,11,12,13,14,15)]),method="pearson")
#Select 28th col which is every variable vurse actual_productivity
Pearson_cor[11]

As you can see just from the table above, there are no variables that have a particularly strong correlation with actual_productivity.So let's take a closer look at the exact value of the correlation between the two.

In [None]:

#Define you own panel
myPanel <- function(x, y, z, ...) {
    panel.levelplot(x,y,z,...)
    panel.text(x, y, round(z, 2))
}
#Define the color scheme
cols = colorRampPalette(c("red","blue"))
#Plot the correlation matrix.
levelplot(cor(GEP[c(5,6,7,8,9,10,11,12,13,14,15)]), col.regions = cols(100), main = "correlation", xlab = NULL, ylab = NULL, 
          scales = list(x = list(rot = 90)), panel = myPanel)

By analyzing the above correlations it is obtained that:

- As mentioned above smv， no_of_workers， over_time, no_of_style_change，there is a correlation between two of them.Having correlation means the absolute value of correlation is greater than 0.3(pupple and blue in figure) here.
- targeted_productivity has a correlation with actual_productivity with 0.42.
- There are no variables that have a particularly strong correlation with actual_productivity.
- actual_productivity itself is a little bit near a Guassan distribution.

### String values

Then let's look at string data.

In [None]:
options(repr.plot.width=8,repr.plot.height=4)

ggplot(aes(x=no_of_workers),data =GEP) + 
    geom_density(aes(fill = quarter)) +
    facet_wrap(~as.factor(quarter)) +
    ggtitle('number of workers and quarter Relationship')

options(repr.plot.width=8,repr.plot.height=3)

ggplot(aes(x=actual_productivity),data =GEP) + 
    geom_density(aes(fill = quarter)) +
    facet_wrap(~as.factor(quarter)) +
    ggtitle('actual_productivity and quarter Relationship')

In [None]:
options(repr.plot.width=8,repr.plot.height=3)

ggplot(aes(x=no_of_workers),data =GEP) + 
    geom_density(aes(fill = department)) +
    facet_wrap(~as.factor(department)) +
    ggtitle('number of workers and department Relationship')

options(repr.plot.width=8,repr.plot.height=3)

ggplot(aes(x=log(incentive+1)),data =GEP) + 
    geom_density(aes(fill = department)) +
    facet_wrap(~as.factor(department)) +
    ggtitle('number of workers and department Relationship')

In [None]:
options(repr.plot.width=8,repr.plot.height=3)

ggplot(aes(x=actual_productivity),data =GEP) + 
    geom_density(aes(fill = department)) +
    facet_wrap(~as.factor(department)) +
    ggtitle('actual_productivity and department Relationship')

By analyzing the above figures it is obtained that:

- The number of workers varies slightly from quarter to quarter, and the distribution of workers from quarters 1 to 4 is roughly the same. However, we can clearly see the difference between quarter5 and the other four. quarter5 has more of a two Gaussian mixture distribution, while the others have a three Gaussian mixture distribution.
- When the actual productivity rate is below 0.8, it can be seen that quarters 1 through 4 account for more of the data. And it is clear that most of the data belonging to quarter 5 have a productivity rate of more than 0.8.
- Different departments obviously have different number of workers, and the number of workers in the sweing department is obviously more than that in the finishing state.
- It is obvious that the product rate of data in the finishing state is higher than that of data in the sweing state.
- Since the incentive values are difficult to observe, and as analyzed earlier, we take log for incentive+1 to get the distribution about the distribution department, it is obvious to find that the distribution of incentive values is obviously very different in different department states. In finishing it is mostly a value close to 0, while in sweing state it is a level with a mean of about 4.

### data transformation

## 3. Methodology<a class="anchor" id="sec_4"></a>

### Overview

First, because the actual productivity we need to predict is a number. In addition, this value is not classified data but a set of values. Obviously, regression will be used here. In this part we will use linear regression, logistc regression, KNN, and xgboost to perform machine learning on the data.

They all have their own characteristics without considering changing the feature.

### Linear Regression

- Methodology is simple, easy to implement
- Can solve regression problems
- It is very convenient for us to do feature engineering under this model.

### Logistic Regression
- Methodology is simple, easy to implement
- Can solve regression problems
- It is very convenient for us to do feature engineering under this model.

### KNN
- Methodology is simple, easy to implement
- Can solve regression problems
- High accuracy, no assumptions about data, insensitive to outlier

### Extreme Gradient Boosting

- Can solve regression problems
- This is a mature function edited by machine learning scientists.
- XGBoost adds a regular term to the cost function.
- In short, all aspects of this method have been considered and may be more appropriate.

### Model perfomance

### 1. MSE
We use Mean Square Error(MSE) as error function here.
It is simple and can express the pros and cons of the model to a certain extent.
\begin{align}
M S E=\frac{\sum_{i=1}^{n}\left(y_{i}-y_{i}^{p}\right)^{2}}{n}
\end{align}


### 2.Residual standard error

Since we are doing a regression, a good measure for judging how well the model is performing is RSE.

It represents the sum of the squares of the difference between the true value and the predicted value divided by the degrees of freedom, and finally takes the root. Let's see its fomular.

\begin{equation}
R S E=\sqrt{\frac{\sum_{i=1}^{n}\left(predicted-a c t u al\right)^{2}}{df}}
\end{equation}

In addition, there are MAE and RSME, which are similar to RSE but also have subtle differences. For example, MAE is not sensitive to outliers.

### 3. R squared

R squared represents the explanatory power of the model. For example, R squared of 0.5 means that 50% of the variance can be interpreted by the input. Of course, R squared is not as high as possible. Obviously, too high R squared may be caused by overfitting.

## 4. Model Development <a class="anchor" id="sec_5"></a>

### Linear Regression

First, let's look at the result of using lm directly on the original data.

Pay special attention to that we use train dataset to train and test to check. In the first part, we have divided train and test.

In [73]:
lm_fit1<- lm(actual_productivity~.,data= train)

In [None]:
summary(lm_fit1)

In [75]:
sw.fit =step(lm_fit1,trace=0,k=log(nrow(train)), direction="both")

In [None]:
summary(sw.fit)

In [None]:
par(mfrow=c(4,4))
plot(lm_fit1)
plot(sw.fit)

After linear regression on the entire train dataset, we use step on it. We found that before and after pruning, the effect is worse after removing relatively unimportant variables.We can see the figures above. The distribution of residual, that is, the gap between the real value and the predicted value, does not seem to change. There seems to be no observable change in the shape of Q-Q norm,too. In observing the data, the Residual standard error has increased by 0.02 instead. Multiple R-squared has dropped instead. So we learned that we trimmed too many variables at this time, and we can't explain the predicted variance well, or it's related.

### Feature engineering

By directly performing linear regression on the entire train dataset, we know that there are too many irrelevant variables. When we keep only the variables with strong correlation, we will find that the model's ability to explain the predicted value is weakening. So we have to add some variables. From the EDA part we have done above, we know that many variables are directly and indirectly related.

In [None]:
lm_fit2<- update(sw.fit,.~ .+log(wip)+log(incentive+1))

In [None]:
summary(lm_fit2)

- Compared with sw_fit, R squared is improved, indicating that we can better explain the variance of the target.
- Compared with lm_fit1, F-statistic has been greatly improved. This shows that in the new model, the overall correlation between the selected variables and the target variables is stronger.
- Compared with lm_fit1, Adjusted R-squared has increased,we can say that our newly added variables improve the model.
- When we observe log(incentive+1) carefully, we find that its p-value is quite small. This shows that we have no reason to say that it has nothing to do with actual_productivity in hypothesis testing.
- we need consider that if we keep log(wip)

In [None]:
lm_fit3<- update(lm_fit2,.~ .+quarter:no_of_workers+department:no_of_workers+smv:no_of_workers)

In [None]:
summary(lm_fit3)

- Multiple R-squared is 0.3659. It increase 0.05.And at the same time we have a larger Adjusted R-squared. Our added featured absolutely are helpful.
- There are also some added value with large p-value.

In [None]:
lm_fit4<- update(lm_fit2,.~ .+quarter:no_of_workers+department:no_of_workers+smv:no_of_workers+department:log(incentive+1))

In [None]:
summary(lm_fit4)

### Delete unimportant variables

After the above analysis, we finally manually input our final model. I DON'T t know why some features are reserved here.In addition to the values left by EDA, there are some values that I cannot explain. For example, smv and log (smv) must be left at the same time to have better performance.

In [None]:
fin = lm(formula = actual_productivity ~ quarter + department + targeted_productivity + 
    smv + over_time + idle_men + no_of_workers + 
    log(targeted_productivity) + log(smv + 1) + quarter:no_of_workers + 
    department:incentive + department:no_of_workers + targeted_productivity:no_of_workers + 
    smv:no_of_workers, data = train)

In [None]:
step =step(fin,trace=0,k=log(nrow(train)), direction="both")

In [None]:
summary(step)

In [None]:
par(mfrow = c(2,2))
plot(step)

- Residuals vs Fitted shows redisuals are more lying around a red line in figure. We can compare it with figure for lm_fit1 above, Obviously, the points in the upper half are closer to a straight line. That means Residuals and Fitted are more linear here.

- Q-Q norm shows our model are less like a normal distribution. Instead, it is a model of the long tail on the left.

- We can see the residuals spread equally along the red line from scale-location. We don't know how equally it is,but it looks like quiet equally.That means our homoscedaslticity is OK.

- Residuals vs Leverage shows, we still have some points which are influential points. But basically, points lie at left side.

### Logistic Regression

The principle of logistic is more complicated. It is a method formed by simple Bayesian and gradient descent iteration. Here we try to use logistic to do regression. The features are still the features we have sorted out.

In [None]:
glm_fit<- glm(formula = actual_productivity ~ quarter + department + targeted_productivity + 
    smv + over_time + idle_men + no_of_workers + 
    log(targeted_productivity) + log(smv + 1) + quarter:no_of_workers + 
    department:incentive + department:no_of_workers + targeted_productivity:no_of_workers + 
    smv:no_of_workers, data = train)

In [None]:
summary(glm_fit)

In [None]:
sw_flm_fit =step(glm_fit,trace=0,k=log(nrow(train)), direction="both")

In [None]:
summary(sw_flm_fit)

It looks the same as simple linear regression.

### KNN

KNN which is K Nearest Neibhours. It pridict value by their nearest neibhours. We use euclealian distance to calculate the distance.

\begin{equation}
euclealian \ distance = \sqrt{\sum_{i=1}^{N}(x_i-y_i)^2} 
\end{equation}

In order to compairing different model,we keep same features in models.

In [None]:
count=trainControl(method='cv',number=10)
knn_fit=train(actual_productivity ~ quarter + department + targeted_productivity + 
    smv + over_time + idle_men + no_of_workers + 
    log(targeted_productivity) + log(smv + 1) + quarter:no_of_workers + 
    department:incentive + department:no_of_workers + targeted_productivity:no_of_workers + 
    smv:no_of_workers, data = train,method ='knn',
             trcontrol=count,
             tuneGrid=expand.grid(k=1:20))

In [None]:
plot(knn_fit,main="K vs RMSE")

From figure above, we know when K = 11, we have lowest RMSE. So, we choose K =11.


In [78]:
count=trainControl(method='cv',number=10)
knn_fin=train(actual_productivity ~ quarter + department + targeted_productivity + 
    smv + over_time + idle_men + no_of_workers + 
    log(targeted_productivity) + log(smv + 1) + quarter:no_of_workers + 
    department:incentive + department:no_of_workers + targeted_productivity:no_of_workers + 
    smv:no_of_workers, data = train,method ='knn',
             trcontrol=count,
             tuneGrid=expand.grid(k=11))

## Decision TREE(XgbTree)

In [79]:
tune_grid <- expand.grid(nrounds = 50,
                        max_depth = 5,
                        eta = 0.05,
                        gamma = 0.01,
                        colsample_bytree = 0.75,
                        min_child_weight = 0,
                        subsample = 0.5)


xgb_model <- train(actual_productivity ~ quarter + department + targeted_productivity + 
    smv + over_time + idle_men + no_of_workers + 
    log(targeted_productivity) + log(smv + 1) + quarter:no_of_workers + 
    department:incentive + department:no_of_workers + targeted_productivity:no_of_workers + 
    smv:no_of_workers, data = train, method = "xgbTree",trControl = trainControl("cv", number = 10), tuneGrid = tune_grid,
                tuneLength = 10)



### Compare different models

### Use MSE to calculate error

### Mutiple linear regression

Here we know Model of Mutiple linear regression is same as Logistic one.

In [80]:
postResample(pred = predict(step,train),obs=train$actual_productivity)
postResample(pred = predict(step,test),obs=test$actual_productivity)

"prediction from a rank-deficient fit may be misleading"

"prediction from a rank-deficient fit may be misleading"

### KNN

In [81]:
postResample(pred = predict(knn_fin,train),obs=train$actual_productivity)
postResample(pred = predict(knn_fin,test),obs=test$actual_productivity)

### xgbTree

In [82]:
postResample(pred = predict(xgb_model,train),obs=train$actual_productivity)
postResample(pred = predict(xgb_model,test),obs=test$actual_productivity)

By comparing the RMSE, R squared, and AME of the three models, it is clear that xgbTree is the best performer. xgbTree has the smallest RMSE, and AME, and at the same time has a larger R squared. It's just that we only let xgbTree iterate 50 times.

## 5. Results and discussion <a class="anchor" id="sec_6"></a>

### Model selection

- First of all, for a machine learning model, the smaller the gap between its prediction and the true value, the better the model's performance. After the above calculation, we know by judging RMSE and AME that these two values ​​of xgb_model are relatively small. This shows that the difference between the predicted value and the true value of xgb_model is the smallest.

- Secondly, xgb_model behaves like this on both train and test datasets.

- Finally, the value of R squared of xgb_model is the largest. First of all, this shows that the features in the model can explain more prediction variance. In addition, because RMSE and AME are both low levels, this shows that this is not caused by the features we have used.

Based on the above analysis, if I can only choose one, I will choose xgb_model.

### Subset of Attributes Discussion

The first is the sub attriabutes obtained from EDA analysis.

1. smv:no_of_workers，targeted_productivity:no_of_workers 

The first known information is that in addition to target_productivity, there is no attribute and actual_Productivity
There is relevance. So in the EDA part, we calculated the correlation between the two for digital data. There are strong correlations among several attributes. So we guess that after combining them in pairs, maybe it will play a big role in the prediction of the model.

2. log(targeted_productivity) , log(smv + 1)

First, after observing the distribution of actual_productivity, we found that it seems to be a Gauss-like distribution.
In the EDA section, when we observe several log-scaled attributes, we find that the incidence is approximately a normal distribution after doing the log. In addition, after logging smv, it was found that it became concentrated in two parts. So we can reasonably guess that logging them in machine learning may have a good effect. log(target_productivity) is completely unfounded. And it seems that when log and non-log exist at the same time, it performs better.

3. quarter:no_of_workers , department:incentive , department:no_of_workers

In the EDA part, we found that when no_of_workers is distributed in different quarters, the main distribution of product rate has changed a lot. We have reason to suspect that quarter5 may lead to a higher productivity. Similarly, the distribution of different department states and no_of_workers, different department states and incentives also seems to affect the distribution of actual_productivity. In other words, after the current combination of the two, in general, we can predict which value the actual_productivity will probably be distributed around. So we have reason to suspect that these values can become sub attributes.As for log (incentive): department, we found that his performance is not as good as incentive: department.



### discussion

When doing machine learning, it is very important to analyze the data first. We can not only have a macro understanding of the entire data, but more importantly, it is of great help to our selection of attributes. As shown in the GEP, almost no attributes are related to the target value. This makes us have to think of ways to transform or combine attributes. If we use fancy transformation, and we let all attributes take the log or combine them in pairs, our model will be very complicated. Even if we do pruning later, this is impossible when the data is extremely complex. Therefore, it is very important to try to find the potential relationship between variables in EDA.

When doing machine learning, we have many models to choose from. In addition to simple basic models such as lm and knn, there are many models that have been edited by data scientists like xgboost. Choosing a better model can directly improve the accuracy of the model. In addition, when choosing a model, we must also consider the goals we want to achieve. Generally speaking, we hope that the more accurate the better, but in addition to regression, there are problems such as classification and clustering. These issues are worthy of in-depth consideration before choosing a more appropriate model.

## 6. Conclusion<a class="anchor" id="sec_7"></a>

In this task, we use different visualization methods to analyze the data for different data types (num, str). We focus on exploring potential sub attributes based on correlation calculations. We also draw a log_scale graph, aiming at the density map of the target value to explore which combined variables may be related to the target. In addition, we use different models for machine learning. Multiple linear regression, logistic regression, KNN, and XGB Tree are all used. We first constantly update the model to prove that the sub attributes we conjecture can indeed make the model perform better. Then, we compare the performance between different models. This performance is mainly achieved by comparing RMSE, AME, and R squared. In the end we choose xgb_model. 


Of course, there are also many shortcomings. For example, we ignore the impact of dates. For example, there must be better sub attributes, but we haven't found them.This requires more research from us.

## 7. References <a class="anchor" id="sec_8"></a>

- Persistent invalid graphics state error when using ggplot2 https://stackoverflow.com/questions/20155581/persistent-invalid-graphics-state-error-when-using-ggplot2

- How to Calculate Residual Standard Error in R https://www.statology.org/residual-standard-error-r/

- knn: k-Nearest Neighbour Classification https://www.rdocumentation.org/packages/class/versions/7.3-19/topics/knn

- trainControl: Control parameters for train https://www.rdocumentation.org/packages/caret/versions/6.0-88/topics/trainControl

- How to plot XGBoost trees in R https://www.r-bloggers.com/2021/04/how-to-plot-xgboost-trees-in-r/

- Simple R - xgboost - caret kernel https://www.kaggle.com/nagsdata/simple-r-xgboost-caret-kernel

- The caret Package Max Kuhn 2019-03-27 https://topepo.github.io/caret/index.html