<a href="https://colab.research.google.com/github/royn5618/Data_Science_Project/blob/main/Garments_worker_productivity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

## Context

The Garment Industry is one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories

## Content

This dataset includes important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts.

Acknowledgements
Relevant Papers:

[1] Imran, A. A., Amin, M. N., Islam Rifat, M. R., & Mehreen, S. (2019). Deep Neural Network Approach for Predicting the Productivity of Garment Employees. 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT). [Web Link]

[2] Rahim, M. S., Imran, A. A., & Ahmed, T. (2021). Mining the Productivity Data of Garment Industry. International Journal of Business Intelligence and Data Mining, 1(1), 1. [Web Link]

## Inspiration

This dataset can be used for regression purpose by predicting the productivity range (0-1) or for classification purpose by transforming the productivity range (0-1) into different classes.

## Task Details

It is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories.




## About this file
## Attribute Information:

01 date : Date in MM-DD-YYYY

02 day : Day of the Week

03 quarter : A portion of the month. A month was divided into four quarters

04 department : Associated department with the instance

05 teamno : Associated team number with the instance 

06 noofworkers : Number of workers in each team 

07 noofstylechange : Number of changes in the style of a particular product

08 targetedproductivity : Targeted productivity set by the Authority for each team for each day. 

09 smv : Standard Minute Value, it is the allocated time for a task 

10 wip : Work in progress. Includes the number of unfinished items for products 

11 overtime : Represents the amount of overtime by each team in minutes

12 incentive : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.

13 idletime : The amount of time when the production was interrupted due to several reasons 

14 idlemen : The number of workers who were idle due to production interruption

15 actual_productivity : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

# Importing Libraries and Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import pyplot
%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
path = "/kaggle/input/productivity-prediction-of-garment-employees/garments_worker_productivity.csv"
df=pd.read_csv(path, header=0, index_col=0, parse_dates=True, squeeze=True)
df.head()

In [None]:
df.shape

In [None]:
df.info()

There are mising values in wip column

In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
df.shape

In order to have a quick overwiev of the data pairplot diagram will be used here

In [None]:
#sns.pairplot(df)

# Categorical Features

Quarter,department,team and day are categorical features

In [None]:
categorical_cols = ['quarter', 'department', 'day', 'team','no_of_style_change'] 

In [None]:
df.head()

## 1-Quarter

In [None]:
df['quarter'].value_counts()

There are 5 quarter as quarter 1, quarter 2, quarter 3, quarter 4, quarter 5, which are not evenly distributed

In [None]:
pyplot.plot(df.index,df.quarter)
plt.show()

When we checked the dates it can be observed that there is a repeated pattern for all quarters with time except Quarter5. We need to look into Quarter5 deeply. There should be a reason for that exception 

In [None]:
df_1=df[df['quarter']=='Quarter5']

In [None]:
df_1.shape

In [None]:
df_1.index

Quarter5 contains 2 days as 29th and 31th of January. 

## 2-Department

In [None]:
df.department.value_counts() 

There are 3 department namely sweing, finishing and finishing but we need to collect them into two group

In [None]:
df=df.replace(['finishing '], ['finishing'])  
df.department.value_counts()

## 3-Day

In [None]:
df.day.value_counts() 

Friday is not a working day

# Numeric Features


In [None]:
df.select_dtypes(include=np.number).columns.tolist()

In this case we have 11 numeric feature as given above but no_of_style_change will be handled as a categorical feature

## 1. Team

In [None]:
ax = sns.countplot(x = 'team', data = df, palette=["#3f3e6fd1", "#85c6a9"])
plt.xlabel('No of Teams')
plt.show()


There are 12 teams. Because It is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories, my analsys will be on team basis.

## 2. SMV

Standard Minute Value, it is the allocated time for a task

In [None]:
plt.figure(figsize=(25, 10))
palette = "Set3"

sns.boxplot(x="team", y="smv", hue="department", data = df,
     palette = palette, fliersize = 0)

plt.title('smv distribution by team',fontsize= 14)
plt.show()

When we look into smv boxplot on team basis with department seperation, it can be clearly seen that while there are fluctuations between teams in the sewing department, the finishing department has almost evenly distributed smv values for each team. 

In [None]:
sns.scatterplot(data=df, x="no_of_workers", y="smv", hue="department")

For the finising department smv doees not change with no_of_workers

In [None]:
df.columns

In [None]:
pyplot.plot(df.index,df.smv)

## 3 WIP
Work in progress. Includes the number of unfinished items for products

In [None]:
df.wip.isnull().sum()

There are 506 null values in wip column

In [None]:
sns.boxplot(x='department',y='wip',data=df)

All null values belongs to the finishing department. The finishing department needs to get a work from the sewing department. This result could mean that the finishing department has no work in progress while waiting for work from the sewing department. So we can replace the null values with zero 

In [None]:
#df['wip'] = df['wip'].fillna(0)

In [None]:
#df.wip.isnull().sum()

In [None]:
pyplot.plot(df.index,df.wip)
plt.yticks(np.arange(0,30000,step=2500))

In [None]:
df[df['wip']>2500].shape

In [None]:
df[df['wip']>2500]

There are 10 rows with high wip values. Each of these records belongs to the sweating section in the 1st quarter and is on February 2nd.

## 4 Over Time

In [None]:
sns.boxplot(x='department',y='over_time',data=df)
plt.show()

In [None]:
sns.boxplot(x='team',y='over_time',data=df)
plt.show()

In [None]:
over_time_by_team_department = df.groupby(['department', 'team']).median()['over_time']

for team in range(1, 12):
    for department in ['sweing', 'finishing']:
        print('Median over_time of team {} {}s: {}'.format(team, department, over_time_by_team_department[department][team]))
print('Median over_time of teams: {}'.format(df['over_time'].median()))

In [None]:
over_time_by_team_department.plot.bar()
plt.show()

In [None]:
over_time_by_team_department.head()

Finishing department has relatively lower over_time values regarding sweing department. In sweing department team6, team11 and team12 have the lowest over_time values.

In [None]:
pyplot.plot(df.index,df.over_time)

In [None]:
df[df['over_time']>12000].shape

In [None]:
df[df['over_time']>12000]

No significant relationship was found when peak values of over_time were observed with respect to time

## 5 Incentive

In [None]:
pyplot.plot(df.index,df.incentive)

In [None]:
df[df['incentive']>150].shape

In [None]:
df[df['incentive']>150]

All of the highest incentive values belong to the finishing department on March 9, Quarter2.

## 6 Idle Time

In [None]:
pyplot.plot(df.index,df.idle_time)

In [None]:
df[df['idle_time']>20].shape

In [None]:
df[df['idle_time']>20]

All of the highest idle_men values belong to the sweing department on February 4 and 7, Quarter1.

## 7 Idle Men

In [None]:
pyplot.plot(df.index,df.idle_men)

In [None]:
df[df['idle_men']>0].shape

In [None]:
df[df['idle_men']>0]

All of the peak values of Idle_men belongs to sweing department

## 8 No_of_style_change 

In [None]:
pyplot.plot(df.index,df.no_of_style_change)

In [None]:
plt.figure(figsize=(15, 7))
palette='gist_rainbow'

plt.subplot(1, 2, 1)
sns.countplot('no_of_style_change',hue='quarter',data=df)
plt.xlabel('no_of_style_change')

plt.subplot(1, 2, 2)
sns.countplot('no_of_style_change',hue='department',data=df)
plt.xlabel('no_of_style_change')

plt.show()

There is no no_of_style_change in Quarter5 and all of changes occured in the sweing department

## 9 No_of_workers

In [None]:
data = df.groupby(['department']).no_of_workers.sum()
data.plot.pie(title="Employee rates by department",autopct='%1.1f%%')
plt.ylabel(None)
plt.show()

Employee rates in sweing and finishing departments are respectively %87.5 and %12.5.

## 10 Actual Productivity

In [None]:
sns.distplot(df.actual_productivity)

In [None]:
pyplot.plot(df.index,df.actual_productivity)

There is no an obvious pattern with respect to time in actual_productivity

In [None]:
plt.figure(figsize=(25, 10))
palette = "Set3"

sns.boxplot(x = 'team', y = 'actual_productivity', data = df,
     palette = palette,hue='department',fliersize = 0)
plt.yticks(np.arange(0,1.2,step=0.3))
plt.title('Actual_productivity distribution by team and department',fontsize= 14)
plt.show()

## 11 Targeted Productivity

In [None]:
sns.distplot(df['targeted_productivity'])

In [None]:
df.targeted_productivity.value_counts()


When we checked the dates it can be observed that there is a repeated pattern for all quarters with time except Quarter5. We need to look into Quarter5 deeply. There should be a reason for that exception 

In [None]:
plt.figure(figsize=(25, 10))
palette = "Set3"

sns.boxplot(x = 'team', y = 'targeted_productivity', data = df,
     palette = palette,hue='department',fliersize = 0)
plt.yticks(np.arange(0,1.2,step=0.3))
plt.title('Targeted_productivity distribution by team and department',fontsize= 14)
plt.show()

# Actual vs Targeted Productivity

In [None]:
plt.figure(figsize=(25, 10))
palette = "Set3"

sns.boxplot(x = 'team', y = df.targeted_productivity-df.actual_productivity, data = df,
     palette = palette,hue='department',fliersize = 0)

plt.title('Difference distribution between targeted_productivity and actual_productivity by team and department',fontsize= 14)
plt.show()

There are both negative and positive variations from targeted_productivity on team and department basis. 

# EDA

##  Correlation Heatmap

In [None]:
corr=df.corr()
mask=np.zeros(corr.shape,dtype=bool)
mask[np.triu_indices(len(mask))]=True


In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(corr,annot=True,vmin=-1,vmax=1,cmap='Spectral',square=True,mask=mask,linecolor='white',linewidths=1)

**Highest Positive Correlations:**

* No_of workers and smv (0.91)
* No_of workers and over_time (0.73)
* Over_time and smv (0.67)
* Idle_men and Idle_time (0.56)


**Positive Correlations:**

* No_of workers and no_of_style_change(0.33)
* No_of_style_changehas and smv  (0.32)

There isnt any obvious negative correlation between features

## Filling in missing values of wip column

In [None]:
df['wip'].isnull().sum()

In [None]:
df['wip'].fillna(0,inplace=True)

In [None]:
df['wip'].isnull().sum()

## One-hot encoding


Some columns have identified that may be useful for predicting productivity range:

* quarter
* department
* day
* team
* no_of_style_change


Before we build our model, we need to prepare these columns for machine learning.

In [None]:
def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

df = create_dummies(df,"quarter")
df = create_dummies(df,"department")
df = create_dummies(df,"day")
df = create_dummies(df,"team")

df.columns

## Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["no_of_style_change_enc"] = le.fit_transform(df["no_of_style_change"])
df.head()


In [None]:
df.columns

## Creating Target_Label for productivity

In [None]:
df['diff']=df.actual_productivity-df.targeted_productivity
df.columns

In [None]:
df['diff'].describe()

In [None]:
df['Target_label']=np.nan
df.head()
df.loc[df['diff']<0,'Target_label'] = -1
df.loc[(df['diff']==0), 'Target_label'] = 0
df.loc[df['diff']>0, 'Target_label'] = 1
df.head()

If the difference between actual_productivity and targeted_productivity is positive it means productivity is in the range of over_performed,

If the difference between actual_productivity and targeted_productivity is equal to 0 it meansproductivity is in the range of  as expected,

If the difference between actual_productivity and targeted_productivity is negative it means productivity is in the range of under_performed


In [None]:
df[df['Target_label']==0]

In [None]:
ax = sns.countplot(x = 'Target_label', data = df, palette='Set1')
plt.xlabel('No of Target_label')

plt.show()


As it can be seen from the graph above, there is imbalance so it is needed to be handled

In [None]:
df['Target_label'].value_counts()

From the value counts above, it can be seen that the dataset ist imbalanced due to the large number of unbalanced observations.
In this case, a binary classification problem can be modelled that predicts whether productivity is in the range of over_performed or not.


As part of our preprocessing,it is needed to turn the 3 class labels into 2 labels:

In [None]:
df['Target_label'] = [-1 if x==-1 else 1 for x in df['Target_label']]

In [None]:
df['Target_label'].value_counts()

In [None]:
ax = sns.countplot(x = 'Target_label', data = df, palette='Set1')
plt.xlabel('No of Target_label')

plt.show()

## Balancing Data

In [None]:
!pip install imbalanced-learn

In [None]:
# check version number
import imblearn
print(imblearn.__version__)

In [None]:
df1=df.drop(['quarter', 'department', 'day', 'team'],axis=1)

In [None]:
from imblearn.over_sampling import SMOTE
X = df1.loc[:, df1.columns != 'Target_label']
y = df1.Target_label
smt = SMOTE()
X_smote, y_smote = smt.fit_resample(X, y)
plt.figure(figsize=(12, 8))
plt.title('Repartition after SMOTE')

#plt.scatter(X_smote[y_smote==1][:, 0], X_smote[y_smote==1][:, 1], label='class 1')
#plt.scatter(X_smote[y_smote==0][:, 0], X_smote[y_smote==0][:, 1], label='class 0')
plt.scatter(X_smote[y_smote==1], X_smote[y_smote==1], label='class 1')
plt.scatter(X_smote[y_smote==-1], X_smote[y_smote==-1], label='class -1')
plt.legend()
plt.grid(False)
plt.show()



In [None]:
X_smote.shape, y_smote.shape

In [None]:
df = pd.concat([pd.DataFrame(X_smote), pd.DataFrame(y_smote)], axis=1)
df.shape

In [None]:
ax = sns.countplot(x = 'Target_label', data = df, palette='Set1')
plt.xlabel('No of Target_label')

plt.show()

## Splitting Train and Test Data

In [None]:
from sklearn.model_selection import train_test_split

columns = ['smv',
       'wip', 'over_time', 'incentive', 'idle_time', 'idle_men',
       'no_of_workers', 
       'quarter_Quarter1', 'quarter_Quarter2', 'quarter_Quarter3',
       'quarter_Quarter4', 'quarter_Quarter5', 'department_finishing',
       'department_sweing', 'day_Monday', 'day_Saturday', 'day_Sunday',
       'day_Thursday', 'day_Tuesday', 'day_Wednesday', 'team_1', 'team_2',
       'team_3', 'team_4', 'team_5', 'team_6', 'team_7', 'team_8', 'team_9',
       'team_10', 'team_11', 'team_12', 'no_of_style_change_enc']

X = df[columns]
y = df['Target_label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2,random_state=0)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

## Scaling

In [None]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Build Models


1. Logistic Regression
2. Decision Tree Classifiers
3. Random Forests
4. Support Vector Machines
5. K-Nearest Neighbors
6. Gaussian Naive Bayes
7. LinearDiscriminantAnalysis

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import precision_score,recall_score,accuracy_score

## Machine Learning Classifier Training and Validating

In [None]:

df_perf_metrics = pd.DataFrame(columns=[
    'Model', 'Accuracy_Training_Set', 'Accuracy_Test_Set', 'Precision',
    'Recall', 'f1_score'
])
models_trained_list = []


def get_perf_metrics(model, i):
    # model name
    model_name = type(model).__name__
    print("Training {} model...".format(model_name))
    # Fitting of model
    model.fit(X_train, y_train)
    print("Completed {} model training.".format(model_name))
    # Predictions
    y_pred = model.predict(X_test)
    # Add to ith row of dataframe - metrics

    df_perf_metrics.loc[i] = [
        model_name,
        model.score(X_train, y_train),
        model.score(X_test, y_test),
        precision_score(y_test, y_pred),
        recall_score(y_test, y_pred),
        f1_score(y_test, y_pred),
    ]
   
    print("Completed {} model's performance assessment.".format(model_name))

In [None]:
models_list = [LogisticRegression(),
               DecisionTreeClassifier(),
               RandomForestClassifier(),
               SVC(),
               KNeighborsClassifier(),
               GaussianNB(),LinearDiscriminantAnalysis()
               ]

In [None]:
from sklearn.metrics import r2_score,f1_score
for n, model in enumerate(models_list):
    get_perf_metrics(model, n)

In [None]:
df_perf_metrics

## Tuning the RandomForestClassifier Model

In [None]:
rfc = RandomForestClassifier()
parameters = {
    "n_estimators":[5,10,50,100,250],
    "max_depth":[2,4,8,16,32,None]}

In [None]:
from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(rfc,parameters,cv=5)
cv.fit(X_train,y_train.values.ravel())

In [None]:
def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')

In [None]:
display(cv)

In [None]:
model = cv.best_estimator_
y_pred = model.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('f1-score: ', f1_score(y_test, y_pred))

In [None]:
df1_perf_metrics = pd.DataFrame(columns=[
    'Model', 'Accuracy_Training_Set', 'Accuracy_Test_Set', 'Precision',
    'Recall', 'f1_score'
])

def get_perf_metrics_t(model):
    model = cv.best_estimator_
    model_name =RandomForestClassifier()
    
    print('Training RandomForestClassifier()')
    model.fit(X_train, y_train)
    print('Completed RandomForestClassifier()')
    y_pred = model.predict(X_test)
    
    df1_perf_metrics.loc[0] = [
        model_name,
        model.score(X_train, y_train),
        model.score(X_test, y_test),
        precision_score(y_test, y_pred),
        recall_score(y_test, y_pred),
        f1_score(y_test, y_pred),
    ]
    
    print("Completed RandomForestClassifier() model's performance assessment.")

get_perf_metrics_t(model)

In [None]:
df1_perf_metrics