## Background
The Garment Industry is one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. 

## Dataset Attribute Information

1. **date**: Date in MM-DD-YYYY
2. **day**: Day of the Week
3. **quarter** : A portion of the month. A month was divided into four quarters
4. **department** : Associated department with the instance
5. **team_no** : Associated team number with the instance
6. **no_of_workers** : Number of workers in each team
7. **no_of_style_change** : Number of changes in the style of a particular product
8. **targeted_productivity** : Targeted productivity set by the Authority for each team for each day.
9. **smv** : Standard Minute Value, it is the allocated time for a task
10. **wip** : Work in progress. Includes the number of unfinished items for products
11. **over_time** : Represents the amount of overtime by each team in minutes
12. **incentive** : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
13. **idle_time** : The amount of time when the production was interrupted due to several reasons
14. **idle_men** : The number of workers who were idle due to production interruption
15. **actual_productivity** : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

In [2]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np
from sklearn import preprocessing
from sklearn.naive_bayes import CategoricalNB, GaussianNB, MultinomialNB

## General Data Preprocessing (20 points)

Our dataset needs cleaning before building any models. Some of the cleaning tasks are common in general, but depends on what kind of models we are building, sometimes we have to do additional processing. These additional tasks will be mentioned in each of the remaining two exercises later.

In [5]:
df = pd.read_csv('./garments_worker_productivity.csv')
df = df.drop("date", axis=1)
print(df.columns)

Index(['quarter', 'department', 'day', 'team', 'targeted_productivity', 'smv',
       'wip', 'over_time', 'incentive', 'idle_time', 'idle_men',
       'no_of_style_change', 'no_of_workers', 'actual_productivity'],
      dtype='object')


In [6]:
categoricalcols = ['day','quarter','department','team']
for col in categoricalcols:
    print("Categorical Elements for {} are : {}".format(col,df[col].unique()))

Categorical Elements for day are : ['Thursday' 'Saturday' 'Sunday' 'Monday' 'Tuesday' 'Wednesday']
Categorical Elements for quarter are : ['Quarter1' 'Quarter2' 'Quarter3' 'Quarter4' 'Quarter5']
Categorical Elements for department are : ['sweing' 'finishing ' 'finishing']
Categorical Elements for team are : [ 8  1 11 12  6  7  2  3  9 10  5  4]


For each of the categorical attributes, remapped the duplicated items

In [7]:
df = df.replace(to_replace='.*finishing.*', value='finishing', regex = True)
df = df.replace(to_replace='sweing', value='sewing')
for col in categoricalcols:
    print("Categorical Elements {} for {} are : {}".format(len(df[col].unique()),col,df[col].unique()))

Categorical Elements 6 for day are : ['Thursday' 'Saturday' 'Sunday' 'Monday' 'Tuesday' 'Wednesday']
Categorical Elements 5 for quarter are : ['Quarter1' 'Quarter2' 'Quarter3' 'Quarter4' 'Quarter5']
Categorical Elements 2 for department are : ['sewing' 'finishing']
Categorical Elements 12 for team are : [ 8  1 11 12  6  7  2  3  9 10  5  4]


 - Created another column named `satisfied` that records the productivity performance. The behavior defined as follows.
     - Returns True or 1 if `actual_productivity` is equal to or greater than `targeted_productivity`. Otherwise, return False or 0, which means the team fails to meet the expected performance.

In [8]:
df['satisfied'] = (df['actual_productivity'] >= df['targeted_productivity']).astype(int)
# df['satisfied'].head()

 - Drop the columns `actual_productivity` and `targeted_productivity`.

In [9]:
df = df.drop(columns=['actual_productivity','targeted_productivity'])
# print(df.columns)

 - Find and **print out** which columns/attributes that have empty vaules, e.g., NA, NaN, null, None.

In [10]:
print(df.columns[df.isna().any()].tolist())

['wip']


 - Fill the empty values with 0.


In [11]:
df = df.replace(to_replace=np.nan, value = 0)
ex1 = df # Extra step to save the runs : Ignore

In [12]:
print(df.columns[df.isna().any()].tolist()) #Checking if there are any NAs

[]


## Naïve Bayes Classifier

### Additional Data Preprocessing 

To build a Naïve Bayes Classifier, we need to further encode our categorical variables.

 - For each of the **categorical attribtues**, encode the set of categories to be **0 ~ (n_classes - 1)**.
     - For example, \["paris", "paris", "tokyo", "amsterdam"\] should be encoded as \[1, 1, 2, 0\].
     - Note that the order does not really matter, i.e., \[0, 0, 1, 2\] also works. But you have to start with 0 in your encodings.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.

In [13]:
df = ex1

In [14]:
# print(df)

In [15]:
encoder = preprocessing.OrdinalEncoder() # Encodes each category as integers. From 0 to n_classes - 1
y_encoder = preprocessing.LabelEncoder() # Same functionality but designed for the dependent variable.

X = df[categoricalcols]
encoder.fit(X)
X = encoder.transform(X)

y = df['satisfied']
y_encoder.fit(y)
y = y_encoder.transform(y)

X = pd.DataFrame(data=X, columns=[categoricalcols])
y = pd.DataFrame(data=y)

df = df.drop(columns=df[categoricalcols], axis =1)
df = pd.concat([df,X], axis=1)

# print(df.columns)
df = df[[c for c in df if c not in ['satisfied']] + ['satisfied']]

# print(df.head())

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=2022)
# These 4 values are just for categorical variables


  result = np.asarray(values, dtype=dtype)


In [16]:
ex21 = df

### Naïve Bayes Classifier for Categorical Attributes

Use the categorical attributes built a Categorical Naïve Bayes classifier that predicts the column `satisfied`. <br >

In [17]:
df = ex21

In [18]:
NB = CategoricalNB() # You can try using MultinomialNB
NB.fit(X_train,y_train)

print(classification_report(y_test, NB.predict(X_test)))


              precision    recall  f1-score   support

           0       0.38      0.16      0.23        56
           1       0.78      0.92      0.84       184

    accuracy                           0.74       240
   macro avg       0.58      0.54      0.54       240
weighted avg       0.69      0.74      0.70       240



  y = column_or_1d(y, warn=True)


### Naïve Bayes Classifier for Numerical Attributes

Using the numerical attributes only, built a Gaussian Naïve Bayes classifier that predicts the column `satisfied`. <br >

In [19]:
# Remember to do this task with your processed data from Exercise 2.1
scaler = preprocessing.StandardScaler()
df = ex21

# Renaming columns as the were formatted differently
df = df.rename(columns={df.columns[8]: 'day'})
df = df.rename(columns={df.columns[9]: 'quarter'})
df = df.rename(columns={df.columns[10]: 'department'})
df = df.rename(columns={df.columns[11]: 'team'})

print(df.head())
X = df.drop(columns=categoricalcols, axis = 'columns')
X = df.drop(columns='satisfied', axis = 'columns')
# print(X.head())


X_train, X_test = train_test_split(X, test_size = 0.2, random_state = 2022)

scaler.fit(X_train)
NB = GaussianNB()
NB.fit(X_train, y_train)

print(classification_report(y_train, NB.predict(scaler.transform(X_train))))
print(classification_report(y_test, NB.predict(scaler.transform(X_test))))

     smv     wip  over_time  incentive  idle_time  idle_men  \
0  26.16  1108.0       7080         98        0.0         0   
1   3.94     0.0        960          0        0.0         0   
2  11.41   968.0       3660         50        0.0         0   
3  11.41   968.0       3660         50        0.0         0   
4  25.90  1170.0       1920         50        0.0         0   

   no_of_style_change  no_of_workers  day  quarter  department  team  \
0                   0           59.0  3.0      0.0         1.0   7.0   
1                   0            8.0  3.0      0.0         0.0   0.0   
2                   0           30.5  3.0      0.0         1.0  10.0   
3                   0           30.5  3.0      0.0         1.0  11.0   
4                   0           56.0  3.0      0.0         1.0   5.0   

   satisfied  
0          1  
1          1  
2          1  
3          1  
4          1  
              precision    recall  f1-score   support

           0       0.43      0.67      0.53

  y = column_or_1d(y, warn=True)
  "X does not have valid feature names, but"
  "X does not have valid feature names, but"


## SVM Classifier

### Additional Data Preprocessing

To build a SVM Classifier, we need a different encoding for our categorical variables.

 - For each of the **categorical attribtues**, encode them with **one-hot encoding**.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.


In [20]:
# Remember to continue the task with your processed data from Exercise 1
df = ex1

encoder = preprocessing.OneHotEncoder()
encoder.fit(df[categoricalcols])
onehot_encoded = encoder.transform(df[categoricalcols]).toarray()
# print(onehot_encoded, onehot_encoded.shape)

X_train, X_test, y_train, y_test = train_test_split(onehot_encoded, df['satisfied'], test_size = 0.2, random_state = 2022)

### SVM with Different Kernels

Using all the attributes we have, built a SVM that predicts the column `satisfied`. <br >
Specifically, 
 - One SVM with **linear kernel**.
 - SVM but with **rbf kernel**.

In [21]:
# Remember to do this task with your processed data from Exercise 3.1
from sklearn.svm import SVC

svmlin = SVC(kernel='linear')
scaler.fit_transform(X_train)

svmlin.fit(X_train, y_train)
yhat = svmlin.predict(X_test)
print(classification_report(y_test, yhat))

svmpoly = SVC(kernel = 'rbf', degree = 3)
svmpoly.fit(X_train,y_train)
yhatpoly = svmpoly.predict(X_test)
print(classification_report(y_test, yhatpoly))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        56
           1       0.77      1.00      0.87       184

    accuracy                           0.77       240
   macro avg       0.38      0.50      0.43       240
weighted avg       0.59      0.77      0.67       240

              precision    recall  f1-score   support

           0       0.57      0.23      0.33        56
           1       0.80      0.95      0.87       184

    accuracy                           0.78       240
   macro avg       0.68      0.59      0.60       240
weighted avg       0.75      0.78      0.74       240



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### SVM with Over-sampling 

In [22]:
# Remember to do this task with your processed data from Exercise 3.1
from imblearn.over_sampling import RandomOverSampler

print(y_train.value_counts())

ros = RandomOverSampler(random_state=21)
X_os, y_os = ros.fit_resample(X_train, y_train)

print(y_os.value_counts())

svmlin = SVC(kernel='linear')
scaler.fit_transform(X_train)

svmlin.fit(X_os, y_os)
yhat = svmlin.predict(X_test)
print(classification_report(y_test, yhat))

svmpoly = SVC(kernel = 'rbf', degree = 3)
svmpoly.fit(X_os,y_os)
yhatpoly = svmpoly.predict(X_test)
print(classification_report(y_test, yhatpoly))

1    691
0    266
Name: satisfied, dtype: int64
0    691
1    691
Name: satisfied, dtype: int64
              precision    recall  f1-score   support

           0       0.36      0.61      0.45        56
           1       0.85      0.67      0.75       184

    accuracy                           0.65       240
   macro avg       0.60      0.64      0.60       240
weighted avg       0.73      0.65      0.68       240

              precision    recall  f1-score   support

           0       0.35      0.59      0.44        56
           1       0.84      0.66      0.74       184

    accuracy                           0.65       240
   macro avg       0.59      0.63      0.59       240
weighted avg       0.73      0.65      0.67       240

