# Homework 3
In this assignment, we will start with hyper-parameter tuning carried on from last homework and then building a Naïve Bayes classifier and a SVM model for the productivity satisfaction of [the given dataset](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees), the productivity of garment employees.

## For Question 1:

### About the Data Set
Seven different types of dry beans were used in a study in Selcuk University, Turkey, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features - 12 dimensions and 4 shape forms - were obtained from the grains.

Number of Instances (records in the data set): __13611__

Number of Attributes (fields within each record, including the class): __17__

### Data Set Attribute Information:

1. __Area (A)__ : The area of a bean zone and the number of pixels within its boundaries.
2. __Perimeter (P)__ : Bean circumference is defined as the length of its border.
3. __Major axis length (L)__ : The distance between the ends of the longest line that can be drawn from a bean.
4. __Minor axis length (l)__ : The longest line that can be drawn from the bean while standing perpendicular to the main axis.
5. __Aspect ratio (K)__ : Defines the relationship between L and l.
6. __Eccentricity (Ec)__ : Eccentricity of the ellipse having the same moments as the region.
7. __Convex area (C)__ : Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
8. __Equivalent diameter (Ed)__ : The diameter of a circle having the same area as a bean seed area.
9. __Extent (Ex)__ : The ratio of the pixels in the bounding box to the bean area.
10. __Solidity (S)__ : Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
11. __Roundness (R)__ : Calculated with the following formula: (4piA)/(P^2)
12. __Compactness (CO)__ : Measures the roundness of an object: Ed/L
13. __ShapeFactor1 (SF1)__
14. __ShapeFactor2 (SF2)__
15. __ShapeFactor3 (SF3)__
16. __ShapeFactor4 (SF4)__

17. __Classes : *Seker, Barbunya, Bombay, Cali, Dermosan, Horoz, Sira*__

## For Questions 2-4:
### Background 
The Garment Industry is one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. 

### Dataset Attribute Information

1. **date**: Date in MM-DD-YYYY
2. **day**: Day of the Week
3. **quarter** : A portion of the month. A month was divided into four quarters
4. **department** : Associated department with the instance
5. **team_no** : Associated team number with the instance
6. **no_of_workers** : Number of workers in each team
7. **no_of_style_change** : Number of changes in the style of a particular product
8. **targeted_productivity** : Targeted productivity set by the Authority for each team for each day.
9. **smv** : Standard Minute Value, it is the allocated time for a task
10. **wip** : Work in progress. Includes the number of unfinished items for products
11. **over_time** : Represents the amount of overtime by each team in minutes
12. **incentive** : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
13. **idle_time** : The amount of time when the production was interrupted due to several reasons
14. **idle_men** : The number of workers who were idle due to production interruption
15. **actual_productivity** : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

#### Libraries that can be used: numpy, scipy, pandas, scikit-learn, cvxpy, imbalanced-learn
Any libraries used in the discussion materials are also allowed.

#### Other Notes

 - Don't worry about not being able to achieve high accuracy, it is neither the goal nor the grading standard of this assignment. <br >
 - If not specified, you are not required to do hyperparameter tuning, but feel free to do so if you'd like.

#### Trouble Shooting
In case you have trouble installing and using imbalanced-learn(imblearn) <br >
Run the below code cell, then go to the selection bar at top: Kernel > Restart. <br >
Then try `import imblearn` to see if things work. 

# Exercises

## Exercise 1 - Hyperparameter Tuning (20 points)

Use either grid search or random search methodology to find the optimal number of nodes required in each hidden layer, as well as the optimal learning rate and the number of epochs, such that the accuracy of the model is maximum for the given data set.

__Requirements :__
- The set of optimal hyperparameters
- The maximum accuracy achieved using this set of optimal hyperparameters

__Note :__ Hyperparameter tuning takes a lot of time to execute. Make sure that you choose the appropriate number of each hyperparameter (preferably 3 of each), and that you allocate enough time to execute your code.

In [5]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

dataset = pd.read_csv("wine.csv")

X = dataset.drop('Wine', axis = 1)
y = dataset['Wine']

scaler = MinMaxScaler(feature_range=(0, 1))
X_rescaled = scaler.fit_transform(X)
X = pd.DataFrame(data = X_rescaled, columns = X.columns)

set_of_classes = y.value_counts().index.tolist()
set_of_classes= pd.DataFrame({'Class': set_of_classes})
y = pd.get_dummies(y)

max_iterations = [500,800,1000]
hidden_layer_siz = [(5, 7), (7, 13), (13, 10)]
learning_rates = 0.15 * np.arange(1, 3)

param_grid = dict(learning_rate_init = learning_rates, hidden_layer_sizes = hidden_layer_siz, max_iter = max_iterations)
# set model
mlp = MLPClassifier(solver = 'sgd', random_state = 42, activation = 'logistic', learning_rate_init = 0.3, batch_size = 100, hidden_layer_sizes = (12, 3), max_iter = 500)

# For Grid Search
grid = GridSearchCV(estimator = mlp, param_grid = param_grid)

# For Random Search
# grid = RandomizedSearchCV(estimator = mlp, param_distributions = param_grid, n_iter = 10)

grid.fit(X,y)

In [6]:
print("Optimal Hyper-parameters : ", grid.best_params_)
print("Optimal Accuracy : ", grid.best_score_)

Optimal Hyper-parameters :  {'hidden_layer_sizes': (5, 7), 'learning_rate_init': 0.15, 'max_iter': 500}
Optimal Accuracy :  0.9552380952380952


## Exercise 2 - General Data Preprocessing (10 points)

Our dataset needs cleaning before building any models. Some of the cleaning tasks are common in general, but depends on what kind of models we are building, sometimes we have to do additional processing. These additional tasks will be mentioned in each of the remaining two exercises later.

Note that **we will be using this processed data from exercise 1 in each of the remaining two exercises**.

For convenience, here are the attributes that we would treat them as **categorical attributes**: `day`, `quarter`, `department`, and `team`. 

 - Drop the column `date`.
 - For each of the categorical attributes, **print out** all the unique elements.
 - For each of the categorical attributes, remap the duplicated items, if you find there are typos or spaces among the duplicated items.
     - For example, "a" and "a " should be the same, so we need to update "a " to be "a".
     - Another example, "apple" and "appel" should be the same, so you should update "appel" to be "apple".
     

 - Create another column named `satisfied` that records the productivity performance. The behavior defined as follows. **This is the dependent variable we'd like to classify in this assignment.**
     - Return True or 1 if `actual_productivity` is equal to or greater than `targeted_productivity`. Otherwise, return False or 0, which means the team fails to meet the expected performance.
 - Drop the columns `actual_productivity` and `targeted_productivity`.


 - Find and **print out** which columns/attributes that have empty vaules, e.g., NA, NaN, null, None.
 - Fill the empty values with 0.

In [7]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, CategoricalNB
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from imblearn.over_sampling import RandomOverSampler
from IPython.display import display # Just for solution

In [8]:
df = pd.read_csv('heart_disease_uci.csv')
df = df.drop(columns=['id', 'dataset'])

cats = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
for cat in cats:
    print(df[cat].unique()) 
# You should notice that we have duplicated "finishing" and "finishing ".

['Male' 'Female']
['typical angina' 'asymptomatic' 'non-anginal' 'atypical angina']
[ True False]
['lv hypertrophy' 'normal' 'st-t abnormality']
[False  True]
['downsloping' 'flat' 'upsloping']
['fixed defect' 'normal' 'reversable defect']


In [9]:
# typo 
df = df.replace('st-t abnormality ', 'stt abnormality') 
df['restecg'].value_counts()

normal              150
lv hypertrophy      148
st-t abnormality      4
Name: restecg, dtype: int64

In [10]:
df['satisfied'] = (df['num'] >= 1)
df = df.drop(columns=['num'])

In [11]:
print(df.columns[df.isna().any()])
df = df.fillna(0)

Index(['ca'], dtype='object')


In [12]:
# Just for showing you the look of the processed data
print(df.shape)
display(df.head(5))

(302, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,satisfied
0,63,Male,typical angina,145,233,True,lv hypertrophy,150,False,2.3,downsloping,0.0,fixed defect,False
1,67,Male,asymptomatic,160,286,False,lv hypertrophy,108,True,1.5,flat,3.0,normal,True
2,67,Male,asymptomatic,120,229,False,lv hypertrophy,129,True,2.6,flat,2.0,reversable defect,True
3,37,Male,non-anginal,130,250,False,normal,187,False,3.5,downsloping,0.0,normal,False
4,41,Female,atypical angina,130,204,False,lv hypertrophy,172,False,1.4,upsloping,0.0,normal,False


## Exercise 3 - Naïve Bayes Classifier (40 points in total)

### Exercise 3.1 - Additional Data Preprocessing (10 points)

To build a Naïve Bayes Classifier, we need to further encode our categorical variables.

 - For each of the **categorical attribtues**, encode the set of categories to be **0 ~ (n_classes - 1)**.
     - For example, \["paris", "paris", "tokyo", "amsterdam"\] should be encoded as \[1, 1, 2, 0\].
     - Note that the order does not really matter, i.e., \[0, 0, 1, 2\] also works. But you have to start with 0 in your encodings.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.
 

In [13]:
# Remember to continue the task with your processed data from Exercise 1

In [14]:
df_nb = df.copy()
df_nb = pd.DataFrame(preprocessing.OrdinalEncoder().fit_transform(df_nb), columns=df_nb.columns)
# OrdinalEncoder will skip numerical values. LabelEncoder also works as they share the same functionality.
display(df_nb.head(10))

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,satisfied
0,29.0,1.0,3.0,31.0,64.0,1.0,0.0,49.0,0.0,22.0,0.0,0.0,0.0,0.0
1,33.0,1.0,0.0,40.0,111.0,0.0,0.0,10.0,1.0,15.0,1.0,3.0,1.0,1.0
2,33.0,1.0,0.0,14.0,60.0,0.0,0.0,29.0,1.0,25.0,1.0,2.0,2.0,1.0
3,3.0,1.0,2.0,22.0,80.0,0.0,1.0,84.0,0.0,32.0,0.0,0.0,1.0,0.0
4,7.0,0.0,1.0,22.0,35.0,0.0,0.0,71.0,0.0,14.0,2.0,0.0,1.0,0.0
5,22.0,1.0,1.0,14.0,67.0,0.0,1.0,76.0,0.0,8.0,2.0,0.0,1.0,0.0
6,28.0,0.0,0.0,28.0,97.0,0.0,0.0,59.0,0.0,33.0,0.0,2.0,1.0,1.0
7,23.0,0.0,0.0,14.0,145.0,0.0,1.0,62.0,1.0,6.0,2.0,0.0,1.0,0.0
8,29.0,1.0,0.0,22.0,83.0,0.0,0.0,46.0,0.0,14.0,1.0,1.0,2.0,1.0
9,19.0,1.0,0.0,28.0,34.0,1.0,0.0,54.0,1.0,29.0,0.0,0.0,2.0,1.0


In [15]:
nb_train, nb_test = train_test_split(df_nb, test_size=0.2)
X_nb_train, y_nb_train = nb_train.drop(columns=['satisfied']), nb_train['satisfied']
X_nb_test, y_nb_test = nb_test.drop(columns=['satisfied']), nb_test['satisfied']
print(X_nb_train.shape, X_nb_test.shape)

(241, 13) (61, 13)


### Exercise 3.2 - Naïve Bayes Classifier for Categorical Attributes (15 points)

Use the categorical attributes **only**, please build a Categorical Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.


In [16]:
# Remember to do this task with your processed data from Exercise 2.1

In [17]:
clf_cat = CategoricalNB()

### Exercise 3.3 - Naïve Bayes Classifier for Numerical Attributes (15 points)

Use the numerical attributes **only**, please build a Gaussian Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

**Remember to scale your data. The scaling method is up to you.**


In [18]:
# Remember to do this task with your processed data from Exercise 2.1

In [19]:
nums = [col for col in X_nb_train.columns if col not in cats]

clf_num = GaussianNB()
scaler = preprocessing.StandardScaler()
scaler.fit(X_nb_train[nums])

Z_nb_train = scaler.transform(X_nb_train[nums])
Z_nb_test = scaler.transform(X_nb_test[nums])

clf_num.fit(Z_nb_train, np.asarray(y_nb_train))

print(classification_report(y_nb_test, clf_num.predict(Z_nb_test)))

              precision    recall  f1-score   support

         0.0       0.76      0.80      0.78        35
         1.0       0.71      0.65      0.68        26

    accuracy                           0.74        61
   macro avg       0.73      0.73      0.73        61
weighted avg       0.74      0.74      0.74        61



## Exercise 4 - SVM Classifier (40 points in total)

### Exercise 4.1 - Additional Data Preprocessing (10 points)

To build a SVM Classifier, we need a different encoding for our categorical variables.

 - For each of the **categorical attribtues**, encode them with **one-hot encoding**.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.



In [20]:
# Remember to continue the task with your processed data from Exercise 1

In [33]:
# Sklearn solution
cats = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
nums = [col for col in df.drop(columns=['satisfied']).columns if col not in cats]
df_svm = df.copy()
df_svm = pd.get_dummies(df_svm, columns=cats)


In [34]:
print(df_svm.columns)

Index([      'age',  'trestbps',      'chol',    'thalch',   'oldpeak',
              'ca', 'satisfied',           0,           1,           2,
                 3,           4,           5,           6,           7,
                 8,           9,          10,          11,          12,
                13,          14,          15,          16,          17,
                18],
      dtype='object')


In [36]:
svm_train, svm_test = train_test_split(df_svm, test_size=0.2)
X_svm_train, y_svm_train = svm_train.drop(columns=['satisfied']), svm_train['satisfied']
X_svm_test, y_svm_test = svm_test.drop(columns=['satisfied']), svm_test['satisfied']

### Exercise 4.2 - SVM with Different Kernels (20 points)

Using all the attributes we have, please build a SVM that predicts the column `satisfied`. <br >
Specifically, please 
 - Build one SVM with **linear kernel**.
 - Build another SVM but with **rbf kernel**.
 - Report the **testing results** of **both models** using `classification report`.

The kernel is the only setting requirement. <br >
Other hyperparameter tuning is not required. But make sure they are the same in these two SVMs if you'd like to tune the model. In other words, the only difference between the two SVMs is the kernel setting.

**Remember to scale your data. The scaling method is up to you.**



In [37]:
# Remember to do this task with your processed data from Exercise 3.1

### If all the data is scaled

In [42]:
svc_li = SVC(kernel='linear')

scaler = preprocessing.StandardScaler()
scaler.fit(X_svm_train)

Z_svm_train = scaler.transform(X_svm_train)
Z_svm_test = scaler.transform(X_svm_test)

svc_li.fit(Z_svm_train, np.asarray(y_svm_train))

print('Linear Kernel')
print(classification_report(y_svm_test, svc_li.predict(Z_svm_test)))


Linear Kernel
              precision    recall  f1-score   support

       False       0.79      0.79      0.79        29
        True       0.81      0.81      0.81        32

    accuracy                           0.80        61
   macro avg       0.80      0.80      0.80        61
weighted avg       0.80      0.80      0.80        61



### Exercise 4.3 - SVM with Over-sampling (10 points)
 - For the column `satisfied` in our **training set**, please **print out** the frequency of each class. 
 - Oversample the **training data**. 
 - For the column `satisfied` in the oversampled data, **print out** the frequency of each class  again.
 - Re-build the 2 SVMs with the same setting you have in Exercise 3.2, but **use oversampled training data** instead.
     - Do not forget to scale the data first. As always, the scaling method is up to you.
 - Report the testing result with `classification_report`.

You can use ANY methods listed on [here](https://imbalanced-learn.org/stable/references/over_sampling.html#) such as RandomOverSampler or SMOTE. <br > 
You are definitely welcomed to build your own oversampler. <br >

Note that you do not have to over-sample your testing data

In [43]:
# Remember to do this task with your processed data from Exercise 3.1

In [44]:
# So far you should've noticed the very low recall score for class False/0.
# That is because the dataset we have is imbalanced and class False is the minority class.
# Naïve Bayes and SVM are sensitive to the dataset imbalance.
print(y_svm_train.value_counts())

ros = RandomOverSampler()
X_os, y_os = ros.fit_resample(X_svm_train, y_svm_train)

print(y_os.value_counts())

False    134
True     107
Name: satisfied, dtype: int64
True     134
False    134
Name: satisfied, dtype: int64


## example of how to write a SVM

In [45]:
import numpy as np

class SVM:
    def __init__(self,lr=.005,iters=1000,lambda_p=0.01):
        self.lr=lr
        self.iters=iters
        self.lambda_p = lambda_p

        self.weight = None
        self.bias = None
        
    def fit(self,X,y):
        n_samples,n_feat = X.shape
        
        self.weight = np.zeros(n_feat)
        self.bias = 0
        
        y_true = np.where(y <= 0, -1, 1) # same as -1 if y <= 0 else 1
        
        for _ in range(self.iters):
            for idx, sample in enumerate(X):
                y_pred = y_true[idx]*(np.dot(sample,self.weight)-self.bias)

                if y_pred>=1:
                    self.weight -= self.lr * (2 * self.lambda_p * self.weight)
                else:
                    self.weight -= self.lr * (2 * self.lambda_p * self.weight - np.dot(sample, y_true[idx]))
                    self.bias -= self.lr * y_true[idx]
        
    def predict(self,X):
        # do a forward pass
        y = np.dot(X, self.weight) - self.bias
        return y>=0 # if > 0, belongs to class 0, else class 1




In [46]:
svm = SVM()
svm.fit(Z_svm_train, np.asarray(y_svm_train))

print(classification_report(y_svm_test, svm.predict(Z_svm_test)))

              precision    recall  f1-score   support

       False       0.80      0.83      0.81        29
        True       0.84      0.81      0.83        32

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61

