# Homework 3
In this assignment, we will start with hyper-parameter tuning carried on from last homework and then building a Naïve Bayes classifier and a SVM model for the productivity satisfaction of [the given dataset](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees), the productivity of garment employees.

## For Question 1:

### About the Data Set
Seven different types of dry beans were used in a study in Selcuk University, Turkey, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features - 12 dimensions and 4 shape forms - were obtained from the grains.

Number of Instances (records in the data set): __13611__

Number of Attributes (fields within each record, including the class): __17__

### Data Set Attribute Information:

1. __Area (A)__ : The area of a bean zone and the number of pixels within its boundaries.
2. __Perimeter (P)__ : Bean circumference is defined as the length of its border.
3. __Major axis length (L)__ : The distance between the ends of the longest line that can be drawn from a bean.
4. __Minor axis length (l)__ : The longest line that can be drawn from the bean while standing perpendicular to the main axis.
5. __Aspect ratio (K)__ : Defines the relationship between L and l.
6. __Eccentricity (Ec)__ : Eccentricity of the ellipse having the same moments as the region.
7. __Convex area (C)__ : Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
8. __Equivalent diameter (Ed)__ : The diameter of a circle having the same area as a bean seed area.
9. __Extent (Ex)__ : The ratio of the pixels in the bounding box to the bean area.
10. __Solidity (S)__ : Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
11. __Roundness (R)__ : Calculated with the following formula: (4piA)/(P^2)
12. __Compactness (CO)__ : Measures the roundness of an object: Ed/L
13. __ShapeFactor1 (SF1)__
14. __ShapeFactor2 (SF2)__
15. __ShapeFactor3 (SF3)__
16. __ShapeFactor4 (SF4)__

17. __Classes : *Seker, Barbunya, Bombay, Cali, Dermosan, Horoz, Sira*__

## For Questions 2-4:
### Background 
The Garment Industry is one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. 

### Dataset Attribute Information

1. **date**: Date in MM-DD-YYYY
2. **day**: Day of the Week
3. **quarter** : A portion of the month. A month was divided into four quarters
4. **department** : Associated department with the instance
5. **team_no** : Associated team number with the instance
6. **no_of_workers** : Number of workers in each team
7. **no_of_style_change** : Number of changes in the style of a particular product
8. **targeted_productivity** : Targeted productivity set by the Authority for each team for each day.
9. **smv** : Standard Minute Value, it is the allocated time for a task
10. **wip** : Work in progress. Includes the number of unfinished items for products
11. **over_time** : Represents the amount of overtime by each team in minutes
12. **incentive** : Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.
13. **idle_time** : The amount of time when the production was interrupted due to several reasons
14. **idle_men** : The number of workers who were idle due to production interruption
15. **actual_productivity** : The actual % of productivity that was delivered by the workers. It ranges from 0-1.

#### Libraries that can be used: numpy, scipy, pandas, scikit-learn, cvxpy, imbalanced-learn
Any libraries used in the discussion materials are also allowed.

#### Other Notes

 - Don't worry about not being able to achieve high accuracy, it is neither the goal nor the grading standard of this assignment. <br >
 - If not specified, you are not required to do hyperparameter tuning, but feel free to do so if you'd like.

#### Trouble Shooting
In case you have trouble installing and using imbalanced-learn(imblearn) <br >
Run the below code cell, then go to the selection bar at top: Kernel > Restart. <br >
Then try `import imblearn` to see if things work. 

In [1]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install imbalanced-learn delayed

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.10.1-py3-none-any.whl (226 kB)
     -------------------------------------- 226.0/226.0 kB 2.8 MB/s eta 0:00:00
Collecting delayed
  Downloading delayed-0.11.0b1-py2.py3-none-any.whl (19 kB)
Collecting hiredis
  Downloading hiredis-2.2.3-cp38-cp38-win_amd64.whl (21 kB)
Collecting redis
  Downloading redis-4.5.5-py3-none-any.whl (240 kB)
     -------------------------------------- 240.3/240.3 kB 2.9 MB/s eta 0:00:00
Installing collected packages: redis, hiredis, delayed, imbalanced-learn
Successfully installed delayed-0.11.0b1 hiredis-2.2.3 imbalanced-learn-0.10.1 redis-4.5.5


# Exercises

## Exercise 1 : Hyperparameter Tuning (20 points)

Use either grid search or random search methodology to find the optimal number of nodes required in each hidden layer, as well as the optimal learning rate and the number of epochs, such that the accuracy of the model is maximum for the given Dry_Beans_Dataset.

__Requirements :__
- The set of optimal hyperparameters
- The maximum accuracy achieved using this set of optimal hyperparameters

__Note :__ Hyperparameter tuning takes a lot of time to execute. Make sure that you choose the appropriate number of each hyperparameter (preferably 3 of each), and that you allocate enough time to execute your code.

In [60]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

dataset = pd.read_csv("Dry_Beans_Dataset.csv")

X = dataset.drop('Class', axis=1)
y = dataset['Class']

scaler = MinMaxScaler(feature_range=(0,1))
X_rescaled = scaler.fit_transform(X)
X = pd.DataFrame(data = X_rescaled, columns = X.columns)

set_of_classes = y.value_counts().index.tolist()
set_of_classes = pd.DataFrame({'Class': set_of_classes})
y = pd.get_dummies(y)

max_iterations = [500,800,1000]
hidden_layer_siz = [(5, 7), (7, 13), (13, 10)]
learning_rates = 0.15 * np.arange(1, 3)

param_grid = dict(learning_rate_init = learning_rates, hidden_layer_sizes = hidden_layer_siz, max_iter = max_iterations)

mlp = MLPClassifier(solver = 'sgd', random_state = 42, activation = 'logistic', learning_rate_init = 0.3, batch_size = 100, hidden_layer_sizes = (12, 3), max_iter = 500)

grid = GridSearchCV(estimator = mlp, param_grid = param_grid)

grid.fit(X,y)

In [61]:
print("Optimal Hyper-parameters : ", grid.best_params_)
print("Optimal Accuracy : ", grid.best_score_)

Optimal Hyper-parameters :  {'hidden_layer_sizes': (7, 13), 'learning_rate_init': 0.3, 'max_iter': 500}
Optimal Accuracy :  0.912203147164209


## Exercise 2 - General Data Preprocessing (10 points)

Our dataset needs cleaning before building any models. Some of the cleaning tasks are common in general, but depends on what kind of models we are building, sometimes we have to do additional processing. These additional tasks will be mentioned in each of the remaining two exercises later.

Note that **we will be using this processed data from exercise 1 in each of the remaining two exercises**.

For convenience, here are the attributes that we would treat them as **categorical attributes**: `day`, `quarter`, `department`, and `team`. 

 - Drop the column `date`.
 - For each of the categorical attributes, **print out** all the unique elements.
 - For each of the categorical attributes, remap the duplicated items, if you find there are typos or spaces among the duplicated items.
     - For example, "a" and "a " should be the same, so we need to update "a " to be "a".
     - Another example, "apple" and "appel" should be the same, so you should update "appel" to be "apple".
     

 - Create another column named `satisfied` that records the productivity performance. The behavior defined as follows. **This is the dependent variable we'd like to classify in this assignment.**
     - Return True or 1 if `actual_productivity` is equal to or greater than `targeted_productivity`. Otherwise, return False or 0, which means the team fails to meet the expected performance.
 - Drop the columns `actual_productivity` and `targeted_productivity`.


 - Find and **print out** which columns/attributes that have empty vaules, e.g., NA, NaN, null, None.
 - Fill the empty values with 0.


In [47]:
import pandas as pd
import numpy as np
df = pd.read_csv('./garments_worker_productivity.csv')
df

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.886500
2,1/1/2015,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
3,1/1/2015,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
4,1/1/2015,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,3/11/2015,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,3/11/2015,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,3/11/2015,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,3/11/2015,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


In [12]:
data = df.drop('date', axis=1)
data

Unnamed: 0,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,Quarter1,sweing,Thursday,8,0.80,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.886500
2,Quarter1,sweing,Thursday,11,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
3,Quarter1,sweing,Thursday,12,0.80,11.41,968.0,3660,50,0.0,0,0,30.5,0.800570
4,Quarter1,sweing,Thursday,6,0.80,25.90,1170.0,1920,50,0.0,0,0,56.0,0.800382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,Quarter2,finishing,Wednesday,10,0.75,2.90,,960,0,0.0,0,0,8.0,0.628333
1193,Quarter2,finishing,Wednesday,8,0.70,3.90,,960,0,0.0,0,0,8.0,0.625625
1194,Quarter2,finishing,Wednesday,7,0.65,3.90,,960,0,0.0,0,0,8.0,0.625625
1195,Quarter2,finishing,Wednesday,9,0.75,2.90,,1800,0,0.0,0,0,15.0,0.505889


In [13]:
for col in data:
    print(col)
    print(data[col].unique())

quarter
['Quarter1' 'Quarter2' 'Quarter3' 'Quarter4' 'Quarter5']
department
['sweing' 'finishing ' 'finishing']
day
['Thursday' 'Saturday' 'Sunday' 'Monday' 'Tuesday' 'Wednesday']
team
[ 8  1 11 12  6  7  2  3  9 10  5  4]
targeted_productivity
[0.8  0.75 0.7  0.65 0.6  0.35 0.5  0.07 0.4 ]
smv
[26.16  3.94 11.41 25.9  28.08 19.87 19.31  2.9  23.69  4.15 11.61 45.67
 21.98 31.83 12.52 42.41 20.79 50.48  4.3  22.4  42.27 27.13 14.61 51.02
 22.52 14.89 22.94 48.68 41.19 48.84 26.87 20.4  49.1  15.26 54.56 40.99
 29.12  4.08 42.97 15.09 30.4  48.18 20.1  38.09 18.79 23.54 50.89 24.26
 20.55 30.1  25.31 10.05 18.22  5.13 29.4  30.33 19.68 21.25  4.6   3.9
 22.53 21.82 27.48 26.66 20.2  15.28 26.82 16.1  23.41 30.48]
wip
[1.1080e+03        nan 9.6800e+02 1.1700e+03 9.8400e+02 7.9500e+02
 7.3300e+02 6.8100e+02 8.7200e+02 5.7800e+02 6.6800e+02 8.6100e+02
 7.7200e+02 9.1300e+02 1.2610e+03 8.4400e+02 1.0050e+03 6.5900e+02
 1.1520e+03 1.1380e+03 6.1000e+02 9.4400e+02 5.4400e+02 1.0720e+03
 5.390

In [25]:
#department was the only column that had this problem
for i in range(len(data['department'])):
    if data['department'][i] == 'finishing ':
        data['department'][i] = 'finishing'
print(data['department'].unique())

['sweing' 'finishing']


In [28]:
satisfied = []
for i in range(len(data['targeted_productivity'])):
    if data['actual_productivity'][i] >= data['targeted_productivity'][i]:
        satisfied.append(1)
    else:
        satisfied.append(0)
data['satisfied'] = satisfied
data[['targeted_productivity', 'actual_productivity', 'satisfied']]

Unnamed: 0,targeted_productivity,actual_productivity,satisfied
0,0.80,0.940725,1
1,0.75,0.886500,1
2,0.80,0.800570,1
3,0.80,0.800570,1
4,0.80,0.800382,1
...,...,...,...
1192,0.75,0.628333,0
1193,0.70,0.625625,0
1194,0.65,0.625625,0
1195,0.75,0.505889,0


In [32]:
pro_dropped_data = data.drop(['actual_productivity', 'targeted_productivity'], axis=1)
pro_dropped_data

Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,Quarter1,sweing,Thursday,8,26.16,1108.0,7080,98,0.0,0,0,59.0,1
1,Quarter1,finishing,Thursday,1,3.94,,960,0,0.0,0,0,8.0,1
2,Quarter1,sweing,Thursday,11,11.41,968.0,3660,50,0.0,0,0,30.5,1
3,Quarter1,sweing,Thursday,12,11.41,968.0,3660,50,0.0,0,0,30.5,1
4,Quarter1,sweing,Thursday,6,25.90,1170.0,1920,50,0.0,0,0,56.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,Quarter2,finishing,Wednesday,10,2.90,,960,0,0.0,0,0,8.0,0
1193,Quarter2,finishing,Wednesday,8,3.90,,960,0,0.0,0,0,8.0,0
1194,Quarter2,finishing,Wednesday,7,3.90,,960,0,0.0,0,0,8.0,0
1195,Quarter2,finishing,Wednesday,9,2.90,,1800,0,0.0,0,0,15.0,0


In [49]:
fillWithZero = []
for col in pro_dropped_data:
    if pro_dropped_data[col].isnull().values.any():
        fillWithZero.append(col)
        print(col)
for col in fillWithZero:
    pro_dropped_data[col] = pro_dropped_data[col].fillna(0)
pro_dropped_data

wip


Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,Quarter1,sweing,Thursday,8,26.16,1108.0,7080,98,0.0,0,0,59.0,1
1,Quarter1,finishing,Thursday,1,3.94,0.0,960,0,0.0,0,0,8.0,1
2,Quarter1,sweing,Thursday,11,11.41,968.0,3660,50,0.0,0,0,30.5,1
3,Quarter1,sweing,Thursday,12,11.41,968.0,3660,50,0.0,0,0,30.5,1
4,Quarter1,sweing,Thursday,6,25.90,1170.0,1920,50,0.0,0,0,56.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,Quarter2,finishing,Wednesday,10,2.90,0.0,960,0,0.0,0,0,8.0,0
1193,Quarter2,finishing,Wednesday,8,3.90,0.0,960,0,0.0,0,0,8.0,0
1194,Quarter2,finishing,Wednesday,7,3.90,0.0,960,0,0.0,0,0,8.0,0
1195,Quarter2,finishing,Wednesday,9,2.90,0.0,1800,0,0.0,0,0,15.0,0


## Exercise 3 - Naïve Bayes Classifier (35 points in total)

### Exercise 3.1 - Additional Data Preprocessing (5 points)

To build a Naïve Bayes Classifier, we need to further encode our categorical variables.

 - For each of the **categorical attribtues**, encode the set of categories to be **0 ~ (n_classes - 1)**.
     - For example, \["paris", "paris", "tokyo", "amsterdam"\] should be encoded as \[1, 1, 2, 0\].
     - Note that the order does not really matter, i.e., \[0, 0, 1, 2\] also works. But you have to start with 0 in your encodings.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.

In [78]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, CategoricalNB
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from imblearn.over_sampling import RandomOverSampler
from IPython.display import display

df_nb = pro_dropped_data
df_nb

Unnamed: 0,quarter,department,day,team,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,satisfied
0,Quarter1,sweing,Thursday,8,26.16,1108.0,7080,98,0.0,0,0,59.0,1
1,Quarter1,finishing,Thursday,1,3.94,0.0,960,0,0.0,0,0,8.0,1
2,Quarter1,sweing,Thursday,11,11.41,968.0,3660,50,0.0,0,0,30.5,1
3,Quarter1,sweing,Thursday,12,11.41,968.0,3660,50,0.0,0,0,30.5,1
4,Quarter1,sweing,Thursday,6,25.90,1170.0,1920,50,0.0,0,0,56.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,Quarter2,finishing,Wednesday,10,2.90,0.0,960,0,0.0,0,0,8.0,0
1193,Quarter2,finishing,Wednesday,8,3.90,0.0,960,0,0.0,0,0,8.0,0
1194,Quarter2,finishing,Wednesday,7,3.90,0.0,960,0,0.0,0,0,8.0,0
1195,Quarter2,finishing,Wednesday,9,2.90,0.0,1800,0,0.0,0,0,15.0,0


In [79]:
df_nb = pd.DataFrame(preprocessing.OrdinalEncoder().fit_transform(df_nb), columns=df_nb.columns)

df_nb['satisfied']

0       1.0
1       1.0
2       1.0
3       1.0
4       1.0
       ... 
1192    0.0
1193    0.0
1194    0.0
1195    0.0
1196    0.0
Name: satisfied, Length: 1197, dtype: float64

In [80]:
nb_train, nb_test = train_test_split(df_nb, test_size=0.2)
X_nb_train, y_nb_train = nb_train.drop(columns=['satisfied']), nb_train['satisfied']
X_nb_test, y_nb_test = nb_test.drop(columns=['satisfied']), nb_test['satisfied']
print(X_nb_train.shape, X_nb_test.shape)

(957, 12) (240, 12)


### Exercise 3.2 - Naïve Bayes Classifier for Categorical Attributes (15 points)

Use the categorical attributes **only**, please build a Categorical Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

In [94]:
categorical_attributes = ['quarter', 'department', 'day', 'team']
X_nb_trainC = X_nb_train[categorical_attributes]
X_nb_testC = X_nb_test[categorical_attributes]

clf_cat = CategoricalNB()

clf_cat.fit(X_nb_trainC, y_nb_train)

y_predC = clf_cat.predict(X_nb_testC)

print(classification_report(y_nb_test, y_predC, target_names=['not satisfied', 'satisfied']))

               precision    recall  f1-score   support

not satisfied       0.56      0.14      0.22        73
    satisfied       0.72      0.95      0.82       167

     accuracy                           0.70       240
    macro avg       0.64      0.54      0.52       240
 weighted avg       0.67      0.70      0.64       240



### Exercise 3.3 - Naïve Bayes Classifier for Numerical Attributes (15 points)

Use the numerical attributes **only**, please build a Gaussian Naïve Bayes classifier that predicts the column `satisfied`. <br >
Report the **testing result** using `classification_report`.

**Remember to scale your data. The scaling method is up to you.**

In [100]:
numerical_attributes = [col for col in X_nb_train.columns if col not in categorical_attributes]

clf_num = GaussianNB()
scaler = preprocessing.StandardScaler()
scaler.fit(X_nb_train[numerical_attributes])

Z_nb_train = scaler.transform(X_nb_train[numerical_attributes])
Z_nb_test = scaler.transform(X_nb_test[numerical_attributes])

clf_num.fit(Z_nb_train, np.asarray(y_nb_train))

print(classification_report(y_nb_test, clf_num.predict(Z_nb_test), target_names=['not satisfied', 'satisfied']))

               precision    recall  f1-score   support

not satisfied       0.83      0.07      0.13        73
    satisfied       0.71      0.99      0.83       167

     accuracy                           0.71       240
    macro avg       0.77      0.53      0.48       240
 weighted avg       0.75      0.71      0.61       240



## Exercies 4 - SVM Classifier (35 points in total)

### Exercise 4.1 - Additional Data Preprocessing (5 points)

To build a SVM Classifier, we need a different encoding for our categorical variables.

 - For each of the **categorical attribtues**, encode them with **one-hot encoding**.
     - You can find information about this encoding in the discussion materials.


 - Split the data into training and testing set with the ratio of 80:20.


In [107]:
categorical_attributes = ['quarter', 'department', 'day', 'team']
numerical_attributes = [col for col in pro_dropped_data.drop(columns=['satisfied']).columns if col not in categorical_attributes]
df_svm = pro_dropped_data.copy()
df_svm = pd.get_dummies(df_svm, columns=categorical_attributes)

In [108]:
print(df_svm.columns)

Index(['smv', 'wip', 'over_time', 'incentive', 'idle_time', 'idle_men',
       'no_of_style_change', 'no_of_workers', 'satisfied', 'quarter_Quarter1',
       'quarter_Quarter2', 'quarter_Quarter3', 'quarter_Quarter4',
       'quarter_Quarter5', 'department_finishing', 'department_sweing',
       'day_Monday', 'day_Saturday', 'day_Sunday', 'day_Thursday',
       'day_Tuesday', 'day_Wednesday', 'team_1', 'team_2', 'team_3', 'team_4',
       'team_5', 'team_6', 'team_7', 'team_8', 'team_9', 'team_10', 'team_11',
       'team_12'],
      dtype='object')


In [109]:
svm_train, svm_test = train_test_split(df_svm, test_size=0.2)
X_svm_train, y_svm_train = svm_train.drop(columns=['satisfied']), svm_train['satisfied']
X_svm_test, y_svm_test = svm_test.drop(columns=['satisfied']), svm_test['satisfied']

### Exercise 4.2 - SVM with Different Kernels (20 points)

Using all the attributes we have, please build a SVM that predicts the column `satisfied`. <br >
Specifically, please 
 - Build one SVM with **linear kernel**.
 - Build another SVM but with **rbf kernel**.
 - Report the **testing results** of **both models** using `classification report`.

The kernel is the only setting requirement. <br >
Other hyperparameter tuning is not required. But make sure they are the same in these two SVMs if you'd like to tune the model. In other words, the only difference between the two SVMs should be the kernel setting.

**Remember to scale your data. The scaling method is up to you.**

In [114]:
svc_li = SVC(kernel='linear')
svc_rbf = SVC(kernel='rbf')

scaler = preprocessing.StandardScaler()
scaler.fit(X_svm_train)

Z_svm_train = scaler.transform(X_svm_train)
Z_svm_test = scaler.transform(X_svm_test)

svc_li.fit(Z_svm_train, np.asarray(y_svm_train))
svc_rbf.fit(Z_svm_train, np.asarray(y_svm_train))

print('Linear Kernel')
print(classification_report(y_svm_test, svc_li.predict(Z_svm_test)))
print('RBF Kernel')
print(classification_report(y_svm_test, svc_rbf.predict(Z_svm_test)))

Linear Kernel
              precision    recall  f1-score   support

           0       0.88      0.10      0.19        67
           1       0.74      0.99      0.85       173

    accuracy                           0.75       240
   macro avg       0.81      0.55      0.52       240
weighted avg       0.78      0.75      0.66       240

RBF Kernel
              precision    recall  f1-score   support

           0       0.74      0.39      0.51        67
           1       0.80      0.95      0.87       173

    accuracy                           0.79       240
   macro avg       0.77      0.67      0.69       240
weighted avg       0.78      0.79      0.77       240



### Exercise 4.3 - SVM with Over-sampling (10 points)
 - For the column `satisfied` in our **training set**, please **print out** the frequency of each class. 
 - Oversample the **training data**. 
 - For the column `satisfied` in the oversampled data, **print out** the frequency of each class  again.
 - Re-build the 2 SVMs with the same setting you have in Exercise 3.2, but **use oversampled training data** instead.
     - Do not forget to scale the data first. As always, the scaling method is up to you.
 - Report the **testing result** with `classification_report`.

You can use ANY methods listed on [here](https://imbalanced-learn.org/stable/references/over_sampling.html#) such as RandomOverSampler or SMOTE. <br > 
You are definitely welcomed to build your own oversampler. <br >

Note that you do not have to over-sample your testing data.

In [116]:
print(y_svm_train.value_counts())

ros = RandomOverSampler()
X_os, y_os = ros.fit_resample(X_svm_train, y_svm_train)

print(y_os.value_counts())

satisfied
1    702
0    255
Name: count, dtype: int64
satisfied
0    702
1    702
Name: count, dtype: int64


In [124]:
svc_li = SVC(kernel='linear')
svc_rbf = SVC(kernel='rbf')

scaler = preprocessing.StandardScaler()
scaler.fit(X_os)

Z_svm_train = scaler.transform(X_os)
Z_svm_test = scaler.transform(X_svm_test)

svc_li.fit(Z_svm_train, np.asarray(y_os))
svc_rbf.fit(Z_svm_train, np.asarray(y_os))

print('Linear Kernel')
print(classification_report(y_svm_test, svc_li.predict(Z_svm_test)))
print('RBF Kernel')
print(classification_report(y_svm_test, svc_rbf.predict(Z_svm_test)))

Linear Kernel
              precision    recall  f1-score   support

           0       0.49      0.66      0.56        67
           1       0.85      0.74      0.79       173

    accuracy                           0.72       240
   macro avg       0.67      0.70      0.68       240
weighted avg       0.75      0.72      0.73       240

RBF Kernel
              precision    recall  f1-score   support

           0       0.51      0.66      0.57        67
           1       0.85      0.75      0.80       173

    accuracy                           0.73       240
   macro avg       0.68      0.70      0.68       240
weighted avg       0.75      0.72      0.73       240

