# Homework Assignment #5 (Individual)


### <p style="text-align: right;"> &#9989; Qingyun, Xu</p>
### <p style="text-align: right;"> &#9989; xq1msu</p>

### Goals for this homework assignment

By the end of this assignment, you should be able to:
* Use `git` to track your work and turn in your assignment
* Read in data and prepare it for modeling
* Build, fit, and evaluate Logistic Regression models
* Build, fit, and evaluate Perceptron models
* Use PCA to reduce the number of features
* Build, fit, and evaluate an SVC model on PCA-transformed data
* Systematically investigate the effects of the number of PCA components on an SVC model of data

### Assignment instructions:

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

There are **41 points** possible on this assignment. Point values for each part are included in the section headers.

This assignment is **due at 11:59 pm on Friday, April 15th**. It should be uploaded into the "Homework Assignments" submission folder for Homework #5. Submission instructions can be found at the end of the notebook.. 


---
## Part 1: Add to your Git repository to track your progress on your assignment (4 points)

For this assignment, you're going to add it to the `cmse202-s22-turnin` repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to

**&#9989; Do the following**:

1. Navigate to your `cmse202-s22-turnin` repository and create a new directory called `hw-05`.
2. Move this notebook into that **new directory** in your repository, then **add it and commit it to your repository**.
1. Finally, to test that everything is working, "git push" the file so that it ends up in your GitHub repository.

**Important**: Double check you've added your Professor and your TA as collaborators to your "turnin" repository (you should have done this in the previous homework assignment).

**Also important**: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, **none of your changes will be tracked**!

If everything went as intended, the file should now show up on your GitHub account in the "`cmse202-s22-turnin`" repository inside the `hw-05` directory that you just created.  Periodically, **you'll be asked to commit your changes to the repository and push them to the remote GitHub location**. Of course, you can always commit your changes more often than that, if you wish.  It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.

&#9989; **Do this**: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below.

**Put the command for cloning your repository here!**
```
git clone https://github.com/xq1msu/cmse202-s22-turnin.git
```

&#9989; **Do this**: Before you move on, create a new branch called `hw05_branch` and move into it. In the cell below put the command(s) to create a new branch and to checkout the new branch. (_Note_: your TA will be able to see if you have created the branch and its history).

**Put the command for creating the new branch here!**
```
git branch hw03_branch
git checkout hw03_branch
```

&#9989; **Do this**: Import necessary packages

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC
from sklearn.decomposition import PCA
import statsmodels.api as sm

---
<a id="loading"></a>
## Part 2. Logistic Regression (12 points)
### 2.1 Data processing (5 points)
For this part, you will read and process the dataset `hw5_data.csv` and split the training and testing sets.

The provided data corresponds to a molecular biology dataset, where each row represents a patient classified into either "active" or "repressive". The columns represent features, where each feature comes from the quantification of a specific gene. Ten genes (ten features) are measured. The goal is to make predictive models that can classify patients ("active" or "repressive") based on the ten features.

The dataset is located at:
`https://raw.githubusercontent.com/msu-cmse-courses/cmse202-S22-data/main/data/hw5_data.csv`


**&#9989; Question 2.1.1 (1 point):** Read the `hw5_data.csv` file into your notebook and print out the unique labels in the `label` columns. 

Note: each row represents one data point and each column (except the `label` column) represents one feature. The `label` column corresponds to the class labels for every data point. There are two types of unique class labels in the `label` column. 

In [3]:
# Put your code here
data = pd.read_csv("hw5_data.csv")
data.head()

Unnamed: 0,label,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
0,active,0.000168,1.089799,3.706677,1.839418,1.000414,0.751246,-0.077189,0.949589,1.641961,1.102132
1,active,0.355099,0.0,2.247742,,0.0,0.194369,0.116891,-0.059497,1.086607,0.50867
2,active,0.001236,0.0,,0.814211,2.696122,0.221476,0.229138,-0.173686,1.091221,1.048915
3,active,0.69014,0.0,0.687331,0.0,0.527291,0.185705,-0.089479,-0.379929,-0.093369,0.272125
4,active,1.37677,0.631267,2.090756,1.581667,0.793976,0.84657,0.178551,0.245401,1.221811,0.111456


In [4]:
data["label"].unique()

array(['active', 'repressive'], dtype=object)

**&#9989; Question 2.1.2 (1 point):** To simplify the process of data modeling, we should convert the labels from strings to integers.

Replace all of the strings in your `label` column with integers based on the following:

| original label | integer label |
| -------- | -------- |
| repressive | 0 |
| active | 1 |

Once you've replaced the labels, display your DataFrame and confirm that it looks correct.

In [5]:
# Put your code here
method = {"repressive":0, "active":1}
data["label"] = data["label"].map(method)

data.head()

Unnamed: 0,label,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
0,1,0.000168,1.089799,3.706677,1.839418,1.000414,0.751246,-0.077189,0.949589,1.641961,1.102132
1,1,0.355099,0.0,2.247742,,0.0,0.194369,0.116891,-0.059497,1.086607,0.50867
2,1,0.001236,0.0,,0.814211,2.696122,0.221476,0.229138,-0.173686,1.091221,1.048915
3,1,0.69014,0.0,0.687331,0.0,0.527291,0.185705,-0.089479,-0.379929,-0.093369,0.272125
4,1,1.37677,0.631267,2.090756,1.581667,0.793976,0.84657,0.178551,0.245401,1.221811,0.111456


**&#9989; Question 2.1.3 (1 point):** At this point, you've hopefully noticed that some of the rows seems to have missing data values as indicated by the existence of `NaN` values. Since we don't necessarily know what to replace these values with, let's just play it safe and remove all of the rows that have `NaN` in any of the column entries. This should help to ensure that we don't end up with errors or confusing results when we try to classify the data.

Remove all of the rows that contain a `NaN` in any column. **Make sure you actually store this new version of your dataframe either in the original variable name or in a new variable name**. If everything went as intended, you should find that you have 793 rows left over.

In [6]:
# Put your code here 
data_dn = data.dropna()
data_dn.shape

(793, 11)

**&#9989; Question 2.1.4 (1 point):** As we've seen when working with `sklearn` it can be much easier to work with the data if we have separate variables: one that stores the feature matrix and one that stores the class labels.

Split your DataFrame so that you have two separate DataFrames: (1) one called `features`, which contains all columns of features; and (2) one called `labels`, which is a single-column dataframe that contains all of the *new* integer labels you just created. 

In [7]:
# Put your code here
features = data_dn.drop(labels="label", axis=1)
labels = data_dn["label"]

&#9989; **Question 2.1.5 (1 point):** How balanced is your dataset? You need to write a bit of code to figure out how balanced your dataset is, by counting the numbers of data points of each classe label. 

In [8]:
# Put your code here
labels.value_counts()

0    402
1    391
Name: label, dtype: int64

It's not prefectly balanced, they points are not equal, but since the difference between two categories is only about 10, the error is in 0.4%, I think it is balanced.

---
### 2.2 Logistic Regression (7 points)

For this part, you will apply logistic regression to tackle th classification problem: predicting class labels based on the features.

**&#9989; Question 2.2.1 (1 point):** Split your data into a training and a testing set with a training set representing 75% of your data. For reproducibility , set the `random_state` argument to `314159`. Print the lengths to show you have the right number of entries for the training and testing sets.

In [9]:
# Put your code here
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.25, train_size=0.75, random_state=314159)

In [10]:
print(len(features_train), len(features_test))

594 199


**&#9989; Question 2.2.2 (3 points):** Build a Logistic regression model based on default settings.

Add constant term in both training and testing features, fit Logistic regression based on the training set, and then print out the model summary.

**Note:** You can use the built-in model `Logit` in `statsmodels.api`.


In [11]:
# Put your code here
logit_model = sm.Logit(labels_train, sm.add_constant(features_train))
result = logit_model.fit()
print(result.summary() )

Optimization terminated successfully.
         Current function value: 0.620529
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                  label   No. Observations:                  594
Model:                          Logit   Df Residuals:                      583
Method:                           MLE   Df Model:                           10
Date:                Fri, 22 Apr 2022   Pseudo R-squ.:                  0.1039
Time:                        22:30:19   Log-Likelihood:                -368.59
converged:                       True   LL-Null:                       -411.32
Covariance Type:            nonrobust   LLR p-value:                 4.244e-14
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.8952      0.234     -3.830      0.000      -1.353      -0.437
feature_1      0.5892      0.

&#9989; **Question 2.2.3 (1 point):** What is the Pseudo R^2? Which features have p-value < 0.05?

<font size=+3>&#9998;</font> Erase this and put your answer here.

0.1039

feature_1,3 have p-value less than 0.005

&#9989; **Question 2.2.4 (2 points):** Make predictions for the testing set using the trained model.

Note: the logistic regression model predicts the probability of belonging to class 1. To make the final binary classification, let's the threshold to be 0.5, which means that every sample in the testing set with predicted probability scores greater than 0.5 will be predicted as '1', and other samples with predicted probability less than 0.5 will be predicted as '0'. 

Show the model's accuracy score based on the testing set.

In [15]:
# Put your code here
from sklearn.metrics import accuracy_score

pre = result.predict(sm.add_constant(features_test)) // 0.5
accuracy_score(labels_test, pre)

0.6683417085427136

---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing Part 2", and push the changes to GitHub.

---

---
## Part 3. Perceptron (5 points)

For this part, you will use another model, Perceptron, to continue working on the same classification problem.

**&#9989; Question 3.1 (2 points):** (1) Build a Perceptron model with default settings, and fit the model based on the training set.

(2) Apply the trained model on the test features to predict the labels for the testing dataset. 

(3) Evaluate the model by printing out the confusion matrix and classification report, based on its performance on the testing dataset.

**Note:** You can use the built-in model `Perceptron` in `sklearn`.

In [None]:
# Put your code here


**&#9989; Question 3.2 (3 points):**. Finding the best penalty term.

`Perceptron` from the `sklearn` can employ different penalty terms, including `l1`, `l2`, and `elasticnet` (Note: check the `penalty` argument of `Perceptron`). Apply the Perceptron on the training dataset again, based on different penalty terms (i.e. make 3 Perceptron models). Print out the accuray score of each model, based on the testing dataset. 

Which penalty term results in the best accuracy?

In [None]:
# Put your code here


---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing Part 3", and push the changes to GitHub.

---

---
## Part 4. Principal Component Analysis (6 points)

The full model uses all 10 features to predict the results. In many cases, we might need to see how close we can get with fewer features. But instead of simply removing features, we will use a Principal Component Analysis (PCA) to determine the combined features that contribute the most the model (through their accounted variance).

**&#9989; Question 4.1 (1 point):** Do a little bit of data preparation before we perform our PCA.

Because the features in our dataset have very different ranges of values, the variation captured by the PCA will be skewed by these relative differences. As a result, it is good practice to **normalize** the features so that they have comparable ranges of values. Thankfully, `sklearn` has a useful function for doing this!

```from sklearn.preprocessing import MinMaxScaler```

Perform a "Min-Max" scaling to normalize the features and store the new normalized features in a new dataframe called as `features_norm`.

In [None]:
# Put your code here
from sklearn.preprocessing import MinMaxScaler


**&#9989; Question 4.2 (1 point):** As you did in Question 2.2.1 above, split your new normalized features and corresponding labels (the labels are the same as before) into a training and a testing set, with the training set representing 75% of your data. For reproducibility , set the `random_state` argument to `314159`. Print the lengths to show you have the right number of entries.

In [None]:
# Put your code here


**&#9989; Question 4.3 (3 points):** Run a Principle Component Analysis (PCA)

Since we only have 10 features to start with, let's see how well we can do if we try to aggressively reduce the feature count and use only **3** principle components. We'll see how well we can predict the labels of dataset with just three!


(1) Using `PCA()` and the associated `fit()` method, run a principle component analysis on your training features using only 3 components. 

(2) Transform both the test and training features using the result of your PCA. 

(3) Print the `explained_variance_ratio_`.

In [None]:
# Put your code here


&#9989; **Question 4.4 (1 point):** What is the total explained variance ratio captured by the 3 principle components? (just quote the number) How well do you think a model with these many features will perform? Why?

<font size=+3>&#9998;</font> Erase this and put your answer here.

---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing Part 4", and push the changes to GitHub.

---

---
## 5. Support vector machine based on PCA (14 points)

### 5.1 Support vector machine (6 points)

For this part, you will build SVC model using the 3 components from PCA, and do grid search to find the best hyperparameters.

**&#9989; Question 5.1.1 (2 points):** Build a linear SVC model with `C=0.1`, and fit it to the training set (using the 3 PCA components from the training set).

Then use the test features to predict the labels for the testing set. 

Evaluate the model's performance using the **confusion matrix** and **classification report**.

In [None]:
# Put your code here


**&#9989; Question 5.1.2 (3 points):** Find the best hyperparameters

At this point, we have fit one SVC model and determined it's performance, but is it the best model? We can use `GridSearchCV` to find the best model (given our choices of parameters). Once we do that, we will use that "best" model for making predictions.

Using the following parameters (`C` = `1e-3`, `0.01`, `0.1`, `1`, `10`, `100` and `gamma` = `1e-6`, `1e-5`, `1e-4`, `1e-3`, `0.01`, `0.1`) for both a `linear` and `rbf` kernel use `GridSearchCV` with the `SVC()` model to find the best fit parameters. Once, you've run the grid search, print the "best estimators".

In [None]:
# Put your code here


&#9989; **Question 5.1.3 (1 point):**  Evaluate the best fit model

Now that we have found the "best estimators", let's determine how good the fit is.

Use the test features to predict the labels, based on the best model. Evaluate the performance using the **confusion matrix** and **classification report**.

In [None]:
# Put your code here


### 5.2 How well does PCA work? (8 points)
The number of components we use in our PCA matters. Let's investigate how they matter by systematically building a model for any number of selected components. While this might seem a bit unnecessary for such a simple dataset, **this can be very useful for more complex datasets and models!**

**&#9989; Question 5.2.1 (3 points):**

To systematically explore how well PCA improves our classification model, we will do this by writing a function that 
* creates the PCA
* creates the SVC model
* uses `GridSearchCV` to find the best hyperparameters
* predicts the labels using test data
* returns the accuracy scores and the explained variance ratio.

Just as you did in Question 5.1.2, use the following parameters (`C` = `1e-3`, `0.01`, `0.1`, `1`, `10`, `100` and `gamma` = `1e-6`, `1e-5`, `1e-4`, `1e-3`, `0.01`, `0.1`) for both a `linear` and `rbf` kernel use `GridSearchCV` with the `SVC()` model to find the best fit parameters.

So, Your function will take as input:
* the number of requested PCA components
* the training feature data
* the testing feature data
* the training data labels
* the test data labels

and it should **return** the accuracy score for an SVC model fit to pca transformed features and the **total** explained variance ratio (i.e. the sum of the explained variance for each component).

In [None]:
# Put your code here

def reduced_SVM(n_components, train_features, train_labels, test_features, test_labels):
    

**&#9989; Question 5.2.2 (2 points):**

Now that you have created a function that returns the accuracy for a given number of components, we will use that to plot the how the accuracy of your SVC model changes when we increase the number of components used in the PCA.

For 1 through 10 components, use your function above to compute and store (as a list) the accuracy of your models and the total explained variance ratio of your models.

In [None]:
# Put your code here


**&#9989; Question 5.2.3 (1 point):** Plot the accuracy vs # of components.

In [None]:
# Put your code here


**&#9989; Question 5.2.4 (1 point):** Where does it seem like we have diminishing returns? That is, at what point is there no major increase in accuracy (or perhaps the accuracy is decreased) as we add additional components to the PCA?

<font size=+3>&#9998;</font> Erase this and put your answer here.

**&#9989; Task 5.2.5 (1 point):** Plot the total explained variance ratio vs # of components. 

In [None]:
# Put your code here


---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing Part 5", and push the changes to GitHub.

---

---
## Assignment wrap-up¶
Please fill out the form that appears when you run the code below. **You must completely fill this out in order to receive credit for the assignment!**

In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.office.com/Pages/ResponsePage.aspx?id=MHEXIi9k2UGSEXQjetVofa-byNJHa0xBs0jOGcRl02lURU83U0ZHUUpWUUFRUzhCQ0JZWDQxVVRUVi4u" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

### Congratulations, you're done!
Submit this assignment by uploading it to the course Desire2Learn web page. Go to the "Homework Assignments" folder, find the submission folder for Homework #5, and upload your notebook.