<a href="https://colab.research.google.com/github/svanhemert00/lmu-isba-4790/blob/main/Mini_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Mini Assignment 4. Linear Classification
Please add one code cell after each question and create a program to answer the question. Make sure your code runs without error. After you are finished, click on File -> Print -> Save as PDF to create a PDF output and upload it on Brightspace.

NOTE: Make sure your PDF does not have your name or any identifying information in the name or content of the file. Anonymity is essential for the peer-review process.

## The Case: Diabetes Prediction
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

You may [download the dataset here](https://drive.google.com/file/d/1HexlzAyB2zylIAS8Dai8txI42qaVGIgr/view?usp=sharing). The dataset includes 9 features and over 700 records. It contains the following variables:

* **Pregnancies:** Number of times pregnant
* **Glucose:** Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure:** Diastolic blood pressure (mm Hg)
* **SkinThickness:** Triceps skin fold thickness (mm)
* **Insulin:** 2-Hour serum insulin (mu U/ml)
* **BMI:** Body mass index (weight in kg/(height in m)^2)
* **DiabetesPedigreeFunction:** Diabetes pedigree function
* **Age:** Age (years)
* **Outcome:** Class variable (0 or 1) 268 of 768 are 1, the others are 0

1. Read the dataset into a dataframe, and check out the first few rows and column data types.

In [32]:
import pandas as pd

df = pd.read_csv("diabetes.csv")
df.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


2. Inspect the dataset for missing values and treat them if there is any.

In [34]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

3. We would like to create linear classifiers to predict the outcome (if the patient has diabetes). Create predictor (X) and target (y) datasets. Then normalize/standardize the predictor dataset, and split X and y into train and test datasets. make sure to use stratification.

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import minmax_scale

X = df.drop(['Outcome'], axis=1)
y = df.Outcome

XS = minmax_scale(X)

X_train, X_test, y_train, y_test = train_test_split(XS, y, test_size=.2, random_state=1, stratify=y)

4. Use simple linear regression to predict the outcome class (0 or 1). Print the accuracy rate for the train and the test sets.

In [36]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, classification_report

sr = LinearRegression()
sr.fit(X_train, y_train)

print("Accuracy rate for simple Reg_ train: ", accuracy_score(y_train, sr.predict(X_train)>.5))
print("Accuracy rate for simple Reg_ test: ", accuracy_score(y_test, sr.predict(X_test)>.5))

Accuracy rate for simple Reg_ train:  0.7866449511400652
Accuracy rate for simple Reg_ test:  0.7727272727272727


5. Use logistic linear regression to predict the outcome. Print the accuracy rate for the train and the test sets. Tune the model with C hyperparameter (inverse of regularization) to get the best fit.

In [37]:
from sklearn.linear_model import LogisticRegression

logr = LogisticRegression(penalty='l2',C=0.3)
logr.fit(X_train, y_train)

print("Accuracy rate for simple LogReg_ train: ", accuracy_score(y_train, logr.predict(X_train)))
print("Accuracy rate for simple LogReg_ test: ", accuracy_score(y_test, logr.predict(X_test)))

Accuracy rate for simple LogReg_ train:  0.757328990228013
Accuracy rate for simple LogReg_ test:  0.7337662337662337


6. Use linear support vector machine to predict the outcome. Print the accuracy rate for the train and the test sets. Tune the model with C hyperparameter (inverse of regularization) to get the best fit.

In [38]:
from sklearn.svm import LinearSVC

svm = LinearSVC(C=1)
svm.fit(X_train, y_train)

print("Accuracy rate for simple svm_ train: ", accuracy_score(y_train, svm.predict(X_train)))
print("Accuracy rate for simple svm_ test: ", accuracy_score(y_test, svm.predict(X_test)))

Accuracy rate for simple svm_ train:  0.7866449511400652
Accuracy rate for simple svm_ test:  0.7727272727272727


7. Use SVM with non-linear kernel trick to predict the outcome. Print the accuracy rate for the train and the test sets. Tune the model with C and gamma hyperparameters to get the best fit.

In [39]:
from sklearn.svm import SVC

ksvm = SVC(gamma=3,C=0.06)
ksvm.fit(X_train, y_train)

print("Accuracy rate for simple LogReg_ train: ", accuracy_score(y_train, ksvm.predict(X_train)))
print("Accuracy rate for simple LogReg_ train: ", accuracy_score(y_test, ksvm.predict(X_test)))

Accuracy rate for simple LogReg_ train:  0.7198697068403909
Accuracy rate for simple LogReg_ train:  0.6883116883116883


8. Print the test set classification report for the above four models. Which one offers a better f1-score?

In [40]:
print(classification_report(y_test, sr.predict(X_test)>.5))
print(classification_report(y_test, logr.predict(X_test)))
print(classification_report(y_test, svm.predict(X_test)))
print(classification_report(y_test, ksvm.predict(X_test)))

              precision    recall  f1-score   support

           0       0.77      0.92      0.84       100
           1       0.77      0.50      0.61        54

    accuracy                           0.77       154
   macro avg       0.77      0.71      0.72       154
weighted avg       0.77      0.77      0.76       154

              precision    recall  f1-score   support

           0       0.72      0.96      0.82       100
           1       0.81      0.31      0.45        54

    accuracy                           0.73       154
   macro avg       0.77      0.64      0.64       154
weighted avg       0.75      0.73      0.69       154

              precision    recall  f1-score   support

           0       0.77      0.92      0.84       100
           1       0.77      0.50      0.61        54

    accuracy                           0.77       154
   macro avg       0.77      0.71      0.72       154
weighted avg       0.77      0.77      0.76       154

              preci