# DATA-780: Midterm (Applied Problems)

## Skylar Furey `stfurey@unc.edu`

In this part of the midterm, you will be proposing solutions to applied problems to demonstrate your understanding of topics covered so far in the course. Please make sure to upload **both** the `.ipynb` file  and `.html` generated export to the Canvas page. 

- You are allowed to use any of the techniques and frameworks discussed in class.
- Please include comments in your code, and a brief discussion of the results you obtained.
- Use markdown as you see fit to properly document your code and typeset your solutions.
- Make sure that code and output is included and properly formatted, and that no error messages are included (or left unexplained)
- **DO NOT** use any artificial intelligence (AI) tools to complete this assignment.  

## Problem 1 (7 points)

In this problem you will be exploring (a subset) of survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960's.

You can download the dataset from <https://raw.githubusercontent.com/reisanar/datasets/master/nhanes.csv>

Variable      | Description
--------------|------------------------------------------------
`has_diabetes`| Participant has diabetes. `1` = Yes, `0` = No
`Gender`      | Gender of study participant	coded as `male` or `female`
`Age`         | Age in years at screening of study participant
`BMI`         | Body mass index (weight/height2 in kg/m2)
`Weight`      | Weight in kg
`Height`      | Standing height in cm

(a) (4 points) Use a logistic regression model to estimate the probability of diabetes as a function of `Age` and `BMI`.


(b) (1 points) Which variable is more important: `Age` or `BMI`? Comment on your results.


(c) (2 points) Predict the probability that a new participant has diabetes based on hypothetical values for `Age` and `BMI` (choose 3 different cases)



## Solutions to Problem

_Include comments as you see fit_

In [1]:
# Hides warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# code goes here
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Pull data from internet and clean to remove NA data
data = pd.read_csv('https://raw.githubusercontent.com/reisanar/datasets/master/nhanes.csv')
data= data.dropna()

# Set feature and target datasets
X = data[['Age', 'BMI']]
y = data['has_diabetes']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Use model coefficients to determine variable importance 
print('Age Variable Importance: ', round(model.coef_[0][0], 3))
print('BMI Variable Importance: ', round(model.coef_[0][1], 3))

# Create hypothetical data
hypo_data = pd.DataFrame({
    'Age': [26, 52, 84],
    'BMI': [27.9, 28.7, 37.8]
})

# Predict probabilities of diabetes for each hypothetical test subject
hypo_probs = model.predict_proba(hypo_data)
print('Probabilities for Hypothetical Values: ', hypo_probs[:,1])

Age Variable Importance:  0.056
BMI Variable Importance:  0.098
Probabilities for Hypothetical Values:  [0.01932525 0.08337916 0.56886634]


The BMI variable is almost double as important as the Age variable when predicting whether or not a person has diabetes. I am also glad that I am part of the 98% of people my age with my BMI without diabetes! This model predicts Type 2 diabetes well but Type 1 Diabetes is much more based on hereditary which is not part of the input features.

## Problem 2 (8 points)

In this problem you will use the dataset available at <https://github.com/reisanar/datasets/raw/master/email.csv> to build multiple classification models and compare results.

(a) (3 points) Build a logistic regression model using the variables listed below:

  Variable      | Description
----------------|-----------------------------------------------
`spam`          | Specifies whether the message was spam
`to_multiple`   | An indicator variable for if more than one person was listed in the _To_ field of the email
`cc`            | An indicator for if someone was CCed on the email
`attach`        | An indicator for if there was an attachment, such as a document or image
`dollar`        | An indicator for if the word _dollar_ or dollar symbol (`$`) appeared in the email
`winner`        | An indicator for if the word _winner_ appeared in the email message
`inherit`       | An indicator for if the word _inherit_ (or a variation, like _inheritance_) appeared in the email
`password`      | An indicator for if the word _password_ was present in the email
`format`        | Indicates if the email contained special formatting, such as bolding, tables, or links
`re_subj`       | Indicates whether `Re:` was included at the start of the email subject
`exclaim_subj`  | Indicates whether any exclamation point was included in the email subject


(b) (3 points) Build a Naive-Bayes classifier and comment on your results.


(c) (2 points) Can you build a support vector classifier for the same task? Comment on your results.


**Note:** you should use the same training and testing sets in the model building and evaluation stages. 

## Solutions to Problem

_Include comments as you see fit_

In [3]:
# code goes here
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Pull data from internet and make winner feature a boolean
data = pd.read_csv('https://github.com/reisanar/datasets/raw/master/email.csv')
data['winner'] = data['winner'].replace({'no': 0, 'yes': 1})

X = data[['to_multiple', 'cc', 'attach', 'dollar', 'winner', 'inherit', 'password', 'format', 're_subj', 'exclaim_subj']]
y = data['spam']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize, fit and test Logistic Regression model
modelLR = LogisticRegression()
modelLR.fit(X_train, y_train)
y_predLR = modelLR.predict(X_test)
accuracyLR = accuracy_score(y_test, y_predLR)

# Initialize, fit and test Naive-Bayes model
modelNB = MultinomialNB(alpha=0.1)
modelNB.fit(X_train, y_train)
y_predNB = modelNB.predict(X_test)
accuracyNB = accuracy_score(y_test, y_predNB)

# Initialize, fit and test Support Vector Machine model
modelSVM = SVC(kernel='rbf')
modelSVM.fit(X_train, y_train)
y_predSVM = modelSVM.predict(X_test)
accuracySVM = accuracy_score(y_test, y_predSVM)

print(f"Logistic Regression Accuracy: {accuracyLR:.3f}")
print(f"Naive-Bayes Accuracy: {accuracyNB:.3f}")
print(f"Support Vector Machine Accuracy: {accuracySVM:.3f}")

Logistic Regression Accuracy: 0.908
Naive-Bayes Accuracy: 0.892
Support Vector Machine Accuracy: 0.911


The SVM model is the most accurate, but all of the models are around the sweet spot of 90-95% accuracy on unforeseen data.

## Problem 3 (4 points)

In this question, we are using the included Wine dataset in sklearn to apply a simple logistic regression. Most of the code is already completed below. Your task is to fill in the initialize model and model fit. The purpose of this question is to check your knowledge of applying a logistic regression and coding up an ML approach. Please, fill in the code to use a Logistic Regression classifier and report the accuracy of the model.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

# Load the Wine dataset
data = load_wine()
X = data.data  # Features
y = data.target  # Target

# Split the data with stratification to maintain class balance in train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the model
model = LogisticRegression() #TODO

# Fit the model
model.fit(X_train, y_train) #TODO

# Predict the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy with higher precision
print(f"Accuracy: {accuracy:.3f}")

Accuracy: 0.972


This simple model provides a great baseline for other models accuracy.

## Problem 4 (4 points)

Update the previous code to include a `StandardScaler` and apply it to the `X_train` and `X_test` datasets. 
Return the updated accuracy with 3 decimal points and comment on your results.

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

# Load the Wine dataset
data = load_wine()
X = data.data  # Features
y = data.target  # Target

# Split the data with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale the data
scaler = StandardScaler() #TODO
X_train = scaler.fit_transform(X_train) #TODO
X_test = scaler.transform(X_test) #TODO

# Initialize the model
model = LogisticRegression()

# Fit the model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy with higher precision
print(f"Accuracy: {accuracy:.3f}")

Accuracy: 0.972


## Problem 5 (4 points)

Complete the code below to implement a Support Vector Machine model. Your support vector machine model should use a radial basis function (RBF) kernel. The purpose of this question is to check your knowledge of SVM code. 

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

# Load the Wine dataset
data = load_wine()
X = data.data  # Features
y = data.target  # Target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Initialize model with a radial basis function (RBF) kernel
svm_model = SVC(kernel='rbf') #TODO

# Fit the model
svm_model.fit(X_train, y_train)

# Predict
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Model Accuracy: {accuracy:.3f}")

SVM Model Accuracy: 0.711


The SVM model was much less accurate than the Logistic Regression model without using a scalar.

## Problem 6 (4 points)

Update the SVM code from the previous problem to include the standard scaler. Report the updated model accuracy and comment on your results. In paritcular, please comment on how a Standard Scaler impacts an SVM compared to a Logistic Regression. 

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

# Load the Wine dataset
data = load_wine()
X = data.data  # Features
y = data.target  # Target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Scale the data (important for SVM performance)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize model with a radial basis function (RBF) kernel
svm_model = SVC(kernel='rbf')

# Fit the model
svm_model.fit(X_train, y_train)

# Predict
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Model Accuracy: {accuracy:.3f}")

SVM Model Accuracy: 0.978


Using a scalar dramatically improved the SVM Model and even made it more accurate than the Logistic Regression model. The scalar has much more impact on SVM models because it tries to find a decision boundary; features with larger value ranges have greater impacts on the decision boundary and therefore scaling the features equalizes their impacts.

## Problem 7 (4 points)

There are a number of ways to evaluate models. For this question, work to include precision and recall into your code. You can do this directly and by generating a classification report. Comments on your results.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

# Load the Wine dataset
data = load_wine()
X = data.data  # Features
y = data.target  # Target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Scale the data (important for SVM performance)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize model with a radial basis function (RBF) kernel
svm_model = SVC(kernel='rbf')

# Fit the model
svm_model.fit(X_train, y_train)

# Predict
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Model Accuracy: {accuracy:.3f}")

# Caluclate precision
precision = precision_score(y_test, y_pred, average='weighted')  #TODO: Complete code and include average='weighted' in your code. 
print(f'Precision: {precision:.3f}')

# Calculate recall
recall = recall_score(y_test, y_pred, average='weighted')  #TODO: Complete code and include average='weighted' in your code. 
print(f'Recall: {recall:.3f}')

# Calculate F1
f1 = f1_score(y_test, y_pred, average='weighted')  #TODO: Complete code include average='weighted' in your code. 
print(f'F1 Score: {f1:.3f}')

# Calculate classification report
classification_rep = classification_report(y_test, y_pred) #TODO: Complete code.
print(classification_rep)

SVM Model Accuracy: 0.978
Precision: 0.979
Recall: 0.978
F1 Score: 0.978
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.95      1.00      0.97        18
           2       1.00      0.92      0.96        12

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.98        45
weighted avg       0.98      0.98      0.98        45



The precision for all 3 cultivators was high with the second cultivator being the only one with a false positive, it had 2.

The recall was also high for all 3 cultivators with the third being the only one with a false negative, it had 1.

The f1-score combines the previous metrics and since the first cultivator had no false positives or negatives, it had a perfect f1-score. The other two cultivators had high f1-scores as well.

The second cultivator was the most common in the test set.

The overall accuracy of the model was great at 98%, this accuracy is borderline too high which make me wonder if it was overfitted.

## Problem 8 (5 points)

Complete the code below to implement a Convolutional Neural Network (CNN) on the MNIST dataset. Comment on your results

In [9]:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import accuracy_score

# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Preprocess the data
X_train = X_train.reshape((X_train.shape[0], 28, 28, 1))  # Reshape to (28, 28, 1) for grayscale images
X_test = X_test.reshape((X_test.shape[0], 28, 28, 1))
X_train = X_train.astype('float32') / 255  # Normalize pixel values to [0, 1]
X_test = X_test.astype('float32') / 255

# One-hot encode the target labels
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# Initialize a simple CNN model
model = models.Sequential() #TODO

# Define the CNN architecture -- Add in the Convolutional, Max Pooling, and Fully-Connected Layers
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) # Add convolutional layer
model.add(layers.MaxPooling2D((2, 2)))  # Add max-pooling layer
model.add(layers.Conv2D(64, (3, 3), activation='relu'))  # Add another convolutional layer
model.add(layers.MaxPooling2D((2, 2)))  # Add another max-pooling layer
model.add(layers.Flatten())  # Flatten the feature maps into a vector
model.add(layers.Dense(64, activation='relu'))  # Fully connected layer
model.add(layers.Dense(10, activation='softmax'))  # Fully connected output layer

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=2, batch_size=64, validation_split=0.2)

# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(X_test, y_test)

# Print the test accuracy
print(f"Test accuracy: {test_acc:.6f}")

Epoch 1/2
[1m750/750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - accuracy: 0.8572 - loss: 0.4634 - val_accuracy: 0.9788 - val_loss: 0.0707
Epoch 2/2
[1m750/750[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.9792 - loss: 0.0665 - val_accuracy: 0.9852 - val_loss: 0.0526
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9844 - loss: 0.0490
Test accuracy: 0.986300


With just 2 epochs we were able to get a CNN model to perform as well as the Logistic Regression model and SVM model with scaled data. For this dataset, the SVM model suggests clear separations in the features between the cultivators, indicating distinct groupings in the data. As a result, the resources required to train a CNN are unnecessary. I would suggest using one of the simpler models, SVM or Logistic Regression.