## Lab Assignment: Machine Learning with Scikit-Learn
Student:     Zachary Stallard

### Objective: To give students practical experience in implementing basic machine learning algorithms using Scikit-Learn.

### Instructions:
Produce four machine learning models (one for each type), using the datasets available in Python.

1. Decision Trees
- Load the Iris dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a decision tree classifier with a maximum depth of 2 and fit it to the training data.
- Evaluate the performance of the decision tree classifier on the testing data using accuracy as the evaluation metric.

2. K-Nearest Neighbors
- Load the Breast Cancer dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a K-Nearest Neighbors classifier with k=5 and fit it to the training data.
- Evaluate the performance of the K-Nearest Neighbors classifier on the testing data using precision, recall, and F1-score as the evaluation metrics.

3. Linear Regression
- Load the California Housing dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a linear regression model and fit it to the training data.
- Evaluate the performance of the linear regression model on the testing data using mean squared error as the evaluation metric.

4. Naive Bayes
- Load the SMS Spam Collection dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a Naive Bayes classifier and fit it to the training data.
- Evaluate the performance of the Naive Bayes classifier on the testing data using accuracy, precision, recall, and F1-score as the evaluation metrics.

### Deliverable:
Modify this notebook to include the python code as well as any documentation related to your submission.  Submit the notebook as your response in Blackboard.

### Grading Criteria:

Your lab assignment will be graded based on the following criteria:

- Correctness of the implementation
- Proper use of basic control structures and functions
- Code efficiency
- Clarity and readability of the code
- Compliance with the instructions and deliverables.

### Student Submission
1. Decision Trees
- Load the Iris dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a decision tree classifier with a maximum depth of 2 and fit it to the training data.
- Evaluate the performance of the decision tree classifier on the testing data using accuracy as the evaluation metric.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Implement a decision tree classifier with a maximum depth of 2 and fit it to the training data
dt = DecisionTreeClassifier(max_depth=2, random_state=42)
dt.fit(X_train, y_train)

# Evaluate the performance of the decision tree classifier on the testing data using accuracy as the evaluation metrics
y_pred = dt.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(acc))

Accuracy: 0.98


2. K-Nearest Neighbors
- Load the Breast Cancer dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a K-Nearest Neighbors classifier with k=5 and fit it to the training data.
- Evaluate the performance of the K-Nearest Neighbors classifier on the testing data using precision, recall, and F1-score as the evaluation metrics.


In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(breast_cancer.data, breast_cancer.target, test_size=0.3, random_state=42)

# Implement a K-Nearest Neighbors classifier with k=5 and fit it to the training data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluate the performance of the K-Nearest Neighbors classifier on the testing data using precision, recall, and F1-score as the evaluation metrics
y_pred = knn.predict(X_test)
prec = precision_score(y_test, y_pred, average='weighted')
rec = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print("Precision: {:.2f}".format(prec))
print("Recall: {:.2f}".format(rec))
print("F1-score: {:.2f}".format(f1))

Precision: 0.96
Recall: 0.96
F1-score: 0.96


3. Linear Regression
- Load the California Housing dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a linear regression model and fit it to the training data.
- Evaluate the performance of the linear regression model on the testing data using mean squared error as the evaluation metric.

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
california_housing = fetch_california_housing()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(california_housing.data, california_housing.target, test_size=0.3, random_state=42)

# Implement a linear regression model and fit it to the training data
lr = LinearRegression()
lr.fit(X_train, y_train)

# Evaluate the performance of the linear regression model on the testing data using mean squared error as the evaluation metric
y_pred = lr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: {:.2f}".format(mse))

Mean squared error: 0.53


4. Naive Bayes
- Load the SMS Spam Collection dataset from Scikit-Learn datasets.
- Split the dataset into training and testing sets.
- Implement a Naive Bayes classifier and fit it to the training data.
- Evaluate the performance of the Naive Bayes classifier on the testing data using accuracy, precision, recall, and F1-score as the evaluation metrics.

In [None]:
#import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the SMS Spam Collection dataset
sms = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'talk.religion.misc'])

# Convert text data to numerical feature vectors using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sms.data)
y = (sms.target != 0).astype(int)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Implement a Naive Bayes classifier and fit it to the training data
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Evaluate the performance of the Naive Bayes classifier on the testing data using accuracy, precision, recall, and F1-score as the evaluation metrics
y_pred = nb.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='weighted')
rec = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print("Accuracy: {:.2f}".format(acc))
print("Precision: {:.2f}".format(prec))
print("Recall: {:.2f}".format(rec))
print("F1-score: {:.2f}".format(f1))

Accuracy: 0.91
Precision: 0.91
Recall: 0.91
F1-score: 0.91
