### **Word2Vec, Naive Bayes, and K-Nearest Neighbor**

#### **Introduction**
In this assignment, we will explore three fundamental concepts: Word2Vec, Naïve Bayes, and K-Nearest Neighbors (KNN). 

- **Word2Vec**: A Vector Space Model for converting words to vectors while capturing semantic relationships between words. It converts text into dense vector representations, making it useful for various NLP tasks like similarity detection and sentiment analysis.
- **Naive Bayes**: A probabilistic machine learning algorithm based on Bayes' theorem that is widely used in text classification tasks, such as spam detection and sentiment analysis. It assumes conditional independence among features, making it computationally efficient.
- **K-Nearest Neighbors (KNN)**: A deterministic machine learning algorithm that classifies data points based on the majority vote of their k-nearest neighbors. It is commonly applied in pattern recognition and recommendation systems.

#### **Instructions**
There are three tasks in this lab exercise:
- Task 1: Word2Vec
- Task 2: Naive Bayes
- Task 3: K-Nearest Neighbors (KNN)

In each task, you will see an **example code** first. This example code is to help you understand the basic python implementation in order to complete each task.

You will then see a **practice task**. This practice task is the real assignment you need to complete.

#### **Task 1: Word2Vec** (Example Code)
This is an example code for you to understand the basic python implementation of Word2Vec.

In [None]:
# Install python dependencies
!pip install pandas
!pip install nltk
!pip install gensim

In [None]:
# Import required libraries
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation
import nltk  # Natural Language Toolkit for text processing
from nltk.tokenize import word_tokenize  # Tokenizer for splitting text into words
from gensim.models import Word2Vec  # Word2Vec model for word embeddings

In [None]:
# Download required NLTK resources
nltk.download('punkt')

In [None]:
"""
## Step 1: Build your text training corpus. This corpus is used to train your Word2Vec model.
"""

# Sample corpus
corpus = [
    "Machine learning is fascinating",
    "Deep learning and neural networks are powerful",
    "Natural language processing is a subset of AI",
    "Word embeddings capture word meanings",
    "Naïve Bayes is a probabilistic classifier"
]

In [None]:
"""
## Step 2: Tokenize the text corpus. This is a necessary step for any natural language processing tasks. 
           ML models learn by words, not sentences.
"""

# Tokenizing the corpus (splitting sentences into words)
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]
tokenized_corpus

In [None]:
"""
## Step 3: Train Word2Vec model with the tokenized corpus.
"""

# Train a Word2Vec model to generate word embeddings from the tokenized corpus.
# Word2Vec is a neural network-based algorithm that learns vector representations of words.
# These word embeddings capture semantic relationships between words.

word2vec_model = Word2Vec(
    sentences=tokenized_corpus,  # The input corpus, where each sentence is tokenized into a list of words.
    vector_size=50,              # The dimensionality of the word vectors. Higher values capture more semantic information but require more computation.
    window=5,                    # The maximum distance between the target word and neighboring words in a sentence. A larger window captures more context.
    min_count=1,                 # Ignores words that appear less than the specified count. Setting it to 1 ensures all words are included.
    workers=4                    # The number of CPU threads to use for training. More workers speed up training but require more resources.
)

In [None]:
"""
## Step 4: Print the vector representation of a word
## Reminder! You can NOT generate vectors for the words that are not in your training corpus.

"""

print("Vector for the word 'learning':", word2vec_model.wv['learning'], "\n")
print("Vector for the word 'learning':", word2vec_model.wv['neural'], "\n")
print("Vector for the word 'learning':", word2vec_model.wv['language'], "\n")


#### **Task 1: Word2Vec** (Practice Task)
This is the assignment you need to complete. You may use any code snippets shown in above.

In [None]:
# Step 1: Curate your own training corpus. Please add another 9 sentences to create a text corpus with a total of 10 sentences.
Corpus = [
    'The cat is sleeping on the mat.'
    ...
]

In [None]:
# Step 2: Tokenize your text corpus
tokenized_corpus = ...
tokenized_corpus

In [None]:
# Step 3: Train your Word2Vec model, with an word embedding dimension of 5.
word2vec_model = Word2Vec(
    ...
)

In [None]:
# Step 4: Generate and print out three word embeddings from your training corpus.
first_word_embedding = ...
second_word_embedding = ...
third_word_embedding = ...

#### **Task 2: Naive Bayes** (Example Code)
This is an example code for you to understand the basic python implementation of Naive Bayes.

In [None]:
# Install python dependencies
!pip install scikit-learn # Install sklearn library.

## Scikit-learn (sklearn) is a Python library providing tools for machine learning and data analysis. 
## Sklearn offers a wide range of functionalities, including: 
## - Regression: Algorithms for predicting continuous values, such as linear regression.
## - Classification: Algorithms for predicting categories, such as k-nearest neighbors.
## - Clustering: Algorithms for grouping similar data points, such as k-means.
## - Dimensionality reduction: Techniques for reducing the number of variables in a dataset.
## - Model selection: Tools for comparing and tuning different machine learning models.
## - Preprocessing: Methods for preparing data for machine learning, such as feature extraction, scaling and encoding.

In [None]:
# Step 1: Import necessary libraries

import numpy as np  # Import NumPy for numerical operations
import pandas as pd  # Import pandas for handling datasets
from sklearn.model_selection import train_test_split  # Import function to split dataset into train and test sets
from sklearn.naive_bayes import GaussianNB  # Import Gaussian Naïve Bayes model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  # Import metrics for evaluation

In [None]:
# Step 2: Let's create a synthetic dataset

# Columns: Fever, Cough, Fatigue, Age, Has Disease (1 = Yes, 0 = No)
data = np.array([
    [1, 1, 0, 25, 1],  # Example: Patient with fever and cough, no fatigue, age 25, has disease
    [0, 1, 1, 40, 1],  # Example: Patient with cough and fatigue, no fever, age 40, has disease
    [1, 0, 1, 35, 0],  # Example: Patient with fever and fatigue, no cough, age 35, no disease
    [0, 0, 0, 50, 0],  # Example: Patient with no symptoms, age 50, no disease
    [1, 1, 1, 30, 1],  # Example: Patient with all symptoms, age 30, has disease
    [0, 0, 1, 45, 0],  # Example: Patient with fatigue only, age 45, no disease
    [1, 1, 0, 60, 1],  # Example: Patient with fever and cough, no fatigue, age 60, has disease
    [1, 0, 1, 55, 0],  # Example: Patient with fever and fatigue, no cough, age 55, no disease
    [0, 1, 1, 65, 1],  # Example: Patient with cough and fatigue, no fever, age 65, has disease
    [1, 0, 0, 28, 0]   # Example: Patient with fever only, age 28, no disease
])

# In our dataset, the training data include each patient's four information: fever, cough, fatigue, and age.
# In our dataset, the label is Has Disease (1 = Yes, 0 = No)

In [None]:
# Step 3: Convert the dataset to a pandas DataFrame

df = pd.DataFrame(data, columns=['Fever', 'Cough', 'Fatigue', 'Age', 'Has_Disease'])

In [None]:
# Step 4: Split data into training and test sets

X = df[['Fever', 'Cough', 'Fatigue', 'Age']]  # Features: Symptoms and age
y = df['Has_Disease']  # Target variable: Disease presence (1 or 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # Split 70% training, 30% testing


In [None]:
# Step 5: Train the Gaussian Naïve Bayes model

model = GaussianNB()  # Create a GaussianNB classifier
model.fit(X_train, y_train)  # Train the model on the training data

In [None]:
# Step 6: Make predictions
y_pred = model.predict(X_test)  # Predict disease presence on test data

In [None]:
# Step 7: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))  # Print accuracy of the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))  # Print confusion matrix
print("Classification Report:\n", classification_report(y_test, y_pred))  # Print precision, recall, F1-score, etc.

#### **Task 2: Naive Bayes** (Practice Task)
This is the assignment you need to complete. You may use any code snippets shown in above.

In [None]:
# Step 1: Load Expanded Dataset. 
# In the expanded dataset, we added another two features, Heart Rate, and Blood Pressure, with total patient number of 30

expanded_data = np.array([
    [1, 1, 0, 25, 80, 120, 1],
    [0, 1, 1, 40, 85, 130, 1],
    [1, 0, 1, 35, 78, 125, 0],
    [0, 0, 0, 50, 72, 115, 0],
    [1, 1, 1, 30, 90, 140, 1],
    [0, 0, 1, 45, 76, 118, 0],
    [1, 1, 0, 60, 88, 135, 1],
    [1, 0, 1, 55, 82, 128, 0],
    [0, 1, 1, 65, 74, 110, 1],
    [1, 0, 0, 28, 79, 122, 0],
    [0, 1, 0, 33, 77, 120, 0],
    [1, 0, 1, 48, 86, 132, 1],
    [0, 1, 1, 52, 81, 126, 1],
    [1, 1, 1, 38, 92, 145, 1],
    [0, 0, 0, 70, 70, 108, 0],
    [1, 1, 0, 58, 84, 129, 1],
    [0, 0, 1, 43, 75, 115, 0],
    [1, 0, 0, 29, 78, 120, 0],
    [0, 1, 1, 60, 83, 125, 1],
    [1, 0, 1, 47, 87, 130, 1],
    [1, 1, 1, 32, 89, 138, 1],
    [0, 0, 0, 55, 73, 112, 0],
    [1, 1, 0, 61, 85, 133, 1],
    [0, 0, 1, 42, 76, 118, 0],
    [1, 0, 0, 27, 79, 124, 0],
    [0, 1, 1, 59, 80, 127, 1],
    [1, 0, 1, 49, 88, 136, 1],
    [1, 1, 0, 34, 91, 142, 1],
    [0, 0, 0, 63, 74, 109, 0],
    [1, 1, 1, 37, 93, 148, 1]
])

In [None]:
# Step 2: Convert data to a dataframe with seven coloumns. 
# Columns names are 'Fever', 'Cough', 'Fatigue', 'Age', 'Heart_Rate', 'Blood_Pressure', and 'Has_Disease'

df = pd.DataFrame(
    ...
)

In [None]:
# Step 3: Split data into training and test sets. Split 70% training, 30% testing

# Define feature columns and target variable
X = df.drop(columns=['Has_Disease'])  # Features. Keep all columns except for column name = 'Has_Disease'
y = df['Has_Disease']  # Target variable: Disease presence

X_train, X_test, y_train, y_test = train_test_split(...)  # Split 70% training, 30% testing

In [None]:
# Step 4: Train the Gaussian Naïve Bayes model

model = ...  # Create a GaussianNB classifier
model.fit(...)  # Train the model on the training data

In [None]:
# Step 5: Make predictions

y_pred = model.predict(...)  # Predict disease presence on test data

In [None]:
# Step 6: Evaluate the model

print("Accuracy:", ...)  # Print accuracy of the model
print("Confusion Matrix:\n", ...)  # Print confusion matrix
print("Classification Report:\n", ...)  # Print precision, recall, F1-score, etc.

#### **Task 3: K-Nearest Neighbors (KNN)** (Example Code)
This is an example code for you to understand the basic python implementation of Naive Bayes.

In this example, we will load a real dataset and visualize the data samples. So we need to install one python library: Seaborn. Seaborn is a Python data visualization library built on top of Matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics. 


In [None]:
# Step 1: Install Seaborn python library

!pip install seaborn

In [None]:
# Step 2: Import required Python libraries

import numpy as np  # Import NumPy for numerical computations
import pandas as pd  # Import Pandas for data manipulation
import matplotlib.pyplot as plt  # Import Matplotlib for data visualization
import seaborn as sns  # Import Seaborn for statistical data visualization
from sklearn.model_selection import train_test_split  # Import function to split dataset into training and testing sets
from sklearn.preprocessing import StandardScaler  # Import StandardScaler for feature normalization
from sklearn.neighbors import KNeighborsClassifier  # Import KNeighborsClassifier to implement KNN algorithm
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  # Import evaluation metrics
from sklearn.datasets import load_iris  # Import Iris dataset

In [None]:
# Step 3: Load the dataset
iris = load_iris()  # Load the Iris dataset from sklearn
# Convert the dataset into a Pandas DataFrame for easier manipulation
data = pd.DataFrame(iris.data, columns=iris.feature_names)
# Add the target labels to the DataFrame
data['target'] = iris.target

The Iris dataset is a well-known dataset in machine learning and statistics, originally introduced by Ronald Fisher in 1936. It is widely used for classification problems.

Size of the Iris Dataset:
- Total samples: 150
- Features per sample: 4
- Target classes: 3 (Setosa, Versicolor, Virginica)
- Features (Independent Variables)

Each sample in the dataset represents a flower and has four numerical features:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)
- Labels (Dependent Variable)

The target variable represents the species of the iris flower, which can be one of three classes:
- 0 → Setosa
- 1 → Versicolor
- 2 → Virginica

In [None]:
# Step 4: Let's display the first 5 rows of the dataset

data.head()

In [None]:
# Step 5: Splitting data into training and testing sets


X = data.drop(columns=['target'])  # Extract feature variables (independent variables)
y = data['target']  # Extract target variable (dependent variable)
# Split the data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Step 6: Standardizing features to normalize data distribution
# Feature scaling essentially normalizes the range of values within each feature, allowing for a fair comparison between different features.


scaler = StandardScaler()  # Initialize the StandardScaler
X_train_scaled = scaler.fit_transform(X_train)  # Fit and transform training data
X_test_scaled = scaler.transform(X_test)  # Transform test data using the same scaler

## You can try to print both X_train and X_train_scaled to see the effect of the feature scaling.

In [None]:
# Step 7: Implementing K-Nearest Neighbors (KNN) algorithm

k = 3  # Define the number of neighbors to consider (K-value)
knn = KNeighborsClassifier(n_neighbors=k)  # Initialize KNN classifier with chosen K
knn.fit(X_train_scaled, y_train)  # Train the classifier on the scaled training data

In [None]:
# Step 8: Make predictions on the test data


y_pred = knn.predict(X_test_scaled)  # Predict the target labels for test data

In [None]:
# Step 9: Evaluate Model Performance


print("Accuracy:", accuracy_score(y_test, y_pred))  # Print accuracy of model
print("Classification Report:\n", classification_report(y_test, y_pred))  # Print detailed classification metrics
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))  # Print confusion matrix

#### **Task 3: K-Nearest Neighbors (KNN)** (Practice Task)
This is the assignment you need to complete. You may use any code snippets shown in above.

In [None]:
# Step 1: Let's load another real-world dataset: Breast Cancer Dataset
from sklearn.datasets import load_breast_cancer  # Import Breast Cancer dataset

# Load the Breast Cancer dataset
cancer = load_breast_cancer()  # Load the dataset from sklearn
# Convert the dataset into a Pandas DataFrame for easier manipulation
data = pd.DataFrame(cancer.data, columns=cancer.feature_names)
# Add the target labels to the DataFrame
data['target'] = cancer.target

The Breast Cancer Wisconsin (Diagnostic) dataset is a renowned collection of data used extensively in machine learning and medical research.

The dataset comprises 30 features, including mean, standard error, and "worst" or largest values, computed for each image. These features encapsulate various aspects of cell nuclei characteristics such as:
- mean radius: Mean of distances from center to points on the perimeter.
- mean texture: Standard deviation of gray-scale values.
- mean perimeter: Perimeter of the tumor.
- mean area: Area of the tumor.

Out of the 569 patients in the dataset, the binary label distribution is: Benign: 357 (63%) and Malignant: 212 (37%).

In [None]:
# Step 2: Let's display the first 5 rows of the dataset. Remeber the last column 'target' is the label. All other coloumns are features.

data.head(...)

In [None]:
# Step 3: Splitting data into training and testing sets

X = data.drop(columns=['target'])  # Extract feature variables (independent variables)
y = data['target']  # Extract target variable (dependent variable)

# Split the data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(...)

In [None]:
# Step 4: Standardizing features to normalize data distribution

...

In [None]:
# Step 5: Implementing K-Nearest Neighbors (KNN) algorithm

...

In [None]:
# Step 6: Make predictions on the test data

...

In [None]:
# Step 7: Evaluate Model Performance

...

In [None]:
# Step 8: change the k value in Step 5 and observe the difference outputs from the KNN model.