In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f181a0d4-8ff7-4a6d-923e-69fff5823a54",
   "metadata": {},
   "source": [
    "# SPAM Detector - Playground"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf3f1fd-3764-4a2c-9375-bfa88caed274",
   "metadata": {},
   "source": [
    "### Lets HELP McSkiddy in developing an Effective Spam\n",
    "\n",
    "In this activity, we will help McSkiddy build a simple Spam detector using Machine learning.\n",
    "As we are in a testing phase, we will use a small data sample."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86fc5ed6-6efe-453f-b14f-f6a86a54fe70",
   "metadata": {},
   "source": [
    "# Step 0: Importing the required libraries\n",
    "Before starting with Data collection, we will import the required libraries. Jupyter Notebook comes with all the libraries we need for Machine Learning. Here, we are importing two key libraries: Numpy and Pandas. These libraries are already explained in detail in the previous task.\n",
    "\n",
    "**Numpy:** NumPy (Numerical Python) is the fundamental package for numerical computations in Python.\n",
    "\n",
    "**pandas:**  Pandas provides high-level data structures and methods designed to make data analysis fast and easy in Python. It's built on top of NumPy.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "999a0cdc-9239-44ec-ae36-daed31443b42",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7fd1fc6-317c-4a25-99c7-f4bacb3e984b",
   "metadata": {},
   "source": [
    "# Step 1: Data Collection\n",
    "\n",
    "**Data collection** is the process of gathering raw data from various sources to be used for Machine Learning. This data can originate from numerous sources, such as databases, text files, APIs, online repositories, sensors, surveys, web scraping, and many others.\n",
    "\n",
    "Here, we are using the Pandas library to load the data collected from various sources in the csv format. The dataset contains spam and ham (non-spam) emails."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd0cdaf3-e554-4ec3-ba4d-03ba603c7e2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"emails_dataset.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a628f87-bd18-40db-9e16-d366eb01c9b2",
   "metadata": {},
   "source": [
    "## Test/Check Dataset\n",
    "\n",
    "Let's review the dataset we just imported. The category column contains the email classification, and the message column contains the email body, as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb1b3ead-54ca-4952-8f41-eee38ac5f47e",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(data.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdf048ff-8c0f-471b-b026-eb2b8364b74e",
   "metadata": {},
   "source": [
    "DataFrames provide a structured and tabular representation of data that's intuitive and easy to read. Using the command below, let's use the pandas library to convert the data into a frame. It will make the data easy to analyse and manipulate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fc8d447-f89a-47af-a77f-77ebfcd4f99f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(data)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e110f0d9-215c-486e-8f3f-6a3627b656e0",
   "metadata": {},
   "source": [
    "# Step 2: Data Preprocessing\n",
    "\n",
    "Data preprocessing refers to the techniques used to convert raw data into a clean, organised, understandable, and structured format suitable for Machine Learning. Given that raw data is often messy, inconsistent, and incomplete, preprocessing is an essential step to ensure that the data feeding into the ML models is relevant and of high quality.\n",
    "\n",
    "There are several data pre-processing machine learning models, each has their own ways to process the data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb07c28d-6c5c-414c-8a89-70c452f7f7ed",
   "metadata": {},
   "source": [
    "### Utilizing CountVectorizer()\n",
    "Machine Learning models understand numbers, not text. This means the text needs to be transformed into a numerical format. CountVectorizer, a class provided by the scikit-learn library in Python, achieves this by converting text into a token (word) count matrix. It is used to prepare the data for the Machine Learning models to use and predict decisions on. \n",
    "\n",
    "Here we are using CounterVectorizer which is used to extract featutres from the text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4ff9af1-0035-4d9e-882a-7af28c6e6a17",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2ee7960-968b-4343-b676-c62274f2f140",
   "metadata": {},
   "source": [
    "We will now use the CountVectorizer function to transform the Message column into numeric, as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "372a036c-fe2c-43be-b991-e1760af2fe8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer()\n",
    "X = vectorizer.fit_transform(df['Message'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aca5f49d-e76b-43db-ac3e-2c89ef463e9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a641c86a-10a2-437c-9c0e-d2c2bd64602a",
   "metadata": {},
   "source": [
    "# Step 3: Train/Test split dataset\n",
    "It's important to test the model's performance on unseen data. By splitting the data, we can train our model on one subset and test its performance on another. \n",
    "\n",
    "Here, the variable X contains the dataset. We will use the functions from sklearn library to split the dataset into training data and testing data, as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9be42a3-c68e-47f6-bb88-1907735f490f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2753d9ed-cb45-4484-87f5-ad69913409c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "y = df['Classification']\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec6ac22a-948e-4a47-b254-a2ea1e6268e1",
   "metadata": {},
   "source": [
    "- **X**: The first argument to `train_test_split` is the feature matrix `X` which you obtained from the `CountVectorizer`. This matrix contains the token counts for each message in the dataset.\n",
    "    \n",
    "- **y**: The second argument is the labels for each instance in your dataset, which indicates whether a message is spam or ham.\n",
    "    \n",
    "- **test_size=0.2**: This argument specifies that 20% of the dataset should be kept as the test set and the rest (80%) should be used for training. It's a common practice to hold out a portion of the dataset for testing to evaluate the performance of the model on unseen data.\n",
    "This is where the actual splitting of data into training and test sets happens.\n",
    "\n",
    "The function then returns four values:\n",
    "\n",
    "- **X_train**: The subset of the features to be used for training.\n",
    "- **X_test**: The subset of the features to be used for testing.\n",
    "- **y_train**: The corresponding labels for the `X_train` set.\n",
    "- **y_test**: The corresponding labels for the `X_test` set."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "171c0dc2-fd4c-4fbf-a8b2-485c4aa178a2",
   "metadata": {},
   "source": [
    "# Step 4:  Model Training using Naive Bayes\n",
    "\n",
    "Naive Bayes is a statistical method that uses the probability of certain words appearing in spam and non-spam emails to determine whether a new email is spam or not.\n",
    "\n",
    "## How Naive Bayes Classification Works\n",
    "\n",
    "- Let's say we have a bunch of emails, some labelled as \"spam\" and others as \"ham\".\n",
    "- The Naive Bayes algorithm learns from these emails. It looks at the words in each email and calculates how frequently each word appears in spam or ham emails. For instance, words like \"free\", \"win\", \"offer\", and \"lottery\" might appear more in spam emails.\n",
    "- The Naive Bayes algorithm calculates the probability of the email being spam based on the words it contains.\n",
    "- When the model is trained with Naive Bayes and gets a new email that says (for example) \"Win a free toy now!\", then it thinks:\n",
    "       -  \"Win\" often appears in spam, so this increases the chance of the email being spam.\n",
    "       -  \"Free\" is also common in spam, further increasing the spam probability.\n",
    "       -  \"Toy\" might be neutral, often appearing in both spam and ham.\n",
    "       -  After considering all the words, it calculates the overall probability of the email being spam and ham.\n",
    "\n",
    "If the calculated probability of spam is higher than that of ham, the algorithm classifies the email as spam. Otherwise, it's classified as ham.\n",
    "Let's use Naive Bayes to train the model, as shown and explained below:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a9cd2c2-5115-4895-b975-e99182f45962",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "clf = MultinomialNB()\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "262fbd15-0ad1-473b-8a50-4364989f5b0b",
   "metadata": {},
   "source": [
    "  - **X_train:** This is the training data you want the model to learn from. It's the token counts for each message in the training dataset, obtained from the CountVectorizer.\n",
    "  - **y_train:** These are the correct labels (either \"spam\" or \"ham\") for each message in the X_train dataset.\n",
    "  \n",
    "  This is where the actual training of the model happens. The fit method is used to train or \"fit\" the model on your training data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "345e99b2-0536-45ee-bebc-45c350b52a56",
   "metadata": {},
   "source": [
    "# Step 5: Model Evaluation\n",
    "\n",
    "After training, it's essential to evaluate the model's performance on the test set to gauge its predictive power. This will give you metrics such as accuracy, precision, and recall."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "117f0dee-88ca-404f-aeef-7eb39321e096",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import classification_report\n",
    "\n",
    "y_pred = clf.predict(X_test)\n",
    "print(classification_report(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d362e206-50ef-408f-8445-2f2c330fa8e4",
   "metadata": {},
   "source": [
    "The classification_report function takes in the true labels (y_test) and the predicted labels (y_pred) and returns a text report showing the main classification metrics."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "deb263d8-bab6-4ec1-a682-0c79e19b3f8c",
   "metadata": {},
   "source": [
    "The report gives you insights into how well your model is performing for each class and overall, in terms of these metrics.\n",
    "\n",
    "- **Precision:** It is the ratio of correctly predicted positive observations to the total predicted positives. The question it answers is: Of all the samples predicted as positive, how many were actually positive?\n",
    "- **Recall (Sensitivity):** It is the ratio of correctly predicted positive\n",
    "observations to all the actual positives. The question it answers is: Of all the actual positive samples, how many did we predict correctly?\n",
    "- **F1-Score:** It's the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the accuracy metric, especially when there's an imbalance between classes.\n",
    "- **Support**: It is the number of actual occurrences of the class in the specified dataset.\n",
    "- **Accuracy:** It's the ratio of correctly predicted observations to the total observations.\n",
    "- **Macro Avg:** This averages the unweighted mean per label.\n",
    "- **Weighted Avg:** This averages the support-weighted mean per label."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c843b9db-c1bd-4598-9bba-fb45c666c8f7",
   "metadata": {},
   "source": [
    "The report gives us insights into how well your model is performing for each class and overall, in terms of these metrics."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "913108c5-c537-4fd2-bbef-08c28b680a5e",
   "metadata": {},
   "source": [
    "# Step 6: Testing the Model\n",
    "\n",
    "Once satisfied with the model's performance, we can use it to classify new messages and determine if they are spam or ham."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86a27d19-6975-4bdd-b4c9-16c1d4741f05",
   "metadata": {},
   "outputs": [],
   "source": [
    "message = vectorizer.transform([\"Today's Offer! Claim ur £150 worth of discount vouchers! Text YES to 85023 now! SavaMob, member offers mobile! T Cs 08717898035. £3.00 Sub. 16 . Unsub reply X \"])\n",
    "prediction = clf.predict(message)\n",
    "print(\"The email is :\", prediction[0])  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55b5ed3d-dcdd-4e22-a7be-f1901b8f3038",
   "metadata": {},
   "source": [
    "# Great Work. \n",
    "\n",
    "We have just completed one cycle of a machine learning pipeline leaving behind deployment."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a14bc7f-e802-4b7c-947e-4cdcb70537f3",
   "metadata": {},
   "source": [
    "## McSkiddy has provided us with the latest email messages and asked us to test our model on these emails.\n",
    "### Let's run our model on some test emails and see if it can detect spam emails."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bafd3281-3dc8-4440-b8b8-926aeb92fc72",
   "metadata": {},
   "source": [
    "#### Update the following code to include the test emails file and run the trained model against these emails."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6a5d601-86bd-43c4-a6ae-52599c97b8ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data = pd.read_csv(\"___________\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb80044b-46b1-4af0-8fb5-d0c61937362b",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(data.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4096fa12-318d-4947-8f0a-578c291986d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_new = vectorizer.transform(test_data['Messages'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4238eaa3-a7b1-4c4b-8e0a-d530a50c299c",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_predictions = clf.predict(X_new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d40eba51-e9ae-4f4b-98e1-d1329f4c0832",
   "metadata": {},
   "outputs": [],
   "source": [
    "results_df = pd.DataFrame({'Messages': test_data['Messages'], 'Prediction': new_predictions})\n",
    "print(results_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67c623b9-c378-4432-b7d3-2b2e8b31b217",
   "metadata": {},
   "source": [
    "### Once you have run the model against the test emails, move on to the task to complete the answers."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "851e16fe-40bf-487c-a75d-0c774f87d7c7",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "This is it from the task. From the practical point of view, we have to consider the following points to ensure the effectiveness and reliability of the model:\n",
    "\n",
    "    - Continuously monitor the model's performance on a test dataset or in a real-world environment.\n",
    "    - Collect feedback from end-users regarding false positives.\n",
    "    - Use this feedback to understand the model's weaknesses and areas for improvement.\n",
    "    - Deploy the model into production."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}