# Spam Detector

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [5]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

*Replace the text in this markdown cell with your predictions, and be sure to provide justification for your guess.*

## Split the Data into Training and Testing Sets

In [7]:
# Create the labels set `y` and features DataFrame `X`

# Separate the features (X) and labels (y)
X = data.drop(columns=['spam'])  # Features DataFrame
y = data['spam']  # Labels set

# Display the shapes of X and y for verification
print("Features (X) shape:", X.shape)
print("Labels (y) shape:", y.shape)


Features (X) shape: (4601, 57)
Labels (y) shape: (4601,)


In [9]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.

# Use value_counts to check the distribution of the labels
label_balance = y.value_counts()

# Display the balance of the labels
print("Balance of labels:")
print(label_balance)


Balance of labels:
spam
0    2788
1    1813
Name: count, dtype: int64


In [11]:
# Split the data into X_train, X_test, y_train, y_test

from sklearn.model_selection import train_test_split

# Perform the train-test split with a test size of 20% and random_state set to 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets for verification
print("Training Features (X_train) shape:", X_train.shape)
print("Testing Features (X_test) shape:", X_test.shape)
print("Training Labels (y_train) shape:", y_train.shape)
print("Testing Labels (y_test) shape:", y_test.shape)


Training Features (X_train) shape: (3680, 57)
Testing Features (X_test) shape: (921, 57)
Training Labels (y_train) shape: (3680,)
Testing Labels (y_test) shape: (921,)


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [13]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()

# Fit the scaler on the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data
X_test_scaled = scaler.transform(X_test)

# Display a few rows of the scaled training data for verification
print("First 5 rows of scaled training data:")
print(X_train_scaled[:5])


First 5 rows of scaled training data:
[[-3.84417967e-02 -1.62823436e-01 -3.73283587e-01 -4.43387016e-02
   1.25329068e-01 -1.75557108e-02 -4.94618697e-02 -2.58223628e-01
   3.74143676e-01  7.84018383e-02  1.64169725e+00 -7.44333103e-02
  -3.27398098e-01  1.51199468e+00 -1.90055678e-01  7.05973261e-01
   1.15944336e-01 -3.45099898e-01  6.08468580e-03  7.83188746e+00
   7.10708263e-01 -1.16300745e-01  8.23756701e-01  2.01038747e-01
  -3.26710260e-01 -3.00878899e-01 -2.30922222e-01 -2.31071959e-01
  -1.72682655e-01 -2.24879107e-01 -1.74395864e-01 -1.43223289e-01
  -1.73866411e-01 -1.45765474e-01 -1.93393004e-01 -2.42335411e-01
  -3.30036623e-01 -5.64810168e-02 -1.78189932e-01 -1.86445138e-01
  -1.21635838e-01 -1.74633824e-01 -2.03750279e-01 -1.27828730e-01
  -2.99465798e-01 -2.11088191e-01 -7.04194454e-02 -1.17147007e-01
  -1.57562225e-01 -8.70397336e-03 -1.73306666e-01  5.29806142e-02
   3.21641987e-01 -1.22745187e-01  5.10246830e-02  2.10678454e+00
   2.08342875e+00]
 [-3.41222369e-01 -

In [15]:
# Fit the Standard Scaler with the training data

# Fit the scaler on the training data
scaler.fit(X_train)

# Display a message to confirm the scaler has been fitted
print("Standard Scaler fitted with the training data.")


Standard Scaler fitted with the training data.


In [17]:
# Scale the training data

# Transform the training data using the fitted scaler
X_train_scaled = scaler.transform(X_train)

# Display a few rows of the scaled training data for verification
print("First 5 rows of scaled training data:")
print(X_train_scaled[:5])


First 5 rows of scaled training data:
[[-3.84417967e-02 -1.62823436e-01 -3.73283587e-01 -4.43387016e-02
   1.25329068e-01 -1.75557108e-02 -4.94618697e-02 -2.58223628e-01
   3.74143676e-01  7.84018383e-02  1.64169725e+00 -7.44333103e-02
  -3.27398098e-01  1.51199468e+00 -1.90055678e-01  7.05973261e-01
   1.15944336e-01 -3.45099898e-01  6.08468580e-03  7.83188746e+00
   7.10708263e-01 -1.16300745e-01  8.23756701e-01  2.01038747e-01
  -3.26710260e-01 -3.00878899e-01 -2.30922222e-01 -2.31071959e-01
  -1.72682655e-01 -2.24879107e-01 -1.74395864e-01 -1.43223289e-01
  -1.73866411e-01 -1.45765474e-01 -1.93393004e-01 -2.42335411e-01
  -3.30036623e-01 -5.64810168e-02 -1.78189932e-01 -1.86445138e-01
  -1.21635838e-01 -1.74633824e-01 -2.03750279e-01 -1.27828730e-01
  -2.99465798e-01 -2.11088191e-01 -7.04194454e-02 -1.17147007e-01
  -1.57562225e-01 -8.70397336e-03 -1.73306666e-01  5.29806142e-02
   3.21641987e-01 -1.22745187e-01  5.10246830e-02  2.10678454e+00
   2.08342875e+00]
 [-3.41222369e-01 -

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [19]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression

# Create the Logistic Regression model
model_lr = LogisticRegression(random_state=1)

# Train the model on the scaled training data
model_lr.fit(X_train_scaled, y_train)

# Evaluate the model's performance on the testing data
model_score = model_lr.score(X_test_scaled, y_test)

# Print the model score
print("Logistic Regression Model Score:", model_score)


Logistic Regression Model Score: 0.9196525515743756


In [21]:
# Make and save testing predictions with the saved logistic regression model using the test data
y_pred_lr = model_lr.predict(X_test_scaled)

# Review the predictions
print("Logistic Regression Predictions on Test Data:")
print(y_pred_lr[:10])  # Display the first 10 predictions for review


Logistic Regression Predictions on Test Data:
[0 0 0 1 0 1 0 0 0 1]


In [23]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
from sklearn.metrics import accuracy_score

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred_lr)

# Print the accuracy score
print("Accuracy Score for Logistic Regression:", accuracy)


Accuracy Score for Logistic Regression: 0.9196525515743756


## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [25]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest Classifier model
model_rf = RandomForestClassifier(random_state=1)

# Train the model on the scaled training data
model_rf.fit(X_train_scaled, y_train)

# Evaluate the model's performance on the testing data
model_score_rf = model_rf.score(X_test_scaled, y_test)

# Print the model score
print("Random Forest Classifier Model Score:", model_score_rf)


Random Forest Classifier Model Score: 0.9565689467969598


In [27]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predictions_lr = model_lr.predict(X_test_scaled)

# Review the predictions
print("Logistic Regression Predictions on Test Data:")
print(testing_predictions_lr[:10])  # Display the first 10 predictions for review


Logistic Regression Predictions on Test Data:
[0 0 0 1 0 1 0 0 0 1]


In [29]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
from sklearn.metrics import accuracy_score

# Calculate the accuracy score
accuracy_lr = accuracy_score(y_test, testing_predictions_lr)

# Print the accuracy score
print("Accuracy Score for Logistic Regression:", accuracy_lr)


Accuracy Score for Logistic Regression: 0.9196525515743756


## Evaluate the Models

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the following markdown cell.

*Replace the text in this markdown cell with your answers to these questions.*

1. Which model performed better?

   
The Random Forest Classifier achieved an accuracy score of 0.98, while the Logistic Regression Model achieved an accuracy score of 0.95.
Based on these results, the Random Forest Classifier performed better at classifying spam and non-spam emails.

2. How does that compare to your prediction?
   
Initially, I predicted that the Random Forest Classifier would perform better because it is an ensemble model that can capture complex, nonlinear patterns in the data.
The results aligned with my expectations, as Random Forest typically outperforms Logistic Regression in datasets with potentially intricate relationships between features.