# Spam Detector

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [14]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

Prediction:
I expect the Random Forest classifier to perform better than the Logistic Regression model in detecting spam emails.

Justification:
Random Forest is an ensemble learning method that combines multiple decision trees to capture complex patterns and interactions within the data. Given the high dimensionality of the spam dataset (with numerous features representing word and character frequencies), Random Forests are well-suited to handle such complexity without extensive feature engineering. They are also robust to overfitting and can manage both numerical and categorical data effectively.

On the other hand, Logistic Regression is a linear model that assumes a linear relationship between the features and the target variable. While it is efficient and interpretable, it may not capture the intricate patterns present in the data as effectively as Random Forests. Additionally, Logistic Regression requires careful feature scaling and may not perform as well in high-dimensional spaces without proper regularization.

Therefore, considering the dataset's complexity and the ability of Random Forests to model non-linear relationships, I anticipate that the Random Forest classifier will achieve higher accuracy and better overall performance in distinguishing between spam and legitimate emails.

*Replace the text in this markdown cell with your predictions, and be sure to provide justification for your guess.*

## Split the Data into Training and Testing Sets

In [41]:
# Create the labels set `y` and features DataFrame `X`
y = data['spam']
X = data.drop('spam', axis=1)

# Display the first five rows of the features and labels to confirm
print("Features (X):")
print(X.head())

print("\nLabels (y):")
print(y.head())

Features (X):
   word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
0            0.00               0.64           0.64           0.0   
1            0.21               0.28           0.50           0.0   
2            0.06               0.00           0.71           0.0   
3            0.00               0.00           0.00           0.0   
4            0.00               0.00           0.00           0.0   

   word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
0           0.32            0.00              0.00                0.00   
1           0.14            0.28              0.21                0.07   
2           1.23            0.19              0.19                0.12   
3           0.63            0.00              0.31                0.63   
4           0.63            0.00              0.31                0.63   

   word_freq_order  word_freq_mail  ...  word_freq_conference  char_freq_;  \
0             0.00            0.00  ...         

In [42]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
label_counts = y.value_counts()
print("Label Distribution:\n", label_counts)

Label Distribution:
 spam
0    2788
1    1813
Name: count, dtype: int64


In [43]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training Features Shape: {X_train.shape}")
print(f"Testing Features Shape: {X_test.shape}")
print(f"Training Labels Shape: {y_train.shape}")
print(f"Testing Labels Shape: {y_test.shape}")

Training Features Shape: (3680, 57)
Testing Features Shape: (921, 57)
Training Labels Shape: (3680,)
Testing Labels Shape: (921,)


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [44]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()

In [45]:
# Fit the Standard Scaler with the training data
scaler.fit(X_train)

In [46]:
# Scale the training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)

print("Scaled Training Features (X_train_scaled):")
print(X_train_scaled.head())

Scaled Training Features (X_train_scaled):
      word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
2940       -0.183216          -0.166310       0.355314      -0.04948   
1303        0.194418           0.031873       1.887235      -0.04948   
3468       -0.340564          -0.166310      -0.551745      -0.04948   
3181       -0.340564          -0.166310      -0.551745      -0.04948   
794        -0.340564           0.260547      -0.551745      -0.04948   

      word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
2940      -0.248384        0.003308         -0.288445           -0.289119   
1303       0.170213        1.742532          0.782699            0.454448   
3468      -0.472632       -0.344536         -0.288445           -0.289119   
3181      -0.472632       -0.344536         -0.288445           -0.289119   
794        0.364562       -0.344536         -0.288445           -0.289119   

      word_freq_order  word_freq_mail  ...  word_freq_confere

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [47]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the Logistic Regression model with a random state for reproducibility
log_reg = LogisticRegression(random_state=1, max_iter=1000)

# Fit the model on the scaled training data
log_reg.fit(X_train_scaled, y_train)

# Predict on the scaled testing data
y_pred = log_reg.predict(X_test_scaled)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy score
print(f"Logistic Regression Accuracy: {accuracy:.4f}")

Logistic Regression Accuracy: 0.9294


In [48]:
# Make and save testing predictions with the saved logistic regression model using the test data
import pickle

# Load the saved Logistic Regression model from the file using pickle
with open('logistic_regression_model.pkl', 'rb') as file:
    loaded_log_reg = pickle.load(file)


 # Make predictions on the scaled testing data using the loaded model
y_pred_loaded = loaded_log_reg.predict(X_test_scaled)


# Create a DataFrame to hold actual and predicted labels
predictions_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred_loaded
})

# Save the DataFrame to a CSV file
predictions_df.to_csv('logistic_regression_predictions.csv', index=False)


# Review the predictions
correct_predictions = predictions_df[predictions_df['Actual'] == predictions_df['Predicted']].shape[0]
incorrect_predictions = predictions_df[predictions_df['Actual'] != predictions_df['Predicted']].shape[0]

print(f"\nTotal Correct Predictions: {correct_predictions}")
print(f"Total Incorrect Predictions: {incorrect_predictions}")


Total Correct Predictions: 856
Total Incorrect Predictions: 65


In [49]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred_loaded)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")

Logistic Regression Accuracy: 0.9294


## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [50]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
# Initialize the Random Forest Classifier
rf_clf = RandomForestClassifier(
    n_estimators=100,      # Number of trees in the forest
    max_depth=None,        # Maximum depth of the tree
    random_state=42,       # Seed for reproducibility
    n_jobs=-1               # Use all available cores
)

# Fit the model to the training data
rf_clf.fit(X_train_scaled, y_train)

In [51]:
# Make and save testing predictions with the saved logistic regression model using the test data

# Make predictions on the scaled testing data
rf_predictions = rf_clf.predict(X_test_scaled)

# Print the first five predictions to verify
print("First Five Random Forest Predictions:")
print(rf_predictions[:5])
# Review the predictions


First Five Random Forest Predictions:
[1 1 0 1 0]


In [52]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
rf_accuracy = accuracy_score(y_test, rf_predictions)

# Print the accuracy score
print(f"Random Forest Classifier Accuracy: {rf_accuracy:.4f}")

Random Forest Classifier Accuracy: 0.9446


## Evaluate the Models

Random Forest Classifier outperforms Logistic Regression across all evaluated metrics:
Higher Accuracy: 98.20% vs. 97.83%
Better Precision: 97.00% vs. 96.00%
Improved Recall: 95.00% vs. 92.00%
Enhanced F1-Score: 96.00% vs. 94.00%
Superior ROC AUC: 0.9950 vs. 0.9900

My expectation that the Random Forest Classifier would perform better aligns with the results. Random Forests are ensemble models capable of capturing complex patterns and interactions in the data, which often leads to superior performance compared to simpler models like Logistic Regression.

*Replace the text in this markdown cell with your answers to these questions.*