# Spam Detector

In [48]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv](https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [49]:
# Import the data
data = pd.read_csv("https://static.bc-edx.com/ai/ail-v-1-0/m13/challenge/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

 I think Random Forest model will perform better over Logistic Regression model:
 1. For a text-based dataset Random Forest model can capture non-linear relationships between features. Logistic Regression thinks that a linear relationship exsists between features and the target variable. 
 2. The dataset has large number of features like word frequencies, character frequencies. Random Forest is more suitable for handling high-dimensional data. It can select the most informative features and reduce the dimensionality of the data, whereas Logistic Regression may suffer from the curse of dimensionality.
3. Random Forest helps mitigate overfitting with a complex dataset. The spam detection dataset is a complex dataset. Logistic regression being a single model, will be more prone to overfitting. 
4. Random Forest can handle categorical features directly, whereas Logistic Regression requires categorical features to be encoded (e.g., one-hot encoding) before modeling.
5. Random Forest provides feature importance scores, which can help identify the most informative features in the dataset. This can be useful for feature selection and understanding the underlying relationships in the data.

It's best to try both models and evaluate their performance using metrics like accuracy, F1-score, ROC-AUC, etc. The best model will depend on the specific dataset and problem we are trying to solve.


## Split the Data into Training and Testing Sets

In [50]:
# Create the labels set `y` and features DataFrame `X`
y = data["spam"]
X = data.copy()
X = data.drop("spam",axis=1)

In [51]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
y.value_counts()

spam
0    2788
1    1813
Name: count, dtype: int64

In [52]:
y[:5]

0    1
1    1
2    1
3    1
4    1
Name: spam, dtype: int64

In [54]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [55]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()

In [56]:
# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

In [57]:
# Scale the training data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [58]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression


In [59]:
# Make and save testing predictions with the saved logistic regression model using the test data
logistic_regression_model = LogisticRegression(random_state=1)

# Fit the model
logistic_regression_model.fit(X_train_scaled, y_train)


# Apply the fitted model to the `test` dataset
testing_predictions = logistic_regression_model.predict(X_test_scaled)

# Review the predictions
testing_predictions

array([1, 0, 1, ..., 0, 0, 0], dtype=int64)

In [60]:
# Score the model
print(f"Training Data Score: {logistic_regression_model.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {logistic_regression_model.score(X_test_scaled, y_test)}")

Training Data Score: 0.9272463768115942
Testing Data Score: 0.9278887923544744


In [61]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
# Import the accuracy_score function
from sklearn.metrics import accuracy_score

# Calculate the model's accuracy on the test dataset
lr_acc_score = accuracy_score(y_test, testing_predictions)


In [62]:
# Display results
print(f"Accuracy Score of Logistic Regression Model: {lr_acc_score}")

Accuracy Score of Logistic Regression Model: 0.9278887923544744


## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [63]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier


In [64]:
# Make and save testing predictions with the saved logistic regression model using the test data
# Train the Random Forest model
rf_model = RandomForestClassifier(random_state=1, n_estimators=500)
rf_model = rf_model.fit(X_train_scaled, y_train)

# Review the predictions
# Make predictions using the testing data
rf_test_predictions = rf_model.predict(X_test_scaled)
rf_test_predictions

array([1, 0, 1, ..., 0, 0, 0], dtype=int64)

In [65]:
# Evaluate the model
print(f'Training Score: {rf_model.score(X_train_scaled, y_train)}')
print(f'Testing Score: {rf_model.score(X_test_scaled, y_test)}')


Training Score: 0.9994202898550725
Testing Score: 0.9574283231972198


In [66]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
acc_score = accuracy_score(y_test, rf_test_predictions)

In [67]:
# Display results
print(f"Accuracy Score of Random Forest Model: {acc_score}")

Accuracy Score of Random Forest Model: 0.9574283231972198


## Evaluate the Models

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the following markdown cell.

I had predicted that Random Forest Model will predict better than Logistic Regression Model. 
Random Forest Model performed better with an accuracy score of 0.9574 (or approximately 95.74%) indicating that the Random Forest model is correctly classifying about 95.74% of the emails as spam or non-spam. This suggests that the model is able to effectively distinguish between spam and legitimate emails.

Logistic Regression Model also performed well with an accuracy score of 0.9279 (or approximately 92.79%) although slightly less accurate than the Random Forest model (0.9574). I didn't expect such high accuracy scores for Logistic Regression Model. 

Logistic Regression is a simpler model compared to Random Forest, and it's impressive that it's still able to achieve an accuracy score above 92%. This suggests that the underlying relationships in the data are somewhat linear, and the Logistic Regression model is able to capture some of these relationships.
However, the slightly lower accuracy score compared to Random Forest suggests that the data may have some non-linear relationships or interactions that Logistic Regression is not able to capture. Random Forest, with its ability to handle non-linear relationships and interactions, is able to perform slightly better.

Still, an accuracy score of 0.9279 is a great result, and the Logistic Regression model can still be a viable option depending on the specific requirements and constraints of using the above dataset in any project.