### The described Spam Detection System (SDS) seems to be well-designed, utilizing a thorough feature extraction process to transform text data into meaningful feature vectors. These vectors capture crucial characteristics of both legitimate and spam messages, including lexical, syntactic, and semantic attributes. The use of a multi-layered machine learning model incorporating algorithms like Support Vector Machines, Random Forest, and Gradient Boosting indicates a robust approach to classification. Training the model on a diverse dataset with labeled examples of spam and non-spam messages is a crucial step in enabling it to recognize complex patterns and adapt to evolving spamming tactics. This comprehensive system appears poised to effectively discern between spam and non-spam content.

### Step 1: Data Preparation and Feature Extraction

In [5]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset containing labeled messages (spam and non-spam)
data = pd.read_csv('data/sms.csv')

# Create a TF-IDF vectorizer to convert text data into feature vectors
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(data['sms'].values)
y = data['label'].values
data.head()

Unnamed: 0,sms,label
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...\n,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


### Step 2: Model Training

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train different classifiers
random_forest = RandomForestClassifier()
svm = SVC()
gradient_boosting = GradientBoostingClassifier()

random_forest.fit(X_train, y_train)
svm.fit(X_train, y_train)
gradient_boosting.fit(X_train, y_train)

# Make predictions on the test set
rf_predictions = random_forest.predict(X_test)
svm_predictions = svm.predict(X_test)
gb_predictions = gradient_boosting.predict(X_test)

# Calculate accuracy scores
rf_accuracy = accuracy_score(y_test, rf_predictions)
svm_accuracy = accuracy_score(y_test, svm_predictions)
gb_accuracy = accuracy_score(y_test, gb_predictions)

print("Random Forest Accuracy:", rf_accuracy)
print("SVM Accuracy:", svm_accuracy)
print("Gradient Boosting Accuracy:", gb_accuracy)


Random Forest Accuracy: 0.9856502242152466
SVM Accuracy: 0.9856502242152466
Gradient Boosting Accuracy: 0.968609865470852


### Step 3: Model Evaluation and Selection

In [3]:
# Compare the accuracies and choose the best-performing model
best_model = max([(rf_accuracy, 'Random Forest'), (svm_accuracy, 'SVM'), (gb_accuracy, 'Gradient Boosting')])

print("Best Model:", best_model[1])


Best Model: SVM
