I wrote this work during the training, and I will be glad if it turns out to be useful to someone :)    

Dataset was loaded from here:
https://www.kaggle.com/wanderfj/enron-spam

In [None]:
import pandas as pd
import numpy as np 
from sklearn.datasets import load_files

import time

# Text cleaning and precprcessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

This dataset have 7 emails, but reading, cleaning and preprocessing all of them take to much time. Using one of them is more than enough.

In [None]:
X, y = [], []
email = load_files("../input/enron-spam/enron1")
X = np.append(X, email.data)
y = np.append(y, email.target)    

### Let's create Dataframe with text and target feature

In [None]:
df_all = pd.DataFrame(columns=['text', 'target'])
df_all['text'] = [x for x in X]
df_all['target'] = [t for t in y]

In [None]:
df_all

In [None]:
df_X = df_all.drop(['target'], axis=1)
df_y = df_all['target']

In [None]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

Now we have list of texts, that encoded binary. Using 'decode' is one of possible solutions, but some texts don't allow us to apply decoding. This can be solved by deleting these texts, but because of this, we can lose important information. Instead of this we can do following:

1. We should remove all special symbols.
1. Remove 'b' in beginning of each text
1. Replace all gaps (\t, \n, \r, \f) between words with spaces
1. Remove all non-letters characters

In [None]:
start_time = time.time()

# Create corpus
corpus = []
for i in range(0, len(df_X)):
    # Remove special symbols
    review = re.sub(r'\\r\\n', ' ', str(df_X['text'][i]))
    # Remove all symbols except letters
    review = re.sub('[^a-zA-Z]', ' ', review)
    # Replacing all gaps with spaces 
    review = re.sub(r'\s+', ' ', review)                    
    # Remove 'b' in the beginning of each text
    review = re.sub(r'^b\s+', '', review)       

    review = review.lower()
    review = review.split()
    review = [stemmer.stem(word) for word in review if word not in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
#tf = TfidfVectorizer()

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

In [None]:
# Splitting data on train and test dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,  random_state=9, test_size=0.2)

The likelihood of whether an email is spam or ham is a aposterior probability. So let's try bayes models. They usually shows high performance in spam detection

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix

model = MultinomialNB().fit(X_train, y_train)
pred = model.predict(X_test)

accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
conf_m = confusion_matrix(y_test, pred)

print(f"accuracy: %.3f" %accuracy)
print(f"precision: %.3f" %precision)
print(f"recall: %.3f" %recall)
print(f"confusion matrix: ")
print(conf_m)
print("--- %s seconds ---" % (time.time() - start_time))

With stemming and using Bag of words I became this results:
* accuracy: 0.978
* precision: 0.961
* recall: 0.964
* It took 116.1 sec

**Let's try different combinations of models (Bag of Words or Tf-Idf) with different preprocessing techniques (Stemming or Lemmatization)**

With lemmatizer and using Bag of words:
* accuracy: 0.979
* precision: 0.970
* recall: 0.957
* It took 102.9 sec

With stemming and using Tf-Idf:
* accuracy: 0.905
* precision: 0.995
* recall: 0.680
* It took 116.6 sec

With lemmatizer and using Tf-Idf:
* accuracy: 0.907
* precision: 0.995
* recall: 0.686
* It took 116.6 sec

As seen, Tf-Idf gives us very high preccision, but recall is bad. On the other hand Bag of words demonstrate high results in both cases. I think, that lemmatizer+Bag-of-words is the better solution. It took least of all time, what will become even more noticeable if you increase the size of dataset. It shows the highest accuracy compared to other solutions. Precision and recall are high too.