# <center> Automatic Document Categorization <br> with Machine Learning techniques in Python </center>

**Authors**: Adam Karwan, Roksana Tomanek, Aleksander Zajchowski i Sviatoslav Somov

### Before we start

- You can find all files and information in this Github Repository : https://github.com/roxytomanek/ai_workshop

### What are we going to achieve

**Can you use this dataset to build a prediction model that will accurately classify which texts are spam?**

##### Context
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
##### Content
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
This corpus has been collected from free or free for research sources at the Internet.
##### Acknowledgements
The original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

### I. Importinig libraries

- **NumPy** - package for scientific computing; it provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
- **Pandas** - we use this library for data manipulation and analysis; it offers data structures and operations for manipulating numerical tables and time series
- **Matplotlib** - is a plotting library for the Python programming language and a numerical mathematics extension NumPy. Also **Pyplot** is a Matplotlib module which provides a MATLAB-like interface but it's free and open-source
- **Seaborn** - is a data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- **Scikit-learn** (sklearn) is a free software machine learning library. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, it's also designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
- **Pickle** - The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process where a Python object hierarchy is converted into a byte stream.
 

In [None]:
# Import the necessary libraries 
import numpy as np
import pandas as pd # Dataframe Management
import matplotlib.pyplot as plt 
import seaborn as sns # Visualization
from sklearn.model_selection import train_test_split 
import pickle # Model Serialization

### II. Downloading the dataset

The first step to any data science project is to import your data. Often, you'll work with data in Comma Separated Value (CSV) files and run into problems at the very beggining of your workflow. To load a csv file we're often using `read_csv()` function from `pandas`. In the round brackets, you can use arguments to adjust the process to your needs. In this case, we're using `delimiter` and `encoding`, all of the possibilities you can find in the documentation.

If you want to see how the DataFrame looks like you can try using this commands:

- `df.head(10)` - this command will show you the fist 10 rows of the DataFrame
- `df.tail(10)` - this one will show you the last 10 rows

In [None]:
# Load Data
df = pd.read_csv('./data/spam_or_ham.csv', delimiter=',', encoding='latin-1')

### III. Summarize the Dataset

Now we need to look at the data following this 3 steps:

1. Dimensions of the dataset.
2. Statistical summary of all attributes.
3. Class Distribution.

In [None]:
# Exploratory Analysis
# View Dataset, top 10 Text Messages
df.head(n=10)

#### Dimensions of the dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

In [None]:
df.shape

#### Statistical summary

Now we can take a look at a summary of each attribute.

In [None]:
df.info()

In [None]:
df.describe(include=['object'])

#### Class Distribution
Now let's take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

In [None]:
# class distribution
print(df.groupby('Label').size())

### IV. Data Visualization

After the explortory analysis we now have a basic idea about the data, but it's always easier to see it in a graph :) 

 

First, we'll see how the class distribution looks like. 

In [None]:
# Check Distribution - Not Balanced Data
sns.countplot(df.Label)
plt.xlabel('Label')
plt.title('Number of ham and spam messages')
# 20% Spam Data

Now let's check what words appear in each class. 

We'll use `WordCloud` library for this visualization. 
If you're environment doesn't have it preinstalled just use 
`conda install -c conda-forge wordcloud` in your terminal if Anaconda is installed 
or `pip install wordcloud`

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud 

# spam and ham words
spam_words = ' '.join(list(df[df['Label'] == 'spam']['Text']))
ham_words = ' '.join(list(df[df['Label'] == 'ham']['Text']))

# Create Word Clouds 
spam_wc = WordCloud(width = 512, height = 512, colormap = 'plasma').generate(spam_words)
ham_wc = WordCloud(width = 512, height = 512, colormap = 'ocean').generate(ham_words)

# Plot Word Clouds
# SPAM
plt.figure(figsize = (10,8), facecolor = 'r')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

# HAM 
plt.figure(figsize = (10,8), facecolor = 'g')
plt.imshow(ham_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

# In Spam Messages word FREE occurs very oftenly
# In Ham Messages words 'OK', 'will', 'got' occur often and corrupted words ('gt' or 'lt')

### V. Models evaluation

1. Separate out a validation dataset
2. Build models
3. Select the best model


#### Creating train and test datasets

To evaluate the model we're going to split it into two parts, one is for training and the second for testing. In this step we're using `train_test_split` function. It will be splited with the `test_size=0.3`, it means we will use 70% of data to train our models and 30% we will hold back as a validation dataset. `random_state` as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. Setting `random_state`, a fixed value, will guarantee that same sequence of random numbers will be generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output. 

In [None]:
# Split Data Set into Train and Test
X_train, X_test, Y_train, Y_test = train_test_split(df.Text, df.Label, test_size=0.3, random_state=123)

#### MODELS

Now the fun begins! In this step, we'll run three different models and see the results for each of them. Here is a simple task for you - in the cell below we definned these three models and your job is to rerun the cell with each of them. 

But before we'll do it - some brief models explanation:

##### Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
More information and tutorial :  <br>
https://machinelearningmastery.com/naive-bayes-tutorial-for-machine-learning/<br>
https://towardsdatascience.com/introduction-to-naive-bayes-classification-4cffabb1ae54


##### Support Vector Machine

A support vector machine (SVM) is a type of supervised machine learning classification algorithm. SVMs were introduced initially in 1960s and were later refined in 1990s. However, it is only now that they are becoming extremely popular, owing to their ability to achieve brilliant results. SVMs are implemented in a unique way when compared to other machine learning algorithms.

In case of linearly separable data in two dimensions, a typical machine learning algorithm tries to find a boundary that divides the data in such a way that the misclassification error can be minimized. But in fact, there can be several boundaries that correctly divide the data points. SVM differs from the other classification algorithms in the way that it chooses the decision boundary that maximizes the distance from the nearest data points of all the classes. An SVM doesn't merely find a decision boundary; it finds the most optimal decision boundary.
You can read more about it here: <br>
https://towardsdatascience.com/introduction-to-support-vector-machine-svm-4671e2cf3755

##### Random Forest

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. A random forest is a classifier consisting of a collection of tree structured classifiers {h(x,Θk ), k=1, …} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x. Briefly, Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Tutorial and more information:<br>
https://towardsdatascience.com/random-forest-in-python-24d0893d51c0<br>
https://machinelearningmastery.com/implement-random-forest-scratch-python/

#### Instruction

Now your task! In lines **18-20** inside a function are 3 models definned, and only one of them is active. To see results for each of them add or remove `#` in the beginning of the line. 

In [None]:
# https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB # Naive Bayes
from sklearn.svm import LinearSVC, SVC # Support Vector Machine
from sklearn.ensemble import RandomForestClassifier # Random Forest
import time

# Python Function
def models(list_sentences_train, list_sentences_test, train_labels, test_labels):
    t0 = time.time() # start time
    
    # Pipeline 
    model = Pipeline([('vect', CountVectorizer(ngram_range=(1,3))), 
                      ('tfidf', TfidfTransformer(use_idf=False)),
                      ('clf', MultinomialNB())]) # Naive Bayes
    #                  ('clf', SVC(kernel='linear', probability=True))]) # Linear SVM with probability
    #                  ('clf', RandomForestClassifier())]) # Random Forest

    # Train Model
    model.fit(list_sentences_train, train_labels) 
    
    duration = time.time() - t0 # end time
    print("Training done in %.3fs " % duration)

    # Model Accuracy
    print('Model final score: %.3f' % model.score(list_sentences_test, test_labels))
    return model

# Train, Evaluate and Save Model
model_std_NLP = models(X_train, X_test, Y_train, Y_test)

In the cell above we only have numerical output, so for making it easier to understand we'll visualise it and create **confusion matrix**

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
What can we learn from this matrix?
- There are two possible predicted classes: "spam" and "ham". In this case if we are predicting the purpose of a message, "spam" means that it's a spam message, and "ham" is for normal messages.

<br>Let's now define the most basic terms:
- true positives (TP): These are the cases in which we predicted "ham" (important message).
- true negatives (TN): We found "spam".
- false positives (FP): We predicted ham, but it's actually spam. (Also known as a "Type I error.")
- false negatives (FN): We predicted spam, but they are actually imporant messages. (Also known as a "Type II error.")
<br>


Very useful article: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62

In [None]:
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

from sklearn.metrics import confusion_matrix # Library to Compute Confusion Matrix

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    # classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

In [None]:
np.set_printoptions(precision=2)

# Predictions with model
Y_pred = model_std_NLP.predict(X_test)
class_names = np.array(['ham', 'spam'])


# Plot non-normalized confusion matrix
plot_confusion_matrix(Y_test, Y_pred, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plot_confusion_matrix(Y_test, Y_pred, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()

### Pickle

Pickle is used for serializing and de-serializing Python object structures, also called marshalling or flattening. Serialization refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network. Later on, this character stream can then be retrieved and de-serialized back to a Python object. Pickling is not to be confused with compression! The former is the conversion of an object from one representation (data in Random Access Memory (RAM)) to another (text on disk), while the latter is the process of encoding data with fewer bits, in order to save disk space.

Pickling is useful for applications where you need some degree of persistency in your data. Your program's state data can be saved to disk, so you can continue working on it later on. It can also be used to send data over a Transmission Control Protocol (TCP) or socket connection, or to store python objects in a database. Pickle is very useful for when you're working with machine learning algorithms, where you want to save them to be able to make new predictions at a later time, without having to rewrite everything or train the model all over again.

Useful: https://www.pythoncentral.io/how-to-pickle-unpickle-tutorial/

In [None]:
# Save to file in the current working directory
pkl_filename = "pickle_model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(model_std_NLP, file)

# Load from file
with open(pkl_filename, 'rb') as file:  
    pickle_model = pickle.load(file)

### Let's test the results 

In [None]:
test_text_spam = ['Urgent! call 09066350750 from your landline. Your complimentary 4* Ibiza Holiday or 10,000 cash await collection SAE T&Cs PO BOX 434 SK3 8WP 150 ppm 18+ ']
test_text_ham = ['Good. No swimsuit allowed :)']

# Predict Category and Probability
# Spam
print(model_std_NLP.predict(test_text_spam)) 
print(model_std_NLP.predict_proba(test_text_spam)) 

# Ham
print(model_std_NLP.predict(test_text_ham)) 
print(model_std_NLP.predict_proba(test_text_ham)) 

# More Test Examples
# Ham - 0
# Good. No swimsuit allowed :)
# Wish i were with you now!
# Im sorry bout last nite it wasnÃ¥Ãt ur fault it was me, spouse it was pmt or sumthin! U 4give me? I think u shldxxxx

# Spam - 1
# Urgent! call 09066350750 from your landline. Your complimentary 4* Ibiza Holiday or 10,000 cash await collection SAE T&Cs PO BOX 434 SK3 8WP 150 ppm 18+ 
# +123 Congratulations - in this week's competition draw u have won the Ã¥Â£1450 prize to claim just call 09050002311 b4280703. T&Cs/stop SMS 08718727868. Over 18 only 150ppm
# Double mins and txts 4 6months FREE Bluetooth on Orange. Available on Sony, Nokia Motorola phones. Call MobileUpd8 on 08000839402 or call2optout/N9DX

In [None]:
# Test Pickle Model
print(pickle_model.predict(test_text_spam)) # Predict Category
print(pickle_model.predict_proba(test_text_spam)) # Predict Probability