<!--NAVIGATION-->


<a href="https://colab.research.google.com/github/saskeli/x/blob/master/bayes.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

|                                       -                                       |                                       -                                       |                                       -                                       |
|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
|  [Exercise 1 (blob classification)](<#Exercise-1-(blob-classification&#41;>)  | [Exercise 2 (plant classification)](<#Exercise-2-(plant-classification&#41;>) |  [Exercise 3 (word classification)](<#Exercise-3-(word-classification&#41;>)  |
|       [Exercise 4 (spam detection)](<#Exercise-4-(spam-detection&#41;>)       |                                                                               |                                                                               |



## ML: Naive Bayes classification

*Classification* is one form of supervised learning. The aim is to annotate all data points with a label. Those points that have the same label belong to the same class. There can be two or more labels. For example, a lifeform can be classified (coarsely) with labels animal, plant, fungi, archaea, bacteria, protozoa, and chromista. The data points are observed to have certain features that can be used to predict their labels. For example, if it is has feathers, then it is most likely an animal.

In supervised learning an algorithm is first given a training set of data points with their features and labels. Then the algorithm learns from these features and labels a (probabilistic) model, which can afterwards be used to predict the labels of previously unseen data.

*Naive Bayes classification* is a fast and simple to understand classification method. Its speed is due to some simplifications we make about the underlying probability distributions, namely, the assumption about the independence of features. Yet, it can be quite powerful, especially when there are enough features in the data.

Suppose we have for each label L a probability distribution. This distribution gives probability for each possible combination of features (a feature vector):

$$P(features | L).$$

The main idea in Bayesian classification is to reverse the direction of dependence: we want to predict the label based on the features:

$$P(L | features)$$

This is possible by [the Bayes theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem):

$$P(L | features) = \frac{P(features | L)P(L)}{P(features)}.$$

Let's assume we have to labels L1 and L2, and their associated distributions: $P(features | L1)$ and $P(features | L2)$. If we have a data point with "features", whose label we don't know, we can try to predict it using the ratio of posterior probabilities:

$$\frac{P(L1 | features)}{P(L2 | features)} = \frac{P(features | L1)P(L1)}{P(features | L2)P(L2)}.$$

If the ratio is greater than one, we label our data point with label L1, and if not, we give it label L2.
The prior probabilities P(L1) and P(L2) of labels can be easily found out from the input data, as for each data point we also have its label. Same goes for the probabilities of features conditioned on the label.

We first demonstrate naive Bayes classification using Gaussian distributions.

#### <div class="alert alert-info">Exercise 1 (blob classification)</div>

Write function `blob_classification` that gets feature matrix X and label vector y as parameters. It should then return the accuracy score of the prediction. Do the prediction using `GaussianNB`, and use `train_test_split` function from `sklearn` to split the dataset in to two parts: one for training and one for testing. Give parameter `random_state=0` to the splitting function so that the result is deterministic. Use training set size of 75% of the whole data.
<hr/>

In [1]:
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def blob_classification(X, y):
    model = GaussianNB()
    x_train, x_test ,y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
    model.fit(x_train, y_train)
    pred = model.predict(x_test)
    return accuracy_score(pred, y_test)

def main():
    X,y = datasets.make_blobs(100, 2, centers=2, random_state=2, cluster_std=2.5)
    print("The accuracy score is", blob_classification(X, y))
    a=np.array([[2, 2, 0, 2.5],
                [2, 3, 1, 1.5],
                [2, 2, 6, 3.5],
                [2, 2, 3, 1.2],
                [2, 4, 4, 2.7]])
    accs=[]
    for row in a:
        X,y = datasets.make_blobs(100, int(row[0]), centers=int(row[1]),
                                  random_state=int(row[2]), cluster_std=row[3])
        accs.append(blob_classification(X, y))
    print(repr(np.hstack([a, np.array(accs)[:,np.newaxis]])))

if __name__ == "__main__":
    main()


The accuracy score is 0.92
array([[2.  , 2.  , 0.  , 2.5 , 0.76],
       [2.  , 3.  , 1.  , 1.5 , 0.96],
       [2.  , 2.  , 6.  , 3.5 , 0.84],
       [2.  , 2.  , 3.  , 1.2 , 1.  ],
       [2.  , 4.  , 4.  , 2.7 , 0.8 ]])


#### <div class="alert alert-info">Exercise 2 (plant classification)</div>

Write function `plant_classification` that does the following:

* loads the iris dataset using sklearn (`sklearn.datasets.load_iris`)
* splits the data into training and testing part using the `train_test_split` function so that the training set size is 80% of the whole data (give the call also the `random_state=0` argument to make the result deterministic)
* use Gaussian naive Bayes to fit the training data
* predict labels of the test data
* the function should return the accuracy score of the prediction performance (`sklearn.metrics.accuracy_score`)
<hr/>

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn import metrics

def plant_classification():
    x,y= load_iris(return_X_y=True)
    x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.20, random_state = 0)
    model =naive_bayes.GaussianNB()
    model.fit(x_train, y_train)
    pred = model.predict(x_test)

    return metrics.accuracy_score(pred, y_test) 

def main():
    print(f"Accuracy is {plant_classification()}")

if __name__ == "__main__":
    main()

Accuracy is 0.9666666666666667


#### <div class="alert alert-info">Exercise 3 (word classification)</div>

This exercise can give four points at maximum!

In this exercise we create a model that tries to label previously unseen words to be either Finnish or English.

Part 1.

Write function `get_features` that gets a one dimensional np.array, containing words, as parameter. It should return a feature matrix of shape (n, 29), where n is the number of elements of the input array. There should be one feature for each of the letters in the following alphabet: "abcdefghijklmnopqrstuvwxyzäö-". The values should be the number of times the corresponding character appears in the word.

Part 2.

Write function `contains_valid_chars` that takes a string as a parameter and returns the truth value of whether all the characters in the string belong to the alphabet or not.

Part 3.

Write function `get_features_and_labels` that returns the tuple (X, y) of the feature matrix and the target vector. Use the labels 0 and 1 for Finnish and English, respectively. Use the supplied functions load_finnish() and load_english() to get the lists of words. Filter the lists in the following ways:

* Convert the Finnish words to lowercase, and then filter out those words that contain characters that don't belong to the alphabet.
* For the English words first filter out those words that begin with an uppercase letter to get rid of proper nouns. Then proceed as with the Finnish words.

Use get_features function you made earlier to form the feature matrix.

Part 4.

We have earlier seen examples where we split the data into learning part and testing part. This way we can test whether the model can really be used to predict unseen data. However, it can be that we had bad luck and the split produced very biased learning and test datas. To counter this, we can perform the split several times and take as the final result the average from the different splits. This is called [cross validation](<https://en.wikipedia.org/wiki/Cross-validation_(statistics)>).

Create `word_classification` function that does the following:

Use the function `get_features_and_labels` you made earlier to get the feature matrix and the labels. Use multinomial naive Bayes to do the classification. Get the accuracy scores using the `sklearn.model_selection.cross_val_score` function; use 5-fold cross validation. The function should return a list of five accuracy scores.

The cv parameter of `cross_val_score` can be either an integer, which specifies the number of folds, or it can be a *cross-validation generator* that generates the (train set,test set) pairs. What happens if you pass the following cross-validation generator to `cross_val_score` as a parameter: `sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=0)`.

Why the difference?
<hr/>

In [None]:


from collections import Counter
import urllib.request
from lxml import etree

import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn import model_selection

alphabet="abcdefghijklmnopqrstuvwxyzäö-"
alphabet_set = set(alphabet)

# Returns a list of Finnish words
def load_finnish():
    finnish_url="https://www.cs.helsinki.fi/u/jttoivon/dap/data/kotus-sanalista_v1/kotus-sanalista_v1.xml"
    filename="src/kotus-sanalista_v1.xml"
    load_from_net=False
    if load_from_net:
        with urllib.request.urlopen(finnish_url) as data:
            lines=[]
            for line in data:
                lines.append(line.decode('utf-8'))
        doc="".join(lines)
    else:
        with open(filename, "rb") as data:
            doc=data.read()
    tree = etree.XML(doc)
    s_elements = tree.xpath('/kotus-sanalista/st/s')
    return list(map(lambda s: s.text, s_elements))

def load_english():
    with open("src/words", encoding="utf-8") as data:
        lines=map(lambda s: s.rstrip(), data.readlines())
    return list(lines)

def get_features(a):
    features = np.zeros((len(a),29))
    for i, j in enumerate(a) :
        for l, k in enumerate(alphabet):
            features[i, l] = j.count(k)       
    return features

def contains_valid_chars(s):
    for char in s:
        if char not in alphabet:
            return False
    return True

def get_features_and_labels():
    
    e = []
    for i in load_english():
        if i[0].islower():
            e.append(i)

    en = []
    for i in e:
        en.append(i.lower()) 


    english = [x for x  in en if contains_valid_chars(x)]
    english = get_features(english)
    
    finish= [i.lower() for i in load_finnish()]
    finish = [x for x  in finish if contains_valid_chars(x)]
    finish = get_features(finish)

    X= np.concatenate((english, finish),axis=0)
    y_english = np.ones(len(english))
    y_finish = np.zeros(len(finish))
    y= np.concatenate((y_english,y_finish))

    return X, y
    

    
def word_classification():
    X, y = get_features_and_labels()
    model = MultinomialNB()
    gen = model_selection.KFold(n_splits=5, shuffle=True, random_state=0)
    score = cross_val_score(model, X, y, cv=gen)

    return score



def main():
    #sh = get_features_and_labels()
    #print(sh.shape)
    print("Accuracy scores are:", word_classification())

if __name__ == "__main__":
    main()


#### <div class="alert alert-info">Exercise 4 (spam detection)</div>

This exercise gives two points if solved correctly!

In the `src` folder there are two files: `ham.txt.gz` and `spam.txt.gz`. The files are preprocessed versions of the files from https://spamassassin.apache.org/old/publiccorpus/. There is one email per line. The file `ham.txt.gz` contains emails that are non-spam, and, conversely, emails in file `spam.txt` are spam. The email headers have been removed, except for the subject line, and non-ascii characters have been deleted.

Write function `spam_detection` that does the following:

* Read the lines from these files into arrays. Use function `open` from `gzip` module, since the files are compressed. From each file take only `fraction` of lines from the start of the file, where `fraction` is a parameter to `spam_detection`, and should be in the range `[0.0, 1.0]`.
* forms the combined feature matrix using `CountVectorizer` class' `fit_transform` method. The feature matrix should first have the rows for the `ham` dataset and then the rows for the `spam` dataset. One row in the feature matrix corresponds to one email.
* use labels 0 for ham and 1 for spam
* divide that feature matrix and the target label into training and test sets, using `train_test_split`. Use 75% of the data for training. Pass the random_state parameter from `spam_detection` to `train_test_split`.
* train a `MultinomialNB` model, and use it to predict the labels for the test set

The function should return a triple consisting of

* accuracy score of the prediction
* size of test sample
* number of misclassified sample points

Note. The tests use the `fraction` parameter with value 0.1 to ease to load on the TMC server. If full data were used and the solution did something non-optimal, it could use huge amounts of memory, causing the solution to fail.
<hr/>

In [None]:

import gzip
import numpy as np
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def spam_detection(random_state=0, fraction=1.0):
    ham = gzip.open("src/ham.txt.gz").readlines()
    ham = ham[:int(fraction*len(ham))]

    spam = gzip.open("src/spam.txt.gz").readlines()
    spam = spam[:int(fraction*len(spam))]

    dset= ham + spam

    X= CountVectorizer().fit_transform(dset).toarray()

    y_ham = np.zeros(len(ham)) 
    y_spam = np.ones(len(spam))
    y= np.concatenate((y_ham, y_spam))

    model = MultinomialNB()
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state, train_size=0.75)
    model.fit(X_train, y_train)
    y_fitted = model.predict(X_test)

    acc = accuracy_score(y_test, y_fitted)
    miscl = np.sum(y_test != y_fitted)

    return acc, len(X_test), miscl

def main():
    
    accuracy, total, misclassified = spam_detection()
    print("Accuracy score:", accuracy)
    print(f"{misclassified} messages miclassified out of {total}")

if __name__ == "__main__":
    main()


<!--NAVIGATION-->


<a href="https://colab.research.google.com/github/saskeli/x/blob/master/bayes.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
