# Naive Bayes and Support Vector Machines
Name: Rusheel Chande

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier

## 1. Bayes’ Theorem shows us how to turn P(E|H) to P(H|E), with E=Evidence and H=Hypothesis. But what does that really mean? Imagine you have to explain this to someone who doesn't understand machine learning or probability at all.

The theorem allows us to calculate the probability of a hypothesis being true based on new evidence which is basically a reversed way of thinking about cause and effect.

a) Explain how to turn P(E|H) to P(H|E), with E=Evidence and H=Hypothesis in layman's terms.

Bayes Theorem helps us reverse a conditional probability. Normally we might know the probability of observing some evidence given a hypothesis P(E|H), like knowing the likelihood of seeing rain if we assume it's cloudy. With Bayes Theorem we can flip this around. It tells us how likely our hypothesis is, given the evidence we've observed P(H|E). The formula to do this is P(H|E) = [P(E|H) * P(H)] / P(E)

b) Use an example from real life to ground the explanation.

An example is assuming H is the event of having a disease, and E is testing positive for that disease. The probability of testing positive given you have the disease is P(Testing Positive (E) | Having Disease (H)). We can find P(Having Disease (H) | Testing Positive (E)) with Bayes Theorem, which is the probability of actually having the disease given a positive test result. We'd just use the formula I put in a) to find it. So the formula is P(Having Disease (H)| Testing Positive (E)) = [P(Testing Positive (E) | Having Disease (H))×P(Having Disease (H))
]/P(Testing Positive (E))

## 2. This is a public set of comments collected for spam research. It has five datasets composed of 1,956 real messages extracted from five videos. These five videos are popular pop songs that were among the 10 most viewed in the collection period.

a) For this exercise use any four of these five datasets to build a spam filter with the Naïve Bayes approach.

In [2]:
katy = pd.read_csv("../data/Youtube02-KatyPerry.csv", sep=",")
lmfao = pd.read_csv("../data/Youtube03-LMFAO.csv", sep=",")
eminem = pd.read_csv("../data/Youtube04-Eminem.csv", sep=",")
shakira = pd.read_csv("../data/Youtube05-Shakira.csv", sep=",")

In [3]:
katy.head(5)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z12pgdhovmrktzm3i23es5d5junftft3f,lekanaVEVO1,2014-07-22T15:27:50,i love this so much. AND also I Generate Free ...,1
1,z13yx345uxepetggz04ci5rjcxeohzlrtf4,Pyunghee,2014-07-27T01:57:16,http://www.billboard.com/articles/columns/pop-...,1
2,z12lsjvi3wa5x1vwh04cibeaqnzrevxajw00k,Erica Ross,2014-07-27T02:51:43,Hey guys! Please join me in my fight to help a...,1
3,z13jcjuovxbwfr0ge04cev2ipsjdfdurwck,Aviel Haimov,2014-08-01T12:27:48,http://psnboss.com/?ref=2tGgp3pV6L this is the...,1
4,z13qybua2yfydzxzj04cgfpqdt2syfx53ms0k,John Bello,2014-08-01T21:04:03,Hey everyone. Watch this trailer!!!!!!!! http...,1


In [4]:
lmfao.head(5)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z13uwn2heqndtr5g304ccv5j5kqqzxjadmc0k,Corey Wilson,2015-05-28T21:39:52.376000,"<a href=""http://www.youtube.com/watch?v=KQ6zr6...",0
1,z124jvczaz3dxhnbc04cffk43oiugj25yzo0k,Epic Gaming,2015-05-28T20:07:20.610000,wierd but funny﻿,0
2,z13tczjy5xj0vjmu5231unho1ofey5zdk,LaS Music,2015-05-28T19:23:35.355000,"Hey guys, I&#39;m a human.<br /><br /><br />Bu...",1
3,z13tzr0hdpnayhqqc04cd3zqqqjkf3ngckk0k,Cheryl Fox,2015-05-28T17:49:35.294000,Party Rock....lol...who wants to shuffle!!!﻿,0
4,z12pcvix4zedcjvyb04ccr1r0mr2g5xwyng0k,PATRICK_TW,2015-05-28T16:28:26.818000,Party rock﻿,0


In [5]:
eminem.head(5)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z12rwfnyyrbsefonb232i5ehdxzkjzjs2,Lisa Wellas,,+447935454150 lovely girl talk to me xxx﻿,1
1,z130wpnwwnyuetxcn23xf5k5ynmkdpjrj04,jason graham,2015-05-29T02:26:10.652000,I always end up coming back to this song<br />﻿,0
2,z13vsfqirtavjvu0t22ezrgzyorwxhpf3,Ajkal Khan,,"my sister just received over 6,500 new <a rel=...",1
3,z12wjzc4eprnvja4304cgbbizuved35wxcs,Dakota Taylor,2015-05-29T02:13:07.810000,Cool﻿,0
4,z13xjfr42z3uxdz2223gx5rrzs3dt5hna,Jihad Naser,,Hello I&#39;am from Palastine﻿,1


In [6]:
shakira.head(5)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,z13lgffb5w3ddx1ul22qy1wxspy5cpkz504,dharma pal,2015-05-29T02:30:18.971000,Nice song﻿,0
1,z123dbgb0mqjfxbtz22ucjc5jvzcv3ykj,Tiza Arellano,2015-05-29T00:14:48.748000,I love song ﻿,0
2,z12quxxp2vutflkxv04cihggzt2azl34pms0k,Prìñçeśś Âliś Łøvê Dømíñø Mâđiś™ ﻿,2015-05-28T21:00:08.607000,I love song ﻿,0
3,z12icv3ysqvlwth2c23eddlykyqut5z1h,Eric Gonzalez,2015-05-28T20:47:12.193000,"860,000,000 lets make it first female to reach...",0
4,z133stly3kete3tly22petvwdpmghrlli,Analena López,2015-05-28T17:08:29.827000,shakira is best for worldcup﻿,0


In [7]:
np.random.seed(200)

combined = pd.concat([katy, lmfao, eminem, shakira])
combined = combined.drop(["COMMENT_ID", "AUTHOR", "DATE"], axis=1)

X = combined["CONTENT"]
y = combined["CLASS"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=371)

pipeline = Pipeline([
    ("vect", CountVectorizer()),
    ("tfidf", TfidfTransformer()),
    ("clf", MultinomialNB()),
])

pipeline.fit(X_train, y_train)

predictions = pipeline.predict(X_test)

accuracy_score(y_test, predictions)

0.8975155279503105

b) Use that filter to check the accuracy on the remaining dataset.

In [8]:
psy = pd.read_csv("../data/Youtube01-Psy.csv", sep=",")
psy.head(5)

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


In [9]:
np.random.seed(200)

psy = psy.drop(["COMMENT_ID", "AUTHOR", "DATE"], axis=1)

X_psy = psy["CONTENT"]
y_psy = psy["CLASS"]

psy_predictions = pipeline.predict(X_psy)

psy_accuracy = accuracy_score(y_psy, psy_predictions)

psy_accuracy

0.7914285714285715

c) Make sure to report the details of the training process and the model.

The Naïve Bayes model was trained Katy Perry, LMFAO, Eminem, and Shakira's datasets. Data cleaning and preprocessing included dropping non-essential columns  which were COMMENT_ID, AUTHOR, and DATE. I focused on the CONTENT for feature extraction and CLASS as the result only. Then a text processing pipeline was created which included CountVectorizer for converting text into token counts and TfidfTransformer for converting token counts to normalized TF-IDF values, followed by the MultinomialNB classifier for the model. The trained model's accuracy was 89.75% on the validation set and when tested on a psy's new dataset, the model had an accuracy of about 79.14% which is less than the training accuracy but still shows that the model is able to roughly generalize

3. In this exercise, you will use the Portuguese sea battles data that contains outcomes of naval battles between Portuguese and Dutch/British ships between 1583 and 1663.

a) Use an SVM-based model to predict the Portuguese outcome of the battle from the number of ships involved on all sides and Spanish involvement.

In [16]:
# faced some separator issues with loading the data, so I just manually entered all of it.

data = {
    "Battle": ["Bantam", "Malacca Strait", "Ilha das Naus", "Pulo Butum", "Surrat", 
               "Ilha das Naus", "Jask", "Hormuz", "Mogincoal Shoals", "Hormuz", 
               "Goa", "Goa", "Goa", "Colombo", "Goa", 
               "Invincible Armada", "Bahia", "Bahia", "Bahia", 
               "Recife", "Abrolhos", "Bahia", "Dunas", "Dunas", 
               "Paraiba", "Tamandare", "Recife", "Lisbon"],
    "Year": [1601, 1606, 1606, 1606, 1615, 
             1615, 1620, 1622, 1622, 1625, 
             1636, 1637, 1638, 1654, 1658, 
             1588, 1624, 1625, 1627, 
             1630, 1631, 1636, 1639, 1639, 
             1640, 1645, 1653, 1657],
    "Portuguese ships": [6, 14, 6, 7, 6, 
                         3, 4, 6, 4, 8, 
                         6, 6, 6, 5, 9, 
                         69, 4, 35, 4, 
                         9, 17, 2, 51, 38, 
                         16, 6, 14, 7],
    "Dutch ships": [3, 11, 9, 9, 0, 
                    5, 0, 0, 4, 4, 
                    4, 7, 8, 3, 9, 
                    0, 13, 20, 10, 
                    60, 16, 8, 11, 110, 
                    30, 7, 5, 10],
    "English ships": [0, 0, 0, 0, 4, 
                      0, 4, 5, 2, 4, 
                      0, 0, 0, 0, 0, 
                      31, 0, 0, 0, 
                      0, 0, 0, 0, 0, 
                      0, 0, 0, 0],
    "Ship Ratio P/D&B": [2, 1.273, 0.667, 0.778, 1.5, 
                   0.6, 1, 1.2, 0.667, 1, 
                   1.5, 0.857, 0.75, 1.667, 1, 
                   2.226, 0.308, 1.75, 0.4, 
                   0.15, 1.063, 0.25, 4.636, 0.345, 
                   0.533, 0.857, 2.8, 0.7],
    "Spanish Involvement": [0, 0, 0, 0, 0, 
                            0, 0, 0, 0, 0, 
                            0, 0, 0, 0, 0, 
                            1, 1, 1, 1, 
                            1, 1, 1, 1, 1, 
                            1, 1, 1, 1],
    "Portuguese outcome": [0, 0, -1, 1, 0, 
                           -1, 0, -1, -1, 0, 
                           0, 0, 1, 1, 0, 
                           -1, -1, 1, -1, 
                           -1, 0, 0, 0, -1, 
                           0, -1, 1, 0]
}

armada = pd.DataFrame(data)
armada.head(5)

Unnamed: 0,Battle,Year,Portuguese ships,Dutch ships,English ships,Ship Ratio P/D&B,Spanish Involvement,Portuguese outcome
0,Bantam,1601,6,3,0,2.0,0,0
1,Malacca Strait,1606,14,11,0,1.273,0,0
2,Ilha das Naus,1606,6,9,0,0.667,0,-1
3,Pulo Butum,1606,7,9,0,0.778,0,1
4,Surrat,1615,6,0,4,1.5,0,0


In [11]:
np.random.seed(200)

X = armada[["Portuguese ships", "Dutch ships", "English ships", "Ship Ratio P/D&B", "Spanish Involvement"]]
y = armada["Portuguese outcome"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

svm_model = SVC(kernel="rbf")
svm_model.fit(X_train, y_train)

svm_predictions = svm_model.predict(X_test)

accuracy_score(y_test, svm_predictions)

0.5

b) Try solving the same problem using two other classifiers that you know.

In [12]:
np.random.seed(200)

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
accuracy_score(y_test, dt_predictions)

0.3333333333333333

In [13]:
np.random.seed(200)

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
accuracy_score(y_test, rf_predictions)

0.3333333333333333

c) Report and compare their results with those from SVM.

The SVM model achieved a low-ish accuracy of 50%, but on other times gets lower accuracies such as 33%. This indicates low to moderate capability to predict the outcome of portugese sea battles. To compare with SVM, both the decision tree and random forest classifiers resulted in lower accuracies of 33.33% which suggests they were lower/equal (as other times I ran the code they were able to reach 50% accuracy) to the SVM model. I say the accuracies are low because sea battles can be hard to predict also because we don't have a lot of data. Weather and sea conditions such as visibility, waves, rain, troop morale etc. can all play a roll in a win/loss