# Webinar 2 - Intro to NLP, Text Processing, Spam Classifier
In this lession, We will go through the following topics one by one.
 1. **Intro to NLP**
 2. **Fetching data from a webpage**
 3. **Text processing in NLP**
 4. **Bayes theorem**
 5. **Naive bayes theorem**
 5. **Spam Detection using naive bayes theorem**
 
<hr>

## 1. Intro to NLP

In general we can define a NLP pipeline as geting a raw text, process it, extract relevant features, and build Modes to accomplish various NLP task. We can say that it can be divided into 3 parts:
1. **Text processing**
2. **Feature extraction**
3. **Modeling**

<hr>

## 2. Fetching data from a webpage

Let's go to <a href="https://techcrunch.com">techcrunch.com</a> and fetch all of the news titles.

<br>

<img src="assets/techcrunch.png">    

<br>

In [None]:
# Importing the libraries


In [None]:
# Fetching a webpage


In [None]:
# Remove the HTML tags


In [None]:
# Getting the titles of each article


In [None]:
# Getting the text of first article


In [None]:
# Getting the text of all articles


**Further Resources:**
1. <a href="https://www.youtube.com/watch?v=aIPqt-OdmS0">Web scraping and parsing with Beautiful Soup & Python Introduction p.1</a>
2. <a href="https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe">How to scrape websites with Python and BeautifulSoup</a>
3. <a href="https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html">kdnuggets - Text Data Preprocessing: A Walkthrough in Python</a>

<hr>

## 3. Text Processing

In text processing we get the raw data and make it ready for feature extraction. The steps are as follows:
1. **Clean the dataset**: Here we will delete all the html tags, if there is some.
2. **Normalize the text**: First lowercase the dataset. Second remove punctuation. Third we tokenize the text.
3. **Remove stop words**
4. **Stemming**: Process of removing the suffixes. For example, “branching”, “branches”, “branched”can all be reduced to “branch”.
5. **Lemmatization**: This a technique for reducing words to a normalised form. In here we use a dictionary for mapping different variants of words back to its root. For example, converting words like “was”, “is”, “were” to “be”.

Stemming vs lemmatization: The final form of stemming can be meaningless (totally different word) but in lemmatization the final form is also a meaningful word in English. Also stemming doesn’t have a dictionary and its based on some rules but in lemmatization we have dictionary. So stemming less memory intensive option.

In [None]:
# Importing the libraries


### 3.1. Load and cleaning the dataset

In [None]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and \
        definitely watch part 2. It will change your view of the matrix. Are the human people the ones  \
        who started the war ? Is AI a bad thing ?"

print(text)

### 3.2. Normalize the text

In [None]:
# Lower casing the dataset


In [None]:
# Remove punctuation


In [None]:
# Tokenizing the text


### 3.3. Remove stop words

In [None]:
# Remove all stop words


### 3.4. Stemming

In [None]:
# Apply stemming


### 3.5. Lemmatization

In [None]:
# Apply Lemmatization on nouns


In [None]:
# Apply Lemmatization on verbs


**Resources:**

1. <a href="https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf">An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition</a> - Chapter 2 Regular Expressions, Text Normalization, Edit Distance
2. <a href="https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html">kdnuggets - Text Data Preprocessing: A Walkthrough in Python</a>
3. <a href="https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/"> Natural Language Processing Series - Harrison</a>
4. <a href="https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908"> Text Preprocessing in Python: Steps, Tools, and Examples </a>

<hr>

## 4. Bayes Theorem

The **Bayes theorem** calculates the probability of an event occurring, based on certain other probabilities that are related to the event in question. It is  composed of a  prior(the probabilities that we are aware of or that is given to us) and the posterior(the probabilities we are looking to compute using the priors). 

##### Let's take a look at example for having a cancer:

Now let's find the odds of an individual having cancer, given that he or she was tested for it and got a positive result. 
Afterward we go on internet and search more about cancer and we figure out that 1% of whole population have cancer. we also figure out that the test correctly predicts 90% of times on those who have cancer (Sensitivity or True Positive Rate.) and it also correctly predicts 90% of times on those who doesn't have cancer. (Specificity or True Negative Rate.)

<br>

So we assume the following:

- `P(Cancer)`: The probability of a person having cancer. It's value is `0.01`  because 1% of whole population have cancer. 
<br>

- `P(Positive)`: The probability of getting a positive test result.
<br>

- `P(Negative)`: The probability of getting a negative test result.
<br>

- `P(Positve|Cancer)`: The probability of getting a positive result on a test done for detecting cancer, given that you have cancer. This has a value `0.9` because our test correctly predicts 90% of times on those who have a cancer.
<br>

- `P(Negative|~Cancer)`: The probability of getting a negative result on a test done for detecting cancer, given that you do not have cancer. This also has a value of `0.9` because our test correctly predicts 90% of times on those who doens't have a cancer.


The **Bayes theorem** formula is as follows:

<img src="assets/Bayes_theorem.png">

<br><br>

The probability of getting a positive test result `P(Positive)` can be calulated using the Sensitivity and Specificity:

<img src="assets/bayes_2.png">

In [None]:
### Now let's calculating P(Positive)

# P(Cancer)


# P(~Cancer)


# P(Positive|Cancer) or Sensitivity


# P(Negative|~Cancer) or Specieficity


# P(Positive)


<br><br>
Given all of this information we can calculate our **posteriors** as follows:
<img src="assets/posterior_1.png">
<img src="assets/posterior_2.png">

In [None]:
### Calculating P(Cancer|Positive)


In [None]:
### Calculating P(~Cancer|Positive)


Note that since: `P(~Cancer|Positive) = (P(~Cancer) * P(Positive|~Cancer) / P(Positive)`

Then `P(Positive/~Cancer)` can be computed as `1 - P(Negative/~Cancer)`. 

##### Our result shows that if we get a positive test result, then there is only a 8.33% chance that we actually have cancer and a 91.67% chance that we do not have cancer.

**Further Resources:**

1. <a href="https://classroom.udacity.com/courses/st101">intro to statistics (Bayes rule)</a> - Udacity free course
2. <a href="https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/">An Intuitive (and Short) Explanation of Bayes’ Theorem</a> 
3. <a href="https://www.mathsisfun.com/data/bayes-theorem.html">Bayes can do magic!</a> 

<hr>

## 5. Naive Bayes Theorem

The word naive in naive bayes come from the naive assumption we are doing. we always assume that the events are independent from each other. This is a false and naive assumption but in practice it words pretty well.

<img src="assets/naive assumption.png">

##### Let's take an example in political usage:
Assume we have two political candidates in two different parties, 'Jennifer Stewart' of the Gemini Party and 'Gabrielle Jones' of the Liberal Party. Assume that the probablility of Jennifer Stewart giving a speech, `P(Jenifer)` is `0.5` and the same for Gabrielle Jones, `P(Gabrielle) = 0.5`. Below we have the probabilities of each of these candidates saying the words 'liberty', 'war' and 'economy' when they give a speech:

- `P(Liberty|Jenifer)` or Probability saying 'liberty' given Jennifer Stewart says it: `0.1`
<br>
- `P(War|Jenifer)` or Probability saying 'war' given Jennifer Stewart says it: `0.1`
<br>
- `P(Economy|Jenifer)` or Probability saying 'economy' given Jennifer Stewart says it: `0.8`
<br><br>
- `P(Liberty|Gabrielle)` or Probability saying 'liberty' given Gabrielle Jones says it: `0.7`
<br>
- `P(War|Gabrielle)` or Probability saying 'war' given Gabrielle Jones says it: `0.2`
<br>
- `P(Economy|Gabrielle)` or Probability saying 'economy' given Gabrielle Jones says it: `0.1`
<br>



**Naive assumption:** Given this, what if we had to find the probabilities of Jennifer Stewart saying the words 'liberty' and 'war'? This is where the Naive Bayes'theorem comes into play as we are considering two features, 'liberty' and 'war'.

<br><br>
Now we are at a place where we can define the <u>formula for the Naive Bayes' theorem</u>:

<img src="assets/naivebayes.png" height="342" width="342">

- `y` is the class variable or in our case the name of the candidate 
- `x1` through `xn` are the feature vectors or in our case the individual words. The theorem makes the assumption that each of the feature vectors or words (`xi`) are independent of each other.
<br><br>

**To break this down, we have to compute the following posterior probabilities:**

1. `P(Jenifer|Liberty,War)`: Probability of Jennifer Stewart saying the words liberty and war. Using the formula and our knowledge of Bayes' theorem, we can compute this as follows:
    
    <img src="assets/jenifer_naivebayes.png">
    
    Here P(Liberty,War) is the probability of the words 'liberty' and 'war' being said in a speech.
    

2. `P(Gabrielle|Liberty,War)`: Probability of Gabrielle Jones saying the words liberty and war. Using the formula, we can compute this as follows:
    
    <img src="assets/gabrielle_naivebayes.png">
    
    

##### Now compute the `P(Liberty,War)` or the probability of the words 'liberty' and 'war' being said in a speech

In [None]:
### Computing the 3 different probability for Jenifer

# P(Jenifer)


# P(Libery | Jenifer)


# P(War | Jenifer)


In [None]:
### Computing the 3 different probability for Gabrielle

# P(Gabrielle)


# P(Liberty | Gabrielle)


# P(War | Gabrielle)


In [None]:
### Compute P(Liberty , War)

# [P(Jenifer) x P(Libery|Jenifer) P(War|Jenifer)] + [P(Gabrielle) x P(Liberty|Gabrielle) x P(War|Gabrielle)]



In [None]:
### Compute P(Jenifer | Liberty, War) or the posterior probability of Jenifer

# P(Jenifer|Libery,War) = (P(Jenifer) * P(Libery|Jenifer) * P(War|Jenifer)) / P(Libery,War)



In [None]:
### Compute P(Gabrielle | Liberty, War) or the posterior probability of Gabrielle

# P(Gabrielle|Libery,War) = (P(Gabrielle) * P(Libery|Gabrielle) * P(War|Gabrielle)) / P(Libery,War)


**Note: The sum of our posterior probability should add up to 1**

**Further Resources:**

1. <a href="https://classroom.udacity.com/courses/ud120">intro to machine learning (Naive Bayes)</a> - Udacity free course
2. <a src="https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/">6 Easy Steps to Learn Naive Bayes Algorithm (with codes in Python and R)</a>
3. <a src="https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c">Naive Bayes Classifier</a>

<hr>

## 6. Spam Detection Using Naive Bayes Theorem

### Introduction 

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag of words features to identify spam e-mail, an approach commonly used in text classification.

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s. (Wikipedia)

In [None]:
# Import the libraries


### Loading the Dataset
We will be using the Naive Bayes algorithm to create a model that can classify [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) SMS messages as spam or not spam, based on the training we give to the model. 

In [None]:
# Loading the dataset


In [None]:
# Get the shape of data
print("dataset shape: ", dataset.shape)

### Data Preprocessing

Now that we loaded our dataset, Let's convert our labels into a binary number. 0 for ham and 1 for spam.

In [None]:
# Converting the labels to binary numbers


### Splitting the Dataset
In order to find a model with the highest accuracy possible, We need further testing for our model. For doing so we split our dataset into training and testing set. One for training and another one for testing the model.

In [None]:
# Splitting the dataset into training and test set


### Feature Extraction - Bag of Words
Now that we have split the data, our next objective is to get the Bag of words and convert our data into the desired matrix format. To do this we will be using CountVectorizer(). The steps are as follows:
* First we fir the `CountVectorizer()` into the training data or `X_train`.
* Then we transform our testing data or `X_test`

In [None]:
# Initializing the count vectorizer


# Fit and transform the training data


# Transform the test data


In [None]:
# Checking the training data in DataFrame


### Naive Bayes

Sklearn has several Naive Bayes implementations that we can use and so we do not have to do the math from scratch. We will be using sklearns `sklearn.naive_bayes` method.

We will be using the multinomial Naive Bayes implementation. This particular classifier is suitable for classification with discrete features.

In [None]:
# Applying Naive Bayes


### Prediction
Let's make a prediction on our testing dataset. Later on we will evaluate how well did our model do on this dataset.

In [None]:
# Predicting the test set


In [None]:
# Converting the prediction "0", "1" into "spam", "ham"


In [None]:
# Predict a random message


In [None]:
# Predict a correct ham message


### Evaluating the Model

Now we want to evaluate how well our model is doing. There are various mechanisms for doing so:

1. **Accuracy**: `[Correct Predictions/Total Number of Predictions]`

2. **Precision**: `[True Positives/(True Positives + False Positives)]`

3. **Recall(sensitivity)**: `[True Positives/(True Positives + False Negatives)]`

4. **F1 score**: weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

In [None]:
# Evaluation


**Resources:**

<a src="http://www.est.uc3m.es/BayesUC3M/Summer_School_UPM/2017/lecture%20notes/Practical1.pdf">Case Study I: Naive Bayesian spam filtering</a> - Universidad Carlos III de Madrid