# Introduction
Hello people, welcome to this kernel&tutorial. Nowadays I've started to learn **boosting algorithms** and wanted to write a simple (implementation-oriented) tutorial about first succesfull boosting algorithm: **Adaboost**

After this kernel, you will have been:
* Learnt how Adaboost and Boosting classifiers work
* Learnt how to train a Adaboost Classifier with Decision Trees
* Learnt what is Bag of Words and what is the disadvanteges of it
* Learnt what is TF-IDF and how to implement it using sklearn 
* How to save a sklearn model and vectorizer using python pickle

# Notebook Content

## Theory
* Simply Explained: Adaboost Classifier
* Simply Explained: Bag of Words (BoW)
* Simply Explained: TF-IDF Vectorization


## Implementation
* Preparing Environment
* Step 1: Data Analyses
* Step 2: Cleaning Text
* Step 3: Implementing TF-IDF
* Step 4: Building and Training Adaboost Classifier
* Step 5: Saving Model and Defining Test Function


# Theory
In this main section, we will cover how things works. I won't explain everything in depth, you can learn the details from articles, main subject of this kernel will be to make you understand concepts such as TF-IDF

I will explain everything with graphs that drawn on **MS Paint :)**

## Simply Explained: Adaboost Classifier

Adaboost is the first succesfull boosting algorithm that invented in 1996 by **Robert Schapire and Yoav Freund**. Probably you ask, great but what is a boosting algorithm?

Gradient boosting based classifiers are classifiers to create a strong classifier using many weak learners. I know this was a confusing sentence. Let me show you how it works with 2-D charts.

![dataset.png](attachment:dataset.png)

* This is our dataset contains two class, the blue ones are class A and others are class B.
* We want to classify them just using simple vertical and horizontal lines, such as:


![1.png](attachment:1.png)
* But this is a bad line, let's start. As we can see, most of the left side of chart contains class B dots, so we can draw a vertical line to split most of class B dots. 

![wk1.png](attachment:wk1.png)
* Yea, **weak learner** concept is just that, probably you know we can do it with a decision tree.
* But still we need more **weak learner**, as you can see most of the points above is B, so we can draw a **weak learner** to there.

![wk2.png](attachment:wk2.png)
* Almost ready, now we just need to rescue the blue point at above.

![finish.png](attachment:finish.png)
* Finally, we finished training our classifier with three decision tree estimator (week learner). If a dot is in zone A or zone B, it must be class B
* And if a dot is in zone C,D or E, its class must be A

Everything you need to get started with Adaboost is over, you can (should) learn details from Articles, let's move on to the text feature extraction methods.



## Simply Explained: Bag of Words
To understand why TF-IDF feature extraction important, I must explain what is Bag of Words (BOW) feature extraction. 

In bag of words approach, we'll create a sparse matrix (a matrix which most of the elements are zero) using sentences. Each future will be a word, let's take a look at the example below:

We have a dataset that contains that 6 sentences:
* Today I am going to study data science
* Tomorrow you will go
* I am interested in data science
* You was a good man
* I just wanted to be a good man
* Are you interested in statistics?

I've said each word will be feature, so our features will be:

    TODAY I AM GO STUDY DATA SCIENCE TOMORROW YOU WILL INTERESTED IN WAS GOOD MAN JUST WANT BE ARE STATISTICS
    
If the sentence given contains the feature, its value will be 1 and if not its value will be 0. Let's create our Bag of Words vectors

        TODAY I AM GO STUDY DATA SCIENCE TOMORROW YOU WILL INTERESTED IN WAS GOOD MAN JUST WANT BE ARE STATISTICS
    1     1   1  1  1  1     1     1      0         0  0        0      0   0   0    0   0    0   0  0    0
    2     0   0  0  1  0     0     0      1         1  1        0      0   0   0    0   0    0   0  0    0
    3     0   1  1  0  0     1     1      0         0  0        1      1   0   0    0   0    0   0  0    0
    4     0   0  0  0  0     0     0      0         1  0        0      0   1   1    1   0    0   0  0    0
    5     0   1  0  0  0     0     0      0         0  0        0      0   0   1    1   1    1   1  0    0
    6     0   0  0  0  0     0     0      0         1  0        0      1   0   0    0   0    0   0  1    1
    


First sentence is *Today I am going to study data science*, by matrix this sentence contains *today*,*I*,*am*,*go*,*study*,*data*,*science*. 

Bag of Words is just all about this, let's move on to the TF-IDF feature extraction.

# Simply Explained: Term Frequency x Inverse Document Frequency 

In previous section, we've learnt bag of words. In bag of words we've used binary features (features that show whether the sentence contain the word or not) but sometimes as you can predict it can cause real problems. 

In TFxIDF each future will be a word as well, but we won't use binary. We'll use a formula to determine TF-IDF score of feature.

=> The i feature's value of j sentence
#### TF-IDF Score: TF(i,j) * IDF(i)

**Term Frequency(i,j)** = How many times **i** word occurs in **j** / how many words are there in **j**

**Inverse Document Frequency(i)** = log2(number of sentences in dataset/ number of sentences that contains **i** word)

It's all about this, let's make an example. 

* We have dataset that contains 500 sentences and all the sentences contain **I** word. (What a selfish dataste ha!)
* Let's calculate TF-IDF score of **I** feature of sentence below:
        I am really interested in gradient boosting
   
* TF Score will be = 1/6
* Inverse Document will be = log2(500/500) = 0
* So it's TF-IDF score will be 0, it means **I** does not have a special meaning.

I hope you understand, let's start to implement.


# Implementation
In this main section we will implement everything that we learnt. Let's start!

## Preparing Environment
First, we'll import libraries, then we will load our dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
import re
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pickle
import seaborn as sns
import time

In [None]:
data = pd.read_csv('../input/sentiment-analysis-for-financial-news/all-data.csv',encoding="latin1",header=None)
data.columns = ["Label","Text"]
data.head()

## Step 1: Data Analyses
In this section we'll take a look at the data.

* Let's check whether are there any missing value or not

In [None]:
plt.subplots(figsize=(5,5))
sns.countplot(data["Label"])
plt.show()

* Dataset is unbalanced, most of the dataset is neutral as we can predict. 
* This may cause problems.

* Then let's take a look at the lenghts by class.

In [None]:
data["len"] = [len(text) for text in data["Text"].values]

data.groupby("Label")["len"].mean()

* As we can see length of sentences are unrelated with classes.

Let's move on to the next section and preprocess the texts.

## Step 2: Cleaning Texts
In this section we'll clean the texts, in order to clean texts we will define a function.

In [None]:
def cleanText(text):
    
    lemma = WordNetLemmatizer()
    stp = stopwords.words('english')
    
    # This means remove everything except alphabetical and numerical characters
    text = re.sub("[^a-zA-Z0-9]"," ",text)
    
    text = text.lower()
    
    # This mean split sentences by words ("I am good" => ["I","am","good"])
    text = nltk.word_tokenize(text)
    
    # Lemmatizers convert words to their base form using dictionaries (going => go, bees => be , dog => dog)
    text = [lemma.lemmatize(word) for word in text]
    
    # We should remove stopwords, stopwords are the words that has no special meaning such as I,You,Me,Was
    text = [word for word in text if word not in stp]
    
    # Everything is ready, now we just need join the elements of lists (["feel","good"] => "feel good")
    text = " ".join(text)
    
    return text

* Let's try our function

In [None]:
cleanText("Nowadays I am interested in traditional text feature extraction methods, because I want to learn foundations")

* Now let's clean our entire texts.

In [None]:
start_time = time.time()
cleanedText = []
for text in data["Text"]:
    
    cleanedText.append(cleanText(text))
end_time = time.time()
process_time = round(end_time-start_time,2)

print("="*10)
print("Texts are cleaned, this process took {} seconds \n \n".format(process_time))

print(cleanedText[0])


* Before moving on to the next section, let's encode our labels. Let negatives be 0, neutrals be 1 and positives be 2

In [None]:
data["Label"].value_counts()

In [None]:
y = []
for label in data["Label"]:
    
    if label=="negative":
        y.append(0)
        
    elif label=="positive":
        y.append(2)
        
    elif label=="neutral":
        y.append(1)

y = np.asarray(y)

## Step 3: Implementing TF-IDF Vectorizing
In this section we'll vectorize our cleaned texts using TF-IDF approach.

In [None]:
# First, we need a vectorizer object
vectorizer = TfidfVectorizer(max_features=4000)
# This means just consider most used 4000 words

start = time.time()

x = vectorizer.fit_transform(cleanedText).toarray()

process_time = round(time.time()-start,2)

print("Vectorizing cleaned text using TF-IDF approach took {} seconds".format(process_time))

x.shape

* Last step before modeling is splitting dataset into train and test.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

# Step 4: Building and Training Adaboost Classifier
In this section we'll train our Adaboost classifier using vectorized texts.

In [None]:
# We'll use 100 weak learners to build a strong learner
classifier = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(),n_estimators=100)

classifier.fit(x_train,y_train)

* And now let's test our classifier

In [None]:
y_pred = classifier.predict(x_test)

print("Test set accuracy of our Adaboost Classifier is {}".format(round(accuracy_score(y_pred,y_test)*100,2)))

plt.subplots(figsize=(6,6))
sns.heatmap(confusion_matrix(y_pred=y_pred,y_true=y_test),annot=True,fmt=".1f",linewidths=1.5,cmap="BuPu_r")
plt.show()

* Because of our dataset is unbalanced we could'nt train our classifier great.

## Step 5: Saving Model and Defining Test Function
In this section we'll save model and vectorizer using pickle object serialization and write a function that takes text as parameter and returns sentiment (neutral,positive,negative) 

In [None]:
pickle.dump(classifier,open("adaboost.pickle","wb"))
pickle.dump(vectorizer,open("vectorizer.pickle","wb"))
            

In [None]:
def analyseText(text):
    
    cls = pickle.load(open("adaboost.pickle","rb"))
    vct = pickle.load(open("vectorizer.pickle","rb"))
    
    # First we need to clean the text given
    text = cleanText(text)
    
    # Then we need to vectorize the text
    text = vct.transform([text])
    
    # And let's predict results using vector
    pred = cls.predict(text)
    
    decision = "neutral"
    
    if pred[0] == 0:
        decision = "negative"
        
    elif pred[0] == 2:
        decision = "positive"
        
    return decision
        

* Everything is ready let's use our function

In [None]:
analyseText("Rental of building equipment accounted for 88 percent of the operating income ")

In [None]:
analyseText("O'Leary 's Material Handling Services , located in Perth , is the leading company in Western Australia that supplies , installs and provides service for tail lifts .")


# Conclusion
Thanks for your attention, if you have any question feel free to ask in comments. Also if you liked this kernel, please upvote to motivate me!

Have a great day!