<h1 align='center'>AutoSynthesis study group</h1>
<h2 align='right'> Session 3 - Modeling </h2>
<h3 align='right'> 20th February 2019 </h3>
<h3 align='right'> Kazeem </h3>

## Recap of last session

<ul> 
<li> Loading and analysis of data </li>
<li> Basic preprocessing </li>
    <ul>
        <li> Lowercasing </li>
        <li> Stopwords removal </li>
        <li> Stemming </li>
        <li> Lemmatisation </li>
    </ul>
<li> Feature representation </li>
    <ul>
        <li> Bag of words </li>
        <li> Tf-idf </li>
        <li><font color = 'red'><strong> Binary </strong></font></li>
        <li> Word embedding </li>
        <li> N-grams </li>
    </ul>
</ul>

### Load libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score, accuracy_score
from sklearn.naive_bayes import GaussianNB, ComplementNB, MultinomialNB, BernoulliNB
from nltk.corpus import stopwords

print ('Packages import successful')

### Binary representation

In [None]:
#import pandas as pd
#from sklearn.feature_extraction.text import CountVectorizer
#load data
train = pd.read_csv('../session 2_no password/session 2/AutoSession2.csv')


vectorizer = CountVectorizer(binary = True, stop_words = 'english')
tdm = vectorizer.fit_transform(train['Abstract'])
words = vectorizer.get_feature_names()
words = np.asarray(words)

BoW =np.vstack((words, tdm.toarray()))
tdm_df = pd.DataFrame(data=BoW[1:,:], columns = BoW[0,:])
print (tdm_df.head(5))

### Refresher
>- Why do we need preprocessing?
>- Why do we need feature representation?

## Modeling - supervised learning

### ML algorithms example 1: Naive Baye's 

<p>Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Based on the equation below:</p>
<img src = "Bayes_rule.webp">

Where,
<ul>
    <li>P(c|x) is the posterior probability of class (c, target label) given predictor (x, feature).</li>
    <li>P(c) is the prior probability of class.</li>
    <li>P(x|c) is the likelihood which is the probability of predictor given class.</li>
    <li>P(x) is the prior probability of predictor.</li>
</ul>

### ML algorithms example 2: support vector machine

<p>In support vector machine (SVM) each data item is plotted as a point in an n-dimensional space (where n is the number of features in the TDM)where the value of each feature is the value of each coordinate. The algorithm  classifies the data by finding the hyperplane that best partition the data.</p>
<p>The SVM relies on the data points closest to the hyperplane on both sides to make prediction.</p>

<img src="SVM_1.png">

<h4>Linear separablity and maximum margin</h4>
<div>
    <img src="SVM_21.png"></div>
    <img src="SVM_3.png"></div>
    <img src="SVM_4.png"></div>
</div>


<h4>The Kernel trick</h4>
<p> What happens in situations where the data is not linearly separable?
<div>
    <img src="SVM_8.png"></div>
    <img src="SVM_9.png"></div>
</div>



## Build text predictive models with ML algorithms

### A typical text classification process

<img src="supervised_learning.png" alt="text classification process" title="text classification" />

### Note - using sample dataset

### Step 1 - load your dataset

In [None]:
data = pd.read_csv('autosynthesis_session3.csv') #set the data path relative to your system and file location
print ('Dataset loaded successfully')
data.head(5) #view some samples

#### Steb 1b - explore the dataset to gain insight

Thanks to Lena for her explorative work on the labelled dataset. These can be found in the shared session's folder named '<em><a href = '../data insight/Lena_Results.ipynb'>data instight</a></em>'
<p>Note that there are two other excel files in the same folder</p>

### Step 2 - correct annomalies and fix NAs

In [None]:
#test for blank spaces in label
np.unique(pd.isna(data.label), return_counts=True)

In [None]:
#test for blank spaces in label
np.unique(pd.isna(data.Decision2), return_counts=True)

In [None]:
#There must be no missing data in data particularly the labels. If it exists, FIX.
data.Decision2 = data.Decision2.fillna(0)

### Step 3 - properly encode the target/label

In [None]:
#from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['labels'] = le.fit_transform(data['label'])
print (data['labels'].head(5))

In [None]:
#view label distribution
#import numpy as np
label_freq = np.unique(data.label, return_counts=True)
print("Overall class distribution: \n", label_freq)

### Step 4 - extract required subset of data (if needed)

In [None]:
data['TiAbs'] = data[['Title', 'Abstract', 'Keywords']].apply(lambda x: '{} {} {}'.format(x[0], x[1], x[2]), axis=1)

### Step 5 - split data to train and test sets

In [None]:
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['TiAbs'], data['label'], test_size=0.10, random_state=19)

In [None]:
#check to see that data and labels are of the same size
print ('Train data size: ', X_train.shape)
print ('Test  data size: ', X_test.shape)
print ('Train LABEL size: ', X_train.shape)
print ('Test  LABEL size: ', X_test.shape)

#check distribution
label_freq = np.unique(y_test, return_counts=True)
print("Overall class distribution: \n", y_test)

### Step 6 - preprocessing

In [None]:
#optionally write custom preprocessing method.....WHY?
def preprocessor(text):
    #text = text.apply(lambda x: ' '.join(x.lower().replace('[^\w\s]','') for x in str(x).split() if not x in set(stopwords.words('english')) and not x.isdigit()))
    
    # split into words
    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation from each word
    import string
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    #from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words and len(w) > 3]
    
    return ' '.join(words) #return the cleaned text string separated by spaces

#### Step 6a - data cleaning and tokenization

In [None]:
X_train = X_train.apply(lambda x: preprocessor(x))
X_test = X_test.apply(lambda x: preprocessor(x))

In [None]:
X_test.head(5)

#### step 6b - feature representation

In [None]:
#from sklearn.feature_extraction.text import TfidfVectorizer
binary_encoder = TfidfVectorizer(stop_words='english', binary = True, max_df=0.8, min_df=3, ngram_range=(1, 1))
binary_train_data = binary_encoder.fit_transform(X_train)
binary_test_data = binary_encoder.transform(X_test)

In [None]:
#from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_encoder = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=3, ngram_range=(1, 1))
tfidf_train_data = tfidf_encoder.fit_transform(X_train)
tfidf_test_data = tfidf_encoder.transform(X_test)

### Step 7 - fit a model

#### A. Binary features

In [None]:
#from sklearn.naive_bayes import GaussianNB, ComplementNB, MultinomialNB, BernoulliNB
#MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
#SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto_deprecated’, coef0=0.0, shrinking=True, probability=False, tol=0.001,
# cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)
#---------------------------------------------------------------------------------------------------------

gnb = GaussianNB()
bnb = BernoulliNB(binarize = None)#if dataset is already in binary form
mnb = MultinomialNB()
cnb = ComplementNB()
svm = SVC(C = 10, kernel = 'linear', class_weight=None, gamma = 'scale', random_state=None)


#train mmodel using training data
gnb_model = gnb.fit(binary_train_data.toarray(), y_train)
print ('Done fitting Gaussian NB model')
bnb_model = bnb.fit(binary_train_data.toarray(), y_train)
print ('Done fitting Bernoulli NB model')
mnb_model = mnb.fit(binary_train_data.toarray(), y_train)
print ('Done fitting Multinomial NB model')
cnb_model = cnb.fit(binary_train_data.toarray(), y_train)
print ('Done fitting Complement NB model')
svm_model = svm.fit(binary_train_data.toarray(), y_train)
print ('Done fitting SVM model')

print ('------------------------------------------------------------------------ \n')
print ('Finished fitting all models')
print ('Trained models ready for prediction on new data \n')
print ('------------------------------------------------------------------------ \n')

#use each model to predict on new data
gnb_prediction = gnb_model.predict(binary_test_data.toarray())
print ('Done predicting with Gaussian NB model')
bnb_prediction = bnb_model.predict(binary_test_data.toarray())
print ('Done predicting with Bernoulli NB model')
mnb_prediction = mnb_model.predict(binary_test_data.toarray())
print ('Done predicting with Multinomial NB model')
cnb_prediction = cnb_model.predict(binary_test_data.toarray())
print ('Done predicting with Complement NB model')
svm_prediction = svm_model.predict(binary_test_data.toarray())
print ('Done predicting with SVM model')

In [None]:
print ('Gaussian NB model: \n', gnb_prediction)

In [None]:
print ('Bernoulli NB model: \n', bnb_prediction)

In [None]:
print ('Multinomial  NB model: \n', mnb_prediction)

In [None]:
print ('Complement NB model: \n', cnb_prediction)

In [None]:
print ('SVM model: \n', svm_prediction)

### Model assessment

#### Some basic metrics

<img src = 'conf_mat.png'>

In [None]:
#from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score, accuracy_score
accuracy = accuracy_score(y_test, cnb_prediction)
precision = precision_score(y_test, cnb_prediction, average= 'micro')
recall = recall_score(y_test, cnb_prediction, average= 'micro')
confusion_matrix(y_test, cnb_prediction)
print(classification_report(y_test, cnb_prediction, target_names=['Yes', 'No']))

In [None]:
print ('Guassian NB model: \n')
pd.crosstab(gnb_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
print ('Bernoulli NB model: \n')
pd.crosstab(bnb_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
print ('Multinomial NB model: \n')
pd.crosstab(mnb_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
print ('Complement NB model: \n')
pd.crosstab(cnb_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
print ('SVM model: \n')
pd.crosstab(svm_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
test_data = X_test
test_data = pd.DataFrame(test_data)

test_data['Article_ID'] = pd.DataFrame(data.iloc[list(test_data.index.values), 0])
test_data['true label'] = y_test
test_data['bnb_binary'] = gnb_prediction
test_data['bnb_binary'] = bnb_prediction
test_data['mnb_binary'] = mnb_prediction
test_data['cnb_binary'] = cnb_prediction
test_data['svm_binary'] = svm_prediction

In [None]:
test_data

#### B. tfidf features

In [None]:
gnb = GaussianNB()
bnb = BernoulliNB()#if dataset is already in binary form
mnb = MultinomialNB()
cnb = ComplementNB()
svm = SVC(C = 10, kernel = 'linear', class_weight=None, gamma = 'scale', random_state=None)


#train mmodel using training data
gnb_model = gnb.fit(tfidf_train_data.toarray(), y_train)
print ('Done fitting Gaussian NB model')
bnb_model = bnb.fit(tfidf_train_data.toarray(), y_train)
print ('Done fitting Bernoulli NB model')
mnb_model = mnb.fit(tfidf_train_data.toarray(), y_train)
print ('Done fitting Multinomial NB model')
cnb_model = cnb.fit(tfidf_train_data.toarray(), y_train)
print ('Done fitting Complement NB model')
svm_model = svm.fit(tfidf_train_data, y_train)
print ('Done fitting SVM model')

#use model to predict on new data
gnb_prediction = gnb_model.predict(tfidf_test_data.toarray())
bnb_prediction = bnb_model.predict(tfidf_test_data.toarray())
mnb_prediction = mnb_model.predict(tfidf_test_data.toarray())
cnb_prediction = cnb_model.predict(tfidf_test_data.toarray())
svm_prediction = svm_model.predict(tfidf_test_data)

In [None]:
print ('Gaussian NB model: \n', gnb_prediction)

In [None]:
print ('Bernoulli NB model: \n', bnb_prediction)

In [None]:
print ('Multinomial  NB model: \n', mnb_prediction)

In [None]:
print ('Complement NB model: \n', cnb_prediction)

In [None]:
print ('SVM model: \n', svm_prediction)

In [None]:
#from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score, accuracy_score
accuracy = accuracy_score(y_test, svm_prediction)
precision = precision_score(y_test, svm_prediction, average= 'micro')
recall = recall_score(y_test, svm_prediction, average= 'micro')
confusion_matrix(y_test, svm_prediction)

print ('SVM RESULT ........')
print ('Accuracy: ', accuracy)
print('PRECISION: ', precision)
print('RECALL: ', recall)
print(classification_report(y_test, svm_prediction, target_names=['Yes', 'No']))

In [None]:
print ('Guassian NB model: \n')
pd.crosstab(gnb_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
print ('Bernoulli NB model: \n')
pd.crosstab(bnb_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
print ('Multinomial NB model: \n')
pd.crosstab(mnb_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
print ('Complement NB model: \n')
pd.crosstab(cnb_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
print ('SVM model: \n')
pd.crosstab(svm_prediction, y_test, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [None]:
#in case you need the values for further processing
tn, fp, fn, tp = confusion_matrix(y_test, svm_prediction).ravel()

In [None]:
tn, fp, fn, tp

In [None]:
test_data['gnb_tfidf'] = gnb_prediction
test_data['bnb_tfidf'] = bnb_prediction
test_data['mnb_tfidf'] = mnb_prediction
test_data['cnb_tfidf'] = cnb_prediction
test_data['svm_tfidf'] = svm_prediction

In [None]:
test_data

#### Personal tasks