# Usual Data science pipeline for text classification step by step 

# Step zero: Problem definition

We are dealing with a supervised learning classification problem with a binary output ( target classes): either True or False. The data we are manipulating is extracted from a csv file and is in text format. Therefore some Natural Language Processing techniques will be applied when preparing our dataset before modeling. 

NB: We have an imbalanced data: the classes are not equally distributed ( approximatively 70-30%), which is a frequent problem in classification. To evaluate the impact of this imbalance on the output of our model, we can use the confusion matrix to see if the dominant class is predicted more often when it the actual class is the least dominant one. 
However, we will not fix this problem. Refer to this link for some useful techniques: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

# Importing Libraries

In [1]:
import os
import re  
import nltk 

import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords  

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split  
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn import model_selection
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

  from numpy.core.umath_tests import inner1d


In [2]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/zeroeffort/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/zeroeffort/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Step one: Data collection  (Importing The dataset)

In [3]:
def import_data(input_path):
    # assertion to verify the input file existance
    assert os.path.isfile(input_path) != False , "No valid input file: file does not exist" 
    try: 
        df= pd.read_csv(input_path,encoding = "ISO-8859-1")
    
    except exception as e:
        pass
    finally:
        return(df)

# Step two: Text Preprocessing

Working with raw text data is not a winning shot! First, we need to clean our data then vectorize it (transform it to numerical matrix so it can be passed to our ML algorithm). 

1-Many techniques for text data cleaning exist. Most often, we will need to apply an NLP pipeline: Sentence segmentation, Tokenization, Lemmatization, Removing special characters, Case conversion, etc. The following class will introduce some of the most used techniques with libraries such as nltk , re and Spacy which proposes an interesting packaged pipeline.

2-Different approaches exist to convert text into the corresponding numerical form. The Bag of Words Model and the Word Embedding Model are two of the most commonly used approaches. For time shortage, we will only use tf-idf of sklearn. Also, the Fast Text library and Spacy provide very performant word vectors to represent text data. 


In [4]:
class text_prep():
    
    def __init__(self, cleaning= 'nltk', input_path= None):
        assert cleaning in ['nltk', 'spacy'], " text_cleaning value must be either nltk or Spacy."
        assert input_path != None, "No input path given"
    
        text_cleaning = { 'nltk': self.text_cleaning_nltk, 'spacy': self.text_cleaning_spacy}
       
            
        self.df= import_data(input_path)
        self.documents= text_cleaning[cleaning]()
        self.X= self.convert_text()
        self.Y= self.df["Classes"]
        
        

    def text_cleaning_nltk(self):
        documents = []
        try:
            
            
            stemmer = WordNetLemmatizer()

            for sen in self.df.Content.values:  
                # Remove all the special characters
                data = re.sub(r'\W', ' ', str(sen))

                # remove all single characters
                data = re.sub(r'\s+[a-zA-Z]\s+', ' ', data)

                # Remove single characters from the start
                data = re.sub(r'\^[a-zA-Z]\s+', ' ', data) 

                # Substituting multiple spaces with single space
                data = re.sub(r'\s+', ' ', data, flags=re.I)

                # Removing prefixed 'b'
                data = re.sub(r'^b\s+', '', data)

                # Converting to Lowercase
                data = data.lower()

                # Lemmatization
                data = data.split()

                data = [stemmer.lemmatize(word) for word in data]
                data = ' '.join(data)

                documents.append(data)
        except exception as e:
                pass
        finally:
                return(documents)
        
        
    def text_cleaning_spacy(self):
        import spacy #load spacy
        nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])
        stops = stopwords.words("english")
        documents=[]
        content= self.df["Content"]
        for data in content.values:
            data = data.lower()
            data = nlp(data)
            lemmatized = list()
            for word in data:
                lemma = word.lemma_.strip()
                if lemma:
                    lemmatized.append(lemma)
       
            documents.append(" ".join(lemmatized))
            
        return (documents)
        
    # Converting Text to Numbers
    def convert_text(self):
        tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
        X = tfidfconverter.fit_transform(self.documents).toarray() 
        return(X)
 

# Step 3: Training Text Classification Model and Predicting Classes 
&
# Step 4: Model Evaluation 

We can use these techniques to build and estimate the skill of machine learning models:
1- With train split (80-20%) then evaluating the accuracy rate and the confusion matrix to visualize the performance of the chosen algorithm. 
2- Using cross-validation techniques (that generally have a lower bias than other methods) such as k-fold. 
To evaluate the results, we use confusion matrix and classification report (accuracy , precision and recall rates) provided by Sklearn.

In the following class, we will implement one method to train and evaluate one model with the 80-20 technique and we will implement a second one to compare several algorithms using k-fold. 

Of course, each model has several parameters. we can use Grid Search technique to find the best parameters that can give us the best results. 

In [8]:
class build_eval_model():
    def __init__(self, actual_X= None, actual_Y= None, model_name=""):
        # prepare models
        self.classifier_dict= { 
                            "RF": RandomForestClassifier(),
                            "LR": LogisticRegression(),
                            "CART": DecisionTreeClassifier(),
                            "SVM": SVC()
                             }
        
        assert len(actual_X) !=0 and len(actual_Y != 0), "Empty values passed to model."
        assert model_name in self.classifier_dict.keys() or model_name == "compare", "Import for this algorithm was not provided or no algorithm name was given" 
        self.actual_X , self.actual_Y= actual_X, actual_Y
        if (model_name == "compare"):
            self.compare_models()
        else:
            self.build_one_model(model_name)
        
        
    def build_one_model(self, model_name=""):
        
        # Splitting Training and Test Sets
        X_train, X_test, y_train, y_test = train_test_split(self.actual_X, self.actual_Y, test_size=0.2, random_state=0) 
        
        classifier = self.classifier_dict[model_name]  
        classifier.fit(X_train, y_train)  
        y_pred = classifier.predict(X_test)
        
        # Evaluating the Model
        print(confusion_matrix(y_test,y_pred))  
        print(classification_report(y_test,y_pred))  
        print(accuracy_score(y_test, y_pred))
        
    def compare_models (self):
        
        # evaluate each model in turn
        results = []
        names = []
        scoring = 'accuracy'
        for name, model in self.classifier_dict.items():
            kfold = model_selection.KFold(n_splits=10, random_state=5)
            # evaluating and saving cross_val results
            cv_results = model_selection.cross_val_score(model, self.actual_X , self.actual_Y, cv=kfold, scoring=scoring)
            results.append(cv_results)
            names.append(name)
            msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
            print(msg)
            
        # boxplot algorithm comparison and save it in /data/output folder
        fig = plt.figure()
        fig.suptitle('Algorithm Comparison')
        ax = fig.add_subplot(111)
        plt.boxplot(results)
        ax.set_xticklabels(names)
        fig.savefig('data/output/comparision.png')   # save the figure to file
        plt.close(fig)    # close the figure

# Step 5 : Linking it all together

In [9]:
def run (input_path="", model_name="", cleaning=""):
   
        text = text_prep(cleaning= cleaning, input_path= input_path)
        build_eval_model(actual_X= text.X, 
                         actual_Y= text.Y, 
                         model_name= model_name)
    
    

In [11]:
if __name__ == '__main__':
    
    input_path= "./data/input/exerciceDS.csv"
    model_name= "LR"
    cleaning = "spacy"
    
    run (input_path= input_path,
         model_name= model_name ,
         cleaning= cleaning)

[[1127    2]
 [  20   50]]
             precision    recall  f1-score   support

      False       0.98      1.00      0.99      1129
       True       0.96      0.71      0.82        70

avg / total       0.98      0.98      0.98      1199

0.981651376146789
