# Text Classification 

- An example of supervised learning algorithm 

- We use text data and NLP methods (text vectorization) to obtain a model relating a categorical variable to a given document. 

- We use **Classification algorithms** which can be characterized by the three categories:
    
    1. **Binary classification**: 
        - The categorical variable has two values (labels). One observation can only have one value.
        
        - Example: email classified as spam/non spam

    
    2. **Multiclass classification**: 
        - Multiple labels
        
        - Each observation can only have one single value
        
    3. **Multilabel Classification**:
        - Each observation can have multiple labels
        
        - Exemple: a newspaper article can be assigned to multiple label. 


We are going to see an example of Multiclass Classification.


In the following part, we will use:
    
    - textacy
    
    - sklearn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Dataset 

- A dataset of bug reports of Java Development Tools (JTD) open source project

- The dataset contains 45296 bug reports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('eclipse_jdt.csv')
print(df.columns)

In [None]:
len(df)

In [None]:
df.sample(10)

### Description of the variables of interest in the dataset

- **Priority**: varies from P1 (most critical) to P5 (less severe)
    
- **Title**: A short description of the bug made by the user
    
- **Description**: a more detailed description of the bug
    
- Component: part of the project impacted by the bug

## Objective:

- We want to estimate a model which allow to forecast the level of priority according to the content of the title and the description of bug.

- **Supervised Learning**: 
    
    1. **Training phase**: Estimation of a model with the training set: training observations (title + description) and their associated labels (priority in our case)
            
            - feature engineering: selecting a adequate set of features of the training observations 
                
            
    At the end of the process we have the trained model which can be used to make predictions
            
    2. **Prediction phase**: The trained model is used with new input observations
        - This new observations are transformed in the same way as in the training phase to produce feature vectors
        - The new feature vectors are applied to the trained model to generate predictions

### Variable of interest: Priority

In [None]:
df['Priority'].value_counts().sort_index().plot(kind='bar')

### Comment:
- Class imbalance:
    
    - The number of bugs with priority P3 is much higher than for the other bugs priorities
    
    - The text classification algorithm will have more information for P3 than for the other priority levels 
    
    

### Distribution of the bugs by components

In [None]:
df['Component'].value_counts()

## Devising a Text Classification Model

Four usual steps:
    
    1. Data preparation
    2. Train-Test split
    3. Training the Machine Learning Model
    4. Model evaluation

### Step 1: Data preparation

- Our aim is to predict **priority** of a bug report according to its **Title** and *Description**
- We keep the columns 'Title', 'Description' and 'Priority' and discard the other ones
- Note that by doing so we restrict our information set => The other variables of the data could countain useful information
- We drop lines with missing information
- We combine 'Title' and 'Description' to obtain a single 'text' column

In [None]:
df=df[['Title','Description','Priority']]
df=df.dropna()
df['text']= df['Title']+''+df['Description']
df=df.drop(columns=['Title','Description'])
df.columns

##### We eliminate special characters 

In [None]:
import textacy
import textacy.preprocessing as tprep

In [None]:
preproc = tprep.make_pipeline(
    tprep.replace.urls,
    tprep.remove.html_tags,
    tprep.normalize.hyphenated_words,
    tprep.normalize.quotation_marks,
    tprep.normalize.unicode,
    tprep.remove.accents,
    tprep.remove.punctuation,
    tprep.normalize.whitespace,
    tprep.replace.numbers
)

In [None]:
df['clean_text']=df['text'].apply(preproc)


We eliminate text with less than 50 characters. These descriptions have not been filled correctly. The description of the problem is not accurate. 

In [None]:
df=df[df['clean_text'].str.len()>50]

In [None]:
print(f"Final number of bug reports: {len(df)}")

In [None]:
def normalize(text):
    text = tprep.replace.urls(text)# we replace url with text
    text = tprep.remove.html_tags(text)
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    text = tprep.remove.punctuation(text)
    text = tprep.normalize.whitespace(text)
    text = tprep.replace.numbers(text)
    return text

In [None]:
df['clean_text']=df['text'].apply(normalize)

In [None]:
df.loc[25026,'text']

In [None]:
df.loc[25026,'clean_text']

### Step 2: Training set and test set

- We split the data set into the training set and the test set 

- We use sklearn train_test_split function

1 Independant variable

2 Target variable

3 test_size = 0.2 => the test set represent 20 % of the data set, the training set 80 % 

4 random_state = 42 => influence how the rows are sampled into the train and test sets. With another number, we will obtain another 80/20 train/test set.
By fixing a value for random_state, we are able to reproduce our results. We can also compare the results when we modify (add/substract) the set of variable 

5 stratify=df['Priority'] => The distribution of the target variable is maintained in the training set and the test set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df['clean_text'],df['Priority'], test_size=0.2,random_state=42,stratify=df['Priority'])

In [None]:
print('Size of Training Data', X_train.shape[0])
print('Size of Test Data', X_test.shape[0])

### Step 3: Training the Machine Learning Model

In [None]:
- Text classification: a supervised machine learning model
    
- Support Vector Machine: a popular algorithm used when woorking with text classification 
    
    
- other possible methods:
    1. Naive Bayes Classifier Algorithm
    2. Boosting Models 
    3. Neural Networks

#### Computation of the tf-idf on the training set


- We must transform our data of text into a numerical array before estimating the model.

- Counting words in each bug reports => combines all counts of words
    - Problem: common words will be overweighted 
        
- We represents texts with the tf-idf 


https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=10,ngram_range=(1,2),stop_words='english')
X_train_tf=tfidf.fit_transform(X_train)

In [None]:
X_train_tf.shape, type(X_train_tf)

In [None]:
print(X_train_tf)

In [None]:
print(len(tfidf.vocabulary_))

In [None]:
tfidf.idf_, len(tfidf.idf_)

In [None]:
tfidf.vocabulary_

In [None]:
# words that were ignore
tfidf.stop_words_

#### Estimation of the Model SVC (Support Vector Classification)

Some parameters

- C = 1 (value by default) regularization parameter
- random_state = 0 => to obtain reproducible output across multiple function calls
- tol = tolerance for stopping criteria
- dual = auto => select the algorithm to solve either the primal or the dual of the optimization problem, in function of the n_sample, n_features, loss, ...

In [None]:
from warnings import simplefilter
simplefilter(action='ignore',category=FutureWarning)

In [None]:
from sklearn.svm import LinearSVC
model1 = LinearSVC(random_state=0,tol=1e-5,dual='auto')
model1.fit(X_train_tf,Y_train)

In [None]:
model1.coef_.shape

In [None]:
model1.intercept_

In [None]:
model1.classes_

In [None]:
model1.n_features_in_

### Model evaluation

In [None]:
X_test_tf=tfidf.transform(X_test)
Y_pred = model1.predict(X_test_tf)

The simplest way to estimate the model is through **accuracy score**

$$ Accuracy=\frac{Number\,of\,correct\,predictions}{Total\,number\,of\,predictions}$$

In [None]:
from sklearn.metrics import accuracy_score
print('Accuracy score', accuracy_score(Y_test, Y_pred))

- The accuracy score of the trained model is equal to 87,5 % => the model can be considerer as a good predictor

- Question: comparison of this accuracy score with other simple classifier. Does our trained model have a higher accuracy score?

### Comparison with a Simple Benchmark Model

- sklearn.dummy.DummyClassifier : make prediction that ignore the input features

- Can be used as a baseline to compare against other more complex classifiers

- The behavior of this baseline model is selected with the **strategy** parameter:
    
    - 'most_frequent' : the model always predict the most frequent class label in the target variable y . The predict_proba 
        method returns the matching one-hot encoder
    
    - "prior" : the model always predict the most frequent class label in the observed target variable. The predict_proba 
        method returns the empirical class distribution of the target variable y
    
    - "stratified" : make a random prediction for a class using the multinomial empirical class prior distribution
    
    - "uniform" : generates prediction uniformly at random from the list of unique classes observed in y. Each class has equal probability.
    
    - "constant" : always predicts a constant label provided by the user. 


In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
clf=DummyClassifier(strategy="most_frequent")
clf.fit(X_train, Y_train)
Y_pred_baseline = clf.predict(X_test)
print('Acuracy Score',accuracy_score(Y_test,Y_pred_baseline))

In [None]:
Y_pred_baseline

In [None]:
clf.class_prior_

In [None]:
from sklearn.metrics import balanced_accuracy_score
print(accuracy_score(Y_test, Y_pred_baseline))

#### Comment
Same value of accuracy for the baseline model: the SVC model doesn't do a better job

### Naive Bayes algorithm

In [None]:
from sklearn.preprocessing import LabelEncoder
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Y_train)
Test_Y = Encoder.fit_transform(Y_test)

In [None]:
# fit the training dataset on the NB classifier
from sklearn.naive_bayes import MultinomialNB
Naive = MultinomialNB()
Naive.fit(X_train_tf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(X_test_tf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

## Confusion Matrix


How well the model is performing for the different values of the target variable? (The different priority levels here)



In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm=confusion_matrix(Y_test,Y_pred)
print(cm)

In [None]:
ConfusionMatrixDisplay.from_predictions(Y_test,Y_pred)
plt.show()

In [None]:
https://gist.github.com/mesquita/f6beffcc2579c6f3a97c9d93e278a9f1#file-nice_cm-py

In [None]:
from sklearn.metrics import recall_score
target_names= ['P1','P2','P3','P4','P5']
print(recall_score(Y_test,Y_pred, labels=target_names,average=None,zero_division=0.0))

In [None]:
model1.classes_

In [None]:
from sklearn.metrics import classification_report
print(classification_report(Y_test,Y_pred,zero_division=0.0))

- Accuracy of the model: 0.88

- Precision and Recall are good for P3 but not for the other labels => accuracy is not sufficient to understand the forecasting performance of the model.

- macro avg: (unweighted) average per label - Does not take label imbalance into account
- weighted avg: weighted average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; 



## Dealing with Class Imbalance

- P3 is by far the most frequent label in the dataset. 
- This imbalance implies that the model is able to detect the characteristics of texts associated to P3 but much less for other labels.

- How can we handle this issue of class imbalance? 

- Two approaches:
    
    1. **upsampling**: techniques used to artificially increase the number of observations of less frequent classes
    
    2. **downsampling**: techniques used to reduce the number of observations of the majority class
        

 An example of dowsampling beneath:
        

### Step 1 resampling the data set 
    
    1. We choose to randomly downsample the P3 class
    2. We create a dataframe with all other categories
    3. We concatenate the two dataframe to create a new (balanced) dataset 

### Step 2 : simplifying and cleaning the data

### Step 3: train-test split

### Step 4: Training the ML model

### Step 5: Model evaluation

## Cross-validation 

- K-fold cross-validation 

In [None]:
df_tf=tfidf.fit_transform(df['text']).toarray()

In [None]:
from sklearn.model_selection import cross_val_score
scores= cross_val_score(estimator=model1,X=df_tf,y=df['Priority'],cv=5)

In [None]:
print("Validation scores from each iteration of the cross validation", scores)

In [None]:
print('Mean value of validation scores', scores.mean())

In [None]:
print('Standard deviation of validation scores', scores.std())

## Hyperparameter Tuning

- Grid Search

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [None]:
training_pipeline = Pipeline(steps=[('tfidf',TfidfVectorizer(stop_words="english")),
                             ('model', LinearSVC(random_state=42,tol=1e-5,dual='auto'))])

In [None]:
grid_param=[{
    'tfidf__min_df': [5,10],
    'tfidf__ngram_range': [(1,3),(1,6)],
    'model__penalty':['l2'],
    'model__loss': ['hinge'],
    'model__max_iter': [18000]
},{
    'tfidf__min_df': [5,10], 
    'tfidf__ngram_range': [(1,3),(1,6)],
    'model__C': [1,10],
    'model__tol': [1e-2,1e-3]
}]    

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
gridSearchProcessor=GridSearchCV(estimator=training_pipeline,param_grid=grid_param,cv=5)

In [None]:
gridSearchProcessor.fit(df['clean_text'],df['Priority'])

In [None]:
best_params=gridSearchProcessor.best_params_
print(best_params)

In [None]:
best_results=gridSearchProcessor.best_score_

In [None]:
print(best_results)