# <center> <font size = 24 color = 'steelblue'> <b>Text Classification using Naive Bayes Classifier

<div class="alert alert-block alert-info">
    
<font size = 4> 

**By the end of this notebook you will be able to:**
- Learn to extract features from text
- Learn to train a Naive Bayes classifier model for basic text classification
- Explore evaluation of text classification model built

# <a id= 'c0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#c1)<br>
[2. Download the necessary corpus from NLTK](#c2)<br>
[3. Data acquisiton](#c3)<br>
[4. Feature extraction](#c4)<br>
[5. Model development](#c5)<br>
[6. Evaluation](#c6)<br>
    

##### <a id = 'c1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [None]:
import nltk
import string
import random
import pandas as pd

##### <a id = 'c2'>
<font size = 10 color = 'midnightblue'> <b>Download necessary corpus and models from nltk

<div class="alert alert-block alert-info">
<font size = 4>
    
<center> <b>Use the "names" corpus from nltk to build a simple model for gender classification of names.
    

In [None]:
nltk.download("names")
nltk.download('product_reviews_1')
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
print(nltk.corpus.names.fileids())

[top](#c0)

##### <a id = 'c3'>
<font size = 10 color = 'midnightblue'> <b>Data acquisition

<div class="alert alert-block alert-success">
    
<font size = 4> 
    
- The names corpus contains two text files.
- `male.txt` contains list of names which are most frequently used for males.
- `female.txt` contains list of names most commonly used for females.

<font size = 5 color = seagreen> **Start by extracting names as female and male names list.**

In [None]:
female_names = nltk.corpus.names.words('female.txt')
male_names = nltk.corpus.names.words('male.txt')

<font size = 5 color = seagreen> <b> Create a labelled data list with names from **female.txt** labeled as females and names from **male.txt** as males as a tuple.

In [None]:
labeled_data = ([(name, 'female') for name in female_names] +
                    [(name, 'male') for name in male_names])

[top](#c0)

##### <a id = 'c4'>
<font size = 10 color = 'midnightblue'> <b> Feature Extraction

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Text data is unstructured and features need to be extracted in order to use it in ML models.
- Here features are identified manually as ***length***, ***first letter***, ***last letter***, ***count of each letter*** and ***count of vowels*** in the name.
- The function below extracts these features and returns a dictionary of features.

In [None]:
def getFeatures(name):
    # Lower casing
    name = name.lower()
    feature_dict = {}

    # Getting the features like length, first_letter, last_letter
    feature_dict['length'] = len(name)
    feature_dict['first_letter'] = name[0]
    feature_dict['last_letter'] = name[-1]

    feature_dict['vowels_count'] = 0

    # Get the counts of alphabets and vowels
    for char in string.ascii_lowercase:
        feature_dict[f'count_{char}'] = name.count(char)
        if (char in 'aeiou' )and (char in name):
            feature_dict['vowels_count'] += name.count(char)

    return feature_dict



<font size = 5 color = seagreen> <b> Transform names in the labeled data to these features using the above function.

In [None]:
new_lab_data= []
for name, label in labeled_data:
    features = getFeatures(name)
    new_lab_data.append((features, label))

>

##### <a id = 'c5'>
<font size = 10 color = 'midnightblue'> <b> Model development 

<font size = 5 color = seagreen> <b> Shuffle the data in random order before splitting into train and test in order to obtain optimized sample for training.

In [None]:
random.shuffle(new_lab_data)

<font size = 5 color = seagreen> <b> Select first 1000 records of the shuffled data as test and remaining as training set.

In [None]:
test_data = new_lab_data[:1000]
train_data = new_lab_data[1000:]

<font size = 5 color = seagreen> <b> Define the classifier object for training of the model.

In [None]:
classifier = nltk.naivebayes.NaiveBayesClassifier.train(train_data)

<font size = 5 color = seagreen> <b> Once the training is complete, the classifier object may be used to classify for a single name input.

In [None]:
classifier.classify(getFeatures('Johnny'))

<div class="alert alert-block alert-info">
<font size = 4> 
    
**Note :**
  - For classification input text needs to be converted into features similar to the training data
  - We can use the same feature extraction function here for transformation
    


<font size = 5 color = seagreen> <b> This classifier object can also be used to classify multiple text inputs at the same time.

<div class="alert alert-block alert-success">
<font size = 4> 

- In order to do so, pass a unlabeled data to the classifier associated function `classify_many`.
- The below snippet separates the labels from the preprocessed (feature extracted) list and prepares the data input for the classification function.

In [None]:
test_features = []
test_labels = []
for feature_set, label in test_data:
    test_features.append(feature_set)
    test_labels.append(label)

<font size = 5 color = seagreen> <b> Obtain the classes for the test input.

In [None]:
test_labels_pred = classifier.classify_many(test_features)

[top](#c0)

##### <a id = 'c6'>
<font size = 10 color = 'midnightblue'> <b> Evaluation

<font size = 5 color = seagreen> <b> Use the evaluation metrics for classification models, like confusion matrix, accuracy, etc. to assess the model

<font size = 5 color = pwdrblue> <b> Confusion Matrix

In [None]:
for_matrix = pd.DataFrame({'pred' : test_labels_pred, 'act' : test_labels})

In [None]:
confusion_mat = pd.crosstab(for_matrix.pred, for_matrix.act)
confusion_mat

In [None]:
# Get the values of true positives, true negatives, false positives, false negatives for computation of accuracy and other measures
TP = confusion_mat.iloc[0,0]
TN = confusion_mat.iloc[1,1]
FP = confusion_mat.iloc[0,1]
FN = confusion_mat.iloc[1,0]

In [None]:
Accuracy = (TP + TN) / sum([TP, TN, FP, FN]) * 100
print(f"Accuracy : {Accuracy:0.2f} %")

<font size = 5 color = seagreen> <b> NLTK also provides functions to obtain accuracy for the model.

In [None]:
## Accuracy on test data :
nltk.classify.accuracy(classifier, test_data)

<font size = 5 color = seagreen> <b> The nltk `naive bayes model` also provides the `top n` important features contributing in classification.

In [None]:
classifier.show_most_informative_features(n = 15)

<div class="alert alert-block alert-info">
<font size = 4> 
    
**Note :**

**This model can be modified to be used for any labeled data with required data cleaning and preprocessing.**

- The NLTK naive bayes classifier accepts the text data in a specific format, i.e. a list containingtuples which contain the feature dictionary and the label as its items.
- The data should be transformed in this manner and used for classification.
- `sklearn` classifiers may also be used but they require transforming text data to numerical formats (discussed in next chapters).
