# NLP TF-IDF Classifier

[Disaster Tweets Dataset](https://www.kaggle.com/c/nlp-getting-started)

> Term Frequency – Inverse Document Drequency (TF-IDF) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. 
- https://en.wikipedia.org/wiki/Tf%E2%80%93idf

TF-IDF scores are computed for each word to discover which words are most correlated with each label, and used to create a solver that guesses tweet sentiment based on relative TF-IDF scores.

The first stage uses TF-IDF to make a prediction, which we can then use to generate a new data field "accuracy". This allows us to recompute TF-IDF using "accuracy" (to split between true and false positive) rather than on the "target" label.

TF-IDF is normally done with many documents within a large corpus. Here I am treating the labels as two giant documents so the "document" is almost as large as the corpus. With small documents, stopwords will cancel out, however due to the large document size, stopwords have high values in both categories.


By taking the ratio, we are looking for words with a high TF-IDF value in one category, but a low value in the other category. Stopwords now cancel out at close to 1. Using a ratio seemed natural given that we only have 2 documents.

Words that appear mostly in one category but not the other will have a much higher category. Note how "Hiroshima" suddenly jumps to the top of the list, compared with ":". This is a highly relevant keyword that rarely gets seen in the other category.

> By adding the true_positive_ratio_scaled to disaster_ratio, what you’re saying is “increase the importance of these words, by this much since they are a  good indicator of disaster tweets” and by adding false_positive_ratio_scaled to non_disaster_ratio, what you are saying is “these words are usually present in tweets that are actually NOT disasters but are falsely classified as disasters. So increase their non_disaster_ratios so that they are more likely to be treated as indicators for non-disaster tweets” - Pawan Bhandarkar

In [None]:
import numpy  as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import pydash
import math
import os
from pydash import flatten
from collections import Counter, OrderedDict
from humanize import intcomma
from operator import itemgetter
from typing import *

# CSV Data

First we read our CSV into a pandas dataframe

In [None]:
df_train = pd.read_csv('../input/nlp-getting-started/train.csv', index_col=0)
df_test  = pd.read_csv('../input/nlp-getting-started/test.csv', index_col=0)
df_train

# Tokenization

Tokeniztion is the process of splitting text strings into a word array. There are a range of different algorithms, each with slightly different rules for how to define word boundries. For now we will just use [nltk.tokenize.casual.TweetTokenizer](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual).

In [None]:
def tokenize_df( dfs: List[pd.DataFrame] ) -> List[str]:
    # TweetTokenizer defaults produce the best result
    # Lowercasing, stripping handles and stemming reduce submission accuracy to 0.72387 
    tokenizer = nltk.TweetTokenizer(preserve_case=True,  reduce_len=False, strip_handles=False)  # defaults 
    # tokenizer = nltk.TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True) 
    tokens = flatten([
        tokenizer.tokenize(tweet_text)    
        for df in flatten([ dfs ])
        for tweet_text in df['text']
    ])
    
    # stemmer   = nltk.PorterStemmer()
    # tokens = [ stemmer.stem(token) for token in tokens ]
    return tokens

tokens_all          = tokenize_df([ df_train, df_test ])
tokens_train        = tokenize_df( df_train )
tokens_test         = tokenize_df( df_test )
tokens_disaster     = tokenize_df( df_train[ df_train['target'] == 1 ] )
tokens_not_disaster = tokenize_df( df_train[ df_train['target'] == 0 ] )
tokens_shared       = set(tokens_train) & set(tokens_test) & set(tokens_disaster) & set(tokens_not_disaster)

print('Unique Tokens:');
print('  tokens_all          ', intcomma(len(set(tokens_all))))
print('  tokens_train        ', intcomma(len(set(tokens_train))))
print('  tokens_test         ', intcomma(len(set(tokens_test))))
print('  tokens_disaster     ', intcomma(len(set(tokens_disaster))))
print('  tokens_not_disaster ', intcomma(len(set(tokens_not_disaster))))
print('  tokens_shared       ', intcomma(len(set(tokens_shared))))
print()
print('New Tokens:');
print(f'  tokens_test         - tokens_train        {intcomma(len(set(tokens_test) - set(tokens_train)))           :>6s} ({len(set(tokens_test) - set(tokens_train))/len(set(tokens_test))*100:.1f}%)')
print(f'  tokens_train        - tokens_test         {intcomma(len(set(tokens_train) - set(tokens_test)))           :>6s} ({len(set(tokens_train) - set(tokens_test))/len(set(tokens_train))*100:.1f}%)')
print(f'  tokens_disaster     - tokens_not_disaster {intcomma(len(set(tokens_disaster) - set(tokens_not_disaster))):>6s} ({len(set(tokens_disaster) - set(tokens_not_disaster))/len(set(tokens_disaster))*100:.1f}%)')
print(f'  tokens_not_disaster - tokens_disaster     {intcomma(len(set(tokens_not_disaster) - set(tokens_disaster))):>6s} ({len(set(tokens_not_disaster) - set(tokens_disaster))/len(set(tokens_not_disaster))*100:.1f}%)')

We have a total vocabulary of 35k word tokens. 

Half the word tokens that appear in the test dataset are not found in the training dataset, and three quarters of words in the training dataset are not see again the test dataset. 

Similar 66-73% figures are found when comparing the split between disaster and safe tweets, suggesting that only a third of vocabulary is shared between dataset splits.

# TF-IDF 

Term Frequency Inverse Document Frequency shows the relative statistical importance of a word in a document relative to how many times it appears in all text.

We are treating all tweets with a given label as a single document, and comparing the freqency of words between labels. As we are comparing two large "documents" stopwords and punctuation get high scores. These will be eliminated below by comparing the relative TF-IDF between labels.

In [None]:
def term_frequency( tokens: List[str] ) -> Counter:
    tf = {
        token: count / len(tokens)
        for token, count in Counter(tokens).items()
    }
    tf = Counter(dict(Counter(tf).most_common()))  # sort and cast
    return tf
    
    
def inverse_document_frequency( tokens: List[str] ) -> Counter:
    idf = {
        token: math.log( len(tokens) / count ) 
        for token, count in Counter(tokens).items()
    }
    idf = Counter(dict(Counter(idf).most_common()))  # sort and cast
    return idf


def tf_idf( document_tokens: List[str], all_tokens: List[str] ) -> Counter:
    tf  = term_frequency(document_tokens)
    idf = inverse_document_frequency(all_tokens)
    tf_idf = {
        token: tf[token] * idf[token]
        for token in set(document_tokens)
    }    
    tf_idf = Counter(dict(Counter(tf_idf).most_common()))  # sort and cast
    return tf_idf


tf_disaster         = term_frequency(tokens_disaster)
tf_not_disaster     = term_frequency(tokens_not_disaster)
idf                 = inverse_document_frequency(tokens_all)
tf_idf_disaster     = tf_idf(tokens_disaster, tokens_all)
tf_idf_not_disaster = tf_idf(tokens_not_disaster, tokens_all)

display('tf_idf_disaster')
display(tf_idf_disaster.most_common(5))

display('tf_idf_not_disaster')
display(tf_idf_not_disaster.most_common(5))

In [None]:
display('tf_idf_disaster')
display(tf_idf_disaster.most_common(5))

display('tf_idf_not_disaster')
display(tf_idf_not_disaster.most_common(5))

However by comparing the ratio of TF-IDF scores between datasets, we can filter out the stopwords (which have high scores in both datasets) and return a list of 
keywords that are most indicative of specific label

In [None]:
def tf_idf_ratio( tf_idf_true: Counter, tf_idf_false: Counter ) -> Counter:
    tf_idf_false_tokens = set(tf_idf_false.keys()) 
    tf_idf_ratio = {
        token: tf_idf_true[token] / tf_idf_false[token]
        for token in tf_idf_true.keys()
        if  token in tf_idf_false_tokens        
    }
    tf_idf_ratio = Counter(dict(Counter(tf_idf_ratio).most_common()))  # sort and cast
    return tf_idf_ratio


tf_idf_ratio_disaster     = tf_idf_ratio(tf_disaster, tf_not_disaster)
tf_idf_ratio_not_disaster = tf_idf_ratio(tf_not_disaster, tf_disaster)


display('tf_idf_ratio_disaster')
display(tf_idf_ratio_disaster.most_common(30))
display('tf_idf_ratio_not_disaster')
display(tf_idf_ratio_not_disaster.most_common(30))

# Solver

The solver here is fairly simple.

We tokenize the tweet and compare each word against the TF-IDF scores for each label. The label with the highest score wins.

Despite this being a very simple method, we get a respectable 81% accuracy on the training dataset and 77.4% accuracy on the submission test dataset.

Applying math.log() to the TF-IDF scores helps reduce overfitting, resulting in a +0.49% improvement on submission test_df scores but with a -0.37% reduction in training accuracy.

In [None]:
def tf_idf_classifer_score(tweet_text: str, tf_idf_ratio_disaster, tf_idf_ratio_not_disaster) -> float:
    score  = 0.0
    tokens = nltk.TweetTokenizer().tokenize(tweet_text)
    for token in tokens:
        if token in tokens_shared:
            # math.log() improves test_df score from 0.76892 -> 0.77382 (+0.49%)
            #            reduces train_df score from 0.80861 -> 0.81229 (-0.37%)
            # score += tf_idf_ratio_disaster.get(token, 0)     
            # score -= tf_idf_ratio_not_disaster.get(token, 0) 
            score += math.log( tf_idf_ratio_disaster.get(token, 1)     )  # log(1) == 0
            score -= math.log( tf_idf_ratio_not_disaster.get(token, 1) )
    return score


def tf_idf_classifer(tweet_text: str, tf_idf_ratio_disaster, tf_idf_ratio_not_disaster) -> int:
    score = tf_idf_classifer_score(tweet_text, tf_idf_ratio_disaster, tf_idf_ratio_not_disaster)
    label = 1 if score > 0 else 0  # NOTE: label == 0 if score == 0 
    return label


def tf_idf_classifer_df(df: pd.DataFrame) -> np.ndarray:
    return np.array([
        tf_idf_classifer( row['text'], tf_idf_ratio_disaster, tf_idf_ratio_not_disaster )
        for index, row in df.iterrows()    
    ])

def tf_idf_classifer_accuracy(df: pd.DataFrame, tf_idf_ratio_disaster, tf_idf_ratio_not_disaster) -> float:
    correct = 0
    total   = 0
    for index, row in df.iterrows():
        label = tf_idf_classifer( row['text'], tf_idf_ratio_disaster, tf_idf_ratio_not_disaster )
        if label == row['target']:
            correct += 1
        total += 1
    accuracy = correct / total
    return accuracy

In [None]:
accuracy = tf_idf_classifer_accuracy(df_train, tf_idf_ratio_disaster, tf_idf_ratio_not_disaster)
print('accuracy =', accuracy)

# False Positive TF-IDF

We can repeat this process and use TF-IDF to find the words likely to indicate a false positive match.

This improves train accuracy by +1.15%

In [None]:
df_train

In [None]:
predictions           = tf_idf_classifer_df(df_train)
tokens_true_positive  = tokenize_df( df_train[ df_train['target'] == predictions ] )
tokens_false_positive = tokenize_df( df_train[ df_train['target'] != predictions ] )

tf_idf_true_positive  = tf_idf( tokens_true_positive,  tokens_all )
tf_idf_false_positive = tf_idf( tokens_false_positive, tokens_all )

tf_idf_true_positive_ratio  = tf_idf_ratio(tf_idf_true_positive,  tf_idf_false_positive)
tf_idf_false_positive_ratio = tf_idf_ratio(tf_idf_false_positive, tf_idf_true_positive)

display('tf_idf_true_positive_ratio')
display(tf_idf_true_positive_ratio.most_common(30))
display('tf_idf_false_positive_ratio')
display(tf_idf_false_positive_ratio.most_common(30))

Now we perform a hyper-parameter search to find the best ratios for combining this new metadata with our solver.

In [None]:
def ratio_hyperparameter_search():
    results = {}
    for scale1 in     [ 0.5, 1, 1.5, 2, 2.5, 3, 4, 8, 16 ]:
        for scale2 in [ 0.5, 1, 1.5, 2, 2.5, 3, 4, 8, 16 ]:   
            tf_idf_true_positive_ratio_scaled  = Counter({ token: count/scale1 for token, count in tf_idf_true_positive_ratio.items()  }) 
            tf_idf_false_positive_ratio_scaled = Counter({ token: count/scale2 for token, count in tf_idf_false_positive_ratio.items() }) 

            accuracy = tf_idf_classifer_accuracy(
                df_train, 
                tf_idf_ratio_disaster     + tf_idf_true_positive_ratio_scaled, 
                tf_idf_ratio_not_disaster + tf_idf_false_positive_ratio_scaled,
            )
            results[(scale1, scale2)] = accuracy 
    display(Counter(results).most_common(10))
  

if os.environ.get('KAGGLE_KERNEL_RUN_TYPE', 'Localhost') == 'Batch':
    ratio_hyperparameter_search()

In [None]:
# Hardcode optimal values
tf_idf_true_positive_ratio_scaled  = Counter({ token: count/3 for token, count in tf_idf_true_positive_ratio.items()  }) 
tf_idf_false_positive_ratio_scaled = Counter({ token: count/2 for token, count in tf_idf_false_positive_ratio.items() }) 
accuracy = tf_idf_classifer_accuracy(
    df_train, 
    tf_idf_ratio_disaster     + tf_idf_true_positive_ratio_scaled, 
    tf_idf_ratio_not_disaster + tf_idf_false_positive_ratio_scaled,
)
print('accuracy =', accuracy)

# Submission

In [None]:
df_submission = pd.read_csv('../input/nlp-getting-started/sample_submission.csv', index_col=0)
for index, row in df_test.iterrows():
    label = tf_idf_classifer( 
        row['text'], 
        tf_idf_ratio_disaster     + tf_idf_true_positive_ratio_scaled, 
        tf_idf_ratio_not_disaster + tf_idf_false_positive_ratio_scaled,
    )
    df_submission.loc[index] = label 
df_submission.to_csv('submission.csv')
df_submission

# Further Reading

This notebook is part of a series exploring Natural Language Processing
- 0.74164 - [NLP Logistic Regression](https://www.kaggle.com/jamesmcguigan/disaster-tweets-logistic-regression/)
- 0.77536 - [NLP TF-IDF Classifier](https://www.kaggle.com/jamesmcguigan/disaster-tweets-tf-idf-classifier)
- 0.79742 - [NLP Naive Bayes](https://www.kaggle.com/jamesmcguigan/nlp-naive-bayes)