# Text Classification

In this notebook, we will leverage the preprocessing and representation techniques and apply them for a text classification use-case. In this notebook, we will cover:

- Apply cleanup and transform text data into a vector form
- Work through a text classification use-case

```shell
Add image from the slide deck
```

Text classification can have a number of applications, such as:
- Document categorization
- Spam vs Ham
- Review Classification
- Fake Vs Actual News
- Sentiment Classification and so on...



## Install Dependencies

In [None]:
!pip install contractions
!pip install tqdm

## Import Libraries

In [None]:
import nltk
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
from tqdm.notebook import tqdm
import unicodedata
import pandas as pd

In [None]:
nltk.download('punkt')

## Get Data
We will make use of the movie review dataset for this tutorial

In [None]:
dataset = pd.read_csv(r'movie_reviews.csv.bz2')
dataset.info()

In [None]:
dataset.head()

### Prepare Train-Test Splits

In [None]:
# build train and test datasets
reviews = dataset['review'].values
sentiments = dataset['sentiment'].values

In [None]:
train_reviews = reviews[:35000]
train_sentiments = sentiments[:35000]

In [None]:
test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]

## Text Preprocessing
- Remove HTML/Special Characters
- Remove accented characters
- Lowercase

In [None]:
def strip_html_tags(text):
  soup = BeautifulSoup(text, "html.parser")
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
  return stripped_text

def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

def pre_process_corpus(docs):
  norm_docs = []
  for doc in tqdm(docs):
    doc = strip_html_tags(doc)
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower()
    doc = remove_accented_chars(doc)
    doc = contractions.fix(doc)
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()  
    norm_docs.append(doc)
  
  return norm_docs

In [None]:
%%time
norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

## Feature Engineering

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=5, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(norm_train_reviews)

In [None]:
# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=5, max_df=1.0, ngram_range=(1,2),
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(norm_train_reviews)

In [None]:
%%time

# transform test reviews into features
cv_test_features = cv.transform(norm_test_reviews)
tv_test_features = tv.transform(norm_test_reviews)

## Classification Model: Logistic Regression

Also known as the logit or logistic model, it uses the logistic (popularly also known as sigmoid) mathematical function to estimate the parameter values. These are the coefficients of all our features such that the overall loss is minimized when predicting the outcome

In [None]:
# Logistic Regression model on BOW features
from sklearn.linear_model import LogisticRegression

### LR with Count Vectorizer

In [None]:
# instantiate model
lr_cv = LogisticRegression(penalty='l2', 
                        max_iter=500, 
                        C=1, 
                        solver='lbfgs', 
                        random_state=42)

In [None]:
## Train with CountVectorizer Features
# train model
lr_cv.fit(cv_train_features, train_sentiments)

In [None]:
# predict on test data
lr_bow_predictions = lr_cv.predict(cv_test_features)

### Evaluate Model

In [None]:
print(classification_report(test_sentiments, lr_bow_predictions))

In [None]:
labels = ['negative', 'positive']
pd.DataFrame(confusion_matrix(test_sentiments, lr_bow_predictions), 
             index=labels, columns=labels)

### LR with TFIDF

In [None]:
# instantiate model
lr_tv = LogisticRegression(penalty='l2', 
                        max_iter=500, 
                        C=1, 
                        solver='lbfgs', 
                        random_state=42)

In [None]:
## Train with CountVectorizer Features
# train model
lr_tv.fit(tv_train_features, train_sentiments)

### Evaluate Model

In [None]:
print(classification_report(test_sentiments, lr_tfidf_predictions))

In [None]:
labels = ['negative', 'positive']
pd.DataFrame(confusion_matrix(test_sentiments, lr_tfidf_predictions), 
             index=labels, columns=labels)