<a href="https://colab.research.google.com/github/tehilamal/GeekWeek2019/blob/master/lecture_1/tau_text_mining_1_text_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Processing - for TAU Text Mining (for MBA) 24/25

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## 1. Importing the dataset

Taken from:
https://www.kaggle.com/datasets/maher3id/restaurant-reviewstsv

In [None]:
import pandas as pd

In [None]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

In [None]:
dataset.sample(10)

### Q1: Extract random sentences

Extract a random sample of 10 sentences as a list of strings

## 2. Simple Bag-of-Words

### Q2: Build a BoW-er

Write a function that turns a string sentence into a dict-based BoW representation.

**Hint:** Use https://www.nltk.org/api/nltk.tokenize.word_tokenize.html

In [None]:
from typing import Dict
from nltk.tokenize import word_tokenize

def bower(sent_str: str) -> Dict[str, int]:
    # Fix me!
    return {}

bower(sent_sample[5])

### Q3: Generate a single BoW-dict representation for your sentence sample

### Q4: How many unique words in your sentence sample?

## 3. BoW-based vectorization

Here is how we can use existing `sklearn` features to get one-hot-encoded vector BoW representations of our sentences.

https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 3000, lowercase=False)
X = cv.fit_transform(sent_sample).toarray()

In [None]:
X[0]

### Q5: How many unique words in your sentence sample, according to CountVectorizer?

Why is the number (probably) lower?

Because of the way `CountVectorizer` separates strings into tokens; see the documentation for `token_pattern`:

``token_patternstr or None, default=r”(?u)\b\w\w+\b”``

  Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or
  more alphanumeric characters **(punctuation is completely ignored and always treated as a token separator)**.

## 4. Text Preprocessing / Cleaning

### Q6: Build a document preprocessing function

Make sure to:
1. Lowercae.
2. Split into tokens.
3. Remove stopwords.
4. Stem the words into stems.

In [None]:
from typing import List

def clean_doc(doc: str) -> List[str]:
    # fix me!
    return doc

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [None]:
nltk.download('stopwords')

In [None]:
clean_samp = [clean_doc(doc) for doc in sent_sample]

## 5. Text Preprocessing Pipeline

### Q7: Build a coprpus preprocessing function

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def dataset_to_X(df: pd.DataFrame) -> pd.DataFrame:
    # Fix me!
    return df

In [None]:
dataset_to_X(dataset)

## 6. Text Preprocessing Pipeline w/ TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def dataset_to_tfidf_X(df: pd.DataFrame) -> pd.DataFrame:
    # Fix me!
    return df

In [None]:
tfidf_X = dataset_to_tfidf_X(dataset)
tfidf_X

In [None]:
def present_active_indices(X: pd.DataFrame, index: int) -> pd.Series:
    """Returns a sub-series of non-zero components of the document vector at the given index."""
    return X.iloc[index][X.iloc[index] != 0]

In [None]:
present_active_indices(tfidf_X, 0)

In [None]:
present_active_indices(tfidf_X, 3)

## 7. Use our pipeline for some simple text classification

We will use the `"Liked"` column (which is either 0 or 1) as our label, and see if we can learn to predict it.

### Setup our X and y

In [None]:
y = dataset.iloc[:, -1].values

In [None]:
y.shape

### Split the dataset into a training and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf_X, y, test_size = 0.20, random_state = 0)

### Fit a Gaussian Naive Bayes classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

### Predict on the test set

In [None]:
import numpy as np

In [None]:
y_pred = classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

### Evaluate

In [None]:
# taken from https://gist.github.com/shaypal5/94c53d765083101efc0240d776a23823

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def print_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14):
    """Prints a confusion matrix, as returned by sklearn.metrics.confusion_matrix, as a heatmap.

    Note that due to returning the created figure object, when this funciton is called in a
    notebook the figure willl be printed twice. To prevent this, either append ; to your
    function call, or modify the function by commenting out the return expression.

    Arguments
    ---------
    confusion_matrix: numpy.ndarray
        The numpy.ndarray object returned from a call to sklearn.metrics.confusion_matrix.
        Similarly constructed ndarrays can also be used.
    class_names: list
        An ordered list of class names, in the order they index the given confusion matrix.
    figsize: tuple
        A 2-long tuple, the first value determining the horizontal size of the ouputted figure,
        the second determining the vertical size. Defaults to (10,7).
    fontsize: int
        Font size for axes labels. Defaults to 14.

    Returns
    -------
    matplotlib.figure.Figure
        The resulting confusion matrix figure
    """
    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names,
    )
    fig = plt.figure(figsize=figsize)
    try:
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    # Note that due to returning the created figure object, when this funciton is called in a notebook
    # the figure willl be printed twice. To prevent this, either append ; to your function call, or
    # modify the function by commenting out this return expression.
    return fig

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)

In [None]:
print_confusion_matrix(cm, ['Not Liked', 'Liked']);

In [None]:
acc = accuracy_score(y_test, y_pred)
acc

In [None]:
print(f"We achieve an accuracy score of {100*acc:.2f}%.")