<a href="https://colab.research.google.com/github/tinamilo6/Textmining/blob/main/A03_TextMining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#!pip install pandas sklearn nltk

# Assignment 3 - Text Mining

Project management and tools for health informatics

## 1. Download and prepare data:

**Do not alter the code in this Section!**

The code in this section downloads the [IMDB IMDB Large Movie Review Dataset]('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz') which is the dataset you will be working on in this assignment.

In [2]:
import os
import tarfile
from urllib.request import urlretrieve

In [3]:
if not os.path.exists('aclImdb'):
    # download data:
    urlretrieve('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', 'aclImdb.tar.gz')

    # unzip data:
    with tarfile.open('aclImdb.tar.gz') as file:
        file.extractall('./')

## 2. Some helper Functions:

**Do not alter the code in this Section!**

This section contains the code for some helper functions that will be useful for solving the assignment. Example code on how to use the functions is provided in section 3.

In [4]:
import pandas as pd

from typing import Literal, Tuple, Iterable

Function for loading data into a pandas dataframe:

In [5]:
def load_data(split:Literal['train', 'test'], texts_per_class:int=500) -> pd.DataFrame:
    ''' Loads the data into a pandas dataframe.'''
    paths  = []
    labels = []

    for label in ('pos', 'neg'):
        # get all files in the folder:
        files = os.listdir(os.path.join('aclImdb', split, label))[:texts_per_class]

        # append them to the lists:
        paths.extend([os.path.join('aclImdb', split, label, f) for f in files])
        labels.extend([label] * len(files))

    return pd.DataFrame({'path':paths, 'label':labels})

Function for loading a specific text:

In [6]:
def load_text(path:str) -> str:
    ''' Reads a single text given the path. '''
    # read file from disk:
    with open(path, 'r', encoding='utf8') as file:
        s = file.read()

    return s

Function for iterating through multiple texts:

In [7]:
def iterate_texts(data:pd.DataFrame) -> Iterable[Tuple[str, str]]:
    ''' Iterates through a pandas dataframe. '''

    for path in data['path'].values:
        # read file from disk:
        with open(path, 'r', encoding='utf8') as file:
            text = file.read()

        yield text

## 3. Text Mining Pipeline

This section will cover the text mining steps for this assignment. The following steps will be performed:

1. **Analyze the Data for Difficult Parts**  
   - Review the data to identify challenging aspects such as contractions, informal language, and complex sentence structures.

2. **Replace Contractions and Informal Language**  
   - Expand common contractions and replace informal phrases, if necessary, to standardize the text for processing.

3. **Tokenize the Texts**  
   - Apply a suitable tokenizer to break down the text into individual tokens for analysis.

### Import needed libraries

In [8]:
# Import needed libaries for the preparation of the texts
import random

### Load the training data
The loaded data frpom the zip file should be saved into a `data_train` and `data_test` DataFrame.
These can be further on be used to access the train and test data.

In [9]:
data_train = load_data('train')
data_test  = load_data('test')
data_train

Unnamed: 0,path,label
0,aclImdb\train\pos\0_9.txt,pos
1,aclImdb\train\pos\10000_8.txt,pos
2,aclImdb\train\pos\10001_10.txt,pos
3,aclImdb\train\pos\10002_7.txt,pos
4,aclImdb\train\pos\10003_8.txt,pos
...,...,...
995,aclImdb\train\neg\10446_2.txt,neg
996,aclImdb\train\neg\10447_1.txt,neg
997,aclImdb\train\neg\10448_1.txt,neg
998,aclImdb\train\neg\10449_4.txt,neg


### 1. Assess the data / texts for difficult parts
In this part, sample texts are printed and then analyzed for diffult parts, that could affect the text mining process.

In [10]:
# Number of samples to be printed
n_samples = 7

# Randomly sample indices from "data_train"
sample_indices = random.sample(range(len(data_train)), n_samples)

# Load and print each sample using the "load_text" function
for i, id in enumerate(sample_indices, start=1):
    # Load text from the file path specified in 'path' column
    text = load_text(data_train.loc[id, 'path'])
    print(f"Sample {i}:\n")
    print(text)
    print("\n" + "="*80 + "\n")

Sample 1:

Made after QUARTET was, TRIO continued the quality of the earlier film versions of the short stories by Maugham. Here the three stories are THE VERGER, MR. KNOW-IT-ALL, and SANITORIUM. The first two are comic (THE VERGER is like a prolonged joke, but one with a good pay-off), and the last more serious (as health issues are involved). Again the author introduces the film and the stories.<br /><br />James Hayter, soon to have his signature role as Samuel Pickwick, is the hero in THE VERGER. He holds this small custodial-type job in a church, but the new Vicar (Michael Hordern) is an intellectual snob. When he hears Hayter has no schooling he fires him. Hayter has saved some money, so he tells his wife (Kathleen Harrison) he fancies buying a small news and tobacco shop. He has a good eye, and his store thrives. Soon he has a whole chain of stores. When his grandchild is christened by Hordern, the latter is amazed to see how prosperous his ex-Verger. The payoff is when bank mana

## Limitations/Issues Captured from the Above Text Samples

- **HTML tags and special characters**
- **Punctuation and symbols** (e.g., `&`)
- **Contractions** (e.g., "isn't", "I'll", "I'm")
- **Parentheses and annotations** (e.g., "(Crouching Tiger)")
- **Informal formatting** (e.g., "my rating is ****")
- **Ambiguity and polysemy** (e.g., "dictators", "nuts")
- **Long and complex sentences**
- **Informal language** (e.g., "what can be so bad about that?")
- **Quotation marks** (e.g., "dictators", "sin")

We will elaborate on these issues and add our conclusions in the report.

### 2. Preprocessing: Simplify the Text

This section focuses on preprocessing the texts to make them more suitable for text mining. Identified constraints from the analysis step will be addressed and, as far as possible, eliminated to improve processing accuracy.


In [11]:
import re

def remove_html_tags(text):
    # Regex to match HTML tags
    clean_text = re.sub(r'<.*?>', '', text)
    return clean_text



In [12]:
import string

def remove_punctuation_and_symbols(text):
    # Remove punctuation and special symbols using regex
    clean_text = re.sub(r'[^\w\s]', '', text)
    return clean_text



In [13]:
def remove_parentheses(text):
    # Remove text inside parentheses along with parentheses
    clean_text = re.sub(r'\(.*?\)', '', text)
    return clean_text

#### Expand the contractions
There are python libraries that focus on expanding contractions. For easier loading (if these packages are not available) we decided to create our own replacement list.

In [14]:
# Dictionary of common contractions and their expanded forms
contractions_dict = {
    "i'm": "i am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    "i've": "i have",
    "you've": "you have",
    "we've": "we have",
    "they've": "they have",
    "i'll": "i will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "can't": "cannot",
    "couldn't": "could not",
    "won't": "will not",
    "wouldn't": "would not",
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "let's": "let us",
    "that's": "that is",
    "who's": "who is",
    "what's": "what is",
    "where's": "where is",
    "there's": "there is",
    "here's": "here is",
    "shouldn't": "should not",
    "mustn't": "must not",
    "shan't": "shall not"
}

# Function to expand contractions
def expand_contractions(text):
    contractions_pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b', flags=re.IGNORECASE)
    def replace(match):
        return contractions_dict.get(match.group(0).lower(), match.group(0))
    return contractions_pattern.sub(replace, text)


In [15]:
# Define a set of informal tokens to remove
informal_tokens = {
    "u": "you",
    "r": "are",
    "lmao": "",  # Remove
    "lol": "",   # Remove
    "btw": "by the way",
    "idk": "I do not know",
    "omg": "oh my god",
    "gonna": "going to",
    "wanna": "want to",
    "gotta": "got to",
    "b/c": "because",
    "thx": "thanks",
    "pls": "please",
    "cuz": "because",
    "wut": "what",
    "smh": "",  # Remove
    "k": "okay",
    "ttyl": "talk to you later"
}

def remove_informal_tokens(text):
    # Replace informal tokens
    for token, replacement in informal_tokens.items():
        # Use regex to match whole words and replace them
        text = re.sub(r'\b' + re.escape(token) + r'\b', replacement, text, flags=re.IGNORECASE)
    return text

def fix_punctuation(text):
    # Ensure proper spacing after punctuation
    text = re.sub(r'\s*([.,;:!?()])\s*', r'\1 ', text)  # Ensure space after punctuation
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = text.strip()  # Trim leading and trailing spaces
    return text

# Combined function to clean the text
def clean_text(text):
    text = remove_informal_tokens(text)  # Remove informal tokens
    text = fix_punctuation(text)  # Fix punctuation
    return text




In [16]:
def preprocess_text(text):
    text = load_text(text)              # Load text from file path
    text = expand_contractions(text)    # Expand contractions
    text = remove_informal_tokens(text) # Replace informal tokens
    text = fix_punctuation(text)        # Fix punctuation
    return text

# Apply preprocessing function to training and test data
data_train['cleaned_text'] = data_train['path'].apply(preprocess_text)
data_test['cleaned_text'] = data_test['path'].apply(preprocess_text)

# Now, you can use 'cleaned_text' for further processing or model training
print(data_train[['path', 'cleaned_text']].head())

                             path  \
0       aclImdb\train\pos\0_9.txt   
1   aclImdb\train\pos\10000_8.txt   
2  aclImdb\train\pos\10001_10.txt   
3   aclImdb\train\pos\10002_7.txt   
4   aclImdb\train\pos\10003_8.txt   

                                        cleaned_text  
0  Bromwell High is a cartoon comedy. It ran at t...  
1  Homelessness( or Houselessness as George Carli...  
2  Brilliant over-acting by Lesley Ann Warren. Be...  
3  This is easily the most underrated film inn th...  
4  This is not the typical Mel Brooks film. It wa...  


### A simple pipeline:
[Do not change it!]

This simple pipeline will be used to compare the newly created pipeline against, to evaluate the performance increase.

**White-Space tokenization:**

In [17]:
def tokenize(text:str):
    ''' An example tokenization function. '''

    # simple white-space tokenization:
    return text.lower().split()


**Bag-of-words Embedding:**

See documentation of [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

# create a simple bag of words embedding:
bow = CountVectorizer(

    # the next line converts the filepaths to the actual texts:
    #preprocessor = load_text, # Commented, to check if the preprocessing from above changes the results

    # tokenization function from above:
    tokenizer = tokenize,

    # Set token_pattern to None since we're using a custom tokenizer
    token_pattern=None

)

# Train the embedding on cleaned training data
embeddings_train = bow.fit_transform(data_train['cleaned_text'].values)

# Vectorize the cleaned test data
embeddings_test = bow.transform(data_test['cleaned_text'].values)

# These are the original lines
# train the embedding:
#embeddings_train = bow.fit_transform(data_train['path'].values)

# vectorize test data:
#embeddings_test = bow.transform(data_test['path'].values)

**Classification with a linear SVM**

See documentation of [sklearn.svm.LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

In [22]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

svm = LinearSVC(dual=False, max_iter=5000)

# train classifier:
svm.fit(embeddings_train, data_train['label'].values)

# test classifier:
predictions = svm.predict(embeddings_test)

# Calculate Accuracy:
print('Accuracy:', accuracy_score(data_test['label'].values, predictions))

Accuracy: 0.767


### Own text mining pipeline

In [20]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Example sentence to encode
s = "Hello, this is a test sentence for the tokenizer."

# Initialize a BPE tokenizer
tokenizer = Tokenizer(BPE())

# Add a pre-tokenizer to handle whitespace properly
tokenizer.pre_tokenizer = Whitespace()

# Initialize a BPE trainer with some default parameters
trainer = BpeTrainer(special_tokens=["<pad>", "<s>", "</s>", "<unk>"])

# Instead of reading files, let's provide training data directly as strings
# Simulate a dataset by giving it a list of strings (or paths to actual files)
training_data = ["Hello world", "This is a test", "We are training the tokenizer"]

# The trainer expects file paths, but we can create files dynamically (or mock them)
# For now, let's assume you have access to these files or use in-memory data.
tokenizer.train_from_iterator(training_data, trainer)

# Encode the string
t = tokenizer.encode(s)

# Decode the tokenized output
decoded_s = tokenizer.decode(t.ids)

# Display the results
print("Original string:", s)
print("Encoded tokens:", t.tokens)
print("Decoded string:", decoded_s)


Original string: Hello, this is a test sentence for the tokenizer.
Encoded tokens: ['Hello', 't', 'h', 'is', 'is', 'a', 'test', 's', 'en', 't', 'en', 'e', 'or', 'the', 'tokenizer']
Decoded string: Hello t h is is a test s en t en e or the tokenizer
