<a href="https://colab.research.google.com/github/tinamilo6/Textmining/blob/main/A03_TextMining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pandas sklearn nltk

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


# Assignment 3 - Text Mining

Project management and tools for health informatics

## 1. Download and prepare data:

**Do not alter the code in this Section!**

The code in this section downloads the [IMDB IMDB Large Movie Review Dataset]('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz') which is the dataset you will be working on in this assignment.

In [2]:
import os
import tarfile
from urllib.request import urlretrieve

In [3]:
if not os.path.exists('aclImdb'):
    # download data:
    urlretrieve('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', 'aclImdb.tar.gz')

    # unzip data:
    with tarfile.open('aclImdb.tar.gz') as file:
        file.extractall('./')

## 2. Some helper Functions:

**Do not alter the code in this Section!**

This section contains the code for some helper functions that will be useful for solving the assignment. Example code on how to use the functions is provided in section 3.

In [4]:
import pandas as pd

from typing import Literal, Tuple, Iterable

Function for loading data into a pandas dataframe:

In [5]:
def load_data(split:Literal['train', 'test'], texts_per_class:int=500) -> pd.DataFrame:
    ''' Loads the data into a pandas dataframe.'''
    paths  = []
    labels = []

    for label in ('pos', 'neg'):
        # get all files in the folder:
        files = os.listdir(os.path.join('aclImdb', split, label))[:texts_per_class]

        # append them to the lists:
        paths.extend([os.path.join('aclImdb', split, label, f) for f in files])
        labels.extend([label] * len(files))

    return pd.DataFrame({'path':paths, 'label':labels})

Function for loading a specific text:

In [6]:
def load_text(path:str) -> str:
    ''' Reads a single text given the path. '''
    # read file from disk:
    with open(path, 'r', encoding='utf8') as file:
        s = file.read()

    return s

Function for iterating through multiple texts:

In [7]:
def iterate_texts(data:pd.DataFrame) -> Iterable[Tuple[str, str]]:
    ''' Iterates through a pandas dataframe. '''

    for path in data['path'].values:
        # read file from disk:
        with open(path, 'r', encoding='utf8') as file:
            text = file.read()

        yield text

## 3. Your Code:

**Alter the code below to complete the assignment!**

Load the training data:

In [8]:
data_train = load_data('train')
data_test  = load_data('test')
data_train

Unnamed: 0,path,label
0,aclImdb/train/pos/3937_8.txt,pos
1,aclImdb/train/pos/10290_8.txt,pos
2,aclImdb/train/pos/11917_8.txt,pos
3,aclImdb/train/pos/6627_7.txt,pos
4,aclImdb/train/pos/8219_7.txt,pos
...,...,...
995,aclImdb/train/neg/2793_1.txt,neg
996,aclImdb/train/neg/6236_1.txt,neg
997,aclImdb/train/neg/8990_3.txt,neg
998,aclImdb/train/neg/7945_1.txt,neg


### Accessing the texts:

In [9]:
# Sample code: load a single text
load_text(data_train.loc[0, 'path'])

"The early career of Abe Lincoln is beautifully presented by Ford. Not that anyone alive has seen footage of the real Lincoln, but Fonda, wearing a fake nose, is uncanny as Lincoln, with the voice, delivery, walk, and other mannerisms - exactly as one would imagine Lincoln to have been. Ford, in the first of three consecutive films he made with Fonda, is at the top of his form, perfectly evoking early 19th century America. The story focuses on a pair accused of murder that Lincoln defends and the courtroom scenes are quite well done. The supporting cast includes many of Ford's regulars. This was Alice Brady's last film, as she died months after its release."

In [10]:
# Sample code: iterate through all texts
for text in iterate_texts(data_train[:20]):
    print(text)

The early career of Abe Lincoln is beautifully presented by Ford. Not that anyone alive has seen footage of the real Lincoln, but Fonda, wearing a fake nose, is uncanny as Lincoln, with the voice, delivery, walk, and other mannerisms - exactly as one would imagine Lincoln to have been. Ford, in the first of three consecutive films he made with Fonda, is at the top of his form, perfectly evoking early 19th century America. The story focuses on a pair accused of murder that Lincoln defends and the courtroom scenes are quite well done. The supporting cast includes many of Ford's regulars. This was Alice Brady's last film, as she died months after its release.
Bruce Almighty is the best Jim Carrey work since The Truman Show, and was a pleasant surprise after some of his recent "Hey Hollywood - look how good I can act!" box office disappointments. It's great to see Jim recognizing and embracing his strengths. He won't get an Academy Award but the film itself will last longer than many of th

In [11]:
# Define the number of sample texts to print
n_samples = 2

# Assume 'data_train' is a DataFrame or list containing your text data.
# Instead of accessing 'text', access the 'path' column,
# which likely contains the file paths to your text data.
# Then, load the text content from those paths.
X_train = [load_text(data_train.loc[i, 'path']) for i in range(n_samples)]

# Print some sample texts from the training set
for i in range(n_samples):
    print(f"Sample {i+1}:\n")
    print(X_train[i])  # Print the text
    print("\n" + "="*80 + "\n")

Sample 1:

The early career of Abe Lincoln is beautifully presented by Ford. Not that anyone alive has seen footage of the real Lincoln, but Fonda, wearing a fake nose, is uncanny as Lincoln, with the voice, delivery, walk, and other mannerisms - exactly as one would imagine Lincoln to have been. Ford, in the first of three consecutive films he made with Fonda, is at the top of his form, perfectly evoking early 19th century America. The story focuses on a pair accused of murder that Lincoln defends and the courtroom scenes are quite well done. The supporting cast includes many of Ford's regulars. This was Alice Brady's last film, as she died months after its release.


Sample 2:

Bruce Almighty is the best Jim Carrey work since The Truman Show, and was a pleasant surprise after some of his recent "Hey Hollywood - look how good I can act!" box office disappointments. It's great to see Jim recognizing and embracing his strengths. He won't get an Academy Award but the film itself will las

LIMITATIONS/ISSUES CAPTURED FROM THE ABOVE TEXT SAMPLES:

HTML tags and special characters

Punctuation and symbols (&)

Contractions ("isn't", "I'll", "I'm")

Parentheses and annotations ("(Crouching Tiger) ")

Informal formatting ("my rating is ****")

Ambiguity and polysemy ("dictators")

Long and complexes sentences

Informal language ("what can be so bad about that?")

Quotation marks ("dictators", "sin")



We will elaborate about them and add our conclusions in the report.

In [12]:
import re

def remove_html_tags(text):
    # Regex to match HTML tags
    clean_text = re.sub(r'<.*?>', '', text)
    return clean_text



In [13]:
import string

def remove_punctuation_and_symbols(text):
    # Remove punctuation and special symbols using regex
    clean_text = re.sub(r'[^\w\s]', '', text)
    return clean_text



In [14]:
# Dictionary of common contractions and their expanded forms
contractions_dict = {
    "I'm": "I am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    "I've": "I have",
    "you've": "you have",
    "we've": "we have",
    "they've": "they have",
    "I'll": "I will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "can't": "cannot",
    "couldn't": "could not",
    "won't": "will not",
    "wouldn't": "would not",
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "I'll": "I will",
    "you'll": "you will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "I've": "I have",
    "you've": "you have",
    "they've": "they have",
    "we've": "we have",
    "let's": "let us",
    "that's": "that is",
    "who's": "who is",
    "what's": "what is",
    "where's": "where is",
    "there's": "there is",
    "here's": "here is",
    "shouldn't": "should not",
    "wouldn't": "would not",
    "couldn't": "could not",
    "mustn't": "must not",
    "shan't": "shall not",
    "hadn't": "had not",
    "hasn't": "has not"
    # Add more contractions as needed
}

# Function to expand contractions
def expand_contractions(text):
    contractions_pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b', flags=re.IGNORECASE)
    def replace(match):
        return contractions_dict.get(match.group(0).lower(), match.group(0))
    return contractions_pattern.sub(replace, text)

# Example usage
text = "I'll go if you're ready. She's been waiting, but I can't make it."
expanded_text = expand_contractions(text)
print(expanded_text)


I'll go if you are ready. she is been waiting, but I cannot make it.


In [17]:
def remove_parentheses(text):
    # Remove text inside parentheses along with parentheses
    clean_text = re.sub(r'\(.*?\)', '', text)
    return clean_text


In [18]:
# Define a set of informal tokens to remove
informal_tokens = {
    "u": "you",
    "r": "are",
    "lmao": "",  # Remove
    "lol": "",   # Remove
    "btw": "by the way",
    "idk": "I do not know",
    "omg": "oh my god",
    "gonna": "going to",
    "wanna": "want to",
    "gotta": "got to",
    "b/c": "because",
    "thx": "thanks",
    "pls": "please",
    "cuz": "because",
    "wut": "what",
    "smh": "",  # Remove
    "k": "okay",
    "ttyl": "talk to you later"
}

def remove_informal_tokens(text):
    # Replace informal tokens
    for token, replacement in informal_tokens.items():
        # Use regex to match whole words and replace them
        text = re.sub(r'\b' + re.escape(token) + r'\b', replacement, text, flags=re.IGNORECASE)
    return text

def fix_punctuation(text):
    # Ensure proper spacing after punctuation
    text = re.sub(r'\s*([.,;:!?()])\s*', r'\1 ', text)  # Ensure space after punctuation
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = text.strip()  # Trim leading and trailing spaces
    return text

# Combined function to clean the text
def clean_text(text):
    text = remove_informal_tokens(text)  # Remove informal tokens
    text = fix_punctuation(text)  # Fix punctuation
    return text




### A simple pipeline:

**White-Space tokenization:**

In [19]:
def tokenize(text:str):
    ''' An example tokenization function. '''

    # simple white-space tokenization:
    return text.lower().split()


In [35]:
from tokenizers import Tokenizer
from tokenizers.models import BPE # Import BPE from tokenizers.models instead of tokenizers.modules
from tokenizers.trainers import BpeTrainer

# Define 's' before encoding it
s = "Hello, this is a test sentence for the tokenizer." # Example string

# Initialize a BPE tokenizer
tokenizer = Tokenizer(BPE())

# Initialize a BPE trainer
trainer = BpeTrainer()

# Train the tokenizer with the specified training files, using the full paths
tokenizer.train(
      ['train1.txt', 'train2.txt'],
      trainer=trainer
    )

t = tokenizer.encode(s) # Encode 's' and store the result in 't'
decoded_s = tokenizer.decode(t.ids) # Decode the encoded tokens back to a string


**Bag-of-words Embedding:**

See documentation of [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

# create a simple bag of words embedding:
bow = CountVectorizer(

    # the next line converts the filepaths to the actual texts:
    preprocessor = load_text,

    # tokenization function from above:
    tokenizer = tokenize

)

# train the embedding:
embeddings_train = bow.fit_transform(data_train['path'].values)

# vectorize test data:
embeddings_test = bow.transform(data_test['path'].values)



**Classification with a linear SVM**

See documentation of [sklearn.svm.LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

In [37]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

svm = LinearSVC()

# train classifier:
svm.fit(embeddings_train, data_train['label'].values)

# test classifier:
predictions = svm.predict(embeddings_test)

# Calculate Accuracy:
print('Accuracy:', accuracy_score(data_test['label'].values, predictions))

Accuracy: 0.779
