## Exercises Week 1: Working with textual data -- ANSWERS

In this assignment, you will work with textual data, focusing on dataset structure, processing, and basic analysis techniques. You will inspect the dataset, discuss research questions, and implement fundamental preprocessing steps such as tokenization, stopword removal, and stemming.

In [None]:
import os
import random
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

### 1. Get the data

Download `articles.tar.gz` or `articles.zip` from Canvas (under Week 1). Unpack the dataset and inspect the contents.

Hint: On Windows, you can use built-in extraction tools or `tar -xvzf articles.tar.gz `on macOS/Linux.

### 2. Inspect the structure of the dataset

What information do the following elements provide about the dataset?

- Folder (directory) names
- Folder structure/hierarchy
- File names
- File contents

How can you programmatically inspect these aspects of the dataset?

*Hint*: Consider using `os.listdir()` to check the folder contents and `glob` for pattern-based file selection.

In [None]:
dataset_path = 'articles'

folders = os.listdir(dataset_path)
print("Folders:", folders)

In [None]:
# Check the contents of each folder to get an overview.
for folder in folders:
    folder_path = os.path.join(dataset_path, folder)
    if os.path.isdir(folder_path):
        print(f"Folder: {folder}")
        print("Contents:", os.listdir(folder_path))

### 3. Discuss strategies for working with this dataset

Considering the dataset's size and structure:

 - Research questions: 
      * How do different news outlets cover the same event? Are there notable differences in tone or word choice?
     *  What are the most frequently discussed topics across different dates and sources?
      * Can we detect sentiment trends over time or between news outlets?
      *  Is it possible to identify bias or framing through word frequency and topic modeling?
 - Strategies: Process files in batches to avoid memory overload, use generators to handle large datasets efficiently.


### 4. Read some (or all) data

Load a sample of the dataset and display the first few lines of text.
How would you handle reading a large number of files efficiently?

In [None]:
from glob import glob
import os

# Define dataset path and the source you want to read from
dataset_path = 'articles/'
source_name = 'Vox'  # Change to desired source if needed, like BBC or The Guardian

# Correct the glob pattern to find files in the specified source folder across all dates
newspaperfiles = glob(os.path.join(dataset_path, f'*/{source_name}/*'))

# Initialize a list to hold documents
documents = []

# Read files and handle encoding errors if necessary
for filename in newspaperfiles:
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            documents.append(f.read())
    except Exception as e:
        print(f"Error reading {filename}: {e}")

print(f"Loaded {len(documents)} articles from {source_name}.")

<div class="alert-block alert-warning">
  <p><strong>Tip:</strong> If you're looking to test or practice your code, it's a great idea to start by working with a random sample of the articles. This allows you to quickly check whether your logic works without having to process the entire dataset. Once you're confident that your code functions correctly on the smaller sample, you can easily scale up and apply it to the full set of documents.</p>
  
  <p>Here's a simple Python code snippet to help you randomly select a subset of articles for practice purposes:</p>
  
  <pre><code>import random
articles = random.sample(documents, 10)  # Randomly select 10 articles</code></pre>
  
  <p>This will select 10 random articles from the 'documents' list, which you can then use for testing your code. Remember, as long as your code works on this smaller sample, you can confidently scale up and run it on the entire collection of documents when you're ready!</p>
</div>


In [None]:
articles = random.sample(documents, 10) 
print(articles[0])

## 5. Tokenization

What is tokenization, and why is it useful in text processing?

Implement a basic tokenization process using Python.

In [None]:
## First: try it out on a sample sentence

text = 'This is a sample sentence for tokenization.'
tokens = word_tokenize(text)
print(tokens)

You can experiment with different texts to observe how the word_tokenize function handles various punctuation, contractions, and other linguistic features. This will give you deeper insight into the way tokenization works in NLP tasks. For example, apply to the  `articles` you have just created. 

In [None]:
## Second: scale up to articles

tokenized_articles = []

for article in articles:
    tokens = word_tokenize(article)
    tokenized_articles.append(tokens)

# Display the tokenized result for the first article as an example
print(tokenized_articles[0])

### 5. Stopword Removal

- Stopwords are common words that usually carry little meaning in text analysis, such as "is", "and", "the".
- Removing them helps focus on more meaningful content.


We will demonstrate how to filter out stopwords from a sample sentence using the stopwords list from the NLTK library. The goal is to remove common words (like "is", "a", "for") that don't contribute much to the meaning of the sentence, leaving behind the more significant words.

In [None]:
# Sample sentence for tokenization and stopword removal
text = 'This is a sample sentence for tokenization and stopword removal.'

# Tokenize the text into words
tokens = word_tokenize(text)

# Get the set of English stopwords
stop_words = set(stopwords.words('english'))

# Filter out stopwords from the tokens
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Display the filtered tokens
print(filtered_tokens)

Now scale up to include more articles

In [None]:
# Get the set of English stopwords
stop_words = set(stopwords.words('english'))

# List to store the filtered tokens for each article
filtered_articles = []

# Apply tokenization and stopword removal to each article
for article in articles:
    # Tokenize the article into words
    tokens = word_tokenize(article)
    
    # Filter out stopwords from the tokens
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    
    # Append the filtered tokens to the result list
    filtered_articles.append(filtered_tokens)

# Display the filtered tokens for the first article as an example
print(filtered_articles[0])