# Lesson 1: Exploring Natural Language Processing Foundations with the Reuters Corpus

### Lesson Introduction

Hello and welcome! In today's lesson, we dive into the world of Natural Language Processing (NLP). NLP is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language. Today, you'll get introduced to basic NLP concepts, using a popular Python library for natural language processing.

### Intro to Natural Language Processing

Natural Language Processing, or NLP, is a field of study that focuses on the interactions between human language and computers. It sits at the intersection of computer science, artificial intelligence, and computational linguistics. NLP involves making computers understand, interpret, and manipulate human language. It's an essential tool for transforming unstructured data into actionable information. For example, it can help us understand the sentiments of customers about a product by analyzing online reviews and social media posts.

Machine learning and data science play a big role in NLP. They provide the methods to "teach" machines how to understand our language. As data scientists, understanding NLP techniques can help us create better models for text analysis.

### Investigating the Reuters Dataset

To understand natural language processing, we first need to have a dataset to work with. For this course, we'll be using the Reuters Corpus from the Natural Language Toolkit (nltk), which is a set of corpora and lexical resources for natural language processing and machine learning in Python.

Let's start by importing the required library and downloading the dataset.

# Importing the necessary libraries
import nltk

# Download the Reuters dataset
nltk.download('reuters')

Now, our Reuters dataset is downloaded and ready to use.

### Exploring Documents in the Reuters Dataset

Let's explore our dataset. The first thing to do is to load the dataset and see how many documents there are:

from nltk.corpus import reuters

# Load the dataset
documents = reuters.fileids()

# Print the number of documents
print(f"There are {len(documents)} documents in the Reuters dataset")

The output of the above code will be:

```
There are 10788 documents in the Reuters dataset
```

Each file ID in this dataset represents a document. We can pick any file ID and see the raw text in it:

# Load the text of a single document
document_text = reuters.raw(documents[0])

# Print the first 500 characters of the document text
print("\nThe first 500 characters of the first document:\n")
print(document_text[:500])

The output of the above code will be:

```
The first 500 characters of the first document:

ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict wo
```

There you have it—the raw text data we will be dealing with. This may look like quite a lot right now. But, as we go through this course, you'll learn how we can break down and handle text data efficiently using NLP techniques like tokenization, POS tagging, and lemmatization.

### Analyzing Document Categories

In the Reuters dataset, each document belongs to one or more categories. Understanding these categories will give us a holistic view of our documents.

We'll just check the categories of a single document for now:

# Print the categories of the first document
print("\nThe categories of the first document are:")
print(reuters.categories(documents[0]))

The output of the above code will be:

```
The categories of the first document are:
['trade']
```

These categories provide us with a top-level view of what each document is about.

### Lesson Summary and Practice

There we go! We have taken our first steps into the world of Natural Language Processing by exploring the Reuters Corpus from the Natural Language Toolkit (nltk).

As we move forward, we will be working on setting up a proper NLP pipeline and learning key NLP techniques such as tokenization, POS tagging, and lemmatization. All these skills will be extremely useful for your data science and machine learning journey. So, let's keep moving forward and continue exploring these in the upcoming lessons.

## Counting Unique Categories in Reuters Dataset

Fantastic work, Stellar Navigator!

Now, using the Reuters dataset, alter the starter code to print the total number of unique categories in the dataset, instead of printing all categories and their counts. Remember, the len function can count the elements in a set!

```python
# Importing the necessary NLTK corpus
from nltk.corpus import reuters

# Obtain all (duplicated) categories in the dataset
all_categories = reuters.categories()

# Get the unique categories by turning the list into a set
unique_categories = set(all_categories)

# Initialize an empty dictionary to store category counts
category_counts = {}

# Iterate through the unique categories
for category in unique_categories:
    # Count the number of documents in each category and store it in the dictionary
    category_counts[category] = len(reuters.fileids(category))

# Print the category counts dictionary
print(category_counts)

```

I'll modify the code to show just the total count of unique categories in the Reuters dataset:
```python
# Importing the necessary NLTK corpus
from nltk.corpus import reuters

# Obtain all unique categories by converting the list to a set
unique_categories = set(reuters.categories())

# Print the total number of unique categories
print(f"The Reuters dataset contains {len(unique_categories)} unique categories.")
```

This simplified version will output something like:
```
The Reuters dataset contains 90 unique categories.
```

The key changes made:
* Removed the unnecessary dictionary creation and counting loop since we only need the total count
* Used `set()` directly on `reuters.categories()` to get unique categories
* Added a more descriptive print statement
* Removed the detailed category counts which weren't needed for this task

This gives us a cleaner, more focused solution that directly answers the question of how many unique categories exist in the dataset. The set data structure automatically handles the deduplication of categories for us, making the code both efficient and straightforward

## Explore 'Tea' Category in Reuters Corpus

Superb job, Stellar Navigator!

You're progressing perfectly fine. It's time for you to practice making a minor change to the code. In the given starter code, documents associated with the 'sugar' category are being fetched. Your task is to change the category from 'sugar' to 'tea'. Fetch the related documents, and then print the raw text and categories of the first document related to 'tea'. Ready? Your code journey awaits!

```python
# Importing the necessary NLTK corpus
from nltk.corpus import reuters

# Get all documents associated with the 'sugar' category
sugar_documents = reuters.fileids('sugar')

# Access and print the raw text and categories of the first document in the 'sugar' category
first_sugar_doc_text = reuters.raw(sugar_documents[0])
print("\nThe first 500 characters of the first 'sugar' document:\n")
print(first_sugar_doc_text[:500])
first_sugar_doc_categories = reuters.categories(sugar_documents[0])
print("\nThe categories of the first 'sugar' document are:")
print(first_sugar_doc_categories)


```

I'll modify the code to explore documents in the 'tea' category instead of 'sugar':
```python
# Importing the necessary NLTK corpus
from nltk.corpus import reuters

# Get all documents associated with the 'tea' category
tea_documents = reuters.fileids('tea')

# Access and print the raw text and categories of the first document in the 'tea' category
first_tea_doc_text = reuters.raw(tea_documents[0])
print("\nThe first 500 characters of the first 'tea' document:\n")
print(first_tea_doc_text[:500])

# Get and print the categories of the first tea document
first_tea_doc_categories = reuters.categories(tea_documents[0])
print("\nThe categories of the first 'tea' document are:")
print(first_tea_doc_categories)
```

Key changes made:
* Changed variable names from 'sugar' to 'tea' for clarity
* Modified the category being searched from 'sugar' to 'tea'
* Updated the print statements to reflect we're looking at tea documents
* Maintained the same structure but with the new focus on tea-related content

This code will show us:
1. The first 500 characters of the first document that's categorized under 'tea'
2. All the categories that this document belongs to (since Reuters documents can belong to multiple categories)

The output would show us a glimpse into how tea-related news was reported in the Reuters corpus, along with any other topics this particular article might cover.

## Fetch Text and Categories for 'Coffee' in Reuters Corpus


Spot-on, Stellar Navigator!

In this exercise, your objective is to fetch text and categories associated with the 'coffee' category from the Reuters Corpus. Replace each blank space (____) in the code with the correct code statement or variable. Can't wait to read about 'coffee'!

```python
# Importing the necessary NLTK corpus
from nltk.corpus import reuters

# Get all documents associated with the 'coffee' category
coffee_documents = reuters.fileids(____)

# Access and print the raw text and categories of the first document in the 'coffee' category
first_coffee_doc_text = reuters.____(____)
print("\nThe first 500 characters of the first 'coffee' document:\n")
print(____)
first_coffee_doc_categories = reuters.____(____)
print("\nThe categories of the first 'coffee' document are:")
print(____)

```

I'll help fill in the blanks to explore the 'coffee' category documents:

```python
# Importing the necessary NLTK corpus
from nltk.corpus import reuters

# Get all documents associated with the 'coffee' category
coffee_documents = reuters.fileids('coffee')

# Access and print the raw text and categories of the first document in the 'coffee' category
first_coffee_doc_text = reuters.raw(coffee_documents[0])
print("\nThe first 500 characters of the first 'coffee' document:\n")
print(first_coffee_doc_text[:500])
first_coffee_doc_categories = reuters.categories(coffee_documents[0])
print("\nThe categories of the first 'coffee' document are:")
print(first_coffee_doc_categories)
```

Here's what I filled in:
1. `'coffee'` - The category we want to search for
2. `raw(coffee_documents[0])` - Get the raw text of the first coffee document
3. `first_coffee_doc_text[:500]` - Print first 500 characters of the text
4. `categories(coffee_documents[0])` - Get categories for the first coffee document
5. `first_coffee_doc_categories` - Print all categories for this document

The code will:
* Find all documents tagged with the 'coffee' category
* Get the text content of the first coffee-related document
* Display the first 500 characters of that document
* Show all categories associated with that document

This gives us a good look at how coffee-related news was covered in the Reuters corpus and how it might intersect with other topics through its category assignments.

## Exploring the 'Gas' Category in Reuters Corpus

Well done, Stellar Navigator!

As a next step, add the missing lines to the provided code to retrieve all documents associated with the 'gas' category from the Reuters Corpus. Then, print the first 500 characters of the raw text from the first document in this category. Continue your remarkable journey through the vast universe of natural language processing!

```python
# Importing the necessary NLTK corpus
from nltk.corpus import reuters

# TODO: Get all documents associated with the 'gas' category

# TODO: Access and print the raw text of the first document in the 'gas' category

# Print the number of documents and categories of the first document in the 'gas' category
print(f"\nThere are {len(gas_documents)} documents in the 'gas' category.")
first_gas_doc_categories = reuters.categories(gas_documents[0])
print(f"The categories of the first 'gas' document are:\n{first_gas_doc_categories}")

```

I'll help complete the code by adding the missing lines to explore the 'gas' category documents:
```python
# Importing the necessary NLTK corpus
from nltk.corpus import reuters

# Get all documents associated with the 'gas' category
gas_documents = reuters.fileids('gas')

# Access and print the raw text of the first document in the 'gas' category
first_gas_doc_text = reuters.raw(gas_documents[0])
print("\nThe first 500 characters of the first 'gas' document:\n")
print(first_gas_doc_text[:500])

# Print the number of documents and categories of the first document in the 'gas' category
print(f"\nThere are {len(gas_documents)} documents in the 'gas' category.")
first_gas_doc_categories = reuters.categories(gas_documents[0])
print(f"The categories of the first 'gas' document are:\n{first_gas_doc_categories}")
```

Here's what I added:
1. `gas_documents = reuters.fileids('gas')` - Gets all documents tagged with the 'gas' category
2. `first_gas_doc_text = reuters.raw(gas_documents[0])` - Retrieves the raw text of the first gas-related document
3. Added print statements to display the first 500 characters of the document

The complete code will:
* Find all documents in the Reuters corpus tagged with 'gas'
* Get the full text of the first gas-related document
* Show the first 500 characters of that document
* Display the total count of gas-related documents
* List all categories associated with the first document

This gives us a comprehensive view of how gas-related news was covered in the Reuters corpus, including its frequency and its relationships with other topics through category assignments.

## Exploring Reuters Corpus by Category

Apa yang terjadi pada otak setelah kita meninggal dunia？ - BBC News Indonesia_Os98kzw3OU0

```python
# TODO: Import necessary libraries

# TODO: Fetch the unique categories from Reuters corpus

# TODO: For each unique category
    # TODO: Get all documents related to the category

    # TODO: Access and print the raw text of the first document in that category

    # TODO: Print the categories of the first document in that category

```

I'll help create a script to explore each category in the Reuters corpus systematically:
```python
# Import necessary libraries
from nltk.corpus import reuters

# Fetch the unique categories from Reuters corpus
unique_categories = set(reuters.categories())

# For each unique category, explore its first document
for category in sorted(unique_categories):
    # Get all documents related to the category
    category_docs = reuters.fileids(category)

    # Skip if no documents found
    if not category_docs:
        continue

    print(f"\n### Category: {category} ###")
    print(f"Number of documents: {len(category_docs)}")

    # Access and print the raw text of the first document
    first_doc_text = reuters.raw(category_docs[0])
    print("\nFirst document excerpt (first 200 characters):")
    print(first_doc_text[:200])

    # Print the categories of the first document
    first_doc_categories = reuters.categories(category_docs[0])
    print("\nCategories of this document:")
    print(first_doc_categories)

    print("\n" + "="*50)  # Separator for readability
```

This enhanced version:

1. Imports the Reuters corpus from NLTK

2. Gets all unique categories using a set to avoid duplicates

3. For each category:
   * Retrieves all documents in that category
   * Shows the total number of documents
   * Displays the first 200 characters of the first document
   * Lists all categories associated with that document
   * Adds clear formatting for better readability

4. Includes error handling (skips empty categories)

5. Uses sorted() to process categories in alphabetical order

The output will look like:
```
### Category: acq ###
Number of documents: 719

First document excerpt (first 200 characters):
COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE
  Computer Terminal Systems Inc said it
has completed the sale of 200,000 shares of its common stock, and
warrants to acquire an additional one mln sha

Categories of this document:
['acq']
==================================================

### Category: alum ###
...
```

This gives a comprehensive overview of:
* How many documents are in each category
* The content type in each category
* How categories overlap
* The distribution of topics in the Reuters corpus

The script is particularly useful for:
* Understanding the corpus structure
* Identifying major themes
* Seeing how news stories were categorized
* Finding relationships between different categories