# Lesson 1: Introduction to Textual Data Collection in NLP

### **Introduction to Text Data Collection** 📚

#### **1. Pengenalan** 🌟  
- Sebagai profesional di bidang data science dan machine learning, terutama dalam Natural Language Processing (NLP), seringkali kita bekerja dengan data teks.  
- Data teks umumnya tidak terstruktur dan lebih sulit dianalisis dibandingkan data terstruktur.  
- Contohnya meliputi email, posting media sosial, buku, atau transkrip percakapan.  

---

#### **2. Dataset 20 Newsgroups** 📰  
- Dataset ini terdiri dari sekitar 20.000 dokumen dari diskusi di newsgroups (forum diskusi lama di internet).  
- Terbagi menjadi 20 kategori topik seperti sains, agama, politik, olahraga, dll.  
- Berguna untuk tugas klasifikasi teks karena datanya tersegmentasi dengan baik.  

---

#### **3. Mengakses dan Memahami Struktur Data** 🔍  
- **Library yang digunakan**: `sklearn.datasets.fetch_20newsgroups()`.  
- **Struktur data**:  
  - `data`: Konten teks (berformat list).  
  - `target`: Label teks (berformat numpy array).  
  - `target_names`: Nama label.  

**Contoh kode:**
```python
from sklearn.datasets import fetch_20newsgroups

# Ambil data
newsgroups = fetch_20newsgroups(subset='all')

# Struktur data
print(f'Type of data: {type(newsgroups.data)}')  # Output: list
print(f'Type of target: {type(newsgroups.target)}')  # Output: numpy.ndarray
```

---

#### **4. Eksplorasi Data** 🔎  
- **Jumlah data**: 18.846 dokumen.  
- **Jumlah label**: Sama dengan jumlah data (18.846).  
- **Kategori**: 20 kelas, contohnya:  
  - `alt.atheism`, `comp.graphics`, `rec.autos`, `talk.religion.misc`, dll.

**Contoh kode:**
```python
print(f'Number of datapoints: {len(newsgroups.data)}')
print(f'Number of target variables: {len(newsgroups.target)}')
print(f'Possible classes: {newsgroups.target_names}')
```

---

#### **5. Preview Sampel Data** 📝  
- **Contoh artikel**: Artikel ke-10.  
- **Kategori**: `talk.religion.misc`.  

**Contoh kode:**
```python
print(f'Article:\n{newsgroups.data[10]}')
print(f'Corresponding Topic: {newsgroups.target_names[newsgroups.target[10]]}')
```

**Hasil:**
- Artikel berisi diskusi tentang agama dan hewan untuk ritual.  
- Topik: `talk.religion.misc`.  

---

#### **6. Ringkasan Pelajaran** 📖  
- Hari ini kita belajar cara mengakses dan menganalisis data teks menggunakan Python.  
- Memahami struktur dataset seperti `20 Newsgroups` adalah langkah awal penting dalam tugas klasifikasi teks.  

**Langkah Berikutnya** 🚀:  
1. Membersihkan data teks.  
2. Menangani nilai kosong.  
3. Mengorganisir data untuk analisis lebih lanjut.  

Setiap langkah ini akan meningkatkan keahlian Anda dalam klasifikasi teks. Tetap semangat belajar! 🎉

## Explore More of the 20 Newsgroups Dataset

Excellent job, Space Voyager! Now, make a small alteration to the starter code: change it to print out the first 150 characters of the 500th article from our 20 Newsgroups dataset, and also display its corresponding topic.

python
# Import necessary libraries and modules
from sklearn.datasets import fetch_20newsgroups

# Fetch the dataset
newsgroups = fetch_20newsgroups(subset='all')

# Here we are fetching the first 100 characters of the 200th article and its corresponding topic
print(f'\nArticle:')
print(f'{newsgroups.data[200][:100]}')
print(f'\nCorresponding Topic:')
print(f'{newsgroups.target_names[newsgroups.target[200]]}')

Here’s the updated code snippet to fetch the **first 150 characters of the 500th article** and display its corresponding topic:

```python
# Import necessary libraries and modules
from sklearn.datasets import fetch_20newsgroups

# Fetch the dataset
newsgroups = fetch_20newsgroups(subset='all')

# Fetch and display the first 150 characters of the 500th article and its corresponding topic
print(f'\nArticle:')
print(f'{newsgroups.data[500][:150]}')
print(f'\nCorresponding Topic:')
print(f'{newsgroups.target_names[newsgroups.target[500]]}')
```

### Changes Made:
1. **Article Selection**: Changed the article index from `200` to `500`.  
2. **Character Limit**: Adjusted the character limit to the **first 150 characters** (was previously 100).

This will now output a preview of the 500th article along with its topic label! 🚀

## Uncover the End of 20 Newsgroups Dataset

Celestial Traveler, your journey continues! Fill in the blanks (____) to import and explore our dataset. We aim to extract and display the last three articles and their corresponding topics. Can you reveal what's at the end of our dataset?

```python
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups

# Fetch the dataset
newsgroups = ____(subset='all')

# Fetch last three articles
last_three_articles = newsgroups.____[-3:]

# Fetch the corresponding topics
corresponding_topics = [newsgroups.target_names[i] for i in newsgroups.____[-3:]]

# Display Last three articles and their corresponding topics
for i in range(3):
    print(f'\nLast article {i+1}:') 
    print(f'{last_three_articles[i]}\n')
    print(f'Corresponding Topic {i+1}:')
    print(f'{corresponding_topics[i]}\n')
```

Here's the completed code snippet to fetch and display the last three articles and their corresponding topics from the 20 Newsgroups dataset:

```python
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups

# Fetch the dataset
newsgroups = fetch_20newsgroups(subset='all')

# Fetch last three articles
last_three_articles = newsgroups.data[-3:]

# Fetch the corresponding topics
corresponding_topics = [newsgroups.target_names[i] for i in newsgroups.target[-3:]]

# Display Last three articles and their corresponding topics
for i in range(3):
    print(f'\nLast article {i+1}:') 
    print(f'{last_three_articles[i]}\n')
    print(f'Corresponding Topic {i+1}:')
    print(f'{corresponding_topics[i]}\n')
```

### Explanation of Blanks:
1. **`newsgroups = fetch_20newsgroups(subset='all')`**: Fetch the dataset.  
2. **`last_three_articles = newsgroups.data[-3:]`**: Extract the last three articles using slicing on the `data` attribute.  
3. **`newsgroups.target[-3:]`**: Retrieve the last three topic indices from the `target` attribute.  

This code will reveal the last three articles and their topics. Let's uncover the dataset's finale! 🚀

## Fetch Specific Categories from Dataset

Celestial Traveler, let's narrow down our data collection. Modify the provided code to fetch only the 'alt.atheism' and 'talk.religion.misc' categories from our dataset. Then, display the first two articles from these categories along with their corresponding labels.

```python
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups

# Fetch a subset of the dataset containing selected categories. Update the categories as needed.
newsgroups_subset = fetch_20newsgroups(subset='all', categories=['comp.graphics', 'sci.space'])

# Display the first two articles and their corresponding topics from this subset
for i in range(2):
    print(f'\nArticle {i+1}:') 
    print(f'{newsgroups_subset.data[i]}\n')
    print(f'Corresponding Topic {i+1}:')
    print(f'{newsgroups_subset.target_names[newsgroups_subset.target[i]]}\n')
```

Here’s the updated code snippet to fetch only the **'alt.atheism'** and **'talk.religion.misc'** categories and display the first two articles along with their labels:

```python
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups

# Fetch a subset of the dataset containing the selected categories
newsgroups_subset = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'talk.religion.misc'])

# Display the first two articles and their corresponding topics from this subset
for i in range(2):
    print(f'\nArticle {i+1}:') 
    print(f'{newsgroups_subset.data[i]}\n')
    print(f'Corresponding Topic {i+1}:')
    print(f'{newsgroups_subset.target_names[newsgroups_subset.target[i]]}\n')
```

### Changes Made:
1. **Categories Selection**: Updated the `categories` parameter to `['alt.atheism', 'talk.religion.misc']`.  
2. **Variable Usage**: Adjusted the code to work with the specific subset defined by the new categories.  

### Output:
This code will fetch the dataset with only the selected categories and display the **first two articles** and their corresponding topics. Time to explore these specific discussions! 🌌

## Fetching the Third Article from Dataset

Well done, Stellar Navigator! Next, fill in the missing line in the code below to fetch and display the third article from the 20 Newsgroups dataset with its corresponding topic. Prepare your spacecraft for another adventure in data exploration!

```python
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups

# Fetch the dataset
newsgroups = fetch_20newsgroups(subset='all')

# TODO: Fetch the third article and its corresponding topic 

```

Here’s the completed code to fetch and display the third article along with its corresponding topic from the 20 Newsgroups dataset:

```python
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups

# Fetch the dataset
newsgroups = fetch_20newsgroups(subset='all')

# Fetch the third article and its corresponding topic
third_article = newsgroups.data[2]
third_topic = newsgroups.target_names[newsgroups.target[2]]

# Display the results
print(f'\nArticle 3:\n{third_article}\n')
print(f'Corresponding Topic:\n{third_topic}\n')
```

### Explanation:
1. **`newsgroups.data[2]`**: Access the third article (index starts at 0).  
2. **`newsgroups.target[2]`**: Get the topic index of the third article.  
3. **`newsgroups.target_names`**: Use the index to find the human-readable topic name.

This code will display the full content of the third article and its corresponding topic. Happy exploring, Data Voyager! 🚀

## Exploring Text Length in Newsgroups Dataset

Great job, Space Voyager! Now, as a final task, write a Python script that calculates and displays the lengths of the first five articles (in terms of the number of characters) from the 20 Newsgroups dataset.

```python
# Import necessary libraries and modules
from sklearn.datasets import fetch_20newsgroups

# TODO: Fetch the 20 Newsgroups dataset

# TODO: Iterate over the first five articles,
# TODO: Calculate their length in terms of the number of characters and display it
```

Here’s the Python script to calculate and display the lengths (in terms of characters) of the first five articles from the 20 Newsgroups dataset:

```python
# Import necessary libraries and modules
from sklearn.datasets import fetch_20newsgroups

# Fetch the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all')

# Iterate over the first five articles, calculate their length, and display it
for i in range(5):
    article_length = len(newsgroups.data[i])  # Calculate the length in characters
    print(f'Article {i+1} length: {article_length} characters\n')
```

### Explanation:
1. **`len(newsgroups.data[i])`**: This calculates the length of each article by counting the number of characters.
2. **`range(5)`**: Loops over the first five articles.
3. **Print statement**: Displays the length of each article in characters.

This script will output the length of the first five articles from the dataset. Enjoy your data exploration journey! 🚀