# Course 5: Software: Keyword Extraction for Data Analysts

## Hypothetical Scenario:
**Context**: Imagine that the Thai Tourism Authority wants to analyze social media posts to understand what locations or services are trending among travelers. Your task is to use keyword extraction to identify these trends.

### Section 1: Introduction to Automated Keyword Extraction
**Text Content**: Automated keyword extraction is a process used to find words or phrases from text that represent the main content. It's crucial for summarizing data and understanding key themes without reading all the content.

```python
# Python code example for simple keyword extraction using RAKE (Rapid Automatic Keyword Extraction algorithm)
from rake_nltk import Rake

rake = Rake()
text = "The Grand Palace in Bangkok has received thousands of likes on social media this month."
rake.extract_keywords_from_text(text)

keyword_extracted = rake.get_ranked_phrases()
print(keyword_extracted)
```

### Section 2: Using APIs for Named Entity Recognition (NER)
**Text Content**: NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

```python
# Python code for using an NER API
import requests

text = "Ayutthaya Historical Park is frequently mentioned in traveler's blogs."
payload = {'text': text}
response = requests.post("http://ner_api_endpoint", json=payload)

entities = response.json()
print(entities)
```

### Section 3: Exploring Zero-Shot Sentence Classification APIs
**Text Content**: Zero-shot classification enables us to classify data into categories that the model has never seen before during training.

```python
# Python code for using a zero-shot classification API
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classification = classifier(
    "Phi Phi Islands are a popular destination for tourists visiting Thailand.",
    candidate_labels=["travel", "food", "accommodation", "entertainment"],
)

print(classification)
```

### Section 4: Question and Answering with RAG - An Overview
**Text Content**: The Retrieval-Augmented Generation (RAG) model combines the dense vector retrieval of relevant documents with a generative model to answer questions.

```python
# Python code to demonstrate RAG usage for Q&A
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True)
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

input_ids = tokenizer("What is the most visited place in Thailand?", return_tensors="pt").input_ids
outputs = model.generate(input_ids)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Section 5: Setting up the API Environment
**Text Content**: Setting up an environment for API interaction involves getting API keys, installing required libraries, and ensuring proper endpoint configuration.

```python
# Python code for setting up an API environment
import os
from dotenv import load_dotenv
import requests

# Load .env file where the API_KEY is stored
load_dotenv()

API_KEY = os.getenv("API_KEY")

# Base URL of the API service
BASE_URL = "http://example_api_service.com/"

# Test the API with a simple GET request
response = requests.get(BASE_URL, headers={"Authorization": f"Bearer {API_KEY}"})
print(response.json())
```

### Section 6: Practical Exercise: Integrating APIs into a Data Pipeline
**Text Content**: In this exercise, we will integrate the NER API into a simple data pipeline to process a batch of social media posts.

```python
# Python code for integrating an API into a data pipeline
import json

# Sample batch of social media posts
posts = ["Chiang Mai's night markets are a must-see!", "Had an amazing street food tour in Bangkok."]

# Function to call the NER API for a batch of posts
def extract_entities(posts):
    entities_batch = []
    for post in posts:
        response = requests.post("http://ner_api_endpoint", json={'text': post})
        entities = response.json()
        entities_batch.append(entities)
    return entities_batch

entities_extracted = extract_entities(posts)
print(json.dumps(entities_extracted, indent=

2, ensure_ascii=False))
```

### Section 7: Customizing API Requests for Specific Use Cases
**Text Content**: Depending on the API's flexibility, we can customize requests to suit specific needs, such as setting the language for NER to Thai.

```python
# Python code to customize API requests
payload = {'text': "วัดพระแก้วเป็นสถานที่ที่ทุกคนควรไปเยือน", 'language': 'th'}
response = requests.post("http://ner_api_endpoint", json=payload)

entities = response.json()
print(entities)
```


### Section 8: Handling API Responses and Extracted Keywords
**Text Content**: Once we receive a response from the API, we need to parse the results and extract the relevant keywords.

```python
# Python code for handling API responses
response = requests.post("http://ner_api_endpoint", json={'text': "Visiting the ruins of Sukhothai is a journey back in time."})
results = response.json()

# Extract and print the keywords
keywords = [entity['word'] for entity in results['entities']]
print(keywords)
```


### Section 9: Visualizing API Data for Better Insights
**Text Content**: Visualization helps in understanding the frequency and distribution of keywords across multiple texts.

```python
# Python code for visualizing API data using Matplotlib
import matplotlib.pyplot as plt

keywords = ['Bangkok', 'Chiang Mai', 'Sukhothai', 'Phuket', 'Samui']
frequencies = [50, 30, 20, 40, 10]

plt.bar(keywords, frequencies)
plt.xlabel('Keyword')
plt.ylabel('Frequency')
plt.title('Keyword Frequency in Social Media Posts')
plt.show()
```


### Section 10: Case Study: Keyword Extraction for Content Strategy
**Text Content**: In this case study, we'll use keyword extraction to inform the content strategy of a tourism blog focusing on Thailand.

```python
# Python code to extract keywords for content strategy
# Using the RAKE algorithm
from rake_nltk import Rake

rake = Rake()
blog_posts = ["10 Things to do in Bangkok", "A Guide to Street Food in Chiang Mai", "Finding Peace in the Temples of Sukhothai"]
keywords = []

for post in blog_posts:
    rake.extract_keywords_from_text(post)
    post_keywords = rake.get_ranked_phrases()
    keywords.append(post_keywords)

print(keywords)
```

### Section 11: Project: Building a Keyword Extraction Tool
**Text Content**: As a project, we will build a tool that extracts keywords from text using the APIs we've explored.

```python
# Python code for building a keyword extraction tool
# This section will guide the students through creating a simple CLI tool that uses the previously discussed APIs.
```

### Section 12: Evaluating API Performance and Accuracy
**Text Content**: It's crucial to evaluate the performance and accuracy of the APIs to ensure they meet our analysis needs.

```python
# Python code for evaluating API performance
# The section will include methods to measure API response times and accuracy using a test dataset.
```

### Section 13: Scaling API Calls for Large Datasets
**Text Content**: We must consider the efficiency and limits of APIs when dealing with large datasets to avoid being throttled or blocked.

```python
# Python code for scaling API calls
# The section will discuss best practices for batch processing and rate limiting.
```

### Section 14: Course Recap and Real-world Applications
**Text Content**: This section recaps the course and discusses how to apply these skills in real-world scenarios, like optimizing content for search engines or understanding customer feedback.

```python
# Python code for recap and application
# A summary of Python code snippets and techniques learned throughout the course.
```

### Section 15: Final Assignment and Course Evaluation
**Text Content**: The final assignment will involve a comprehensive keyword analysis of a new dataset, and the course evaluation will gather feedback on the learning experience.

```python
# Python code for final assignment
# Instructions and starter code for the final assignment.
```