# Web Scraping and Text Cleaning: MLK Jr. "I Have a Dream" Speech (Beginner Project)

This project was built as part of my Python learning process.  
I used `requests`, `BeautifulSoup`, `re` (regular expressions), and `pandas` to scrape and analyze the full text of Martin Luther King Jr.'s "I Have a Dream" speech from a sample webpage.  

The goal of this project was to:
- Scrape the full text of the speech from the target webpage
- Clean and standardize the text using regular expressions
- Perform word frequency analysis
- Store the results in a CSV file for further analysis or visualization

This is a beginner project — not a production-grade scraper — and is focused on practicing text extraction, data cleaning, and basic natural language processing (NLP) techniques.

## Libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

## Methodology

Approach:

- Use `requests` to send an HTTP GET request to the target webpage
- Parse the HTML using `BeautifulSoup`
- Extract paragraph text (`<p>` tags) and combine the content into a single string
- Perform initial text cleaning:
    - Remove line breaks and unwanted escape characters
    - Normalize apostrophes
- Convert all text to lowercase
- Use regular expressions to:
    - Remove punctuation (while keeping apostrophes inside words)
    - Tokenize the text into individual words
- Build a Pandas DataFrame to store the words
- Perform word frequency analysis using `value_counts()`
- Save the cleaned word counts to a CSV file

In [2]:
# Step 1: Scrape the webpage
url = 'http://www.analytictech.com/mb021/mlk.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

# Step 2: Extract paragraph text and combine
mlkjr_speech = soup.find_all('p')
speech_combined = [p.text.strip() for p in mlkjr_speech]
string_speech = ' '.join(speech_combined)

# Step 3: Initial cleaning
a_lil_clean = string_speech.replace('\r\n', ' ').replace(r"\'", "'")

# Optional: Quick checks (can be removed if desired)
print("we're" in a_lil_clean)  # Should be True
print("God's" in a_lil_clean)  # Should be True

# Step 4: Lowercase everything
cleaned = a_lil_clean.lower()

# Step 5: Remove punctuation, keep apostrophes within words
cleaned = re.sub(r"[^\w\s']+", '', cleaned)

# Step 6: Tokenize (split into words)
cleaned_words = cleaned.split()

# Step 7: Build DataFrame and count word frequencies
df = pd.DataFrame(cleaned_words, columns=['word'])
df = df.value_counts().reset_index(name='count')
df.columns = ['word', 'count']

# Step 8: Show top 10 most common words
print(df.head(10))

# Step 9: Export to CSV
df.to_csv(r'C:\Users\jrwie\OneDrive\Desktop\Data Stuffs\Analyst_Builder\Python\Exports\MLKjr_speech_scraped.csv', index=False)

True
True
      word  count
0      the     54
1       of     49
2       to     29
3      and     27
4        a     20
5       in     17
6       be     16
7     will     16
8  freedom     13
9       we     13


## Next Steps

Potential improvements:
- Perform deeper text cleaning (e.g., stopword removal)
- Visualize the word frequencies with a bar chart or word cloud
- Compare the word frequency of this speech to other famous speeches
- Explore more advanced NLP techniques (lemmatization, part-of-speech tagging)
- Build a reusable text analysis pipeline for other scraped texts