# Project 2 - Text Data Cleaning and Preprocessing for Weather-Related X Feeds

### DSCI 614

Sean Kilfoy

## Scenario

You are a data scientist working for the Department of Transportation. You have built a road condition dashboard. Your manager wanted to have more text data. Your manager lets you monitor the X feeds and get the latest 200 tweets regarding the weather.

## Introduction

In this project, we clean and preprocess text data from weather-related search results and X feeds. We develop a streamlined approach to extract, clean, and prepare textual data for further analysis and monitoring. We remove irrelevant information from the text using regex and web scraping techniques. Preprocessing ensures the quality and relevance of the data used in our road condition dashboard and other analytical applications within the DoT.

## Winter Snowstorm Google Search

We'll perform the Google search and extract the necessary information with the web-scraping Python library, `googlesearch-python`.

In [23]:
%%capture
!pip install googlesearch-python wordcloud spacy
from googlesearch import search
import pandas as pd

We'll set `num_results` to 100 to extract the latest 100 results and append URLs, titles, and descriptions to a data frame.

In [26]:
query = "Winter snowstorm"

num_results = 100

search_results = list(
    search(
        term=query,
        num_results=num_results,
        sleep_interval=5,
        lang="en",
        advanced=True
    )
)

data = []
for result in search_results:
    url = result.url
    title = result.title
    description = result.description
    result_dict = {
        'URL': url,
        'Title': title,
        'Description': description
    }
    data.append(result_dict)

results_df = pd.DataFrame(data)
results_df.head()

Unnamed: 0,URL,Title,Description
0,https://www.weather.gov/fgz/WinterStorms,Winter Storms and Blizzards,Blizzards are dangerous winter storms that are...
1,https://www.nssl.noaa.gov/education/svrwx101/w...,Severe Weather 101: Winter Weather Types,"A winter storm is a combination of heavy snow,..."
2,https://www.weather.gov/safety/winter-snow,Snow Storm Safety,Blizzard: Sustained winds or frequent gusts of...
3,https://www.redcross.org/get-help/how-to-prepa...,Winter Storm Preparedness & Blizzard Safety,"Each winter, hundreds are injured or killed by..."
4,https://scied.ucar.edu/learning-zone/storms/wi...,Winter Storms - UCAR Center for Science Education,Snowstorms are one type of winter storm. Blizz...


## Text Data Preprocessing

In this section, we will clean and refine the extracted search results to ensure the data is suitable for analysis. We'll use regex to remove unwanted elements, such as dates, URLs, short words, specific stop words, and special characters. This helps maintain the quality and relevance of our textual data, and enhances the accuracy and effectiveness of our road condition dashboard.

We'll perform the following steps:

1. **Concatenate the URL, title, and description:** Combine these fields into a single column to facilitate text processing.
2. **Remove dates and times:** Use regex to eliminate date and time mentions, which are irrelevant for our analysis.
3. **Remove hyperlinks:** Strip out any URLs to focus on the descriptive content.
4. **Remove short words:** Exclude words with two or fewer characters as they typically add little value.
5. **Remove specific stop words:** Eliminate common stop words that might not be covered by standard stop word lists.
6. **Remove special characters and punctuation:** Clean the text by removing non-alphanumeric characters.


In [28]:
import re

In [29]:
# Concatenate the URL, title, and description
results_df['search_result'] = results_df['URL'] + " " + results_df['Title'] + " " + results_df['Description']

# Remove dates and times
date_time_pattern = r'\b(?:\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?)\s+\d{4})\b|\b\d{1,2}:\d{2}(?:am|pm)?\b'
results_df['search_result'] = results_df['search_result'].apply(lambda x: re.sub(date_time_pattern, '', x))

# Remove hyperlinks
url_pattern = r'https?://\S+|www\.\S+'
results_df['search_result'] = results_df['search_result'].apply(lambda x: re.sub(url_pattern, '', x))

# Remove all words containing at most two characters
short_words_pattern = r'\b\w{1,2}\b'
results_df['search_result'] = results_df['search_result'].apply(lambda x: re.sub(short_words_pattern, '', x))

# Remove specific stop words
stop_words = ["are", "but", "very", "since", "could"]
stop_words_pattern = r'\b(?:' + '|'.join(stop_words) + r')\b'
results_df['search_result'] = results_df['search_result'].apply(lambda x: re.sub(stop_words_pattern, '', x))

# Remove all special characters and punctuation
special_chars_pattern = r'[^\w\s]'
results_df['search_result'] = results_df['search_result'].apply(lambda x: re.sub(special_chars_pattern, '', x))

In [30]:
results_df[['search_result']].head()

Unnamed: 0,search_result
0,Winter Storms and Blizzards Blizzards danger...
1,Severe Weather 101 Winter Weather Types wint...
2,Snow Storm Safety Blizzard Sustained winds f...
3,Winter Storm Preparedness Blizzard Safety Ea...
4,Winter Storms UCAR Center for Science Educat...


In [71]:
results_df.head()

Unnamed: 0,URL,Title,Description,search_result
0,https://www.weather.gov/fgz/WinterStorms,Winter Storms and Blizzards,Blizzards are dangerous winter storms that are...,Winter Storms and Blizzards Blizzards danger...
1,https://www.nssl.noaa.gov/education/svrwx101/w...,Severe Weather 101: Winter Weather Types,"A winter storm is a combination of heavy snow,...",Severe Weather 101 Winter Weather Types wint...
2,https://www.weather.gov/safety/winter-snow,Snow Storm Safety,Blizzard: Sustained winds or frequent gusts of...,Snow Storm Safety Blizzard Sustained winds f...
3,https://www.redcross.org/get-help/how-to-prepa...,Winter Storm Preparedness & Blizzard Safety,"Each winter, hundreds are injured or killed by...",Winter Storm Preparedness Blizzard Safety Ea...
4,https://scied.ucar.edu/learning-zone/storms/wi...,Winter Storms - UCAR Center for Science Education,Snowstorms are one type of winter storm. Blizz...,Winter Storms UCAR Center for Science Educat...


## Real-Time Extreme Weather Search Dashboard




In [1]:
%%capture
!pip install googlesearch-python streamlit pandas

In [12]:
%%writefile dashboard.py

import streamlit as st
import pandas as pd
from googlesearch import search
from datetime import datetime
import time
import requests
from bs4 import BeautifulSoup
from requests.exceptions import HTTPError
import random

# Function to perform Google search and return results in a DataFrame
def google_search(query, num_results=5, max_retries=5):
    search_results = []
    retry_count = 0
    sleep_interval = 2

    while len(search_results) < num_results and retry_count < max_retries:
        try:
            results = list(
                search(
                    term=query,
                    num_results=num_results,
                    sleep_interval=sleep_interval,
                    lang="en",
                    advanced=True
                )
            )
            search_results.extend(results)
            break  # Exit the loop if successful
        except HTTPError as e:
            if e.response.status_code == 429:
                retry_count += 1
                sleep_interval *= 2  # Exponential backoff
                time.sleep(sleep_interval)
            else:
                raise e

    data = []
    retrieval_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    for result in search_results[:num_results]:
        url = result.url
        title = result.title
        description = result.description  # Use full description from the search result
        result_dict = {
            'URL': url,
            'Title': title,
            'Description': description,
            'Retrieval Time': retrieval_time
        }
        data.append(result_dict)

    return pd.DataFrame(data)

# Streamlit application
st.set_page_config(layout="wide", initial_sidebar_state="collapsed")
st.title("Real-Time Extreme Weather Search Dashboard")

# Sidebar control panel
st.sidebar.title("Control Panel")
refresh_interval = st.sidebar.slider("Update Interval (seconds)", min_value=1, max_value=300, value=10)
num_results = st.sidebar.slider("Number of Top Results", min_value=1, max_value=10, value=5)

# Initialize session state for queries and saved articles
if 'queries' not in st.session_state:
    st.session_state.queries = ["Hurricane", "Winter snowstorm"]
if 'saved_articles' not in st.session_state:
    st.session_state.saved_articles = []

# Function to add a new search query
def add_search():
    st.session_state.queries.append("New Query")
    st.rerun()  # Immediately refresh the app to reflect changes

# Function to remove a search query
def remove_search(index):
    if len(st.session_state.queries) > 1:
        st.session_state.queries.pop(index)
        st.rerun()  # Immediately refresh the app to reflect changes

# Function to save an article
def save_article(url, title, description):
    st.session_state.saved_articles.append({
        'Title': title,
        'URL': url,
        'Description': description
    })
    st.rerun()

# Display input fields for each query with remove button
for i, query in enumerate(st.session_state.queries):
    cols = st.sidebar.columns([5, 1])
    query_input = cols[0].text_input(f"Search Query {i+1}", value=query, key=f"query_{i}")
    cols[1].button("–", key=f"remove_{i}", on_click=remove_search, args=(i,))
    st.session_state.queries[i] = query_input

# Add button for new search
st.sidebar.markdown(
    """
    <style>
    .full-width-button > div > button {
        width: 100%;
        font-size: 24px;
        font-weight: bold;
    }
    </style>
    """,
    unsafe_allow_html=True
)

if st.sidebar.button("＋", key="add", help="Add New Search Query"):
    add_search()

# Main content: Query results
st.header("Query Results")
placeholders = [st.empty() for _ in st.session_state.queries]

def update_dashboard():
    for i, query in enumerate(st.session_state.queries):
        results = google_search(query, num_results=num_results)
        with placeholders[i].container():
            st.subheader(f"Top {num_results} Results for '{query}'")
            for idx, row in results.iterrows():
                result_key = f"{query}_{idx}_{int(time.time())}"
                title_col, save_col = st.columns([5, 1])
                title_col.markdown(
                    f"""
                    <a href="{row['URL']}" target="_blank">
                        <button style="width: 100%; font-size: 20px; font-weight: bold;">
                            {row['Title']}
                        </button>
                    </a>
                    """,
                    unsafe_allow_html=True
                )
                save_col.button("＋", key=f"save_{result_key}", on_click=save_article, args=(row['URL'], row['Title'], row['Description']))
                st.markdown(f"<div style='text-align: justify;'>{row['Description']}</div>", unsafe_allow_html=True)

update_dashboard()

# Right sidebar: Saved articles
st.sidebar.subheader("Saved Articles")
for article in st.session_state.saved_articles:
    st.sidebar.markdown(f"**[{article['Title']}]({article['URL']})**")
    st.sidebar.markdown(f"{article['Description']}")

# Main loop to refresh the dashboard at the specified interval
while True:
    update_dashboard()
    time.sleep(refresh_interval)


Overwriting dashboard.py


In [13]:
# Run the Streamlit app
!streamlit run dashboard.py

[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8502[0m
[34m  Network URL: [0m[1mhttp://192.168.50.5:8502[0m
[0m
^C
[34m  Stopping...[0m
