# üß† NLP Project: Flood Information Extraction from FloodList Articles (UK)

## üåä Objective:
To automatically extract flood-related information such as rainfall, water level, storm names, and locations from FloodList news articles, and identify whether ‚Äúflash floods‚Äù were mentioned.
The project also organizes flood-related articles into thematic groups based on the presence of flash-flood mentions and quantitative data, helping identify which articles are most relevant or data-rich for further analysis.

## üß© Workflow

1. Load and inspect the dataset
    Imported the CSV file containing full news texts.
    Checked for missing or misnamed columns and confirmed the presence of "Full Text".

2. Text preprocessing
    Removed punctuation and English stopwords using NLTK.
    Ensured non-string entries were handled safely (empty strings for NaNs).
    Tokenized each article into lists of words.

3. Detect flash-flood mentions
    Scanned each tokenized article for the pattern 'flash' followed by 'flood'.
    Created a binary column Flash_Flood_Mentioned (1 = present, 0 = absent).

4. Information extraction with spaCy and regex
    Used regular expressions to capture:
    Rainfall values (e.g., ‚Äú45 mm‚Äù, ‚Äú20 millimetres‚Äù)
    Water-level values (e.g., ‚Äú3.5 m‚Äù, ‚Äú2.8 metres‚Äù)
    Storm names (e.g., ‚ÄúStorm Dennis‚Äù)
    Applied spaCy NER to extract geographic locations (GPE entities).
    Stored extracted values in new columns: Rainfall_mm, WaterLevel_m, StormName, Location.

5. Feature flags and grouping
    Added flags has_rainfall, has_waterlevel, and has_any_numeric.
    Combined these with Flash_Flood_Mentioned to form four groups:
        FlashFlood + Numerics
        FlashFlood (no numerics)
        OtherFlood + Numerics
        OtherFlood (no numerics)

6. Priority grouping and insights
    Classified each article into the four groups for interpretability.
    Highlighted the ‚ÄúFlashFlood + Numerics‚Äù subset as high-priority articles
    (most informative for hydrological research or event validation).
    Extracted top-mentioned locations from this subset.

7. Export results of classification and summary

In [52]:
# ====================================================
# STEP 1 ‚Äî INSTALL & IMPORT MODULES
# ====================================================

#  Install required packages (run once per new environment)
# Uncomment these if you get "ModuleNotFoundError"
# !pip install pandas numpy matplotlib nltk spacy tqdm

#  Download additional resources (only first time)
import nltk
nltk.download('stopwords')

#  Import libraries
import pandas as pd
import numpy as np
import re
import string
import matplotlib.pyplot as plt
from tqdm import tqdm
from nltk.corpus import stopwords
import spacy

#  Load SpaCy language model (download if not installed)
# !python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

print("‚úÖ All modules loaded and ready!")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\taran\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


‚úÖ All modules loaded and ready!


In [53]:
# ====================================================
# STEP 2 ‚Äî LOAD THE DATA
# ====================================================

# File path (keep your CSV in the same folder or give full path)
uk_article_file_csv = 'uk_flood_articles_80.csv'

# Load the CSV
df = pd.read_csv(uk_article_file_csv)

# Quick basic check
print("‚úÖ File loaded successfully.")
print("Number of articles:", len(df))
print("Column names:", list(df.columns))

# # Optional quick peek
# print(df.head(3))


‚úÖ File loaded successfully.
Number of articles: 80
Column names: ['Title', 'Date', 'Full Text', 'Link']


In [54]:
# ====================================================
# STEP 3 ‚Äî CLEAN TEXT, REMOVE STOPWORDS, TOKENIZE
# ====================================================

# Prepare stopwords
stop_words = set(stopwords.words('english'))

# Convert text column to list
full_text_list = df['Full Text'].tolist()
row_count = len(full_text_list)
print("Number of text rows:", row_count)

# Process text: remove punctuation + stopwords
cleaned_list = []

for text in full_text_list:
    if isinstance(text, str):  # make sure it's a string
        # Step 1: remove punctuation
        nopunc = ''.join([char for char in text if char not in string.punctuation])

        # Step 2: remove stopwords
        clean_words = [word for word in nopunc.split() if word.lower() not in stop_words]

        # Step 3: join cleaned words back
        cleaned_list.append(' '.join(clean_words))
    else:
        cleaned_list.append('')  # if not string, just keep empty

# Tokenize
tokenized_article_list = [article.split() for article in cleaned_list]

print("‚úÖ Text cleaned and tokenized.")
print("Example tokens from first article:")
print(tokenized_article_list[0][:30])

Number of text rows: 80
‚úÖ Text cleaned and tokenized.
Example tokens from first article:
['Parts', 'United', 'Kingdom', 'continue', 'grapple', 'widespread', 'flooding', 'stemming', 'passage', 'Storm', 'Babet', 'Authorities', 'confirmed', 'grim', 'toll', 'least', 'four', 'fatalities', 'linked', 'storm', 'swept', 'nation', 'recent', 'days', 'Hundreds', 'people', 'evacuated', 'homes', 'parts', 'Scotland']


In [55]:
# ====================================================
# STEP 4 ‚Äî DETECT 'FLASH FLOOD' MENTIONS
# ====================================================

"""
We check if each article contains the word 'flash' followed by 'flood'
to mark that article as mentioning a flash flood.
"""

df['Flash_Flood_Mentioned'] = [
    1 if any(
        article[i].lower() == 'flash'
        and i + 1 < len(article)
        and article[i + 1].lower().startswith('flood')
        for i in range(len(article))
    )
    else 0
    for article in tokenized_article_list
]

# Preview first few results
print("‚úÖ Flash_Flood_Mentioned column created.")
print(df[['Full Text', 'Flash_Flood_Mentioned']].head(10))


‚úÖ Flash_Flood_Mentioned column created.
                                           Full Text  Flash_Flood_Mentioned
0  Parts of the United Kingdom continue to grappl...                      0
1  Storms and heavy rain brought flash flooding t...                      1
2  In the United Kingdom, intense downpours excee...                      1
3  Thousands of trees are to be planted as part o...                      0
4  England may be set to flood at the end of wint...                      0
5  Police in UK report that one person is missing...                      1
6  Hundreds of homes have been flooded in England...                      1
7  Thunderstorms affected parts of western Europe...                      1
8  Heavy rainfall in eastern England, UK on 09 Ju...                      1
9  More than 300,000 homes in England are now bet...                      0


In [56]:
# ====================================================
# STEP 5 ‚Äî EXTRACT RAINFALL, WATER LEVEL, STORM, LOCATION
# ====================================================

# Regex patterns
rainfall_pattern = re.compile(r'(\d+(?:\.\d+)?)\s*(?:mm|millimetres|millimeters)', re.IGNORECASE)
waterlevel_pattern = re.compile(r'(\d+(?:\.\d+)?)\s*(?:m|metres|meters)\b', re.IGNORECASE)
storm_pattern = re.compile(r'\b(?:Storm|Cyclone|Hurricane|Typhoon)\s+([A-Z][a-z]+)\b')

# Extraction function
def extract_info(text):
    if not isinstance(text, str) or len(text.strip()) == 0:
        return None, None, None, None

    rainfall = rainfall_pattern.findall(text)
    waterlevel = waterlevel_pattern.findall(text)
    storm = storm_pattern.findall(text)

    # NER for location
    doc = nlp(text)
    locations = [ent.text for ent in doc.ents if ent.label_ == "GPE"]

    # Convert lists to readable strings
    rainfall = ', '.join(rainfall) if rainfall else None
    waterlevel = ', '.join(waterlevel) if waterlevel else None
    storm = ', '.join(storm) if storm else None
    locations = ', '.join(list(set(locations))) if locations else None

    return rainfall, waterlevel, storm, locations

# Apply extraction to every article
df[['Rainfall_mm', 'WaterLevel_m', 'StormName', 'Location']] = df['Full Text'].apply(
    lambda x: pd.Series(extract_info(x))
)

# Preview the updated DataFrame
print("‚úÖ Extracted rainfall, water level, storm name, and location.")
print(df[['Rainfall_mm', 'WaterLevel_m', 'StormName', 'Location']].head(10))


‚úÖ Extracted rainfall, water level, storm name, and location.
            Rainfall_mm                          WaterLevel_m  \
0                  None  1.79, 1.65, 2.22, 2.12, 30.52, 30.37   
1                  41.5                2.85, 0.60, 1.55, 1.40   
2            20, 20, 20                                  None   
3                  None                                   400   
4                  None                                  None   
5             140, 71.4                                  None   
6                  None     6.55, 7.04, 5.33, 5.56, 4.4, 3.97   
7  26, 41.6, 34.6, 51.2                                  None   
8                50, 90                                  None   
9                  None                                  None   

                                           StormName  \
0  Babet, Babet, Babet, Babet, Babet, Babet, Babe...   
1                                               None   
2                                               None 

In [57]:
# ====================================================
# STEP 6 ‚Äî CREATE FLAGS FOR RAINFALL / WATER LEVEL / NUMERICS
# ====================================================

# Replace empty strings with NaN so .notna() works properly
for col in ['Rainfall_mm', 'WaterLevel_m']:
    if col in df.columns:
        df[col] = df[col].replace('', np.nan)

# Create flags
df['has_rainfall'] = df['Rainfall_mm'].notna()
df['has_waterlevel'] = df['WaterLevel_m'].notna()
df['has_any_numeric'] = df['has_rainfall'] | df['has_waterlevel']

# Make sure Flash_Flood_Mentioned is in 0/1 format
df['Flash_Flood_Mentioned'] = (
    pd.to_numeric(df['Flash_Flood_Mentioned'], errors='coerce')
    .fillna(0)
    .astype(int)
)

# Quick check
print("‚úÖ Flags created.")
print(df[['has_rainfall', 'has_waterlevel', 'has_any_numeric', 'Flash_Flood_Mentioned']].head(10))


‚úÖ Flags created.
   has_rainfall  has_waterlevel  has_any_numeric  Flash_Flood_Mentioned
0         False            True             True                      0
1          True            True             True                      1
2          True           False             True                      1
3         False            True             True                      0
4         False           False            False                      0
5          True           False             True                      1
6         False            True             True                      1
7          True           False             True                      1
8          True           False             True                      1
9         False           False            False                      0


In [58]:
# ====================================================
# STEP 7 ‚Äî GROUP ARTICLES & GET QUICK INSIGHTS
# ====================================================

'''
Group meanings:

FlashFlood + Numerics       ‚Üí Articles that mention 'flash flood' and include numeric data (rainfall in mm or waterlevel in m).
                              These are data-rich flash flood events and are most useful for detailed analysis.

FlashFlood (no numerics)    ‚Üí Articles that mention 'flash flood' but have no numeric data.
                              Qualitative flash flood reports; still useful for event occurrence mapping.

OtherFlood + Numerics       ‚Üí Articles about floods (not flash floods) that include numeric data.
                              Likely riverine or long-duration floods with measurable rainfall/levels.

OtherFlood (no numerics)    ‚Üí General flood mentions with no numeric data.
                              Least data-dense; useful mainly for regional flood frequency insights; more like 'human interest' stories.
'''

# --- Create the main grouping ---
df['Group'] = np.select(
    [
        (df['Flash_Flood_Mentioned'] == 1) & (df['has_any_numeric'] == True),
        (df['Flash_Flood_Mentioned'] == 1) & (df['has_any_numeric'] == False),
        (df['Flash_Flood_Mentioned'] == 0) & (df['has_any_numeric'] == True)
    ],
    [
        'FlashFlood + Numerics',
        'FlashFlood (no numerics)',
        'OtherFlood + Numerics'
    ],
    default='OtherFlood (no numerics)'
)

# --- Show overall counts ---
print("‚úÖ Articles grouped successfully!\n")
print(df['Group'].value_counts(), "\n")

# --- Get total unique locations per group ---
group_locs = {}

for grp in df['Group'].unique():
    all_locs = []
    for locs in df[df['Group'] == grp]['Location'].dropna():
        parts = str(locs).split(',')
        for p in parts:
            p = p.strip()
            if p != '':
                all_locs.append(p)
    group_locs[grp] = len(set(all_locs))

print("Unique locations mentioned per group:\n")
for g, n in group_locs.items():
    print(f"‚Ä¢ {g}: {n} locations")

# --- Top locations in 'FlashFlood + Numerics' group ---
priority = df[df['Group'] == 'FlashFlood + Numerics']

all_locs = []
for locs in priority['Location'].dropna():
    parts = str(locs).split(',')
    for p in parts:
        p = p.strip()
        if p != '':
            all_locs.append(p)

location_counts = pd.Series(all_locs).value_counts()
# print("\nTop 10 locations in 'FlashFlood + Numerics' group:\n", location_counts, "\n")

# --- Show a few sample rows from that group ---
cols = ['Full Text', 'Rainfall_mm', 'WaterLevel_m', 'StormName', 'Location']
# print("Sample articles from 'FlashFlood + Numerics':\n")
# print(priority[cols].head(5))


‚úÖ Articles grouped successfully!

Group
OtherFlood (no numerics)    35
OtherFlood + Numerics       27
FlashFlood + Numerics       14
FlashFlood (no numerics)     4
Name: count, dtype: int64 

Unique locations mentioned per group:

‚Ä¢ OtherFlood + Numerics: 165 locations
‚Ä¢ FlashFlood + Numerics: 131 locations
‚Ä¢ OtherFlood (no numerics): 71 locations
‚Ä¢ FlashFlood (no numerics): 23 locations


In [59]:
# print(location_counts)

In [63]:
# ====================================================
# STEP 8 ‚Äî SAVE OUTPUTS
# ====================================================

# File names (edit if you want)
output_csv = 'flood_articles_processed.csv'
priority_csv = 'flood_articles_flashflood_numeric.csv'

# Save the full dataframe
df.to_csv(output_csv, index=False)
print("‚úÖ Full dataset saved as:", output_csv)

# Save only the priority group (FlashFlood + Numerics)
priority = df[df['Group'] == 'FlashFlood + Numerics']
priority.to_csv(priority_csv, index=False)
print("‚úÖ Priority dataset (FlashFlood + Numerics) saved as:", priority_csv)

# Optional quick summary file
summary_txt = 'flood_summary.txt'
with open(summary_txt, 'w', encoding='utf-8') as f:
    f.write("Flood Information Extraction Summary\n")
    f.write("===================================\n\n")
    f.write(str(df['Group'].value_counts()))
    f.write("\n\nTop locations (FlashFlood + Numerics):\n")
    f.write(str(location_counts.head(40)))
print("‚úÖ Summary text file saved as:", summary_txt)


‚úÖ Full dataset saved as: flood_articles_processed.csv
‚úÖ Priority dataset (FlashFlood + Numerics) saved as: flood_articles_flashflood_numeric.csv
‚úÖ Summary text file saved as: flood_summary.txt


In [65]:
# summary_txt = 'flood_summary.txt'

# print("üìÑ Contents of flood_summary.txt:\n")
# with open(summary_txt, 'r', encoding='utf-8') as f:
#     print(f.read())