# NLP Project: Can I identify flood magnitudes from FloodList articles for UK?

Workflow:
1. Load the data
2. Remove non-english characters
3. Process data and save to csv/df
4. Tokenize the text
5. Remove stopwords
6. Perform stemming/lemmatization
7. Extract relevant keywords and rainfall values
8. Feature engineering - convert text to numerical features like bag of words, TF-IDF
9. Prepare for classification of data
10. Make train/test set

In [4]:
import pandas as pd

uk_article_file_csv = 'uk_flood_articles_80.csv'  # File is in the same directory as the notebook
df = pd.read_csv(uk_article_file_csv)
print(df.head())

                                               Title               Date  \
0  UK – Over 1,000 Homes Damaged, Hundreds Evacua...   24 October, 2023   
1  UK – Evacuations After Floods in Devon and Som...       10 May, 2023   
2  Intense Downpours in the UK Will Increase Due ...      9 March, 2023   
3  UK – Thousands of Trees to Be Planted at Flood...  22 February, 2023   
4  UK – England May Be Set to Flood at the End of...   12 January, 2023   

                                           Full Text  \
0  Parts of the United Kingdom continue to grappl...   
1  Storms and heavy rain brought flash flooding t...   
2  In the United Kingdom, intense downpours excee...   
3  Thousands of trees are to be planted as part o...   
4  England may be set to flood at the end of wint...   

                                                Link  
0  https://floodlist.com/europe/floods-england-sc...  
1  https://floodlist.com/europe/united-kingdom/fl...  
2  https://floodlist.com/europe/united-kingdom/

In [40]:
import re, nltk, string, unicodedata

In [42]:
from nltk.corpus import stopwords
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [44]:
stop_words = set(stopwords.words('english'))
full_text_list = df['Full Text'].tolist()
row_count = len(full_text_list)
row_count

80

In [89]:
# Process the text
cleaned_list = []

for text in full_text_list:
    if isinstance(text, str):  # Ensure the value is a string
        # Step 1: Remove punctuation
        nopunc = ''.join([char for char in text if char not in string.punctuation])
        
        # Step 2: Remove stopwords
        clean_words = [word for word in nopunc.split() if word.lower() not in stop_words]
        
        # Add the cleaned text as a joined string to the list
        cleaned_list.append(' '.join(clean_words))
    else:
        cleaned_list.append('')  # Handle non-string values with an empty string

print(cleaned_list[:2])



In [58]:
tokenized_article_list = [article.split() for article in cleaned_list]
tokenized_article_list[1]

['Storms',
 'heavy',
 'rain',
 'brought',
 'flash',
 'flooding',
 'parts',
 'England',
 'Wales',
 'United',
 'Kingdom',
 '09',
 'May',
 '2023',
 'Homes',
 'roads',
 'flooded',
 'Somerset',
 'Devon',
 'South',
 'West',
 'Flooding',
 'also',
 'reported',
 'North',
 'Wales',
 'parts',
 'South',
 'East',
 'England',
 'Devon',
 'Somerset',
 'Fire',
 'Rescue',
 'Service',
 'said',
 'received',
 '“widespread”',
 'calls',
 'help',
 'due',
 'flooding',
 'around',
 'midday',
 '09',
 'May',
 '2023',
 'Around',
 '5',
 'homes',
 'flooded',
 'along',
 'several',
 'roads',
 'areas',
 'Exeter',
 'Devon',
 'Roads',
 'school',
 'buildings',
 'affected',
 'Tipton',
 'St',
 'John',
 'near',
 'Exeter',
 'weather',
 'station',
 'Exeter',
 'Met',
 'Office',
 'recorded',
 '415',
 'mm',
 'rain',
 '24',
 'hours',
 'early',
 '10',
 'May',
 'farm',
 'building',
 'destroyed',
 'vehicle',
 'damaged',
 'floods',
 'Newton',
 'Poppleford',
 'Devon',
 'Devon',
 'Somerset',
 'Fire',
 'Rescue',
 'Service',
 'carried',
 '

In [76]:
# Okay the idea is to do some stemming because i am looking for articles which talk about flash flood and consequent flood depths
# So i have to classify the articles as mentioning flash floods or not. so this would be binary. i guess i'll go with naive bayes here.
# feature engineering er pore tf idf kore i need to identify the significance of the keyword in each article
# erpor flood depth data extraction will be required using regular expressions
# usually depth would be paired with 'meters' or 'm' so i need to make the model focus on numbers paired with 'meters' or 'm' like 1.5 meters or 0.3m



In [85]:
# Add a "Flash Flood Mentioned" column
df['Flash_Flood_Mentioned'] = [
    1 if 'flash' in article and 'flood' in article[article.index('flash') + 1:] else 0
    # looks for the word 'flash' in article
    # anddd looks for the word 'flood' in article
    # for that it queries at which index 'flash' is mentioned
    # then it checks for the immediate next index if 'flood' is mentioned
    for article in tokenized_article_list
]

# Preview the results
print(df[['Full Text', 'Flash_Flood_Mentioned']].head(20))

                                            Full Text  Flash_Flood_Mentioned
0   Parts of the United Kingdom continue to grappl...                      0
1   Storms and heavy rain brought flash flooding t...                      1
2   In the United Kingdom, intense downpours excee...                      0
3   Thousands of trees are to be planted as part o...                      0
4   England may be set to flood at the end of wint...                      0
5   Police in UK report that one person is missing...                      1
6   Hundreds of homes have been flooded in England...                      0
7   Thunderstorms affected parts of western Europe...                      1
8   Heavy rainfall in eastern England, UK on 09 Ju...                      1
9   More than 300,000 homes in England are now bet...                      0
10  Overlapping disasters struck parts of England ...                      0
11  Storm Christoph brought heavy rain to parts of...                      0

In [103]:
# check for the frequency of words in your articles, but before that split the dataset into training and testing sets
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

#bag of words codes
bow_matrix = vectorizer.fit_transform(cleaned_list)
# bow_matrix
word_frequencies = bow_matrix.sum()
vocab = vectorizer.get_feature_names_out()
vocab

array(['01', '01082019', '02', ..., 'zoom', 'zurichbased', 'éireann'],
      dtype=object)

In [107]:
word_freq_dict = dict(zip(vocab, word_frequencies))
total_unique_words = len(vocab)

In [109]:
total_unique_words

5800

In [111]:
# Extract dictionary items into a list of tuples
word_freq_items = list(word_freq_dict.items())  # Example: [('word1', 4), ('word2', 2), ...]

# Define a function to get the frequency (value) from a tuple
def get_frequency(item):
    return item[1]  # The frequency is the second element of the tuple (index 1)

# Sort the list of tuples by frequency in descending order
sorted_word_freq = sorted(word_freq_items, key=get_frequency, reverse=True)

# Get the top 5 most frequent words
top_5_words = sorted_word_freq[:5]

# Print the result
print("Top 5 Most Frequent Words:", top_5_words)

Top 5 Most Frequent Words: [('flood', 567), ('flooding', 450), ('water', 235), ('said', 200), ('england', 197)]
