# Text extraction

In this notebook, we show how temperatures and wind data from 2002 until now are extracted from avalanche reports that we previously downloaded.

## 1: Temperature extraction:

In [4]:
import numpy as np
import pandas as pd
import re
import glob
import os
import dateutil

We define some functions that will be used to extract temperatures from text.

In [24]:
temperature_pattern = re.compile(r"(moins |plus )?(\d+) degre", re.IGNORECASE)
CONTEXT = 25
main_directions = ['nord', 'sud', 'est', 'ouest']
directions = main_directions + ['nord-est', 'nord-ouest', 'sud-est', 'sud-ouest']


def replace_words(string, tokenize_map):
    """Replace similar words by a token
    tokenize_map should be a dict(word -> token)
    """
    for w, t in tokenize_map.items():
        string = string.replace(w, t)
    return string

def extract_temperatures(paragraph):
    """Obtain the location for each temperature
    returns a dict(region -> temperature)
    """
    result = {}
    
    ts = []
    for match in temperature_pattern.finditer(paragraph):
        sign = -1 if match[1] == 'moins ' else 1
        value = int(match[2])
        end = match.end()
        ts.append((sign * value, end))
    
    if len(ts) == 1:
        result['default'] = ts[0][0]
    elif len(ts) > 1:
        for value, end in ts:
            for direction in main_directions:
                if direction in paragraph[end:end + CONTEXT]:
                    result[direction] = value
                    break
    
    return result
    
tokens = {
    'degre': ['degré', 'degrés', 'degre', 'degres', 'degree', 'degrees', 'degrée', 'degrées', '°', '°C', '° C'],
    'plus ': ['+', 'jusqu\'au-dela de ', 'au-dela '],
    'moins ': ['-'],
    'situation generale': ['Rétrospective météo', 'Retrospective meteo', 'situation générale', 'COUVERTURE NEIGEUSE', 'Retrospective météorologique', 'Retrospective meteorologique'],
    'plus 0': ['zero', 'zéro'],
    'ouest': ['l''ouest'],
    'est': ['l''est'],
    '1': ['un'],
}

tokens_map = {word: token for token, words in tokens.items() for word in words}

temperatures = {
    'default': [],
    'nord': [],
    'sud': [],
    'est': [],
    'ouest': [],
}

We previously transformed the pdf files into text files in order to do the text processing. Now we extract the temperatures from the files.

In [25]:
no_situation_paragraph = 0
total_files = 0

for year in range(2002,2018):
    path = "../data/slf/{}/nb/fr/txt".format(str(year))
    
    for filename in glob.glob(os.path.join(path, '*.txt')):
        file_date = dateutil.parser.parse(filename[27:35])
        total_files += 1

        with open(filename, 'rb') as file:
            content = file.read().decode("utf-8", "ignore")
            content = replace_words(content, tokens_map)
            
            # find 'situation generale' paragraph
            paragraph = None
            for text in content.split('\n\n\n'):
                text = text.lower()
                if 'situation generale' in text:
                    paragraph = text
            
            if not paragraph:
                no_situation_paragraph += 1
            
            else:
                paragraph = paragraph.replace("\n", " ")
                ts = extract_temperatures(paragraph)
                for direction, t in ts.items():
                    temperatures[direction].append((file_date, t))

print('Total number of report without situation paragraph: {}/{}'.format(no_situation_paragraph, total_files))

total number of report without situation paragraph: 42/2063


In [39]:
records = [(date, region, t) for region, ts in temperatures.items() for date, t in ts]

In [47]:
results = pd.DataFrame(records, columns=['date', 'region', 'temperature'])
results.region = results.region.str.replace('default', '-')
results = results.sort_values(by='date')
results.set_index(['date', 'region'])

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature
date,region,Unnamed: 2_level_1
2001-11-12,-,15
2001-11-25,-,8
2001-11-26,-,0
2001-11-29,-,-5
2001-12-05,-,0
2001-12-08,-,-2
2001-12-10,-,2
2001-12-11,-,-3
2001-12-12,-,-5
2001-12-13,-,-15


We concatenate all dataframes to get only one dataframe containing all temperatures with dates as the index

In [471]:
new_df = pd.concat([temp_df, temp_df_nord, temp_df_sud, temp_df_est, temp_df_ouest], axis=1)
print('We collected temperatures for %d dates' %len(new_df))
new_df.head()

We collected temperatures for 2091 dates


Unnamed: 0_level_0,Temperature,Temperature Nord,Temperature Sud,Temperature est,Temperature ouest
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-11-12,15.0,,,,
2001-11-25,8.0,,,,
2001-11-26,0.0,,,,
2001-12-02,4.0,,,,
2001-12-05,0.0,,,,


We check if our algorithm has worked by selecting a sampling of 20 dates.

In [357]:
new_df.sample(20)

Unnamed: 0_level_0,Temperature,Temperature Nord,Temperature Sud,Temperature est,Temperature ouest
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-04-25,,,5.0,-3.0,
2007-04-14,8.0,,,,
2009-10-20,,,3.0,,
2008-03-31,,-3.0,0.0,,
2017-03-28,,2.0,4.0,,
2007-03-22,,,-6.0,-11.0,-8.0
2009-04-01,,6.0,1.0,,
2008-03-17,,-3.0,2.0,,
2011-12-09,0.0,,,,
2016-12-27,,,3.0,-6.0,1.0


From the sample we took, we obtain an accuracy of 95% for the extraction of temperatures.
Thus it is reasonable to use these temperatures for further analysis.

## 1: Wind extraction:

Now we will extract wind data from avalanche reports.
There are no exact numbers like wind speed in the reports, but an evulation of the strength of the wind is given.
Thus, our output will be categorical: strong, moderate or weak.

First some variables are initialized.

In [472]:
# We will replace all words that are similar to moderate with the same word
words_1 = ['modéré','modere','modérés','moderes']
token_1 = 'modere' 
# Same for the term 'fort' (strong)
words_2 = ['fort','forts']
token_2 = 'fort'
# Same for the term 'faible' (weak)
words_3 = ['faible','faibles']
token_3 = 'faible'
# This change is made to allow paragraph selection
words_4 = ['Rétrospective météo','Retrospective meteo', 'situation générale', 'COUVERTURE NEIGEUSE', 'Retrospective météorologique', 'Retrospective meteorologique']
token_4 = 'situation generale'
    
# Creation of lists that will contain wind data and dates
wind = []
date = []

We now extract the wind data

In [473]:
for year in range(2002,2018):
    # we select all the text files corresponding to 1 year
    path = "../data/slf/{}/nb/fr/txt".format(str(year))
    # the algorithm is run for each file of the same year
    for filename in glob.glob(os.path.join(path, '*.txt')):
        
        paragraph = []
        # Opening of a text file
        #filename = '../data/slf/2002/nb/fr/txt/20020102_nb_fr_bw.txt'
        with open(filename) as file:
            
            content = file.read()
            
            # regroup the same words into one word using the function replace_words
            content = replace_words(content, words_1, token_1)
            content = replace_words(content, words_2, token_2)
            content = replace_words(content, words_3, token_3)
            content = replace_words(content, words_4, token_4)

            # collect paragraph in which wind information is present
            for text in content.split('\n\n\n'):
                if 'situation generale' in text.lower():
                    paragraph = text
            
            #initialize variables
            wind_fort = []
            wind_faible = []
            wind_modere = []
            # check if paragraph is not empty
            if (paragraph != []):
                
                # We select a string containing the wind values
                # Using regex, one pattern is created for each wind strength
                pattern_fort = re.compile(r"fort", re.IGNORECASE)
                pattern_faible = re.compile(r"faible", re.IGNORECASE)
                pattern_modere = re.compile(r"modere", re.IGNORECASE)
                
                # we use the patterns obtained with regex to get the wind strength value
                wind_fort = [m[0] for m in pattern_fort.finditer(paragraph.replace("\n",""))]
                wind_faible = [m[0] for m in pattern_faible.finditer(paragraph.replace("\n",""))]
                wind_modere = [m[0] for m in pattern_modere.finditer(paragraph.replace("\n",""))]
                
                if wind_fort != []:
                    wind.append(wind_fort[0])
                    date.append(dateutil.parser.parse(filename[27:35]))
                
                elif wind_faible !=[]:
                    wind.append(wind_faible[0])
                    date.append(dateutil.parser.parse(filename[27:35]))
                
                elif wind_modere != []:
                    wind.append(wind_modere[0])
                    date.append(dateutil.parser.parse(filename[27:35]))
                

We define the dataframe in which wind and corresponding dates are inserted.

In [362]:
wind_df = pd.DataFrame({'Date':date,'Wind':wind})
wind_df = wind_df.drop_duplicates(subset='Date', keep='first')
wind_df = wind_df.set_index('Date')
print('We collected wind information for %d dates' %len(wind_df))
wind_df.head()

We collected wind information for 2677 dates


Unnamed: 0_level_0,Wind
Date,Unnamed: 1_level_1
2001-11-12,faible
2001-11-23,faible
2001-11-24,faible
2001-11-25,faible
2001-11-27,faible


We check if our algorithm has worked by selecting a sampling of 20 dates.

In [370]:
wind_df.sample(20)  

Unnamed: 0_level_0,Wind
Date,Unnamed: 1_level_1
2005-05-25,fort
2002-02-09,faible
2015-02-26,fort
2004-02-02,fort
2006-05-01,faible
2009-11-29,fort
2014-02-25,fort
2014-12-12,fort
2009-05-05,modere
2008-03-26,faible
