![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

Introduction to Text Mining and Natural Language Processing

by Hannes Mueller


# Session 2: Project Design and Getting the Text


## 1) Reading Text from Many Files

We made it. We have amassed files on our hard drive. In this session we will mix our knowledge of Regex and pathwalks to read in masses of rtf files. The principle works the same regardless of the file format.

Two key lessons are to be had here: 
- walking the folder structure to find and filter files
- note how the body of the text is identified and extracted
- think about meta data to extract 

In [4]:
from striprtf.striprtf import rtf_to_text

import os
import re

import pandas as pd
import matplotlib.pyplot as plt


# Adjust to your system if you run the notebook from a different folder
project_root = os.getcwd()
readdir = os.path.join(project_root, 'Spain')
write = project_root


In [5]:
readdir

'/Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/Spain'

### Intro to os.walk

Here is an exercise that demonstrates how to use the os.walk function to recursively walk through a directory and print the names of all the files and directories that it encounters:

Begin by importing the os module.

In [6]:
import os

Next, use the os.walk function to iterate over all the files and directories in a directory of your choice. This function takes a single argument, which is the path of the directory that you want to start the search from.

The os.walk function returns three values:
- The root variable contains the current directory being walked through
- The dirs variable contains a list of subdirectories of the current directory
- The files variable contains a list of files in the current directory

In [7]:
for root, dirs, files in os.walk(os.path.join(write, 'testfolder')):
    print('Root:', root)
    print('Directories:', dirs)
    print('Files:', files)
    print('   ')


Root: /Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/testfolder
Directories: ['folder 2 in testfolder', 'folder 1 in testfolder', 'folder 4 in testfolder', 'folder 3 in testfolder']
Files: ['.DS_Store', 'file in testfolder.txt']
   
Root: /Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/testfolder/folder 2 in testfolder
Directories: []
Files: ['file 1 in folder 2 in testfolder.txt', 'file 2 in folder 2 in testfolder.txt']
   
Root: /Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/testfolder/folder 1 in testfolder
Directories: []
Files: ['file 1 in folder 1 in testfolder.txt', 'file 2 in folder 1 in testfolder.txt']
   
Root: /Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/testfolder/folder 4 in testfolder
Directories: ['folder 1 in folder 4 in testfolder']
Files: ['file 1 in folder 4 in testfolder.txt'

## Some variations of this (will not go through)

In [8]:
for allinone in os.walk(os.path.join(write, 'testfolder')):
    print(allinone)
    print('   ')

# If you want to filter some files or directories, You can do so via python inbuilt functions or regular expression
import re
for root, dirs, files in os.walk(os.path.join(write, 'testfolder')):
    for file in files:
        if re.search('.txt$', file):
            print(os.path.join(root, file))

# Excluding a folder
exclude_dirs='folder 1 in testfolder'
for root, dirs, files in os.walk(os.path.join(write, 'testfolder')):
    dirs[:] = [d for d in dirs if d not in exclude_dirs]
    for file in files:
        if re.search('.txt$', file):
            print(os.path.join(root, file))


('/Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/testfolder', ['folder 2 in testfolder', 'folder 1 in testfolder', 'folder 4 in testfolder', 'folder 3 in testfolder'], ['.DS_Store', 'file in testfolder.txt'])
   
('/Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/testfolder/folder 2 in testfolder', [], ['file 1 in folder 2 in testfolder.txt', 'file 2 in folder 2 in testfolder.txt'])
   
('/Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/testfolder/folder 1 in testfolder', [], ['file 1 in folder 1 in testfolder.txt', 'file 2 in folder 1 in testfolder.txt'])
   
('/Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/testfolder/folder 4 in testfolder', ['folder 1 in folder 4 in testfolder'], ['file 1 in folder 4 in testfolder.txt', 'file 2 in folder 4 in testfolder.txt'])
   
('/Users/hannesfelixmuller/Dropbox

Once you are done, you should see the names of all the files and directories that the os.walk function encountered, starting from the directory that you specified and recursing through all its subdirectories.
The code above will print the current directory , subdirectories in it and files in it and recursively does it for all the subdirectories. You can add more functionality like file size, created time, etc, based on the need.

### Apply to AP Spain folder

Let's apply this to the Spain folder which contains the articles (rtf files) we want to parse. In what follows I use the fact that lists are generated that contain the filenames.

In [9]:
for root, dirs, files in os.walk(readdir):
    print(root)
print(root)
print("In total we have", len(files), "files to process.")

/Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/Spain
/Users/hannesfelixmuller/Dropbox/teaching/Text Mining DSDM 2026/Session2_Projects_Getting_Text/Spain
In total we have 285 files to process.


In [10]:
print(files)

['(81986)Files(50)(2).RTF', '(47940)Files(50)(6).RTF', '(63121)Files(50)(5).RTF', '(34860)Files(43).RTF', '(14051)Files(50)(8).RTF', '(74531)Files(50)(30).RTF', '(58039)Files(30).RTF', '(14082)Files(50)(4).RTF', '(45259)Files(50)(4).RTF', '(53865)Files(50)(3).RTF', '(8893)Files(7).RTF', '(79441)Files(50)(8).RTF', '(55354)Files(50)(4).RTF', '(78989)Files(16).RTF', '(18310)Files(50)(9).RTF', '(78456)Files(50)(49).RTF', '(77045)Files(28).RTF', '(49256)Files(50)(2).RTF', '(71672)Files(50)(9).RTF', '(85717)Files(50)(7).RTF', '(2487)Files(50)(2).RTF', '(81120)Files(50)(5).RTF', '(51984)Files(50)(3).RTF', '(1581)Files(50)(4).RTF', '(92678)Files(50)(5).RTF', '(83307)Files(50)(27).RTF', '(62609)Files(50)(5).RTF', '(22195)Files(50)(7).RTF', '(73746)Files(50)(37).RTF', '(76812)Files(50)(17).RTF', '(44772)Files(50)(4).RTF', '(62120)Files(50)(3).RTF', '(93029)Files(50)(32).RTF', '(66832)Files(19).RTF', '(14074)Files(50)(5).RTF', '(41252)Files(50)(38).RTF', '(77477)Files(50)(1).RTF', '(44279)Files(5

In [11]:
#let's take a look at the first file
file=files[0]
fname=os.path.join(readdir, file)
try:
    rtf = open(fname, encoding='utf8').read()
except:
    print('open error')

#gibberish
rtf[:500]


'{\\rtf1\\ansi\\ansicpg1252\\uc0\\stshfdbch0\\stshfloch0\\stshfhich0\\stshfbi0\\deff0\\adeff0{\\fonttbl{\\f0\\froman\\fcharset0\\fprq2{\\*\\panose 02020603050405020304}Times New Roman;}{\\f1\\froman\\fcharset2\\fprq2{\\*\\panose 05050102010706020507}Symbol;}{\\f2\\fswiss\\fcharset0\\fprq2{\\*\\panose 020b0604020202020204}Arial;}}{\\colortbl;\\red0\\green119\\blue204;\\red255\\green255\\blue255;\n\\red0\\green0\\blue0;\\red118\\green118\\blue118;}{\\stylesheet{\\s0\\snext0\\sqformat\\spriority0\\aspalpha\\aspnum\\adjustright\\ltrpar\\li0\\lin0\\ri0\\ri'

In [12]:
rtf_doc = rtf_to_text(rtf)

#this looks much better!
print(rtf_doc)




Spain opens first mosque in 500 years and hears echoes of a glorious past
The Associated Press
July 11, 2003, Friday, BC cycle


Copyright 2003 Associated Press  All Rights Reserved
Section: International News
Length: 577 words
Byline: By DANIEL WOOLLS, Associated Press Writer
Dateline: GRANADA, Spain
Body


The cry of a muezzin echoed from a hilltop overlooking the Alhambra as Granada, the former seat of Moorish rule in Spain, unveiled its first mosque in 511 years.
Dignitaries from Arab and Muslim countries worldwide attended the opening Thursday of the Great Mosque of Granada, crowning a fitful and emotionally charged project that began in 1981.
With repeated shouts of "Allahu Akhbar" (God is great), Sheik Sultan bin Mohammed al-Qassimi of the United Arab Emirates, which paid half the cost of construction, drew back a blood-red curtain to display a stone plaque inaugurating the building.
Later, a muezzin clad in white climbed atop the mosque's thick, square minaret and called Mus

### Class Discussion

We have one of 285 files. This seems to contain many articles with different dates. Split the large string contained in the first file into articles.


DON'T SCROLL DOWN - THINK

### Answer
Code used below to split large string into articles is the following:

In [26]:
split_text = rtf_doc.split(r'End of Document')
print(split_text[0])




Spain opens first mosque in 500 years and hears echoes of a glorious past
The Associated Press
July 11, 2003, Friday, BC cycle


Copyright 2003 Associated Press  All Rights Reserved
Section: International News
Length: 577 words
Byline: By DANIEL WOOLLS, Associated Press Writer
Dateline: GRANADA, Spain
Body


The cry of a muezzin echoed from a hilltop overlooking the Alhambra as Granada, the former seat of Moorish rule in Spain, unveiled its first mosque in 511 years.
Dignitaries from Arab and Muslim countries worldwide attended the opening Thursday of the Great Mosque of Granada, crowning a fitful and emotionally charged project that began in 1981.
With repeated shouts of "Allahu Akhbar" (God is great), Sheik Sultan bin Mohammed al-Qassimi of the United Arab Emirates, which paid half the cost of construction, drew back a blood-red curtain to display a stone plaque inaugurating the building.
Later, a muezzin clad in white climbed atop the mosque's thick, square minaret and called Mus

### Exercise to do at home

Get together as a group or work alone as you wish. Keep in mind that you 285 files and these files contain many articles as shown above. Now develop something that loops over all files. Solve the following exercise:

1) Think about how to extract the date of each article.

2) Develop a RegEx to extract the text body from each article.

3) Put date and text body content in a pandas dataframe.

4) Extract at least one additional element of metadata. No need to do more than one.

5) Plot the number of articles over time. 

6) Plot the number of articles with some metadata characteristic over time.

### Hints

I get 13475 articles but only if I adjust the format to pick up some articles that are not otherwise recognized. After excluding articles with less than a 100 signs I have 13462 articles. You should for sure have over 10000 articles if you do it right. Don't obsess about this!

My code starts as follows (sensitive stuff is covered by xxx). You can follow me here or completely ignore my code. In any case make sure you understand the loop! This is crucial for understanding loading files. 

My explanation: You see I am loading a file and then split it using the "End of Document" string. You also see I have a variation in here depending on the file format. If you launch this code it will just open all the files but not store anything.

Note that I end the code by initializing a list item called articles (articles=[])! Build on that.

In [None]:
#Note: I'm using Pandas to store. Works fine with the amount of articles we will read in.
#I am setting up the notebook such that I have nice metadata using the time information.
df_test = pd.DataFrame(columns=['xxx','xxx','xxx','xxx', 'text'])

files = next(os.walk(rootdir))[2]

#initializing
m=['',0,0]
j=0
i=1
print('Total files to process:', len(files))
for file in files:
    print('Processing file number', i)
    print(file)
    i+=1

    fname=os.path.join(rootdir, file)
    rtf = open(fname, encoding='utf-8').read()
    rtf_doc = rtf_to_text(rtf, errors = 'ignore')
    split_text = rtf_doc.split(r'End of Document')

    try:
        rtf_doc = rtf_to_text(rtf)
    except:
        print('read error')

    else:
        text = rtf_doc
        artis = text.split('End of Document')
        arti0 = artis[0].split('News|')
        artis[0] = arti0[-1]

    articles = []
