# Making a CSV data file from text lines
## A dataset for Norwegian preterites

by Koenraad De Smedt at UiB

---
This notebook demonstrates how to:

1.  Construct and write a CSV file
2.  Extract character attributes from words
3.  Fill in missing attributes
4.  Break up a problem in smaller steps.

CSV is an often used format in data science, so it is useful to have some experience in converting data to this format. The current demo results in a file that can be fed into a machine learning program like [Weka](https://waikato.github.io/weka-wiki/downloading_weka/).

A plain [list of Norwegian preterites](https://git.app.uib.no/desmedt/teaching/-/raw/main/vpret-et-te.txt) is used as a basis. The last two letters of each verbform are up the suffix (*et* or *te*) and will be extracted as the last attribute for the CSV file. The three attributes before that are the three letters before the suffix. All features must be delimited by a separator (usually a comma). Example:

>`abonnerte` -> `n,e,r,te`

>`adlet` -> `a,d,l,et`

If a preterite is shorter than the required 5 characters, a filler character should fill the places of the missing features.

>`aget'` -> `+,a,g,et`

>`ået'` -> `+,+,å,et`

---

## Define variables and helper functions

First define some global variables. This makes it easy to change them if the need should arise. The filler and separator should be non-word characters.

In [None]:
filler = '+'
sep = ','

Now break down the problem and define some helper functions. Because taking the five final characters of a word will fail if the word is too short, step one is to write a helper function that ‘fills up’ a word with fillers if necessary, otherwise the word is returned as it was.

In [None]:
def fill_pret (verb):
  if len(verb) < 5:
    return (5 - len(verb)) * filler + verb
  else: return verb

fill_pret('ået')

In [None]:
fill_pret('abonnerte')

Step two is to write the main conversion function that first uses the helper function and then joins the necessary features with commas.

In [None]:
def convert_pret (verb):
  verb = fill_pret(verb)
  return sep.join(verb[-5:-1]) + verb[-1]

convert_pret ('ået')

In [None]:
convert_pret('abonnerte')

## Convert input file to output file

Now we are ready to apply the conversion to every line that is read from a file and to write the results to another file. The input file is remotely available. The output will be a temporary file (downloadable).

In [None]:
import requests
vpret_url = 'https://git.app.uib.no/desmedt/teaching/-/raw/main/vpret-et-te.txt'

Open the url as a stream from which you can read lines. With an open output file, first print a header (column names). Then iterate over lines from the stream, and print the converted lined to the output file. The `.strip` function strips whitespace (including newlines) from the beginning and end of the line.

In [None]:
vpret_stream = requests.get(vpret_url, stream=True)

with open('vpret.csv', 'w') as outfile:
  print(sep.join(['ant','pen','fin','suffix']), file=outfile)
  for line in vpret_stream.iter_lines(decode_unicode=True):
    print(convert_pret(line.strip()), file=outfile)

Check the contents of the resulting file. Read and print only the first 100 characters.

In [None]:
with open('vpret.csv') as f:
  print(f.read(100))

Alternatively, shown the first ten lines with an operating system command.

In [None]:
!head vpret.csv

### Exercises

1.   Change the filler and separator. Run the program and check the result.
2.   If you want to use 6 letters, for instance, instead of 5, you have to change the program in several places. This may be inconvenient and prone to errors. A better way is to make a global variable `nletters` for the number of letters of the word that will be used in the output. Add this variable at the beginning and adapt the program where necessary to use this variable.