In [2]:
%load_ext autoreload
%autoreload 2

# These directives will automatically reload any modules that are loaded from file.

First, let's create a bunch of character identifier functions. If you have a character, pass it into any or all of these functions to find out what it is, or isn't.

In [3]:
def is_char(character):
    """
    Returns True if the character passed in is a letter,
    False otherwise. 
    
    Uses ASCII code points to determine what range a character is in.
    """
    if ord(character) >= 65 and ord(character) <= 90:
        return True
    elif ord(character) >= 97 and ord(character) <= 122:
        return True
    else:
        return False
    

def is_digit(character):
    """
    Returns True if the character passed in is a digit,
    False otherwise. 
    
    Uses ASCII code points to determine what range a character is in.
    """
    if ord(character) >= 48 and ord(character) <= 57:
        return True
    else:
        return False
    

def is_whitespace(character):
    """
    Returns True if the character passed in is white space
    (space, tab),
    False otherwise. 
    
    Uses ASCII code points to determine what range a character is in.
    """
    if ord(character) == 32 or ord(character) <= 9:
        return True
    else:
        return False
    
    

In [4]:
is_whitespace(' ')

True

In [5]:
is_whitespace(' ')

True

In [6]:
import unittest

class TestCharacterClassFunctions(unittest.TestCase):
    
    def testLetter(self):
        self.assertTrue(is_char('a'))
        self.assertFalse(is_char('*'))
        print("letters work")
    
    def testDigit(self):
        self.assertTrue(is_digit('0'))
        self.assertFalse(is_digit('*'))
        print("numbers work")
    
    def testWhitespace(self):
        self.assertTrue(is_whitespace('\t'))
        self.assertTrue(is_whitespace(' '))
        self.assertFalse(is_whitespace('*'))
        print("whitespace work")
    
    def run(self):
        self.testLetter()
        self.testDigit()
        self.testWhitespace()


In [7]:
x = TestCharacterClassFunctions()
x.run()

letters work
numbers work
whitespace work


 Read the text from an external file. 

In [8]:
windatafile = open('windata.txt', mode='r', encoding='')

for line in windatafile:
    print(line, end='')

LookupError: unknown encoding: 

In [None]:
datafile = open('data.txt', mode='r', encoding='utf-8')

for line in datafile:
    print(line, end='')

In [None]:
help(is_digit)

In [None]:
type(datafile)

In [None]:
help(is_whitespace)

In [None]:
def is_operator(chars):
    if chars == "<<":
        return True
    elif chars == "+=":
        return True
    elif chars == "=":
        return True
    else:
        return False

In [None]:
is_operator("<<")

In [None]:
datafile = open('data_bom.txt', mode='r', encoding='utf-8')

for line in datafile:
    print(line, end='')

## Overview:

This is a state machine with six states, an input stream, an output stream, and a 'temporary' queue.



#### String Storage
Temporary and permanent character storage (files and queues) are in single quotes, and include:
* 'input' file
* 'output' file
* 'temporary' queue



#### Character Classes
Single characters are grouped into character classes, like digits (numbers), alphabetic characters (letters), whitespace characters (spaces and tabs), other characters that can be, either alone or in groups, an operator, assignment, or line-terminating character, newline characters, and the forward slash character that, when paired with no whitespace, indicates a comment.
These classes are shown inside less-than/greater-than pairs. Classes indented from another class are a subset of that class:
* < code >
  * < alpha >
  * < digit >
  * < operator >
  * < assignment >
  * < terminator >
* < whitespace >
  * < space >
  * < tab >
* < slash >
* < newline >
  * < carriage return >
  * < line feed >



#### Machine States
The states of the state machine are shown in ALL CAPS:
* START
* SPACE
* COMMENT
* FLUSHLINE
* ENDLINE
* TOKEN

# Setup
Steps that must be taken before the state machine begins work:
* create empty 'temporary' queue
* create an 'output' stream
* open 'input' file for reading
* set machine to START state

## START state
  * read a char from 'input'
  * if char is < whitespace >:
    * replace char with " " // a literal space, to replace any tabs or other potential whitespace chars
    * push char to 'temporary' queue // push to queue right away or keep in a single-char temp variable?
    * switch to SPACE state
  * if char is < slash >, switch to COMMENT state
  * if char is < newline >, switch to ENDLINE state
  * if char is < code >, switch to TOKEN state

## SPACE state
  * read char from 'input'
  * if char is < slash >:
    * push char to 'temporary' queue
    * switch to COMMENT state
  * if char is < code >:
    * write 'temporary' to 'output' file
    * clear all elements of 'temporary'
    * push char to 'temporary' queue
    * switch to TOKEN state
  * if char is < newline >:
    * clear all elements of 'temporary'
    * push char to 'temporary' queue
    * switch to ENDLINE state
  * if char is < whitespace >, stay in SPACE state // don't push. Only want EXACTLY one space



## TOKEN state
  * read char from 'input'
  * if char is < code >:
    * push char to 'temporary' queue
    * stay in TOKEN state
  * if char is < whitespace >:
    * write 'temporary' to 'output' file
    * clear all elements of 'temporary'
    * push " " to 'temporary' queue
    * switch to SPACE state
  * if char is < comment >:
    * write 'temporary' to 'output' file
    * clear all elements of 'temporary'
    * push char to 'temporary' queue
    * switch to COMMENT state
  * if char is < newline >:
    * write 'temporary' to 'output' file
    * clear all elements of 'temporary'
    * push char to 'temporary' queue
    * switch to ENDLINE state



## COMMENT state 
// this state is entered on the FIRST slash char: should exit state in one char
  * read char from 'input'
  * if char is < slash >:
    * clear all elements of 'temporary'
    * switch to FLUSHLINE state
  * if char is < space >:
    * write 'temporary' to 'output' file
    * clear all elements of 'temporary'
    * push " " to 'temporary' queue
    * switch to SPACE state
  * if char is < code >:
    * write 'temporary' to 'output' file // this is an edge case of a single slash
    * clear all elements of 'temporary'
    * push char to 'temporary' queue
    * switch to TOKEN state
  * if char is < newline >:
    * write 'temporary' to 'output' file // this is an edge case of a single slash
    * clear all elements of 'temporary'
    * push char to 'temporary' queue
    * switch to ENDLINE state  


## FLUSHLINE state  
// only entered when double-slash comment is found
  * read char from 'input'
  * if char is < newline >:
    * push char to 'temporary' queue
    * switch to ENDLINE state
  * else:
    stay in FLUSHLINE state



## ENDLINE state
  * if last char in 'temporary' queue is `\r` (carriage return): // Windows line-ending
    * read char from 'input'
    * if it's `\n`: (line feed):
      * write < carriage return > to 'output' file
      * write < line feed > to 'output' file
      * clear all elements of 'temporary' queue
      * switch to START state
    * else:
      * signal an ERROR // this line ending is malformed.
  * else:
    * write < line feed > to 'output' file
    * clear all elements of 'temporary'
    * switch to START state


A little test on `tempqueue`, which is a list (or array). The last element of the list is available by the slice `[-1:]`, which is helpful to know.

In [None]:
tempqueue = [ 'a', 'b', 'c', '=']

In [None]:
tempqueue[-1:]

In [None]:
tempqueue.append('$')

In [None]:
tempqueue[-1:]

This is only *an outline* of the plan