# Week 4: Strings of Data

In computer programming, the word "string" is used to refer to text stored and manipulated by the program. Logically, strings are sequences of individual characters; Python lets you iterate over a string's characters or index them as if the string were a list of characters. The notion of character is general - it could be any letter in the alphabet, a digit, or any Unicode character, including emojis. As of Python 3, strings default to supporting Unicode, so one string could simultaneously have an English word, a math formula, and Chinese characters without making any special effort.

String support is ubiquitous across programming languages. The features of strings vary from language to language, but broadly cover reading and writing raw data from files, converting that raw data into more useful forms like numbers, and converting useful data like numbers back to strings for storage. We will cover Python's string support with a focus on converting between different forms of data.

**Topic Outline**
1. What is a String?
2. Print Anything
3. Basic String Usage
4. Parsing Data from Strings

**Think About It**
- When you see a number, how do you recognize the number as a whole from the individual digits?
- How do you separate words on a page?

## 4.1 Lesson: What is a String? 

A string is a sequence of characters representing text. The following video looks at what we can do with strings with just that description. The example code is Python specific, but the concepts are universal across programming languages.

This video introduces common properties of strings shared across programming languages.

In [1]:
print("hello world.")

hello world.


In [6]:
#length of string
len("hello world.")

12

In [5]:
#loop over string
for c in "hello world.":
    print(c)

h
e
l
l
o
 
w
o
r
l
d
.


In [7]:
# string indexing
"hello world"[0]

'h'

In [8]:
"hello world"[3:8]

'lo wo'

## 4.2 Lesson: Print Anything

You have already seen a number of examples where various objects and values are output in this notebook. You may have noticed that the details vary. Sometimes the print function is used. Sometimes the last expression is the output. Sometimes there are quotes. Sometimes there are no quotes. This lesson will dig in on some of those details as examples of string conversions and manipulations.

This video explores different ways to print strings, and what the different ways highlight.

In [9]:
"Hello Everybody" #output has quotes

'Hello Everybody'

Printing only a string will print the output without quotes. Quotes are not part of the string, because they act as markers to identify the beginning and the end of the string.

In [10]:
len("Hello Everybody!")

16

In [12]:
'Hello Everybody!' #output of the repr function

'Hello Everybody!'

`repr` is a function that converts any python object into a string. `repr` tries to return a string that you can copy and past into python code and get the original object or value back.

In [13]:
4.2e1

42.0

In [14]:
print(4.2e1)

42.0


In [16]:
str(4.2e1)

'42.0'

#### `print` vs `str` and `repr`
- `print` is good for things you look at alot
- For debugging, exploration and quick checks, use the default `str` and `repr`

**Table: String Conversion Functions**
| Function Name | Description |
| :--- | :--- |
| `str` | converts any object into a string. Also the class for string objects. |
| `repr` | converts an object into a string. Usually returns a string that you could copy paste into Python code to reconstruct the object |

#### References:
**Source** | https://pyformat.info/ This web site is devoted to Python formatting options. It is old, and does not include f-strings, but is clearer than Python’s official documentation where it has coverage.


**Source** | https://docs.python.org/3/library/re.html Regular expressions are a standard way to write patterns for searching for and matching with strings. The “re” module is Python’s built-in implementation of regular expressions.

## 4.3 Lesson: Basic String Usage

In data science, most work with strings consists of parsing inputs into something more meaningful, such as numbers, representing labels or categories such as “sunny” or “raining”, or writing output. Later, you will see more kinds of string manipulation, particularly for language models and other text-specific analysis, but these three will cover us for most problems.

How will we cover this work? Parsing inputs will be a major focus of next week, but many useful functions will be mentioned in this overview of strings. Using strings as labels or categories does little manipulation of them - mostly they are used as keys in other data structures. Writing output will also be covered next week, but this lesson will cover useful functions and formatting capability to generate strings that you want to output.

#### Key Terms
**Parsing**: The process of analyzing a sequence to understand its structure and extract meaning.

### A Whirlwind Tour of Python, String Type:

Strings in python are created with single for double quotes

In [17]:
message = "what do you like?"
response = "spam"

Python has many extremely useful string functions and methods, here are a few of them:

In [18]:
#length of string:
len(response)

4

In [19]:
#make upper-case. See also str.lower()
response.upper()

'SPAM'

In [20]:
#capitalize, also see str.title()
message.capitalize()

'What do you like?'

In [21]:
# concatenation with +
message + response

'what do you like?spam'

In [22]:
# multiplication is multiple concatenation
5 * response

'spamspamspamspamspam'

In [23]:
# access individual characters (zero-based indexing)
message[0]

'w'

### Using Strings in Python
A string is a piece of text. Str is the class for python strings. Any object can be passed to `str` to turn the object into a string. `repr` can also be used to turn an object into a string.

In [24]:
3.0

3.0

In [25]:
str(3.0)

'3.0'

In [26]:
"hello world"

'hello world'

In [27]:
print(repr("hello world"))

'hello world'


In [28]:
repr("hello world")

"'hello world'"

In [29]:
repr('the dog said "hello" to me.') #the output tries to output a string that can be copied into python code. 

'\'the dog said "hello" to me.\''

The function `repr()` tries to output a string that you can copy paste to remake the object in python code. This generally works well for built-in object types, but custom types need code to support it well. It will generate the visible escape sequences for them. In Python, escape characters start with a backslash `\`, as can be seen in the above example.

#### What is a String? 

In Python3, a string is a sequence of unicode characters. Unicode is a standard for representing characters all over the world. Unicode handles all of the different characters worldwide. 

Since strings are sequences, you and write a `for` loop over strings: 

### Formatting Data in Strings
Using f-strings and the format function to easily include data in strings

F strings replaced a lot of clunky string formatting functions. Let's look at an example. 

In [3]:
x = 3.14
print(f"hi! X = {x}")

hi! X = 3.14


If you put an f before the string, python will look for sections marked with braces, and read the insides for more Python expressions. Mostly variables are used, but more complicated expressions can go in there.

In [7]:
y = 42
print(f"the answer is {y/x}")

the answer is 13.375796178343949


You can also specify details about how to format those expressions.

In [8]:
print(f"the answer is {y / x:.2f}")

the answer is 13.38


- The `:` marks the beginning of the formatting. The `f` stands for floating point, so a real number, and the `.2` specifies two digits after the decimal point.

- If you want to truncate a float to an integer, use a `.0` so there are no digits after the decimal point. 

In [9]:
print(f"the answer is {y // x:.0f}")

the answer is 13


- If you have an integer, you can use `:d`, or just skip it:

In [10]:
print(f"{y:d}")

42


In [11]:
print(f"{y}")

42


If you want to format the number with commas, add a `:,d`

In [13]:
z=1234567
print(f"{z:,d}")

1,234,567


more than one expression can be inserted:

In [14]:
print (f"the answer to {y} / {x} is {y/x}")

the answer to 42 / 3.14 is 13.375796178343949


**Table: String Methods to Know**
| Method Name | Description |
| :--- | :--- |
| `capitalize` | Capitalizes the first word of a string. |
| `encode` | converts a string into bytes, defaulting to UTF-8. Sometimes needed for writing files.|
| `format` | uses the string as a template, takes in arguments to fill in that template|
| `find` | looks for another string inside the original string. returns the index where it was found, or -1 otherwise. |
| `index` | LIke find, but raises an exception if not found |
| `startswith`/`endswith` | Like find, but raises an exception if not listed |
| `split` | returns a list of strings from splitting on characters in the input string. Defaults to splitting on any "white space" if no split chracters are given. |
| `replace` | does find and replace with the input arguments. | 

## 4.4 Lesson: Parsing Data From Strings
 
We already saw several ways to make strings from different types. (The easy answer was to use the `str` function.) How do we convert from strings back to other data types?

In practice, you will not need to write your own code parsing numeric types like integers and floating point numbers, but you may need to write code separating out those numbers. In this lesson, you will see a video showing how to parse an integer, but generally, you should just use the built-in functions `int` or `float`, depending on what type of number you expect. However, you will almost inevitably run into examples where something unexpected is presented as a "number", and suddenly an exception will happen. So, we will also show you how to catch exceptions and deal with them. We will not cover how you should respond to invalid numbers; that will depend on your application. But you will be able to handle them as you see fit.

#### Parsing Numbers from Strings
This video presents a very simple example of parsing integers from scratch and shows how to use the built-in Python functionality. 

Let's write a function to parse an integer from a string:

In [17]:
# bad parser
def my_parse_int(s):
    output = 0
    for c in s:
        output = output * 10 + ord(c) - ord('0')
        print('TEMP', output)
    return output

The print statement in the loop will show how each output value is changing, as each character is read.

In [18]:
my_parse_int('1234')

TEMP 1
TEMP 12
TEMP 123
TEMP 1234


1234

What is this ord function?

In [19]:
ord('0')

48

In [20]:
ord('1')

49

In [21]:
ord('2')

50

#### Code Example: Catching Exceptions from Invalid Numbers
In the last video, you saw that calling `int` or `float` with bad number strings would throw an exception. If this happens, your program will stop with an error unless you have arranged to do so already. Here's an example of how to do that.

In [23]:
def parse_int_with_default(s, default=None):
    try:
        return int(s)
    except ValueError:
        return default

In [24]:
parse_int_with_default('3')

3

In [25]:
parse_int_with_default('abc')

In [27]:
parse_int_with_default('abc', -9999)

-9999

What default should you use? Is it better to let the exception happen and expose the bad data? That depends entirely on the context that you are working in.

In [28]:
x = "This is a test."

In [29]:
x.startswith("this")

False

In [30]:
x == x.capitalize()

True

In [31]:
x.replace("This", "test") == "test"

False

In [32]:
len(x.split()) == 4

True

In [33]:
x.find("test") == 10

True

# Week 4 Topic 2: Reading and Writing Data in Files

Data science projects usually involve reading and parsing data from files. Files will typically be your first source of historical data when starting a project. Even if your project ultimately pulls from production databases, you should generally start testing with a snapshot saved to a file (and seriously consider whether your project actually needs to pull from live production systems). The following lessons will give you several examples of reading and writing different kinds of files.

The examples in this lesson are written with a focus on the basic functionality. Each example is followed by notes explaining new Python features or functions used in the code; please read those notes carefully. The notes will mention the most likely issues to come up from bad data, but the code will skip error checking for clarity until the later examples.

#### Topic Outline
- Reading a File
- Reading and Writing TSV Files
- Reading and Writing CSV Files
- What Data Structures Should Be Used?
- Reading and Writing JSON Files

#### Think About It
- Have you ever looked inside a CSV file saved by Microsoft Excel? Could you recognize the data?

## 4.6 Lesson: Reading a File

Reading a file in Python is pretty easy using the `open` function to access local files. Later on, you'll see that many Python libraries provide "file-like" objects that can be used similarly to files, or that sharing the same parsing functions.

For now, here are two example functions that read a file and return the file contents broken into lines:

- The first reads the whole file at once and returns all the lines in a list. 
- The second reads the file incrementally and returns an iterator returning a line at a time. 

The former takes more memory, but can be sorted, processed multiple times, or anything else that you would do with a list. The latter requires minimal memory, and is more efficient if you just need to look at the contents once, or process the contents before storing them in a more persistent data structure. Both are reasonable bases for future development depending on your specific needs.

#### Reading Lines from a File
The following example code reads all the lines of a file at once and returns them in a list.

In [34]:
def read_file_at_once(filename):
    output = []
    with open(filename) as fp:
        for line in fp:
            output.append(line)
 
    return output

In [36]:
read_file_at_once('gettysburg.txt.txt')

['Fourscore and seven years ago our fathers brought forth on this\n',
 'continent a new nation, conceived in liberty and dedicated to the\n',
 'proposition that all men are created equal.\n',
 'Now we are engaged in a great civil war, testing whether that nation\n',
 'or any nation so conceived and so dedicated can long endure. We are\n',
 'met on a great battle field of that war. We have come to dedicate a\n',
 'portion of that field, as a final resting place for those who here\n',
 'gave their lives that that nation might live. It is altogether\n',
 'fitting and proper that we should do this.\n',
 'But, in a larger sense, we can not dedicate - we can not consecrate\n',
 '- we can not hallow - this ground. The brave men, living and dead,\n',
 'who struggled here, have consecrated it, far above our poor power to\n',
 'add or detract. The world will little note, nor long remember, what\n',
 'we say here, but it can never forget what they did here. It is for\n',
 'us the living, rather, 

#### Code Notes
-  This function uses a new `with` statement syntax
     - In this case, a new variable fp is created with the result of `open(filename)` which represents the open file
    - At the end of the `with` statement, the open file is closed automatically, even if an exception happened
    - This "context manager" pattern is often used to manage resource connections, and libraries supporting context managers will give you similar examples to follow  
- The iterator of `fp` and other file handles returns lines from the file

#### Code Example: Reading Lines with a Generator

In [37]:
def read_file_with_generator(filename):
    with open(filename) as fp:
        for line in fp:
            yield line

In [38]:
read_file_with_generator('gettysburg.txt.txt')

<generator object read_file_with_generator at 0x10d0d17b0>

#### Code Notes
- This function uses the `yield` statement or expression. A genrator function immediately returns a "coroutine" which acts as an iterator. Each timne the next iterator output is requested, the coroutine runs the generator function's code until the next `yield` statement, or the function completes. All expressions "yielded are returned as iterator output.

**TLDR**: `yield` turns your function output into an iterator.

## A Whirlwind Tour of Python: Generators

Here we'll take a deeper dive into Python generators, including *generator expressions* and *generator functions*

### Generator Expressions:
The difference between list comprehensions and generator expressions is sometimes confusing, but we'll outline the differences between them.

#### List comprehensions use square brackets, while generator expressions use parenthesis

This is a represenative list comprehension:

In [39]:
[n ** 2 for n in range(12)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]

While this is a represenative generator expression:

In [41]:
(n ** 2 for n in range(12))

<generator object <genexpr> at 0x10b714380>

Notice that printing the generator expression does not print the contents; one way to print the contents of a generator expression is to pass it to the `list` constructor:

In [42]:
G = (n ** 2 for n in range (12))
list(G)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]

#### A list is a collection of values, while a generator is a recipe for producing values

When you create a list, you are actually building a collection of values, and there is some memory cost associated with that. When you create a generator, you are not building a collection of values, but a recipe for producing those values. Both expose the same iterator interface, as we can see here:

In [43]:
L = [n ** 2 for n in range (12)]
for val in L:
    print(val, end= ' ')

0 1 4 9 16 25 36 49 64 81 100 121 

In [44]:
G = (n ** 2 for n in range (12))
for val in G:
    print(val, end = " ")

0 1 4 9 16 25 36 49 64 81 100 121 

The difference is that a generator expression does not actually compute the values until they are needed. This not only leads to memory efficiency, but to computational efficiency as well! This also means that while the size of a list is limited by available memory, the size of a generator expression is unlimited!

An example of an infinite generator expression can be created using the `count` iterator defined in `itertools`:

In [46]:
from itertools import count
count()

count(0)

In [47]:
for i in count():
    print(i, end= " ")
    if i >= 10: break

0 1 2 3 4 5 6 7 8 9 10 

The `count` iterator will go on happily counting forever until you tell it to stop; this makes it convenient to create generators that will also go on forever:

In [49]:
factors = [2, 3, 5, 7]
G = (i for i in count() if all(i % n > 0 for n in factors))
for val in G:
    print(val, end=" ")
    if val > 40: break

1 11 13 17 19 23 29 31 37 41 

You might see what we're getting at here: if we were to expand the list of factors appropriately, what we would have the beginnings of is a prime number generator, using the Sieve of Eratosthenes algorithm. We'll explore this more momentarily.

### A list can be iterated multiple times; a generator expression is single-use

This is one of those potential gotchas of generator expressions. With a list, we can straightforwardly do this:

In [50]:
L = [n ** 2 for n in range(12)]
for val in L:
    print(val, end=" ")
print()

for val in L:
    print(val, end=" ")

0 1 4 9 16 25 36 49 64 81 100 121 
0 1 4 9 16 25 36 49 64 81 100 121 

A generator expression, on the other hand, is used-up after one iteration:

In [51]:
G = (n ** 2 for n in range(12))
list(G)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]

In [52]:
list(G)

[]

This can be very useful because it means iteration can be stopped and started:

In [53]:
G = (n**2 for n in range(12))
for n in G:
    print(n, end=' ')
    if n > 30: break

print("\ndoing something in between")

for n in G:
    print(n, end=' ')

0 1 4 9 16 25 36 
doing something in between
49 64 81 100 121 

One place I've found this useful is when working with collections of data files on disk; it means that you can quite easily analyze them in batches, letting the generator keep track of which ones you have yet to see.

### Generator Functions: Using `Yield`

We saw in the previous section that list comprehensions are best used to create relatively simple lists, while using a normal `for` loop can be better in more complicated situations. The same is true of generator expressions: we can make more complicated generators using *generator functions*, which make use of the `yield` statement.

Here we have two ways of constructing the same list:

In [54]:
L1 = [n ** 2 for n in range(12)]

L2 = []
for n in range(12):
    L2.append(n ** 2)

print(L1)
print(L2)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]


Similarly, here we have two ways of constructing equivalent generators:

In [55]:
G1 = (n ** 2 for n in range(12))

def gen():
    for n in range(12):
        yield n ** 2

G2 = gen()
print(*G1)
print(*G2)

0 1 4 9 16 25 36 49 64 81 100 121
0 1 4 9 16 25 36 49 64 81 100 121


A generator function is a function that, rather than using `return` to return a value once, uses `yield` to yield a (potentially infinite) sequence of values. Just as in generator expressions, the state of the generator is preserved between partial iterations, but if we want a fresh copy of the generator we can simply call the function again.

#### Example: Prime Number Generator

Here I'll show my favorite example of a generator function: a function to generate an unbounded series of prime numbers. A classic algorithm for this is the Sieve of Eratosthenes, which works something like this:

In [56]:
# Generate a list of candidates
L = [n for n in range(2, 40)]
print(L)

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]


In [57]:
# Remove all multiples of the first value
L = [n for n in L if n == L[0] or n % L[0] > 0]
print(L)

[2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39]


In [58]:
# Remove all multiples of the second value
L = [n for n in L if n == L[1] or n % L[1] > 0]
print(L)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, 35, 37]


In [59]:
# Remove all multiples of the third value
L = [n for n in L if n == L[2] or n % L[2] > 0]
print(L)

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]


If we repeat this procedure enough times on a large enough list, we can generate as many primes as we wish.

Let's encapsulate this logic in a generator function:

In [60]:
def gen_primes(N):
    """Generate primes up to N"""
    primes = set()
    for n in range(2, N):
        if all(n % p > 0 for p in primes):
            primes.add(n)
            yield n

print(*gen_primes(100))

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97


That's all there is to it! While this is certainly not the most computationally efficient implementation of the Sieve of Eratosthenes, it illustrates how convenient the generator function syntax can be for building more complicated sequences.

## 4.7 Lesson: Readng and Writing TSV files:

The tab-separated value (TSV) file format is a common way to share data. They are generally easy to create and parse, and have pretty broad support. Most files with the extension .tsv are TSV files. Microsoft Excel defaults to saving TSV files with the extension .txt.

The idea of TSV files is simple. Each line in the file is broken up into columns by a "tab" character ("\t" in Python and most languages). This works because most data does not naturally have embedded tabs. After reading each line, simply split it on tabs and you have the columns.

Most TSV files have a header row for the first line. This header row is naturally split up by tabs, and the strings between those tabs are the column names.

The following code examples read and write TSV files. They are written using generators for output, or taking in iterables for input, to emphasize the lightweight iterable-oriented style. Each of them can be easily switched to use lists if appropriate.

### Code Example: Reading a TSV File
Here is an example of reading a tab-separated file and yielding a list of values for each row. If there is a header row, it will be the first list returned.

In [61]:
def read_tsv_lists(filename):
    with open(filename) as fp:
        for line in fp:
            # strip newline at end of the line
            line = line.rstrip("\n")
            yield line.split("\t")

#### Notes:
- The string method `str.rstrip` removes the specified characters at the end of the line. In this case, it is being used to strip the newline character `'\n'` marking the end of the line. Each line except the last one will end with the newline character, but it is not really data, just a separator between lines.
- The string method `str.split` breaks up the original string and returns the pieces in a list. The original string is not actually changed.
- Sometimes functions like this are written to return tuples of data to discourage accidental changes. This requires marginal extra code - just add `tuple()` to the `yield` statement. However, this makes changes like parsing numbers from strings more inconvenient, so its probably better to leave it as a list.
- All the values in the returned rows will be strings. You will eventually want to convert them to other types.
- This code is unlikely to have errors if the file exists and you can read it, but unexpected results may happen if the columns are inconsistent.
    - This code has no checks if the number of columns in each row are the same. Usually, you should at least check if the number matches the header (first) row.
    
#### Code Example: Reading a TSV File into Dictionaries
Here is another example of reading a tab-separated file with a header and yielding the contents as dictionaries. This version requires a header row in the file.

In [62]:
def read_tsv_dictionaries(filename):
    with open(filename) as fp:
        def parse_ine(line):
            return line.rstrip("\n").split("\t")
        
        header = parse_line(next(fp))
        for line in fp:
            line = parse_line(line)
            yield dict(zip(header, line))

#### Code Notes:
- The nested function `parse_line` is for convenience to avoid repeating the logic for stripping newlines and splitting on tabs. It does not access any variables from the surrounding scope, but itself is scoped so only code inside the with statement can access it.
- The header variable is a list populated by parsing the first line of the file.
- This function uses three built-in functions that we haven't mentioned before.
    - The built-in function `next` returns the next output from an iterator. In this case, it was the first line from the file.
     - The built-in function `dict` takes in an iterable of pairs (sequences of length two) and populates a dictionary. The first value of each pair becomes a key, and the second value of each pair is that key's value.
    - The built-in function `zip` takes two or more iterables, repeatedly takes a value from each, and yields a tuple of those values. The name is an analogy to zipping up a zipper - two sides are becoming paired together. In this usage, the column names in the `header` variable are being paired with the values in the `line` variable.
- This code will raise an exception when next is called if the file is empty
    - This means that the file does not even have a header row, and your assumptions about this being a valid data file were wrong. If you catch this exception, you should probably report which file the error, and then raise an exception again since your process is unlikely to work with this file empty.
- If rows after the header have different numbers of values, this code will continue without an exception, but you may be surprised later since the dictionaries are not as expected
    - If there are fewer values than expected, then the code will implicitly assume that these values should be matched to the first column names in the header, and the dictionary will be missing the later keys. You will see the `KeyError` exception when trying to access those keys.
    - If there are more values than expected, then they will be silently dropped by the `zip` function.
    - If you want to catch these cases, you can add an explicit length comparison `(len(header) != len(line))` and code your own response, or add `strict=True` to the `zip` call which will make it raise a `ValueError` if the lengths do not match.
    
#### Code Example: Writing a TSV File in Python
Here is a function to write a simple TSV file. It takes in the target filename, a sequence of column_names for the header, and a sequence of rows of data with each row being a sequence.

In [4]:
def write_tsv(filename, column_names, rows):
    with open(filename, "w") as fp:
        def write_line(row):
            fp.write("\t".join(str(v) for v in row) + "\n")
 
        write_line(column_names)
        for row in rows:
            write_line(row)

In [7]:
.write?

Object `write` not found.


#### Code Notes:
- This function assumes that `str` is adequate to convert all of the values to strings.
- The nested function write_line writes one line to the file with its input row converted to strings.
    - Unlike the previous nested function example, it uses the variable `fp` from the surrounding scope.
    - The file object fp's method `.write` takes in a string and writes it to the file. Unlike the `print` function, it does not automatically add a new line at the end, so it needs to be explicitly included.
- The string method `str.split` takes in a sequence of strings and joins them together separated by the original string.
    - `"\t".join(["a", "b", "c"])` returns `"a+b+c"`
- This function could be modified to handle dictionary rows by changing the `write_line(row)` line to pass a list or generator comprehension to `write_line` to fetch the values in the right order.
    - If you make that modification, you should "freeze" `column_names` as a list so that you can iterate over it multiple times.

## 4.8 Lesson: Reading and Writing CSV files & What Data Structure Should be Used When First Reading a File?

The comma-separated value format, or CSV for short, is a more common file format than TSVs. Many older software systems, and Microsoft Excel in particular, used CSV files as their primary shareable file format. Like TSVs, CSVs are human readable. As you likely assumed, comma-separated value files use commas to separate values. But since commas are much more likely to appear in string data than tabs (which are invisible), CSV files use quoting and other means to distinguish commas in values and commas separating values. The details vary across software generating the files, but the Microsoft Excel dialect is close to a de facto standard.

Given the complexities and variation of different CSV formats, we strongly recommend that you use Python's built-in csv library. The csv library works well with default settings, and has many options to handle the various dialects. The csv library can even handle TSV files with appropriate options, and has classes for dictionary support.

https://docs.python.org/3/library/csv.html

### Complicated Strings and Numbers in the CSV File Format:
How to write different types of data to a csv file. 

In [64]:
import csv
with open("test.csv", "w") as test_fp:
    test_writer = csv.writer(test_fp)
    test_writer.writerow(["hi", "hello, and goodbye"])

This code uses `csv.writer` to get an object specialized in writing CSV files. The function has several options to change dialect in the object returned by `csv.writer` has a `writerow` method which we will use to write the data. This `writerow` method takes in a sequence of data, and the data in the sequence does not need to be a string. It will auotmatically call `str()` to convert non-string data.   

In [66]:
with open("test.csv") as test_fp:
    for l in test_fp:
        print(l)

hi,"hello, and goodbye"



In [69]:
def test_csv(row):
    with open("test.csv", "w") as test_fp:
        test_writer = csv.writer(test_fp)
        test_writer.writerow(row)
    with open("test.csv") as test_fp:
        for l in test_fp:
            print(l)

In [70]:
test_csv(['hi', 5, 42.0, "bye"])

hi,5,42.0,bye



In [72]:
test_csv(["hi", 'this is a double quote " in the middle and end"', "done"]) #example with double quotes

hi,"this is a double quote "" in the middle and end""",done



Two double quotes in a row is the default encoding of a double quote in a string. So if you're writing code to parse this yourself and your code was parsing the middle of a string in double quotes, it would need to peek at the next character to tell if this was a double quote inside a string, or the stringw as being closed. If there is an odd double quote left at the end, then that markes the end of the string value. 

In [73]:
test_csv(['lots of double quotes """""'])

"lots of double quotes """""""""""



#### Code Example: Reading a CSV File with the CSV Module

In [74]:
import csv
def read_csv_lists(filename):
    with open(filename) as file:
        reader = csv.reader(file)
        for row in reader:
            yield row

#### Code Notes:
- If you compare this function with `read_tsv_lists` earlier, the addition of one reader object replaced stripping new lines and splitting the line into multiple values.
    - The latter splitting functionality would have been much more complicated than the simple TSV case.
- `csv.reader` has several options for tweaking the dialect read
    - Any changes to the dialect options will be made in the `csv.reader` call, and the following loop will be unchanged.

#### Code Example: Reading a CSV File into Dictionaries
Here is another version of the file reading code that returns dictionaries instead of lists. Note how brief the change is; this is a common pattern that they wanted to be easy to express.

In [75]:
import csv
def read_csv_dictionaries():
    with open(filename) as file:
        reader = csv.DictReader(file)
        for row in reader:
            yield row

#### Code Notes:
- The only change from the previous `read_csv` function is changing the call from `csv.reader` to `csv.DictReader`.
- `csv.DictReader` can be similarly configured with options to read different dialects.
#### Code Example: Reading TSV Files with the CSV Module
Here is an example of using those options to read a TSV file using the CSV module.

In [76]:
import csv
 
def read_tsv_dictionaries_2():
    with open(filename) as file:
        reader = csv.DictReader(file, dialect="excel-tab")
        for row in reader:
            yield row

#### Code Notes:
- The only change is adding the `dialect="excel-tab"` option.
- You could also get this effect with `delimiter="\t"` since `\t` is the tab character.

but what exactly is the `excel-tab` dialect?

In [77]:
import csv
csv.list_dialects()

['excel', 'excel-tab', 'unix']

In [79]:
dialect = csv.get_dialect('excel-tab')
dialect

<_csv.Dialect at 0x10af77c40>

In [80]:
dir(dialect)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'delimiter',
 'doublequote',
 'escapechar',
 'lineterminator',
 'quotechar',
 'quoting',
 'skipinitialspace',
 'strict']

In [81]:
for attribute in dir(dialect):
    if not attribute.startswith("_"):
        print(attribute, repr(getattr(dialect, attribute)))

delimiter '\t'
doublequote True
escapechar None
lineterminator '\r\n'
quotechar '"'
quoting 0
skipinitialspace False
strict False


#### Code Notes:
- `csv.list_dialects` returns a list of the available dialects. `"excel-tab"` is included.
- `csv.get_dialect` gives us a dialect object, but converting it into a string is not very helpful.
- The built-in function `dir` was mentioned in week 1, but we have not used it since. It returns a list of the attributes (including methods) of an object, and is handy when trying to learn about an unfamiliar object.
- The built-in function `getattr` returns an object's named attribute. What's an attribute? Anything you can access from the object using the dot notation. So `getattr(o, "foo")` is the same as `o.foo`. Here, it was used to programmatically look at an unfamiliar object where the attribute names weren't known beforehand.
- `repr` was used for clarity since some of these attributes were non-visible characters; `repr` changed them to the backslashed expressions.
- This dialect is a little fancier than a plain TSV file but is probably fine for most purposes.
    - Of particular note, it supports quoting fields with double quotes and the same double doublequote behavior that we saw looking at example CSV encodings.
    - Most of the time, these just do not come up at all.
    - When they come up, you'll have to decide on a case by case basis whether you want this support or not.

#### Code Example: Handling Different Data Types
We will wrap up this example with a more full-featured parser that handles numeric types too. It will try to convert all fields except `"mango_id"` into numbers and set the value to None if the parsing fails. None will not be a good input to most modeling code, but you will have to separately decide what to do in those cases.

In [82]:
import csv
 
def read_mango_data(filename):
    with open(filename) as file:
        reader = csv.DictReader(file, dialect="excel-tab")
        for row in reader:
            for column_name in row:
                if column_name != "mango_id":
                    try:
                        row[column_name] = float(row[column_name])
                    except:
                        row[column_name] = None
 
            yield row

#### Code Notes:

- You may want to convert individual columns instead of looping over all columns, depending on what your data looks like.
- `float` is a reasonable default for most numeric columns, but occasionally `int` will be preferred.

#### An Opinionated Take
This video argues for using the list of dictionaries representation when first investigating a new file or source of data.

## 4.9 Lesson: Reading and Writing JSON Files
JavaScript Object Notation, commonly known as JSON, is a popular format originally used by JavaScript as a human-readable format to transfer data between web servers and web pages in a browser. Most of the early usage helped make more interactive web pages. Since that original use case, the JSON file format has become standardized, and it is used by many application program interfaces (APIs), accepted by many databases, and generally used where flexibility is desired.

Let’s look at an example.

In [83]:
import json
print(json.dumps({"hi": ["this", "string", "is", "in", "a", "list"], "numbers": [1, 2, 3, 4, 5], "nothing": None}))

{"hi": ["this", "string", "is", "in", "a", "list"], "numbers": [1, 2, 3, 4, 5], "nothing": null}


That example JSON data should be easy for you to interpret with your knowledge so far. In fact, this example is almost valid Python code except for the `null` at the end instead of None.

JSON is not a great representation for data in data science. It takes up a lot of space and can be expensive to parse. However, it is still handy to know since it is often used for configuration files, and you may need to parse other data sources using JSON to extract the data that you want to model. Fortunately, Python has strong JSON support via the built-in json library.

A particularly nice feature of JSON is that all the type information is built-in to the file format, so parsing a JSON file includes parsing all the types. A downside of JSON is that if the file is corrupted somehow, then it is difficult to recover, in contrast to TSV and CSV files where you can usually get away with just dropping the impacted lines.

#### Video: Reading and Writing JSON in Python
This video gives a quick introduction to reading and writing data in the common JSON format.

In [84]:
import json

# let's make a JSON
Data = {"config" : {"important": True, "purple_factor": 0.9, "exceptions": ["brown", "red"]}}

To write a file, we just open the file for writing, and call it `json.dump`

In [86]:
with open("test.json", "w") as file:
    json.dump(Data, file)

Let's look at what that wrote out:

In [87]:
with open("test.json") as file:
    for line in file:
        print(line)

{"config": {"important": true, "purple_factor": 0.9, "exceptions": ["brown", "red"]}}


In [88]:
with open("test.json") as file:
    new_data = json.load(file)
new_data

{'config': {'important': True,
  'purple_factor': 0.9,
  'exceptions': ['brown', 'red']}}

In [90]:
data_json = json.dumps(Data)
print(data_json)

{"config": {"important": true, "purple_factor": 0.9, "exceptions": ["brown", "red"]}}


The default string conversion used by print usually generates one long line of text for complex objects.

In [91]:
messy_data = [{"x": {"hi": "bye", "hello": "goodbye"}}, 0, 1, [{"weird": {"weirder": "stuff"}}, "some more stuff"], ["a", "b", "c"]]
print(messy_data)

[{'x': {'hi': 'bye', 'hello': 'goodbye'}}, 0, 1, [{'weird': {'weirder': 'stuff'}}, 'some more stuff'], ['a', 'b', 'c']]


If you just use `json.dumps`, then the output is pretty similar to Python's default conversion to strings. In this case, most of the change is just changing the quotes to double quotes.

In [92]:
import json
print(json.dumps(messy_data))

[{"x": {"hi": "bye", "hello": "goodbye"}}, 0, 1, [{"weird": {"weirder": "stuff"}}, "some more stuff"], ["a", "b", "c"]]


But if you add the `indent=2` option, then it will add new lines and indentation to the output, and you'll be able to see the structure of the object much better.

In [93]:
print(json.dumps(messy_data, indent=2))

[
  {
    "x": {
      "hi": "bye",
      "hello": "goodbye"
    }
  },
  0,
  1,
  [
    {
      "weird": {
        "weirder": "stuff"
      }
    },
    "some more stuff"
  ],
  [
    "a",
    "b",
    "c"
  ]
]


**Table: JSON Functions in Python** | The following functions are in the JSON module cover most usage.
| function name | description |
| :--- | :--- |
| `dump` | write the input object to a file-like object as JSON. |
| `dumps` | Return the input object encoded as JSON string. |
| `load` | Reads the contents ofa file-like object, parses it as JSON, and returns the resulting object. |
| `loads` | Reads the contents of a string as JSON and returns the file. |

In [100]:
#json.dump
with open("output.json") as fp:
    dump([{"hi" : "bye"}], fp)

FileNotFoundError: [Errno 2] No such file or directory: 'output.json'

In [99]:
#json.dumps
output = json.dumps([{"hi" : "bye"}])
output

'[{"hi": "bye"}]'

In [101]:
#json.load
with open("input.json") as fp:
    input_data = load(fp)

FileNotFoundError: [Errno 2] No such file or directory: 'input.json'

In [105]:
#json.loads
input_data = json.loads(output)
input_data

[{'hi': 'bye'}]