# Lab 15 File IO

## Pandas Readers And Writers

### JSON 

Remember the `best_brews` `dict` of `dict`s?  A `dict` of `dict`s in `str` form is an example of *json*, for
JavaScript Object Notation.  This is a popular data format for sharing data from application to application.

JSON is Python code in `str` form.  JSON can contain `dict`, `list`, `str`, numbers, '"true/false"' instead of `True/False` and `"null"` instead of `None`

"brews.txt" is a file, inour DataSet directory, that has the `best_brews` `dict` as its text.  With that file, it is this easy to put the data into a Pandas DataFrame

In [1]:
import pandas as pd
json_brews_frame = pd.read_json("DataSets/brews.txt")  # Windows?  "DataSets\brews.txt"
json_brews_frame.T.head()                              # Any IO? 
                                                       # import os
                                                       # os.path.join("DataSets", "brews.txt")

Unnamed: 0,bar,brewer_tap_room,brewpub
Alabama,The Nook,Good People Brewing Company,
Alaska,Humpy's Great Alaskan Alehouse,Midnight Sun Brewing Company,Mooses Tooth Pub and Pizzeria
Arizona,Angel's Trumpet Ale House,Dragoon Brewing Co. Tap Room,Papago Brewing Company
Arkansas,Flying Saucer Draught Emporium - Little Rock,Diamond Bear Brewing Company,
California,Churchill's Pub and Grille,The Bruery,Beachwood BBQ and Brewing


In [2]:
json_brews_frame.to_json("DataSets/brews.json")
brews_again = pd.read_json("DataSets/brews.json")
brews_again.T.head()

Unnamed: 0,bar,brewer_tap_room,brewpub
Alabama,The Nook,Good People Brewing Company,
Alaska,Humpy's Great Alaskan Alehouse,Midnight Sun Brewing Company,Mooses Tooth Pub and Pizzeria
Arizona,Angel's Trumpet Ale House,Dragoon Brewing Co. Tap Room,Papago Brewing Company
Arkansas,Flying Saucer Draught Emporium - Little Rock,Diamond Bear Brewing Company,
California,Churchill's Pub and Grille,The Bruery,Beachwood BBQ and Brewing


### CSV

Once in a DataFrame, it's easy to make a CSV (comma separated values) file:

In [3]:
json_brews_frame.T.to_csv("DataSets/brews.csv")

In [4]:
csv_brews_frame = pd.read_csv("DataSets/brews.csv")
csv_brews_frame.head()

Unnamed: 0.1,Unnamed: 0,bar,brewer_tap_room,brewpub
0,Alabama,The Nook,Good People Brewing Company,
1,Alaska,Humpy's Great Alaskan Alehouse,Midnight Sun Brewing Company,Mooses Tooth Pub and Pizzeria
2,Arizona,Angel's Trumpet Ale House,Dragoon Brewing Co. Tap Room,Papago Brewing Company
3,Arkansas,Flying Saucer Draught Emporium - Little Rock,Diamond Bear Brewing Company,
4,California,Churchill's Pub and Grille,The Bruery,Beachwood BBQ and Brewing


### Handling Default Arguments

### The display of the data set has changed!

Let's fix that.  First:

```python
In [4]: ?pd.read_csv

Signature: pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
```
Then follows the explanation of each of the defaulted arguments.  To get rid of the new row labels, which is the index, `index_col=0`, makes the first column, the states, become the index, and the integers index is not used.  To get rid of the `NaN`s, we use `na_filter=False`.


In [5]:
csv_brews_frame = pd.read_csv("DataSets/brews.csv", index_col=0, na_filter=False)

In [6]:
csv_brews_frame.head()


Unnamed: 0,bar,brewer_tap_room,brewpub
Alabama,The Nook,Good People Brewing Company,
Alaska,Humpy's Great Alaskan Alehouse,Midnight Sun Brewing Company,Mooses Tooth Pub and Pizzeria
Arizona,Angel's Trumpet Ale House,Dragoon Brewing Co. Tap Room,Papago Brewing Company
Arkansas,Flying Saucer Draught Emporium - Little Rock,Diamond Bear Brewing Company,
California,Churchill's Pub and Grille,The Bruery,Beachwood BBQ and Brewing


### Clipboard

I copied this onto my clipboard, from: ```https://pandas.pydata.org/pandas-docs/stable/io.html#io-store-in-csv```
```
Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas	 
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google Big Query	read_gbq	to_gbq
```
Getting the data into a DataFrame is so easy:
```python
pandas_io = pd.read_clipboard(index_col=0)
```
To have it handy on disc, I used:
```
pd.to_csv("DataSets/pandas_io.csv")
```

In [7]:
pandas_io = pd.read_csv("DataSets/pandas_io.csv", index_col=0)
pandas_io

Unnamed: 0_level_0,Data Description,Reader,Writer
Format Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
text,CSV,read_csv,to_csv
text,JSON,read_json,to_json
text,HTML,read_html,to_html
text,Local clipboard,read_clipboard,to_clipboard
binary,MS Excel,read_excel,to_excel
binary,HDF5 Format,read_hdf,to_hdf
binary,Feather Format,read_feather,to_feather
binary,Parquet Format,read_parquet,to_parquet
binary,Msgpack,read_msgpack,to_msgpack
binary,Stata,read_stata,to_stata


In [8]:
pandas_io.to_clipboard()

In [9]:
new_pandas_io = pd.read_clipboard(index_col=1)



In [10]:
new_pandas_io

Unnamed: 0_level_0,Format Type,Reader,Writer
Data Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CSV,text,read_csv,to_csv
JSON,text,read_json,to_json
HTML,text,read_html,to_html
Local clipboard,text,read_clipboard,to_clipboard
MS Excel,binary,read_excel,to_excel
HDF5 Format,binary,read_hdf,to_hdf
Feather Format,binary,read_feather,to_feather
Parquet Format,binary,read_parquet,to_parquet
Msgpack,binary,read_msgpack,to_msgpack
Stata,binary,read_stata,to_stata


## Raw File IO

In [11]:
def ReadFile(file_name):
    with open(file_name) as open_file_object:
        for line in open_file_object:
            print(line, end='')
ReadFile("DataSets/ram_tzu.txt")

Ram Tzu knows this:
When God wants you to do something,
you think it's your idea.


In [12]:
def CapFile(file_name, cap_file_name):
    with open(file_name) as reader:
        with open(cap_file_name, "w") as writer:
            for line in reader:
                writer.write(line.upper())
CapFile("DataSets/ram_tzu.txt", "DataSets/ram_tzu_caps.txt") 
ReadFile("DataSets/ram_tzu_caps.txt")

RAM TZU KNOWS THIS:
WHEN GOD WANTS YOU TO DO SOMETHING,
YOU THINK IT'S YOUR IDEA.


## Rawest File IO

In [13]:
def Count(sub_str, file_name):
    count = 0
    file_object = open(file_name)
    for line in file_object:
        count += line.count(sub_str)  
    file_object.close()               # happens automatically with `with`
    return count

print(Count("you", "DataSets/ram_tzu.txt"))

3


### Dealing with Unicode

To read a file that has characters in it that are not `ascii`, open it like:

`open("the_file", encoding='utf-8')`

Writing data to a file that was encoded with `utf-8` will need:

`open("the_file", "w", encoding='utf-8')`


# Exercises

`1.` `https://simplemaps.com/data/us-cities` provides the data found in "DataSets/Cities.csv".  Get that data into a Pandas DataFrame.  Use DataFrame.head() to see what you have in your dataset.

`2.`  That gives you a lot of columns, some not so interesting.  Use:
```python
cities.info()
```
to see the columns and the characteristics of each column.

Which are interesting?  Maybe:
```python
interesting = ["state_id", "population", "population_proper", "density"]
```
The `city` label is the index, so it is included automatically.

This will give you just those columns:
```python
cities = cities[interesting]
```

`4.` Those `NaN` values are not useful when you are looking at the population of each city.  

`3.` Sort the data by the population, with the largest being first.

`4.` Which 5 cities are the densist?  Which 5 cities have the highest population?

`5.` Save your `cities` data as a comma-separated file.

`6.` The `cities` DataFrame is not suitable for saving as JSON.  Why not?
> Hint:  `cities.loc["San Jose"]` shows you the data for "San Jose".

`8.`  How many lines, words, and characters are in the file `ram_tzu.txt`?

`9.`  Make a new file, `ram_tzu_titled.txt` where the first character of each word is capitalized.  Check `?str.title` for help.

`10.` Do:
```python
import glob
?glob.glob
```
to see that:
`glob.glob("*.ipynb")`

will give you a list of all the `ipynb` files in the current directory.

Make a generator, `GetFirstPounds()` that can be used like this:
```python
for notebook, first_pound in GetFirstPounds():
```
and use it to produce this output:

Note: Some of the Lab notebooks need to be read as unicode.


`11.` Make a function that reads a file and returns the largest number of the same character in a row, which characters have that many in a row.  If the text in the file is:  "You better look at that report as soon as possible.", the return is (2, "tos") because there are 2 in a row of t's, o's, and s's.