# Files 

**Time**
- Teaching: 10 min
- Exercises: 5 min

**Questions**:
- "How do a open a file and read its contents?"
- "How do I write a file with the variables I generated?"

**Learning Objectives**:
- "Learn the Pythonic way of reading in files."
- "Understand how to read/write text files and csv files."
* * * * *

In this lesson we will cover how to read and write files.

## Reading from a file

Reading a file requires three steps:

1. Opening the file
2. Reading the file
3. Closing the file

In [1]:
my_file = open("example.txt", "r")
text = my_file.read()
my_file.close()

print(text)

This is line 1.
This is line 2.
This is line 3.
This is line 4.
This is line 5.



- However, use the `with open` syntax and this will automatically close files for you. 
- The `'r'` indicates that you are reading the file, as opposed to, say, writing to it.

In [None]:
# better code
with open('example.txt', 'r') as my_file:
    text = my_file.read()
    
print(text)

`with` will keep the file open as long as the program is still in the indented block, once outside, the file is no longer open, and you can't access the contents, only what you have saved to a variable.

## Reading a file as a list

- Very often we want to read in a file line by line, storing those lines as a list.
- To do that, we can use the `for line in my_file` syntax:

In [None]:
stored = []
with open('example.txt', 'r') as my_file:
    for line in my_file:
        stored.append(line)

In [None]:
stored

Remember that the variable name can be anything. It does not have to be `line`. Files are simply always read line by line.

- We can use the `strip` [method](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method) to get rid of those line breaks at the end

In [None]:
stored = []
with open('example.txt', 'r') as my_file:
    for line in my_file:
        stored.append(line.strip())

In [None]:
stored

## Writing to a file

We can use the `with open` syntax for writing files as well.

In [None]:
# this is okay...
new_file = open("example2.txt", "w")
bees = ['bears', 'beets', 'Battlestar Galactica']
for i in bees:
    new_file.write(i + '\n')
new_file.close()

In [None]:
# but this is better...
bees = ['bears', 'beets', 'Battlestar Galactica']
with open('example2.txt', 'w') as new_file:
    for i in bees:
        new_file.write(i + '\n')

Let's take a look at the file we wrote.
- An exclamation point `!` puts you in bash

In [None]:
# for Macs use the `cat` command
!cat example2.txt

In [None]:
# for windows use the `type` command
!type example2.txt

# Reading/Writing csv files using `pandas`

Reading in a dataset that is stored as a "comma separated file" (csv) is easy in Python using the `pandas` package. Central to the `pandas` package is the `DataFrame` type, which stores 2-dimensional tabular data in a format similar to Excel spreedsheets.

Let's import `pandas` and use it's `read_csv()` function to load the data stored in a csv file into a `DataFrame`

In [2]:
import pandas as pd
caps = pd.read_csv('capitals.csv')
#Use of read.csv() of pandas package

We can look at the first 5 (or any number) rows of data using the `.head()` method of the `DataFrame` object.

In [3]:
caps.head()

Unnamed: 0,Country,Capital,Latitude,Longitude
0,Afghanistan,Kabul,34¡28'N,69¡11'E
1,Albania,Tirane,41¡18'N,19¡49'E
2,Algeria,Algiers,36¡42'N,03¡08'E
3,American Samoa,Pago Pago,14¡16'S,170¡43'W
4,Andorra,Andorra la Vella,42¡31'N,01¡32'E


To see how many data points and variables exist in the dataframe we can simply use the `.shape` attribute.

In [4]:
caps.shape

(200, 4)

Or we can get more detailed information about the number of entries (e.g. observations, data points) and the variables for each entry using the `.info()` method.

In [5]:
caps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
Country      200 non-null object
Capital      199 non-null object
Latitude     200 non-null object
Longitude    200 non-null object
dtypes: object(4)
memory usage: 6.3+ KB


It looks like there is a single missing value in the Capital variable (there are 199 non-null objects, not 200). Let's remove that missing value (or `na`) using the `dropna()` method so that we can save an updated version of the csv file.

In [6]:
caps_nomissing = caps.dropna()
caps_nomissing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 0 to 199
Data columns (total 4 columns):
Country      199 non-null object
Capital      199 non-null object
Latitude     199 non-null object
Longitude    199 non-null object
dtypes: object(4)
memory usage: 7.8+ KB


That looks better. Now let's write this updated `DataFrame` out to a csv file.

In [7]:
caps_nomissing.to_csv('capitals_nomissing.csv')

For more information on using `pandas` come to the D-Lab's workshop titled "Introduction to Pandas". Here's a [link](https://github.com/dlab-berkeley/introduction-to-pandas) to the GitHub repo containing the course materials.

## Challenge 1: Read in a list

The file `counties.txt` has a column of counties in California. Read in the data into a list called `counties`.

## Challenge 2: Writing a CSV file

Below is a `pandas` `DataFrame` created from a dictionary of lists representing various information about US states. Write this [object](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#object) as a CSV file called `states.csv`

In [None]:
states = pd.DataFrame( {'state': ['Ohio', 'Michigan', 'California', 'Florida', 'Alabama'],
                        'population': [11.6, 9.9, 39.1, 20.2, 4.9], 
                        'year in union': [1803, 1837, 1850, 1834, 1819], 
                        'state bird': ['Northern cardinal', np.nan, np.nan, np.nan, np.nan], 
                        'capital': ['Columbus', 'Lansing', 'Sacramento', 'Tallahassee', 'Montgomery']})
states

## [OPTIONAL] Using the CSV Module

In addition to reading csv files using the `pandas` module, Python has a `csv` module that can read csv filese into lists and dictionaries.
- In python, a common way to do that is to read a csv as a list of dictionaries. 
- For this, we use the `csv` module

In [None]:
import csv

In [None]:
#read csv and read into a list of dictionaries
capitals = [] # make empty list
with open('capitals.csv', 'r') as csvfile: # open file
    reader = csv.DictReader(csvfile) # create a reader
    for row in reader: # loop through rows
        capitals.append(row) # append each row to the list

In [None]:
capitals[:5]

- Writing a list of dictionaries as a CSV is similar:

In [None]:
# get the keys in each dictionary
keys = capitals[1].keys()
keys

In [None]:
# write rows
with open('capitals2.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(capitals)

In [None]:
csv.DictWriter.writerows?

In [None]:
for cur_observation_dict in capitals:
    cur_line = []
    for cur_key in keys:
        cur_line.append(cur_observation_dict[cur_key])
    output_file.write(cur_line)