# Working With Files and Doing Data Cleaning

## Working With Files

Most data you might want to work with is likely to be in some kind of file or another.  You often will want to work with data in comma separated text files, or in spreadsheet tables, or in tables stored in a database.  And you will increasingly find data online, not only as text or spreadsheets to download, but as Open Data APIs, returned from a web service as a JSON object.  This session covers how to work with a variety of file formats in Python, and how to begin processing data in files to clean data.

### Basics of Reading and Writing Files in Python

Let's start by creating a simple file, and then reading it back.  We will use 'open' to open a file we will call 'tempfile' in 'write' (w) mode.  We will assign the file object, which is **iterable**, to an object we will arbitrarily call 'f'.

In [49]:
f = open('tempfile.txt', 'w')
for i in range(10):
    f.write('this is line ' + str(i) + '\n')
f.close()

You can open the text file in an editor to verify that this code wrote the file as expected.  Now open the file we just created in Python, in read mode (r)

In [57]:
f = open('tempfile.txt', 'r')

And step through reading each line with the readline() method.  Notice that each time you execute this it advances to the next line.

In [58]:
print(f.readline())
print(f.readline())

this is line 0

this is line 1



The plural version of that method generates a list of the lines in a file.  Note that we will re-open the file here to start from the beginning. Otherwise it will be positioned at the end of the file and give us back an empty string. Notice that this list contains the raw text contents, including the newline string '\n'.

In [61]:
f = open('tempfile.txt', 'r')
f.readlines()

['this is line 0\n',
 'this is line 1\n',
 'this is line 2\n',
 'this is line 3\n',
 'this is line 4\n',
 'this is line 5\n',
 'this is line 6\n',
 'this is line 7\n',
 'this is line 8\n',
 'this is line 9\n']

Here is another way to loop through the lines of the file and print them all out. Notice that printing the lines suppresses the quotes and the newline string.

In [62]:
f = open('tempfile.txt', 'r')
for line in f:
    print(line, end='')

this is line 0
this is line 1
this is line 2
this is line 3
this is line 4
this is line 5
this is line 6
this is line 7
this is line 8
this is line 9


In [77]:
import json
f = open('tempfile.txt', 'r')

x = json.dumps(f.readlines())
x

'["this is line 0\\n", "this is line 1\\n", "this is line 2\\n", "this is line 3\\n", "this is line 4\\n", "this is line 5\\n", "this is line 6\\n", "this is line 7\\n", "this is line 8\\n", "this is line 9\\n"]'

In [78]:
f = open('tempfile.txt', 'r')
j = open('temp.json', 'w')
json.dump(f.readlines(), j)

In [80]:
j = open('temp.json', 'r')
x = json.load(j)
x

['this is line 0\n',
 'this is line 1\n',
 'this is line 2\n',
 'this is line 3\n',
 'this is line 4\n',
 'this is line 5\n',
 'this is line 6\n',
 'this is line 7\n',
 'this is line 8\n',
 'this is line 9\n']

## Putting Python to Work: Cleaning up Messy Data

Let's begin looking at some real data and do some work on it using these Python data types and methods.  Remember those rental listings you just used to create a map in CartoDB last week and that we looked at using Python last session?  They were nicely formatted, clean CSV files.  That isn't how they started...

Let's look at the messy set of raw rental listing data, obtained by scraping it from the web.  We left in the messiness to use it for learning how to begin putting to work what we have learned about basic data types.  

You will need a data file called items.csv, and a copy of this notebook, which is 5-strings-lists-dictionaries-part-2.ipynb.  Both are in the files directory, and linked from the session page.  Download the notebook to a location on your hard drive.  I create a Data folder below that, and put the items.csv file into that.  If you organize things differently, you will need to change the file references below in each cell -- so probably best to use the same file organization.

Once we have the file downloaded and you have this notebook open, we begin by importing the csv and string modules so we have access to their classes and methods.

In [2]:
import csv, string

Now let's open the items.csv file and use the reader method in the csv class to iterate over all the rows in the file, and print out its contents.

In [4]:
with open('Data/items.csv', 'rb') as csvfile:
    i = 0
    itemreader = csv.reader(csvfile)
    #next(itemreader, None)  # skip the headers
    for row in itemreader:
        i = i+1
        if i < 6:
            print row

['neighborhood', 'title', 'price', 'bedrooms', 'pid', 'longitude', 'date', 'link', 'latitude', 'sqft', 'sourcepage']
[' (SOMA / south beach)', '1bed + Den, 1bath at Mission Bay', '$2895', '   / 1br - 950ft\xc2\xb2 -    ', '4046628359', '-122.399663', 'Sep  4 2013', '/sfc/apa/4046628359.html', '37.774623', '   / 1br - 950ft\xc2\xb2 -    ', 'http://sfbay.craigslist.org/sfc/apa/']
[' (SOMA / south beach)', 'Love where you live!', '$3354', '   / 1br - 710ft\xc2\xb2 -    ', '4046761563', '', 'Sep  4 2013', '/sfc/apa/4046761563.html', '', '   / 1br - 710ft\xc2\xb2 -    ', 'http://sfbay.craigslist.org/sfc/apa/']
[' (inner sunset / UCSF)', 'We Welcome Your Furry Friends! Call Today!', '$2865', '   / 1br - 644ft\xc2\xb2 -    ', '4046661504', '-122.470727', 'Sep  4 2013', '/sfc/apa/4046661504.html', '37.765739', '   / 1br - 644ft\xc2\xb2 -    ', 'http://sfbay.craigslist.org/sfc/apa/']
[' (financial district)', 'Golden Gateway Commons | 2BR + office townhouse & 4 decks!!', '$5500', '   / 2br - 14

## What a Mess!

OK, that is what 'raw' data looks like.  It has all kinds of extraneous text in it, and we need to clean it up in order to use it.  Fortunately, you already should know how to use the tools to make this happen, using basic Python methods for lists and strings that we learned last session: split, len, find, indexing (e.g. s[1:4]), strip.

The exercise for today is to practice those skills to clean up this file.  These are the things you need to do to get it cleaned up:

1. Create variables for neighborhood, title, price, pid, date, latitude, longitude
2. Remove the parentheses around the neighborhood name
3. Remove the $ sign from price
4. Convert price to integer
5. Extract the number of bedrooms from the text string that contains it. For example, the first data row has '   / 1br - 950ft\xc2\xb2 -    ' as a string, and you just want the '1' before 'br' (hint: you can find an index value of a specific character or string and get an offset from that index value by adding or subtracting from that index value, like: find('X')+1)
6. Extract the sqft from the same text string, in this case '950' just before 'ft'
7. Unpack the date into year, month and day
8. print the cleaned up variables for the first 10 rows

To do this without getting too frustrated, and as a general productivity strategy, it is advisable to take small steps and check the results at each step. Don't try to do everything in the smallest number of steps - at least not in your first pass.

When you're done, printing the relevant variables from the first row using the following command:

should look like:

PLEASE do each step above in a different cell, checking your results at each step, and add comment cells to explain your work.

When you complete this exercise, save the notebook and submit it on the class website.  

Let's do the first couple of steps to get started. Beginning with creating variables. Note that the bedrooms and sqft strings are duplicated and contain both items of information.

In [5]:
with open('Data/items.csv', 'rb') as csvfile:
    i = 0
    itemreader = csv.reader(csvfile)
    #next(itemreader, None)  # skip the headers
    for row in itemreader:
        i = i+1
        if i < 6:
            neighborhood = row[0]
            title = row[1]
            price = row[2]
            bedrooms = row[3]
            sqft = row[9]
            print neighborhood, ',', price, ',', bedrooms, ',', sqft

neighborhood , price , bedrooms , sqft
 (SOMA / south beach) , $2895 ,    / 1br - 950ft² -     ,    / 1br - 950ft² -    
 (SOMA / south beach) , $3354 ,    / 1br - 710ft² -     ,    / 1br - 710ft² -    
 (inner sunset / UCSF) , $2865 ,    / 1br - 644ft² -     ,    / 1br - 644ft² -    
 (financial district) , $5500 ,    / 2br - 1450ft² -     ,    / 2br - 1450ft² -    


OK, next we strip the parentheses from around the neighborhood name...

In [6]:
with open('Data/items.csv', 'rb') as csvfile:
    i = 0
    itemreader = csv.reader(csvfile)
    #next(itemreader, None)  # skip the headers
    for row in itemreader:
        i = i+1
        if i < 6:
            neighborhood = row[0].strip( ') ' ).strip( '  (' )
            title = row[1]
            price = row[2]
            bedrooms = row[3]
            sqft = row[9]
            print neighborhood, ',', price, ',', bedrooms, ',', sqft

neighborhood , price , bedrooms , sqft
SOMA / south beach , $2895 ,    / 1br - 950ft² -     ,    / 1br - 950ft² -    
SOMA / south beach , $3354 ,    / 1br - 710ft² -     ,    / 1br - 710ft² -    
inner sunset / UCSF , $2865 ,    / 1br - 644ft² -     ,    / 1br - 644ft² -    
financial district , $5500 ,    / 2br - 1450ft² -     ,    / 2br - 1450ft² -    


Your turn!  Continue from step 3 through 8, and when you are done, submit the completed notebook.