# Working With Files and Doing Data Cleaning

Today we begin working with files.  We learned about iterating last week as part of programming logic.  We can now begin to use that knowledge to iterate through the rows of a file, to extract values, or do computations on the data in the file.

Today we will get familiar with reading and writing data in arbitrary text files, JSON files and in CSV (comma separated value) formatted files.  We will practice doing some work on the contents of the files, like computing statistics on them, and beginning to clean up messy data using some of our string methods and type conversion methods from the first week.

Next session we will continue working with files, and learn a bit of QGIS to load and visualize our data.

## First, a Review on Planning Your Programs

One of the most difficult things about learning to program is to learn how to start.  What to do before you write the code, and how to work your way through the coding process.

A good general idea is to think through the problem you want to solve first -- just conceptually.  How will you know you have solved it?  Are there tests that you can use to be sure?  Can you break the problem down into smaller components, and solve those sequentially?  This is a step that is conceptualizing your algorithm, or your plan for the code.

What approach would you use to solve each of those components? Can you describe those steps in English? We call this step writing 'pseudo-code'

Finally, there is the coding step. And the inevitable debugging step.  You really can't do one without the other.

Generally, it is good practice to work your way through problems in this way, and write the code for each building block, testing it to be sure it works for all the kinds of cases you can imagine, then test them together.  You'll end up being more productive, and far less frustrated, using a systematic, problem-solving approach.  

And by all means, don't try to tackle it all at once.  Below is an example of how to work through this process.

### Phase 1: Conceptual Plan

Think about the problem you are trying to solve, and try to determine how to break it down into steps you can conceptually solve.

### Phase 2: Pseudo-Code

Write the idea down as an algorithm, in words rather than code.

### Phase 3: Code Incrementally, Test, and Document

Generally, build the code one step at a time, and test that step.  Add comments to explain your logic.  Make sure you include narrative in your assignments explaining your reasoning, and adding explanatory comments in the code every few lines to explain what you are doing in each part.

## Reviewing the Prime Numbers Exercise

Let's review the prime numbers example from the perspective of planning how to write your code, and how to build and test the code.  The process involves a combination of workplanning and problem solving in order to have a productive experience and generate clean, readable, code that is bug-free and is reusable.

The objective of this exercise is to enable you to assimilate the material we have covered so far to solve a novel problem.  You should have all the tools you need, by now, and just need a bit of practice at putting the pieces together to solve a problem.  The problem we want to solve is how to test whether a whole number is a prime number. Recall that a Prime Number can be divided evenly only by 1 or itself, and it must be a whole number greater than 1. 

So to have reusable code to test whether any number you want to test is a prime number you would need to do what? Write code that tests whether a number passed to a function meets these conditions.

Here is a table of prime numbers up to 1000:

### Phase 1: Conceptual Plan

What do we need to do to determine whether each number between 1 and 100 is a prime number?

1. We need to see if any given number can be divided evenly by any other number besides 1 and itself.  If we count all the times it is divisible by some number less than or equal to itself, it should be exacly 2. 
2. We need to test this for every number between 1 and 100

### Phase 2: Pseudo-Code

1. Write a function (isprime) to test whether a number passed to it as an argument (x) is a prime number.
Iterate over all values from x to 1.
At each iteration, test whether the original x is evenly divisible by this iteration value.
Keep track of how many times you get an evenly divisible result.
If the result is more than 2, call x a prime number.

2. Write a loop from 1 to 100, call this value (z)
Within the loop, call function isprime, and pass it the value of z.
Print the list of prime numbers.

### Phase 3: Code

Let's build the code one step at a time, and test that step.  Add comments to explain your logic.

In [58]:
def isprime(x):
    #Need to keep track of the original value of x, so keep a copy as y
    y = x 
    #Initialize a counter to keep track of how many times y is evenly divisible
    count = 0 
    while x > 0: #create an iterator over values of x (be careful of infinite loops!)
        even = (y%x == 0) #is y evenly divisible by x?
        print(y, 'is evenly divisible by', x, 'is:', even) #Print the result of that
        count = count + even #increment the count every time it is evenly divisible
        x = x - 1 #Must not forget to decrement this counter, or will create an infinite loop!
    print(y, 'is evenly divisible', count, 'times')

In [59]:
isprime(6)

6 is evenly divisible by 6 is: True
6 is evenly divisible by 5 is: False
6 is evenly divisible by 4 is: False
6 is evenly divisible by 3 is: True
6 is evenly divisible by 2 is: True
6 is evenly divisible by 1 is: True
6 is evenly divisible 4 times


OK, now that we're getting the results we need, let's streamline the output and check one value at a time.  

Side note: The process of iteratively editing and refining your code is sometimes referred to as 'refactoring' it.  This often times involves re-writing sections of it, throwing away parts, and reorganizing it.

In [60]:
def isprime(x):
    y = x
    count = 0
    while x > 0:
        even = (y%x == 0)
        count = count + even
        x = x - 1
    if count != 2:
        print(y, 'is not a prime number')
    else:
        print(y, 'is a prime number')

In [62]:
isprime(12)

12 is not a prime number


After testing this on a bunch of numbers, it seems to be properly discriminating between prime and non-prime numbers, so now we create a loop to test that systematically and print the resulting list of prime numbers.  And let's print the results in a consise way like the table of prime numbers above.

In [63]:
def isprime(x):
    y = x
    count = 0
    while x > 0:
        even = (y%x == 0)
        count = count + even
        x = x - 1
    if count != 2:
        pass #No need to print a result if it is not prime
    else:
        print(y, end=' ') 
        #This just prints the value if it is a prime number. The end=' ' keeps it on the same line

In [64]:
start = 1
limit = 1000
print('The prime numbers between', start, 'and', limit, 'are:')
while start <= limit:
    isprime(start)
    start = start + 1

The prime numbers between 1 and 1000 are:
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463 467 479 487 491 499 503 509 521 523 541 547 557 563 569 571 577 587 593 599 601 607 613 617 619 631 641 643 647 653 659 661 673 677 683 691 701 709 719 727 733 739 743 751 757 761 769 773 787 797 809 811 821 823 827 829 839 853 857 859 863 877 881 883 887 907 911 919 929 937 941 947 953 967 971 977 983 991 997 

# Working With Files

Most data you might want to work with is likely to be in some kind of file.  You often will want to work with data in comma separated text files, or in spreadsheet tables, or in tables stored in a database.  And you will increasingly find data online, not only as text or spreadsheets to download, but as Open Data APIs, returned from a web service as a JSON object.  This session covers how to work with a variety of file formats in Python, and how to begin processing data in files to clean data.

### Basics of Reading and Writing Files in Python

Let's start by creating a simple file, and then reading it back.  We will use 'open' to open a file we will call 'tempfile' in 'write' (w) mode.  We will assign the file object, which is **iterable**, to an object we will arbitrarily call 'f'.

In [117]:
f = open('tempfile.txt', 'w')
for i in range(10):
    f.write('this is line ' + str(i) + '\n')
f.close()

Pay attention to the last line above.  When you open a file in write mode you have to **close** it when you are done or the file might be corrupted and unusable when you need to use it.

You can open the text file in an editor to verify that this code wrote the file as expected.  Now open the file we just created in Python, in read mode (r)

In [66]:
f = open('tempfile.txt', 'r')

The first method to read a file is to read it in all at once, with a read() method. Here we load the whole file in memory and assign it to a. Note that we will re-open the file here to start from the beginning. Otherwise it will be positioned at the end of the file and give us back an empty string. 

In [67]:
f = open('tempfile.txt', 'r')
a = f.read()
a

'this is line 0\nthis is line 1\nthis is line 2\nthis is line 3\nthis is line 4\nthis is line 5\nthis is line 6\nthis is line 7\nthis is line 8\nthis is line 9\n'

Note that when we use read() it creates a string object.  The whole file is loaded into one big string.  Sometimes this may be useful, but often it is not the best way to load a file.

In [68]:
type(a)

str

An alternative approach is to step through reading each line of the file with the readline() method.  Notice that each time you execute this it advances to the next line.  Each line is read into a string object.  In this case we are not doing anything with that object except printing it.

In [69]:
f = open('tempfile.txt', 'r')
print(f.readline())
print(f.readline())

this is line 0

this is line 1



Why did the code above print a blank line between the two lines (and another blank line after the second one)?

Here is another way to loop through the lines of the file and print them all out. Remember how to printing the lines and suppress the newline string?.

In [70]:
f = open('tempfile.txt', 'r')
for line in f:
    print(line, end='')

this is line 0
this is line 1
this is line 2
this is line 3
this is line 4
this is line 5
this is line 6
this is line 7
this is line 8
this is line 9


The plural version of the readline() method generates a list of the lines in a file, with a string containing each line of the file.  Notice that this list contains the raw text contents, including the newline string '\n'.

Assigning the contents of the file to a list object is helpful since you can do more work on that list object, whereas the results above from printing are gone as soon as you run the code.  They just print results to output.

In [71]:
f = open('tempfile.txt', 'r')
a = f.readlines()
a

['this is line 0\n',
 'this is line 1\n',
 'this is line 2\n',
 'this is line 3\n',
 'this is line 4\n',
 'this is line 5\n',
 'this is line 6\n',
 'this is line 7\n',
 'this is line 8\n',
 'this is line 9\n']

Using **with** is a handy way to open a file, load its data, and automatically close the file. 

In [74]:
with open('tempfile.txt', 'r') as f:
    read_data = f.read()
print(read_data)
f.closed


this is line 0
this is line 1
this is line 2
this is line 3
this is line 4
this is line 5
this is line 6
this is line 7
this is line 8
this is line 9



True

### Working with JSON

JSON (JavaScript Object Notation) is a common format for data accessed from a web browser, which is generally running JavaScript.  We will see this format a lot when we begin working with data on a web page or accessed with an API.

The json dumps() method converts Python objects to JSON format, using the counterpart format for each data type, as in the table below.

In [75]:
import json

json.dumps([1,2,3])

'[1, 2, 3]'

Notice that objects can be complex, containing multiple types of data, and still be easily translated between Python objects and JSON format.  The following example converts a Python list, containing one element that is a dictionary, to JSON.

In [76]:
json.dumps([1,2,3,{'foo': 'bar'}])

'[1, 2, 3, {"foo": "bar"}]'

Below we convert the contents of tempfile to a json object.

In [77]:
f = open('tempfile.txt', 'r')

x = json.dumps(f.readlines())
x

'["this is line 0\\n", "this is line 1\\n", "this is line 2\\n", "this is line 3\\n", "this is line 4\\n", "this is line 5\\n", "this is line 6\\n", "this is line 7\\n", "this is line 8\\n", "this is line 9\\n"]'

With the dump() method, we can write JSON data to a file.  Here we read tempfile, and create a new JSON formatted file into which we write the contents of tempfile.  Using just a couple of lines of code, we read a file into Python, and convert it to Json format, and write it to a file in Json format.

In [78]:
f = open('tempfile.txt', 'r')
j = open('temp.json', 'w')
json.dump(f.readlines(), j)

Using the load() method, we can read JSON formatted data and load it into a Python object.

In [79]:
j = open('temp.json', 'r')
x = json.load(j)
x

['this is line 0\n',
 'this is line 1\n',
 'this is line 2\n',
 'this is line 3\n',
 'this is line 4\n',
 'this is line 5\n',
 'this is line 6\n',
 'this is line 7\n',
 'this is line 8\n',
 'this is line 9\n']

### Working with CSV Files

CSV (Comma Separated Values) is probably the most common format of data you will encounter.  Files in this format are often exported in this format from a database table or from Excel, or just used as a simple, standard text (ASCII) file format for ease of use.

Let's begin by writing a CSV file like the JSON example above, by importing the csv module, and writing a file with several columns, separated by commas.

In [3]:
my_data = []
for i in range(10):
    my_data.append([i, i*2, i+2])
my_data

[[0, 0, 2],
 [1, 2, 3],
 [2, 4, 4],
 [3, 6, 5],
 [4, 8, 6],
 [5, 10, 7],
 [6, 12, 8],
 [7, 14, 9],
 [8, 16, 10],
 [9, 18, 11]]

Now we will write the CSV file using my_data, and adding a header row first with column names.  Note that we open the file as before, in write mode, but now use the writerow() method to write one row with the header, and writerows() to iterate over the rows and write them to the file.

In [4]:
import csv
with open('my_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["x", "y", "z"])
    writer.writerows(my_data)

Reading a CSV file is very similar to writing one, but simpler.  We create a reader object that is iterable, and then we can iterate over the rows and do things, like print each row.

In [83]:
with open('my_data.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

['x', 'y', 'z']
['0', '0', '2']
['1', '2', '3']
['2', '4', '4']
['3', '6', '5']
['4', '8', '6']
['5', '10', '7']
['6', '12', '8']
['7', '14', '9']
['8', '16', '10']
['9', '18', '11']


If we want to actually work with the data, then we need to assign it to an object rather than just printing it.  Here we can use the list method to convert the iterable reader object to a list, one per row.

In [84]:
with open('my_data.csv', newline='') as f:
    reader = csv.reader(f)
    my_data = list(reader)
my_data

[['x', 'y', 'z'],
 ['0', '0', '2'],
 ['1', '2', '3'],
 ['2', '4', '4'],
 ['3', '6', '5'],
 ['4', '8', '6'],
 ['5', '10', '7'],
 ['6', '12', '8'],
 ['7', '14', '9'],
 ['8', '16', '10'],
 ['9', '18', '11']]

If you want to skip the header row in order to have the data without the header, you can use **next** after instantiating the reader object, to advance one row in the CVS file.

In [85]:
with open('my_data.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)
    my_data = list(reader)
my_data

[['0', '0', '2'],
 ['1', '2', '3'],
 ['2', '4', '4'],
 ['3', '6', '5'],
 ['4', '8', '6'],
 ['5', '10', '7'],
 ['6', '12', '8'],
 ['7', '14', '9'],
 ['8', '16', '10'],
 ['9', '18', '11']]

Since the data is now available as an object, you can do normal Python processing on it, like selecting the first entry of each row and printing it.

In [86]:
for row in my_data:
    print(row[0])

0
1
2
3
4
5
6
7
8
9


### Reading a CSV File and Computing Statistics With it

Let's do some work with a data file.  We will use a sample file that has monthly rainfall amounts in it.  First read the file and print its contents. The file is 'data/rain.csv'

In [5]:
with open('data/rain.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    for row in my_csv:
        print(row)

['month_2014', 'rainfall_inches']
['jan', '5.3']
['feb', '5.4']
['mar', '4.8']
['apr', '4.7']
['may', '3.3']
['jun', '1.2']
['jul', '0.8']
['aug', '0.7']
['sep', '']
['oct', '3.9']
['nov', '4.5']
['dec', '5.9']


Use rain.csv to calculate mean value.  How would we do this conceptually?

We need a sum of the monthly rainfall, and a count of months to divide it by to get the mean (average) monthly rainfall.

So as we loop through the file row by row, let's update a count and a cumulative sum, and when we are done looping through the rows, we just divide the cumulative sum by the count to get the mean.

In [6]:
with open('data/rain.csv', 'r') as csvfile: #use with in order to automatically close the file at the end
    
    # initialize a counter and variables to contain our descriptive stats
    count = 0 #at the end, divide cumulative_sum by this to get the mean
    cumulative_sum = 0 #our rolling sum
    
    # open the file and skip the header row
    my_csv = csv.reader(csvfile)
    next(my_csv)
    
    # loop through each data row
    for row in my_csv:       
        # increment the counter and extract this row's rainfall as a float
        count = count + 1
        rainfall = float(row[1])
            
        # add this row's rainfall to the cumulative sum
        cumulative_sum = cumulative_sum + rainfall
            
    # after looping through all the rows, divide the cumulative sum by the count and round to get the mean
    mean_value = round(cumulative_sum / count, 1)
    
    # print out the mean value
    print('mean:', mean_value, 'inches')


ValueError: could not convert string to float: 

Hmm... what's wrong with the code? We got a strange error.

Do any of the rows look like the value of rainfall might be difficult for Python to compute a floating point value from?

Let's see what happens when we use the float method on an empty string... 

In [7]:
float('')

ValueError: could not convert string to float: 

Maybe we should skip September's value since it is missing?  And we should be careful to exlude it from the count also, so we don't get an incorrect mean value.

In [8]:
with open('data/rain.csv', 'r') as csvfile:
    
    count = 0 
    cumulative_sum = 0 
    
    my_csv = csv.reader(csvfile)
    next(my_csv)
    
    for row in my_csv:
        
        # rainfall amount is in column 1, only process this row's value if not an empty string
        if not row[1] == '':
            
            count = count + 1
            rainfall = float(row[1])
            
            cumulative_sum = cumulative_sum + rainfall
            
    mean_value = round(cumulative_sum / count, 1)
    
    print('mean:', mean_value, 'inches')


mean: 3.7 inches


OK, great. Now practice with this process by adding to this code a calculation for the maximum rainfall value (it should be 5.9).

In [9]:
with open('data/rain.csv', 'r') as csvfile:
    
    # initialize a counter and variables to contain our descriptive stats
    count = 0 #at the end, divide cumulative_sum by this to get the mean
    cumulative_sum = 0 #our rolling sum
    max_value = -1 #pick a really small number that's guaranteed to be less than the max
    
    # open the file and skip the header row
    my_csv = csv.reader(csvfile)
    next(my_csv)
    
    # loop through each data row
    for row in my_csv:
        
        # rainfall amount is in column 1, only process this row's value if not an empty string
        if not row[1] == '':
            
            # increment the counter and extract this row's rainfall as a float
            count = count + 1
            rainfall = float(row[1])
            
            # add this row's rainfall to the cumulative sum
            cumulative_sum = cumulative_sum + rainfall
            
            # if this row's rainfall is greater than the current max value, update with the new max
            if rainfall > max_value:
                max_value = rainfall

    # after looping through all the rows, divide the cumulative sum by the count and round to get the mean
    mean_value = round(cumulative_sum / count, 1)
    
    # print out the mean and max values
    print('mean:', mean_value, 'inches')
    print('max:', max_value, 'inches')

mean: 3.7 inches
max: 5.9 inches


Now add a calculation to find the minimum rainfall amount.

In [11]:
with open('data/rain.csv', 'r') as csvfile:
    
    # initialize a counter and variables to contain our descriptive stats
    count = 0 #at the end, divide cumulative_sum by this to get the mean
    cumulative_sum = 0 #our rolling sum
    max_value = -1 #pick a really small number that's guaranteed to be less than the max
    min_value = 100
    
    # open the file and skip the header row
    my_csv = csv.reader(csvfile)
    next(my_csv)
    
    # loop through each data row
    for row in my_csv:
        
        # rainfall amount is in column 1, only process this row's value if not an empty string
        if not row[1] == '':
            
            # increment the counter and extract this row's rainfall as a float
            count = count + 1
            rainfall = float(row[1])
            
            # add this row's rainfall to the cumulative sum
            cumulative_sum = cumulative_sum + rainfall
            
            # if this row's rainfall is greater than the current max value, update with the new max
            if rainfall > max_value:
                max_value = rainfall

            # if this row's rainfall is less than the current min value, update with the new min
            if rainfall < min_value:
                min_value = rainfall

    # after looping through all the rows, divide the cumulative sum by the count and round to get the mean
    mean_value = round(cumulative_sum / count, 1)
    
    # print out the mean and max values
    print('mean:', mean_value, 'inches')
    print('max:', max_value, 'inches')
    print('min:', min_value, 'inches')

mean: 3.7 inches
max: 5.9 inches
min: 0.7 inches


### Cleaning up Messy Data

Let's look at another data file - one that contains 15 Craigslist rental listingsthat we have already done some cleanup on.

In [92]:
with open('data/rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    for row in my_csv:
        print(row)

['neighborhood', 'price', 'bedrooms', 'date', 'sqft', 'longitude', 'latitude']
['foster city', '2495', '1', '11/14/2014 12:26', '755', '-122.27', '37.5538']
['palo alto', '2695', '', '11/14/2014 12:25', '443', '-122.161524', '37.450289']
['brisbane', '3150', '2', '11/14/2014 12:24', '1242', '-122.417912', '37.692415']
['palo alto', '2800', '2', '11/14/2014 12:24', '', '', '']
['san mateo', '2196', '1', '11/14/2014 12:24', '676', '-122.2998', '37.5395']
['santa clara', '3264', '3', '11/14/2014 12:28', '1138', '', '']
['san jose south', '2000', '2', '11/14/2014 12:28', '822', '-121.902268', '37.253503']
['sunnyvale', '4740', '3', '11/14/2014 12:28', '1406', '-122.034683', '37.368445']
['inner sunset / UCSF', '3395', '2', '11/14/2014 12:32', '', '-122.479345', '37.764582']
['richmond / seacliff', '2699', '1', '11/14/2014 12:32', '', '-122.503781', '37.7718']
['SOMA / south beach', '3620', '1', '11/14/2014 12:30', '860', '-122.395195', '37.775133']
['dublin / pleasanton / livermore', '2025

In [93]:
# the column headers are the first row in the data file
# use next to iterate our csv reader to the first row to grab the headers
with open('data/rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    headers = next(my_csv)
    print(headers)

['neighborhood', 'price', 'bedrooms', 'date', 'sqft', 'longitude', 'latitude']


In [95]:
# what is column index 1 (zero-indexed) in our data set?
headers[1]

'price'

In [96]:
# for each row in the data set, print the price column's value
with open('data/rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    for row in my_csv:
        print(row[1])

price
2495
2695
3150
2800
2196
3264
2000
4740
3395
2699
3620
2025

1795
4299


In [97]:
# create a new list to contain the column of prices in the data set
prices = []
with open('data/rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    for row in my_csv:
        prices.append(row[1])  
prices

['price',
 '2495',
 '2695',
 '3150',
 '2800',
 '2196',
 '3264',
 '2000',
 '4740',
 '3395',
 '2699',
 '3620',
 '2025',
 '',
 '1795',
 '4299']

This list has a couple of problems. First, it includes the header. Second, it's all strings even though prices are numeric data. Third, it contains some empty strings. We'll have to clean it up.

In [98]:
# to remove the first element of the list, we can just capture position 1 through the end of the list
prices_noheader = prices[1:]
prices_noheader

['2495',
 '2695',
 '3150',
 '2800',
 '2196',
 '3264',
 '2000',
 '4740',
 '3395',
 '2699',
 '3620',
 '2025',
 '',
 '1795',
 '4299']

In [99]:
# now let's convert the price strings to integers
for price in prices_noheader:
    print(int(float(price)), ' ')

2495  
2695  
3150  
2800  
2196  
3264  
2000  
4740  
3395  
2699  
3620  
2025  


ValueError: could not convert string to float: 

In [100]:
# you can't convert an empty string to a numeric type
for price in prices_noheader:
    if not price == '':
        print(int(float(price)))
    else:
        print('None')

2495
2695
3150
2800
2196
3264
2000
4740
3395
2699
3620
2025
None
1795
4299


In [101]:
# encapsulate this functionality inside a new function
def extract_int_price(price):
    if not price == '':
        return int(float(price))
    else:
        return None

In [102]:
# use our function to convert each element in the list of prices to an integer
for price in prices_noheader:
    print(extract_int_price(price))

2495
2695
3150
2800
2196
3264
2000
4740
3395
2699
3620
2025
None
1795
4299


In [103]:
# rather than just printing each converted value, turn it into a new list called int_prices
int_prices = []
for price in prices_noheader:
    int_prices.append(extract_int_price(price))
print(int_prices)

[2495, 2695, 3150, 2800, 2196, 3264, 2000, 4740, 3395, 2699, 3620, 2025, None, 1795, 4299]


### Now let's clean up our neighborhood names


In [104]:
# replace any forward slashes in neighborhood name with a hyphen
with open('data/rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    next(my_csv) #skip the header row
    for row in my_csv:
        print(row[0].replace('/', '-')) #use string.replace() method

foster city
palo alto
brisbane
palo alto
san mateo
santa clara
san jose south
sunnyvale
inner sunset - UCSF
richmond - seacliff
SOMA - south beach
dublin - pleasanton - livermore
concord - pleasant hill - martinez
hercules, pinole, san pablo, el sob
corte madera


In [106]:
# create a new function to replace forward slashes and commas with hyphens
def clean_neighborhood(neighborhood_name):
    # you can daisy chain multiple string.replace() methods
    return neighborhood_name.replace('/', '-').replace(',', '')

### Now let's use the functions above to create a cleaner version of the data.

In [120]:
# clean the data set by calling the cleaning functions and save the results to variables
rentals_cleaned = []
with open('data/rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    next(my_csv)
    for row in my_csv:
        neighborhood_cleaned = clean_neighborhood(row[0])
        price_cleaned = extract_int_price(row[1])
        rentals_cleaned.append([neighborhood_cleaned, price_cleaned])      

# display our nested lists of data        
rentals_cleaned

[['foster city', 2495],
 ['palo alto', 2695],
 ['brisbane', 3150],
 ['palo alto', 2800],
 ['san mateo', 2196],
 ['santa clara', 3264],
 ['san jose south', 2000],
 ['sunnyvale', 4740],
 ['inner sunset - UCSF', 3395],
 ['richmond - seacliff', 2699],
 ['SOMA - south beach', 3620],
 ['dublin - pleasanton - livermore', 2025],
 ['concord - pleasant hill - martinez', None],
 ['hercules pinole san pablo el sob', 1795],
 ['corte madera', 4299]]

### Create a new data set with cleaned up variables

In [109]:
with open('data/cleaned_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["neighborhood_cleaned", "price_cleaned", "bedrooms_cleaned"])
    writer.writerows(rentals_cleaned)

### Practice time

Create a function to clean number of bedrooms and add this to the cleaned_data.csv file.

In [105]:
# create a new function to convert bedrooms from a string to an int
def extract_int_bedrooms(bedrooms):
    if not bedrooms == '':
        return int(float(bedrooms))
    else:
        return None

In [107]:
# clean the data set by calling the cleaning functions and save the results to variables
rentals_cleaned = []
with open('data/rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    next(my_csv)
    for row in my_csv:
        neighborhood_cleaned = clean_neighborhood(row[0])
        price_cleaned = extract_int_price(row[1])
        bedrooms_cleaned = extract_int_bedrooms(row[2])
        rentals_cleaned.append([neighborhood_cleaned, price_cleaned, bedrooms_cleaned])      

# display our nested lists of data        
rentals_cleaned

[['foster city', 2495, 1],
 ['palo alto', 2695, None],
 ['brisbane', 3150, 2],
 ['palo alto', 2800, 2],
 ['san mateo', 2196, 1],
 ['santa clara', 3264, 3],
 ['san jose south', 2000, 2],
 ['sunnyvale', 4740, 3],
 ['inner sunset - UCSF', 3395, 2],
 ['richmond - seacliff', 2699, 1],
 ['SOMA - south beach', 3620, 1],
 ['dublin - pleasanton - livermore', 2025, 1],
 ['concord - pleasant hill - martinez', None, 2],
 ['hercules pinole san pablo el sob', 1795, 1],
 ['corte madera', 4299, 3]]

Now save the data to a new csv file:

In [109]:
with open('data/cleaned_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["neighborhood_cleaned", "price_cleaned", "bedrooms_cleaned"])
    writer.writerows(rentals_cleaned)

Load the file in a text editor to confirm that it did what you expected.