# August 28

Working with one dataset is all well and good, but we often want to associate information in two datasets. For example, we could have different subsets of data, with the same columns, collected at different times. Perhaps we have 
technicians going into the field at different points in time. To the `data` folder, I have also added a second file, `second_subset.csv`. These are data that a field tech forgot to add to the dataset. So we will now add them to the data set.

Read in the `surveys.csv` file and the `second_subset.csv` data file. Call the variables to which you save each one `surveys_df` and `second_df`, respectively.



In [2]:
import pandas as pd

# A brief note on files

So far, I've just asked you to take it on faith that our data are in `../data/surveys.csv`. But what does that mean? Let's go to the board for a minute. 

Exercise for once I've explained on the board:

- What is the parent directory of secret_data? 
- What would be the path to secrete_data from notebooks?
- What are the child directories of maze?
- Can you read in the data from hidden_data?

In the coming weeks, we will talk about organizing a computational research project, such that we have our scripts all in one place, our data all in one place, and our outputs all in one place. But for now, I'm just introducing a little information on where the "../" comes from. 

First, we'll look at concatenating dataframes.

In [None]:
vertical_stack = pd.concat([surveys_df, second_df], axis=0)
vertical_stack

Have a look at the output. What does "concatenate" mean? Is this the result we wanted? 

In [None]:
vertical_stack = vertical_stack.reset_index(drop=True)

`reset_index()` allows us to renumber the indices, such that we now have a contiguous dataset, as opposed to two numbering systems squashed together.

In the below cell, write the new dataset to a file. Where should you save it? Why?

# Joining Data

Another task we might like to do is combine two different datasets based on columns. For example, records for one specimen might be in two different files which keep track of different variables.

For example, in our classroom dataset, whoever recorded to rodent data used shorthand for the species names - DO, DL, etc. That makes taking data while wrangling live animals easier. In a second file, "species.csv", we have a translation table to go from the shortand to the full name for analysis as we work on the paper. The goal here is to make it easier to take data in the field (by using shortand) without losing data (being able to translate back to the long form). 


When we concatenated our DataFrames we simply added them to each other - stacking them either vertically. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called “joining”. The columns containing the common values are called “join key(s)”. Joining DataFrames in this way is often useful when one DataFrame is a “lookup table” containing additional data that we want to include in the other.

Storing data in this way has many benefits including:

- It ensures consistency in the spelling of species attributes (genus, species and taxa) given each species is only entered once. Imagine the possibilities for spelling errors when entering the genus and species thousands of times!
- It also makes it easy for us to make changes to the species information once without having to find each instance of it in the larger survey data.
- It optimizes the size of our data.

## Identifying join keys

To identify appropriate join keys we first need to know which field(s) are shared between the files (DataFrames). We might inspect both DataFrames to identify these columns. If we are lucky, both DataFrames will have columns with the same name that also contain the same data. If we are less lucky, we need to identify a (differently-named) column in each DataFrame that contains the same information.



In [None]:
surveys_df.columns

In [None]:
species = pd.read_csv("../data/species.csv")
species.columns

In our example, the join key is the column containing the two-letter species identifier, which is called species_id. 

Now that we know the fields with the common species ID attributes in each DataFrame, we are almost ready to join our data. However, since there are different types of joins, we also need to decide which type of join makes sense for our analysis.

## Inner Joins

The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.

Inner joins yield a DataFrame that contains only rows where the value being joins exists in BOTH tables:

![inner](img/inner-join.png)

The pandas function for performing joins is called merge and an Inner join is the default option:

In [None]:
merged_inner = pd.merge(left=surveys_df,right=species, left_on='species_id', right_on='species_id')
# In this case `species_id` is the only column name in  both dataframes, so if we skippd `left_on`
# And `right_on` arguments we would still get the same result

# What's the size of the output data?
merged_inner.shape
merged_inner


The result of an inner join of surveys and species is a new DataFrame that contains the combined set of columns from surveys and species. It only contains rows that have two-letter species codes that are the same in both the surveys and species DataFrames. In other words, if a row in surveys has a value of species_id that does not appear in the species_id column of species, it will not be included in the DataFrame returned by an inner join. Similarly, if a row in species has a value of species_id that does not appear in the species_id column of surveys, that row will not be included in the DataFrame returned by an inner join.

The two DataFrames that we want to join are passed to the merge function using the left and right argument. The left_on='species' argument tells merge to use the species_id column as the join key from surveys (the left DataFrame). Similarly , the right_on='species_id' argument tells merge to use the species_id column as the join key from species (the right DataFrame). For inner joins, the order of the left and right arguments does not matter.

The result merged_inner DataFrame contains all of the columns from surveys (record id, month, day, etc.) as well as all the columns from species (species_id, genus, species, and taxa).

Notice that merged_inner has fewer rows than surveys. This is an indication that there were rows in surveys_df with value(s) for species_id that do not exist as value(s) for species_id in species_df.

## Left Joins

What if we want to add information from species to surveys_df without losing any of the information from surveys_df? In this case, we use a different type of join called a “left outer join”, or a “left join”.

Like an inner join, a left join uses join keys to combine two DataFrames. Unlike an inner join, a left join will return all of the rows from the left DataFrame, even those rows whose join key(s) do not have values in the right DataFrame. Rows in the left DataFrame that are missing values for the join key(s) in the right DataFrame will simply have null (i.e., NaN or None) values for those columns in the resulting joined DataFrame.

Note: a left join will still discard rows from the right DataFrame that do not have values for the join key(s) in the left DataFrame.

![left join](img/left-join.png)

In [None]:
merged_left = pd.merge(left=surveys_df,right=species, how='left', left_on='species_id', right_on='species_id')
merged_left

The result DataFrame from a left join (merged_left) looks very much like the result DataFrame from an inner join (merged_inner) in terms of the columns it contains. However, unlike merged_inner, merged_left contains the same number of rows as the original surveys_df DataFrame. When we inspect merged_left, we find there are rows where the information that should have come from species_sub (i.e., species_id, genus, and taxa) is missing (they contain NaN values):

In [None]:
merged_left[ pd.isnull(merged_left.genus) ]

These rows are the ones where the value of species_id from survey_sub (in this case, PF) does not occur in species_sub. What might this mean about the `PF` value? 
The pandas merge function supports two other join types:

- Right (outer) join: Invoked by passing how='right' as an argument. Similar to a left join, except all rows from the right DataFrame are kept, while rows from the left DataFrame without matching join key(s) values are discarded.

- Full (outer) join: Invoked by passing how='outer' as an argument. This join type returns the all pairwise combinations of rows from both DataFrames; i.e., the result DataFrame will NaN where data is missing in one of the dataframes. This join type is very rarely used.

Try one or both of these to get a fuller sense of what they do.

# Reminder: For Loops

April: Don't forget to ask about what databases people think they will use!!!!!

Loops allow us to repeat a workflow (or series of actions) a given number of times or while some condition is true. We would use a loop to automatically process data that’s stored in multiple files (daily values with one file per year, for example). Loops lighten our work load by performing repeated tasks without our direct involvement and make it less likely that we’ll introduce errors by making mistakes while processing each file by hand.

Let’s write a simple for loop that simulates what a kid might see during a visit to the zoo:

In [23]:
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
print(animals)


['lion', 'tiger', 'crocodile', 'vulture', 'hippo']


In [24]:
for creature in animals:
    print(creature)


lion
tiger
crocodile
vulture
hippo


The line defining the loop must start with for and end with a colon, and the body of the loop must be indented.

In this example, creature is the loop variable that takes the value of the next entry in animals every time the loop goes around. We can call the loop variable anything we like. After the loop finishes, the loop variable will still exist and will have the value of the last entry in the collection:

In [25]:
for creature in animals:
    pass

In [26]:
print('The loop variable is now: ' + creature)

The loop variable is now: hippo


What happened above? Tell your neighbor.

## Quick challenge:
- What happens if we don’t include the pass statement?
- Rewrite the loop so that the animals are separated by commas, not new lines (Hint: You can concatenate strings using a plus sign. For example, print(string1 + string2) outputs ‘string1string2’).


## Using loops to automate analyses

The file we’ve been using so far, `surveys.csv`, contains 25 years of data and is very large. We would like to separate the data for each year into a separate file.

Let’s start by making a new directory inside the folder data to store all of these files using the module os. If we want to put the directory in the data directory, what will be the path? Enter it between the quotation marks below.

In [28]:
import os

os.mkdir('')


We can check that we've successfully created the directory with the `os.listdir` command. Fill in the path between the quotation marks:

In [29]:
os.listdir('')


[]

In previous lessons, we saw how to use the library pandas to load the species data into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame into a CSV file. Let’s write a script that performs those three steps in sequence for the year 2002. Remember to fill in the path to your yearly_files in teh last command:

In [31]:
import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('../data/surveys.csv')

# Select only data for the year 2002
surveys2002 = surveys_df[surveys_df.year == 2002]

# Write the new DataFrame to a CSV file
surveys2002.to_csv('')


To create yearly data files, we could repeat the last two commands over and over, once for each year of data. Repeating code is neither elegant nor practical, and is very likely to introduce errors into your code. We want to turn what we’ve just written into a loop that repeats the last two commands for every year in the dataset.

Let’s start by writing a loop that simply prints the names of the files we want to create - the dataset we are using covers 1977 through 2002, and we’ll create a separate file for each of those years. Listing the filenames is a good way to confirm that the loop is behaving as we expect.

We have seen that we can loop over a list of items, so we need a list of years to loop over. We can get the years in our DataFrame with:

In [32]:
surveys_df['year']

0        1977
1        1977
2        1977
3        1977
4        1977
5        1977
6        1977
7        1977
8        1977
9        1977
10       1977
11       1977
12       1977
13       1977
14       1977
15       1977
16       1977
17       1977
18       1977
19       1977
20       1977
21       1977
22       1977
23       1977
24       1977
25       1977
26       1977
27       1977
28       1977
29       1977
         ... 
35519    2002
35520    2002
35521    2002
35522    2002
35523    2002
35524    2002
35525    2002
35526    2002
35527    2002
35528    2002
35529    2002
35530    2002
35531    2002
35532    2002
35533    2002
35534    2002
35535    2002
35536    2002
35537    2002
35538    2002
35539    2002
35540    2002
35541    2002
35542    2002
35543    2002
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

but we want only unique years, which we can get using the unique method which we have already seen.

In [35]:
surveys_df['year'].unique()

array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002])

Putting this into a for loop we get

In [36]:
for year in surveys_df['year'].unique():
   filename='data/yearly_files/surveys' + str(year) + '.csv'
   print(filename)


data/yearly_files/surveys1977.csv
data/yearly_files/surveys1978.csv
data/yearly_files/surveys1979.csv
data/yearly_files/surveys1980.csv
data/yearly_files/surveys1981.csv
data/yearly_files/surveys1982.csv
data/yearly_files/surveys1983.csv
data/yearly_files/surveys1984.csv
data/yearly_files/surveys1985.csv
data/yearly_files/surveys1986.csv
data/yearly_files/surveys1987.csv
data/yearly_files/surveys1988.csv
data/yearly_files/surveys1989.csv
data/yearly_files/surveys1990.csv
data/yearly_files/surveys1991.csv
data/yearly_files/surveys1992.csv
data/yearly_files/surveys1993.csv
data/yearly_files/surveys1994.csv
data/yearly_files/surveys1995.csv
data/yearly_files/surveys1996.csv
data/yearly_files/surveys1997.csv
data/yearly_files/surveys1998.csv
data/yearly_files/surveys1999.csv
data/yearly_files/surveys2000.csv
data/yearly_files/surveys2001.csv
data/yearly_files/surveys2002.csv


We can now add the rest of the steps we need to create separate text files:

In [38]:
# Load the data into a DataFrame
surveys_df = pd.read_csv('../data/surveys.csv')

for year in surveys_df['year'].unique():

    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year]

    # Write the new DataFrame to a CSV file
    filename = '../data/yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)


Talk through the loop with a partner. Look inside the yearly_files directory and check a couple of the files you just created to confirm that everything worked as expected.

Notice that the code above created a unique filename for each year.

`filename = '../data/yearly_files/surveys' + str(year) + '.csv'`

Let’s break down the parts of this name:

- The first part is simply some text that specifies the directory to store our data file in (data/yearly_files/) and the first part of the file name (surveys): '../data/yearly_files/surveys'
- We can concatenate this with the value of a variable, in this case year by using the plus + sign and the variable we want to add to the file name: + str(year)
- Then we add the file extension as another text string: + '.csv'

Notice that we use single quotes to add text strings. The variable is not surrounded by quotes. This code produces the string data/yearly_files/surveys2002.csv which contains the path to the new filename AND the file name itself.

## Challenges

Some of the surveys you saved are missing data (they have null values that show up as NaN - Not A Number - in the DataFrames and do not show up in the text files). Modify the for loop so that the entries with null values are not included in the yearly files.

- Let’s say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977?

- Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique CSV file for each species?


In [4]:
# Load the data into a DataFrame
surveys_df = pd.read_csv('../data/surveys.csv')

for spec in surveys_df['species_id'].unique():

    # Select data for the year
    surveys_sp = surveys_df[surveys_df.species_id == spec]

    # Write the new DataFrame to a CSV file
    
    filename = '../data/spec_files/surveys' + str(spec) + '.csv'
    surveys_sp.to_csv(filename)

In [5]:
! cat ../data/spec_files/surveysPL.csv


,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
22816,22817,9,24,1995,10,PL,,20.0,25.0
23134,23135,12,21,1995,24,PL,F,20.0,21.0
23146,23147,12,21,1995,24,PL,M,20.0,21.0
23218,23219,1,27,1996,24,PL,F,20.0,21.0
23253,23254,1,27,1996,23,PL,F,20.0,24.0
23339,23340,1,28,1996,10,PL,F,19.0,20.0
23571,23572,3,23,1996,23,PL,M,20.0,19.0
23698,23699,4,14,1996,23,PL,M,21.0,20.0
24581,24582,10,12,1996,24,PL,F,19.0,23.0
24958,24959,2,8,1997,24,PL,M,21.0,21.0
24986,24987,2,8,1997,24,PL,F,21.0,27.0
25058,25059,2,9,1997,10,PL,M,20.0,21.0
25093,25094,2,9,1997,9,PL,F,21.0,22.0
25133,25134,2,9,1997,14,PL,M,21.0,23.0
26349,26350,7,9,1997,24,PL,F,21.0,16.0
26419,26420,7,9,1997,19,PL,M,18.0,10.0
26423,26424,7,9,1997,16,PL,F,22.0,18.0
26556,26557,7,29,1997,7,PL,F,20.0,22.0
26700,26701,7,30,1997,3,PL,M,21.0,19.0
26786,26787,9,27,1997,7,PL,F,21.0,16.0
26888,26889,9,28,1997,16,PL,F,21.0,10.0
26920,26921,9,28,1997,5,PL,M,22.0,20.0
26930,26931,9,28,1997,10,PL,F,10.0,1

## Building reusable and modular code with functions

Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for loop like the one above every time we needed to do it but that would be time consuming and error prone. A more elegant solution would be to create a reusable tool that performs this task with minimum input from the user. To do this, we are going to turn the code we’ve already written into a function.

Functions are reusable, self-contained pieces of code that are called with a single command. They can be designed to accept arguments as input and return values, but they don’t need to do either. Variables declared inside functions only exist while the function is running and if a variable within the function (a local variable) has the same name as a variable somewhere else in the code, the local variable hides but doesn’t overwrite the other.

Every method used in Python (for example, print) is a function, and the libraries we import (say, pandas) are a collection of functions. We will only use functions that are housed within the same code that uses them, but it’s also easy to write functions that can be used by different programs.

Functions are declared following this general structure:

In [39]:
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

The function declaration starts with the word def, followed by the function name and any arguments in parenthesis, and ends in a colon. The body of the function is indented just like loops are. If the function returns something when it is called, it includes a return statement at the end.

This is how we call the function:

In [40]:
product_of_inputs = this_is_the_function_name(2,5)

The function arguments are: 2 5 (this is done inside the function!)


In [41]:
print('Their product is:', product_of_inputs, '(this is done outside the function!)')

Their product is: 10 (this is done outside the function!)


## Challenge 

- Change the values of the arguments in the function and check its output. What if one or both outputs are non-numeric?
- Try calling the function by giving it the wrong number of arguments (not 2) or not assigning the function call to a variable (no `product_of_inputs =`)
- Declare a variable inside the function and test to see where it exists (Hint: can you print it from outside the function?)
- Explore what happens when a variable both inside and outside the function have the same name. What happens to the global variable when you change the value of the local variable?


We can now turn our code for saving yearly data files into a function. There are many different “chunks” of this code that we can turn into functions, and we can even create functions that call other functions inside them. Let’s first write a function that separates data for just one year and saves that data to a file:

In [47]:
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = '../data/yearly_files/function_surveys' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)


The text between the two sets of triple double quotes is called a docstring and contains the documentation for the function. It does nothing when the function is running and is therefore not necessary, but it is good practice to include docstrings as a reminder of what the code does. Docstrings in functions also become part of their ‘official’ documentation:

In [45]:
help(one_year_csv_writer)
#or
#one_year_csv_writer?

Help on function one_year_csv_writer in module __main__:

one_year_csv_writer(this_year, all_data)
    Writes a csv file for data from a given year.
    
    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data



In [48]:
one_year_csv_writer(2002, surveys_df)

We changed the root of the name of the CSV file so we can distinguish it from the one we wrote before. Check the yearly_files directory for the file. Did it do what you expect?

What we really want to do, though, is create files for multiple years without having to request them one by one. Let’s write another function that replaces the entire For loop by simply looping through a sequence of years and repeatedly calling the function we just wrote, one_year_csv_writer:

In [49]:
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate CSV files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(year, all_data)


Because people will naturally expect that the end year for the files is the last year with data, the for loop inside the function ends at end_year + 1. By writing the entire loop into a function, we’ve made a reusable tool for whenever we need to break a large data file into yearly files. Because we can specify the first and last year for which we want files, we can even use this function to create files for a subset of the years available. This is how we call this function:

In [50]:
yearly_data_csv_writer(1977, 2002, surveys_df)

The functions we wrote demand that we give them a value for every argument. Ideally, we would like these functions to be as flexible and independent as possible. Let’s modify the function yearly_data_csv_writer so that the start_year and end_year default to the full range of the data if they are not supplied by the user. Arguments can be given default values with an equal sign in the function declaration. Any arguments in the function without default values (here, all_data) is a required argument and MUST come before the argument with default values (which are optional in the function call).

In [51]:
def yearly_data_arg_test(all_data, start_year = 1977, end_year = 2002):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    start_year --- the first year of data we want --- default: 1977
    end_year --- the last year of data we want --- default: 2002
    all_data --- DataFrame with multi-year data
    """

    return start_year, end_year


start,end = yearly_data_arg_test (surveys_df, 1988, 1993)
print('Both optional arguments:\t', start, end)

start,end = yearly_data_arg_test (surveys_df)
print('Default values:\t\t\t', start, end)


Both optional arguments:	 1988 1993
Default values:			 1977 2002


But what if our dataset doesn’t start in 1977 and end in 2002? We can modify the function so that it looks for the start and end years in the dataset if those dates are not provided:

In [52]:
def yearly_data_arg_test(all_data, start_year = None, end_year = None):
    """
    Modified from yearly_data_csv_writer to test default argument values!

    start_year --- the first year of data we want --- default: None - check all_data
    end_year --- the last year of data we want --- default: None - check all_data
    all_data --- DataFrame with multi-year data
    """

    if start_year is None:
        start_year = min(all_data.year)
    if end_year is None:
        end_year = max(all_data.year)

    return start_year, end_year


start,end = yearly_data_arg_test (surveys_df, 1988, 1993)
print('Both optional arguments:\t', start, end)

start,end = yearly_data_arg_test (surveys_df)
print('Default values:\t\t\t', start, end)


Both optional arguments:	 1988 1993
Default values:			 1977 2002


The default values of the start_year and end_year arguments in the function yearly_data_arg_test are now None. This is a build-it constant in Python that indicates the absence of a value - essentially, that the variable exists in the namespace of the function (the directory of variable names) but that it doesn’t correspond to any existing object.

# If Statements

The body of the test function now has two conditionals (if statements) that check the values of start_year and end_year. If statements execute a segment of code when some condition is met. They commonly look something like this:

In [53]:
a = 5

if a<0:  # Meets first condition?

    # if a IS less than zero
    print('a is a negative number')

elif a>0:  # Did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
    print('a is a positive number')

else:  # Met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
    print('a must be zero!')


a is a positive number


Change the value of a to see how this function works. The statement elif means “else if”, and all of the conditional statements must end in a colon.

The if statements in the function yearly_data_arg_test check whether there is an object associated with the variable names start_year and end_year. If those variables are None, the if statements return the boolean True and execute whatever is in their body. On the other hand, if the variable names are associated with some value (they got a number in the function call), the if statements return False and do not execute. The opposite conditional statements, which would return True if the variables were associated with objects (if they had received value in the function call), would be if start_year and if end_year.

As we’ve written it so far, the function yearly_data_arg_test associates values in the function call with arguments in the function definition just based in their order. If the function gets only two values in the function call, the first one will be associated with all_data and the second with start_year, regardless of what we intended them to be. We can get around this problem by calling the function using keyword arguments, where each of the arguments in the function definition is associated with a keyword and the function call passes values to the function using these keywords:

In [56]:
start,end = yearly_data_arg_test(surveys_df)
print('Default values:\t\t\t', start, end)

start,end = yearly_data_arg_test(surveys_df, 1988, 1993)
print('No keywords:\t\t\t', start, end)

start,end = yearly_data_arg_test(surveys_df, start_year = 1988, end_year = 1993)
print('Both keywords, in order:\t', start, end)

start,end = yearly_data_arg_test(surveys_df, end_year = 1993, start_year = 1988)
print('Both keywords, flipped:\t\t', start, end)

start,end = yearly_data_arg_test(surveys_df, start_year = 1988)
print('One keyword, default end:\t', start, end)

start,end = yearly_data_arg_test(surveys_df, end_year = 1993)
print('One keyword, default start:\t', start, end)


Default values:			 1977 2002
No keywords:			 1988 1993
Both keywords, in order:	 1988 1993
Both keywords, flipped:		 1988 1993
One keyword, default end:	 1988 2002
One keyword, default start:	 1977 1993


## Challenge


- The code below checks to see whether a directory exists and creates one if it doesn’t. Add some code to your function that writes out the CSV files, to check for a directory to write to.

`if 'dir_name_here' in os.listdir('.'):
   print('Processed directory exists')
else:
   os.mkdir('dir_name_here')
   print('Processed directory created')
`


## For Thursday

Thursday, we will be doing some hands-on practice with datasets, and writing functions. 