# Assignment 27: Pandas DataFrame Objects and File Formats #

### Goals for this Assignment ###

By the time you have completed this assignment, you should be able to:

- Create `DataFrame` objects in Pandas by hand
- Access columns of a `DataFrame` object using square brackets (`[]`)
- Load Comma-Separated Values (`.csv`) files into `DataFrame` objects
- Write a `DataFrame` object to a `.csv` file

## Step 1: Create a `DataFrame` Object by Hand ##

### Background: Creating `DataFrame` Objects ###

In Pandas, [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) objects are used to represent two-dimensional data.
This is in contrast to `Series` objects, which are for one-dimensional data.
`DataFrame` objects are most commonly used to represent tables, for example:

| Student Name | Age | GPA |
| ------------ | --- | --- |
| Alice        |  25 | 3.1 |
| Bob          |  22 | 2.7 |
| Bill         |  28 | 3.3 |
| Barbara      |  30 | 2.9 |

`DataFrame` objects themselves can be broken down into `Series` objects, generally where one column of the table is represented with one `Series` object.
This follows from the fact that:

- Usually, all the values in the same column are expected to have the same type, which is a requirement of `Series` objects.  For example, in the table above, all the values of "Student Name" are expected to be strings, all the values of "Age" are expected to be integers, and all the values of "GPA" are expected to be floating-point values.
- We often want to process columns as a whole.  For example, while it makes sense to get the average age of a student in the above table, it doesn't make sense to get the average of each row, since each row is composed of values of very different types.

We can manually make a `DataFrame` object by using `DataFrame`'s constructor, and passing along a dictionary of column names to the list of values in the column.
This is shown in the cell below, which creates a `DataFrame` object representing the information in the previously provided table:

In [1]:
import pandas as pd

student_names = ["Alice", "Bob", "Bill", "Barbara"]
ages = [25, 22, 28, 30]
gpas = [3.1, 2.7, 3.3, 2.9]
table_data = pd.DataFrame({"name" : student_names, "age" : ages, "gpa" : gpas})
print(table_data)

      name  age  gpa
0    Alice   25  3.1
1      Bob   22  2.7
2     Bill   28  3.3
3  Barbara   30  2.9


In the prior cell, `table_data` is a `DataFrame` object holding the same information as the table.
The names of the columns are controlled by the named of the keys in the dictionary.
In this case, rather than using the original names in the table, shorter names were used, since the `DataFrame` can be indexed by these names (and these are a bit more convenient to work with).
In the next step, we will work with `table_data` a bit to show what we can do with this `DataFrame` representation.

### Try this Yourself ###

Consider the following table, representing part of the inventory of a hardware store:

| Product | Price | Quantity | Location |
| ------------ | --- | --- | --- |
| Hammer        |  20 | 19 | Tools |
| Wrench        |  30 | 15 | Tools |
| Screws (20)   |  2.5 | 150 | Hardware |

In the next cell, create a `DataFrame` object representing this information.
For the column names, use the names `"product"`, `"price"`, `"quantity"`, and `"location"`.
Bind your `DataFrame` object to a variable named `hardware_store`.
Leave the `print` in place in order to test your code.

In [12]:
# Define hardware_store here.
# You may define additional variables beforehand if you wish, instead
# of defining hardware_store all at once.
product = ["Hammer", "Wrench", "Screws (20)"]
price = [20,30, 2.5]
quantity = [19,15,150]
location = ["Tools","Tools","Hardware"]
hardware_store = pd.DataFrame({"product" : product, "price" : price, "quantity" : quantity, "location" : location})

print(hardware_store)
# above statement should print:
#        product  price  quantity  location
# 0       Hammer   20.0        19     Tools
# 1       Wrench   30.0        15     Tools
# 2  Screws (20)    2.5       150  Hardware

       product  price  quantity  location
0       Hammer   20.0        19     Tools
1       Wrench   30.0        15     Tools
2  Screws (20)    2.5       150  Hardware


## Step 2: Access and Manipulate Columns and Values in a `DataFrame` Object ##

### Background: Working with `DataFrame` Objects ###

`DataFrame` objects have two levels of indexing going on, one for columns (which comes first), and another for rows (which come second).
This is shown below (repeating the definition of `table_data` for convenience):

In [13]:
student_names = ["Alice", "Bob", "Bill", "Barbara"]
ages = [25, 22, 28, 30]
gpas = [3.1, 2.7, 3.3, 2.9]
table_data = pd.DataFrame({"name" : student_names, "age" : ages, "gpa" : gpas})

print(table_data["age"][0]) # prints 25
print(table_data["name"][1]) # prints "Bob"
print(table_data["gpa"][2]) # prints 3.3

25
Bob
3.3


In the cell above, to break down the first `print`, the first thing that happens with `table_data["age"][0]` is the access of `table_data["age"]`.
This accesses the values in the `"age"` column, which returns a Pandas `Series` object.
From there, the `[0]` accessess row `0` of this `Series` object.
As a result, you can think of this expression as having implicit parentheses around the access of the `"age"` column, like so: `(table_data["age"])[0]`.

Because you get back `Series` objects from accessing the values in a column, you can do all of the operations you've seen before on `Series` objects, including fancy indexing, masking, and vector operations.
Some of these operations work over the whole table, as well.
For example, we can find the average GPA of everyone over age 25:

In [4]:
print(table_data[table_data["age"] > 25]["gpa"].mean())

3.0999999999999996


Breaking the above expression down into parts:

- `table_data["age"] > 25` gives us a `Series` object holding `bool` values, where a `True` is present for each row where the value of `"age"` was greater than `25.
- `table_data[table_data["age"] > 25]` gives us a version of `table_data` where all rows where the `"age"` was **not** greater than `25` are cut out.
- `table_data[table_data["age"] > 25]["gpa"]` gives us a `Series` object with all the values for the `"gpa"` column, but only for those where the `"age"` field's value was greater than `25`.
- `table_data[table_data["age"] > 25]["gpa"].mean()` computes the mean of the aforementioned part of the `"gpa"` column.

### Try this Yourself ###

In the next cell, using your `hardware_store` from before, write an expression which will find the maximum price of any object that you have less than `50` of in inventory.
This should be `30`, since you only have fewer than 50 hammers and wrenches, and wrenches have the largest price of `30`.
However, your expression should use masking and vector operations to find this information, in a similar way that `table_data` was previously manipulated.
Be sure to `print` the result of your expression.

In [17]:
# Define your expression below.  Be sure to print it.

print(hardware_store[hardware_store["quantity"] < 50]["price"].max())

30.0


## Step 3: Load a Comma-Separated Value (CSV) File into a `DataFrame` ##

### Background: CSV Files ###

CSV files are a common way to distribute data and store data.
These typically have the `.csv` file extension.
The most basic version of this format (more on that in a bit) is based on separating each row with a newline, and each column with a comma (hence the "comma" in "comma-separated value").
With this format in mind, our table representing students becomes:

```
Name,Age,GPA
Alice,25,3.1
Bob,22,2.7
Bill,28,3.3
Barbara,30,2.9
```

Pandas can open CSV files directly, and create `DataFrame` objects directly from the loaded-in data.
This is shown in the cell below.
Note that this will require you to download `students.csv`, which is included with the assignment on Canvas; `students.csv` contains the same information as shown above.

In [18]:
csv = pd.read_csv("students.csv")
print(csv)

      Name  Age  GPA
0    Alice   25  3.1
1      Bob   22  2.7
2     Bill   28  3.3
3  Barbara   30  2.9


As shown above, `csv` is already a `DataFrame` object.
The names of the columns comes directly from the CSV file itself, where the first row is treated as a header for the data.
If you don't like the names of the columns (e.g., they are too long to easily work with when indexing into `DataFrame` objects), then you can modify them by changing `.columns`, like so:

In [19]:
csv.columns = ["name", "age", "gpa"]
print(csv)

      name  age  gpa
0    Alice   25  3.1
1      Bob   22  2.7
2     Bill   28  3.3
3  Barbara   30  2.9


As shown above, the names of the columns have now changed to reflect what was passed to `.columns`.

> Earlier I referred to what was described for CSV files as "the most basic version of this format".
> This unfortunately gets more complex if any values themselves contain commas or newlines, and strings can do exactly this.
> There are ways to handle this, e.g., by enforcing that strings begin and end with quote characters, which allows you to distinguish between commas and newlines within a string and commas and newlines used as data separators.
> However, this makes directly working with the format more difficult, as now we need extra code to differentiate the two.
> This also introduces its own similar problem, in that we now need to distinguish between quotes in a string and quotes used to start or end a string.
> The good news is that there are solutions to all these problems, and the `read_csv` method handles all of these problems without us needing to worry about it.

### Try this Yourself ###

In the next cell, load in the data in the `hardware.csv` file into a `DataFrame` object; `hardware.csv` will need to be downloaded from Canvas, and it contains the same data as the hardware example before.
Bind the `DataFrame` object into a new variable named `my_csv`.
Modify the columns to be `"product"`, `"quantity"`, `"price"`, and `"location"`.
Be sure to print `my_csv` afterwards.

In [24]:
# Create your my_csv variable below, modify the column names appropriately, and print
# the resulting my_csv.
my_csv = pd.read_csv("hardware.csv")
print(my_csv)
print()
my_csv.columns = ["product", "price","quantity","location"]
print(my_csv)


       Product  Price  Quantity  Location
0       Hammer   20.0        19     Tools
1       Wrench   30.0        15     Tools
2  Screws (20)    2.5       150  Hardware

       product  price  quantity  location
0       Hammer   20.0        19     Tools
1       Wrench   30.0        15     Tools
2  Screws (20)    2.5       150  Hardware


## Step 4: Write a `DataFrame` into a CSV File ##

### Background: Writing to CSV Files with Pandas ###

With Pandas, `DataFrame` objects already have methods that can be used to write data to a given file.
This is illustrated in the next cell, which takes all the students in `table_data` with GPAs less than 3.0 and writes them to a separate file `gpas_below_3.csv`:

In [25]:
table_data[table_data["gpa"] < 3.0].to_csv("gpas_below_3.csv")

If you look at the actual data written by the above cell, you'll see the following in the `.csv` file:

```
,name,age,gpa
1,Bob,22,2.7
3,Barbara,30,2.9
```

The first column doesn't have a name in the header, and looking at the values in this first column, they correspond to the row indices in the original `table_data` `DataFrame`.
This same `.csv` file can then be loaded into a `DataFrame` using the `read_csv` method in the prior step.

### Try this Yourself ###

In the cell below, a `DataFrame` object is defined.
Write this `DataFrame` object to a CSV file named `my_csv.csv`.
You will turn in `my_csv.csv` as part of the assignment on Canvas.

In [28]:
df = pd.DataFrame({"first" : ["foo", "bar", "baz"],
                   "second" : [3, 2, 8],
                   "third" : [2.2, 3.3, 4.4]})
# Write out df to a CSV file below

df.to_csv("my_csv.csv")

## Step 5: Submit via Canvas ##

Be sure to **save your work**, then log into [Canvas](https://canvas.csun.edu/).  Go to the COMP 502 course, and click "Assignments" on the left pane.  From there, click "Assignment 27".  From there, you can upload your `27_pandas_dataframe_objects_and_file_formats.ipynb` file, **as well as** your `my_csv.csv` file.

You can turn in the assignment multiple times, but only the last version you submitted will be graded.

### On Other File Formats ###

While this assignment looked specifically at CSV files, Pandas supports multiple other formats, and data sets can be distributed in these other formats.
However, as far as the usage of Pandas is concerned, you usually don't need to worry too much about which specific one you're dealing with, as long as Pandas supports it.
If you're curious, you can look at the [official Pandas documentation](https://pandas.pydata.org/docs/reference/io.html) to see what other kinds of file formats are supported, where the `read_BLAH` methods will read data in `BLAH` format, and the `to_BLAH` methods will write something in `BLAH` format.

### Special Thanks to Dr. Glenn Bruns ###

Special thanks to [Dr. Glenn Bruns](https://csumb.edu/scd/glenn-bruns/) at California State University, Monterey Bay, for providing me with closely-related materials which were used in the creation of this assignment.