## Reading data the right way



First, we import a dataset to play with. As a new twist, I will also use a library that can test whether a file is present or not. This is because I spent 20 minutes debugging my code without realizing that I had the wrong filename. So I decided to include this here. It may prove useful. The code is straightforward and simply raises an error if the file is not present. To test whether the file is present, we import the pathlib-library, which imports tools to deal with file and path names (e.g., checking whether a file is present). I recommend using this code snippet in your submissions.  Recall that to locate a file, you need to know the filename and the directory where to find the file.

The following piece of code first retrieves the current directory, and then builds a path object from the current working directory plus the filename.



In [1]:
import pathlib as pl
import pandas as pd

# define the file and sheetname we want to read. Note that the file
# has to be present in the local working directory!
fn: str = "example_file.csv"  # file name

# get the current working directory
cwd: pl.Path = pl.Path.cwd()
print(f"the current working directory is\n \t {cwd}\n")

# build the fully qualified file name (i.e. path + filename)
fqfn: pl.Path = pl.Path(f"{cwd}/{fn}")

# This piece of code could have saved me 20 minutes...
if not fqfn.exists():  # check if the file is actually there
    raise FileNotFoundError(f"Cannot find file {fqfn}")
else:
    print(f"The fully qualified filename is\n \t {fqfn}\n")
    df: pd.DataFrame = pd.read_csv(fqfn)  # read csv data

So try this with a filename you know does not exist to see what will
happen!

As you can see, there is quite a bit of typing, involved. I thus keep a repository of code snippets I use a lot, so I can simply use cut/copy/paste the next time I need to read a file. You could, e.g., use something like this:



In [1]:
# this is just a template, it will not run unless
# you add useful variable values!
import pathlib as pl
import pandas as pd

fn: str = ""  # file name
cwd: pl.Path = pl.Path.cwd()  # get the current working directory
fqfn: pl.Path = pl.Path(f"{cwd}/{fn}")  # fully qualified file name

if not fqfn.exists():  # check if file exist
    raise FileNotFoundError(f"Cannot find file {fqfn}")

Collect these snippets in a separate notebook, and they will save you a lot of time in the coming weeks.



## Working with the pandas series data object



 The Pandas data frame is composed of smaller units, the so-called pandas series object. You can think of them as columns in a spreadsheet.  You already used this data type implicitly in our last module, but here we will explore it more fully. The above code should have created a data frame already.  Let's make sure it worked



In [1]:
display(df.head())

Now, let's extract the data from column `A`, and execute the following



In [1]:
A: pd.Series = df["A"]  # extract column A, and save as pandas data series
B: pd.Series = A * 2  # multiply A with 2

# now we create a new dataframe from the two pandas series objects if
# you don't understand the syntax, go back to the "other data types"
# module and check out what I am doing here. If you cannot figure it
# out, speak up!
result: pd.DataFrame = pd.DataFrame({"A": A, "A*2 =": B})
result.head()  # multiply A with 2 and print the result

Note in the above example the use of `pd.DataFrame` On the left, it is used as a type-hint, whereas on the right side of the assignment, it is used as a function to create a new data frame from a dictionary.

You probably remember that it was not possible to multiply a list with a number because lists can contain numbers, letters, other lists, tuples etc. To do this, you had to write a loop. A Pandas series, on the other hand, can only have one data type per column. In other words, a column can contain strings, integers, floats etc., but all entries in a given column must be of the same type. Since column `A` consists only of numbers, python can directly multiply each element with 2. What happens if you multiply `A*B`? Is the multiplication element by element, or do you get the cross product?  What happens if you write `A**B`?  Hurray! no more loops!  (kinda&#x2026;)

In a way, python treats a pandas series object like a vector, not unlike Matlab. There is some cool stuff we can do with this. We can e.g., apply a comparison operator to a pandas series



In [1]:
print(A>2)

Now, why would this be useful? Remember that `False` equals zero, whereas `True` equals 1. So if you want to count the number of values in `A` that are larger than 2, you can simply write



In [1]:
n :int = sum(A>2) 
print (n)

neat!

