# <p style="background-color: #f5df18; padding: 10px;">Programming & Plotting in Python | **Reading Tabular Data into DataFrames** </p>



<div style="display: flex;">
    <div style="flex: 1; margin-right: 20px;">
        <h2>Questions</h2>
        <ul>
            <li>How can I read tabular data?</li>
        </ul>
    </div>
    <div style="flex: 1;">
        <h2>Learning Objectives</h2>
        <ul>
            <li>Import the Pandas library.</li>
    <li>Use Pandas to load a simple CSV data set.</li>
    <li>Get some basic information about a Pandas DataFrame.</li>
        </ul>
    </div>
</div>


## Use the Pandas library to do statistics on tabular data.
---


- [Pandas](https://pandas.pydata.org/) is a widely-used Python library for statistics, particularly on tabular data.


- Pandas uses `DataFrames`, which are 2-dimensional tables whose columns have names
    and potentially have different data types.


- We load Pandas with `import pandas as pd`. The alias `pd` is commonly used to refer to the Pandas library in code.


- We read Comma Separated Values (CSV) data file with `pd.read_csv`.
  - Argument is the name of the file to be read.
  - Returns a dataframe that you can assign to a variable


Let's read and print the contents of the `gapminder_gdp_oceania.csv` file stored in the `data` directory

In [2]:
import pandas as pd

In [3]:
data_oceania = pd.read_csv('data/gapminder_gdp_oceania.csv')

In [4]:
data_oceania

Unnamed: 0,country,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
0,Australia,10039.59564,10949.64959,12217.22686,14526.12465,16788.62948,18334.19751,19477.00928,21888.88903,23424.76683,26997.93657,30687.75473,34435.36744
1,New Zealand,10556.57566,12247.39532,13175.678,14463.91893,16046.03728,16233.7177,17632.4104,19007.19129,18363.32494,21050.41377,23189.80135,25185.00911


- The columns in a dataframe are the observed variables, and the rows are the observations.

- Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

- Using descriptive dataframe names helps us distinguish between multiple dataframes so we won't accidentally overwrite a dataframe or read from the wrong one.

## 🔔 File Not Found
---

Our lessons store their data files in a `data` sub-directory,
which is why the path to the file is `data/gapminder_gdp_oceania.csv`.
If you forget to include `data/`,
or if you include it but your copy of the file is somewhere else,
you will get a [runtime error](04-built-in.md)
that ends with a line like this:

```error
FileNotFoundError: [Errno 2] No such file or directory: 'data/gapminder_gdp_oceania.csv'
```


##  Use `index_col` to specify that a column's values should be used as row headings.
---

- Row headings are numbers (0 and 1 in this case).
- Really want to index by country.
- Pass the name of the column to `read_csv` as its `index_col` parameter to do this.
- Naming the dataframe `data_oceania_country` tells us which region the data includes (`oceania`) and how it is indexed (`country`).

In [5]:
data_oceania_country = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')

In [6]:
data_oceania_country

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Australia,10039.59564,10949.64959,12217.22686,14526.12465,16788.62948,18334.19751,19477.00928,21888.88903,23424.76683,26997.93657,30687.75473,34435.36744
New Zealand,10556.57566,12247.39532,13175.678,14463.91893,16046.03728,16233.7177,17632.4104,19007.19129,18363.32494,21050.41377,23189.80135,25185.00911


## Use the `DataFrame.info()` method to find out more about a dataframe.
---


In [7]:
data_oceania_country.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gdpPercap_1952  2 non-null      float64
 1   gdpPercap_1957  2 non-null      float64
 2   gdpPercap_1962  2 non-null      float64
 3   gdpPercap_1967  2 non-null      float64
 4   gdpPercap_1972  2 non-null      float64
 5   gdpPercap_1977  2 non-null      float64
 6   gdpPercap_1982  2 non-null      float64
 7   gdpPercap_1987  2 non-null      float64
 8   gdpPercap_1992  2 non-null      float64
 9   gdpPercap_1997  2 non-null      float64
 10  gdpPercap_2002  2 non-null      float64
 11  gdpPercap_2007  2 non-null      float64
dtypes: float64(12)
memory usage: 208.0+ bytes


- This is a `DataFrame`
- Two rows named `'Australia'` and `'New Zealand'`
- Twelve columns, each of which has two actual 64-bit floating point values.
  - We will talk later about null values, which are used to represent missing observations.
- Uses 208 bytes of memory.

## The `DataFrame.columns` variable stores information about the dataframe's columns.
---

In [8]:
data_oceania_country.columns

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')


- Note that this is data, *not* a method.  (It doesn't have parentheses.)
  - Like `math.pi`.
  - So do not use `()` to try to call it.
- Called a *member variable*, or just *member*.

## Use `DataFrame.T` to transpose a dataframe.
---

- Sometimes want to treat columns as rows and vice versa.
- Transpose (written `.T`) doesn't copy the data, just changes the program's view of it.
- Like `columns`, it is a member variable.


In [9]:
data_oceania_country.T

country,Australia,New Zealand
gdpPercap_1952,10039.59564,10556.57566
gdpPercap_1957,10949.64959,12247.39532
gdpPercap_1962,12217.22686,13175.678
gdpPercap_1967,14526.12465,14463.91893
gdpPercap_1972,16788.62948,16046.03728
gdpPercap_1977,18334.19751,16233.7177
gdpPercap_1982,19477.00928,17632.4104
gdpPercap_1987,21888.88903,19007.19129
gdpPercap_1992,23424.76683,18363.32494
gdpPercap_1997,26997.93657,21050.41377


## Use `DataFrame.describe()` to get summary statistics about data.
---

`DataFrame.describe()` gets the summary statistics of only the columns that have numerical data.
All other columns are ignored, unless you use the argument `include='all'`.

In [10]:
data_oceania_country.describe()

Unnamed: 0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
count,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
mean,10298.08565,11598.522455,12696.45243,14495.02179,16417.33338,17283.957605,18554.70984,20448.04016,20894.045885,24024.17517,26938.77804,29810.188275
std,365.560078,917.644806,677.727301,43.986086,525.09198,1485.263517,1304.328377,2037.668013,3578.979883,4205.533703,5301.85368,6540.991104
min,10039.59564,10949.64959,12217.22686,14463.91893,16046.03728,16233.7177,17632.4104,19007.19129,18363.32494,21050.41377,23189.80135,25185.00911
25%,10168.840645,11274.086022,12456.839645,14479.47036,16231.68533,16758.837652,18093.56012,19727.615725,19628.685412,22537.29447,25064.289695,27497.598692
50%,10298.08565,11598.522455,12696.45243,14495.02179,16417.33338,17283.957605,18554.70984,20448.04016,20894.045885,24024.17517,26938.77804,29810.188275
75%,10427.330655,11922.958888,12936.065215,14510.57322,16602.98143,17809.077558,19015.85956,21168.464595,22159.406358,25511.05587,28813.266385,32122.777858
max,10556.57566,12247.39532,13175.678,14526.12465,16788.62948,18334.19751,19477.00928,21888.88903,23424.76683,26997.93657,30687.75473,34435.36744


- Not particularly useful with just two records,
  but very helpful when there are thousands.

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Reading other data </p>

---

Read the data in `gapminder_gdp_americas.csv`
(which should be in the same directory as `gapminder_gdp_oceania.csv`)
into a variable called `data_americas`
and display its summary statistics.

## Solution

To read in a CSV, we use `pd.read_csv` and pass the filename `'data/gapminder_gdp_americas.csv'` to it.
We also once again pass the column name `'country'` to the parameter `index_col` in order to index by country.
The summary statistics can be displayed with the `DataFrame.describe()` method.

```python
data_americas = pd.read_csv('data/gapminder_gdp_americas.csv', index_col='country')
data_americas.describe()
```


## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Inspecting data </p>

---

After reading the data for the Americas,
use `help(data_americas.head)` and `help(data_americas.tail)`
to find out what `DataFrame.head` and `DataFrame.tail` do.

1. What method call will display the first three rows of this data?
2. What method call will display the last three columns of this data?
  (Hint: you may need to change your view of the data.)

## Solution

1. We can check out the first five rows of `data_americas` by executing `data_americas.head()`
  which lets us view the beginning of the DataFrame. We can specify the number of rows we wish
  to see by specifying the parameter `n` in our call to `data_americas.head()`.
  To view the first three rows, execute:
  
  ```python
  data_americas.head(n=3)
  ```
  
  ```output
            continent  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
  country
  Argentina  Americas     5911.315053     6856.856212     7133.166023
  Bolivia    Americas     2677.326347     2127.686326     2180.972546
  Brazil     Americas     2108.944355     2487.365989     3336.585802
  
            gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
  country
  Argentina     8052.953021     9443.038526    10079.026740     8997.897412
  Bolivia       2586.886053     2980.331339     3548.097832     3156.510452
  Brazil        3429.864357     4985.711467     6660.118654     7030.835878
  
             gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
  country
  Argentina     9139.671389     9308.418710    10967.281950     8797.640716
  Bolivia       2753.691490     2961.699694     3326.143191     3413.262690
  Brazil        7807.095818     6950.283021     7957.980824     8131.212843
  
             gdpPercap_2007
  country
  Argentina    12779.379640
  Bolivia       3822.137084
  Brazil        9065.800825
  ```

2. To check out the last three rows of `data_americas`, we would use the command,
  `americas.tail(n=3)`, analogous to `head()` used above. However, here we want to look at
  the last three columns so we need to change our view and then use `tail()`. To do so, we
  create a new DataFrame in which rows and columns are switched:
  
  ```python
  americas_flipped = data_americas.T
  ```
  
  We can then view the last three columns of `americas` by viewing the last three rows
  of `americas_flipped`:
  
  ```python
  americas_flipped.tail(n=3)
  ```
  
  ```output
  country        Argentina  Bolivia   Brazil   Canada    Chile Colombia  \
  gdpPercap_1997   10967.3  3326.14  7957.98  28954.9  10118.1  6117.36
  gdpPercap_2002   8797.64  3413.26  8131.21    33329  10778.8  5755.26
  gdpPercap_2007   12779.4  3822.14   9065.8  36319.2  13171.6  7006.58
  
  country        Costa Rica     Cuba Dominican Republic  Ecuador    ...     \
  gdpPercap_1997    6677.05  5431.99             3614.1  7429.46    ...
  gdpPercap_2002    7723.45  6340.65            4563.81  5773.04    ...
  gdpPercap_2007    9645.06   8948.1            6025.37  6873.26    ...
  
  country          Mexico Nicaragua   Panama Paraguay     Peru Puerto Rico  \
  gdpPercap_1997   9767.3   2253.02  7113.69   4247.4  5838.35     16999.4
  gdpPercap_2002  10742.4   2474.55  7356.03  3783.67  5909.02     18855.6
  gdpPercap_2007  11977.6   2749.32  9809.19  4172.84  7408.91     19328.7
  
  country        Trinidad and Tobago United States  Uruguay Venezuela
  gdpPercap_1997             8792.57       35767.4  9230.24   10165.5
  gdpPercap_2002             11460.6       39097.1     7727   8605.05
  gdpPercap_2007             18008.5       42951.7  10611.5   11415.8
  ```
  
  This shows the data that we want, but we may prefer to display three columns instead of three rows,
  so we can flip it back:
  
  ```python
  americas_flipped.tail(n=3).T    
  ```
  
  **Note:** we could have done the above in a single line of code by 'chaining' the commands:
  
  ```python
  data_americas.T.tail(n=3).T
  ```

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Reading Files in Other Directories </p>
---

The data for your current project is stored in a file called `microbes.csv`,
which is located in a folder called `field_data`.
You are doing analysis in a notebook called `analysis.ipynb`
in a sibling folder called `thesis`:

```output
your_home_directory
+-- field_data/
|   +-- microbes.csv
+-- thesis/
    +-- analysis.ipynb
```

What value(s) should you pass to `read_csv` to read `microbes.csv` in `analysis.ipynb`?

## Solution

We need to specify the path to the file of interest in the call to `pd.read_csv`. We first need to 'jump' out of
the folder `thesis` using '../' and then into the folder `field_data` using 'field\_data/'. Then we can specify the filename \`microbes.csv.
The result is as follows:

```python
data_microbes = pd.read_csv('../field_data/microbes.csv')
```

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Writing data </p>
---

As well as the `read_csv` function for reading data from a file,
Pandas provides a `to_csv` function to write dataframes to files.
Applying what you've learned about reading from files,
write one of your dataframes to a file called `processed.csv`.
You can use `help` to get information on how to use `to_csv`.

## Solution

In order to write the DataFrame `data_americas` to a file called `processed.csv`, execute the following command:

```python
data_americas.to_csv('processed.csv')
```

For help on `read_csv` or `to_csv`, you could execute, for example:

```python
help(data_americas.to_csv)
help(pd.read_csv)
```

Note that `help(to_csv)` or `help(pd.to_csv)` throws an error! This is due to the fact that `to_csv` is not a global Pandas function, but
a member function of DataFrames. This means you can only call it on an instance of a DataFrame
e.g., `data_americas.to_csv` or `data_oceania.to_csv`

# <p style="background-color: #f5df18; padding: 10px;"> 🗝️ Key points</p>
---

- Use the Pandas library to get basic statistics out of tabular data.
- Use `index_col` to specify that a column's values should be used as row headings.
- Use `DataFrame.info` to find out more about a dataframe.
- The `DataFrame.columns` variable stores information about the dataframe's columns.
- Use `DataFrame.T` to transpose a dataframe.
- Use `DataFrame.describe` to get summary statistics about data.