# <center>  Introduction to pandas </center>
<div>
<img src="https://pandas.pydata.org/static/img/pandas.svg" width="600"/>
</div>

Before, we learned about some useful data structures to store and organize data. These included lists, dictionaries, tuples, and arrays. In this lecture, we will learn about the **pandas** library, some of its features, and new data structures that the library imports.

Pandas is a popular Python library utilized by many data scientists. It offers useful additional functionalities in Python that expand one's capability to store, organize, and analyze data. Additional data structures that comes with the pandas library are ***series*** and ***dataframes***. If you have experience in working with spreadsheets in Microsoft Excel, working with pandas series and dataframes will look familar. Essentially, series and dataframes allow for the storage of data in a tabular format. 

For more information on pandas and documentation on functionalities of the pandas library, refer to the <a href="https://pandas.pydata.org/docs/index.html">official pandas webpage</a>. The <a href="https://pandas.pydata.org/docs/reference/index.html">API reference page</a> is an extensive resource for many pandas functions and methods.

# Series

A pandas series can be thought of as a 1D array or a single column in a table. 

To make a series, we first must import the pandas library. It is common convention that Pandas is imported as `pd`. Then, we can make a series by using the `pd.Series()` function and utilize a list or array as the input to the function, as shown below:

In [None]:
import pandas as pd

groceries = pd.Series(['Doritos', 'Bananas', 'Broccoli', 'Chicken'])

groceries

Calling `groceries` shows us a series of four items that is indexed from 0 to 3 (inclusive). To name the indices, we can pass another list or array into the `.set_axis()` method.

In [None]:
groceries = groceries.set_axis(['Snack', 'Fruit', 'Vegetable', 'Meat'])

groceries

When creating a series, the index can also be set. To achieve the same outcome as above, we can use the `index` parameter in the `pd.Series()` function:

In [None]:
groceries2 = pd.Series(['Doritos', 'Bananas', 'Broccoli', 'Chicken'],
         index = ['Snack', 'Fruit', 'Vegetable', 'Meat'])

groceries2

Series can be indexed by position (numerically) similar to indexing an array. They can also be indexed by value or name. Below we index a single item and a range of items numerically and by name:

In [None]:
print(groceries2[1])
print(groceries2["Fruit"])

In [None]:
print(groceries2["Fruit":"Meat"])
print('\n')                          # Prints a new line
print(groceries2[1:4])

Notice that when slicing by a defined value or name, the end of the slice will be **included**. When slicing by an index position, the end of the slice will be **excluded**.

# Dataframes

Dataframes are essentially tables that consist of multiple series. Below, we see the multiple ways a dataframe can be made.


### Making a dataframe using a dictionary
To construct a dataframe using a dictionary, you can use the `pd.Dataframe()` function. Passing a dictionary into this function creates a dataframe along a column-axis; the keys of the dictionary become the column titles, while the values of each key become the rows of each column. The `index` parameter can also be passed into the function as well but must be defined outside of the dictionary as a list:

In [None]:
grocery_df1 = pd.DataFrame(
    {"Item": ['Doritos', 'Bananas', 'Broccoli', 'Chicken'],
     "Unit Price": [3.99,0.50,2.00, 5.00],
     "Quantity": [2,5,1,3]},
    index = ['Snack', 'Fruit', 'Vegetable', 'Meat']
)

grocery_df1

### Making a dataframe using lists

Another way to construct a dataframe is by using lists. While passing lists into the `pd.Dataframe()` function, a dataframe is constructed along a row-axis; each list becomes a single row in the dataframe. Using this method, columns can be named by passing a list into the `columns` parameter:

In [None]:
grocery_df2 = pd.DataFrame(
    [['Doritos', 3.99, 2],
     ['Bananas', 0.50, 5],
     ['Broccoli', 2.00, 1],
     ['Chicken', 5.00, 3]],
    index = ['Snack', 'Fruit', 'Vegetable', 'Meat'],
    columns = ["Item","Unit Price","Quantity"]
    )

grocery_df2

### Making a dataframe using pandas series

Finally, multiple series can be used to construct a dataframe by using the `pd.concat()` function. Using this function, one can pass a list of series and specify the method of concatenation/joining through the `axis` parameter. Concatenation while `axis = 0` means that the series will be joined as additional rows; concatenation while `axis = 1` means that the series will be joined together as columns. The `keys` parameter sets the column titles and should be defined using a list. 

Once the series have been concatenated into a dataframe, the indexes of the dataframe will start at 0 by default. The `.set_axis()` method can be used on the dataframe to define the index values/titles:

In [None]:
items = pd.Series(['Doritos', 'Bananas', 'Broccoli', 'Chicken'])
unit_price = pd.Series([3.99,0.50,2.00, 5.00])
quantity = pd.Series([2,5,1,3])
indices = pd.Series(['Snack', 'Fruit', 'Vegetable', 'Meat'])

grocery_df3 = pd.concat([items, unit_price, quantity], axis = 1, keys = ['Item', 'Unit Price', 'Quantity'])
grocery_df3 = grocery_df3.set_axis(indices)

grocery_df3

# Loading data and file paths

### Filepaths

More often, dataframes are constructed from pre-existing files rather than made from scratch. This will require us to read in data from our local machine or from the Internet.

<u>Two types:</u>
- *Absolute* filepath - provides directions to files, regardless of your <u>current working directory</u>
- *Relative* filepath - provides directions to files in relation to your <u>current working directory</u> (i.e., your current location in your local machine)

Filepaths may be represented differently, depending on your operating system. On Windows, directories are distinguished using a *backslash* ( \\ ), while in Unix-based systems, such as Mac OS X, forward slashes ( / ) are used. 

Below is an example of the structure of directories and files on a computer. The current working directory called **<font color=#99180f>Random-Stuff</font>** can be used to navigate to the desired file **<font color=#120e99>lake_temps.csv</font>**. Alternatively, an absolute path can be used to navigate to the same place. The relative and absolute paths for Windows and Unix-based operating systems are provided:

<img align="center" src="https://raw.githubusercontent.com/campbelle1/CAN2023/4dfaebc11fcfa0c156b58621b73eb8fbed738329/filepaths.png" width="66%"/> 


When working in a Jupyter Notebook, the current working directory will be the folder in which the notebook is housed. To know what the filepath of that is, you can import the `os` library and call `os.getcwd()`. The output will give the absolute filepath to the current working directory:

In [None]:
import os
os.getcwd()

To know the files and folders that are in the current working directory, you may want to list its contents. This can be done by using the `os.listdir()` function and passing `'.'` as a string argument. The `.` notation represents the current working directory:

In [None]:
os.listdir('.')

To access files using a relative path, we must first determine where the files are relative to the above current working directory. A visual representation of this hierarchy is shown below:

<img align="center" src="https://raw.githubusercontent.com/campbelle1/CAN2023/main/Tuesday-Lecture-filepaths.png" width="100%"/> 


To navigate <u>one</u> directory up in a filepath in JupyterLab, the notation `../` is used. To get to the data, we next type the name of the desired folder that we wish to navigate to, which is *datasets*. Finally, if we want to read in a specific file, we just need to type the filename, including the file extension. If we wanted to access the *sample_data.csv* file in the *datasets* folder, the relative filepath would be:

> ../datasets/sample_data.csv

The above path would be used as a string to load in data, as shown below. <u>Alternatively, a string of a url address to an online .csv file could be used as input for `pd.read_csv()` to download data directly off of the Internet.</u>

**Note:** In Windows and Unix-based operating systems, you can easily obtain the filepath of a file of interest by right clicking on the file and then clicking *Info* (Mac) or *Properties* (Windows). From there, you can copy and paste your filepath for immediate use (<u>except if you have Windows.</u> You must first change all forward slashes `/` to back slashes `\`).

### Loading Data

Python can read several types of files. Below are some useful functions to load data files:

- `pd.read_csv(file)` : Loads comma separated values files (.csv files). Requires `pandas` to be imported first.

- `pd.read_excel(file)` : Loads Microsoft Excel files (.xlsx files). Requires `pandas` to be imported first.

- `open(file, mode)` : Loads text files (.txt files); The `mode` parameter is optional and determines how the file is opened. When `mode` is not specified, the default argument `'r'` is passed, which reads the file.

We will use `pd.read_csv()` to read data stored in comma separated values (.csv) files in this lecture. 

When loading data into Python, we can store the csv file as a dataframe for downstream processing and analysis. Below, we load a file called `sample_data.csv` using `pd.read_csv()` and save it as a dataframe called `plants`:

In [None]:
plants = pd.read_csv('../datasets/sample_data.csv')

plants

When calling for `plants`, we see a dataframe that contains data on the height of various different house plants sold at a local florist. 

To set a column as the index of the dataframe, the `.sex_index()` method can be used. The name of the desired column can be passed as a string into the method:


In [None]:
plants = plants.set_index('Name')
plants

We can obtain a list of the indices by calling `.index` on the dataframe:

In [None]:
plants.index

When using the `pd.read_csv()` function to load in data, the `index_col` parameter can also be used to set the index of the dataframe.

Another useful parameter is `nrows`, which loads in a specified number of rows. This may be useful when you are only interested in looking at an initial portion of the data.

Below we load in the first 3 rows of `sample_data.csv` and designate the `Name` column as the index of our dataframe:

In [None]:
three_rows = pd.read_csv('../datasets/sample_data.csv', index_col='Name', nrows=3)

three_rows

By default, Python shows the first 60 rows of a dataframe. If a dataframe exceeds 60 rows, a truncated version is displayed that shows the first and last five rows.

When working with larger datasets, the number of rows that are shown can be adjusted when the dataframe is used in the `pd.set_option()` function. This function takes two arguments: the option you want to set and the value to which to set the option. 

By passing `'display.max_rows'` as the first argument and the number `15` as the second argument, we set the option in our environment to display a truncated version of any dataframe that exceeds 15 rows. A similar approach could be taken with columns by passing `'display.max_columns'` as an argument. This can be useful if you would like to see all the rows and columns of a dataframe or if you want to abbreviate the dataframe after a certain number of rows and columns:

In [None]:
pd.set_option('display.max_rows', 15)
plants

# Exploratory methods and functions for dataframes

When loading in data as dataframes, especially large datasets, you may want to get a quick overview of the data. 

Two very useful dataframe methods that can help with this are the `.head()` and `.tail()` methods.  By default, calling `.head()` or `.tail()` on a dataframe will return the first five or the last five rows of the dataframe, respectively. Passing an integer into these methods will give you an output with that number of rows:

In [None]:
plants.head(7)

In [None]:
plants.tail(11)

If a negative integer *n* is passed into the `.head()` method, all but the last *n* rows will be shown.

Likewise, if a negative integer *h* is passed into the `.tail()` method, all but the top *h* rows will be shown:

In [None]:
plants.head(-23)

In [None]:
plants.tail(-11)

When working with large datasets with many variables, the `.columns`, `.shape`, and `.size` attributes and the `.info()` method can be helpful. 

The `.columns` attribute returns a list of column names when called on a dataframe:

In [None]:
plants.columns

The `.shape` attribute returns a tuple of the number of rows and columns within the dataframe:

In [None]:
plants.shape

The `.size` attribute returns the number of total data points within the dataframe (i.e. the number of rows * the number of columns):

In [None]:
plants.size

Lastly, the `.info()` method provides information on the dataframe, including the range of the indexes, the data type of each column, and the memory usage of the dataframe:

In [None]:
plants.info()

# Activity

1. Using the provided sheet of paper, collect data from 10 of your fellow CAN cohort members. Try to collect from people that <u>are not</u> from your same home institution.

2. Using this information, create a dataframe using the three different approaches discussed (i.e. via a dictionary, lists, and pandas series).

3. Load the affordable housing data from the *Chicago-Housing* folder. Make the `Community Area Name` column the index of the dataframe.

4. Explore the dataset. How many entries are there? How many variables are documented for each entry? What are the data types of each variable?