# Pandas Basics


**Pandas** (derived from the term "**pan**el **da**ta") is a popular Python library for processing and analyzing data, particularly in a tabular format. Think of it as a spreadsheet in a programming environment, packing a lot more computational and memory efficiency with all the automation benefits of Python. Pandas can do virtually any of the tasks you can do in a spreadsheet and extends easily to tasks like processing data, extract-transform-load (ETL), data cleaning, machine learning preparation, and data viewing stored in formats such as CSV, JSON, and SQL. 

This series of trainings will focus on data cleaning with Pandas, but we will review a few Pandas basics in this section.

## Setting Up Pandas

Pandas should already come packaged with the Anaconda distribution. But if you ever need to install Pandas you can run this command:  

```
conda install pandas 
```

If using a standard Python distribution, use `pip`. 

```
pip install pandas
```

Typically when using Pandas you will use an *alias* to rename the `pandas` package, typically with the name `pd`. This is to prevent clashing with similarly-named functions in other libraries (e.g. NumPy, Python's standard library) while requiring less typing. Here is typically how we alias the import of Pandas. 

In [207]:
import pandas as pd 

It is common to use NumPy (a numeric computing Python library) in conjunction with Pandas. It gets aliased in a similar manner but is instead called `np`. 

In [209]:
import numpy as np 

There are two main types of data structures in Pandas we will review: the `Series` and the `DataFrame`. Then we will talk about indexing and importing from data sources. We will focus on the `Series` first. 

## Series

A `Series` holds a single dimension of data. Think of it as a single column, similar to a plain `list` in Python. But there are some key features that Pandas offers. Let's take a look. 

In [213]:
# declare a Series of temperatures 
my_series = pd.Series([97.5, 98.2, 99.5, 98.1, 98.7], name='temperatures')

# print the series 
my_series

0    97.5
1    98.2
2    99.5
3    98.1
4    98.7
Name: temperatures, dtype: float64

Note how we called the `Series()` construct in Pandas, and passed a Python list of floating point values to it. We gave it a name "temperatures." When we print the series, we see not just the data but also the name provided and the inferred data type (`float64`, which is 64-bit floating point values). This shows we can store useful metadata with our list of values. 

But notice how there are two columns, not just one, in our series. This may be surprising as we described a series as a single column of data, so why are there two? What is the column of consecutive integers on the left? This is what we call the **index**, which allows us to reference an element by an identifier. 


svg image

We can access an element by its index in a `Series` using a familiar Python getter syntax, just like Python lists. 

In [217]:
my_series[3]

98.1

You might wonder why this explicit index is useful, because numeric indices are a standard feature in Python collections. This is where Pandas can do things differently. We can use strings, dates, and even arbitrary ordered/unordered types. Below we use strings representing the days of week as the index for each data point. 

In [219]:
# declare a Series of temperatures 
my_series = pd.Series(
    index=['MON','TUE','WED','THUR','FRI'],
    data=[97.5, 98.2, 99.5, 98.1, 98.7], 
    name='temperatures'
)

# print the series 
my_series

MON     97.5
TUE     98.2
WED     99.5
THUR    98.1
FRI     98.7
Name: temperatures, dtype: float64

We can then access any element by that corresponding string value in the index. 

svg image

In [222]:
# print the WED element
my_series['WED']

99.5

We also call the index the **axis**, and we call each element **axis labels** (e.g. consecutive integers or MON,TUE,WED...). We call the actual data the **values** in the series. The values do not have to be numeric. They can be strings, dates, or any arbitrary Python type. It is ideal a given series has the same datatype throughout its values for efficiency, and therefore can leverage fast vectorized operations behind the scenes. 

svg image

## DataFrames

Whereas a series holds a single column of data, a **dataframe** in Pandas holds multiple columns of data. It is also the data structure you will work with the most in Python.  

Let's take a look. Below we have a `DataFrame` of three records. 

In [227]:
df = pd.DataFrame(data=[
    ("Otter Antiques", "5732 Serenity Ln", "Addison", "TX", 75031, "COMMERCIAL", 55),
    ("North Exa Energy", "67 Hearst Dr", "Plano", "TX", 75093, "INDUSTRIAL",73),
    ("City of Plano", "239 Plano Dr", "Plano", "TX", 75093,"GOVERNMENT",193), 
    ],
    columns=["COMPANY_NAME","ADDRESS","CITY","STATE","ZIP","CATEGORY","NUM_EMPLOYEES"]
)

# print dataframe 
df

Unnamed: 0,COMPANY_NAME,ADDRESS,CITY,STATE,ZIP,CATEGORY,NUM_EMPLOYEES
0,Otter Antiques,5732 Serenity Ln,Addison,TX,75031,COMMERCIAL,55
1,North Exa Energy,67 Hearst Dr,Plano,TX,75093,INDUSTRIAL,73
2,City of Plano,239 Plano Dr,Plano,TX,75093,GOVERNMENT,193


We can initialize these three records by assigning a collection containing collections to the `data` parameter, where each sub-collection is a row of data. In this case I create a list of tuples. The names of each column are provided as a list of strings to the `columns` parameter. Notice there are seven columns and three rows. We can obtain these dimensions using the dataframe's `shape` property.

In [229]:
df.shape

(3, 7)

Notice carefully we have two axes now, one corresponding to the rows and another to the column, which are axis 0 and axis 1 respectively. 

svg image

We can access a column (in Axis 1) by its axis label using the square bracket getter syntax `[ ]`. Because this is just one column, it will return a series. Below, we retrieve the "ADDRESS" column. 

In [233]:
df["ADDRESS"]

0    5732 Serenity Ln
1        67 Hearst Dr
2        239 Plano Dr
Name: ADDRESS, dtype: object

If we select multiple columns using a list of strings, we will extract another dataframe of just those two columns instead of a series. 

In [235]:
df[["COMPANY_NAME", "ADDRESS"]]

Unnamed: 0,COMPANY_NAME,ADDRESS
0,Otter Antiques,5732 Serenity Ln
1,North Exa Energy,67 Hearst Dr
2,City of Plano,239 Plano Dr


Let's declare another dataframe holding some simple contact information. 

In [237]:
df = pd.DataFrame({
    "first_name": ["Thomas", 'Sam', 'Joe'], 
    "last_name": ["Nield", 'Scala', 'Morrison'], 
    "email": ["tmnield@outlook.com", 'sam.scala@gmail.com', 'joe@rexonmetals.com']
})

df

Unnamed: 0,first_name,last_name,email
0,Thomas,Nield,tmnield@outlook.com
1,Sam,Scala,sam.scala@gmail.com
2,Joe,Morrison,joe@rexonmetals.com


Retrieve the `columns` property and notice we have the type `Index`. It contains the column names. 

In [239]:
df.columns

Index(['first_name', 'last_name', 'email'], dtype='object')

An **index** again is an organizer to quickly retrieve data on an axis. In the context of a `DataFrame`, one index is used to retrieve a column by name. There are [many types of index implementations depending on the data type](https://pandas.pydata.org/docs/reference/api/pandas.Index.html). 

With dataframes, you will often use an index to retrieve a given row. An index can be retrieved with the `index` property. 

In [241]:
df.index

RangeIndex(start=0, stop=3, step=1)

Note the index implemenation returned is a `RangeIndex` which is a simple integer index defined by a range and step (typically `1` for a simple consecutive integer). To retrieve an index by a row, use the `loc[]` getter with the row number. 

In [243]:
df.loc[2]

first_name                    Joe
last_name                Morrison
email         joe@rexonmetals.com
Name: 2, dtype: object

You can manipulate and change the row index, most commonly by using the `set_index()` function and specifying which column you want as the index. Below we set the `email` column to become the index. Note we set `inplace` to be true so it edits the existing dataframe rather than create a new one. 

In [245]:
df.set_index('email', inplace=True)

You can then use the `loc[]` getter to look up a record by an email address.

In [247]:
df.loc['tmnield@outlook.com']

first_name    Thomas
last_name      Nield
Name: tmnield@outlook.com, dtype: object

If you study the `index` property now, notice it is no longer a `RangeIndex` but rather a plain `Index` organized by string objects. 

In [249]:
df.index

Index(['tmnield@outlook.com', 'sam.scala@gmail.com', 'joe@rexonmetals.com'], dtype='object', name='email')

To reset the index back to the default, call `reset_index()`. Use the `inplace` modifier so it edits the existing dataframe rather than create a new one. 

In [251]:
df.reset_index(inplace=True)

## Importing Data

Let's cover some basic and common data import operations in Pandas. To prepare, let's download three files in three data formats: CSV, SQLite, and JSON. 

In [254]:
import urllib.request

# Download a CSV
urllib.request.urlretrieve("https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/classification/maintenance_predict.csv?raw=true", "maintenance.csv")

# Download SQLite database and connect to it 
urllib.request.urlretrieve("https://github.com/thomasnield/anaconda_intro_to_sql/blob/main/company_operations.db?raw=true", "company_operations.db")

# Download a JSON file 
urllib.request.urlretrieve("https://gist.githubusercontent.com/thomasnield/7daae45b907263808d1374235e73a27d/raw/32763976868928518a7e6b65211704390359601a/json_tabuler.json", "customers.json")

('customers.json', <http.client.HTTPMessage at 0x1962debb440>)

### Reading a CSV

To help view the raw contents of a text file, let's implement this `read_text()` function. 

In [257]:
def read_text(filepath): 
    with open(filepath) as f:
        return f.readlines()

Let's take a look at the CSV file `maintenance.csv` first and view its raw contents line-by-line.

In [259]:
read_text('maintenance.csv')

['DAYS_SINCE_INSTALL,FLIGHT_HOURS,ENV_TEMPERATURE,REPLACEMENT_NEEDED\n',
 '93,792,126,1\n',
 '11,107,113,1\n',
 '23,152,110,1\n',
 '32,375,110,1\n',
 '141,701,110,1\n',
 '21,143,107,1\n',
 '68,453,107,1\n',
 '72,754,107,1\n',
 '33,185,106,1\n',
 '106,715,106,1\n',
 '152,1046,106,1\n',
 '83,1113,106,1\n',
 '175,1616,106,1\n',
 '198,1823,106,1\n',
 '183,1832,106,1\n',
 '11,60,105,1\n',
 '15,67,105,1\n',
 '127,250,105,1\n',
 '129,750,105,1\n',
 '33,285,103,0\n',
 '73,677,103,1\n',
 '142,1825,102,0\n',
 '1,6,102,0\n',
 '62,239,101,0\n',
 '40,167,100,0\n',
 '32,301,100,0\n',
 '92,988,100,1\n',
 '135,1002,99,0\n',
 '61,334,99,0\n',
 '134,473,98,1\n',
 '80,729,98,1\n',
 '127,441,97,0\n',
 '20,204,96,0\n',
 '29,213,96,0\n',
 '156,1228,96,1\n',
 '3,24,95,0\n',
 '80,859,95,1\n',
 '192,943,94,0\n',
 '128,322,93,0\n',
 '56,656,93,0\n',
 '152,1475,93,0\n',
 '26,239,93,0\n',
 '95,379,92,0\n',
 '116,1000,92,0\n',
 '89,373,92,0\n',
 '134,585,92,1\n',
 '13,144,91,0\n',
 '61,269,91,0\n',
 '74,301,91,0\n

We can load it easily into a Pandas DataFrame by using the `read_csv()` function and pass it the location and/or name of the file. 

In [261]:
pd.read_csv('maintenance.csv')

Unnamed: 0,DAYS_SINCE_INSTALL,FLIGHT_HOURS,ENV_TEMPERATURE,REPLACEMENT_NEEDED
0,93,792,126,1
1,11,107,113,1
2,23,152,110,1
3,32,375,110,1
4,141,701,110,1
...,...,...,...,...
439,57,448,22,0
440,155,650,21,0
441,176,1697,17,1
442,33,188,12,0


### Reading a SQL Database

Next let's look at a SQL database. There are many SQL database platforms like MySQL, Oracle, Microsoft SQL Server, and PostgreSQL. They all use relational tables to store data, and we can use the `read_sql()` function in Pandas to easily execute a SQL query against a database connection. 

In our example, we will use SQLite since it is lightweight and easily stores data in a simple database file which we just downloaded. SQLite support is also included in Python's standard library. Here is how we connect to a SQLite database and run a query against the `CUSTOMER` table using Pandas' `read_sql()` function. 

In [264]:
import sqlite3

conn = sqlite3.connect('company_operations.db')

pd.read_sql("SELECT * FROM CUSTOMER", conn)

Unnamed: 0,CUSTOMER_ID,CUSTOMER_NAME,ADDRESS,CITY,STATE,ZIP,CATEGORY
0,1,Alpha Medical,18745 Train Dr,Dallas,TX,75021,INDUSTRIAL
1,2,Oak Cliff Base,2379 Cliff Ave,Abbevile,LA,70510,GOVERNMENT
2,3,Sports Unlimited,1605 Station Dr,Alexandrai,LA,71301,COMMERCIAL
3,4,Riley Sporting Goods,9854 Firefly Blvd,Austin,TX,78701,COMMERCIAL
4,5,Lite Industrial,462 Roadrunner Blvd,Houston,TX,77254,INDUSTRIAL
5,6,Prairie Sports Center,689 Stadium Way,Tulsa,OK,74101,COMMERCIAL
6,7,Facility 95,2396 Runway Dr,Oklahoma City,OK,73101,GOVERNMENT
7,8,Allen Stadium,573 HIllcrest Rd,Allen,TX,75002,COMMERCIAL
8,9,Dent Research,392 45th St,Waco,TX,76700,INDUSTRIAL
9,10,Gamma Solutions,2752 27th St,Phoenix,AZ,85001,COMMERCIAL


### Reading a JSON File

Last let's look at how to read data from a JSON file. Let's look at the contents of `customers.json`. 

In [267]:
read_text('customers.json')

['[\n',
 '     {\n',
 '            "ID" : 1,\n',
 '            "NAME": "Alpha Medical",\n',
 '            "ADDRESS" : "18745 Train Dr",\n',
 '            "CITY" : "Dallas",\n',
 '            "STATE": "TX",\n',
 '            "ZIP" : 75021,\n',
 '            "CATEGORY" : "INDUSTRIAL"\n',
 '        },\n',
 '        {\n',
 '            "ID" : 2,\n',
 '            "NAME": "Oak Cliff Base",\n',
 '            "ADDRESS" : "2379 Cliff Ave",\n',
 '            "CITY" : "Abbevile",\n',
 '            "STATE": "LA",\n',
 '            "ZIP" : 70510,\n',
 '            "CATEGORY": "GOVERNMENT"\n',
 '        },\n',
 '        {\n',
 '            "ID" : 3,\n',
 '            "NAME": "Sports Unlimited",\n',
 '            "ADDRESS" : "1605 Station Dr",\n',
 '            "CITY" : "Alexandrai",\n',
 '            "STATE": "LA",\n',
 '            "ZIP" : 71301,\n',
 '            "CATEGORY" : "COMMERCIAL"\n',
 '        },\n',
 '        {\n',
 '            "ID" : 4,\n',
 '            "NAME": "Riley Sporting Goods"

Assuming the JSON is structured in a way that can be mapped into a tabular structure easily, we can use Pandas' `read_json()` function to read a JSON file into a DataFrame easily. 

In [269]:
pd.read_json('customers.json')

Unnamed: 0,ID,NAME,ADDRESS,CITY,STATE,ZIP,CATEGORY
0,1,Alpha Medical,18745 Train Dr,Dallas,TX,75021,INDUSTRIAL
1,2,Oak Cliff Base,2379 Cliff Ave,Abbevile,LA,70510,GOVERNMENT
2,3,Sports Unlimited,1605 Station Dr,Alexandrai,LA,71301,COMMERCIAL
3,4,Riley Sporting Goods,9854 Firefly Blvd,Austin,TX,78701,COMMERCIAL
4,5,Lite Industrial,462 Roadrunner Blvd,Houston,TX,77254,INDUSTRIAL
5,6,Prairie Sports Center,689 Stadium Way,Tulsa,OK,74101,COMMERCIAL
6,7,Facility 95,2396 Runway Dr,Oklahoma City,OK,73101,GOVERNMENT
7,8,Allen Stadium,573 HIllcrest Rd,Allen,TX,75002,COMMERCIAL
8,9,Dent Research,392 45th St,Waco,TX,76700,INDUSTRIAL
9,10,Gamma Solutions,2752 27th St,Phoenix,AZ,85001,COMMERCIAL


## Exercise

Complete the code below to declare a Dataframe with two records, each containing the fields `ORDER_ID` and `QUANTITY` both of type integers. Have the first record contain an `ORDER_ID` of `1` and a `QUANTITY` of `450`, and the second an `ORDER_ID` of `2` and a `QUANTITY` of `700`. 

Afterwards, select and display the first record. 

In [272]:
df = pd.DataFrame({
    "ORDER_ID": [1, 2], 
    "QUANTITY": [450, 700]
})

df.loc[0]

ORDER_ID      1
QUANTITY    450
Name: 0, dtype: int64