# Pandas

## 1 Import pandas

In [1]:
# Using the as statement allows us to use pd when referring to pandas
import pandas as pd

In [2]:
# Check version
print(pd.__version__)

1.0.5


In [3]:
# Detailed info 
# pd.show_versions()

In [5]:
# See what it does

help(pd)

Help on package pandas:

NAME
    pandas

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    
    **pandas** is a Python package providing fast, flexible, and expressive data
    structures designed to make working with "relational" or "labeled" data both
    easy and intuitive. It aims to be the fundamental high-level building block for
    doing practical, **real world** data analysis in Python. Additionally, it has
    the broader goal of becoming **the most powerful and flexible open source data
    analysis / manipulation tool available in any language**. It is already well on
    its way toward this goal.
    
    Main Features
    -------------
    Here are just a few of the things that pandas does well:
    
      - Easy handling of missing data in floating point as well as non-floating
        point data.
      - Size mutability: columns can be inserted and deleted from DataFrame and
        higher dimensional objects
      - Automatic an

## 2 Basic commands

### What can you do with Pandas? 
- read_csv() / read_excel()
- Series()
- DataFrame
- values

In [6]:
# read_csv() to import a csv file (I'll also assign a variable) 
# You can also read xlsx files using read_excel()

csv_path = 'datasets/TMRQLD/vehicleinvolvement.csv'
df = pd.read_csv(csv_path)

In [7]:
# Let's see the dataframe head
df.head()

Unnamed: 0,Crash_Year,Crash_Police_Region,Crash_Severity,Involving_Motorcycle_Moped,Involving_Truck,Involving_Bus,Count_Crashes,Count_Casualty_Fatality,Count_Casualty_Hospitalised,Count_Casualty_MedicallyTreated,Count_Casualty_MinorInjury,Count_Casualty_All
0,2001,Brisbane,Fatal,No,No,No,41,43,20,3,0,66
1,2001,Brisbane,Fatal,No,No,Yes,1,1,1,0,0,2
2,2001,Brisbane,Fatal,No,Yes,No,2,2,0,1,2,5
3,2001,Brisbane,Fatal,Yes,No,No,5,5,0,2,0,7
4,2001,Brisbane,Fatal,Yes,Yes,No,1,1,0,0,0,1


In [14]:
# Isolate a column as a list
region = df[['Crash_Police_Region']]
type (region)

pandas.core.frame.DataFrame

In [20]:
# Or, you can get it as a 'series' which the lab describes as a 1-D dataframe. Which I read as, don't use this
region2 = df['Crash_Police_Region']
type (region2)

pandas.core.series.Series

In [17]:
# You can also use this to create a new df with multiple columns from the original dataset
# There's no real reason I'd do this with the above table, but anyway:
condensed_df = df[['Crash_Police_Region','Count_Casualty_All']]
condensed_df.head

<bound method NDFrame.head of      Crash_Police_Region  Count_Casualty_All
0               Brisbane                  66
1               Brisbane                   2
2               Brisbane                   5
3               Brisbane                   7
4               Brisbane                   1
...                  ...                 ...
2289            Southern                 316
2290            Southern                   7
2291            Southern                  14
2292            Southern                   2
2293            Southern                  15

[2294 rows x 2 columns]>

In the above example, In one instance I forgot to call the database as follows:
```python
mistake = [['Crash_Police_Region','Count_Casualty_All']]
correct = df[['Crash_Police_Region','Count_Casualty_All']]
```
So what I ended up with if you do this is a nested list with 2 items. Dumb!

In [23]:
# Read data within the file using iloc. You need to provide x and y coordinates for the cell. 
# The first cell at the top left is 0,0.
df.iloc[0, 0]

2001

In [26]:
# You can also use loc to swap out the coordinate for the column name.
df.loc[3, 'Involving_Bus']

'No'

In [36]:
# You can display a little chunk of the table by giving location ranges
# df.iloc[y-axis range, x-axis range]
df.iloc[0:4, 0:2]

Unnamed: 0,Crash_Year,Crash_Police_Region
0,2001,Brisbane
1,2001,Brisbane
2,2001,Brisbane
3,2001,Brisbane


In [31]:
# Again, drop the i and use loc to use header names instead:
df.loc[0:4, 'Crash_Year':'Crash_Police_Region']

Unnamed: 0,Crash_Year,Crash_Police_Region
0,2001,Brisbane
1,2001,Brisbane
2,2001,Brisbane
3,2001,Brisbane
4,2001,Brisbane
