## External package usecase: Pandas

You will need Pandas for your first assignment. It is a package, and so must be imported to be used:

In [1]:
import pandas as pd

Some packages (like `math`, `time`, etc.) are part of the standard Python library and can always be imported. Others, like `pandas`, are not standard, and have to be downloaded separately. If you don't have them, you will get an `ModuleNotFoundError`. The recommended way to install them is via a package manager; `pip` has become standard, but there are others like `conda`. These package managers centralize all your downloaded packages into one location, handle dependencies and versioning to make using packages simpler.

We can use a Jupyter command to get more information on a package; the website (https://pandas.pydata.org/) is also always a good ressource:

In [2]:
pd?

[1;31mType:[0m        module
[1;31mString form:[0m <module 'pandas' from 'C:\\Users\\simon\\AppData\\Local\\Continuum\\anaconda3\\envs\\thinfilm\\lib\\site-packages\\pandas\\__init__.py'>
[1;31mFile:[0m        c:\users\simon\appdata\local\continuum\anaconda3\envs\thinfilm\lib\site-packages\pandas\__init__.py
[1;31mDocstring:[0m  
pandas - a powerful data analysis and manipulation library for Python

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - E

Now that pandas is imported under the name `pd`, and therefore can call pandas functions like so:

In [3]:
myfile_variable = pd.read_csv("Traffic_Violations-short.csv", delimiter="|")

To get information on a package's function, we can navigate to the function definition, or look it up online https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html. We can also just ask Jupyter:

In [4]:
pd.read_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mForwardRef[0m[1;33m([0m[1;34m'PathLike[str]'[0m[1;33m)[0m[1;33m,[0m [0mstr[0m[1;33m,[0m [0mIO[0m[1;33m[[0m[1;33m~[0m[0mT[0m[1;33m][0m[1;33m,[0m [0mio[0m[1;33m.[0m[0mRawIOBase[0m[1;33m,[0m [0mio[0m[1;33m.[0m[0mBufferedIOBase[0m[1;33m,[0m [0mio[0m[1;33m.[0m[0mTextIOBase[0m[1;33m,[0m [0m_io[0m[1;33m.[0m[0mTextIOWrapper[0m[1;33m,[0m [0mmmap[0m[1;33m.[0m[0mmmap[0m[1;33m][0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m=[0m[1;33m<[0m[0mobject[0m [0mobject[0m [0mat[0m [1;36m0x00000177F80393A0[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m=[0m[1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33

Here, the first argument is positional and is the filepath to read from: a string with the filename is enough here, since by default the interpreter looks in the directory it is executed from. We also passed the keyword argument `delimiter` to match the formating of the file we had.

The function returns a pandas object called a dataframe:

In [5]:
type(myfile_variable)

pandas.core.frame.DataFrame

Let's inspect that:

In [7]:
pd.core.frame.DataFrame?

[1;31mInit signature:[0m
[0mpd[0m[1;33m.[0m[0mcore[0m[1;33m.[0m[0mframe[0m[1;33m.[0m[0mDataFrame[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'Optional[Axes]'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Optional[Axes]'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Optional[Dtype]'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Par

This object has attributes like `columns`: 

In [8]:
myfile_variable.columns

Index(['Date Of Stop', 'Time Of Stop', 'Agency', 'SubAgency', 'Description',
       'Location', 'Latitude', 'Longitude', 'Accident', 'Belts',
       'Personal Injury', 'Property Damage', 'Fatal', 'Commercial License',
       'HAZMAT', 'Commercial Vehicle', 'Alcohol', 'Work Zone', 'State',
       'VehicleType', 'Year', 'Make', 'Model', 'Color', 'Violation Type',
       'Charge', 'Article', 'Contributed To Accident', 'Race', 'Gender',
       'Driver City', 'Driver State', 'DL State', 'Arrest Type',
       'Geolocation'],
      dtype='object')

For dataframes, the columns are the headers of the table:

In [9]:
myfile_variable

Unnamed: 0,Date Of Stop,Time Of Stop,Agency,SubAgency,Description,Location,Latitude,Longitude,Accident,Belts,...,Charge,Article,Contributed To Accident,Race,Gender,Driver City,Driver State,DL State,Arrest Type,Geolocation
0,09/24/2013,17:11:00,MCP,"3rd district, Silver Spring",DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI...,8804 FLOWER AVE,,,No,No,...,13-401(h),Transportation Article,No,BLACK,M,TAKOMA PARK,MD,MD,A - Marked Patrol,
1,12/20/2012,00:41:00,MCP,"2nd district, Bethesda",DRIVING WHILE IMPAIRED BY ALCOHOL,NORFOLK AVE / ST ELMO AVE,38.983578,-77.093105,No,No,...,21-902(b1),Transportation Article,No,WHITE,M,DERWOOD,MD,MD,A - Marked Patrol,"(38.9835782, -77.09310515)"
2,07/20/2012,23:12:00,MCP,"5th district, Germantown",FAILURE TO STOP AT STOP SIGN,WISTERIA DR @ WARING STATION RD,39.16181,-77.253581,No,No,...,21-707(a),Transportation Article,No,ASIAN,F,GERMANTOWN,MD,MD,A - Marked Patrol,"(39.1618098166667, -77.25358095)"
3,03/19/2012,16:10:00,MCP,"2nd district, Bethesda",DRIVER USING HANDS TO USE HANDHELD TELEPHONE W...,CLARENDON RD @ ELM ST. N/,38.982731,-77.100755,No,No,...,21-1124.2(d2),Transportation Article,No,HISPANIC,M,ARLINGTON,VA,VA,A - Marked Patrol,"(38.9827307333333, -77.1007551666667)"
4,12/01/2014,12:52:00,MCP,"6th district, Gaithersburg / Montgomery Village",FAILURE STOP AND YIELD AT THRU HWY,CHRISTOPHER AVE/MONTGOMERY VILLAGE AVE,39.162888,-77.229088,No,No,...,21-403(b),Transportation Article,No,BLACK,F,UPPER MARLBORO,MD,MD,A - Marked Patrol,"(39.1628883333333, -77.2290883333333)"


Dataframes are similar to Dicts in that elements can be accessed by a string, here the column name:

In [10]:
myfile_variable["Date Of Stop"]

0    09/24/2013
1    12/20/2012
2    07/20/2012
3    03/19/2012
4    12/01/2014
Name: Date Of Stop, dtype: object