### Introduction to Pandas 

For doing data analysis and manipulation in Python, Pandas is one of the most powerful, resourceful and easy to begin with package. It consists of data structures and functions and can assist you with doing a huge set of tasks for analytical work. In order to use these resources, Pandas needs to be loaded using the **import** command, as shown below, for it to be used in our script. We can also define an alias for Pandas (we used pd) as we are going to use Pandas' modules multiple times in our code.

<img src="https://pandas.pydata.org/docs/_static/pandas.svg" width="200"/>

A little note about Pandas is that it used object-oriented notations. No need to worry if you are not familiar with this concept. However, getting a high-level intuition about object-oriented programming (OOP) is going to be helpful in understanding how Pandas work. All data srtuctures that store our data in the program are a form of Pandas objects defined by a blueprint called Class. For example, a DataFrame (to be discussed below) is a data-structure to store data in a format similar to spreadsheet. We can create a DataFrame object to store a table of our dataset and this object will have multiple methods and function that operate on this object.

Pandas comes usually installed with Jupyter enviornments like the one you are using here. If pandas is installed, you can check the current version of Pandas by using the ```pd.__version__``` command after the import command below:

*Note: pd is an alias*

In [2]:
import pandas as pd
pd.__version__

'1.4.2'

### Data Structures in Pandas

We will learn the 2 fundamental data structures in Pandas: **Series** and **DataFrames**. Understanding the between Series and DataFrame difference can assist in knowing what methods work for a Series and what work for a DataFrame when refering to the documentation.

<img src="series.png" width="300"/>

**Series** is a one-dimensional array which can hold data of various types. Series are labelled where an index labels each element on the axis. Each index element has to be unique. Series can be created out from a list, a python dictionary or even a scalar value. A pandas Series has a single dtype.

Below, we will go through 2 of the ways to create a Pandas Series.

In [3]:
# Series out of a dictionary

sample_dictionary = {'Mercury': 35, 'Venus': 67, 'Earth': 93}

sample_series = pd.Series(sample_dictionary)

# Print the series
sample_series

Mercury    35
Venus      67
Earth      93
dtype: int64

Note: if we do not specify the index, the default numeric range will be given.

In [4]:
# Series from a list without index given

sample_list = [35, 67, 93]

sample_series = pd.Series(sample_list)

sample_series

0    35
1    67
2    93
dtype: int64

In [5]:
# Series from a list with index specified

sample_series = pd.Series(sample_list, index = ['Mercury', 'Venus', 'Earth'])

sample_series

Mercury    35
Venus      67
Earth      93
dtype: int64

You can retrieve a value from a Series by specifying the index in square brackets as you would do for a Python dictionary. You can also use the Series.get() method and specify the index and default value if the index is not in the Series. 

In [6]:
sample_series['Mercury']

35

In [7]:
sample_series.get('Venus')

67

In [8]:
sample_series.get('Mars', 'Value not found!')

'Value not found!'

<hr style="border:2px solid gray">

Unlike Series, **DataFrames** are a 2 dimensional, consisting of both row and columns. They are similar to a spreadsheet or a SQL table. Pandas provides all the functionality and methods to deal with data in the DataFrame. Each row in a DataFrame is labeled with an index, as in Series. Whereas, there are also labels for each column.

<img src="dataframe.png" width="500"/>

There are several ways to create or form a DataFrame. A DataFrame can be made of one or multiple series combined. It can also be formed out of lists and dictionaries.

In [9]:
# DataFrame from dictionary of lists
# Creating Data for site views (in thousands) on a website per browser every year

df = pd.DataFrame({
    'Chrome': [67, 74, 89],
    'Safari': [44, 58, 70],
    'Firefox': [8, 14, 16]
}, index = [2018, 2019, 2020])

df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,44,8
2019,74,58,14
2020,89,70,16


In [11]:
# DataFrame from list of dictionaries
df = pd.DataFrame(
    [{"Chrome": 67, "Safari": 44, "Firefox":8 }, 
     {"Chrome": 74, "Safari": 58, "Firefox": 14},
    {"Chrome": 89, "Safari": 70, "Firefox": 16}]
)

df

Unnamed: 0,Chrome,Safari,Firefox
0,67,44,8
1,74,58,14
2,89,70,16


Note: If we do not provide the index, pandas defaults the index list to a range of integers beginning from 0.

In [12]:
# DataFrame from 2D lists

df = pd.DataFrame(
    [[67, 74, 89],
     [44, 58, 70],
     [8, 14, 16]],
    index = [2018, 2019, 2020], columns = ["Chrome", "Safari", "Firefox"]
)

df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,74,89
2019,44,58,70
2020,8,14,16


Similar to series, we can use dictionary-like indexing to select columns from a DataFrame.

In [13]:
df['Chrome']

2018    67
2019    44
2020     8
Name: Chrome, dtype: int64

### Files in Pandas

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. 

Pandas offeres reader functions which can read in a data file and return it as a pandas object. The most useful one `read.csv()` will read a text file and convert it to a DataFrame with default arguments. Pandas also offers `read_json()` and `read_excel` to read json and excel files, respectively, 2 of the most common file types for data.

We will use the `read.csv` function to read the Chicago Police Traffic Stops data.

In [14]:
stops_df = pd.read_csv('idot/IDOT_2021.csv')

The [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) provides great detail of the arguments that `read_csv()` function accepts. Some of the common ones that you must know are:

**filepath_or_buffer**: Either a path to a file or URL.
<br>**sep**: Delimiter to use. Default delimiter is `','` (for comma-separated files. Use `'\t'` for tab-separated files, tsv)

Typing and running the DataFrame name on a Jupyter cell displays the dataset with rows and columns truncated.

In [15]:
stops_df

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,...,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
0,1/1/21,0:33,5,FIDEL LEGORRETA,5902.0,CHICAGO,IL,CHEVROLET,2017.0,1993,...,0,0,0,0,0,0,0,0,0,0
1,1/1/21,1:50,4,VICTOR PEREZ,7383.0,CHICAGO,IL,FORD,2012.0,1957,...,0,0,0,0,0,0,0,0,0,0
2,1/1/21,8:50,4,STEPHANIE ORTIGARA,18302.0,CHICAGO,IL,FORD,2007.0,1967,...,0,0,0,0,0,0,0,0,0,0
3,1/1/21,12:41,6,JASON ARROYO,14502.0,CHICAGO,IL,BMW,1998.0,1990,...,0,0,0,0,0,0,0,0,0,0
4,1/1/21,13:51,5,MONTY OWENS,11975.0,CHICAGO,IL,TOYOTA,2002.0,1945,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
377894,12/30/21,23:40,2,MATTHEW DRINNAN,13585.0,CHICAGO,IL,HYUNDAI,2020.0,1964,...,0,0,0,0,0,0,0,0,0,0
377895,12/31/21,20:35,5,EMMANUEL GARCIA,19038.0,CHICAGO,IL,NISSAN,2007.0,1973,...,0,0,0,0,0,0,0,0,0,0
377896,12/31/21,20:31,21,LUIS NUNEZ,18229.0,CHICAGO HEIGHTS,IL,HONDA,2007.0,1980,...,0,0,0,0,0,0,0,0,0,0
377897,12/31/21,21:33,2,GUSTAVO DOMINGUEZ,15235.0,CHICAGO,IL,KIA,2014.0,1982,...,0,0,0,0,0,0,0,0,0,0


This dataset is huge! Pandas also offers methods to view a small sample of a Series or DataFrame object. The `head()` and `tail()` methods enable you to do so showing you the first set of rows and last set of rows, respectively. By default, they display 5 rows. The argument to these methods can be changed to the number of rows needed to be displayed.

In [16]:
# View first 5 rows
stops_df.head()

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,...,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
0,1/1/21,0:33,5,FIDEL LEGORRETA,5902.0,CHICAGO,IL,CHEVROLET,2017.0,1993,...,0,0,0,0,0,0,0,0,0,0
1,1/1/21,1:50,4,VICTOR PEREZ,7383.0,CHICAGO,IL,FORD,2012.0,1957,...,0,0,0,0,0,0,0,0,0,0
2,1/1/21,8:50,4,STEPHANIE ORTIGARA,18302.0,CHICAGO,IL,FORD,2007.0,1967,...,0,0,0,0,0,0,0,0,0,0
3,1/1/21,12:41,6,JASON ARROYO,14502.0,CHICAGO,IL,BMW,1998.0,1990,...,0,0,0,0,0,0,0,0,0,0
4,1/1/21,13:51,5,MONTY OWENS,11975.0,CHICAGO,IL,TOYOTA,2002.0,1945,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# View last 3 rows
stops_df.tail(3)

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,...,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
377896,12/31/21,20:31,21,LUIS NUNEZ,18229.0,CHICAGO HEIGHTS,IL,HONDA,2007.0,1980,...,0,0,0,0,0,0,0,0,0,0
377897,12/31/21,21:33,2,GUSTAVO DOMINGUEZ,15235.0,CHICAGO,IL,KIA,2014.0,1982,...,0,0,0,0,0,0,0,0,0,0
377898,12/31/21,21:20,4,FRANK GIANAKAKIS,6934.0,CHICAGO,IL,TOYOTA,2015.0,1987,...,0,0,0,0,0,0,0,0,0,0


### Basic metadata

Pandas enables you to access basic metadata for its object by using some attributes. The important ones we are going to learn here are **shape**, **dtypes** and axis labels.

**Shape** gives the axis dimensions of the object. For a DataFrame, it will give the number of rows and columns.

In [18]:
stops_df.shape

(377899, 55)

**dtypes** will give the data type for each column in a DataFrame. We are going to learn more about data types in the upcoming sections.

In [22]:
stops_df.dtypes

DATESTOP               object
TIMESTOP               object
DURATION                int64
OFFNAME                object
OFFBDGE               float64
CITY_I                 object
STATE                  object
VEHMAKE                object
VEHYEAR               float64
YRBIRTH                 int64
DRSEX                   int64
DRRACE                float64
REASSTOP              float64
TYPEMOV               float64
RESSTOP                 int64
BEAT_I                  int64
VEHCONSREQ              int64
VEHCONSGIV              int64
VEHSRCHCOND             int64
VEHSRCHCONDBY           int64
VEHCONTRA               int64
VEHDRUGS                int64
VEHPARA                 int64
VEHALC                  int64
VEHWEAP                 int64
VEHSTOLPROP             int64
VEHOTHER                int64
VEHDRAMT                int64
DRCONSREQ               int64
DRCONSGIV               int64
DRVSRCHCOND             int64
DRVSRCHCONDBY           int64
PASSCONSREQ             int64
PASSCONSGI

Pandas objects have axis labels. We have seen that each element in a Series is labeled through an index. Whereas, a DataFrame has a set of labels for each row and another set for columns. We can access these labels for a DataFrame using `.index` and `.columns` methods.

In [26]:
stops_df.columns

Index(['DATESTOP', 'TIMESTOP', 'DURATION', 'OFFNAME', 'OFFBDGE', 'CITY_I',
       'STATE', 'VEHMAKE', 'VEHYEAR', 'YRBIRTH', 'DRSEX', 'DRRACE', 'REASSTOP',
       'TYPEMOV', 'RESSTOP', 'BEAT_I', 'VEHCONSREQ', 'VEHCONSGIV',
       'VEHSRCHCOND', 'VEHSRCHCONDBY', 'VEHCONTRA', 'VEHDRUGS', 'VEHPARA',
       'VEHALC', 'VEHWEAP', 'VEHSTOLPROP', 'VEHOTHER', 'VEHDRAMT', 'DRCONSREQ',
       'DRCONSGIV', 'DRVSRCHCOND', 'DRVSRCHCONDBY', 'PASSCONSREQ',
       'PASSCONSGIV', 'PASSSRCHCOND', 'PASSSRCHCONDBY', 'PASSDRVCONTRA',
       'PASSDRVDRUGS', 'PASSDRVPARA', 'PASSDRVALC', 'PASSDRVWEAP',
       'PASSDRVSTOLPROP', 'PASSDRVOTHER', 'PASSDRVDRAMT', 'DOGPERFSNIFF',
       'DOGALERT', 'DOGALERTSRCH', 'DOGALERTSRCHCONTRA', 'DOGDRUG', 'DOGPARA',
       'DOGALC', 'DOGWEAP', 'DOGSTOLPROP', 'DOGOTHER', 'DOGDRAMT'],
      dtype='object')

In [27]:
stops_df.index

RangeIndex(start=0, stop=377899, step=1)

As you can notice, since there wasn't any index provided by the dataset, Pandas set the default index to range of integers from 0 to length of the DataFrame.