# Introduction to Pandas 

For doing data analysis and manipulation in Python, Pandas is one of the most powerful, resourceful and easy to begin with packages. It consists of data structures and functions and can assist us with doing a huge set of tasks for analytical work. In order to use these resources, Pandas needs to be loaded using the **import** command, as shown below, for it to be used in our script. We can also define an alias for Pandas (we used pd) as we are going to use Pandas' modules multiple times in our code.

<img src="https://pandas.pydata.org/docs/_static/pandas.svg" width="200"/>

A little note about Pandas is that it uses object-oriented notations. No need to worry if you are not familiar with this concept. However, getting a high-level intuition about object-oriented programming (OOP) is going to be helpful in understanding how Pandas work. All data structures that store our data in the program are a form of Pandas objects defined by a blueprint called Class. For example, a DataFrame (to be discussed below) is a data-structure to store data in a format similar to spreadsheet. We can create a DataFrame object to store a table of our dataset and this object will have multiple methods and functions that operate on this object.

Pandas comes usually installed with Jupyter environments like the one you are using here. If pandas is installed, you can check the current version of Pandas by using the ```pd.__version__``` command after the import command below:

*Note: pd is an alias for pandas*

In [1]:
import pandas as pd
pd.__version__

'1.4.2'

### Data Structures in Pandas

We will learn the 2 fundamental data structures in Pandas: **Series** and **DataFrames**. 

<img src="series.png" width="300"/>

**Series** is a one-dimensional array which can hold data of various types. Series are labelled where an index labels each element on the axis. Each index element has to be unique. Series can be created out from a list, a python dictionary or even a scalar value. A pandas Series has a single dtype.

Below, we will go through 2 of the ways to create a Pandas Series. The method below creates a Series out of a Python dictionary.

In [2]:
sample_dictionary = {'Mercury': 35, 'Venus': 67, 'Earth': 93}

sample_series = pd.Series(sample_dictionary)

In [3]:
# Print the series
sample_series

Mercury    35
Venus      67
Earth      93
dtype: int64

Note: if we do not specify the index, the default numeric range will be given.

We will now create a Series from a list, without giving the index. In the cell after that, we will provide the index while we create a Series out of a list.

In [4]:
sample_list = [35, 67, 93]

sample_series = pd.Series(sample_list)

sample_series

0    35
1    67
2    93
dtype: int64

In [5]:
sample_series = pd.Series(sample_list, index = ['Mercury', 'Venus', 'Earth'])

sample_series

Mercury    35
Venus      67
Earth      93
dtype: int64

<hr style="border:2px solid gray">

Unlike Series, **DataFrames** are a multi-dimensional, consisting of row and columns. They are similar to a spreadsheet or a SQL table. Pandas provides all the functionality and methods to deal with data in the DataFrame. Each row in a DataFrame is labeled with an index, as in Series. Whereas, there are also labels for each column. In one of upcoming sections, we will also have a look at multi-level indexing for DataFrames.

<img src="dataframe.png" width="500"/>

There are several ways to create or form a DataFrame. A DataFrame can be made of one or multiple series combined. It can also be formed out of lists and dictionaries. In the first cell, we will create a DataFrame using a Dictionary of lists. 

In the examples below, we are creating data for site views (in thousands) on a website per browser every year.

In [117]:
df = pd.DataFrame({
    'Chrome': [67, 74, 89],
    'Safari': [44, 58, 70],
    'Firefox': [8, 14, 16]
}, index = [2018, 2019, 2020])

df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,44,8
2019,74,58,14
2020,89,70,16


Below we will create a DataFrame using a list of dictionaries.

In [7]:
df = pd.DataFrame(
    [{"Chrome": 67, "Safari": 44, "Firefox":8 }, 
     {"Chrome": 74, "Safari": 58, "Firefox": 14},
    {"Chrome": 89, "Safari": 70, "Firefox": 16}]
)

df

Unnamed: 0,Chrome,Safari,Firefox
0,67,44,8
1,74,58,14
2,89,70,16


Note: If we do not provide the index, pandas defaults the index list to a range of integers beginning from 0.

In the following cell we will create a DataFrame using a 2 dimensional list.

In [8]:
df = pd.DataFrame(
    [[67, 74, 89],
     [44, 58, 70],
     [8, 14, 16]],
    index = [2018, 2019, 2020], columns = ["Chrome", "Safari", "Firefox"]
)

df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,74,89
2019,44,58,70
2020,8,14,16


### Files in Pandas

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. 

Pandas offeres reader functions which can read in a data file and return it as a pandas object. The most useful one `read.csv()` will read a text file and convert it to a DataFrame with default arguments. Pandas also offers `read_json()` and `read_excel` to read json and excel files, respectively, 2 of the most common file types for data.

We will use the `read.csv` function to read the Chicago Police Traffic Stops data.

In [9]:
stops_df = pd.read_csv('TrafficStopsChicago.csv')

The [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) provides great detail of the arguments that `read_csv()` function accepts. Some of the common ones that you must know are:

**filepath_or_buffer**: Either a path to a file or URL.
<br>**sep**: Delimiter to use. Default delimiter is `','` (for comma-separated files. Use `'\t'` for tab-separated files, tsv)

Typing and running the DataFrame name on a Jupyter cell displays the dataset with rows and columns truncated.

In [10]:
stops_df

Unnamed: 0,DateStop,TimeStop,Duration,City,DriverRace,ReasonStop,TypeOfMovingViolation,Beat,SearchConducted,ContraFound,DrugsFound,WeaponFound
0,2021-01-01,0:33,5,CHICAGO,4.0,2.0,,1234,2,0,0,0
1,2021-01-01,1:50,4,CHICAGO,2.0,2.0,,1122,2,0,0,0
2,2021-01-01,8:50,4,CHICAGO,2.0,3.0,,332,2,0,0,0
3,2021-01-01,12:41,6,CHICAGO,2.0,2.0,,1121,2,0,0,0
4,2021-01-01,13:51,5,CHICAGO,2.0,2.0,,423,2,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
76275,2021-02-28,20:34,9,CHICAGO,4.0,1.0,4.0,1533,2,0,0,0
76276,2021-02-28,20:06,4,CHICAGO,2.0,2.0,,214,2,0,0,0
76277,2021-02-28,21:47,4,SAINT PAUL,2.0,3.0,,1821,2,0,0,0
76278,2021-02-28,21:15,5,CHICAGO,2.0,2.0,,331,2,0,0,0


This dataset is huge! Pandas also offers methods to view a small sample of a Series or DataFrame object. The `head()` and `tail()` methods enable you to do so showing you the first set of rows and last set of rows, respectively. By default, they display 5 rows. The argument to these methods can be changed to the number of rows needed to be displayed.

In [11]:
# View first 5 rows
stops_df.head()

Unnamed: 0,DateStop,TimeStop,Duration,City,DriverRace,ReasonStop,TypeOfMovingViolation,Beat,SearchConducted,ContraFound,DrugsFound,WeaponFound
0,2021-01-01,0:33,5,CHICAGO,4.0,2.0,,1234,2,0,0,0
1,2021-01-01,1:50,4,CHICAGO,2.0,2.0,,1122,2,0,0,0
2,2021-01-01,8:50,4,CHICAGO,2.0,3.0,,332,2,0,0,0
3,2021-01-01,12:41,6,CHICAGO,2.0,2.0,,1121,2,0,0,0
4,2021-01-01,13:51,5,CHICAGO,2.0,2.0,,423,2,0,0,0


In [12]:
# View last 3 rows
stops_df.tail(3)

Unnamed: 0,DateStop,TimeStop,Duration,City,DriverRace,ReasonStop,TypeOfMovingViolation,Beat,SearchConducted,ContraFound,DrugsFound,WeaponFound
76277,2021-02-28,21:47,4,SAINT PAUL,2.0,3.0,,1821,2,0,0,0
76278,2021-02-28,21:15,5,CHICAGO,2.0,2.0,,331,2,0,0,0
76279,2021-02-28,21:15,4,CHICAGO,4.0,2.0,,1033,2,0,0,0


### Basic metadata

Pandas enables you to access basic metadata for its object by using some attributes. The important ones we are going to learn here are **shape**, **dtypes** and axis labels.

**Shape** gives the axis dimensions of the object. For a DataFrame, it will give the number of rows and columns.

In [13]:
stops_df.shape

(76280, 12)

**dtypes** will give the data type for each column in a DataFrame. We are going to learn more about data types in the upcoming sections.

In [14]:
stops_df.dtypes

DateStop                  object
TimeStop                  object
Duration                   int64
City                      object
DriverRace               float64
ReasonStop               float64
TypeOfMovingViolation    float64
Beat                       int64
SearchConducted            int64
ContraFound                int64
DrugsFound                 int64
WeaponFound                int64
dtype: object

Pandas objects have axis labels. We have seen that each element in a Series is labeled through an index. Whereas, a DataFrame has a set of labels for each row and another set for columns. We can access these labels for a DataFrame using `.index` and `.columns` methods.

In [15]:
stops_df.columns

Index(['DateStop', 'TimeStop', 'Duration', 'City', 'DriverRace', 'ReasonStop',
       'TypeOfMovingViolation', 'Beat', 'SearchConducted', 'ContraFound',
       'DrugsFound', 'WeaponFound'],
      dtype='object')

In [16]:
stops_df.index

RangeIndex(start=0, stop=76280, step=1)

As you can notice, since there wasn't any index provided by the dataset, Pandas set the default index to range of integers from 0 to length of the DataFrame.

### Selecting and Subsetting

Pandas offers a few ways that we can use to select a set of data which can include a cell, row, column or a subset of the entire dataframe. The 3 ways that we will go through are selection using `[]`, `.loc` (label based indexing) and `.iloc` (position based indexing).

Using `[]` is an archaic way of indexing which mimics the same way we would index dictionaries in Python. In Pandas, it'll select the lower-dimensional set. For the Series below we are using the row Index to select a row below.

In [17]:
sample_series['Mercury']

35

Similarly, we can use this notation to index a column in a DataFrame. This will result in a Series.

In [18]:
df['Chrome']

2018    67
2019    44
2020     8
Name: Chrome, dtype: int64

While you may stil find this notation being used for indexing DataFrames and Series, it can often result in errors and confusion. The purpose to bring it up here was to provide familiarity with this notation. Therefore, we will focus more on using `.loc` for selection which is a more powerful and explicit indexer. The format it follows is `.loc[row_indexer,col_indexer]`

The integer below is the label for the index that will extract the row with index label 2.

In [19]:
stops_df.loc[2]

DateStop                 2021-01-01
TimeStop                       8:50
Duration                          4
City                        CHICAGO
DriverRace                      2.0
ReasonStop                      3.0
TypeOfMovingViolation           NaN
Beat                            332
SearchConducted                   2
ContraFound                       0
DrugsFound                        0
WeaponFound                       0
Name: 2, dtype: object

We can also print out a set of rows using slices. `:2` refers to index in the the range 0 to 2, a shorthand for `0:2`. 

Note that the index labels are interger in this DataFrame. This method also works well with non-integer index. For eg. we can also using slicing for sting labels. In that case, pandas will return all the rows in between the rows whose labels are specified in the slice, including the specified rows as well.

In [20]:
stops_df.loc[:2]

Unnamed: 0,DateStop,TimeStop,Duration,City,DriverRace,ReasonStop,TypeOfMovingViolation,Beat,SearchConducted,ContraFound,DrugsFound,WeaponFound
0,2021-01-01,0:33,5,CHICAGO,4.0,2.0,,1234,2,0,0,0
1,2021-01-01,1:50,4,CHICAGO,2.0,2.0,,1122,2,0,0,0
2,2021-01-01,8:50,4,CHICAGO,2.0,3.0,,332,2,0,0,0


Using the `.loc` attribute, we can also select the second dimension (column for a DataFrame). In the cell below, we are selecting all rows and the 'City' column.

In [21]:
stops_df.loc[: ,'City']

0           CHICAGO
1           CHICAGO
2           CHICAGO
3           CHICAGO
4           CHICAGO
            ...    
76275       CHICAGO
76276       CHICAGO
76277    SAINT PAUL
76278       CHICAGO
76279       CHICAGO
Name: City, Length: 76280, dtype: object

By specifying a single value in both the dimensions, we can select the specific cell as well.

In [22]:
stops_df.loc[5 ,'City']

'CHICAGO'

Or select a range of rows over multiple columns.

In [23]:
stops_df.loc[55:60, ['City','DriverRace']]

Unnamed: 0,City,DriverRace
55,MAYWOOD,2.0
56,CHICAGO,2.0
57,CHICAGO,2.0
58,FORES PARK,2.0
59,BELVIDERE,2.0
60,CHICAGO,1.0


The selection can also be a list of indices rather than a range.

In [24]:
stops_df.loc[[55,78,903], ['City','DriverRace']]

Unnamed: 0,City,DriverRace
55,MAYWOOD,2.0
78,CHICAGO,4.0
903,CHICAGO,1.0


Another useful approach that comes handy when performing data manipulation operations is selection using boolean operators. In the cell below, before the comparision operator, we are subsetting the DataFrame to return all rows in the 'Duration' column. We are then comparing the 'Duration' value and returning `True` if the value is strictly less than 5. This returns a boolean array. 

In [25]:
stops_df.loc[:, 'Duration'] < 5

0        False
1         True
2         True
3        False
4        False
         ...  
76275    False
76276     True
76277     True
76278    False
76279     True
Name: Duration, Length: 76280, dtype: bool

The above comparision operation can then be used to filter out rows that are True for the above case. In the cell below, we are using the same comparision operation to return cells from the entire DataFrame that have a stop duration of less than 5. Notice, we are then subsetting it to only provide City Names of those rows.

In [26]:
stops_df.loc[stops_df.loc[:,'Duration'] < 5, 'City']

1                 CHICAGO
2                 CHICAGO
5                 CHICAGO
8        GLENDALE HEIGHTS
9                 CHICAGO
               ...       
76270             CHICAGO
76272             CHICAGO
76276             CHICAGO
76277          SAINT PAUL
76279             CHICAGO
Name: City, Length: 40558, dtype: object

Another form of selection that we introduced is Selection by Position using `.iloc`. This attribute is a form of indexing using integers that mark position of the cell. 

In [27]:
stops_df.iloc[8]

DateStop                       2021-01-01
TimeStop                            17:26
Duration                                2
City                     GLENDALE HEIGHTS
DriverRace                            5.0
ReasonStop                            1.0
TypeOfMovingViolation                 6.0
Beat                                 1824
SearchConducted                         2
ContraFound                             0
DrugsFound                              0
WeaponFound                             0
Name: 8, dtype: object

In the case of this dataset, index labels are sorted integers which is the same as the index position as well, therefore, in this specific case, using wither `.loc` or `.iloc` will return the same set of rows. However, since we have string labels for eac column, we will have to provide integer values that mark the position of the column in the DataFrame. Providing column name or any non-integer value in `.iloc` will return an error.

In the cell below, we are selecting rows 2 till 6, with 2 included, and selecting the 4th column ('Duration' is at position 4).

In [28]:
stops_df.iloc[2:6, 4]

2    2.0
3    2.0
4    2.0
5    2.0
Name: DriverRace, dtype: float64

We can also select a range for rows as well as for columns.

In [29]:
stops_df.iloc[2:6, 4:8]

Unnamed: 0,DriverRace,ReasonStop,TypeOfMovingViolation,Beat
2,2.0,3.0,,332
3,2.0,2.0,,1121
4,2.0,2.0,,423
5,2.0,2.0,,423


Note: Out of range slice indexes are handled gracefully just as in Python. It will not return an error but will return everything in the range apart from the our of range slices. If the entire slice index is out of range, Python will return empty list.


### Data types and conversion

We saw above how the attribute `dtypes` provides us with the the data type for each column in a DataFrame. Pandas has a set of data types that it understands and manipulates. It also includes a set of extended types from third-party libraries that have implemented this extension.

We will touch upon the most basic and common ones that you saw in the result returned from running the command `stops_df.dtypes`. In brief, they include:

- integer: default integer types are int64 (called nullable integers too)
- float: default float types are float64
- boolean: stores boolean data (True/False) with missing values
- date/time: timestamp for scalar and datetime64[ns] format for series
- object:  holds any Python object, including strings
- string: dedicated to strings
- category: stores limited, fixed number of possible values

We can print the data types for the stops DataFrame again and can see that by default, Pandas doesn't interpret all data types in the form that can be useful for us. For eg. DateStop and TimeStop are object types (stored as strings) and will prevent us from performing several temporal operations. (We will cover datetime in detail in the following section)

Similarly, DriverRace and ReasonStop, which indicates the category of the Driver's Race and Reason for Stop, respectively. However, since these categories are assigned float and integer, respectively, this can mislead our analytical process and any prediction we might want to make.

In [30]:
stops_df.dtypes

DateStop                  object
TimeStop                  object
Duration                   int64
City                      object
DriverRace               float64
ReasonStop               float64
TypeOfMovingViolation    float64
Beat                       int64
SearchConducted            int64
ContraFound                int64
DrugsFound                 int64
WeaponFound                int64
dtype: object

The `astype` method enables us to convert data types to the desired ones. A convenient way of performing this operation is to pass a dictionary with the column name whose type needs to be changed as key and the desired data type as value. We are then assigning the returned result to the original DataFrame to replace it with these new changes.

In [31]:
stops_df = stops_df.astype({'DriverRace': 'category', 
                            'ReasonStop': 'category'})

Let's use `dtypes` to check the new assigned data types

In [32]:
stops_df.dtypes

DateStop                   object
TimeStop                   object
Duration                    int64
City                       object
DriverRace               category
ReasonStop               category
TypeOfMovingViolation     float64
Beat                        int64
SearchConducted             int64
ContraFound                 int64
DrugsFound                  int64
WeaponFound                 int64
dtype: object

### Datetime in Pandas

Temporal data (consisting of date and time stamps) is highly useful in data analytics and can result in useful features. Pandas offers the ability to work with time series information in various formats. This can enable us to perform several operations on datetime values such as sorting, predicting or categorizing data by certain time periods.

A convenient format for datetime that Pandas uses is datetime from Python's [datetime](https://docs.python.org/3/library/datetime.html) library. This provides the ability for manipulating dates and times. We will work with datetime.datetime type from this library which is a combination of both date and time. Pandas will assign either datetime64[ns] or datetime64[ns, tz] data type to such a format.

Series and DataFrame have extended data type support and functionality for datetime. In order to create this format from a Series or list consisting of timestamps in String or Object form (as is in the case of `stops_df` dataframe), we will use pandas' `to_datetime` function. A single scalar value will be converted to Timestamp data type, whereas, a pandas object will convert to datetime64[ns].

In [33]:
pd.to_datetime("2010/11/12")

Timestamp('2010-11-12 00:00:00')

Using this function on the stops DataFrame will result in pandas parsing it to datetime.datetime like format. You can notice the data type it gets converted into. 

In [34]:
pd.to_datetime(stops_df['DateStop'])

0       2021-01-01
1       2021-01-01
2       2021-01-01
3       2021-01-01
4       2021-01-01
           ...    
76275   2021-02-28
76276   2021-02-28
76277   2021-02-28
76278   2021-02-28
76279   2021-02-28
Name: DateStop, Length: 76280, dtype: datetime64[ns]

We will now assign it back to the original DataFrame. The `to_datetime` function also enables to pass a format argument to ensure specific parsing. This could also potentially speed up the conversion considerably. You can find the format codes on this [link](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).

In [35]:
stops_df['DateStop'] = pd.to_datetime(stops_df['DateStop'], 
                                      format="%Y-%m-%d")

Every Timestamp has a set of time/date properties or temporal features that can be extracted.

For a Series of type datetime, we can use the .dt accessor to extract these properties. This [table](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components) lists all the properties that can be accessed. We will use it to get hour of the day and the day of the week (to classify if the stop was on a weekend or a weekday). We will assign these to a new column in the DataFrame.

In [36]:
stops_df['Month'] = stops_df['DateStop'].dt.month
stops_df['Weekday'] = stops_df['DateStop'].dt.weekday

We can see how we accessed properties from the timestamp in DATESTOP column for some randomly selected rows.

In [37]:
stops_df.loc[[1,44,5400,10434,55400], ['DateStop','Month','Weekday']]

Unnamed: 0,DateStop,Month,Weekday
1,2021-01-01,1,4
44,2021-01-01,1,4
5400,2021-02-06,2,5
10434,2021-01-05,1,1
55400,2021-02-20,2,5


### Missing Values

Dealing with missing values is a major component of the data cleaning process. Fortunately, pandas has a set of markers and methods to help us do that. Primarily, pandas uses NaN as the default missing value marker and can be found in different data types.

Pandas offers a set of methids to detect, fill or replace NaN values. Moreover, NaNs can also be ignored in pandas function by specifying to ignore NaNs in the function areguments if possible. Therefore, it is handy to convert any value indicating a missing value to a NaN.

In order to check for NaN in a DataFrame or Series, we can use `.isna()` or `.notna()` attribute.

In [38]:
stops_df.isna().head()

Unnamed: 0,DateStop,TimeStop,Duration,City,DriverRace,ReasonStop,TypeOfMovingViolation,Beat,SearchConducted,ContraFound,DrugsFound,WeaponFound,Month,Weekday
0,False,False,False,False,False,False,True,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False,False,False,False


Using `.notna()` will reverse the results. As you can notice above, we can find missing values in the column 'TypeOfMovingViolation' with slight glimpse over the dataframe. In order to count the total missing values in all columns, we can suffix it with a method `.sum()` which adds all cells where value is True (where each True is 1).

In [39]:
stops_df.isna().sum()

DateStop                     0
TimeStop                     0
Duration                     0
City                         0
DriverRace                   3
ReasonStop                   0
TypeOfMovingViolation    53326
Beat                         0
SearchConducted              0
ContraFound                  0
DrugsFound                   0
WeaponFound                  0
Month                        0
Weekday                      0
dtype: int64

One of the ways to deal with these missing values is to fill in the NaNs with a scaler value using `.fillna()`. Here, the method accepts the value to be replaced. The inplace argument, when set to True, applies the changes to the DataFrame.

In [40]:
stops_df['TypeOfMovingViolation'].fillna('No moving violation', 
                                         inplace = True)

**Note**: For storage purposes, it isn't optimal to fill in categorical with a long string. Therefore, categories are usually mapped to a single character to save space.

Another way is to simply exclude rows or columns from a dataset that might contain missing values. We need to be very careful when deciding to use this approach since we might lose critical data or reduce the data size.

In case we plan to, `dropna()` will do the job for use. By specifying the axis in the argument (either 'index' or 'columns'), the specified axis containing atleast one NaN will be removed.

We will use it below to remove the rows where the Driver's Race is missing since that is the key value for our analysis. We specify the column name as the subset argument to only remove rows where 'DriveRace' is missing.

In [41]:
stops_df.dropna(axis='index', 
                subset=['DriverRace'], 
                inplace=True)

### Mapping/Replacing values and labels

Data cleaning process usuallly requires replacing vague, misspled or missing values with correct ones. IT also requires structuring the index and column labels properly in order to make it easier for analysis.

In order to rename index labels or columns, the rename method is very useful. By mapping the existing value to a new one, we can replace the column name.

In [137]:
stops_df.rename(columns={"Beat": "BeatCode"}, 
                inplace= True)
stops_df.columns

Index(['DateStop', 'TimeStop', 'Duration', 'City', 'DriverRace', 'ReasonStop',
       'TypeOfMovingViolation', 'BeatCode', 'SearchConducted', 'ContraFound',
       'DrugsFound', 'WeaponFound', 'Month', 'Weekday'],
      dtype='object')

To change any cells value, we can use `.loc()` and `.iloc` as shown below. 

In [188]:
stops_df.loc[230, 'City'] = 'Chicago'

However, value replacement, particularly for data cleaning purposes, is usually applied on set of values combined. For such purposes, pandas provides the map method that maps value in a Series (or a column in a DataFrame) to a given set of values. 

In the cell below, we are using the map method to convert the categorical values of race in numerical form to the actual race title in string as provided in the data dictionary. Note that we are using a dictionary where the key is the existing value and the value is the new string we want to replace it with. Providing a dictionary for mapping is the most convenient way, however, map method takes other forms of inputs too.

In [158]:
stops_df['DriverRace'] = stops_df['DriverRace'].map({
    1.0: 'White',
    2.0: 'Black or African American',
    3.0: 'American Indian or Alaska Native',
    4.0: 'Hispanic or Latino',
    5.0: 'Asian',
    6.0: 'Native Hawaiian or Other Pacific Islander',
})

In [192]:
stops_df['TypeOfMovingViolation'] = stops_df['TypeOfMovingViolation'].map({
    1.0: 'Speed',
    2.0: 'Lane Violation',
    3.0: 'Seat Belt',
    4.0: 'Traffic Sign or Signal',
    5.0: 'Follow too close',
    6.0: 'Other',
})

### Statistics

In the exploratory process, it is essential to have an understanding of the distribution of data and its summary statistics. Pandas offers a set of methods that provide this information and additionaly, enable us to visualize it.

A few ways to do this includes using `mean()` and `median` method to get these summary statis for a DataFrame or a Series. See the cells below:

In [47]:
stops_df['Duration'].mean()

7.579952016990705

In [48]:
stops_df['Duration'].median()

4.0

We got the mean and median duration for all Traffic stops. Depending on the data type, we can gather more of such stats to assist with our analysis. We can also use the `.describe()` method to provide us with a quick way to see essential summary stats for each column. Notice, that summary stats are only given for numeric columns, therefore, we do not see summary stats for any column with datetime, categorical or object type.

You can also notice that columns that were supposed to have categorical types and were picked up as integer or float data type, got statistical values that are not relevant. For eg. Month has mean and percentiles. Therefore, this emphasizes on the relevance of converting such columns to categorical data types to get a suitable analysis.

In [50]:
stops_df.describe()

Unnamed: 0,Duration,Beat,SearchConducted,ContraFound,DrugsFound,WeaponFound,Month,Weekday
count,76277.0,76277.0,76277.0,76277.0,76277.0,76277.0,76277.0,76277.0
mean,7.579952,1081.62707,1.985251,0.026076,0.027151,0.029091,1.440133,3.116064
std,26.18217,646.260818,0.120547,0.219205,0.226316,0.238602,0.496406,1.958205
min,0.0,111.0,1.0,0.0,0.0,0.0,1.0,0.0
25%,3.0,612.0,2.0,0.0,0.0,0.0,1.0,1.0
50%,4.0,1022.0,2.0,0.0,0.0,0.0,1.0,3.0
75%,6.0,1522.0,2.0,0.0,0.0,0.0,2.0,5.0
max,719.0,6100.0,2.0,2.0,2.0,2.0,2.0,6.0


Other significant attributes that give useful information include `value_counts()` which provides count of unique values in a series. We can use it to see the number of stops on each day of the week.<br><br>Note: Pandas weekday attribute maps Monday to 0 and Sunday to 6

In [55]:
stops_df['Weekday'].value_counts()

4    12198
5    11944
3    10899
6    10780
2    10695
1    10405
0     9356
Name: Weekday, dtype: int64

### Pointers and Object References

It is important to know how Pandas objects are treated, especially when they are passed to functions or when changes are made to them. You might be familiar with how arguments to functions are pass-by-value where python creates a copy of the variable inside the function and works with that copy. However, pandas objects, particularly DataFrame and Series are passed as pass-by-reference, where the variable *pointing* to the pandas object is given to the function and, therefore, any changes made to the object the pointer variable is pointing to within the function, also changes the object in place.

Moreover, any assignment to a new variable such as `df2 = df` just makes the new variable point to the same DataFrame and as a result, any changes made to `df2` will also impact `df` since they both point to the same object.

In [118]:
df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,44,8
2019,74,58,14
2020,89,70,16


In [119]:
df2 = df

In [120]:
df2.loc[2018, 'Safari'] = 87

In [121]:
df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,87,8
2019,74,58,14
2020,89,70,16


In [122]:
df2.loc[2021] = [32, 19, 45]
df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,87,8
2019,74,58,14
2020,89,70,16
2021,32,19,45


Similarly, passing dataframe as an argument to a function call doesn't create a new copy but changes the DataFrame in-place as it is only passing the reference of the DataFrame object and not the data.

In [123]:
def square_vals(dataframe):
    print(dataframe**2)

In [124]:
square_vals(df)

      Chrome  Safari  Firefox
2018    4489    7569       64
2019    5476    3364      196
2020    7921    4900      256
2021    1024     361     2025


The process above of assigning DataFrame to a new variable using `=` sign doesn't copy the values but just makes the new variable another point of reference for the dataframe. 

Pandas uses a method called shallow copy. This creates a new object where the object will only have references to the original data and its indexes but not the data and index itself. Any changes to values in a shallow copy will also reflect in the original dataframe. The difference this has with using the assignment operator `=` is that we can append extra rows and columns to the shallow copied dataframe without affecting the original one. The original dataframe will only change if we change existing values as we are changing the values that are referred to in both the dataframes.

In [125]:
df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,87,8
2019,74,58,14
2020,89,70,16
2021,32,19,45


In [126]:
df_shallow = df.copy(deep=False)

In [127]:
df_shallow.loc[2020, 'Chrome'] = 80
df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,87,8
2019,74,58,14
2020,80,70,16
2021,32,19,45


In [128]:
df_shallow.loc[2022] = [44, 59, 50]
df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,87,8
2019,74,58,14
2020,80,70,16
2021,32,19,45


On the other hand, we can use a deep copy which creats a new object and copies both the data and values to the new object. Therefore, any changes to the deep copy will not be reflected in the original copy.

In [129]:
df_deep = df.copy(deep=True)

In [130]:
df_deep.loc[2018, 'Firefox'] = 99
df_deep

Unnamed: 0,Chrome,Safari,Firefox
2018,67,87,99
2019,74,58,14
2020,80,70,16
2021,32,19,45


In [131]:
df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,87,8
2019,74,58,14
2020,80,70,16
2021,32,19,45


**Note**: It is important to note that within deep copy, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This means that if the cells in the dataframe contain references to other objects, the references will be copied only and not the objects it refers too. 

### Export

In the beginning, we went through how we can import a variety of files and read it as a pandas object. Similarly, we can make use of writer functions that enable us to save a pandas object to your computer. For saving dataframes, a few of the methods that we can use are `.to_csv()` (comma-separated file format), `.to_excel()` (MS Excel format) and `to_pickle()` (Python pickle format). All these methods take the file path as the first argument.

In [133]:
df.to_csv('brower.csv')

### Groupby

Groupby is certainly amongst the most useful set of methods in pandas. To make this concept intuitive, we will break it down into its 3 major steps: split-apply-combine.

In the *split* step, the data is divided into groups by a criteria. For eg. we will group our datasets into race groups (6 in this DataFrame). The format for this step is shown below where the `groupby()` method takes the column name to group over as an argument.

In [162]:
RaceGroups = stops_df.groupby('DriverRace')

In the *apply* step, a function is applied to each group that was formed. These are referred to as aggregation functions. We will use an averaging function (mean) to be applied to values in each race group.

The *combine* step then combines and structures the result into a DataFrame. By applying the aggregation function to the grouped data below, we are performing both the apply and combine step. 

In [169]:
RaceGroups.mean()

Unnamed: 0_level_0,Duration,BeatCode,SearchConducted,ContraFound,DrugsFound,WeaponFound,Month,Weekday
DriverRace,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
White,7.501639,1472.421225,1.995715,0.00731,0.007562,0.008571,1.444795,3.098437
Black or African American,7.777141,931.528335,1.982218,0.031661,0.032955,0.034966,1.437555,3.107487
American Indian or Alaska Native,6.758278,1316.754967,1.996689,0.006623,0.006623,0.006623,1.476821,3.235099
Hispanic or Latino,7.168813,1312.990674,1.987629,0.021252,0.022204,0.024678,1.445537,3.153651
Asian,6.271658,1445.236364,1.99893,0.002139,0.002139,0.002139,1.435829,3.094118
Native Hawaiian or Other Pacific Islander,7.015957,1291.494681,2.0,0.0,0.0,0.0,1.462766,3.026596


For every numerical column that we have in our dataset, we get a mean value for every group. It surely doesn't make sense to have a mean for every column since some of them contain categories. We can select the column we want to aggregate by specifying along with the group.

In [171]:
RaceGroups['Duration'].mean()

DriverRace
White                                        7.501639
Black or African American                    7.777141
American Indian or Alaska Native             6.758278
Hispanic or Latino                           7.168813
Asian                                        6.271658
Native Hawaiian or Other Pacific Islander    7.015957
Name: Duration, dtype: float64

In order to set the group names as a column, we need to use the `reset_index()` method.

In [172]:
RaceGroups['Duration'].mean().reset_index()

Unnamed: 0,DriverRace,Duration
0,White,7.501639
1,Black or African American,7.777141
2,American Indian or Alaska Native,6.758278
3,Hispanic or Latino,7.168813
4,Asian,6.271658
5,Native Hawaiian or Other Pacific Islander,7.015957


Size is another aggragation function to get the size for each group.

In [166]:
RaceGroups.size()

DriverRace
White                                         7934
Black or African American                    50220
American Indian or Alaska Native               302
Hispanic or Latino                           15763
Asian                                         1870
Native Hawaiian or Other Pacific Islander      188
dtype: int64

We can also groupby multiple columns and chain all these steps in a single line.

In [195]:
stops_df.groupby(['TypeOfMovingViolation',
                  'DriverRace'])['Duration'].mean().reset_index()

Unnamed: 0,TypeOfMovingViolation,DriverRace,Duration
0,Follow too close,White,4.6
1,Follow too close,Black or African American,15.0
2,Follow too close,American Indian or Alaska Native,
3,Follow too close,Hispanic or Latino,7.333333
4,Follow too close,Asian,5.0
5,Follow too close,Native Hawaiian or Other Pacific Islander,
6,Lane Violation,White,8.907216
7,Lane Violation,Black or African American,7.950725
8,Lane Violation,American Indian or Alaska Native,6.1
9,Lane Violation,Hispanic or Latino,7.844322


### Additional Concepts

#### Method Chaining

As the name suggests, method chaining enables us to use a series of methods in a single command. This is possible because every method that we use on a pandas object gets returned as a pandas object itself, allowing us to apply another method on it. Method chaining simplies are code and makes it more readable.

We will repeat the method chaining acion we perfomed previously.

In [196]:
stops_df.groupby(['TypeOfMovingViolation',
                  'DriverRace'])['Duration'].mean().reset_index()

Unnamed: 0,TypeOfMovingViolation,DriverRace,Duration
0,Follow too close,White,4.6
1,Follow too close,Black or African American,15.0
2,Follow too close,American Indian or Alaska Native,
3,Follow too close,Hispanic or Latino,7.333333
4,Follow too close,Asian,5.0
5,Follow too close,Native Hawaiian or Other Pacific Islander,
6,Lane Violation,White,8.907216
7,Lane Violation,Black or African American,7.950725
8,Lane Violation,American Indian or Alaska Native,6.1
9,Lane Violation,Hispanic or Latino,7.844322
