## Stanford Open Policing Project dataset
#### Examining the dataset
Throughout this course, you'll be analyzing a dataset of traffic stops in Rhode Island that was collected by the Stanford Open Policing Project.

Before beginning your analysis, it's important that you familiarize yourself with the dataset. In this exercise, you'll read the dataset into pandas, examine the first few rows, and then count the number of missing values.

In [1]:
# Import the pandas library as pd
import pandas as pd

# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv("police.TXT")

# Examine the head of the DataFrame
print(ri.head(),"\n\n")

# Count the number of missing values in each column
print(ri.isnull().sum())

  state   stop_date stop_time  county_name driver_gender driver_race  \
0    RI  2005-01-04     12:55          NaN             M       White   
1    RI  2005-01-23     23:15          NaN             M       White   
2    RI  2005-02-17     04:15          NaN             M       White   
3    RI  2005-02-20     17:15          NaN             M       White   
4    RI  2005-02-24     01:20          NaN             F       White   

                    violation_raw  violation  search_conducted search_type  \
0  Equipment/Inspection Violation  Equipment             False         NaN   
1                        Speeding   Speeding             False         NaN   
2                        Speeding   Speeding             False         NaN   
3                Call for Service      Other             False         NaN   
4                        Speeding   Speeding             False         NaN   

    stop_outcome is_arrested stop_duration  drugs_related_stop district  
0       Citation       F

It looks like most of the columns have at least some missing values. We'll figure out how to handle these values in the next exercise!

#### Dropping columns
Often, a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the DataFrame, to make it easier for you to focus on the remaining columns.

In this exercise, you'll drop the county_name column because it only contains missing values, and you'll drop the state column because all of the traffic stops took place in one state (Rhode Island). Thus, these columns can be dropped because they contain no useful information. The number of missing values in each column has been printed in the IPython Shell for you.

In [25]:
# Examine the shape of the DataFrame
print(ri.shape)

# Drop the 'county_name' and 'state' columns
ri.drop(["county_name", "state"], axis='columns', inplace=True)

# Examine the shape of the DataFrame (again)
print(ri.shape)

(84495, 15)
(84495, 13)


Great job! We'll continue to remove unnecessary data from the DataFrame in the next exercise.

#### Dropping rows
When you know that a specific column will be critical to your analysis, and only a small fraction of rows are missing a value in that column, it often makes sense to remove those rows from the dataset.

During this course, the driver_gender column will be critical to many of your analyses. Because only a small fraction of rows are missing driver_gender, we'll drop those rows from the dataset.

In [26]:
# Count the number of missing values in each column
print(ri.isnull().sum(),"\n\n")

# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=["driver_gender"], inplace=True)

# Count the number of missing values in each column (again)
print(ri.isnull().sum(),"\n\n")

# Examine the shape of the DataFrame
print(ri.shape)

stop_date                 0
stop_time                 0
driver_gender          4922
driver_race            4920
violation_raw          4920
violation              4920
search_conducted          0
search_type           81385
stop_outcome           4920
is_arrested            4920
stop_duration          4920
drugs_related_stop        0
district                  0
dtype: int64 


stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           76463
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64 


(79573, 13)


Excellent! We dropped around 5,000 rows, which is a small fraction of the dataset, and now only one column remains with any missing values.

## Using proper data types
#### Finding an incorrect data type

The dtypes attribute of the ri DataFrame has been printed for you. Your task is to explore the ri DataFrame in the IPython Shell to determine which column's data type should be changed.

Possible Answers<br>
<br>
a.stop_time should have a data type of float<br>
b.search_conducted should have a data type of object<br>
<strong>c.is_arrested should have a data type of bool</strong><br>
d.district should have a data type of int<br>

In [27]:
ri.is_arrested

0        False
1        False
2        False
3         True
4        False
         ...  
84490    False
84491    False
84492    False
84493    False
84494    False
Name: is_arrested, Length: 79573, dtype: object

Correct! We'll fix the data type of the is_arrested column in the next exercise.

#### Fixing a data type
We saw in the previous exercise that the is_arrested column currently has the object data type. In this exercise, we'll change the data type to bool, which is the most suitable type for a column containing True and False values.

Fixing the data type will enable us to use mathematical operations on the is_arrested column that would not be possible otherwise.

In [28]:
# Examine the head of the 'is_arrested' column
print(ri.is_arrested.head(),"\n")

# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')

# Check the data type of 'is_arrested' 
print(ri['is_arrested'].dtype)

0    False
1    False
2    False
3     True
4    False
Name: is_arrested, dtype: object 

bool


Beforehand, It was an object. I run it two times and then it's been converted to bool as a beginning 

Great! It's best to fix these data type problems early, before you begin your analysis.

## Creating a DatetimeIndex
#### Combining object columns
Currently, the date and time of each traffic stop are stored in separate object columns: stop_date and stop_time.

In this exercise, you'll combine these two columns into a single column, and then convert it to datetime format. This will enable convenient date-based attributes that we'll use later in the course.

In [29]:
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = ri.stop_date.str.cat(ri.stop_time, sep = ' ')

# Convert 'combined' to datetime format
ri['stop_datetime'] = pd.to_datetime(combined)

# Examine the data types of the DataFrame
print(ri.dtypes)

stop_date                     object
stop_time                     object
driver_gender                 object
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                     bool
stop_duration                 object
drugs_related_stop              bool
district                      object
stop_datetime         datetime64[ns]
dtype: object


Excellent! Now we're ready to set the stop_datetime column as the index.

In [30]:
# Set 'stop_datetime' as the index
ri.set_index('stop_datetime', inplace=True)

# Examine the index
print(ri.index, "\n")

# Examine the columns
print(ri.columns)

DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:15:00',
               '2005-02-17 04:15:00', '2005-02-20 17:15:00',
               '2005-02-24 01:20:00', '2005-03-14 10:00:00',
               '2005-03-29 21:55:00', '2005-04-04 21:25:00',
               '2005-07-14 11:20:00', '2005-07-14 19:55:00',
               ...
               '2015-02-22 15:07:00', '2015-02-22 17:54:00',
               '2015-02-22 22:47:00', '2015-02-22 23:24:00',
               '2015-02-23 00:12:00', '2015-02-23 01:02:00',
               '2015-02-23 08:37:00', '2015-02-23 10:09:00',
               '2015-02-23 12:35:00', '2015-02-23 12:56:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=79573, freq=None) 

Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race',
       'violation_raw', 'violation', 'search_conducted', 'search_type',
       'stop_outcome', 'is_arrested', 'stop_duration', 'drugs_related_stop',
       'district'],
      dtype='object')


Congratulations! Now that you have cleaned the dataset, you can begin analyzing it in the next chapter.