# Chapter 2: Gathering and Collecting Data

Data collection and gathering is a key component of the data science pipeline. Based on the principle of garbage-in garbage-out, this step might probably have the most relevance out of all. Data collection can be done in various ways ranging from a social non-profit conducting a survey in the neighborhood by interveiwing people of a specific income group, a cellular network like Verizon taking in real-time call logs from its millions of users or an online retailer tracking their customer's demographics and purchases to have better targeted ads. Data, therefore, can be collected with varying volumes, flow and variety. 

A critical element in this step is to decide what data to collect and with what flow and granurality. Is it fine to get real-time aggregate data and process it immediately for taking time-sensitive decision or can we store all the incoming data for every single user in a large data dump and process it at our convenience. Decisions like these have an impact on the results we get and this brings us back to the importance of contextual knowledge regarding the business.

For the purpose of making data-driven decisions and predictions critical for our organization, attaining a large volume of data covering large variety can have the variation that may help us build better learning models (to be covered up in the next section) and hence make better predictions. If you've ever heard of the term 'Data beats Algorithm' you might soon realize that effective predictions and analysis is not a function of how good the engineering or the learning model is, but is more a function of the quantity, quality and variety of data. There are huge profit making firms with the primary business of offering clean and labeled data. Scale.ai is one such venture that offers clean and labelled vision data that companies can use for computer vision. You might pay a lot to attain that data but it results in effective models for data science.

**Traffic Stops Data**

Going back to the Traffic stops case study that we worked on the previous sub-section, we are now going to discuss and evaluate the set of data sources that were collected for this study. We are going to discuss the merits and caveats of each approach and talk about how the integrity of the data can be maintained. Additionally, it will be useful to think about the ideal data set that we might require. We will also circle back on the discussion we previously had about the features and variables we might need for this study.

[Needs an Update with responses to the prompts in the paragraph above]

### Guided Practice

For this capstone, we are going to learn and use Pandas, an open-source Python library that has a great set of data structures and data analysis tools. In this entire book, we are primarily going to use Data Frames, a Pandas object, as the primary data structure to store and manipulate the datasets that are going to be used for this capstone. DataFrames are a 2 dimensional data structure with rows and columns, similar to a spreadsheet or a SQL table. Pandas provide all the functionality and methods to deal with data in the DataFrame. 

We will also work on Series, which is a one-dimensional data structure or a single labeled column. A DataFrame can be made of multiple series combined. Understanding the between Series and DataFrame difference can assist in knowing what methods work for a Series and what work for a DataFrame when refering to the documentation.

We will now go ahead an import the Pandas library to the local python environment by simply using the **import** method used below.

In [1]:
import pandas

In [2]:
# ignore these lines
pandas.options.display.max_columns = None
pandas.options.display.max_rows = None
import warnings
warnings.filterwarnings('ignore')

The following figures can give you a visual representation of a pandas DataFrame and Series. A series and every column in a DataFrame are labeled. Both Series and DataFrame also have an index that identifies each row. We will go through the following set of exercises on creating DataFrames and Series in order to understand the structure well.

![](https://pandas.pydata.org/docs/_images/01_table_dataframe.svg)

In [3]:
# create pandas dataframe using a 1D array in each column
df = pandas.DataFrame({
    'q1':[450, 200, 310],
    'q2':[410, 540, 650],
    'q3':[730, 800, 90],
    'q4':[110, 350, 85]
})

df

Unnamed: 0,q1,q2,q3,q4
0,450,410,730,110
1,200,540,800,350
2,310,650,90,85


In [4]:
# We can also give a different label to each index of the dataframe
df.index = [2018, 2019, 2020]

df

Unnamed: 0,q1,q2,q3,q4
2018,450,410,730,110
2019,200,540,800,350
2020,310,650,90,85


In [5]:
# create a pandas series object using a 1D array
ser = pandas.Series([510, 820, 90], index=[2018, 2019, 2020])

ser

2018    510
2019    820
2020     90
dtype: int64

There are several ways to create a Dataframe and a Series object. We will additionally try a few of them.

In [None]:
# create dataframe from multiple series

df_2 = pandas.DataFrame({
    'q1':ser,
    'q2':pandas.Series([110,450,300], index=[2018, 2019, 2020]),
    'q3':ser,
    'q4':ser*2
})

df_2

As you can see above, we were also able to use arithmetic operations on each series and apply it to every element of the series.

In [6]:
# create dataframe from a dictionary

dictionary = {
    'q1':[190, 35, 440],
    'q2':[390, 430, 230],
    'q3':[400, 250, 70],
    'q4':[50, 350, 440]
}

years = [2018, 2019, 2020]

df_3 = pandas.DataFrame(dictionary, index = years)


df_3

Unnamed: 0,q1,q2,q3,q4
2018,190,390,400,50
2019,35,430,250,350
2020,440,230,70,440


In the methods below, we will see how to subset pandas DataFrames.

In [7]:
# Selecting a single column
df['q2']

2018    410
2019    540
2020    650
Name: q2, dtype: int64

In [8]:
# Selecting a set of columns
df[['q2','q3']]

# Why double brackets?

Unnamed: 0,q2,q3
2018,410,730
2019,540,800
2020,650,90


It is important to consider the data type of each column since this can have a major impact on our analysis and can optimize the way we store and process data in a dataframe. By running the cell below, you'll find that all columns have int64 data type, which refers to a 64 bit integer.

In [9]:
# View data type of each column
df.dtypes

q1    int64
q2    int64
q3    int64
q4    int64
dtype: object

We can also check the labels for each column and index by calling the methods of the dataframe index and columns respectively as shown below.

In [10]:
# View index
df.index

Int64Index([2018, 2019, 2020], dtype='int64')

In [11]:
# You can also change the index values

df.index = ['2014', '2015','2016']
df

Unnamed: 0,q1,q2,q3,q4
2014,450,410,730,110
2015,200,540,800,350
2016,310,650,90,85


In [12]:
# View column labels
df.columns
# These can be changed too

Index(['q1', 'q2', 'q3', 'q4'], dtype='object')

#### Working with traffic stops data

We'll move this guided practice by working on the traffic stops data for the City of Chicago collected from Illinois Department of Transportation (IDOT). By now, you might be familiar with the methods and functions that we are going to use below to fetch and view the dataset. We'll introduce a few more.

In [14]:
stops = pandas.read_csv('idot/IDOT_2021.csv')

In [15]:
# Shapes gives you the dimensions of the dataframe
stops.shape

# How many rows and columns do we have?

(377899, 55)

In [16]:
# Its a huge dataset! The head method enables you to view the first set of rows specified in the brackets
stops.head(3)

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,DRSEX,DRRACE,REASSTOP,TYPEMOV,RESSTOP,BEAT_I,VEHCONSREQ,VEHCONSGIV,VEHSRCHCOND,VEHSRCHCONDBY,VEHCONTRA,VEHDRUGS,VEHPARA,VEHALC,VEHWEAP,VEHSTOLPROP,VEHOTHER,VEHDRAMT,DRCONSREQ,DRCONSGIV,DRVSRCHCOND,DRVSRCHCONDBY,PASSCONSREQ,PASSCONSGIV,PASSSRCHCOND,PASSSRCHCONDBY,PASSDRVCONTRA,PASSDRVDRUGS,PASSDRVPARA,PASSDRVALC,PASSDRVWEAP,PASSDRVSTOLPROP,PASSDRVOTHER,PASSDRVDRAMT,DOGPERFSNIFF,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
0,1/1/21,0:33,5,FIDEL LEGORRETA,5902.0,CHICAGO,IL,CHEVROLET,2017.0,1993,1,4.0,2.0,,3,1234,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0
1,1/1/21,1:50,4,VICTOR PEREZ,7383.0,CHICAGO,IL,FORD,2012.0,1957,1,2.0,2.0,,3,1122,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0
2,1/1/21,8:50,4,STEPHANIE ORTIGARA,18302.0,CHICAGO,IL,FORD,2007.0,1967,1,2.0,3.0,,3,332,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0


In [17]:
# Tail let's you view the specified number of rows at the end. Default for both methods is 5
stops.tail()

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,DRSEX,DRRACE,REASSTOP,TYPEMOV,RESSTOP,BEAT_I,VEHCONSREQ,VEHCONSGIV,VEHSRCHCOND,VEHSRCHCONDBY,VEHCONTRA,VEHDRUGS,VEHPARA,VEHALC,VEHWEAP,VEHSTOLPROP,VEHOTHER,VEHDRAMT,DRCONSREQ,DRCONSGIV,DRVSRCHCOND,DRVSRCHCONDBY,PASSCONSREQ,PASSCONSGIV,PASSSRCHCOND,PASSSRCHCONDBY,PASSDRVCONTRA,PASSDRVDRUGS,PASSDRVPARA,PASSDRVALC,PASSDRVWEAP,PASSDRVSTOLPROP,PASSDRVOTHER,PASSDRVDRAMT,DOGPERFSNIFF,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
377894,12/30/21,23:40,2,MATTHEW DRINNAN,13585.0,CHICAGO,IL,HYUNDAI,2020.0,1964,2,2.0,2.0,,3,711,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0
377895,12/31/21,20:35,5,EMMANUEL GARCIA,19038.0,CHICAGO,IL,NISSAN,2007.0,1973,1,2.0,3.0,,3,522,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0
377896,12/31/21,20:31,21,LUIS NUNEZ,18229.0,CHICAGO HEIGHTS,IL,HONDA,2007.0,1980,1,2.0,1.0,3.0,1,2232,2,0,1,2,2,2,2,2,2,2,2,0,2,0,1,2,2,0,2,0,2,2,2,2,2,2,2,1,2,0,0,0,0,0,0,0,0,0,0
377897,12/31/21,21:33,2,GUSTAVO DOMINGUEZ,15235.0,CHICAGO,IL,KIA,2014.0,1982,1,2.0,3.0,,3,1113,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0
377898,12/31/21,21:20,4,FRANK GIANAKAKIS,6934.0,CHICAGO,IL,TOYOTA,2015.0,1987,1,5.0,2.0,,3,1831,2,0,2,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0


In [18]:
# We can select to view just one column
stops['VEHMAKE'].head()

# If you may notice, this returns a Series

0    CHEVROLET
1         FORD
2         FORD
3          BMW
4       TOYOTA
Name: VEHMAKE, dtype: object

In [19]:
# Or select a subset of columns
stops[['VEHMAKE', 'VEHYEAR', 'STATE']].head()

# The interior brackets are for list, and the outside brackets are indexing operator

Unnamed: 0,VEHMAKE,VEHYEAR,STATE
0,CHEVROLET,2017.0,IL
1,FORD,2012.0,IL
2,FORD,2007.0,IL
3,BMW,1998.0,IL
4,TOYOTA,2002.0,IL


In [20]:
# We can use this method to slice the dataframe and only keep the columns we need
# We'll copy it to a new dataframe and call it stops_sliced

stops_sliced = stops[['DATESTOP','TIMESTOP','DURATION','CITY_I', 'STATE', 'DRRACE','RESSTOP',
                 'TYPEMOV','BEAT_I','VEHSRCHCOND','VEHCONTRA','VEHDRUGS','VEHWEAP','VEHSTOLPROP']]

Pandas also offers a range of ways to select or slice a portion of the dataframe. Using a dataframe's .loc and .iloc methods, we can select cells by label and position, respectively.

In [21]:
# Selecting by label
stops.loc[5, ['OFFNAME']]

OFFNAME    MONTY OWENS
Name: 5, dtype: object

In [22]:
stops_sliced.loc[1:10, ['CITY_I', 'STATE']]
# Notice something peculiar below?

Unnamed: 0,CITY_I,STATE
1,CHICAGO,IL
2,CHICAGO,IL
3,CHICAGO,IL
4,CHICAGO,IL
5,CHICAGO,IL
6,CHICAGO,IL
7,CHICAGO,IL
8,GLENDALE HEIGHTS,IL
9,CHICAGO,IL
10,CHICAGO,IL


In [23]:
# Selecting by position
stops_sliced.iloc[2, 1]

'8:50'

In [24]:
stops_sliced.iloc[:10, 3:5]

Unnamed: 0,CITY_I,STATE
0,CHICAGO,IL
1,CHICAGO,IL
2,CHICAGO,IL
3,CHICAGO,IL
4,CHICAGO,IL
5,CHICAGO,IL
6,CHICAGO,IL
7,CHICAGO,IL
8,GLENDALE HEIGHTS,IL
9,CHICAGO,IL


We can also use an additional set of pandas operations to get the summary statistics of the dataframe and do a quick analysis of every column we have. The describe() method offers us that power. Additionally we can also get specific stats for a Series or for the entire dataframe using methods like mean() or median().

In [25]:
# summary stats
stops_sliced.describe()

Unnamed: 0,DURATION,DRRACE,RESSTOP,TYPEMOV,BEAT_I,VEHSRCHCOND,VEHCONTRA,VEHDRUGS,VEHWEAP,VEHSTOLPROP
count,377899.0,377870.0,377899.0,123306.0,377899.0,377899.0,377899.0,377899.0,377899.0,377899.0
mean,7.940148,2.421145,2.914686,4.102006,1144.907007,1.985427,0.024694,0.026044,0.028415,0.029137
std,25.584249,1.051478,0.404165,1.425366,644.821167,0.119835,0.210543,0.219792,0.235143,0.239621
min,0.0,1.0,1.0,1.0,111.0,1.0,0.0,0.0,0.0,0.0
25%,3.0,2.0,3.0,4.0,711.0,2.0,0.0,0.0,0.0,0.0
50%,4.0,2.0,3.0,4.0,1112.0,2.0,0.0,0.0,0.0,0.0
75%,6.0,4.0,3.0,6.0,1533.0,2.0,0.0,0.0,0.0,0.0
max,719.0,6.0,3.0,6.0,6100.0,2.0,2.0,2.0,2.0,2.0


In [26]:
# mean duration
stops_sliced['DURATION'].mean()
# Note: applying mean() to a dataframe will give you mean of every row

7.940148028970704

It is essential to know the data type for each column in our dataframe. Pandas offers the dtypes method to do that. Ensuring that the data types are interpreted correctly can significantly optimize our work and analysis. We'll touch more on that. Think about the data types interpreted for each row, do you seen any problem here?

In [27]:
stops_sliced.dtypes

DATESTOP        object
TIMESTOP        object
DURATION         int64
CITY_I          object
STATE           object
DRRACE         float64
RESSTOP          int64
TYPEMOV        float64
BEAT_I           int64
VEHSRCHCOND      int64
VEHCONTRA        int64
VEHDRUGS         int64
VEHWEAP          int64
VEHSTOLPROP      int64
dtype: object

In [30]:
# Save a dataframe as a csv (Can also be converted to other formats)
# Format: df.to_csv(filepath)
stops_sliced.to_csv('stops_data_sliced.csv')

As we discussed, gathering relevant meta data will help strenghten our analysis of traffic stops in Chicago. Another important component to measure whether there were disciminatory practices is that we should also know the proportion of each race in the every beat. That way, we can measure police activities targetted towards a race by taking the proportion of population into account as well.

We will take you through the steps of getting and downloading demographics data of Police beats in Chicago which will enable us to find the proportion of a race targeted. 

### Exercise 1.2

In [32]:
# Donwload beatrace data from Kaggle

In [33]:
# Read the data as csv

In [34]:
# Check out the shape

In [35]:
# Display the first 5 rows

In [36]:
# Display the data types

In [37]:
# Display the summary statistics