<img src="https://ucfai.org//course/sp19/data-curation/banner.jpg">

<div class="col-12">
    <a class="btn btn-success btn-block" href="https://ucfai.org/signup">
        First Attendance? Sign Up!
    </a>
</div>

<div class="col-12">
    <h1> Cleaning and Manipulating a Dataset with Python </h1>
    <hr>
</div>

<div style="line-height: 2em;">
    <p>by: 
        <strong> Daniel Silva</strong>
        (<a href="https://github.com/danielzgsilva">@danielzgsilva</a>) <br> &emsp;&nbsp;
        <strong> John Muchovej</strong>
        (<a href="https://github.com/ionlights">@ionlights</a>)
   <br>&emsp;&nbsp;  on 2019-03-20</p>
</div>

## Today's lecture will cover how to load, clean, and manipulate a dataset using Python
###  In order to do this we'll be utilizing a Python library named Pandas.

####  Pandas is an open sourced library which provides high-performance, easy-to-use data structures and data analysis tools in Python. It is arguably the most preferred and widely used tool in the DS/AI industry for data munging and wrangling.

**If you do not yet have Python and Pandas installed on your machine I recommend using a package such as the <a href="https://www.anaconda.com/" target="_blank">Anaconda Distribution</a>. 
This can be installed for Windows, Linux, or Mac and will quickly install Python, Jupyter Notebook, and the most popular Data Science libraries onto your machine. 

### Importing Libraries and Downloading Data

In order to use any Python library we need to first import the library...<br>Pandas is actually built on top of Numpy, a scientific computing library, and happens to work hand in hand with Pandas. We'll import this library as well.

In [87]:
# 'pd' will serve as the alias for Pandas when calling functions
import pandas as pd
import numpy as np
import os

This code downloads our dataset. To do this we'll utilize a script named gdown, which enables downloading files from Google Drive from the command line.

In [88]:
!wget https://raw.githubusercontent.com/circulosmeos/gdown.pl/master/gdown.pl
!chmod +x gdown.pl

!./gdown.pl https://drive.google.com/open?id=1uFRR5wtQTYjkZgfqUCtHfM1jJAT763Gm LA_Parking_Citations.csv

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/danielzgsilva/.wget-hsts'. HSTS will be disabled.
--2019-03-19 16:33:18--  https://raw.githubusercontent.com/circulosmeos/gdown.pl/master/gdown.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.4.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.4.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2072 (2.0K) [text/plain]
Saving to: ‘gdown.pl’


2019-03-19 16:33:18 (11.0 MB/s) - ‘gdown.pl’ saved [2072/2072]

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/danielzgsilva/.wget-hsts'. HSTS will be disabled.
Cannot open cookies file ‘gdown.cookie.temp’: No such file or directory
--2019-03-19 16:33:19--  https://docs.google.com/uc?id=1uFRR5wtQTYjkZgfqUCtHfM1jJAT763Gm&export=download
Resolving docs.goo

#### Loading in a Dataset with Python

At this point you might be asking thinking, "Well this is cool and all, but where the heck can I get a dataset in the first place??"

Fear not! There are a number of online repositories which supply both messy and clean datasets for almost any Data Science project you could imagine. Here are some of my favorites:
-  <a href="https://www.kaggle.com/" target="_blank">Kaggle</a>: A popular site within the Data Science community which hosts Machine Learning competitions. It contains a tremendous amount of datasets, all of which you can download.
    - As a note, you can open up a kernel under any competition or dataset and the data will already be loaded into the notebook, no need to download to your machine!
-  <a href="https://cloud.google.com/bigquery/public-data/" target="_blank">Google Public Datasets</a>
-  <a href="https://aws.amazon.com/start-now/?sc_channel=BA&sc_campaign=elevator&sc_publisher=captivate" target="_blank">Amazon Web Services Public Datasets</a>
-  <a href="http://mlr.cs.umass.edu/ml/UC" target="_blank">Irvine Machine Learning Repository</a>

**When you want to use Pandas to manipulate or analyze data, you’ll usually get your data in one of three different ways:**

-  Convert a Pythonlist, dictionary or Numpy array to a Pandas data frame
-  Open a local file using Pandas, usually a CSV file, but could also be a tab delimited text file (like TSV), Excel, etc
-  Open a remote file through a URL or read from a database such as SQL

In our case we will be loading our data set from a CSV (comma separated values file) which I downloaded from Kaggle:  <a href="https://www.kaggle.com/cityofLA/los-angeles-parking-citations" target="_blank">link to dataset</a>

In [95]:
# Note that file directories in Jupyter Notebook begin from the folder which holds your IPython notebook file. 
# This csv file is saved in the same folder as my notebook. If it was in another folder we'd need to further define the path
df = pd.read_csv('LA_Parking_Citations.csv')

Other methods exist such as:
-  **pd.read_feather** and	**pd.to_feather**     
Read into these to explore a lightweight and fast option to store and read DataFrames in/from memory
-  **pd.read_sql**
-  You may also pass the parameter **sep = ' '** to read text files using varying delimiters
   Eg. sep = '\t' for tab delimited files

### Pandas Components

Pandas has two core components:
-  **Series**: This is essentially a numpy.array, but for the most part these will be the columns within our Dataframes
-  **DataFrames**: These are the bread and butter of pandas. They're equivalent to a table or an excel spreadsheet (made up of columns and rows)

### Inpsecting and Analyzing a Dataframe

pandas.DataFrame.head() by default shows the first 5 rows of a Dataframe. An integer can be passed to load different numbers of rows

#### This dataset contains a line item for each ticket issued in the City of Los Angeles

In [96]:
df.head()

Unnamed: 0,Ticket number,Issue Date,Issue time,Meter Id,Marked Time,RP State Plate,Plate Expiry Date,VIN,Make,Body Style,Color,Location,Route,Agency,Violation code,Violation Description,Fine amount,Latitude,Longitude
0,4346620795,2019-01-04T00:00:00,814.0,,,CA,201811.0,,SUBA,PA,BL,5242 CARTWRIGHT AVE,356,53.0,80.69BS,NO PARK/STREET CLEAN,73.0,6451806.399,1882867.749
1,4346620806,2019-01-04T00:00:00,816.0,,,CA,201909.0,,NISS,PA,WT,5334 CARTWRIGHT AVE,356,53.0,80.69BS,NO PARK/STREET CLEAN,73.0,6451807.009,1883469.46
2,4346620810,2019-01-04T00:00:00,818.0,,,CA,201905.0,,LEXS,PA,BK,5338 CARTWRIGHT AVE,356,53.0,80.69BS,NO PARK/STREET CLEAN,73.0,6451807.036,1883495.621
3,4346620821,2019-01-04T00:00:00,819.0,,,CA,201905.0,,LEXS,PA,BK,5338 CARTWRIGHT AVE,356,53.0,5200,DISPLAY OF PLATES,25.0,6451807.036,1883495.621
4,4346620832,2019-01-04T00:00:00,824.0,,,CA,201905.0,,CHEV,PA,GY,5330 RIVERTON AVE,356,53.0,80.69BS,NO PARK/STREET CLEAN,73.0,6451146.82,1883618.633


In [97]:
df.head(2)

Unnamed: 0,Ticket number,Issue Date,Issue time,Meter Id,Marked Time,RP State Plate,Plate Expiry Date,VIN,Make,Body Style,Color,Location,Route,Agency,Violation code,Violation Description,Fine amount,Latitude,Longitude
0,4346620795,2019-01-04T00:00:00,814.0,,,CA,201811.0,,SUBA,PA,BL,5242 CARTWRIGHT AVE,356,53.0,80.69BS,NO PARK/STREET CLEAN,73.0,6451806.399,1882867.749
1,4346620806,2019-01-04T00:00:00,816.0,,,CA,201909.0,,NISS,PA,WT,5334 CARTWRIGHT AVE,356,53.0,80.69BS,NO PARK/STREET CLEAN,73.0,6451807.009,1883469.46


pd.DataFrame.shape can quickly tell you the dimensions of your DataFrame

In [98]:
df.shape

(100000, 19)

The next two DataFrame methods can be used to tell us which datatypes our DataFrame consists of, as well as how many NULL values are found in each column

In [99]:
other.get_dtype_counts()

float64     8
int64       1
object     10
dtype: int64

Notice below that our Meter Id, Marked Time, and VIN columns have a significant number of NULL values. We'll deal with these in a bit...

In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 19 columns):
Ticket number            100000 non-null int64
Issue Date               99994 non-null object
Issue time               99967 non-null float64
Meter Id                 23588 non-null object
Marked Time              2883 non-null float64
RP State Plate           100000 non-null object
Plate Expiry Date        90157 non-null float64
VIN                      0 non-null float64
Make                     99893 non-null object
Body Style               99896 non-null object
Color                    99953 non-null object
Location                 99989 non-null object
Route                    99515 non-null object
Agency                   99994 non-null float64
Violation code           100000 non-null object
Violation Description    99990 non-null object
Fine amount              99934 non-null float64
Latitude                 100000 non-null float64
Longitude                100000 non-nul

This method below, .describe(), provides a statistical summary of our numerical columns (ints and floats)

In [101]:
df.describe()

Unnamed: 0,Ticket number,Issue time,Marked Time,Plate Expiry Date,VIN,Agency,Fine amount,Latitude,Longitude
count,100000.0,99967.0,2883.0,90157.0,0.0,99994.0,99934.0,100000.0,100000.0
mean,4196131000.0,1157.492382,1066.813736,196594.509456,,51.83235,70.335882,5614210.0,1618228.0
std,680278100.0,484.922067,225.816816,32252.264663,,9.734104,32.780404,2153896.0,593598.4
min,1009245000.0,0.0,1.0,1.0,,1.0,25.0,99999.0,99999.0
25%,4346037000.0,850.0,923.0,201901.0,,53.0,63.0,6423665.0,1826502.0
50%,4346652000.0,1129.0,1049.0,201905.0,,54.0,68.0,6453386.0,1842885.0
75%,4347111000.0,1424.0,1205.0,201908.0,,55.0,73.0,6474729.0,1858854.0
max,4348572000.0,2359.0,2359.0,209912.0,,58.0,363.0,6513161.0,1941745.0


Pass the include parameter with datatype(s) of you'd like to get a summary of

In [102]:
df.describe(include=object)

Unnamed: 0,Issue Date,Meter Id,RP State Plate,Make,Body Style,Color,Location,Route,Violation code,Violation Description
count,99994,23588,100000,99893,99896,99953,99989,99515,100000,99990
unique,85,10267,66,231,35,36,55827,931,138,161
top,2019-01-08T00:00:00,53,CA,TOYT,PA,BK,101 LARCHMONT BL N,600,80.69BS,NO PARK/STREET CLEAN
freq,7930,442,93004,16579,87813,22076,213,9030,28734,29100


In [103]:
df.describe(include='all')

Unnamed: 0,Ticket number,Issue Date,Issue time,Meter Id,Marked Time,RP State Plate,Plate Expiry Date,VIN,Make,Body Style,Color,Location,Route,Agency,Violation code,Violation Description,Fine amount,Latitude,Longitude
count,100000.0,99994,99967.0,23588.0,2883.0,100000,90157.0,0.0,99893,99896,99953,99989,99515.0,99994.0,100000,99990,99934.0,100000.0,100000.0
unique,,85,,10267.0,,66,,,231,35,36,55827,931.0,,138,161,,,
top,,2019-01-08T00:00:00,,53.0,,CA,,,TOYT,PA,BK,101 LARCHMONT BL N,600.0,,80.69BS,NO PARK/STREET CLEAN,,,
freq,,7930,,442.0,,93004,,,16579,87813,22076,213,9030.0,,28734,29100,,,
mean,4196131000.0,,1157.492382,,1066.813736,,196594.509456,,,,,,,51.83235,,,70.335882,5614210.0,1618228.0
std,680278100.0,,484.922067,,225.816816,,32252.264663,,,,,,,9.734104,,,32.780404,2153896.0,593598.4
min,1009245000.0,,0.0,,1.0,,1.0,,,,,,,1.0,,,25.0,99999.0,99999.0
25%,4346037000.0,,850.0,,923.0,,201901.0,,,,,,,53.0,,,63.0,6423665.0,1826502.0
50%,4346652000.0,,1129.0,,1049.0,,201905.0,,,,,,,54.0,,,68.0,6453386.0,1842885.0
75%,4347111000.0,,1424.0,,1205.0,,201908.0,,,,,,,55.0,,,73.0,6474729.0,1858854.0


### Dropping columns from a Dataframe

Lets take a look at the amount of missing data in each column

In [104]:
percent_missing = df.isnull().sum() * 100 / len(df)
percent_missing.sort_values(ascending=False, inplace=True)
percent_missing

VIN                      100.000
Marked Time               97.117
Meter Id                  76.412
Plate Expiry Date          9.843
Route                      0.485
Make                       0.107
Body Style                 0.104
Fine amount                0.066
Color                      0.047
Issue time                 0.033
Location                   0.011
Violation Description      0.010
Issue Date                 0.006
Agency                     0.006
Longitude                  0.000
RP State Plate             0.000
Latitude                   0.000
Violation code             0.000
Ticket number              0.000
dtype: float64

We see that columns VIN, Marked Time, and Meter ID all have a high percent of NULL values, so we decide to simply drop these columns. <br> 
Let's say that for this analysis we're also not concerned with the Route or the Agency, nor the Longitude/Latitude so let's drop those as well.

Pass the drop() method a list of the columns we'd like to drop and specify the axis as 1 (for the columns axis). The inplace parameter allows this method to occur **inplace**, or on our current DataFrame.
Think of it as the difference between:
-  x = x  + 1;
-  x++;         


In [105]:
columns_to_drop = ["VIN", "Marked Time", "Meter Id", "Route", "Agency", "Longitude", "Latitude"]
df.drop(columns_to_drop, axis = 1, inplace = True)

In [106]:
df.head()

Unnamed: 0,Ticket number,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
0,4346620795,2019-01-04T00:00:00,814.0,CA,201811.0,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
1,4346620806,2019-01-04T00:00:00,816.0,CA,201909.0,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
2,4346620810,2019-01-04T00:00:00,818.0,CA,201905.0,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
3,4346620821,2019-01-04T00:00:00,819.0,CA,201905.0,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25.0
4,4346620832,2019-01-04T00:00:00,824.0,CA,201905.0,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73.0


In [107]:
percent_missing = df.isnull().sum() * 100 / len(df)
percent_missing.sort_values(ascending=False, inplace=True)
percent_missing

Plate Expiry Date        9.843
Make                     0.107
Body Style               0.104
Fine amount              0.066
Color                    0.047
Issue time               0.033
Location                 0.011
Violation Description    0.010
Issue Date               0.006
Violation code           0.000
RP State Plate           0.000
Ticket number            0.000
dtype: float64

For the sake of demonstration, we can also also drop rows with this method. Specify our axis as 0 for rows, and instead of column names we'll now use indice numbers

In [108]:
# Rows 0, 1, and 2 will be dropped from the DataFrame
# Also, notice we do not perform this method inplace. This way we are not permanently altering our DataFrame

df.drop([0, 1, 2], axis = 0)

Unnamed: 0,Ticket number,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
3,4346620821,2019-01-04T00:00:00,819.0,CA,201905.0,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25.0
4,4346620832,2019-01-04T00:00:00,824.0,CA,201905.0,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73.0
5,4346620843,2019-01-04T00:00:00,829.0,CA,201901.0,KIA,PA,GY,5220 HARMONY AVE,80.69BS,NO PARK/STREET CLEAN,73.0
6,4346620865,2019-01-04T00:00:00,830.0,CA,201901.0,KIA,PA,GY,5220 HARMONY AVE,5200,DISPLAY OF PLATES,25.0
7,4346620876,2019-01-04T00:00:00,836.0,CA,201707.0,FORD,OT,WT,10831 MAGNOLIA BLVD,80.69BS,NO PARK/STREET CLEAN,73.0
8,4346620880,2019-01-04T00:00:00,838.0,CA,201707.0,FORD,OT,WT,10831 MAGNOLIA BLVD,5204A-,DISPLAY OF TABS,25.0
9,4346620891,2019-01-04T00:00:00,842.0,CA,201901.0,NISS,PA,WT,5200 SATSUMA AVE,80.69BS,NO PARK/STREET CLEAN,73.0
10,4346620902,2019-01-04T00:00:00,846.0,CA,201907.0,FORD,PA,BL,10800 WEDDINGTON ST,80.69BS,NO PARK/STREET CLEAN,73.0
11,4346620913,2019-01-04T00:00:00,928.0,CA,201811.0,MERZ,PA,GY,5200 LANKERSHIM BL,88.13B+,METER EXP.,63.0
12,4346620924,2019-01-04T00:00:00,929.0,CA,201811.0,MERZ,PA,GY,5200 LANKERSHIM BL,5204A-,DISPLAY OF TABS,25.0


In [109]:
df.head()

Unnamed: 0,Ticket number,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
0,4346620795,2019-01-04T00:00:00,814.0,CA,201811.0,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
1,4346620806,2019-01-04T00:00:00,816.0,CA,201909.0,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
2,4346620810,2019-01-04T00:00:00,818.0,CA,201905.0,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
3,4346620821,2019-01-04T00:00:00,819.0,CA,201905.0,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25.0
4,4346620832,2019-01-04T00:00:00,824.0,CA,201905.0,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73.0


### Creating a unique index for your DataFrame

Pandas allows us to slice and access rows of our Dataframe utilizing the unique index numbers. In many cases, it is helpful to use a unique identifying field from the data as its index , rather than having our rows labeled 0 - 999999. In this case,
Ticket Number would function as an excellent Index for us

In [110]:
# Ensuring Ticket Numbers are in fact, unique
df['Ticket number'].is_unique

True

In [111]:
df.set_index('Ticket number', inplace = True)
df.head()

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4346620795,2019-01-04T00:00:00,814.0,CA,201811.0,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
4346620806,2019-01-04T00:00:00,816.0,CA,201909.0,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
4346620810,2019-01-04T00:00:00,818.0,CA,201905.0,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73.0
4346620821,2019-01-04T00:00:00,819.0,CA,201905.0,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25.0
4346620832,2019-01-04T00:00:00,824.0,CA,201905.0,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73.0


### Indexing your Dataframe

#### pd.DataFrame.loc[ ]  allows us to do label-based indexing

This means accessing records using their unique label (index), without regard to their position in the DataFrame. In our case the unique label is now the ticket number

In [116]:
# This will return the record of ticket number 4346620795 (It happens to be the first row in our DataFrame) 
df.loc[4346620795]

Issue Date                2019-01-04T00:00:00
Issue time                                814
RP State Plate                             CA
Plate Expiry Date                      201811
Make                                     SUBA
Body Style                                 PA
Color                                      BL
Location                  5242 CARTWRIGHT AVE
Violation code                        80.69BS
Violation Description    NO PARK/STREET CLEAN
Fine amount                                73
Name: 4346620795, dtype: object

#### pd.DataFrame.iloc[ ] allows us to do position-based indexing

This means accessing a row based on what row number it is in the DataFrame. To access the first record in the DataFrame (which we also pulled above) do:

In [117]:
df.iloc[0]

Issue Date                2019-01-04T00:00:00
Issue time                                814
RP State Plate                             CA
Plate Expiry Date                      201811
Make                                     SUBA
Body Style                                 PA
Color                                      BL
Location                  5242 CARTWRIGHT AVE
Violation code                        80.69BS
Violation Description    NO PARK/STREET CLEAN
Fine amount                                73
Name: 4346620795, dtype: object

This function also allows for Numpy like slicing of our DataFrame. For example, to retrieve the last 2,000 records of the DataFrame we can do:

In [118]:
df.iloc[(len(df)-2000):]

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4347624525,2019-01-24T00:00:00,2014.0,CA,201912.0,NISS,PA,GY,2980 FRANCIS AVE,22500H,DOUBLE PARKING,68.0
4347624536,2019-01-24T00:00:00,2017.0,CA,202001.0,DODG,PA,GN,2984 FRANCIS AVE,5200,DISPLAY OF PLATES,25.0
4347624540,2019-01-24T00:00:00,2259.0,CA,201812.0,MERZ,PA,BK,1145 MOHAWK ST,80.58L,PREFERENTIAL PARKING,68.0
4347624551,2019-01-24T00:00:00,2301.0,CA,201911.0,FORD,PA,GY,1141 MOHAWK ST,80.58L,PREFERENTIAL PARKING,68.0
4347624562,2019-01-24T00:00:00,2327.0,CA,201901.0,CHEV,PA,BL,981 MADISON AVE N,80.56E4+,RED ZONE,93.0
4347624573,2019-01-24T00:00:00,2329.0,OR,202005.0,CHEV,PA,BL,4570 SANTA MONICA BLVD,80.69AA+,NO STOP/STAND,93.0
4347624584,2019-01-24T00:00:00,2331.0,CA,201805.0,VOLK,PA,BL,4570 SANTA MONICA BLVD,80.69AA+,NO STOP/STAND,93.0
4347624595,2019-01-24T00:00:00,2331.0,CA,201805.0,VOLK,PA,BL,4570 SANTA MONICA BLVD,5204A-,DISPLAY OF TABS,25.0
4347624606,2019-01-24T00:00:00,2332.0,CA,201904.0,TOYT,PA,SL,4570 SANTA MONICA BLVD,80.69AA+,NO STOP/STAND,93.0
4347624610,2019-01-24T00:00:00,2350.0,CA,201811.0,VOLV,PA,BL,1821 SILVER LAKE DR W,80.58L,PREFERENTIAL PARKING,68.0


### Dealing with NaN or Inaccurate values

In [119]:
percent_missing

Plate Expiry Date        9.843
Make                     0.107
Body Style               0.104
Fine amount              0.066
Color                    0.047
Issue time               0.033
Location                 0.011
Violation Description    0.010
Issue Date               0.006
Violation code           0.000
RP State Plate           0.000
Ticket number            0.000
dtype: float64

In [120]:
df.dtypes

Issue Date                object
Issue time               float64
RP State Plate            object
Plate Expiry Date        float64
Make                      object
Body Style                object
Color                     object
Location                  object
Violation code            object
Violation Description     object
Fine amount              float64
dtype: object

We'll go ahead and fill the numeric columns which contain NULL values with 0, these are Issue Time, Fine Amount, and Plate Expiration Date. We'll then convert these columns to integers after noticing their values are all whole numbers 

In [121]:
df['Issue time'].fillna(value=0.0, inplace=True)
df['Issue time'] = df['Issue time'].astype(int)

df['Fine amount'].fillna(value=0.0, inplace=True)
df['Fine amount'] = df['Fine amount'].astype(int)

df['Plate Expiry Date'].fillna(value=0.0, inplace=True)
df['Plate Expiry Date'] = df['Plate Expiry Date'].astype(int)

df.dtypes

Issue Date               object
Issue time                int64
RP State Plate           object
Plate Expiry Date         int64
Make                     object
Body Style               object
Color                    object
Location                 object
Violation code           object
Violation Description    object
Fine amount               int64
dtype: object

In [122]:
percent_missing = df.isnull().sum() * 100 / len(df)
percent_missing.sort_values(ascending=False, inplace=True)
percent_missing

Make                     0.107
Body Style               0.104
Color                    0.047
Location                 0.011
Violation Description    0.010
Issue Date               0.006
Fine amount              0.000
Violation code           0.000
Plate Expiry Date        0.000
RP State Plate           0.000
Issue time               0.000
dtype: float64

Okay, now we're getting there. Let's recap: We started by dropping all the columns that we either weren't interested in, or simply had too many missing values to be useful. We then created a unique index for the data, Ticket Number, and filled in missing values in our numeric columns with an arbitrary value. Let's take a look at the dataset again to decide if any further manipulation is necessary...

### Cleaning up our Columns 

In [123]:
df.head()

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4346620795,2019-01-04T00:00:00,814,CA,201811,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620806,2019-01-04T00:00:00,816,CA,201909,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620810,2019-01-04T00:00:00,818,CA,201905,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620821,2019-01-04T00:00:00,819,CA,201905,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25
4346620832,2019-01-04T00:00:00,824,CA,201905,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73


The first thing I notice is that Plate Expiration Date is in integer form. We'd like to turn this column into a date-time type with a proper year-month format. Let's look at the unique values...

In [124]:
unique_dates = pd.Series(np.sort(df['Plate Expiry Date'].unique()))
unique_dates[:15].append(unique_dates[len(unique_dates)-10:])

0           0
1           1
2           2
3           3
4           4
5           5
6           6
7           7
8           8
9           9
10         10
11         11
12         12
13     200102
14     200103
226    207401
227    208410
228    209004
229    209011
230    209102
231    209105
232    209512
233    209609
234    209907
235    209912
dtype: int64

We have a couple things to deal with here. The first thing to tackle are the outliers... The expiration dates seem to range from year 2000 to 2099, therefore the integers 1 through 12 don't mean much. (0 came from our NULL values). I'm going to treat these outliers as missing dates and simply replace them with 0

#### To do this let's utilize Numpy.where 

&emsp;&emsp;&emsp;np.where(condition, then, else)
<br>
<br>
This will loop through each row of the column we pass to it and check whether the condition is true. If True, apply the 'then' value, if not, apply the else value

In [125]:
df['Plate Expiry Date'] = np.where(df['Plate Expiry Date'] <= 12, 0, df['Plate Expiry Date'])

In [126]:
unique_dates = pd.Series(np.sort(df['Plate Expiry Date'].unique()))
unique_dates.iloc[:10]

0         0
1    200102
2    200103
3    200104
4    200105
5    200108
6    200109
7    200112
8    200118
9    200119
dtype: int64

Awesome, we've replaced all of those outliers with 0. Now let's take a look at how to convert these integers to a date format

&emsp; We'll utilize **pd.to_datetime()** <br><br>
This method will parse through the column we pass to it and convert it to a Pandas **Datetime** format <br>
-   Datetime format is commonly used when dealing with Dates as it provides a great deal of functionality and makes these columns much easier to deal with
-  The parameter **Errors** communicates how to deal with elements that can't be interpreted as a date, **Coerce** says just make these NULL
-  We also pass the format of the column we'll be parsing, in this case the Plate Exp Date ints are in year-month format: **%Y%m**

In [127]:
df['Plate Expiry Date'] = pd.to_datetime(df['Plate Expiry Date'], errors='coerce', format='%Y%m')
df.head()

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4346620795,2019-01-04T00:00:00,814,CA,2018-11-01,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620806,2019-01-04T00:00:00,816,CA,2019-09-01,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620810,2019-01-04T00:00:00,818,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620821,2019-01-04T00:00:00,819,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25
4346620832,2019-01-04T00:00:00,824,CA,2019-05-01,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73


The last column which needs a bit of cleaning is the **Issue Date** column. We'd like to chop off the end of each string in the column since it seems every entry has 'T00:00:00' tacked on. Let's take a look at how we can do this...

Pandas provides a number of nifty and easy to use vectorized string operations in the way of **pd.Series.str**, some examples are:
-  pd.Series.str.split()
-  pd.Series.str.replace()
-  pd.Series.str.contains() <br>
<br> And the one which we'll utilize...
-  **pd.Series.str.extract()**
<br> This will allow us to extract the part of each string in the column that matches the Regular Expression we pass to it 
<br> The Regular Expression **\d{4}-\d{2}-\d{2}** will search for the pattern: Any 4 digits - Any 2 digits - Any 2 digits
<br>
<br> *If you're not familiar with RegEx don't worry too much as it's not the purpose of today's lecture*
<br> &emsp; *If you'd like to read more on RegEx visit: https://regexr.com/.*

In [128]:
df['Issue Date'] = df['Issue Date'].str.extract(r'^(\d{4}-\d{2}-\d{2})', expand = False)
df.head()

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4346620795,2019-01-04,814,CA,2018-11-01,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620806,2019-01-04,816,CA,2019-09-01,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620810,2019-01-04,818,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620821,2019-01-04,819,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25
4346620832,2019-01-04,824,CA,2019-05-01,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73


Okay, so we've extracted the date portion of **Issue Date**. As you can see below, this column is still an object, or essentially a string. Similar to Plate Expiration Date, we'd like to convert this to Datetime format so we can make use of Pandas' Datetime functionality later down the road.

In [129]:
df.dtypes

Issue Date                       object
Issue time                        int64
RP State Plate                   object
Plate Expiry Date        datetime64[ns]
Make                             object
Body Style                       object
Color                            object
Location                         object
Violation code                   object
Violation Description            object
Fine amount                       int64
dtype: object

Note how this time the format to be parsed is a bit different. In this case, Issue Date is already in date format **%Y-%m-%d**, we just want to convert it to the datetime datetype

In [130]:
df['Issue Date'] = pd.to_datetime(df['Issue Date'], errors='coerce', format='%Y-%m-%d')
df.dtypes

Issue Date               datetime64[ns]
Issue time                        int64
RP State Plate                   object
Plate Expiry Date        datetime64[ns]
Make                             object
Body Style                       object
Color                            object
Location                         object
Violation code                   object
Violation Description            object
Fine amount                       int64
dtype: object

In [131]:
df.head()

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
4346620795,2019-01-04,814,CA,2018-11-01,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620806,2019-01-04,816,CA,2019-09-01,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620810,2019-01-04,818,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73
4346620821,2019-01-04,819,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25
4346620832,2019-01-04,824,CA,2019-05-01,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73


We're lookin good!

#### If we've got time we can cover these extra topics 

### Calculated Columns

A big part of Machine Learning and Data Science is about brainstorming and creating new features to extract more information from the data than what is present at first glance. In this case we might question whether there is a correlation between number of tickets issued and the current season... Lets create a new feature, or column, in this dataset for Season of Issue Date

Creating a new column is as simple as 
-  df['New Column Name'] = Equation or Conditional used to set values in new column
<br><br> Here we'll use a fancy calculation against the month of Issue Date to determine the season (from 1 - 4) based off the month
<br> **Series.dt** provides a number of datetime functions if the column is in datetime format
-  Specifically, Series.dt.month returns the month of the datetime as a float E.G: January = 1.0

In [132]:
#    Month      Season
# 12 | 1 | 2 = 'Winter' or 1
# 3 | 4 | 5 = 'Spring' or 2
# 6 | 7 | 8 = 'Summer' or 3
# 9 | 10 | 11 = 'Fall' or 4

df['Issue Season'] = ((df['Issue Date'].dt.month % 12 + 3) // 3).fillna(0.0).astype(int)
df.head()

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount,Issue Season
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4346620795,2019-01-04,814,CA,2018-11-01,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73,1
4346620806,2019-01-04,816,CA,2019-09-01,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73,1
4346620810,2019-01-04,818,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73,1
4346620821,2019-01-04,819,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25,1
4346620832,2019-01-04,824,CA,2019-05-01,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73,1


In [137]:
df['Issue Date'].dt.month.unique()

array([ 1., 12.,  9., 10., 11.,  2., nan,  5.,  3.])

In [139]:
df['Issue Season'].unique()

array([1, 4, 0, 2])

### Sorting and filtering a Dataframe

#### Now, as an example, we could filter on just tickets issued in the Winter, and then let's sort the data by Issue Date

In [140]:
df.shape

(100000, 12)

Filtering can be done by passing the conditional we want to filter on into the original DataFrame<br><br>Here, the inner conditional results in a boolean array of length 100000. It's true when Season is 1, and False otherwise <br> We pass this boolean array to our original DataFrame and this filters our data on just rows where our inner condition was found to be True

In [141]:
df_winter = df[df['Issue Season'] == 1]
df_winter.head()

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount,Issue Season
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4346620795,2019-01-04,814,CA,2018-11-01,SUBA,PA,BL,5242 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73,1
4346620806,2019-01-04,816,CA,2019-09-01,NISS,PA,WT,5334 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73,1
4346620810,2019-01-04,818,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,80.69BS,NO PARK/STREET CLEAN,73,1
4346620821,2019-01-04,819,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE,5200,DISPLAY OF PLATES,25,1
4346620832,2019-01-04,824,CA,2019-05-01,CHEV,PA,GY,5330 RIVERTON AVE,80.69BS,NO PARK/STREET CLEAN,73,1


In [142]:
df_winter.shape

(99973, 12)

And, as expected, this new DataFrame is a subset of our orginal DataFrame

Sometimes you may need to filter a dataset based on if a column contains a number of different values, and don't want to create a long OR statement... <br> In this case you could filter your DataFrame based off a list, like so:

In [147]:
seasons = [0, 2, 3, 4]

As you may have guessed, we're going to attempt to filter our data on tickets issued in every season OTHER than summer. <br>
The technique we use is the same, but the conditional will look a bit different. Lets take a look...

In [148]:
df_not_winter = df[df['Issue Season'].isin(seasons)]

In [149]:
print('Seasons: ' + str(df_not_winter['Issue Season'].unique()) + '    Shape: ' + str(df_not_winter.shape))
print('Seasons: ' + str(df_winter['Issue Season'].unique()) + '        Shape: ' + str(df_winter.shape))

Seasons: [4 0 2]    Shape: (27, 12)
Seasons: [1]        Shape: (99973, 12)


As you can see, this new DataFrame contains all the seasons except Winter, and is the complentary set to the previous DataFrame we made

#### The last thing I wanted to cover is how we can Sort a DataFrame 

We'll utlize **pd.DataFrame.sort_values()**
<br><br>Which allows us to sort the rows of a dataset by column value. In this case let's sort all summer tickets by their Issue Date, and then by their Issue Time

In [150]:
df_summer.sort_values (by = ["Issue Date", "Issue time"], axis = 0, ascending = True)

Unnamed: 0_level_0,Issue Date,Issue time,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location,Violation code,Violation Description,Fine amount,Issue Season
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1120588372,2010-01-09,750,CA,NaT,NISS,PA,BK,8113 CEDROS,5204A,EXPIRED TAGS,25,1
1126565090,2011-01-12,1118,CA,2018-06-01,HOND,TR,WH,509 W 82ND ST,5204A,EXPIRED TAGS,25,1
1115653582,2013-01-03,939,CA,NaT,FORD,PA,BG,1100 VICTORIAS ADJ,8069BS,NO PARK/STREET CLEAN,73,1
1126565171,2013-01-12,1404,CA,2019-11-01,CHEV,TR,BK,3000 W 67TH ST,11,22500F,68,1
1126639102,2017-01-14,1100,CA,2019-04-01,BUIC,PA,GY,260 S AV 50,8069B,NO PARKING,73,1
1126639135,2017-01-14,1319,OR,2019-10-01,HOND,PA,GY,1252 N AV 51,8813B,METER EXPIRED,63,1
1121321515,2017-12-28,1900,CA,2018-11-01,CADI,PA,SI,58TH E/O HOOVER,5204A,EXPIRED TAGS,25,1
1120628611,2018-01-01,850,CA,2018-07-01,TOYO,PA,SI,BUNKER HILL/ALPINE,4000A1,NO EVIDENCE OF REG,50,1
1120626791,2018-01-01,1130,CA,2018-03-01,BUIC,PA,WH,HOPE/N OF VENICE,4000A1,NO EVIDENCE OF REG,50,1
1120999681,2018-01-01,1145,CA,2017-06-01,PORS,PA,WH,1100 S HOPE ST,4000A1,NO EVIDENCE OF REG,50,1


For sake of demonstration, we can also rearrange the columns of a DataFrame as well like so:

In [151]:
# get the list of all columns
columns = list(df.columns)
columns

['Issue Date',
 'Issue time',
 'RP State Plate',
 'Plate Expiry Date',
 'Make',
 'Body Style',
 'Color',
 'Location',
 'Violation code',
 'Violation Description',
 'Fine amount',
 'Issue Season']

In [152]:
#rearrange the list of columns in the order we'd like 
columns = ['Issue Date',
 'Issue time',
 'Issue Season',
 'Fine amount',
 'Violation code',
 'Violation Description',
 'RP State Plate',
 'Plate Expiry Date',
 'Make',
 'Body Style',
 'Color',
 'Location',]

In [153]:
#Finally pass the list of new columns to our original DataFrame
df = df[columns]

In [154]:
df.head()

Unnamed: 0_level_0,Issue Date,Issue time,Issue Season,Fine amount,Violation code,Violation Description,RP State Plate,Plate Expiry Date,Make,Body Style,Color,Location
Ticket number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4346620795,2019-01-04,814,1,73,80.69BS,NO PARK/STREET CLEAN,CA,2018-11-01,SUBA,PA,BL,5242 CARTWRIGHT AVE
4346620806,2019-01-04,816,1,73,80.69BS,NO PARK/STREET CLEAN,CA,2019-09-01,NISS,PA,WT,5334 CARTWRIGHT AVE
4346620810,2019-01-04,818,1,73,80.69BS,NO PARK/STREET CLEAN,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE
4346620821,2019-01-04,819,1,25,5200,DISPLAY OF PLATES,CA,2019-05-01,LEXS,PA,BK,5338 CARTWRIGHT AVE
4346620832,2019-01-04,824,1,73,80.69BS,NO PARK/STREET CLEAN,CA,2019-05-01,CHEV,PA,GY,5330 RIVERTON AVE


## ALL DONE ! <BR>
### Thanks to everyone for coming out tonight, please remember to sign out!