# Programming and Database Fundamentals for Data Scientists - EAS503

## Installing iPython (jupyter) notebook:

Start by installing `Python 3`. Easiest way is to install `Anaconda Python`, which is a free and open source distribution of Python, consisting of many useful libraries.

Follow the instructions at: [https://docs.continuum.io/anaconda/install/](https://docs.continuum.io/anaconda/install/)

You can start IPython notebook by running

```shell
jupyter notebook
```


## Setting up mysql database
In this course we will use mysql database. You can install it on your system, following instructions here:
[https://dev.mysql.com/downloads/mysql/](https://dev.mysql.com/downloads/mysql/)

After this step, you should have a mysql database running on your laptop.
### Installing python bindings for mysql
To be able to connect to the mysql database, you need to install the interface (MySQLdb) that will let you connect to the database from within a python application. More information here:

[http://mysql-python.sourceforge.net/](http://mysql-python.sourceforge.net/)

In Unix-like environments (including MacOS), you can try:
```shell
pip install mysql-python
```

### Alternative - PyMySQL
#### Installation
```script
pip install PyMySQL
```
or
```script
conda install PyMySQL
```
Might need `sudo` privileges depending on your Python installation.


## Demonstrating a simple data science pipeline.
Data available from [Chicago Crime Data](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)

Before starting to play with the Chicago data, you will need to import the data into your database. We will assume that the database server is running on localhost on the standard port (see mysql help for handling other settings), the username is `username` and the password is `password` (You should use better usernames and passwords!)

### Step 1: Getting the data
Download the data from [here](https://www.cse.buffalo.edu/ubds/docs/chicago_crime_data.csv). This is only data for 2015 onwards. You can also download the full dataset here as well [here](https://www.cse.buffalo.edu/ubds/docs/chicago_crime_data_all.csv).

### Step 2: Setting up the database
Create a new database and the empty table using the lines below. You can either go into `mysql` command prompt and copy them, or just copy them into a text file (`script.sql`) and use the command line prompt:

```shell
mysql -u username -p < script.sql
```

You will be prompted for your password.
```sql
-- create database
create database if not exists eas503db;
use eas503db;
-- create table
drop table if exists `chicago_crime_data`;
create table `chicago_crime_data` (
  ID mediumint(9),
  Case_Number text,
  Date varchar(32),
  Block text,
  IUCR text,
  Primary_Type text,
  Description text,
  Location_Description text,
  Arrest varchar(6),
  Domestic varchar(6),
  Beat text,
  District text,
  Ward int,
  Community_Area text,
  FBI_Code text,
  X_Coordinate float,
  Y_Coordinate float,
  Year int(4),
  Updated_On varchar(32),
  Latitude float,
  Longitude float,
  Location varchar(64)
);
```

### Step 3: Importing data into the database.
We will use `mysqlimport` to do the bulk import. You may use the direct INFILE LOAD command from the `mysql` prompt, which does the same thing.

```shell
mysqlimport --local --fields-terminated-by=, --fields-enclosed-by='"' --ignore-lines=1 -u root -proot eas503db chicago_crime_data.csv
```

### Step 4: Converting the date elements to the correct data type
You need final few adjustments to make sure that the date fields are stored as the correct `datetime` type. From `mysql` prompt, do the following:
```sql
update chicago_crime_data set Date =  STR_TO_DATE(Date, '%m/%d/%Y %h:%i:%s %p');
alter table chicago_crime_data modify Date datetime;
update chicago_crime_data set Updated_On =  STR_TO_DATE(Updated_On, '%m/%d/%Y %h:%i:%s %p');
alter table chicago_crime_data modify Updated_On datetime;
```

In [23]:
#import MySQLdb
import pymysql.cursors

In [22]:
# Start the connection to database
# make sure to close it after your are done
db = pymysql.connect(host="127.0.0.1",    # your host, usually localhost
                     user="root",         # your username
                     passwd="root",       # your password
                     db="eas503db")    # name of the data base

### Let us start with a very simple query
Get the total number of rows in the database

In [23]:
querystr = 'SELECT count(*) FROM chicago_crime_data'
cur = db.cursor()
cur.execute(querystr)

1

In [24]:
for row in cur.fetchall():
    print("Total number of cases are:")
    print(row)
    

Total number of cases are:
(436264,)


### Doing more interesting things with SQL
Which are the more crime ridden areas in Chicago since 2016?

In [25]:
querystr = 'SELECT count(*) as cnt, District FROM chicago_crime_data WHERE Date > str_to_date(\'2016/01/01\',\'%Y/%m/%d\') GROUP BY District ORDER BY cnt'
cur = db.cursor()
cur.execute(querystr)

23

In [16]:
for row in cur.fetchall():
    print(row)

(5, '031')
(7108, '020')
(12348, '024')
(12543, '017')
(13844, '022')
(14824, '016')
(16794, '014')
(17878, '015')
(18487, '002')
(19171, '005')
(19674, '019')
(19991, '009')
(20191, '003')
(20339, '010')
(22185, '018')
(22445, '012')
(23000, '007')
(23172, '001')
(23492, '025')
(24143, '004')
(26629, '006')
(27913, '008')
(30011, '011')


### Zooming into Jefferson Park (011)

In [17]:
querystr = 'SELECT count(*) as cnt, Primary_Type FROM chicago_crime_data WHERE District = "011" AND Date > str_to_date(\'2016/01/01\',\'%Y/%m/%d\') GROUP BY Primary_Type ORDER BY cnt'
cur = db.cursor()
cur.execute(querystr)

32

In [18]:
for row in cur.fetchall():
    print(row)

(1, 'NON-CRIMINAL')
(1, 'NON-CRIMINAL (SUBJECT SPECIFIED)')
(1, 'NON - CRIMINAL')
(2, 'HUMAN TRAFFICKING')
(2, 'PUBLIC INDECENCY')
(3, 'CONCEALED CARRY LICENSE VIOLATION')
(4, 'OBSCENITY')
(11, 'INTIMIDATION')
(12, 'STALKING')
(18, 'LIQUOR LAW VIOLATION')
(21, 'KIDNAPPING')
(49, 'ARSON')
(73, 'SEX OFFENSE')
(86, 'GAMBLING')
(144, 'HOMICIDE')
(158, 'CRIM SEXUAL ASSAULT')
(203, 'OFFENSE INVOLVING CHILDREN')
(231, 'INTERFERENCE WITH PUBLIC OFFICER')
(274, 'PUBLIC PEACE VIOLATION')
(644, 'CRIMINAL TRESPASS')
(720, 'WEAPONS VIOLATION')
(745, 'PROSTITUTION')
(787, 'BURGLARY')
(847, 'DECEPTIVE PRACTICE')
(1212, 'MOTOR VEHICLE THEFT')
(1571, 'ROBBERY')
(1768, 'OTHER OFFENSE')
(2200, 'ASSAULT')
(2825, 'CRIMINAL DAMAGE')
(3327, 'THEFT')
(5707, 'NARCOTICS')
(6364, 'BATTERY')


In [26]:
db.close()

### PYTHON Library PANDAS
One can directly read data from the csv file into a Pandas object

In [1]:
import pandas as pd

In [9]:
df = pd.read_csv('chicago_crime_data.csv',header=0)

In [None]:
pd.read

In [45]:
df

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10496347,HZ237390,04/23/2016 04:55:00 PM,001XX N PARKSIDE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,29.0,25,14,1138578.0,1900413.0,2016,04/30/2016 03:51:13 PM,41.882858,-87.766599,"(41.88285803, -87.766599362)"
1,10496348,HZ237355,04/23/2016 02:00:00 PM,0000X E RIVERWALK S,0820,THEFT,$500 AND UNDER,SIDEWALK,False,False,...,42.0,32,06,1176778.0,1902518.0,2016,04/30/2016 03:51:13 PM,41.887856,-87.626264,"(41.887856357, -87.626264274)"
2,10496349,HZ237341,04/23/2016 04:10:00 PM,003XX W 35TH ST,0495,BATTERY,AGGRAVATED OF A SENIOR CITIZEN,SPORTS ARENA/STADIUM,True,False,...,11.0,34,04B,1174431.0,1881739.0,2016,04/30/2016 03:51:13 PM,41.830890,-87.635503,"(41.830890037, -87.635503335)"
3,10496350,HZ237330,04/23/2016 01:30:00 PM,040XX N MAJOR AVE,031A,ROBBERY,ARMED: HANDGUN,ALLEY,False,False,...,38.0,15,03,1137626.0,1926291.0,2016,04/30/2016 03:51:13 PM,41.953887,-87.769470,"(41.953887423, -87.76947041)"
4,10496351,HZ237402,04/23/2016 04:45:00 PM,084XX S DREXEL AVE,2820,OTHER OFFENSE,TELEPHONE THREAT,RESIDENCE,False,True,...,8.0,44,26,1183692.0,1849271.0,2016,04/30/2016 03:51:13 PM,41.741584,-87.602537,"(41.741583562, -87.602537135)"
5,10496352,HZ237376,04/23/2016 04:39:00 PM,018XX S ST LOUIS AVE,0820,THEFT,$500 AND UNDER,APARTMENT,False,True,...,24.0,29,06,1153327.0,1890818.0,2016,04/30/2016 03:51:13 PM,41.856248,-87.712695,"(41.856248253, -87.712694737)"
6,10496353,HZ237358,04/23/2016 07:00:00 AM,002XX E HURON ST,0890,THEFT,FROM BUILDING,HOSPITAL BUILDING/GROUNDS,False,False,...,42.0,8,06,1178046.0,1905133.0,2016,04/30/2016 03:51:13 PM,41.895003,-87.621528,"(41.895003278, -87.62152816)"
7,10496354,HZ237396,04/23/2016 12:00:00 PM,023XX S TRUMBULL AVE,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,APARTMENT,False,True,...,22.0,30,26,1153734.0,1888259.0,2016,04/30/2016 03:51:13 PM,41.849218,-87.711269,"(41.849217975, -87.711268873)"
8,10496355,HZ237254,04/23/2016 01:00:00 AM,070XX S JEFFERY BLVD,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,BAR OR TAVERN,False,True,...,5.0,43,26,1190765.0,1858575.0,2016,04/30/2016 03:51:13 PM,41.766947,-87.576322,"(41.766946791, -87.57632241)"
9,10496356,HZ237248,04/23/2016 03:10:00 PM,095XX S COMMERCIAL AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,10.0,51,14,1197803.0,1842283.0,2016,04/30/2016 03:51:13 PM,41.722067,-87.551069,"(41.722067458, -87.55106874)"


Many of the operations that can be done as SQL queries on the database, can also be done on a `Pandas` object. Then what is benefit of a database?

## Measuring performance of a Python code
Here we will focus on two types of performance metrics:
#### Speed - time taken to run the code on the machine
We will use a Python tool called `timeit` to measure the time. 

We will first define two `functions` (snippets of code that can be called repeatedly) to use `mysql` and `Pandas`

In [50]:
def pd_performance():
    df = pd.read_csv('chicago_crime_data.csv',header=0)
    cnt = len(df.index)
    
def db_performance():
    db = pymysql.connect(host="127.0.0.1",    # your host, usually localhost
                     user="root",         # your username
                     passwd="root",       # your password
                     db="eas503db")    # name of the data base
    querystr = 'SELECT count(*) FROM chicago_crime_data'
    cur = db.cursor()
    cur.execute(querystr)

    for row in cur.fetchall():
        cnt = row
    db.close()

In [51]:
timeit pd_performance()

2.09 s ± 98.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [49]:
timeit db_performance()

137 ms ± 520 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Space - memory used to run the code on the machine
We will use another Python tool called `memory_profiler` for this. For more see - https://pypi.org/project/memory_profiler/. Conda users may install the tool using:
```shell
conda install memory_profiler
```

Unfortunately, the `memory_profiler` does not support notebooks. So we will run these two codes from the command line.

In [43]:
import pandas as pd
from memory_profiler import profile

def pd_performance():
    df = pd.read_csv('chicago_crime_data.csv',header=0)
    cnt = len(df.index)

Save the above snippet into a file called `pd_demo.py`. Then run:
```shell
python -m memory_profiler pd_demo.py
```

In [44]:
import pymysql.cursors
from memory_profiler import profile

def db_performance():
    db = pymysql.connect(host="127.0.0.1",    # your host, usually localhost
                     user="root",         # your username
                     passwd="root",       # your password
                     db="eas503db")    # name of the data base
    querystr = 'SELECT count(*) FROM chicago_crime_data'
    cur = db.cursor()
    cur.execute(querystr)

    for row in cur.fetchall():
        cnt = row
    db.close()

Save the above snippet into a file called `db_demo.py`. Then run:
```shell
python -m memory_profiler db_demo.py
```

### Observations:
The output of the `memory_profiler` gives us:

For `pd_demo.py`:
```
Line #    Mem usage    Increment   Line Contents
================================================
     4     77.6 MiB     77.6 MiB   @profile
     5                             def pd_performance():
     6    270.0 MiB    192.4 MiB       df = pd.read_csv('chicago_crime_data.csv',header=0)
     7    270.0 MiB      0.0 MiB       cnt = len(df.index)
```

For `db_demo.py`:
```
Line #    Mem usage    Increment   Line Contents
================================================
     4     49.5 MiB     49.5 MiB   @profile
     5                             def db_performance():
     6     49.5 MiB      0.0 MiB       db = pymysql.connect(host="127.0.0.1",    # your host, usually localhost
     7     49.5 MiB      0.0 MiB                        user="root",         # your username
     8     49.5 MiB      0.0 MiB                        passwd="root",       # your password
     9     49.7 MiB      0.2 MiB                        db="eas503db")    # name of the data base
    10     49.7 MiB      0.0 MiB       querystr = 'SELECT count(*) FROM chicago_crime_data'
    11     49.7 MiB      0.0 MiB       cur = db.cursor()
    12     49.7 MiB      0.0 MiB       cur.execute(querystr)
    13                             
    14     49.7 MiB      0.0 MiB       for row in cur.fetchall():
    15     49.7 MiB      0.0 MiB           cnt = row
    16     49.7 MiB      0.0 MiB       db.close()
```

Pulling data from the database is better, both in terms of memory usage and speed, than reading a file. Of course, this does not include the cost of the database itself. But, given that same database will be used over and over, and typically by many applications, that cost tends to be _amortized_.