# About

Scratch notebook to download some data from Kaggle

# Download Data from Kaggle via API

Note, see https://github.com/Kaggle/kaggle-api for details on how to download kaggle data programmatically

In [1]:
!kaggle --help

usage: kaggle [-h] [-v] {competitions,c,datasets,d,kernels,k,config} ...

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

commands:
  {competitions,c,datasets,d,kernels,k,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle competitions
    datasets (d)        Commands related to Kaggle datasets
    kernels (k)         Commands related to Kaggle kernels
    config              Configuration settings


In [2]:
!kaggle competitions files sf-crime

name                      size  creationDate         
------------------------  ----  -------------------  
test.csv.zip              19MB  2015-05-28 23:56:19  
sampleSubmission.csv.zip   2MB  2015-06-03 19:25:27  
train.csv.zip             22MB  2015-06-03 19:25:58  


In [3]:
!kaggle competitions download -c sf-crime -p ../data/raw

Downloading test.csv.zip to ../data/raw

Downloading sampleSubmission.csv.zip to ../data/raw

Downloading train.csv.zip to ../data/raw




  0%|          | 0.00/18.7M [00:00<?, ?B/s]
  5%|5         | 1.00M/18.7M [00:01<00:18, 1.02MB/s]
 11%|#         | 2.00M/18.7M [00:02<00:19, 890kB/s] 
 16%|#6        | 3.00M/18.7M [00:04<00:21, 753kB/s]
 21%|##1       | 4.00M/18.7M [00:05<00:20, 759kB/s]
 27%|##6       | 5.00M/18.7M [00:07<00:20, 698kB/s]
 32%|###2      | 6.00M/18.7M [00:10<00:23, 568kB/s]
 37%|###7      | 7.00M/18.7M [00:13<00:24, 495kB/s]
 43%|####2     | 8.00M/18.7M [00:14<00:20, 537kB/s]
 48%|####8     | 9.00M/18.7M [00:16<00:19, 520kB/s]
 53%|#####3    | 10.0M/18.7M [00:18<00:17, 528kB/s]
 59%|#####8    | 11.0M/18.7M [00:20<00:14, 560kB/s]
 64%|######4   | 12.0M/18.7M [00:21<00:10, 641kB/s]
 70%|######9   | 13.0M/18.7M [00:22<00:08, 670kB/s]
 75%|#######4  | 14.0M/18.7M [00:24<00:08, 602kB/s]
 80%|########  | 15.0M/18.7M [00:26<00:06, 602kB/s]
 86%|########5 | 16.0M/18.7M [00:28<00:04, 606kB/s]
 91%|######### | 17.0M/18.7M [00:29<00:02, 614kB/s]
 96%|#########6| 18.0M/18.7M [00:31<00:01, 608kB/s]
100%|##########| 

Note, command downloads zipped files. But fortunately, pandas can read zipped files without having to unzip.

# Load Data

In [4]:
import numpy as np
import pandas as pd

In [5]:
train_csv = pd.read_csv("../data/raw/train.csv.zip", compression="zip")
test_csv = pd.read_csv("../data/raw/test.csv.zip", compression="zip")
sample_submissions = pd.read_csv("../data/raw/sampleSubmission.csv.zip", compression="zip")

## whats in train csv?

In [6]:
train_csv.shape

(878049, 9)

In [7]:
train_csv.describe(include = "all")

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
count,878049,878049,878049,878049,878049,878049,878049,878049.0,878049.0
unique,389257,39,879,7,10,17,23228,,
top,2011-01-01 00:01:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Friday,SOUTHERN,NONE,800 Block of BRYANT ST,,
freq,185,174900,60022,133734,157182,526790,26533,,
mean,,,,,,,,-122.422616,37.77102
std,,,,,,,,0.030354,0.456893
min,,,,,,,,-122.513642,37.707879
25%,,,,,,,,-122.432952,37.752427
50%,,,,,,,,-122.41642,37.775421
75%,,,,,,,,-122.406959,37.784369


In [8]:
train_csv.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


## Whats in test csv?

In [9]:
test_csv.shape

(884262, 7)

In [10]:
test_csv.describe(include = "all")

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
count,884262.0,884262,884262,884262,884262,884262.0,884262.0
unique,,392173,7,10,23184,,
top,,2010-01-01 00:01:00,Friday,SOUTHERN,800 Block of BRYANT ST,,
freq,,150,134703,157456,26984,,
mean,442130.5,,,,,-122.422693,37.771476
std,255264.596206,,,,,0.030985,0.484824
min,0.0,,,,,-122.513642,37.707879
25%,221065.25,,,,,-122.433069,37.752374
50%,442130.5,,,,,-122.416517,37.775421
75%,663195.75,,,,,-122.406959,37.784353


In [11]:
test_csv.head()

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


## Whats in sample submission?

In [12]:
sample_submissions.shape

(884262, 40)

In [13]:
sample_submissions.describe(include = "all")

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
count,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,...,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0,884262.0
mean,442130.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
std,255264.596206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,221065.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,442130.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,663195.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
max,884261.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [14]:
sample_submissions.head()

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


# Create Mini Data Sets

So that we can upload something into github

In [15]:
mini_train = train_csv.head(10)
mini_test = test_csv.head(10)
mini_submission = sample_submissions.head(10)

In [16]:
mini_train.to_csv("../data/raw/mini-train.csv", index = False)
mini_test.to_csv("../data/raw/mini-test.csv", index = False)
mini_submission.to_csv("../data/raw/mini-sample-submission.csv", index = False)