<a href="https://colab.research.google.com/github/rtheman/Data_IO/blob/master/from_Internet_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective

This notebook provide receipes for loading/extracting data into this Juypter Notebook (Google Colab), in this case, from **Internet**.

**METHOD**.  This notebook uses `urllib` and `zipfile` library to download and unzip file respectively.

**DATASET** from the [Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew) released by the City of Chicago.

**PROCESS**
1. Download data from Internet (Google Cloud Storage).
1. Save to a temporary directory under Colab file system `/tmp/[tmp_folder_name]`.
1. Unzip downloaded file with `zipfile` library.
1. Save training dataset as a dataframe.

# Init. parameters and libraries

In [None]:
import os
from io import BytesIO

import tempfile, urllib, zipfile

import pandas as pd

In [None]:
# file_name = "http://stats.idre.ucla.edu/stat/data/binary.csv"
data_orig_path = "https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets"
data_orig_filename = "chicago_data.zip"

BASE_DIR = tempfile.mkdtemp()
DATA_DIR = os.path.join(BASE_DIR, 'data')
OUTPUT_DIR = os.path.join(BASE_DIR, 'chicago_taxi_output')

TRAIN_DATA = os.path.join(DATA_DIR, 'train', 'data.csv')
EVAL_DATA = os.path.join(DATA_DIR, 'eval', 'data.csv')
SERVING_DATA = os.path.join(DATA_DIR, 'serving', 'data.csv')

# 1.) Import Data

## Download ZIP file from GCP and unzip it

In [None]:
data_orig_URL = os.path.join(data_orig_path, data_orig_filename)

# Download the zip file from GCP and unzip it
zip, headers = urllib.request.urlretrieve(data_orig_URL)
zipfile.ZipFile(zip).extractall(BASE_DIR)
zipfile.ZipFile(zip).close()

## Read data into Dataframe

In [None]:
# df = pd.read_csv(file_name, header=0, index_col=0)
df = pd.read_csv(TRAIN_DATA, header=0)
df.info

<bound method DataFrame.info of       pickup_community_area   fare  ...  dropoff_community_area  tips
0                        22  12.85  ...                    32.0   0.0
1                        22   5.45  ...                    24.0   0.0
2                        33   0.00  ...                    33.0   0.0
3                        33  11.05  ...                     8.0   0.0
4                        33  11.05  ...                     8.0   0.0
...                     ...    ...  ...                     ...   ...
9995                      8   3.25  ...                     8.0   0.0
9996                      8   3.25  ...                     8.0   0.0
9997                      8   4.25  ...                     8.0   0.0
9998                     61   9.85  ...                    59.0   0.0
9999                     61   5.65  ...                    58.0   0.0

[10000 rows x 18 columns]>

In [None]:
df.dtypes

pickup_community_area       int64
fare                      float64
trip_start_month            int64
trip_start_hour             int64
trip_start_day              int64
trip_start_timestamp        int64
pickup_latitude           float64
pickup_longitude          float64
dropoff_latitude          float64
dropoff_longitude         float64
trip_miles                float64
pickup_census_tract       float64
dropoff_census_tract      float64
payment_type               object
company                    object
trip_seconds                int64
dropoff_community_area    float64
tips                      float64
dtype: object

In [None]:
df

Unnamed: 0,pickup_community_area,fare,trip_start_month,trip_start_hour,trip_start_day,trip_start_timestamp,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,trip_miles,pickup_census_tract,dropoff_census_tract,payment_type,company,trip_seconds,dropoff_community_area,tips
0,22,12.85,3,11,7,1393673400,41.920452,-87.679955,41.877406,-87.621972,0.0,,1.703132e+10,Cash,Taxi Affiliation Services,720,32.0,0.0
1,22,5.45,8,21,7,1439675100,41.920452,-87.679955,41.906771,-87.681025,1.2,,1.703124e+10,Cash,Dispatch Taxi Affiliation,360,24.0,0.0
2,33,0.00,5,10,4,1432118700,41.849247,-87.624135,41.849247,-87.624135,0.0,,1.703184e+10,Cash,Northwest Management LLC,0,33.0,0.0
3,33,11.05,3,15,1,1427037300,41.849247,-87.624135,41.892508,-87.626215,0.0,,1.703108e+10,Cash,Taxi Affiliation Services,900,8.0,0.0
4,33,11.05,5,15,6,1401464700,41.849247,-87.624135,41.892508,-87.626215,3.2,,1.703108e+10,Cash,,960,8.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,8,3.25,5,11,5,1431602100,41.904935,-87.649907,41.904935,-87.649907,0.0,,1.703184e+10,Cash,Taxi Affiliation Services,60,8.0,0.0
9996,8,3.25,11,16,4,1385568900,41.904935,-87.649907,41.904935,-87.649907,0.0,,1.703184e+10,Cash,Taxi Affiliation Services,0,8.0,0.0
9997,8,4.25,12,13,3,1449579600,41.904935,-87.649907,41.904935,-87.649907,0.3,,1.703184e+10,Cash,,180,8.0,0.0
9998,61,9.85,9,15,6,1410534000,41.809018,-87.659167,41.829922,-87.672503,3.0,,,Cash,Taxi Affiliation Services,780,59.0,0.0
