<h1> 2. Creating a sampled dataset - HY Working</h1>

This notebook illustrates:
<ol>
<li> Sampling a BigQuery dataset to create datasets for ML
<li> Preprocessing with Pandas
</ol>

In [10]:
# change these to try this notebook out
BUCKET = 'qwiklabs-gcp-0a838c82400aa60e1'
PROJECT = 'qwiklabs-gcp-0a838c82400aa60e1'
REGION = 'australia-southeast1'

In [11]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [12]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

Creating gs://qwiklabs-gcp-0a838c82400aa60e1/...


<h2> Create ML dataset by sampling using BigQuery </h2>
<p>
Let's sample the BigQuery data to create smaller datasets.
</p>

In [13]:
# Create SQL query using natality data after the year 2000
import google.datalab.bigquery as bq
query = """
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks,
  ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

## Lab Task #1

Sample the BigQuery resultset (above) so that you have approximately 12,000 training examples and 3000 evaluation examples.
The training and evaluation datasets have to be well-distributed (not all the babies are born in Jan 2005, for example)
and should not overlap (no baby is part of both training and evaluation datasets).

Hint (highlight to see): <p style='color:white'>You will use MOD() on the hashmonth to divide the dataset into non-overlapping training and evaluation datasets, and RAND() to sample these to the desired size.</p>

In [14]:
count_sql = "select count(*) from (" + query + ") where MOD(hashmonth, 4) = 1"
print(bq.Query(count_sql).execute().result().to_dataframe())

       f0_
0  9134316


In [15]:
# check how hashomonth works:
count_sql = "select count(*),  MOD(hashmonth,20) from (" + query +") where mod(hashmonth,20) < 14 group by MOD(hashmonth, 20)" 
print(bq.Query(count_sql).execute().result().to_dataframe())

        f0_  f1_
0   2120329    3
1   1343107    7
2   2484033   13
3   2480582    5
4   2080569   12
5   2368609    0
6    354450   11
7    660523    1
8   2328931    4
9   2060460    8
10   988441    2
11  1764733    9
12  1075744    6
13  1726094   10


In [16]:
trainsql = "select * from (" + query + ") where MOD(hashmonth, 10) < 9 AND RAND() < 0.0005" # 90% train
traindat = bq.Query(trainsql).execute().result().to_dataframe()

In [17]:
print(traindat.count())
traindat.head()

weight_pounds      14847
is_male            14859
mother_age         14859
plurality          14859
gestation_weeks    14751
hashmonth          14859
dtype: int64


Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,8.93754,True,21,1,39.0,774501970389208065
1,6.530092,False,34,1,38.0,774501970389208065
2,7.18707,False,35,1,41.0,774501970389208065
3,6.977631,True,35,1,39.0,774501970389208065
4,7.874912,True,31,1,39.0,774501970389208065


In [18]:
testsql = "select * from (" + query + ") where MOD(hashmonth, 10) = 9 AND RAND() < 0.0005" # 10% test
testdat = bq.Query(testsql).execute().result().to_dataframe()
print(testdat.count())
testdat.head()

weight_pounds      1633
is_male            1634
mother_age         1634
plurality          1634
gestation_weeks    1615
hashmonth          1634
dtype: int64


Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,5.81359,True,23,1,38.0,8904940584331855459
1,9.499719,True,23,1,41.0,8904940584331855459
2,7.813183,True,25,1,41.0,5742197815970064689
3,7.211321,True,32,1,38.0,4740473290291881219
4,7.588311,True,22,1,39.0,4740473290291881219


## Lab Task #2

Use Pandas to:
* Clean up the data to remove rows that are missing any of the fields.
* Simulate the lack of ultrasound.
* Change the plurality column to be a string.

Hints: <p>
Filtering:
<pre>
df = df[df.weight_pounds > 0]
</pre>
Lack of ultrasound:
<pre>
nous = df.copy(deep=True)
nous['is_male'] = 'Unknown'
</pre>
Modify plurality to be a string:
<pre
>
twins_etc = dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
df['plurality'].replace(twins_etc, inplace=True)
</pre>
</p>

In [19]:
#Filtering - clean out 'bad data'
# Let's look at a small sample of the training data
# check for number of nulls
traindf = traindat
traindf.describe()

Unnamed: 0,weight_pounds,mother_age,plurality,gestation_weeks,hashmonth
count,14847.0,14859.0,14859.0,14751.0,14859.0
mean,7.227683,27.402315,1.034996,38.591214,4.372484e+18
std,1.318797,6.191578,0.192364,2.602255,2.76546e+18
min,0.500449,13.0,1.0,17.0,7.493147e+16
25%,6.563162,22.0,1.0,38.0,1.639186e+18
50%,7.312733,27.0,1.0,39.0,4.329667e+18
75%,8.028133,32.0,1.0,40.0,6.888635e+18
max,12.50021,50.0,4.0,47.0,9.183606e+18


In [20]:
import pandas as pd
def preprocess(df):
  # clean up data we don't want to train on
  # in other words, users will have to tell us the mother's age
  # otherwise, our ML service won't work.
  # these were chosen because they are such good predictors
  # and because these are easy enough to collect
  df = df[df.weight_pounds > 0]
  df = df[df.mother_age > 0]
  df = df[df.gestation_weeks > 0]
  df = df[df.plurality > 0]
  
  # modify plurality field to be a string
  twins_etc = dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
  df['plurality'].replace(twins_etc, inplace=True)
  
  # now create extra rows to simulate lack of ultrasound, i.e. don't know the gender
  nous = df.copy(deep=True)
  nous.loc[nous['plurality'] != 'Single(1)', 'plurality'] = 'Multiple(2+)'
  nous['is_male'] = 'Unknown'
  
  return pd.concat([df, nous])

In [21]:
# Let's see a small sample of the training data now after our preprocessing
traindf_clean = preprocess(traindf)
testdf = testdat
testdf_clean = preprocess(testdf)
traindf_clean.head()

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,8.93754,True,21,Single(1),39.0,774501970389208065
1,6.530092,False,34,Single(1),38.0,774501970389208065
2,7.18707,False,35,Single(1),41.0,774501970389208065
3,6.977631,True,35,Single(1),39.0,774501970389208065
4,7.874912,True,31,Single(1),39.0,774501970389208065


In [22]:
print(traindf_clean.describe(include='all'))

        weight_pounds  is_male    mother_age  plurality  gestation_weeks  \
count    29480.000000    29480  29480.000000      29480     29480.000000   
unique            NaN        3           NaN          5              NaN   
top               NaN  Unknown           NaN  Single(1)              NaN   
freq              NaN    14740           NaN      28492              NaN   
mean         7.228670      NaN     27.407463        NaN        38.596744   
std          1.316765      NaN      6.189664        NaN         2.584611   
min          0.500449      NaN     13.000000        NaN        18.000000   
25%          6.563162      NaN     23.000000        NaN        38.000000   
50%          7.312733      NaN     27.000000        NaN        39.000000   
75%          8.027582      NaN     32.000000        NaN        40.000000   
max         12.500210      NaN     50.000000        NaN        47.000000   

           hashmonth  
count   2.948000e+04  
unique           NaN  
top              N

## Lab Task #3

Write the cleaned out data into CSV files.  Change the name of the Pandas dataframes (traindf, evaldf) appropriately.



In [23]:
traindf_clean.to_csv('train.csv', index=False, header=False)
testdf_clean.to_csv('eval.csv', index=False, header=False)

In [24]:
%bash
wc -l *.csv
head *.csv
tail *.csv

   3228 eval.csv
  29480 train.csv
  32708 total
==> eval.csv <==
5.8135898489399995,True,23,Single(1),38.0,8904940584331855459
9.49971886958,True,23,Single(1),41.0,8904940584331855459
7.81318256528,True,25,Single(1),41.0,5742197815970064689
7.21132059002,True,32,Single(1),38.0,4740473290291881219
7.5883110580399995,True,22,Single(1),39.0,4740473290291881219
8.6862131228,True,34,Single(1),40.0,1443901198490054949
8.062304921339999,True,20,Single(1),40.0,5742197815970064689
8.313631900019999,True,34,Single(1),40.0,4740473290291881219
5.6879263596,False,34,Single(1),37.0,260598435387740869
6.0009827716399995,True,17,Single(1),36.0,1443901198490054949

==> train.csv <==
8.93754010148,True,21,Single(1),39.0,774501970389208065
6.5300922004399995,False,34,Single(1),38.0,774501970389208065
7.1870697412,False,35,Single(1),41.0,774501970389208065
6.9776305923,True,35,Single(1),39.0,774501970389208065
7.87491199864,True,31,Single(1),39.0,774501970389208065
8.39299831434,True,37,Single(1),39.0,77

Copyright 2017-2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License