<h1> 2. Creating a sampled dataset - HY Working</h1>

This notebook illustrates:
<ol>
<li> Sampling a BigQuery dataset to create datasets for ML
<li> Preprocessing with Pandas
</ol>

In [1]:
# change these to try this notebook out
BUCKET = 'qwiklabs-gcp-8ad84a85d935caa2'
PROJECT = 'qwiklabs-gcp-8ad84a85d935caa2'
REGION = 'australia-southeast1-a'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [3]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

<h2> Create ML dataset by sampling using BigQuery </h2>
<p>
Let's sample the BigQuery data to create smaller datasets.
</p>

In [4]:
# Create SQL query using natality data after the year 2000
import google.datalab.bigquery as bq
query = """
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks,
  ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

## Lab Task #1

Sample the BigQuery resultset (above) so that you have approximately 12,000 training examples and 3000 evaluation examples.
The training and evaluation datasets have to be well-distributed (not all the babies are born in Jan 2005, for example)
and should not overlap (no baby is part of both training and evaluation datasets).

Hint (highlight to see): <p style='color:white'>You will use MOD() on the hashmonth to divide the dataset into non-overlapping training and evaluation datasets, and RAND() to sample these to the desired size.</p>

In [5]:
count_sql = "select count(*) from (" + query + ") where MOD(hashmonth, 4) = 1"
print(bq.Query(count_sql).execute().result().to_dataframe())

       f0_
0  9134316


In [6]:
# check how hashomonth works:
count_sql = "select count(*),  MOD(hashmonth,20) from (" + query +") where mod(hashmonth,20) < 14 group by MOD(hashmonth, 20)" 
print(bq.Query(count_sql).execute().result().to_dataframe())

        f0_  f1_
0   2120329    3
1   2368609    0
2    354450   11
3    988441    2
4   1764733    9
5   1075744    6
6   1726094   10
7   2480582    5
8   2080569   12
9   1343107    7
10  2484033   13
11  2328931    4
12   660523    1
13  2060460    8


In [12]:
trainsql = "select * from (" + query + ") where MOD(hashmonth, 10) < 9 AND RAND() < 0.0005" # 90% train
traindat = bq.Query(trainsql).execute().result().to_dataframe()

In [13]:
print(traindat.count())
traindat.head()

weight_pounds      14999
is_male            15010
mother_age         15010
plurality          15010
gestation_weeks    14919
hashmonth          15010
dtype: int64


Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,6.812284,True,35,1,41.0,774501970389208065
1,6.812284,False,32,1,39.0,774501970389208065
2,7.251004,False,13,1,38.0,774501970389208065
3,7.892549,True,29,1,40.0,774501970389208065
4,5.335187,False,29,1,36.0,774501970389208065


In [14]:
testsql = "select * from (" + query + ") where MOD(hashmonth, 10) = 9 AND RAND() < 0.0005" # 10% test
testdat = bq.Query(testsql).execute().result().to_dataframe()
print(testdat.count())
testdat.head()

weight_pounds      1552
is_male            1553
mother_age         1553
plurality          1553
gestation_weeks    1542
hashmonth          1553
dtype: int64


Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,7.500126,True,21,1,40.0,8904940584331855459
1,7.62579,True,33,1,39.0,2995620979373137889
2,8.875811,False,39,1,42.0,2995620979373137889
3,7.18707,False,39,1,38.0,260598435387740869
4,6.503637,True,33,1,37.0,270792696282171059


## Lab Task #2

Use Pandas to:
* Clean up the data to remove rows that are missing any of the fields.
* Simulate the lack of ultrasound.
* Change the plurality column to be a string.

Hints: <p>
Filtering:
<pre>
df = df[df.weight_pounds > 0]
</pre>
Lack of ultrasound:
<pre>
nous = df.copy(deep=True)
nous['is_male'] = 'Unknown'
</pre>
Modify plurality to be a string:
<pre
>
twins_etc = dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
df['plurality'].replace(twins_etc, inplace=True)
</pre>
</p>

In [16]:
#Filtering - clean out 'bad data'
# Let's look at a small sample of the training data
# check for number of nulls
traindf = traindat
traindf.describe()

Unnamed: 0,weight_pounds,mother_age,plurality,gestation_weeks,hashmonth
count,14999.0,15010.0,15010.0,14919.0,15010.0
mean,7.241891,27.396536,1.036775,38.580736,4.368225e+18
std,1.311489,6.162164,0.200887,2.540917,2.773487e+18
min,0.500449,12.0,1.0,17.0,7.493147e+16
25%,6.563162,22.0,1.0,38.0,1.639186e+18
50%,7.312733,27.0,1.0,39.0,4.329667e+18
75%,8.062305,32.0,1.0,40.0,6.910175e+18
max,12.588395,51.0,4.0,47.0,9.183606e+18


In [17]:
import pandas as pd
def preprocess(df):
  # clean up data we don't want to train on
  # in other words, users will have to tell us the mother's age
  # otherwise, our ML service won't work.
  # these were chosen because they are such good predictors
  # and because these are easy enough to collect
  df = df[df.weight_pounds > 0]
  df = df[df.mother_age > 0]
  df = df[df.gestation_weeks > 0]
  df = df[df.plurality > 0]
  
  # modify plurality field to be a string
  twins_etc = dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
  df['plurality'].replace(twins_etc, inplace=True)
  
  # now create extra rows to simulate lack of ultrasound, i.e. don't know the gender
  nous = df.copy(deep=True)
  nous.loc[nous['plurality'] != 'Single(1)', 'plurality'] = 'Multiple(2+)'
  nous['is_male'] = 'Unknown'
  
  return pd.concat([df, nous])

In [23]:
# Let's see a small sample of the training data now after our preprocessing
traindf_clean = preprocess(traindf)
testdf = testdat
testdf_clean = preprocess(testdf)
traindf_clean.head()

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,hashmonth
0,6.812284,True,35,Single(1),41.0,774501970389208065
1,6.812284,False,32,Single(1),39.0,774501970389208065
2,7.251004,False,13,Single(1),38.0,774501970389208065
3,7.892549,True,29,Single(1),40.0,774501970389208065
4,5.335187,False,29,Single(1),36.0,774501970389208065


In [26]:
print(traindf_clean.describe(include='all'))

        weight_pounds  is_male    mother_age  plurality  gestation_weeks  \
count    29816.000000    29816  29816.000000      29816     29816.000000   
unique            NaN        3           NaN          5              NaN   
top               NaN  Unknown           NaN  Single(1)              NaN   
freq              NaN    14908           NaN      28788              NaN   
mean         7.241639      NaN     27.393413        NaN        38.583311   
std          1.309622      NaN      6.155183        NaN         2.535166   
min          0.500449      NaN     12.000000        NaN        17.000000   
25%          6.563162      NaN     22.000000        NaN        38.000000   
50%          7.312733      NaN     27.000000        NaN        39.000000   
75%          8.062305      NaN     32.000000        NaN        40.000000   
max         12.588395      NaN     51.000000        NaN        47.000000   

           hashmonth  
count   2.981600e+04  
unique           NaN  
top              N

## Lab Task #3

Write the cleaned out data into CSV files.  Change the name of the Pandas dataframes (traindf, evaldf) appropriately.



In [28]:
traindf_clean.to_csv('train.csv', index=False, header=False)
testdf_clean.to_csv('eval.csv', index=False, header=False)

In [29]:
%bash
wc -l *.csv
head *.csv
tail *.csv

   3082 eval.csv
  29816 train.csv
  32898 total
==> eval.csv <==
7.50012615324,True,21,Single(1),40.0,8904940584331855459
7.62578964258,True,33,Single(1),39.0,2995620979373137889
8.87581066812,False,39,Single(1),42.0,2995620979373137889
7.1870697412,False,39,Single(1),38.0,260598435387740869
6.503636729,True,33,Single(1),37.0,270792696282171059
6.87401332916,False,18,Single(1),40.0,5742197815970064689
5.93704871566,False,27,Single(1),36.0,4740473290291881219
6.2501051276999995,True,25,Single(1),40.0,7146494315947640619
8.935335478859999,True,37,Single(1),39.0,5742197815970064689
7.87491199864,False,19,Single(1),39.0,7146494315947640619

==> train.csv <==
6.8122838958,True,35,Single(1),41.0,774501970389208065
6.8122838958,False,32,Single(1),39.0,774501970389208065
7.25100379718,False,13,Single(1),38.0,774501970389208065
7.8925489796,True,29,Single(1),40.0,774501970389208065
5.3351867404,False,29,Single(1),36.0,774501970389208065
6.1244416383599996,True,28,Single(1),37.0,774501970389208

Copyright 2017-2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License