


**Task**- To Use Big query to import the dataset, build a model and finally saving the model in the Google Storage account.

**About the Dataset**- 

Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.

**Feature Description:**

**Dependent Variable**

- **weight_pounds** - Weight of the child, in pounds.

**Independent Variable:**

- **is_male** - TRUE if the child is male, FALSE if female.
-  **mother_age** - Reported age of the mother when giving birth.
-  **plurality** - How many children were born as a result of this pregnancy. twins=2, triplets=3, and so on.
-  **gestation_weeks** - The number of weeks of the pregnancy.


In [None]:
# Installing xgboost 
!pip3 install xgboost==0.82

In [None]:
import pandas as pd                                 
import xgboost as xgb                                
import numpy as np                                   

from sklearn.model_selection import train_test_split 
from sklearn.utils import shuffle                    
from sklearn.metrics import r2_score                 

# BigQuery is a fully managed, serverless SQL data warehouse that allows for speedy SQL queries and 
# interactive analysis of large datasets.
from google.cloud import bigquery                    

# Importing the dataset
We will be using big query to import the dataset.

1. To find the total number of rows in the entire dataset

In [None]:
query1="""
SELECT
  count(*),
FROM
  publicdata.samples.natality
"""

In [None]:
df = bigquery.Client().query(query1).to_dataframe()
df

2. To find the number of records for the year 2000

In [None]:
query2="""
SELECT
count(*),
FROM
publicdata.samples.natality
WHERE year = 2000
"""

In [None]:
df = bigquery.Client().query(query2).to_dataframe()
df

3. To find the number of records for the year >2000

In [None]:
query3="""
SELECT
  count(*),
FROM
  publicdata.samples.natality
  WHERE year >2000
"""

In [None]:
df = bigquery.Client().query(query3).to_dataframe()
df

## Importing the dataset for our analysis

In [None]:
query="""
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks
FROM
  publicdata.samples.natality
WHERE year > 2000
LIMIT 40000
"""

In [None]:
df = bigquery.Client().query(query).to_dataframe() 
df.head()

# Exploring the data

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# checking the data is imbalanced or not
df['is_male'].value_counts() 

In [None]:
# Checking null  values
df.isnull().sum()

In [None]:
df = df.dropna(axis=0) # Dropping all mising value rowwise

In [None]:
df.isnull().sum()  # Checking null values

# Data Wrangling
In this step we would be preparing the dataset for model building.

## Shuffling the data
This step is done to generalize the model. Thus, the training data is representative of all the data, and there is bias.

In [None]:
# Shuffle the data
df = shuffle(df,random_state = 42) 

In [None]:
df.head() 

In [None]:
labels = df['weight_pounds']                                # Dependent variable
data = df.drop(columns = ['weight_pounds'],axis = 1)        # Independent variables

In [None]:
# Converting the data type of all the independent variables to 'float' type
data['is_male'] = data['is_male'].astype(float)
data['mother_age'] = data['mother_age'].astype(float)
data['plurality'] = data['plurality'].astype(float)
data['gestation_weeks'] = data['gestation_weeks'].astype(float)

In [None]:
data.info()

## Splitting the dataset into train and test
This helps in evaluating the model performance.

In [None]:
x,y =data,labels                                         # X being the independent variable and y being the dependent variable
x_train,x_test,y_train,y_test = train_test_split(x,y)    # Splitting the data into train and test
                                                         # Since we have not specified any splitting criterion 25% is for split

# Model Building and Evaluation

In [None]:
model = xgb.XGBRegressor(objective = 'reg:linear') # Creating an instance of the class XGBRegressor

In [None]:
model.fit(x_train,y_train)                         # Now fitting to our dataset

In [None]:
y_pred = model.predict(x_test)                     # Predicting the first 25 observations

In [None]:
r2_score(y_test,y_pred)                            # Evaluating the performance of the model

In [None]:
# Printing the first 25 prediction with their true observed values.
for i in range(25):
    print('Predicted weight: ', y_pred[i])
    print('Actual weight: ', y_test.iloc[i])
    print()

# Saving the model

In [None]:
# We save the model in .bst format. This will be saved in the working directory.
model.save_model('jose_model.bst')  

Now to save our model in our cloud storage, we do the following:

In [None]:
# Here we specify the details regarding our bucket
GCP_PROJECT = 'My first Project'
MODEL_BUCKET = 'gs://buckets'
VERSION_NAME = 'v1'
MODEL_NAME = 'model'
REGION = 'asia-east1'
FRAMEWORK="XGBOOST"

In [None]:
!gsutil mb $MODEL_BUCKET # This would create the bucket. The bucket will have the name as we had specified,

In [None]:
!gsutil cp ./jose_model.bst $MODEL_BUCKET  # Now the model is saved at this bucket

In [None]:
!gcloud ai-platform models create $MODEL_NAME  \
$REGION

In [None]:
!gcloud ai-platform versions create $VERSION_NAME \
  --model=$MODEL_NAME \
  --origin=$MODEL_BUCKET \
  --runtime-version=2.10 \
  --framework=$FRAMEWORK \
  --python-version=3.7 \
  --region=REGION \

# DEPLOYMENT
In the following session, I would list out the steps to deploy the model using Command line Interface. It could be done using Console too, but here I would list out the steps involved for CLI alone.

**Step 1.** Set environment variables to store the path to the Cloud Storage directory where your model binary is located, your model name, your version name and your framework choice.

When you create a version with the gcloud CLI, you may provide the framework name in capital letters with underscores (for example, SCIKIT_LEARN) or in lowercase letters with hyphens (for example, scikit-learn). FOr our case it will be 'XGBOOST'.

EG:

MODEL_DIR="gs://your_bucket_name/"

VERSION_NAME="[YOUR-VERSION-NAME]"

MODEL_NAME="[YOUR-MODEL-NAME]"

FRAMEWORK="XG-BOOST"

CUSTOM_CODE_PATH="gs://your_bucket_name/my_custom_code-0.1.tar.gz"

---------------------------------------------------------------------------------------------
**Step 2.** Create the version:

EG:

gcloud ai-platform versions create $VERSION_NAME \
  --model=$MODEL_NAME \
  
  --origin=$MODEL_DIR \
  
  --runtime-version=2.10 \
  
  --framework=$FRAMEWORK \
  
  --python-version=3.7 \
  
  --region=REGION \
  
  --machine-type=MACHINE_TYPE
  
Replace the following:

**REGION:** The region of the regional endpoint on which you created the model. If you created the model on the global endpoint, omit the --region flag.

**MACHINE_TYPE:** A machine type, determining the computing resources available to your prediction nodes

----------------------------------------------------------------------------------------------------

**Step 3.** Get information about your new version:

EG:

gcloud ai-platform versions describe $ VERSION_NAME \
  --model=$MODEL_NAME
  
WE would get an output as

createTime: '2018-02-28T16:30:45Z'
deploymentUri: gs://your_bucket_name
framework: [YOUR-FRAMEWORK-NAME]
machineType: mls1-c1-m2
name: projects/[YOUR-PROJECT-ID]/models/[YOUR-MODEL-NAME]/versions/[YOUR-VERSION-NAME]
pythonVersion: '3.7'
runtimeVersion: '2.10'
state: READY