In this kernel, I am exploring the Microsoft Malware Prediction datasets using Big Query running queries from this notebook. 

Thank you for the great resources / support:
1. https://www.kaggle.com/sohier/getting-started-with-big-query
2. https://www.kaggle.com/product-feedback/48573#275757
3. https://www.kaggle.com/happycloud/bigquery-ml-template-intersection-congestion

Feel free to leave a comment and help improve the code...
Awesome out'

Jie Geng kindly provided great code to get started. Source: https://www.kaggle.com/jiegeng94/everyone-do-this-at-the-beginning 
17 columns removed without affecting the score...

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Following instructions provided by:
* https://www.kaggle.com/sirtorry/bigquery-ml-template-intersection-congestion


# 1. Setup and create your dataset

Transfer your pre-processed datasets to BQ with the following command:
> !gsutil -m cp -r /kaggle/input/microsoft-malware-prediction/* gs://{your_bucket_name}/folder/

In [None]:
# Replace 'kaggle-competitions-project' with YOUR OWN project id here --  
PROJECT_ID = 'kaggle-bq-quest'

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID) # location="US")
#dataset = client.create_dataset('bqml_example', exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

# create a reference to our table
table = client.get_table("kaggle-bq-quest.ms_maleware.train")

# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()

# 2. Create your model

In [None]:
%load_ext google.cloud.bigquery

I am focusing on 27 features to save process time. 

In [None]:
%%bigquery
CREATE MODEL IF NOT EXISTS `ms_maleware.model1`
OPTIONS
( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE
  ) AS
SELECT
    HasDetections as label,
    SmartScreen,
    AppVersion,
    Census_InternalBatteryNumberOfCharges,
    AVProductStatesIdentifier,
    Census_TotalPhysicalRAM,
    LocaleEnglishNameIdentifier,
    Census_SystemVolumeTotalCapacity,
    AVProductsEnabled,
    Census_InternalPrimaryDiagonalDisplaySizeInInches,
    Census_FirmwareVersionIdentifier,
    Census_OSInstallTypeName,
    Census_OSBuildNumber,
    Census_FirmwareManufacturerIdentifier,
    Census_ActivationChannel,
    Census_OSArchitecture,
    Census_ProcessorCoreCount,
    Census_OSEdition,
    Census_PrimaryDiskTypeName,
    Census_IsSecureBootEnabled,
    IsProtected,
    Census_InternalPrimaryDisplayResolutionVertical,
    Census_OSBuildRevision,
    Census_InternalPrimaryDisplayResolutionHorizontal,
    Census_HasOpticalDiskDrive,
    OrganizationIdentifier,
    EngineVersion,
    ProductName
FROM
  `kaggle-bq-quest.ms_maleware.train`

# 3. Get training statistics

In [None]:
%%bigquery
SELECT
    *
FROM
  ML.TRAINING_INFO(MODEL `ms_maleware.model1`)
ORDER BY iteration 

# 4. Evaluate your model

In [None]:
%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `ms_maleware.model1`, (
  SELECT
    HasDetections as label,
    SmartScreen,
    AppVersion,
    Census_InternalBatteryNumberOfCharges,
    AVProductStatesIdentifier,
    Census_TotalPhysicalRAM,
    LocaleEnglishNameIdentifier,
    Census_SystemVolumeTotalCapacity,
    AVProductsEnabled,
    Census_InternalPrimaryDiagonalDisplaySizeInInches,
    Census_FirmwareVersionIdentifier,
    Census_OSInstallTypeName,
    Census_OSBuildNumber,
    Census_FirmwareManufacturerIdentifier,
    Census_ActivationChannel,
    Census_OSArchitecture,
    Census_ProcessorCoreCount,
    Census_OSEdition,
    Census_PrimaryDiskTypeName,
    Census_IsSecureBootEnabled,
    IsProtected,
    Census_InternalPrimaryDisplayResolutionVertical,
    Census_OSBuildRevision,
    Census_InternalPrimaryDisplayResolutionHorizontal,
    Census_HasOpticalDiskDrive,
    OrganizationIdentifier,
    EngineVersion,
    ProductName
  FROM
    `kaggle-bq-quest.ms_maleware.train`
    ))

# 5. Use your model to predict outcomes

In [None]:
%%bigquery df
SELECT
    *
FROM
  ML.PREDICT(MODEL `ms_maleware.model1`,
    (
    SELECT
        MachineIdentifier,
        SmartScreen,
        AppVersion,
        Census_InternalBatteryNumberOfCharges,
        AVProductStatesIdentifier,
        Census_TotalPhysicalRAM,
        LocaleEnglishNameIdentifier,
        Census_SystemVolumeTotalCapacity,
        AVProductsEnabled,
        Census_InternalPrimaryDiagonalDisplaySizeInInches,
        Census_FirmwareVersionIdentifier,
        Census_OSInstallTypeName,
        Census_OSBuildNumber,
        Census_FirmwareManufacturerIdentifier,
        Census_ActivationChannel,
        Census_OSArchitecture,
        Census_ProcessorCoreCount,
        Census_OSEdition,
        Census_PrimaryDiskTypeName,
        Census_IsSecureBootEnabled,
        IsProtected,
        Census_InternalPrimaryDisplayResolutionVertical,
        Census_OSBuildRevision,
        Census_InternalPrimaryDisplayResolutionHorizontal,
        Census_HasOpticalDiskDrive,
        OrganizationIdentifier,
        EngineVersion,
        ProductName
    FROM
      `kaggle-bq-quest.ms_maleware.test`))


# 6. Output as CSV

In [None]:
df = df[['predicted_label', 'MachineIdentifier']]
df.head()

In [None]:
df.rename(columns={'MachineIdentifier':'MachineIdentifier', 'predicted_label': 'HasDetections'}, inplace=True)

Rearranging the columns:

In [None]:
df = df[['MachineIdentifier', 'HasDetections']]

In [None]:
df.head()

That looks better!

In [None]:
df.to_csv(r'submission.csv',index=False)

# Suggestion for improvement

As you may have noticed, I have selected a limited number of feature to save processing time.
My descision is based on @neeraj22ny notebook "Feature Importance Using LOFO" 

Source: https://www.kaggle.com/neeraj22ny/feature-importance-using-lofo

The selected features are:

1.	MachineIdentifier,
2.	SmartScreen,
3.	AppVersion,
4.	Census_InternalBatteryNumberOfCharges,
5.	AVProductStatesIdentifier,
6.	Census_TotalPhysicalRAM,
7.	LocaleEnglishNameIdentifier,
8.	Census_SystemVolumeTotalCapacity,
9.	AVProductsEnabled,
10.	Census_InternalPrimaryDiagonalDisplaySizeInInches,
11.	Census_FirmwareVersionIdentifier,
12.	Census_OSInstallTypeName,
13.	Census_OSBuildNumber,
14.	Census_FirmwareManufacturerIdentifier,
15.	Census_ActivationChannel,
16.	Census_OSArchitecture,
17.	Census_ProcessorCoreCount,
18.	Census_OSEdition,
19.	Census_PrimaryDiskTypeName,
20.	Census_IsSecureBootEnabled,
21.	IsProtected,
22.	Census_InternalPrimaryDisplayResolutionVertical,
23.	Census_OSBuildRevision,
24.	Census_InternalPrimaryDisplayResolutionHorizontal,
25.	Census_HasOpticalDiskDrive,
26.	OrganizationIdentifier,
27.	EngineVersion,
28.	ProductName


I will add more features and test if the scores improves.


I hope some of you kagglers would be able to suggest ways I could imporove the code or ideas so learn more... Thanks.