# Machine Learning and Big Data Analysis course
## Topic: Advanced Big Data analysis techniques
### Part 1. Large files with Pandas and Dask

### 1. Pandas hints

In [None]:
import os
import sys
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)

We will use data from [Microsoft Malware Prediction on Kaggle](https://www.kaggle.com/competitions/microsoft-malware-prediction/overview). The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. 

First of all we will just try to read file as is:

In [None]:
file_path = '../__DATA/malware_prediction.csv'

In [None]:
df = pd.read_csv(file_path)

Oops...

In [None]:
!ls -lah $file_path

Large file requires something else. So we are going to try some tricks:

__Trick 1.__ Less data.

In [None]:
df = pd.read_csv(file_path, nrows=10)

In [None]:
df.head()

We can study structure of the data with small sample:

In [None]:
df.describe().T

...to inspect data by the columns:

In [None]:
%%time
df = pd.read_csv(file_path, usecols=['ProductName'])

In [None]:
df.head()

In [None]:
df.value_counts()

In [None]:
%%time
df = pd.read_csv(file_path, usecols=['Platform'])
df.value_counts()

__Trick 2.__ Chunks.

In [None]:
from tqdm.auto import tqdm

In [None]:
result = None
for chunk in tqdm(pd.read_csv(file_path, chunksize=1e5)):
    chunk_result = chunk['Platform'].value_counts()
    if result is None:
        result = chunk_result
    else:
        result = result.add(chunk_result, fill_value=0)

In [None]:
result.sort_values(ascending=False, inplace=True)
print(result)

Chunks are good if we want to process many columns:

In [None]:
result = None
for chunk in tqdm(pd.read_csv(file_path, chunksize=1e5)):
    chunk_result = chunk['Platform'].apply(lambda x: x.replace('windows', 'WIN')) + \
        ' ' + chunk['ProductName']
    if result is None:
        result = chunk_result
    else:
        result = result.append(chunk_result)

In [None]:
result.head()

### 2. Use of Dask library

Now we try [Dask](https://docs.dask.org/en/stable/).

#### 2.1. Basic operations

In [None]:
import dask
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, progress

You may change `n_workers` and `memory_limit` to manage performance but you should take into account your server (or cluster) resource limits:

In [None]:
client = Client(
    n_workers=2,
    threads_per_worker=1,
    memory_limit='2GB'
)
print(
    'Dask dashboard available at:',
    'https://jhas01.gsom.spbu.ru{}proxy/{}/status'.format(
        os.environ['JUPYTERHUB_SERVICE_PREFIX'],
        client.scheduler_info()['services']['dashboard']
    )
)
client

In [None]:
ddf = dd.read_csv(file_path)

Reading may require defining data types, so let's do it:

In [None]:
dtypes = {
    'AVProductStatesIdentifier':                          'float64',
    'AVProductsEnabled':                                  'float64',
    'AVProductsInstalled':                                'float64',
    'Census_FirmwareManufacturerIdentifier':              'float64',
    'Census_FirmwareVersionIdentifier':                   'float64',
    'Census_InternalBatteryNumberOfCharges':              'float64',
    'Census_InternalPrimaryDisplayResolutionHorizontal':  'float64',
    'Census_InternalPrimaryDisplayResolutionVertical':    'float64',
    'Census_IsAlwaysOnAlwaysConnectedCapable':            'float64',
    'Census_IsFlightsDisabled':                           'float64',
    'Census_IsVirtualDevice':                             'float64',
    'Census_OEMModelIdentifier':                          'float64',
    'Census_OEMNameIdentifier':                           'float64',
    'Census_OSInstallLanguageIdentifier':                 'float64',
    'Census_PrimaryDiskTotalCapacity':                    'float64',
    'Census_ProcessorClass':                              'object', 
    'Census_ProcessorCoreCount':                          'float64',
    'Census_ProcessorManufacturerIdentifier':             'float64',
    'Census_ProcessorModelIdentifier':                    'float64',
    'Census_SystemVolumeTotalCapacity':                   'float64',
    'Census_TotalPhysicalRAM':                            'float64',
    'CityIdentifier':                                     'float64',
    'Firewall':                                           'float64',
    'GeoNameIdentifier':                                  'float64',
    'IsProtected':                                        'float64',
    'PuaMode':                                            'object', 
    'RtpStateBitfield':                                   'float64',
    'SMode':                                              'float64',
    'UacLuaenable':                                       'float64',
    'Wdft_IsGamer':                                       'float64',
    'Wdft_RegionIdentifier':                              'float64'
}

In [None]:
ddf = dd.read_csv(file_path, dtype=dtypes)
ddf.head()

In [None]:
ddf.describe()

In [None]:
ddf.info()

In [None]:
print('all columns:', ddf.columns)

In [None]:
ddf.count()

Dask uses lazy computation approach, so any processing starts only after `compute()` is called:

In [None]:
%%time
ddf.count().compute()

In [None]:
ddf.groupby('Platform').count()

In [None]:
%%time
ddf.groupby('Platform').count().compute()

In [None]:
%%time
ddf.groupby('Platform').Platform.count().compute()

Improve performance by increasing number of workers and memory use:

In [None]:
client = Client(
    n_workers=4,
    threads_per_worker=1,
    memory_limit='8GB'
)
print(
    'Dask dashboard available at:',
    'https://jhas01.gsom.spbu.ru{}proxy/{}/status'.format(
        os.environ['JUPYTERHUB_SERVICE_PREFIX'],
        client.scheduler_info()['services']['dashboard']
    )
)
client

In [None]:
%%time
ddf.groupby('Platform').Platform.count().compute()

In [None]:
ddf.groupby('ProductName').ProductName.count().compute()

In [None]:
ddf.HasDetections.unique().compute()

In [None]:
ddf.groupby('HasDetections').HasDetections.count().compute()

In [None]:
ddf.HasDetections.mean().compute()

In [None]:
ddf_win10 = ddf.loc[ddf.Platform == 'windows10']

In [None]:
ddf_win10.HasDetections.mean().compute()

#### 2.2. User's function with Dask

In [None]:
ddf.head()

In [None]:
ddf.Census_OSArchitecture.unique().compute()

Apply function to Dask dataframe column:

In [None]:
def is_amd(text):
    if text == 'amd64':
        return 1
    else:
        return 0

ddf = ddf.assign(
    Census_OSArchitecture_isAMD=ddf.Census_OSArchitecture.map(
        lambda x: is_amd(x), meta=('x', str)
    )
)

In [None]:
ddf.head()

In [None]:
ddf.groupby('Census_OSArchitecture_isAMD').Census_OSArchitecture_isAMD.count().compute()

#### 2.3. Dask for Machine Learning

For further readings - [Dask for Machine Learning](https://examples.dask.org/machine-learning.html)

In [None]:
!pip install dask_ml
!pip install -U scikit-learn

In [None]:
import joblib
from dask_ml.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

##### 2.3.1. Data preprocessing

In [None]:
ddf = dd.read_csv(file_path, dtype=dtypes)
ddf.head()

In [None]:
# very simple model
ml_cols = [
    'IsBeta',
    'AVProductsInstalled',
    'AVProductsEnabled',
    'HasTpm',
    'Wdft_IsGamer',
    'HasDetections'
]

In [None]:
# RandomForestClassifier can not handle `nan` values
ddf_ml = ddf[ml_cols].dropna()
ddf_ml.head()

In [None]:
cols = [
    'IsBeta',
    'AVProductsInstalled',
    'AVProductsEnabled',
    'HasTpm',
    'Wdft_IsGamer'
]
X = ddf_ml[cols]
y = ddf_ml['HasDetections']

##### 2.3.2. Training with Dask

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [None]:
clf = RandomForestClassifier(verbose=1)

In [None]:
with joblib.parallel_backend('dask'):
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)

##### 2.3.3. Evaluation

In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score

In [None]:
roc_auc_score(preds, y_test.values.compute())

In [None]:
accuracy_score(preds, y_test.values.compute())