# List of Utility Scripts on Kaggle

It seems the ability to search for Utility scripts has been removed!

Utility scripts (announced [here](https://www.kaggle.com/product-feedback/91185 "Feature Launch: Import scripts into notebook kernels")) are marked by an ***admin > utility script*** tag, and when run, save their source code as an output file; when attached to a Notebook as an *input source*, the code is available on the Python path, to import, exactly like the first cell below...

This Notebook generates a listing of Utility scripts using the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset.
(If it says "dataset no longer available" ignore that - Kaggle refreshes the dataset every day. The link in this paragraph should work.)

## Contents

 * [Plot Output Size](#Plot-Output-Size)
 * [Utility Scripts](#Utility-Scripts)
 * [Large Utility Scripts - Mislabelled?](#Large-Utility-Scripts---Mislabelled?)

In [1]:
from jt_mk_utils import *

In [2]:
import os, sys, re, time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML, Image, display
from datetime import datetime

In [3]:
tags = read_tags()
tags[tags.Slug.str.contains('util')]

In [4]:
ktags = read_kernel_tags()
ktags.nunique()

In [5]:
ids = ktags.query('TagId==16074').KernelId
kernels = read_kernels(filter=('Id', ids))
kernels = kernels.dropna(subset=['CurrentKernelVersionId'])
kernels.count()

In [6]:
users = read_users(filter=('Id', kernels.AuthorUserId))
users = users.set_index('Id')
kernels = kernels.join(users, on='AuthorUserId')
kernels = kernels.dropna(subset=['UserName'])
kernels.count()

Did you know that the the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset lists all the output files of *every version* of *every public Notebook on the platform*?!

It looks like some of the Notebooks marked with the ***admin > utility script*** tag are not really utility scripts - we could filter Notebooks by the filename extensions of their outputs but I will leave that for a later version (or an exercise for ***you***, dear reader, hit that *Copy and Edit* button!)

In [7]:
ids = set(kernels.CurrentKernelVersionId)
output_files = read_kernel_version_output_files(filter=('KernelVersionId', ids))
output_files.head()

In [8]:
output_files.ContentTypeExtension.value_counts().head()

In [9]:
gb = output_files.groupby('KernelVersionId')
stats = gb.ContentLength.agg(['count', 'sum'])
stats.columns = ['Files', 'Content Length']
stats.head()

In [10]:
kernels = kernels.join(stats, on='CurrentKernelVersionId')
kernels.shape

In [11]:
SHOW = ['User', 'Script', 'Votes', 'Content Length', 'Files']
HOST = "https://www.kaggle.com"

BC = '#40c4ff'

plt.rc('figure', figsize=(15, 9))
plt.rc('font', size=14)
plt.style.use('bmh')


def odp(v):
    return f'{v:.0f}'


def user_name_link(r):
    return (f'<a href="{HOST}/{r.UserName}" '
            f' title="UserName: {r.UserName}\n'
            f'RegisterDate: {r.RegisterDate.date()}">'
            f'{r.DisplayName}</a>')


def script_link(r):
    return (
        f'<a href="https://www.kaggle.com/{r.UserName}/{r.CurrentUrlSlug}" '
        f' title="UserName: {r.UserName}\n'
        f'CreationDate: {r.CreationDate.date()}\n'
        f'EvaluationDate: {r.EvaluationDate.date()}\n'
        f'MadePublicDate: {r.MadePublicDate.date()}\n'
        f'Medal: {r.Medal}\n'
        f'MedalAwardDate: {r.MedalAwardDate.date()}\n'
        f'TotalViews: {r.TotalViews}\n'
        f'TotalComments: {r.TotalComments}">'
        f'{r.CurrentUrlSlug}</a>')


df = kernels.copy()
df['User'] = df.apply(user_name_link, axis=1)
df['Script'] = df.apply(script_link, axis=1)
df = df.sort_values(['TotalVotes', 'TotalViews'], ascending=False)
df = df.rename(columns={'TotalVotes': 'Votes'})


def show_df(df):
    return df[SHOW].style.bar(color=BC, width=85).format({
        'Files': odp,
        'Content Length': lambda v: f'{v/1e6:.1f} Mb'
    }).hide_index()

# Plot Output Size

In [12]:
np.log(df['Content Length']).hist(bins=44, color=BC)
plt.title('Utility Scripts - Log(Content Length)');

# Utility Scripts

Sorted (above) by vote count, then view count (descending)

In [13]:
small = (df['Content Length'] <= 3e6)
show_df(df[small])

# Large Utility Scripts - Mislabelled?

In [14]:
show_df(df[~small])

In [15]:
kernels.to_csv('UtilityScripts.csv', index=False)