# Fastai v2 with image, text, and tabular data

* [Fastai v2](https://www.fast.ai/2020/08/21/fastai2-launch/) was launched on August 21st, 2020 along with a companion textbook and a companion course.  Much of this work is adapted from the official fast.ai courses and tutorials.  For more detail about fastai, see https://www.mdpi.com/2078-2489/11/2/108/htm.  I highly recommend these free learning materials:  

 - https://www.fast.ai/2020/08/21/fastai2-launch/ 
 - https://docs.fast.ai/tutorial


* The purpose of this notebook is to demonstrate how to use a GPU-enabled Kaggle Notebook to train a ML model using the recently released fastai-v2.  
* Note that this method uses the default Kaggle docker image and does not require any pip install statements.


In [None]:
import torch
import fastai
from fastai.tabular.all import *
from fastai.text.all import *
from fastai.vision.all import *
from fastai.medical.imaging import *
from fastai import *

import time
from datetime import datetime

print(f'Notebook last run on {datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d, %H:%M:%S UTC")}')
print('Using fastai version ',fastai.__version__)
print('And torch version ',torch.__version__)

In [None]:
def plot_fastai_results(learn):
    '''
    Plots sensitivity, speficificty, prevalence, accuracy, and confusion matrix for a fastai model named "learn".
    Some portions are adapted from https://github.com/fastai/fastai/blob/master/nbs/61_tutorial.medical_imaging.ipynb
    '''
    interp = Interpretation.from_learner(learn)
    interp = ClassificationInterpretation.from_learner(learn)
    interp.plot_confusion_matrix(figsize=(7,7))
    losses,idxs = interp.top_losses()
    len(dls.valid_ds)==len(losses)==len(idxs)
    upp, low = interp.confusion_matrix()
    tn, fp = upp[0], upp[1]
    fn, tp = low[0], low[1]
    sensitivity = tp/(tp + fn)
    print('Sensitivity: ',sensitivity)
    specificity = tn/(fp + tn)
    print('Specificity: ',specificity)
    #val = dls.valid_ds.cat
    prevalance = 15/50
    print('Prevalance: ',prevalance)
    accuracy = (sensitivity * prevalance) + (specificity * (1 - prevalance))
    print('Accuracy: ',accuracy)

# Image data from fastai
* This is a small dataset of chest x-ray images

In [None]:
pneumothorax_source = untar_data(URLs.SIIM_SMALL)
items = get_dicom_files(pneumothorax_source/f"train/")
trn,val = RandomSplitter()(items)
df = pd.read_csv(pneumothorax_source/f"labels.csv")
pneumothorax = DataBlock(blocks=(ImageBlock(cls=PILDicom), CategoryBlock),
                   get_x=lambda x:pneumothorax_source/f"{x[0]}",
                   get_y=lambda x:x[1],
                   batch_tfms=aug_transforms(size=224))
dls = pneumothorax.dataloaders(df.values)
dls.show_batch(max_n=16)

In [None]:
learn = cnn_learner(dls, resnet34, metrics=accuracy, model_dir='/kaggle/tmp/model/')
learn.lr_find()
learn.fine_tune(5)
learn.show_results()

In [None]:
plot_fastai_results(learn=learn)

# Image data from Kaggle
* This dataset has a lot more images than the previous dataset.  This should improve the accuracy of our model.

In [None]:
path = Path('/kaggle/input/chest-xray-pneumonia/chest_xray/')
dls = ImageDataLoaders.from_folder(path, train='train',
                                   item_tfms=Resize(224),valid_pct=0.2,
                                   bs=64,seed=0)
dls.show_batch()

In [None]:
learn = cnn_learner(dls, resnet34, metrics=accuracy, model_dir='/kaggle/tmp/model/')
learn.lr_find()
learn.fine_tune(5)
learn.show_results()

In [None]:
plot_fastai_results(learn=learn)

# Tabular data from fastai

In [None]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])
splits = RandomSplitter(valid_pct=0.2)(range_of(df))
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)
dls = to.dataloaders(bs=64)
dls.show_batch()

In [None]:
learn = tabular_learner(dls, metrics=accuracy)
learn.lr_find()
learn.fine_tune(5)
learn.show_results()

In [None]:
plot_fastai_results(learn=learn)

# Tabular Data from Kaggle

In [None]:
df = pd.read_csv('/kaggle/input/adult-census-income/adult.csv', skipinitialspace=True)
dls = TabularDataLoaders.from_df(df=df, path=path, y_names="income",
    cat_names = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education.num'],
    procs = [Categorify, FillMissing, Normalize])
splits = RandomSplitter(valid_pct=0.2)(range_of(df))
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education.num'],
                   y_names='income',
                   splits=splits)
dls = to.dataloaders(bs=64)
dls.show_batch()

In [None]:
learn = tabular_learner(dls, metrics=accuracy)
learn.lr_find()
learn.fine_tune(5)
learn.show_results()

In [None]:
plot_fastai_results(learn=learn)

# Text data from fastai
* IMDB Film Reviews, pos or neg

In [None]:
path = untar_data(URLs.IMDB)
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
dls.show_batch(max_n=3) # investigate https://forums.fast.ai/t/most-of-the-items-in-show-batch-is-xxpad-strings/78989

In [None]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.lr_find()
learn.fine_tune(7)
learn.show_results(max_n=3)

In [None]:
plot_fastai_results(learn=learn)

# Text data from Kaggle
* IMDB Film Reviews, positive or negative

In [None]:
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
dls = TextDataLoaders.from_df(df=df,model_dir='/kaggle/tmp/model/')
#dls.show_batch() # investigate https://forums.fast.ai/t/most-of-the-items-in-show-batch-is-xxpad-strings/78989/5
df.head(15)

In [None]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.lr_find()
learn.fine_tune(5)
learn.show_results(max_n=3) # investigate https://forums.fast.ai/t/most-of-the-items-in-show-batch-is-xxpad-strings/78989/2

In [None]:
plot_fastai_results(learn=learn)

# References

**Tutorials and code snippets:**

 - https://docs.fast.ai/tutorial
 
  - http://docs.fast.ai/tutorial.tabular
  - http://docs.fast.ai/tutorial.text
  - http://docs.fast.ai/tutorial.vision
  
    - https://github.com/fastai/fastai/blob/master/nbs/61_tutorial.medical_imaging.ipynb

**Datasets:**

 - From fastai:

  - Filice R et al. "Crowdsourcing pneumothorax annotations using machine learning annotations on the NIH chest X-ray dataset". J Digit Imaging (2019). https://doi.org/10.1007/s10278-019-00299-9
  - Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid". Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996).  https://doi.org/10.4304/jcp.6.7.1325-1331
  - Maas, Andrew L., Daly, Raymond E., Pham, Peter T., Huang, Dan, Ng, Andrew Y, Potts, Christopher.  "Learning Word Vectors for Sentiment Analysis". Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.  (2011).   https://www.aclweb.org/anthology/P11-1015
  
 
 - From kaggle:
 
  - https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
  - https://www.kaggle.com/uciml/adult-census-income
  - https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [None]:
!mkdir /kaggle/working/docker/
!pip freeze > '../working/docker/requirements.txt'