<a href="https://colab.research.google.com/github/saakolch/procedure_of_extracting_data/blob/main/data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets evaluate transformers sentencepiece

In [None]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2024-05-13 05:49:19--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip’

drugsCom_raw.zip        [          <=>       ]  41.00M  61.1KB/s    in 7m 55s  

2024-05-13 05:57:14 (88.5 KB/s) - ‘drugsCom_raw.zip’ saved [42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


In [None]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}

# \t is the tab character
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
drug_sample = drug_dataset['train'].shuffle(seed=42).select(range(1000))
drug_sample[:3]

{'patient_id': [191114, 142693, 71561],
 'drugName': ['Campral', 'Levonorgestrel', 'Vraylar'],
 'condition': ['alcohol dependence', 'birth control', 'bipolar disorde'],
 'review': ['"Sober a year 8-25-11. God, AA and Campral have worked. No cravings I couldn\'t handle. Together all have helped me have a new lease on life. Feel better, work has improved. Highly recommend this medicine if you want to quit drinking."',
  '"I\'ve been on birth control for a while now due to horrendous cramps and excessively heavy cycles. And I\'ve always been an anti-kid-having lady so hey it works for me. So I ended up switching from the nuva ring (which I loved) to the Mirena. I\'m not going to complain about pain because we\'re all different and to be honest, it really wasn\'t terrible, a little discomfort. But I\'ve had my Mirena for a little over 2 years now and I\'ve gained 40-50 pounds with no change in diet, I have absolutely no energy for anything (I work and go to school full time), and I\'m rece

In [None]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))
    print(len(drug_dataset[split].unique("Unnamed: 0")))
    print(len(drug_dataset[split]))

161297
161297
53766
53766


In [None]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name='patient_id'
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [None]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}


drug_dataset = drug_dataset.filter(lambda x: x['condition'] is not None)
drug_dataset = drug_dataset.map(lowercase_condition)

Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

In [None]:
drug_dataset['train']['condition'][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

In [None]:
def compute_review_length(example):
  return {"review_length": len(example['review'].split())}

In [None]:
drug_dataset = drug_dataset.map(compute_review_length)

drug_dataset['train'][0]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

In [None]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

In [None]:
drug_before = drug_dataset['train']
drug_before.num_rows

160398

In [None]:
drug_dataset =  drug_dataset.filter(lambda x: x['review_length'] > 30)
drug_dataset.num_rows

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'train': 138514, 'test': 46108}

In [None]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [None]:
#drug_dataset = drug_dataset.map(lambda x: {'review': html.unescape(x['review'])})
# way faster:
drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

def tokenize_function(examples):
    return tokenizer(examples['review'], truncation=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
def tokenize_and_split(examples):
    return tokenizer(
        examples['review'],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True
    )

In [None]:
result = tokenize_and_split(drug_dataset['train'][:3])
[len(inp) for inp in result['input_ids']]

[128, 49, 128, 55, 116]

Here we will get the error, because of overflowith of tokens which increased our tokenized_datasets to 1463, though .map() selected 1000 samples for drug_dataset

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

Here we are going to remove all of old columns from drug_dataset

In [None]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset['train'].column_names
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

In [None]:
len(tokenized_dataset['train']), len(drug_dataset['train']),tokenized_dataset

(206772,
 138514,
 DatasetDict({
     train: Dataset({
         features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
         num_rows: 206772
     })
     test: Dataset({
         features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
         num_rows: 68876
     })
 }))

Or instead of removing old columns, we can associate each key from drug_dataset with either 1 batch of tokens or more batch of ones, therefore we will match the size of tokenized_dataset with drug_dataset. The overflow_to_sample_mapping makes the old columns the same saze as the new ones

In [None]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset


Using Dataset library by Hugging Face we can easily switch to third-party library using Dataset.set_format()

In [None]:
drug_dataset.set_format('pandas')

In [None]:
drug_dataset['train'][:5]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89
3,35696,Buprenorphine / naloxone,opiate dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37,124
4,155963,Cialis,benign prostatic hyperplasia,"""2nd day on 5mg started to work with rock hard...",2.0,"November 28, 2015",43,68


In [None]:
train_pandas = drug_dataset['train'][:]

In [None]:
frequencies = (
    train_pandas['condition']
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={'index': 'condition', 'condition': 'frequency'})
)
frequencies.head()

Unnamed: 0,frequency,count
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


In [None]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 819
})

Average rating per drug:

In [None]:
for i in drug_dataset['train']['rating']:
    sum = 0
    sum += i
average = sum/len(drug_dataset)
average

4.5

We are going to train our set, for this purpose we need prepare the drug_dataset to former format and create the validation set

In [None]:
drug_dataset.reset_format()

In [None]:
drug_dataset_clean = drug_dataset['train'].train_test_split(train_size=0.8, seed=42)

drug_dataset_clean['validation'] = drug_dataset_clean.pop('test')

drug_dataset_clean['test'] = drug_dataset['test']
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

Now we can choose option for saving our dataset locally taking Dataset.save_to_disk() (Arrow), Dataset.to_csv() (CSV) or Dataset.to_json() (JSON)

In [None]:
drug_dataset_clean.save_to_disk('drug-reviews')

Saving the dataset (0/1 shards):   0%|          | 0/110811 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/27703 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/46108 [00:00<?, ? examples/s]

It will be saved as following structure(double-click this cell):

drug-reviews/
├── dataset_dict.json
├── test
│   ├── dataset.arrow
│   ├── dataset_info.json
│   └── state.json
├── train
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── indices.arrow
│   └── state.json
└── validation
    ├── dataset.arrow
    ├── dataset_info.json
    ├── indices.arrow
    └── state.json

In [None]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk('drug-reviews')
drug_dataset_reloaded

For CSV and JSON formats we have to store each split as a separate file

In [None]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f'drug-reviews-{split}.jsonl')

Creating json from Arrow format:   0%|          | 0/111 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]

The above cell give to us JSON Lines format, each row in dataset is stored as a single line of JSON


In [None]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":89879,"drugName":"Cyclosporine","condition":"keratoconjunctivitis sicca","review":"\"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on.\"","rating":2.0,"date":"April 20, 2013","usefulCount":69,"review_length":147}


The loading of the JSON files

In [None]:
data_files = {
    'train': 'drug-reviews-train.jsonl',
    'validation': 'drug-reviews-validation.jsonl',
    'test': 'drug-reviews-test.jsonl',
}
drug_dataset_reloaded = load_dataset('json', data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]