<a href="https://colab.research.google.com/github/sahug/ds-bert/blob/main/BERT%20NLP%20-%20Casual%20Language%20Modeling%20using%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT NLP - Casual Language Modeling using BERT**

**Language modeling** predicts words in a sentence. There are two forms of language modeling.
- **Causal Language Modeling** predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. `distilgpt2`
- **Masked Language Modeling** predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. `distilroberta-base`

In [1]:
%pip install -qq datasets

[K     |████████████████████████████████| 346 kB 5.5 MB/s 
[K     |████████████████████████████████| 86 kB 4.2 MB/s 
[K     |████████████████████████████████| 140 kB 51.2 MB/s 
[K     |████████████████████████████████| 212 kB 43.6 MB/s 
[K     |████████████████████████████████| 1.1 MB 55.5 MB/s 
[K     |████████████████████████████████| 86 kB 3.6 MB/s 
[K     |████████████████████████████████| 596 kB 44.1 MB/s 
[K     |████████████████████████████████| 127 kB 46.3 MB/s 
[K     |████████████████████████████████| 94 kB 3.2 MB/s 
[K     |████████████████████████████████| 144 kB 20.7 MB/s 
[K     |████████████████████████████████| 271 kB 44.8 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

**Load Dataset**

In [2]:
from datasets import load_dataset
eli5 = load_dataset("eli5", split="train_asks[:5000]")

Downloading builder script:   0%|          | 0.00/5.63k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading and preparing dataset eli5/LFQA_reddit (download: 6.03 MiB, generated: 1.26 GiB, post-processed: Unknown size, total: 1.26 GiB) to /root/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa...


Downloading:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/576M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/21.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/286M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.65M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/330M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/18.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/36.2M [00:00<?, ?B/s]

Dataset eli5 downloaded and prepared to /root/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa. Subsequent calls will reuse this data.


**Train and Test Split**

In [3]:
eli5 = eli5.train_test_split(test_size=0.2)

In [4]:
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
        num_rows: 1000
    })
})

In [5]:
eli5["train"][0], eli5["test"][0] 

({'answers': {'a_id': ['c4zs5n4'],
   'score': [3],
   'text': ["Unfortunately, you don't have enough information to determine how large the object is.  You would need to calibrate using an object of known size at a known distance.\n\nYou can, however determine how far the object was.  Let a be the angular span in the farther away image and b be the angular span in the nearer image.  If d is the farther distance, then d\\*tan(a) = (d-4)\\*tan(b) (= the size of the object).  Also, f\\*tan(a) = 61 px and f\\*tan(b) = 84 px for some unknown f.  So tan(a) = 61/84 \\* tan(b), and you can use this to find d = 336/23 = 14.6 in."]},
  'answers_urls': {'url': []},
  'document': '',
  'q_id': 'uyskn',
  'selftext': 'This might just be simple geometry and my brain is not working right now, but thanks in advance for any help.\n\nI am trying to figure out the size of an object and how far away it is only using two pictures of it.  In the first picture the object appears to be 61 pixels high.  In th

**Look at Dataset**

In [6]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(
                lambda x: [typ.feature.names[i] for i in x]
            )
    display(HTML(df.to_html()))

In [7]:
show_random_elements(eli5["train"])

Unnamed: 0,q_id,title,selftext,document,subreddit,answers,title_urls,selftext_urls,answers_urls
0,ygzko,"""Teach a man to reason, and he'll think for a lifetime."" - So, how does one learn to reason?",,,askscience,"{'a_id': ['c5vhda8'], 'text': ['stop blindly accepting anything you are being told'], 'score': [3]}",{'url': []},{'url': []},{'url': []}
1,uk7d5,"Intellectual Ventures has plans for a ""Garden Hose To The Sky"" which would pump sulfates high into the atmosphere in order to reflect sunlight in a supposedly cheap and easy manner. Does it have any merit?","Yes, I just read Super Freakonomics.",,askscience,"{'a_id': ['c4w337k', 'c4w4kkx'], 'text': ['It isn't a great idea; _URL_0_ We need to be resolving the root cause here.', 'They do a lot of speculative stuff since they have so much money to throw around. Their hyped up [mosquito laser](_URL_2_) would be one of them. They play a video of it over and over in their main lobby. It's a butt of a lot of jokes among patent lawyers in the Pacific Northwest. They also want to put [tiny nuclear reactors](_URL_2_) everywhere with one of their *many* companies. They are also the biggest patent trolls on the planet and one of the largest companies most people have never heard about. But they do have easy to sign NDAs; I'll give them credit for that ;) edit spelling'], 'score': [3, 2]}",{'url': []},{'url': []},"{'url': ['http://news.opb.org/article/geoengineered_sky_bye-bye_blue_hello_white/', 'http://en.wikipedia.org/wiki/Mosquito_laser', 'http://www.terrapower.com/home.aspx']}"
2,b5nf3m,How do computers allocate resources?,"If a computer is doing something in the background, say rendering video, and something in the foreground, say browsing the web, and the web browsing is lagging, why doesn't it automatically redirect resources to ensure that the foreground task is smooth, and devote only the excess resources to the background task? \n\nOr does it? \n\nHow do computers allocate resources?",,askscience,"{'a_id': ['ejevav1'], 'text': ['Ressource allocation and management is the primary job of the operating system. What you're talking about is specifically CPU time, which is handled by the so called scheduler. There are different algorithms to determine which task gets CPU time next. Linux, for example, uses the [CFS](_URL_0_) (Completely Fair Scheduler). Simplified, the CFS tracks how much CPU time each task has used so far and gives the CPU to the next runnable task with the least time used. Tasks can also be assigned priorities, which are used to weigh the already used CPU time, so used CPU time by tasks with low priority is weighted higher than the used CPU time of high priority tasks. There are also so called real-time algorithms for scheduling, that are used when a task needs to get their thing done in a very specific time frame, like a driver interacting with a device, for example. Real-time tasks always take priority over everything else. I'm not sure how Windows' scheduler works and if it can identify and prefer interactive tasks, but generally speaking, a scheduler doesn't know or care whether a task is interactive or not. Since interactive (what you call ""foreground"") tasks spend a lot of time waiting for user input, and therefore accumulate less used CPU time, they generally get preferential treatment over long running ""background"" tasks once there's actually something to do, but that's not always the case. Hang ups can especially happen when multiple tasks wait for shared resources, like mass storage. E.g. when a long running ""background"" task is currently waiting for the hard disk driver to retrieve some data and an interactive/""foreground"" tasks also wants to access data on the same hard disk, it'll have to wait for the hard disk to become available.'], 'score': [10]}",{'url': []},{'url': []},{'url': ['https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt']}
3,1qeauv,"If correct, does the 'Entropic Force' of gravity remove the need for gravitons, allowing us to consider GUTs as ToEs?","If correct, does the 'Entropic Force' of gravity remove the need for gravitons, allowing us to consider GUTs as ToEs?",,askscience,"{'a_id': ['cdcc9d3', 'cdd3zrz'], 'text': ['When the source of a gravitational field shifts, the gravitational field around it changes and that change propagates outward as gravitational radiation. If the change in the source is quantized, then the gravitational radiation must also be quantized, and we need a quantum model of gravity to explain that.', 'Naively, yes Less naively, see for instance the conclusion of : [Thermodynamics of Spacetime: The Einstein Equation of State](_URL_0_) > the viewpoint developed here suggests that it may not be correct to canonically quantize the Einstein equations, even if they describe a phenomenon that is ultimately quantum mechanical There are puzzling relations between gravity, thermodynamics, quantum theory, and the nature of time. But to say the least, dismissing the existence of graviton is a particularly controversial viewpoint. One could seriously question their observability for instance.'], 'score': [5, 2]}",{'url': []},{'url': []},{'url': ['http://arxiv.org/abs/gr-qc/9504004']}
4,2geyup,"If I drop a live wire in the tub with me, I'm going to get shocked. If I'm in the ocean and someone drops a live wire in miles away, will I get shocked?","ie: If I drop a toaster in the tub with me that is plugged in, I'm going to get electrocuted. If you drop a toaster in the ocean next to me will we be shocked? How badly? What if you're 1 mile away? 10? 50? 1,000s?\n",,askscience,"{'a_id': ['ckim459'], 'text': ['I am on mobile so it's a pain in the ass to include reference links. I can add some later on if requested. The resistivity of sea water is about 3 orders of magnitude lower than the bulk resistivity of a human body. The effective resistance between the ocean and ground would basically be zero compared to human + ocean to ground. You can roughly model the system as two parallel resistors, you and the ocean. Since there is a finite potential difference (voltage) between the lightning strike and the ocean surface, almost no potential would drop across your body, meaning there should be no current passing through you and you should not experience a shock. Would I swim in the ocean during a lightning storm? Probably not, but you should be fairly safe. An exact calculation may be quite tricky. Assuming an average lightning strike carries 30 kA of current, and that sea water has a resistivity of 0.2 Ohm meters, one could use the effective resistivity of a human body to calculate the radial distance from a lightning strike needed to ensure less than a lethal dose of current. I found a journal article about human conductivity but I'm not at work so I'm pay walled at the moment. Maybe someone else can give it a shot? If I have time tomorrow I may try. Seems interesting. My guess is less than 100 feet. _URL_2_. _URL_0_ Edit: can't fall asleep for some reason so I did a calculation in my head that gives about 7 meters for a safe distance assuming 30 Volts kills a wet human. Might be super wrong, I'll write it out in the morning. I basically used conservation of charge from a 30 kA source to get potential as a function of radial distance from the strike assuming spherical symmetry of the current. Edit 2: I quickly wrote up my ""solution"" this morning, which you can see [here](_URL_1_)^\* . This may be a completely horseshit approach that gives a totally wrong answer, but such is life. I assume the lightning strike injects a constant current into the ocean, which then spreads radially away with spherical symmetry so that each hemisphere some distance *r* from the lightning strike must also have this much current passing through it. From the relationship between current density and electric field, I get an expression for the electric field as a function of *r* that I integrate to get a potential. I assume that a human body in the ocean does not have any effect on the lines of equipotential which is perhaps the worst assumption in the whole approach. Then I assume that if you insert a human some distance *r* away from the lightning strike, the front of his or her torso will be at a potential V(r) and the back will be at a potential V(r+d) where d is the width of the torso. You can solve for *r* such that V(r)-V(r+d) ~ V_lethal where V_lethal is the voltage needed to kill you. \*: I accidentally dropped d near the end but I added it in by the magic of technology and computers. The last line reads r^2 + rd on the left hand side.'], 'score': [17]}",{'url': []},{'url': []},"{'url': ['http://www.tandfonline.com/doi/pdf/10.1080/027263401752246199', 'http://imgur.com/KC5Nqnu', 'http://www.aharfield.co.uk/lightning-protection-services/about-lightning']}"
5,4lw7dc,What is the difference between a blazar and a GRB?,,,askscience,"{'a_id': ['d3qy81q'], 'text': ['The emission in both comes from ultrarelativistic jets (jets with Lorentz factor > > 1) which are moving towards us so that their radiation is boosted to very high frequencies/energies, but they are in different systems and last for different times. Blazars are jets coming from a supermassive black holes at the center of galaxies that are continuously accreting matter, and they are quasi-stable objects that can radiate for long periods of time. GRBs are short, catastrophic events lasting from 0.2-2000 seconds that result from the collapse of a massive star or the collision of a neutron star with another neutron star or a black hole.'], 'score': [3]}",{'url': []},{'url': []},{'url': []}
6,16xrmb,"Why are most of the ""superacids"" non-corrosive?","What makes them different from the ""strong"" acids which are far more corrosive despite being much weaker proton donors?",,askscience,"{'a_id': ['c80faak'], 'text': ['Most superacids are corrosive. Perhaps you're thinking about carborane acid, which is unusual in it's lack of corrosivity. Maybe this article will help: _URL_0_'], 'score': [3]}",{'url': []},{'url': []},{'url': ['http://www.nature.com/news/2004/041115/full/news041115-5.html']}
7,6cevyh,Efimov physics: could someone explain the significance?,"I stumble on an article about the [Efimov Effect] (_URL_0_). There were only a couple of posts on reddit about it and most of the comments were deleted. \n\nI'm a lay person (ie. don't have a degree in physics) and was hoping someone might be able to explain why pairing three bosons is such a breakthrough. I've searched everywhere for an answer, but couldn't find anything that explained the potential applicability of this or how this alters the classical view of physics. \n\nIs it just that it suggest things work differently at the quantum level than we think?\n\nThanks!\n\n",,askscience,"{'a_id': ['dhuf2u1'], 'text': ['It's not ""such a breakthrough"" exactly, as there aren't really all that many applications ... so much as it's a purely-quantum, non-classical effect that isn't predicted by classical theory yet exists nonetheless due to quantum effects. So really it's just confirming a prediction we made. The [Wiki article](_URL_1_) describes it as similar to [Borromean rings](_URL_0_). Borromean rings are an arrangement of three rings where none of the three rings is locked with the other two (so, none of the rings pass through each other to form a chain-like link), however even though none of the rings are locked to the others, the way the three rings are arranged topologically results in them all being bound to each other. If you removed any one of the rings, the other two would not be bound -- so it's a bound state of three rings that is explicitly *not* made up of bound states of smaller numbers of rings. Anyway the Efimov state is similar to Borromean rings in that it is a bound state of three particles, where none of the three particles is bound to either of the other two, individually. So the state is stable when it otherwise wouldn't be from the purely classical prediction. What's remarkable though is that Efimov's prediction is that there should be an infinite number of excited Efimov states which are stable, at increasingly large distances and energies. So the ground Efimov state has some characteristic distance at which the three particles are bound together -- at smaller distances, the three particles are not bound and they repel each other. They repel each other at larger distances too ... so they are only stable at just the right distance. *Except* ... this also works at multiples of 22.7 times this distance (with 22.7 times the energy). So if your three particles are groupwise-bound at a distance of 1 unit and an energy of 1 unit, they will also be bound at a distance of 22.7 units and an energy of 22.7 units. And they will also bound at a distance of 22.7^(2) = 515.29 units, and so on for 22.7^(3) all the way up. This despite the fact that the three particles will repel each other and break up at any other size/energy combination. Hope that helps.'], 'score': [2]}",{'url': []},{'url': ['https://blogs.scientificamerican.com/cocktail-party-physics/three-8217-s-company-two-8217-s-a-crowd-meet-the-efimov-effect/']},"{'url': ['https://en.wikipedia.org/wiki/Borromean_rings', 'https://en.wikipedia.org/wiki/Efimov_state']}"
8,1smm58,Does staying in space has any influence on the menstrual cycle,If a woman stays for a long period of time in space (The ISS for example). Does it has any influence on her menstrual cycle? The weightlessness or the fact that there's no day/night in space?,,askscience,"{'a_id': ['cdz5fst'], 'text': ['No, women have menstruated fine in space, and haven't reported any changes in length. Menstrual blood is forcefully expelled so gravity isn't an issue, and they had ample access to pads. From the words of astronaut Rhea Seddon. _URL_0_  > There was concern about it. It was one of those unknowns. A lot of people predicted retrograde flow of menstrual blood, and it would get out in your abdomen, get peritonitis, and horrible things would happen. All the women were going, “I don’t think so.” But you couldn’t prove it or disprove it. We were asked, “What do we do about this?” We said, “How about we just consider it a nonproblem until it becomes a problem? If anybody gets sick in space you can bring us home. Then we’ll deal with it as a problem, but let’s consider it a nonproblem.” They did. I’m not totally sure who had the first period in space, but they came back and said, “Period in space, just like period on the ground. Don’t worry about it.” I think the big controversy was about—and a lot of the women disagreed—how many feminine hygiene products do you put [onboard].  > Of course the more you put, the less room you have in your drawer for your clothes and stuff. Or in a drawer. I don’t even remember where they put it. I helped make that decision with the docs. We had to do worst case. Tampons or pads, how many would you use if you had a heavy flow, five days or seven days of flow. Because we didn’t know how it would be different up there. What’s the max that you could use?'], 'score': [41]}",{'url': []},{'url': []},{'url': ['http://www.jsc.nasa.gov/history/oral_histories/SeddonMR/SeddonMR_5-21-10.htm']}
9,njaxz,"How do you get the initial batch of water for the ""pure"" side of a reverse osmosis system?",I know how reverse osmosis works (_URL_0_) but I can't figure out what is used as clean water when the system is first used if you haven't made any yet. If you use bottled water isn't your reverse osmosis (osmosisized?) water only going to be as good as what you started off with?,,askscience,"{'a_id': ['c39jexe'], 'text': ['Perhaps you don't understand how these systems work. You put in water that has high ionic content. You end up with _two_ output streams. One stream has greatly reduced ionic content - this is called product water. The other stream has increased ionic content - this is called brine. You toss out the brine and keep the product water. You do not need to bootstrap the system with purified water. I build my own reverse osmosis systems from components on eBay to deal with well water with a very high ionic content. If you have any questions, fire away! CHEERS'], 'score': [6]}",{'url': []},{'url': ['http://science.howstuffworks.com/reverse-osmosis.htm']},{'url': []}


**Extract Text**

**Flatten** the dataset for easy extraction. We will be able to extract the data like `answers.text` instead of `["answers"]["text"]`

In [8]:
eli5 = eli5.flatten()

In [9]:
eli5["train"]["answers.text"][0]

["Unfortunately, you don't have enough information to determine how large the object is.  You would need to calibrate using an object of known size at a known distance.\n\nYou can, however determine how far the object was.  Let a be the angular span in the farther away image and b be the angular span in the nearer image.  If d is the farther distance, then d\\*tan(a) = (d-4)\\*tan(b) (= the size of the object).  Also, f\\*tan(a) = 61 px and f\\*tan(b) = 84 px for some unknown f.  So tan(a) = 61/84 \\* tan(b), and you can use this to find d = 336/23 = 14.6 in."]

**Preprocess**

In [10]:
%pip install -qq transformers

[K     |████████████████████████████████| 4.2 MB 5.3 MB/s 
[K     |████████████████████████████████| 6.6 MB 39.5 MB/s 
[?25h

In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


In [12]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)

In [13]:
tokenized_eli5 = eli5.map(preprocess_function, batched=True, num_proc=4, remove_columns=eli5["train"].column_names)

        

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

In [14]:
tokenized_eli5

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [15]:
print(tokenized_eli5["train"]["input_ids"][0])

[13898, 11, 345, 836, 470, 423, 1576, 1321, 284, 5004, 703, 1588, 262, 2134, 318, 13, 220, 921, 561, 761, 284, 33801, 378, 1262, 281, 2134, 286, 1900, 2546, 379, 257, 1900, 5253, 13, 198, 198, 1639, 460, 11, 2158, 5004, 703, 1290, 262, 2134, 373, 13, 220, 3914, 257, 307, 262, 32558, 11506, 287, 262, 18485, 1497, 2939, 290, 275, 307, 262, 32558, 11506, 287, 262, 40671, 2939, 13, 220, 1002, 288, 318, 262, 18485, 5253, 11, 788, 288, 59, 9, 38006, 7, 64, 8, 796, 357, 67, 12, 19, 19415, 9, 38006, 7, 65, 8, 46121, 262, 2546, 286, 262, 2134, 737, 220, 4418, 11, 277, 59, 9, 38006, 7, 64, 8, 796, 8454, 279, 87, 290, 277, 59, 9, 38006, 7, 65, 8, 796, 9508, 279, 87, 329, 617, 6439, 277, 13, 220, 1406, 25706, 7, 64, 8, 796, 8454, 14, 5705, 3467, 9, 25706, 7, 65, 828, 290, 345, 460, 779, 428, 284, 1064, 288, 796, 38867, 14, 1954, 796, 1478, 13, 21, 287, 13]


**Capture Truncated Text**
When we tokenize texts the tokenizer truncates some of the texts based on default size. So we need a second preprocessing function to capture text truncated from any lengthy examples to prevent loss of information. This preprocessing function should:

- Concatenate all the text.
- Split the concatenated text into smaller chunks defined by block_size.

In [16]:
BLOCK_SIZE = 128

def group_text(examples):
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  result = {
      k: [t[i: i+ BLOCK_SIZE] for i in range(0, total_length, BLOCK_SIZE)]
      for k, t in concatenated_examples.items()
      }
  result["labels"] = result["input_ids"].copy()
  return result

In [17]:
lm_dataset = tokenized_eli5.map(group_text, batched=True, num_proc=4)

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

In [18]:
lm_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 8459
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2162
    })
})

In [19]:
print(lm_dataset["train"]["input_ids"][0])

[13898, 11, 345, 836, 470, 423, 1576, 1321, 284, 5004, 703, 1588, 262, 2134, 318, 13, 220, 921, 561, 761, 284, 33801, 378, 1262, 281, 2134, 286, 1900, 2546, 379, 257, 1900, 5253, 13, 198, 198, 1639, 460, 11, 2158, 5004, 703, 1290, 262, 2134, 373, 13, 220, 3914, 257, 307, 262, 32558, 11506, 287, 262, 18485, 1497, 2939, 290, 275, 307, 262, 32558, 11506, 287, 262, 40671, 2939, 13, 220, 1002, 288, 318, 262, 18485, 5253, 11, 788, 288, 59, 9, 38006, 7, 64, 8, 796, 357, 67, 12, 19, 19415, 9, 38006, 7, 65, 8, 46121, 262, 2546, 286, 262, 2134, 737, 220, 4418, 11, 277, 59, 9, 38006, 7, 64, 8, 796, 8454, 279, 87, 290, 277, 59, 9, 38006, 7, 65, 8, 796, 9508]


In [20]:
print(lm_dataset["train"]["labels"][0])

[13898, 11, 345, 836, 470, 423, 1576, 1321, 284, 5004, 703, 1588, 262, 2134, 318, 13, 220, 921, 561, 761, 284, 33801, 378, 1262, 281, 2134, 286, 1900, 2546, 379, 257, 1900, 5253, 13, 198, 198, 1639, 460, 11, 2158, 5004, 703, 1290, 262, 2134, 373, 13, 220, 3914, 257, 307, 262, 32558, 11506, 287, 262, 18485, 1497, 2939, 290, 275, 307, 262, 32558, 11506, 287, 262, 40671, 2939, 13, 220, 1002, 288, 318, 262, 18485, 5253, 11, 788, 288, 59, 9, 38006, 7, 64, 8, 796, 357, 67, 12, 19, 19415, 9, 38006, 7, 65, 8, 46121, 262, 2546, 286, 262, 2134, 737, 220, 4418, 11, 277, 59, 9, 38006, 7, 64, 8, 796, 8454, 279, 87, 290, 277, 59, 9, 38006, 7, 65, 8, 796, 9508]


For **Causal Language Modeling**, use `DataCollatorForLanguageModeling` to create a batch of examples. It will also dynamically pad your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting padding=True, dynamic padding is more efficient.

In [21]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

**Train**

To **fine-tune** a model in TensorFlow, start by converting your datasets to the tf.data.Dataset format with to_tf_dataset. Specify inputs and labels in columns, whether to shuffle the dataset order, batch size, and the data collator:

In [22]:
tf_train_set = lm_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,    
    collate_fn=data_collator
)

tf_test_set = lm_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator
)

**Optimizer**

In [23]:
from transformers import create_optimizer, AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

**Model**

In [24]:
from transformers import TFAutoModelForCausalLM
model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

Downloading:   0%|          | 0.00/313M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


**Compile**

In [25]:
import tensorflow as tf
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


**Fit**

In [None]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Epoch 1/3
  3/528 [..............................] - ETA: 3:22:15 - loss: 4.2322